Post-Training Alignment against Prompt Injections
Post-Training Alignment Against PIs | DSC291: Safety in Generative AI, UCSD | Sept 2025 -- Dec 2025 |
- Implemented LoRA-based supervised fine-tuning (SFT), Direct Preference Optimization (DPO), and a two-stage SFT→DPO pipeline on small 3–4B LLMs (Llama-3.2-3B Base, Qwen3-4B-Instruct-2507) to study post-training defenses against prompt injection and jailbreak attacks without re-pretraining.
- Curated a 20k-example preference dataset from WildJailbreak (adversarial) and Alpaca (benign) and evaluated on AdvBench, JailbreakBench, and a held-out Alpaca split, showing that SFT+DPO on Qwen-4B cuts attack success rate from 9% to 2.7% while boosting benign helpfulness from 37% to 72%.