Sep 3 – 4, 2025
Hörsaalgebäude, Campus Poppelsdorf, Universität Bonn
Europe/Berlin timezone

Mitigating Emerging Misalignment with Training Regularization

Not scheduled
1h 30m
Open Space (first floor)

Open Space (first floor)

Poster Natural Language Processing Poster Session

Speaker

David Kaczér (University of Bonn)

Description

Emergent Misalignment (EMA) is a puzzling phenomenon where models finetuned on a narrowly misaligned task (e.g., including insecure backdoors in code) learn to be broadly misaligned. EMA is concerning, as models trained on superficially harmless data might become broadly misaligned. At the same time, the fact that alignment behavior across different domains is so strongly correlated during training presents an opportunity to robustly align models. It also suggests that a simple mitigation through intervention during training may be possible. We show that EMA occurs even in small (7B) models and with as little as rank-1 adapters. We investigate regularizing interventions during training including SafeLoRA, Generalized Knowledge Distillation and gradient projection to increase the models' robustness. Our current results indicate that although these interventions successfully mitigate EMA, this comes at a cost of inhibiting learning of benign tasks.

Author

David Kaczér (University of Bonn)

Co-authors

Akbar Karimi (University of Bonn) Dr Florian Mai (University of Bonn) Prof. Lucie Flek (University of Bonn)

Presentation materials