Lamarr Scientific Forum

Name: Lamarr Scientific Forum
Start: 2025-09-03T08:30:00+02:00
End: 2025-09-04T18:00:00+02:00
Location: Hörsaalgebäude, Campus Poppelsdorf, Universität Bonn

Sep 3 – 4, 2025

Hörsaalgebäude, Campus Poppelsdorf, Universität Bonn

Europe/Berlin timezone

Contact

Mitigating Emerging Misalignment with Training Regularization

NLP.2.1

Sep 3, 2025, 2:00 PM

1h 30m

Open Space (first floor)

Poster Natural Language Processing Poster Session

David Kaczér (University of Bonn)

Emergent Misalignment (EMA) is a puzzling phenomenon where models finetuned on a narrowly misaligned task (e.g., including insecure backdoors in code) learn to be broadly misaligned. EMA is concerning, as models trained on superficially harmless data might become broadly misaligned. At the same time, the fact that alignment behavior across different domains is so strongly correlated during training presents an opportunity to robustly align models. It also suggests that a simple mitigation through intervention during training may be possible. We show that EMA occurs even in small (7B) models and with as little as rank-1 adapters. We investigate regularizing interventions during training including SafeLoRA, Generalized Knowledge Distillation and gradient projection to increase the models' robustness. Our current results indicate that although these interventions successfully mitigate EMA, this comes at a cost of inhibiting learning of benign tasks.

David Kaczér (University of Bonn)

Akbar Karimi (University of Bonn) Dr Florian Mai (University of Bonn) Prof. Lucie Flek (University of Bonn)

Informatik_Poster_David_Kaczer_Sommerfest_2025.pdf

Lamarr Scientific Forum

Contact

Mitigating Emerging Misalignment with Training Regularization

Open Space (first floor)

Speaker

Description

Author

Co-authors

Presentation materials

Choose timezone

Lamarr Scientific Forum

Contact

Speaker

Description

Author

Co-authors

Presentation materials