Natural Language Processing

[lamarr-nlp] EMNLP Outstanding Paper (!!) Guest Talk by Dr. Martin Tutek from TakeLab, University of Zagreb | From Internals to Integrity: How Insights into Transformer LMs can Improve Interpretability and Faithfulness

Europe/Berlin
2.122 (Friedrich-Hirzebruch-Allee 6, Bonn)

2.122

Friedrich-Hirzebruch-Allee 6, Bonn

Description

As part of the Lamarr NLP Colloquium, we have the pleasure to host Dr. Martin Tutek from the University of Zagreb. Martin will give a talk on Transformer interpretability and faithfulness, with a paper that received an Outstanding Paper Award at EMNLP 2025.

Title: From Internals to Integrity: How Insights into Transformer LMs can Improve Interpretability and Faithfulness


Abstract:

As language models increasingly rely on chain-of-thought (CoT) reasoning, a critical question for model interpretability and oversight emerges: do verbalized reasoning steps actually reflect the model's internal computations, or are they merely plausible-sounding post-hoc narratives? Prior approaches to measuring faithfulness perturb reasoning steps in context. However, such approaches only measure self-consistency, as models can reconstruct erased knowledge from their parameters. Precise faithfulness tests require intervening on the parameters that encode such knowledge themselves. We introduce Faithfulness by Unlearning Reasoning steps (FUR), which erases knowledge encoded in CoT steps directly from model weights and measures the effect on predictions to estimate faithfulness. Our experiments show that parametric faithfulness is meaningfully distinct from contextual faithfulness, and that the steps humans find most plausible are often not the ones the model actually relies on.
 
Bio:
Martin Tutek is a postdoctoral researcher at TakeLab, University of Zagreb, working on faithful explainability, safety, and controllability of language models. He received his PhD in 2022 from the University of Zagreb under the supervision of Jan Šnajder. After his PhD, he held postdoctoral positions at the UKP Lab at TU Darmstadt with Iryna Gurevych on the InterText initiative, and subsequently at the Technion with Yonatan Belinkov, focusing on machine unlearning and parametric faithfulness of language model reasoning. He is a recipient of an outstanding paper award at EMNLP 2025 for this line of work, as well as a research grant from Coefficient Giving (formerly OpenPhilantropy) to further investigate encoded reasoning within chains of thought using machine unlearning.
 
Looking forward to your participation.

Date: Wednesday, Mar 18, 2026
Time: 11:00 - 12:00 pm (CET).
Location:  Friedrich-Hirzebruch-Allee 6, 53115 Bonn, Germany
Room: 2.122 + Zoom
Zoom:  https://uni-bonn.zoom-x.de/j/63819604806?pwd=64PSGa9HyTym9j1bjy6jhcJF3eHebi.1