Sep 3 – 4, 2025
Hörsaalgebäude, Campus Poppelsdorf, Universität Bonn
Europe/Berlin timezone

LLM Value Alignment

Not scheduled
1h 30m
Open Space (first floor)

Open Space (first floor)

Poster Human-centered AI Systems Poster Session

Speaker

Shangrui Nie (Bonn-Aachen International Center for Information Technology (b-it))

Description

Social sciences define values as preferred behaviors or outcomes that motivate an individual's actions or judgments.
While LLMs often reflect biases from their training data, it remains unclear what values underlie their generation processes, and whether such internal value systems can be measured or modified.
In this paper, we investigate whether fine-tuning can steer a model’s internal moral preferences and whether such changes manifest in downstream behavior.
Building on a taxonomy of 20 human values, we fine-tune models using two approaches: supervised fine-tuning (SFT) with scalar value ratings in a survey; and direct preference optimization (DPO) with contrastive sentence pairs.
Each method downgrades a target value while keeping others fixed.
We evaluate models on moral judgments in the Am I The Asshole subreddit, using GPT-labeled examples with high vs. low value standards.
We measure both prediction change rate and directional consistency with expected value shifts.
Results show that SFT is more effective than DPO at inducing value-aligned behavioral changes, especially for values with sufficient evaluation data. These findings suggest that value-specific instruction tuning offers a promising path for aligning LLMs' moral behavior.

Author

Shangrui Nie (Bonn-Aachen International Center for Information Technology (b-it))

Presentation materials

There are no materials yet.