RESEARCH · November 2025

Beyond the Bradley-Terry Paradigm: Defining Next-Generation Preference Alignment

Toward Cognitively Inspired "Uncertainty-Aware" Alignment

EMNLP 2025 arXiv

In the alignment of large language models (LLMs), offline preference optimization methods such as DPO have become the prevailing approach for improving efficiency. However, most existing methods adhere to the Bradley-Terry (BT) model, which faces three critical challenges in real-world scenarios: dependence on pairwise data, training distribution shift, and the assumption of "rational" human behavior.

Recently, the Beijing Institute for General Artificial Intelligence (BIGAI), in collaboration with the University of Science and Technology of China, published research at EMNLP 2025 introducing UAPO (Adaptive Preference Optimization with Uncertainty-aware Utility Anchor). By incorporating a "utility anchor," this method achieves, for the first time, robust modeling of uncertain preference data.

Core Challenges

Why Do Existing Preference Optimization Methods Fall Short?

Current preference alignment methods (e.g., DPO, SimPO) encounter significant bottlenecks in practical applications:

Pairwise Constraints at the Data Level

The BT model mandates "preferred–dispreferred" pairwise data, yet in practice, human preferences are often non-comparative in nature.

Distribution Shift at the Optimization Level

Over-optimization (reward hacking) causes the model to produce unreliable signals when encountering out-of-distribution (OOD) samples.

Rationality Assumption at the Cognitive Level

The BT model assumes humans are perfectly rational utility maximizers, overlooking the well-established phenomena of "risk aversion" and "uncertainty" from behavioral economics.

Innovative Design

UAPO: Introducing the Utility Anchor

Drawing on the anchoring effect from behavioral economics, UAPO introduces a learnable "utility anchor" $y_\bot$.

Utility Anchor Mechanism — 图1 Figure 1: The utility anchor serves to balance the preferred and dispreferred distributions in preference alignment.

Core Advantages of the Framework:

Decoupling Pairwise Dependence

By decomposing the objective function into a pointwise form, UAPO enables the model to learn directly from unpaired data, significantly improving data utilization.

Uncertainty Awareness

The utility anchor captures ambiguous signals in the annotation process and is theoretically equivalent to introducing an "uncertainty penalty" in pessimistic reinforcement learning (Pessimistic RL), preventing the model from falling into reward traps.

Smoother Training Dynamics

Compared to DPO, UAPO exhibits lower and more stable KL divergence during training, better preserving the original capabilities of the pretrained model.

Experimental Results

Outstanding Generalization and Robustness

The research team conducted extensive evaluations across multiple models, including Mistral, Llama-3, and Gemma-2. The results demonstrate:

Leading Benchmark Performance

On AlpacaEval 2 and Arena-Hard, UAPO variants (e.g., SimUAPO) consistently outperform the original SimPO and DPO. On Gemma-2-9B, SimUAPO achieves a length-controlled win rate (LC) of 73.5%.

Resilient to Distribution Shift

On OOD benchmarks such as RewardBench 2, UAPO demonstrates stronger transfer capabilities, particularly in mathematical reasoning and safety evaluation.

Robust Against Data Noise

Even under extreme conditions where 40% of preference annotations are randomly flipped (noise contamination), UAPO's performance degradation is substantially smaller than that of conventional methods.

Performance of UAPO and SimUAPO — 图2 Figure 2: Comparison of UAPO and SimUAPO against existing mainstream methods.

Outlook

Toward Trustworthy Aligned AI

The release of UAPO represents not merely an algorithmic improvement, but a deeper insight into how to teach LLMs to make judgments. Alignment should not be limited to "feeding" standard answers; it should teach models to understand the scale of values and the boundaries of uncertainty.

Going forward, the team will continue to explore:

🔄

Self-Supervised Alignment

Leveraging the utility anchor for model self-play and iterative refinement.

🧩

Complex Task Alignment

Validating the effectiveness of UAPO in long-form text generation and complex reasoning chains.

LLM AlignmentPreference OptimizationUtility AnchorUncertainty

Authors

Xiaobo Wang^1,3, Zixia Jia³, Jiaqi Li³, Qi Liu^*1,2, Zilong Zheng^*3

¹ USTC, ² IAI, ³ BIGAI

^* Corresponding authors.