Speaker
Description
Reinforcement learning traditionally learns absolute state values, estimating how good a particular situation is in isolation. Yet in both biological systems and practical decision-making, what often matters is not the absolute value of a state, but how it compares to alternatives. Motivated by empirical findings in neuroscience, we introduce \textbf{Pairwise-TD}, a novel framework that learns \emph{value differences} directly.
Our method defines a new pairwise Bellman operator that estimates the relative value $\Delta(s_i, s_j) = V(s_i) - V(s_j)$, bypassing the need to ever compute $V(s)$ explicitly. We prove that this operator is a $\gamma$-contraction in a structured function space, ensuring convergence to a unique fixed point. Pairwise-TD integrates naturally into on-policy actor-critic methods and enables exact recovery of Generalized Advantage Estimation (GAE) using only pairwise differences. Hereby, we derive a pseudo-value approach that yields an unbiased policy gradient estimator despite the absence of an explicit value baseline. To address pair-wise comparisons in episodic environments with terminal states, we introduce a principled scheme for computing Bellman targets using only observable quantities, ensuring correct learning even when episode lengths vary. Finally, we present a lightweight neural network architecture that enforces antisymmetry via a shared encoder and linear projection, further improving the structure of our relative value function. Together, these contributions offer a biologically inspired, practically effective, and theoretically grounded alternative to traditional value learning.