Preference-based RL provides an approach to learning reward models from pairwise comparisons of behaviors, bypassing the need for explicit reward design. However, existing methods typically rely on passive data collection and suffer from poor sample efficiency, especially during the early stages of learning.
We introduce a model-based approach that actively directs exploration by jointly reasoning over uncertainties in the reward, dynamics, and value functions. Our method, Uncertainty-Balanced Preference Planning (UBP²), uses ensembles of reward, dynamics, and value function models to evaluate candidate trajectories according to a unified score that combines expected reward, terminal value, and epistemic uncertainty.
Planning under this objective yields an explicit tradeoff between exploitation and information acquisition without requiring ad hoc exploration heuristics. Under standard regularity assumptions, we establish sublinear regret guarantees for both finite-horizon and infinite-horizon settings.
across 10 MetaWorld tasks
best across all tasks vs. baselines
higher sample efficiency
sublinear, both horizons
UBP² jointly learns ensembles of reward, dynamics, and value models from preference feedback and online interaction. A model-predictive controller plans over imagined trajectories using a unified optimistic objective; once the preference budget is exhausted, control passes to the learned policy.
The coefficients λr, λd, λq are automatically tuned online via a Polyak-averaged target policy, eliminating manual tuning of the exploration–exploitation balance.
Unified optimistic planning objective combining predicted cumulative return with epistemic uncertainty over reward, dynamics, and value — outperforming any single-component uncertainty approach.
Optimistic preference query strategy that globally ranks candidate trajectory segment pairs by predicted reward plus reward-model epistemic uncertainty, rather than sampling locally.
Sublinear regret bounds (finite- and infinite-horizon) with explicit dependence on the maximum information gain Γd,N and Γr,N of the dynamics and reward kernels.
Strong empirical results on 10 Meta-World tasks — substantially higher sample efficiency than model-free (RUNE, MRN) and non-optimistic model-based baselines.
Evaluated on 10 MetaWorld manipulation tasks with proprioceptive observations. We report Interquartile Mean (IQM) success rate over 1M environment steps. Feedback budgets reflect numbers commonly used in prior work.
| Method | Door Close | Window Close | Handle Press | Coffee Button | Faucet Open | Door Open | Door Unlock | Sweep Into | Drawer Open | Hammer | Avg ↑ | Rank ↓ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| UBP² | 97.8 | 93.0 | 93.4 | 86.2 | 89.4 | 70.5 | 85.6 | 36.3 | 77.5 | 29.9 | 76.0 | 1.20 |
| MBP | 95.5 | 74.6 | 90.7 | 75.3 | 83.4 | 61.2 | 85.1 | 1.1 | 43.2 | 14.8 | 62.5 | 2.60 |
| RUNE | 91.5 | 85.2 | 84.5 | 90.4 | 83.1 | 41.7 | 56.1 | 35.5 | 35.9 | 4.6 | 60.8 | 3.10 |
| MRN | 94.2 | 84.9 | 81.5 | 87.7 | 80.1 | 48.2 | 65.6 | 28.3 | 38.1 | 7.5 | 61.6 | 3.10 |
* Asterisks (omitted here for brevity) denote statistically significant improvements under Wilcoxon signed-rank test (p < 0.05). Best per-column in bold.
Config = (Reward Unc. | Dynamics Unc. | Value Unc. | Opt. Pref. Selection). 1 = enabled, 0 = disabled.
| Config | Door Close | Faucet Open | Door Open | Drawer Open | Avg ↑ | Regret ↓ |
|---|---|---|---|---|---|---|
| 1111 (Full UBP²) | 97.8 | 89.5 | 73.0 | 77.7 | 76.1 | 11.5 |
| 0111 | 96.7 | 82.8 | 84.5 | 68.3 | 75.3 | 15.8 |
| 1011 | 95.9 | 89.7 | 58.5 | 58.4 | 69.6 | 26.0 |
| 1001 | 94.8 | 90.3 | 67.7 | 44.0 | 67.4 | 33.7 |
| 0000 (MBP) | 95.6 | 83.6 | 64.7 | 43.7 | 63.8 | 37.8 |
Optimistic preference selection consistently outperforms entropy- and disagreement-based strategies.
| Preference Selection Strategy | Drawer Open | Door Open |
|---|---|---|
| UBP²-optimistic (ours) | 78.2 | 74.1 |
| UBP²-disagreement | 75.2 | 45.5 |
| UBP²-entropy | 42.0 | 0.1 |
Consistent gains when increasing from horizon 7 to 11; performance saturates or slightly regresses at 15.
| Horizon H | Coffee Button | Door Open | Door Unlock |
|---|---|---|---|
| 7 | 77.5 | 74.1 | 75.7 |
| 11 | 88.8 | 79.3 | 86.5 |
| 15 | 92.4 | 68.6 | 74.1 |
Rollout videos for each MetaWorld task. Numbers in parentheses indicate the maximum preference feedback budget used. Replace the placeholder slots with your assets/videos/task_name.mp4 files.
Under standard RKHS/GP regularity assumptions with well-calibrated models, UBP² achieves sublinear cumulative regret in the number of environment interaction episodes.
Γd,N and Γr,N are the maximum information gains of the dynamics and reward kernels; eq,N = Σ εq,n accounts for Q-function sub-optimality. When eq,N is sublinear, overall regret is sublinear.
Linear: 𝒪(H³ √N · dx3/2)
RBF: 𝒪(H³ √N · log3/2(d+1) N)
Matérn: 𝒪(H³ · N½ + 3d/(2(2ν+d)))
Both linear and RBF remain strictly sublinear for fixed state and action dimension.
A5.1 Continuous, bounded dynamics and reward
A5.2 RKHS regularity of dynamics & reward
A5.3 Q-function sub-optimality bounded by εq,n
A5.4 Epistemic Q-uncertainty via GP uncertainty propagation