LEAF Lab · University of Toronto

UBP2

Uncertainty-Balanced Preference Planning
Mohamed Nabail  ·  Leo Cheng*  ·  Jingmin Wang*  ·  Nicholas Rhinehart
* Equal contribution  ·  Learning, Embodied Autonomy, and Forecasting (LEAF) Lab
Preference-Based RL Model-Based RL Epistemic Uncertainty Meta-World Sublinear Regret
UBP² teaser figure
Place teaser figure here — assets/teaser.png
Figure 1. Illustration of planning in UBP². The agent rolls out the dynamics model to generate imagined trajectories. The planner selects the trajectory maximizing expected return and epistemic uncertainty, then executes only the first action at=1.

Abstract

Overview

Preference-based RL provides an approach to learning reward models from pairwise comparisons of behaviors, bypassing the need for explicit reward design. However, existing methods typically rely on passive data collection and suffer from poor sample efficiency, especially during the early stages of learning.

We introduce a model-based approach that actively directs exploration by jointly reasoning over uncertainties in the reward, dynamics, and value functions. Our method, Uncertainty-Balanced Preference Planning (UBP²), uses ensembles of reward, dynamics, and value function models to evaluate candidate trajectories according to a unified score that combines expected reward, terminal value, and epistemic uncertainty.

Planning under this objective yields an explicit tradeoff between exploitation and information acquisition without requiring ad hoc exploration heuristics. Under standard regularity assumptions, we establish sublinear regret guarantees for both finite-horizon and infinite-horizon settings.

76.0

Avg. IQM Success

across 10 MetaWorld tasks

1.20

Average Rank

best across all tasks vs. baselines

9/10

Tasks Won

higher sample efficiency

𝒪(√N)

Regret Bound

sublinear, both horizons


Method

UBP² Framework

UBP² jointly learns ensembles of reward, dynamics, and value models from preference feedback and online interaction. A model-predictive controller plans over imagined trajectories using a unified optimistic objective; once the preference budget is exhausted, control passes to the learned policy.

Method / pipeline figure — assets/method.png
Figure 2. UBP² uses ensembles of reward (rθ), dynamics (pφ), and value (qψ) models to perform optimistic planning over imagined trajectories. Preferences are selected optimistically from executed trajectories and used to jointly train all three model ensembles.

Unified Planning Objective

Eq. (1) arg maxμ,σ   𝔼a [   Σt=0H−1 γt ( μrn(ŝ,a) + λr·σrepi(ŝ,a) + λd·σdepi(ŝ,a) )  +  γH( μqnH,aH) + λq·σqepiH,aH) ) ]

The coefficients λr, λd, λq are automatically tuned online via a Polyak-averaged target policy, eliminating manual tuning of the exploration–exploitation balance.

Algorithm

ALGORITHM 1  ·  UBP²
1Input: Replay buffer ℬ, preference buffer 𝒫
2Params: model ensembles (pφ, rθ, qψ), policy πlearn
3Init: t ← 0, npref ← 0, s ← env.reset()
4while t ≤ NT do
5if npref < Npref then
6a ← MPC(s; λr, λd, λq) // uncertainty-guided optimistic planning
7else
8a ← πlearn(s) // switch to learned policy
9end if
10(s′, r) ← env.step(a) ; ℬ ← ℬ ∪ (s, a) ; t ← t + 1
11Update pφ, rθ, qψ, πlearn, {λr, λd, λq}
12if episode done and npref < Npref then
13𝒫 ← 𝒫 ∪ OptimisticPrefSelect(ℬ, rθ)
14end if
15end while

Contributions


Experiments

Quantitative Results

Evaluated on 10 MetaWorld manipulation tasks with proprioceptive observations. We report Interquartile Mean (IQM) success rate over 1M environment steps. Feedback budgets reflect numbers commonly used in prior work.

MethodDoor CloseWindow CloseHandle Press Coffee ButtonFaucet OpenDoor Open Door UnlockSweep IntoDrawer OpenHammer Avg ↑Rank ↓
UBP² 97.893.093.4 86.289.470.5 85.636.377.529.9 76.01.20
MBP 95.574.690.775.383.4 61.285.11.143.214.8 62.52.60
RUNE 91.585.284.590.483.1 41.756.135.535.94.6 60.83.10
MRN 94.284.981.587.780.1 48.265.628.338.17.5 61.63.10

* Asterisks (omitted here for brevity) denote statistically significant improvements under Wilcoxon signed-rank test (p < 0.05). Best per-column in bold.

Config = (Reward Unc. | Dynamics Unc. | Value Unc. | Opt. Pref. Selection). 1 = enabled, 0 = disabled.

ConfigDoor CloseFaucet OpenDoor OpenDrawer OpenAvg ↑Regret ↓
1111 (Full UBP²)97.889.573.077.776.111.5
011196.782.884.568.375.315.8
101195.989.758.558.469.626.0
100194.890.367.744.067.433.7
0000 (MBP)95.683.664.743.763.837.8

Optimistic preference selection consistently outperforms entropy- and disagreement-based strategies.

Preference Selection StrategyDrawer OpenDoor Open
UBP²-optimistic (ours)78.274.1
UBP²-disagreement75.245.5
UBP²-entropy42.00.1

Consistent gains when increasing from horizon 7 to 11; performance saturates or slightly regresses at 15.

Horizon HCoffee ButtonDoor OpenDoor Unlock
777.574.175.7
1188.879.386.5
1592.468.674.1

Videos

Qualitative Results

Rollout videos for each MetaWorld task. Numbers in parentheses indicate the maximum preference feedback budget used. Replace the placeholder slots with your assets/videos/task_name.mp4 files.

door_close.mp4
Door Close
Budget: 500 queries
UBP²: 97.8 IQM
window_close.mp4
Window Close
Budget: 500 queries
UBP²: 93.0 IQM
handle_press.mp4
Handle Press
Budget: 1,000 queries
UBP²: 93.4 IQM
coffee_button.mp4
Coffee Button
Budget: 1,000 queries
UBP²: 86.2 IQM
faucet_open.mp4
Faucet Open
Budget: 2,000 queries
UBP²: 89.4 IQM
door_open.mp4
Door Open
Budget: 2,000 queries
UBP²: 70.5 IQM
door_unlock.mp4
Door Unlock
Budget: 2,500 queries
UBP²: 85.6 IQM
sweep_into.mp4
Sweep Into
Budget: 5,000 queries
UBP²: 36.3 IQM
drawer_open.mp4
Drawer Open
Budget: 5,000 queries
UBP²: 77.5 IQM
hammer.mp4
Hammer
Budget: 10,000 queries
UBP²: 29.9 IQM

Theory

Theoretical Guarantees

Under standard RKHS/GP regularity assumptions with well-calibrated models, UBP² achieves sublinear cumulative regret in the number of environment interaction episodes.

Theorem 5.8 Infinite-Horizon Regret Bound

Rγ,N  ≤  𝒪( H³ √N  ( Γ3/2d,N log N  +  Γ3/2r,N log N )  +  eq,N )

Γd,N and Γr,N are the maximum information gains of the dynamics and reward kernels; eq,N = Σ εq,n accounts for Q-function sub-optimality. When eq,N is sublinear, overall regret is sublinear.

Kernel-Specific Rates

Linear:  𝒪(H³ √N · dx3/2)

RBF:     𝒪(H³ √N · log3/2(d+1) N)

Matérn:  𝒪(H³ · N½ + 3d/(2(2ν+d)))

Both linear and RBF remain strictly sublinear for fixed state and action dimension.

Key Assumptions

A5.1   Continuous, bounded dynamics and reward

A5.2   RKHS regularity of dynamics & reward

A5.3   Q-function sub-optimality bounded by εq,n

A5.4   Epistemic Q-uncertainty via GP uncertainty propagation


Citation

Cite This Work

@article{nabail2025ubp2,
  title = {UBP2: Uncertainty-Balanced Preference Planning},
  author = {Nabail, Mohamed and Cheng, Leo and Wang, Jingmin
            and Rhinehart, Nicholas},
  institution = {University of Toronto, LEAF Lab},
  year = {2025},
  url = {https://github.com/MohamedGNabail/ubp2/}
}