LEAF Lab · University of Toronto

UBP²

Uncertainty-Balanced Preference Planning

Mohamed Nabail · Leo Cheng* · Jingmin Wang* · Nicholas Rhinehart

* Equal contribution · Learning, Embodied Autonomy, and Forecasting (LEAF) Lab

Code Paper (arXiv)

Preference-Based RL Model-Based RL Epistemic Uncertainty Meta-World Sublinear Regret

UBP² teaser figure — **Figure 1.** Illustration of planning in UBP². The agent rolls out the dynamics model to generate imagined trajectories. The planner selects the trajectory maximizing expected return and epistemic uncertainty, then executes only the first action a_t=1.

Abstract

Overview

Preference-based RL provides an approach to learning reward models from pairwise comparisons of behaviors, bypassing the need for explicit reward design. However, existing methods typically rely on passive data collection and suffer from poor sample efficiency, especially during the early stages of learning.

We introduce a model-based approach that actively directs exploration by jointly reasoning over uncertainties in the reward, dynamics, and value functions. Our method, Uncertainty-Balanced Preference Planning (UBP²), uses ensembles of reward, dynamics, and value function models to evaluate candidate trajectories according to a unified score that combines expected reward, terminal value, and epistemic uncertainty.

Planning under this objective yields an explicit tradeoff between exploitation and information acquisition without requiring ad hoc exploration heuristics. Under standard regularity assumptions, we establish sublinear regret guarantees for both finite-horizon and infinite-horizon settings.

76.0

Avg. IQM Success

across 10 MetaWorld tasks

1.20

Average Rank

best across all tasks vs. baselines

9/10

Tasks Won

higher sample efficiency

𝒪(√N)

Regret Bound

sublinear, both horizons

Method

UBP² Framework

UBP² jointly learns ensembles of reward, dynamics, and value models from preference feedback and online interaction. A model-predictive controller plans over imagined trajectories using a unified optimistic objective; once the preference budget is exhausted, control passes to the learned policy.

Method / pipeline figure — assets/method.png

Figure 2. UBP² uses ensembles of reward (r_θ), dynamics (p_φ), and value (q_ψ) models to perform optimistic planning over imagined trajectories. Preferences are selected optimistically from executed trajectories and used to jointly train all three model ensembles.

Unified Planning Objective

Eq. (1) arg max_μ,σ 𝔼_a [ Σ_t=0^H−1 γ^t ( μ^r_n(ŝ,a) + λ_r·σ^r_epi(ŝ,a) + λ_d·σ^d_epi(ŝ,a) ) + γ^H( μ^q_n(ŝ_H,a_H) + λ_q·σ^q_epi(ŝ_H,a_H) ) ]

The coefficients λ_r, λ_d, λ_q are automatically tuned online via a Polyak-averaged target policy, eliminating manual tuning of the exploration–exploitation balance.

Algorithm

ALGORITHM 1 · UBP²

1Input: Replay buffer ℬ, preference buffer 𝒫

2Params: model ensembles (p_φ, r_θ, q_ψ), policy π^learn

3Init: t ← 0, n_pref ← 0, s ← env.reset()

4while t ≤ N_T do

5if n_pref < N_pref then

6a ← MPC(s; λ_r, λ_d, λ_q) // uncertainty-guided optimistic planning

7else

8a ← π^learn(s) // switch to learned policy

9end if

10(s′, r) ← env.step(a) ; ℬ ← ℬ ∪ (s, a) ; t ← t + 1

11Update p_φ, r_θ, q_ψ, π^learn, {λ_r, λ_d, λ_q}

12if episode done and n_pref < N_pref then

13𝒫 ← 𝒫 ∪ OptimisticPrefSelect(ℬ, r_θ)

14end if

15end while

Contributions

Unified optimistic planning objective combining predicted cumulative return with epistemic uncertainty over reward, dynamics, and value — outperforming any single-component uncertainty approach.
Optimistic preference query strategy that globally ranks candidate trajectory segment pairs by predicted reward plus reward-model epistemic uncertainty, rather than sampling locally.
Sublinear regret bounds (finite- and infinite-horizon) with explicit dependence on the maximum information gain Γ_d,N and Γ_r,N of the dynamics and reward kernels.
Strong empirical results on 10 Meta-World tasks — substantially higher sample efficiency than model-free (RUNE, MRN) and non-optimistic model-based baselines.

Experiments

Quantitative Results

Evaluated on 10 MetaWorld manipulation tasks with proprioceptive observations. We report Interquartile Mean (IQM) success rate over 1M environment steps. Feedback budgets reflect numbers commonly used in prior work.

Method	Door Close	Window Close	Handle Press	Coffee Button	Faucet Open	Door Open	Door Unlock	Sweep Into	Drawer Open	Hammer	Avg ↑	Rank ↓
UBP²	97.8	93.0	93.4	86.2	89.4	70.5	85.6	36.3	77.5	29.9	76.0	1.20
MBP	95.5	74.6	90.7	75.3	83.4	61.2	85.1	1.1	43.2	14.8	62.5	2.60
RUNE	91.5	85.2	84.5	90.4	83.1	41.7	56.1	35.5	35.9	4.6	60.8	3.10
MRN	94.2	84.9	81.5	87.7	80.1	48.2	65.6	28.3	38.1	7.5	61.6	3.10

* Asterisks (omitted here for brevity) denote statistically significant improvements under Wilcoxon signed-rank test (p < 0.05). Best per-column in bold.

Config = (Reward Unc. | Dynamics Unc. | Value Unc. | Opt. Pref. Selection). 1 = enabled, 0 = disabled.

Config	Door Close	Faucet Open	Door Open	Drawer Open	Avg ↑	Regret ↓
1111 (Full UBP²)	97.8	89.5	73.0	77.7	76.1	11.5
0111	96.7	82.8	84.5	68.3	75.3	15.8
1011	95.9	89.7	58.5	58.4	69.6	26.0
1001	94.8	90.3	67.7	44.0	67.4	33.7
0000 (MBP)	95.6	83.6	64.7	43.7	63.8	37.8

Optimistic preference selection consistently outperforms entropy- and disagreement-based strategies.

Preference Selection Strategy	Drawer Open	Door Open
UBP²-optimistic (ours)	78.2	74.1
UBP²-disagreement	75.2	45.5
UBP²-entropy	42.0	0.1

Consistent gains when increasing from horizon 7 to 11; performance saturates or slightly regresses at 15.

Horizon H	Coffee Button	Door Open	Door Unlock
7	77.5	74.1	75.7
11	88.8	79.3	86.5
15	92.4	68.6	74.1

Videos

Qualitative Results

Rollout videos for each MetaWorld task. Numbers in parentheses indicate the maximum preference feedback budget used. Replace the placeholder slots with your assets/videos/task_name.mp4 files.

door_close.mp4

Door Close

Budget: 500 queries

UBP²: 97.8 IQM

window_close.mp4

Window Close

Budget: 500 queries

UBP²: 93.0 IQM

handle_press.mp4

Handle Press

Budget: 1,000 queries

UBP²: 93.4 IQM

coffee_button.mp4

Coffee Button

Budget: 1,000 queries

UBP²: 86.2 IQM

faucet_open.mp4

Faucet Open

Budget: 2,000 queries

UBP²: 89.4 IQM

door_open.mp4

Door Open

Budget: 2,000 queries

UBP²: 70.5 IQM

door_unlock.mp4

Door Unlock

Budget: 2,500 queries

UBP²: 85.6 IQM

sweep_into.mp4

Sweep Into

Budget: 5,000 queries

UBP²: 36.3 IQM

drawer_open.mp4

Drawer Open

Budget: 5,000 queries

UBP²: 77.5 IQM

hammer.mp4

Hammer

Budget: 10,000 queries

UBP²: 29.9 IQM

Theory

Theoretical Guarantees

Under standard RKHS/GP regularity assumptions with well-calibrated models, UBP² achieves sublinear cumulative regret in the number of environment interaction episodes.

Theorem 5.8 Infinite-Horizon Regret Bound

R_γ,N ≤ 𝒪( H³ √N ( Γ^3/2_d,N log N + Γ^3/2_r,N log N ) + e_q,N )

Γ_d,N and Γ_r,N are the maximum information gains of the dynamics and reward kernels; e_q,N = Σ ε_q,n accounts for Q-function sub-optimality. When e_q,N is sublinear, overall regret is sublinear.

Kernel-Specific Rates

Linear: 𝒪(H³ √N · d_x^3/2)

RBF: 𝒪(H³ √N · log^3/2(d+1) N)

Matérn: 𝒪(H³ · N^{½ + 3d/(2(2ν+d))})

Both linear and RBF remain strictly sublinear for fixed state and action dimension.

Key Assumptions

A5.1 Continuous, bounded dynamics and reward

A5.2 RKHS regularity of dynamics & reward

A5.3 Q-function sub-optimality bounded by ε_q,n

A5.4 Epistemic Q-uncertainty via GP uncertainty propagation

UBP2