CRED: Counterfactual Reasoning and Environment Design for Active Preference Learning

Department of Computer Science
University of Colorado Boulder
International Conference on Robotics and Automation (ICRA) 2026

Problem: To align robot behavior with human preferences, we need to query humans with trajectory comparisons, but which scenarios and trajectories should we show? Existing methods are limited to a fixed environment and random trajectories, missing informative comparisons.

Our approach: CRED efficiently searches for informative environments to query the human. In each environment, we present two trajectories that represent different reward hypotheses (sampled from our current belief over human preferences), creating counterfactual comparisons that maximally disambiguate the true reward.

Suboptimal queries in a fixed environment

(a) Fixed environment limits query diversity

Counterfactual trajectories from sampled reward functions

(b) Counterfactual trajectories from sampled rewards

Environment design reveals new preference distinctions

(c) Environment design reveals new distinctions

Learned preferences deployed on real robot

(d) Learned preferences deployed on real robot

Abstract

Method Overview

CRED formulates active preference learning as a bilevel optimization problem. The outer loop uses Bayesian optimization to select environment parameters θ that yield the most informative preference queries. The inner loop performs counterfactual reasoning: it samples diverse reward weight hypotheses from the current belief, generates a trajectory for each hypothesis (via RL or trajectory optimization), and selects the pair with the highest mutual information gain. This joint optimization over environments and trajectories enables CRED to efficiently explore the feature space and converge to accurate reward estimates with fewer human queries.

CRED system diagram showing bilevel optimization with outer Bayesian optimization loop and inner counterfactual reasoning loop

Figure: CRED's bilevel optimization framework. The outer loop (Bayesian optimization) proposes environment parameters θ. The inner loop (counterfactual reasoning) samples reward weights from the current belief, generates trajectories, and selects the most informative pair to present to the human.

Experimental Domains

Lunar Lander environment

Lunar Lander

Safely land a 2D lander on a designated pad. Features include vertical speed and horizontal position. Environment parameter: wind power (1–20). Policy trained with PPO.

Tabletop Manipulation environment in MuJoCo

Tabletop Manipulation

Deliver a coffee cup across a cluttered table in MuJoCo. Features: hovering frequency over objects. Environment parameter: object placements (compressed via VAE). Trajectories via CHOMP.

Navigation environment using Webots and OpenStreetMaps

Navigation

Deliver food between locations over a street network in Webots. Features: path length and terrain proportions. Environment parameter: surface type assignments. Policy via value iteration.

Simulation Results

We evaluate CRED against two baselines: RR (Random Rollouts with mutual information optimization) and MBP (Mean Belief Policy). We also conduct ablation studies isolating the effect of counterfactual reasoning (CR) and environment design (ED), and compare against domain randomization (DR) variants. CRED consistently achieves the highest reward correlation with fewer queries across all three domains.

Lunar Lander
Tabletop
Navigation
Baselines
Lunar Lander baseline comparison
Tabletop baseline comparison
Navigation baseline comparison
Ablation
Lunar Lander ablation study
Tabletop ablation study
Navigation ablation study
Domain Rand.
Lunar Lander domain randomization comparison
Tabletop domain randomization comparison
Navigation domain randomization comparison

Key findings: CRED converges to near-perfect accuracy (r = 0.998) in Lunar Lander within 4 iterations. In Tabletop, CRED reaches r = 0.979 by iteration 10. In Navigation, CRED achieves r = 0.906 after 16 iterations. Ablation studies confirm that both counterfactual reasoning and environment design are essential—removing either component significantly reduces performance. CRED also outperforms domain randomization variants, which lack principled environment selection.

User Study

We conducted an IRB-approved user study with 25 participants (ages 20–39) in a within-subjects design. Each participant answered 6 preference queries per task under three conditions: CRED, MBP, and RR-DR. We measured reward correlation with ground truth, NASA-TLX workload, and ease-of-choice ratings on a 7-point Likert scale.

Tabletop reward correlation boxplot showing CRED outperforming baselines

(a) Tabletop Accuracy

Navigation reward correlation boxplot showing CRED outperforming baselines

(b) Navigation Accuracy

Ease of choice boxplot favoring CRED

(c) Ease of Choice

NASA-TLX overall workload boxplot showing CRED with lowest workload

(d) Overall Workload (NASA-TLX)

Key findings: CRED achieves a median reward correlation of 0.97 in Tabletop and 0.78 in Navigation, significantly outperforming both baselines. Participants reported the lowest mental workload with CRED (median NASA-TLX: 1.83) and the highest ease-of-choice ratings (median: 6.0 out of 7). Statistical significance was confirmed via Wilcoxon signed-rank tests with Holm–Bonferroni correction.

Real-World Deployment

Reward functions learned through CRED are deployed on real robotic platforms. The learned preferences transfer from simulation to hardware, enabling robots to navigate and manipulate according to user-specified trade-offs.

Unitree Go1 quadruped robot navigating on grass terrain

Go1 navigating on grass

Unitree Go1 quadruped robot navigating on asphalt terrain

Go1 navigating on asphalt

Video Presentation

Paper

BibTeX

@inproceedings{tung2026cred,
  title={CRED: Counterfactual Reasoning and Environment Design for Active Preference Learning},
  author={Tung, Yi-Shiuan and Kumar, Gyanig and Jiang, Wei and Hayes, Bradley and Roncone, Alessandro},
  booktitle={IEEE International Conference on Robotics and Automation (ICRA)},
  year={2026}
}