CRED: Counterfactual Reasoning and Environment Design for Active Preference Learning

Tung, Yi-Shiuan; Kumar, Gyanig; Jiang, Wei; Hayes, Bradley; Roncone, Alessandro

CRED: Counterfactual Reasoning and Environment Design for Active Preference Learning

Yi-Shiuan Tung Gyanig Kumar, Wei Jiang, Bradley Hayes, Alessandro Roncone

Department of Computer Science
University of Colorado Boulder
International Conference on Robotics and Automation (ICRA) 2026

Paper Code arXiv Video

Problem: To align robot behavior with human preferences, we need to query humans with trajectory comparisons, but which scenarios and trajectories should we show? Existing methods are limited to a fixed environment and random trajectories, missing informative comparisons.

Our approach: CRED efficiently searches for informative environments to query the human. In each environment, we present two trajectories that represent different reward hypotheses (sampled from our current belief over human preferences), creating counterfactual comparisons that maximally disambiguate the true reward.

Suboptimal queries in a fixed environment

(a) Fixed environment limits query diversity

Counterfactual trajectories from sampled reward functions

(b) Counterfactual trajectories from sampled rewards

Environment design reveals new preference distinctions

(c) Environment design reveals new distinctions

(d) Learned preferences deployed on real robot

Abstract

As a robot's operational environment and tasks to perform within it grow in complexity, the explicit specification and balancing of optimization objectives to achieve a preferred behavior profile moves increasingly farther out of reach. These systems benefit strongly by being able to align their behavior to reflect human preferences and respond to corrections, but manually encoding this feedback is infeasible. Active preference learning (APL) learns human reward functions by presenting trajectories for ranking. However, existing methods sample from fixed trajectory sets or replay buffers that limit query diversity and often fail to identify informative comparisons. We propose CRED, a novel trajectory generation method for APL that improves reward inference by jointly optimizing environment design and trajectory selection to efficiently query and extract preferences from users. CRED "imagines" new scenarios through environment design and leverages counterfactual reasoning—by sampling possible rewards from its current belief and asking "What if this were the true preference?"—to generate trajectory pairs that expose differences between competing reward functions. Comprehensive experiments and a user study show that CRED significantly outperforms state-of-the-art methods in reward accuracy and sample efficiency and receives higher user ratings.

Method Overview

CRED formulates active preference learning as a bilevel optimization problem. The outer loop uses Bayesian optimization to select environment parameters θ that yield the most informative preference queries. The inner loop performs counterfactual reasoning: it samples diverse reward weight hypotheses from the current belief, generates a trajectory for each hypothesis (via RL or trajectory optimization), and selects the pair with the highest mutual information gain. This joint optimization over environments and trajectories enables CRED to efficiently explore the feature space and converge to accurate reward estimates with fewer human queries.

CRED system diagram showing bilevel optimization with outer Bayesian optimization loop and inner counterfactual reasoning loop

Figure: CRED's bilevel optimization framework. The outer loop (Bayesian optimization) proposes environment parameters θ. The inner loop (counterfactual reasoning) samples reward weights from the current belief, generates trajectories, and selects the most informative pair to present to the human.

Experimental Domains

Lunar Lander

Safely land a 2D lander on a designated pad. Features include vertical speed and horizontal position. Environment parameter: wind power (1–20). Policy trained with PPO.

Tabletop Manipulation

Deliver a coffee cup across a cluttered table in MuJoCo. Features: hovering frequency over objects. Environment parameter: object placements (compressed via VAE). Trajectories via CHOMP.

Navigation

Deliver food between locations over a street network in Webots. Features: path length and terrain proportions. Environment parameter: surface type assignments. Policy via value iteration.

Simulation Results

We evaluate CRED against two baselines: RR (Random Rollouts with mutual information optimization) and MBP (Mean Belief Policy). We also conduct ablation studies isolating the effect of counterfactual reasoning (CR) and environment design (ED), and compare against domain randomization (DR) variants. CRED consistently achieves the highest reward correlation with fewer queries across all three domains.

Lunar Lander

Tabletop

Navigation

Baselines

Ablation

Domain Rand.

Navigation domain randomization comparison

Key findings: CRED converges to near-perfect accuracy (r = 0.998) in Lunar Lander within 4 iterations. In Tabletop, CRED reaches r = 0.979 by iteration 10. In Navigation, CRED achieves r = 0.906 after 16 iterations. Ablation studies confirm that both counterfactual reasoning and environment design are essential—removing either component significantly reduces performance. CRED also outperforms domain randomization variants, which lack principled environment selection.

User Study

We conducted an IRB-approved user study with 25 participants (ages 20–39) in a within-subjects design. Each participant answered 6 preference queries per task under three conditions: CRED, MBP, and RR-DR. We measured reward correlation with ground truth, NASA-TLX workload, and ease-of-choice ratings on a 7-point Likert scale.

Tabletop reward correlation boxplot showing CRED outperforming baselines

(a) Tabletop Accuracy

Navigation reward correlation boxplot showing CRED outperforming baselines

(b) Navigation Accuracy

(c) Ease of Choice

NASA-TLX overall workload boxplot showing CRED with lowest workload

(d) Overall Workload (NASA-TLX)

Key findings: CRED achieves a median reward correlation of 0.97 in Tabletop and 0.78 in Navigation, significantly outperforming both baselines. Participants reported the lowest mental workload with CRED (median NASA-TLX: 1.83) and the highest ease-of-choice ratings (median: 6.0 out of 7). Statistical significance was confirmed via Wilcoxon signed-rank tests with Holm–Bonferroni correction.

Real-World Deployment

Reward functions learned through CRED are deployed on real robotic platforms. The learned preferences transfer from simulation to hardware, enabling robots to navigate and manipulate according to user-specified trade-offs.

Unitree Go1 quadruped robot navigating on grass terrain

Go1 navigating on grass

Unitree Go1 quadruped robot navigating on asphalt terrain

Go1 navigating on asphalt

Video Presentation

Paper

BibTeX

@inproceedings{tung2026cred,
  title={CRED: Counterfactual Reasoning and Environment Design for Active Preference Learning},
  author={Tung, Yi-Shiuan and Kumar, Gyanig and Jiang, Wei and Hayes, Bradley and Roncone, Alessandro},
  booktitle={IEEE International Conference on Robotics and Automation (ICRA)},
  year={2026}
}

More Works

Workspace Optimization to Improve Human Motion Prediction

Bilevel Optimization for Just-in-Time Robotic Kitting

CRED: Counterfactual Reasoning and Environment Design for Active Preference Learning

Abstract

Method Overview

Experimental Domains

Lunar Lander

Tabletop Manipulation

Navigation

Simulation Results

User Study

Real-World Deployment

Video Presentation

Paper

BibTeX