Abstract
As a robot's operational environment and tasks to perform within it grow in complexity, the explicit specification and balancing of optimization objectives to achieve a preferred behavior profile moves increasingly farther out of reach. These systems benefit strongly by being able to align their behavior to reflect human preferences and respond to corrections, but manually encoding this feedback is infeasible. Active preference learning (APL) learns human reward functions by presenting trajectories for ranking. However, existing methods sample from fixed trajectory sets or replay buffers that limit query diversity and often fail to identify informative comparisons. We propose CRED, a novel trajectory generation method for APL that improves reward inference by jointly optimizing environment design and trajectory selection to efficiently query and extract preferences from users. CRED "imagines" new scenarios through environment design and leverages counterfactual reasoning—by sampling possible rewards from its current belief and asking "What if this were the true preference?"—to generate trajectory pairs that expose differences between competing reward functions. Comprehensive experiments and a user study show that CRED significantly outperforms state-of-the-art methods in reward accuracy and sample efficiency and receives higher user ratings.
Method Overview
CRED formulates active preference learning as a bilevel optimization problem. The outer loop uses Bayesian optimization to select environment parameters θ that yield the most informative preference queries. The inner loop performs counterfactual reasoning: it samples diverse reward weight hypotheses from the current belief, generates a trajectory for each hypothesis (via RL or trajectory optimization), and selects the pair with the highest mutual information gain. This joint optimization over environments and trajectories enables CRED to efficiently explore the feature space and converge to accurate reward estimates with fewer human queries.
Figure: CRED's bilevel optimization framework. The outer loop (Bayesian optimization) proposes environment parameters θ. The inner loop (counterfactual reasoning) samples reward weights from the current belief, generates trajectories, and selects the most informative pair to present to the human.
Experimental Domains
Lunar Lander
Safely land a 2D lander on a designated pad. Features include vertical speed and horizontal position. Environment parameter: wind power (1–20). Policy trained with PPO.
Tabletop Manipulation
Deliver a coffee cup across a cluttered table in MuJoCo. Features: hovering frequency over objects. Environment parameter: object placements (compressed via VAE). Trajectories via CHOMP.
Navigation
Deliver food between locations over a street network in Webots. Features: path length and terrain proportions. Environment parameter: surface type assignments. Policy via value iteration.
Simulation Results
We evaluate CRED against two baselines: RR (Random Rollouts with mutual information optimization) and MBP (Mean Belief Policy). We also conduct ablation studies isolating the effect of counterfactual reasoning (CR) and environment design (ED), and compare against domain randomization (DR) variants. CRED consistently achieves the highest reward correlation with fewer queries across all three domains.









Key findings: CRED converges to near-perfect accuracy (r = 0.998) in Lunar Lander within 4 iterations. In Tabletop, CRED reaches r = 0.979 by iteration 10. In Navigation, CRED achieves r = 0.906 after 16 iterations. Ablation studies confirm that both counterfactual reasoning and environment design are essential—removing either component significantly reduces performance. CRED also outperforms domain randomization variants, which lack principled environment selection.
User Study
We conducted an IRB-approved user study with 25 participants (ages 20–39) in a within-subjects design. Each participant answered 6 preference queries per task under three conditions: CRED, MBP, and RR-DR. We measured reward correlation with ground truth, NASA-TLX workload, and ease-of-choice ratings on a 7-point Likert scale.
(a) Tabletop Accuracy
(b) Navigation Accuracy
(c) Ease of Choice
(d) Overall Workload (NASA-TLX)
Key findings: CRED achieves a median reward correlation of 0.97 in Tabletop and 0.78 in Navigation, significantly outperforming both baselines. Participants reported the lowest mental workload with CRED (median NASA-TLX: 1.83) and the highest ease-of-choice ratings (median: 6.0 out of 7). Statistical significance was confirmed via Wilcoxon signed-rank tests with Holm–Bonferroni correction.
Real-World Deployment
Reward functions learned through CRED are deployed on real robotic platforms. The learned preferences transfer from simulation to hardware, enabling robots to navigate and manipulate according to user-specified trade-offs.
Go1 navigating on grass
Go1 navigating on asphalt
Video Presentation
Paper
BibTeX
@inproceedings{tung2026cred,
title={CRED: Counterfactual Reasoning and Environment Design for Active Preference Learning},
author={Tung, Yi-Shiuan and Kumar, Gyanig and Jiang, Wei and Hayes, Bradley and Roncone, Alessandro},
booktitle={IEEE International Conference on Robotics and Automation (ICRA)},
year={2026}
}