<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="4.3.4">Jekyll</generator><link href="https://yi-shiuan-tung.github.io/feed.xml" rel="self" type="application/atom+xml"/><link href="https://yi-shiuan-tung.github.io/" rel="alternate" type="text/html" hreflang="en"/><updated>2026-04-08T16:24:42+00:00</updated><id>https://yi-shiuan-tung.github.io/feed.xml</id><title type="html">blank</title><subtitle>Yi-Shiuan Tung&apos;s academic website. PhD student at CU Boulder researching human-robot interaction, environment design, and machine learning. </subtitle><entry><title type="html">Variational Inference for Latent Variable Models</title><link href="https://yi-shiuan-tung.github.io/blog/2025/vae/" rel="alternate" type="text/html" title="Variational Inference for Latent Variable Models"/><published>2025-09-15T00:00:00+00:00</published><updated>2025-09-15T00:00:00+00:00</updated><id>https://yi-shiuan-tung.github.io/blog/2025/vae</id><content type="html" xml:base="https://yi-shiuan-tung.github.io/blog/2025/vae/"><![CDATA[<p>This post goes through the derivation of Evidence Lower Bound (ELBO) and an intuitive explanation of how variational inference works for latent variable models. Much of the intuition presented here is inspired by <a href="https://youtu.be/UTMpM4orS30?si=BzBurdtiXm5orxEo">Sergey Levine’s lectures on variational inference</a> and insights from discussions with <a href="https://www.colorado.edu/cs/christoffer-heckman">Chris Heckman</a> during my area exam.</p> <h3 id="latent-variable-models">Latent Variable Models</h3> <p>Suppose we want to model a complex data distribution \(p(x)\) given a dataset \(D = \{x_1, x_2, \ldots, x_N\}\) where \(x\) might represent images, robot trajectories, or other high-dimensional data. Directly modeling \(p(x)\) is often intractable due to its complexity.</p> <p>Latent variable models address this challenge by introducing an auxiliary random variable \(z\) drawn from a simple distribution \(p(z)\), such as a Gaussian. Rather than modeling \(p(x)\) directly, we instead model how data is generated <em>conditioned</em> on the latent variable via \(p(x\vert z)\).</p> <p>The conditional distribution \(p(x \vert z)\) is chosen to be easy to sample from. A common choice is also Gaussian:</p> <p>\(\begin{equation} p(x \vert z) = \mathcal{N}(\mu(z), \sigma(z)) \end{equation}\),</p> <p>where the mean \(\mu(z)\) and standard deviation \(\sigma(z)\) are functions of \(z\) learned from data. The marginal distribution over observations is then obtained by integrating out the latent variable:</p> <p>\(\begin{equation} p(x) = \int p(x \vert z)p(z)dz \end{equation}\).</p> <center> <img src="/blog/assets/img/vae/intro.png" alt="Latent variable model illustration" width="280"/> </center> <p>Sampling a latent variable \(z\) selects a Gaussian distribution over the data space via \(p(x \vert z)\). By drawing different values of \(z\) and sampling from the corresponding conditionals, the model can represent complex data distributions.</p> <h3 id="how-do-we-train-the-model-p_thetax">How do we train the model \(p_{\theta}(x)\)?</h3> <p>We can use maximum likelihood to train the model \(p_{\theta}(x)\) where \(\theta\) are the parameters of the model. However, the integration over \(z\) is intractable.</p> <p>\(\begin{align} \mathcal{L}(\theta) &amp;= \sum_{i=1}^{N} \log p_{\theta}(x_i) \\ &amp;= \sum_{i=1}^N \log \int p(x_i\vert z)p(z)dz\\ \end{align}\).</p> <p><strong>Why is the integration over \(z\) intractable?</strong> The integration over \(z\) is intractable because \(p(x \vert z)\) depends nonlinearly on \(z\). In deep latent variable models, \(p(x \vert z)\) is typically parameterized as a Gaussian whose mean and variance are outputs of a neural netowrk.</p> \[\begin{equation} p(x \vert z) = \mathcal{N}(\mu_{nn}(z), \sigma_{nn}(z)) \end{equation}\] <p>Because neural networks introduce nonlinear dependencies on \(z\), the integrand is no longer a Gaussian in \(z\), and the resulting integral has no closed-form solution.</p> <p>Closed-form marginalization is only possible in restricted settings. For example, if \(p(z) = \mathcal{N}(0, I)\) and \(p(x \vert z) = \mathcal{N}(Az+b, \Sigma)\), then the model is linear-Gaussian, and the marginal distribution is \(p(x) = \mathcal{N}(b, AA^T + \Sigma)\).</p> <p><strong>What if \(z\) is discrete?</strong> If \(z\) were discrete, the integral would become a sum, but computing gradients of the log-likelihood would still require evaluating or summing over all latent states, which quickly becomes infeasible in large or structured latent spaces.</p> <p><strong>Why can’t we sample \(z\) to approximate the integral and gradients?</strong> We could approximate the marginal likelihood using Monte Carlo sampling: \(\begin{equation} p(x) \approx \frac{1}{M}\sum_{i=1}^M p(x \vert z_i), \quad z_i \sim p(z) \end{equation}\)</p> <p>However, maximum likelihood requires gradients of \(\text{log}p(x)\) which depend on the posterior \(p(z \vert x)\). Sampling from the prior \(p(z)\) does not provide samples from the posterior. As a result, naive Monte Carlo sampling results in high variance estimates, leading to unstable and impractical learning. Below is the derivation for why \(\nabla_{\theta} \text{log}p_{\theta}(x)\) depends on \(p(z \vert x)\).</p> \[\nabla_{\theta} \text{log}p_{\theta}(x) = \frac{1}{p_{\theta}(x)} \nabla_{\theta} p_{\theta}(x) = \frac{1}{p_{\theta}(x)} \int \nabla_{\theta} p_{\theta}(x \vert z)p(z)dz\] <p>Now, if we take the derivative of \(\text{log}p_{\theta}(x \vert z)\) with respect to \(\theta\), we get \(\nabla_{\theta} \text{log}p_{\theta}(x \vert z) = \frac{1}{p_{\theta}(x \vert z)} \nabla_{\theta} p_{\theta}(x \vert z)\). Then we get \(\nabla_{\theta}p_{\theta}(x \vert z) = p_{\theta}(x \vert z) \nabla_{\theta} \text{log}p_{\theta}(x \vert z)\). Substituting this into the gradient of \(\text{log}p_{\theta}(x)\), we get</p> \[\begin{equation} \nabla_{\theta} \text{log}p_{\theta}(x) = \frac{1}{p_{\theta}(x)} \int p_{\theta}(x \vert z) p(z) \nabla_{\theta} \text{log}p_{\theta}(x \vert z)dz \end{equation}\] <p>We have the term \(\frac{p_{\theta}(x \vert z)p(z)}{p_{\theta}(x)}\), which is exactly the posterior \(p_{\theta}(z \vert x)\). The gradient becomes</p> \[\begin{align} \nabla_{\theta} \text{log}p_{\theta}(x) &amp;= \int p_{\theta}(z \vert x) \nabla_{\theta} \text{log}p_{\theta}(x \vert z)p(z)dz\\ &amp;= \mathbb{E}_{p_{\theta}(z \vert x)}\left[\nabla_{\theta} \text{log}p_{\theta}(x \vert z)\right] \end{align}\] <h3 id="variational-approximation">Variational Approximation</h3> <p>This motivates the use of variational inference, which replaces the intractable posterior \(p(z \vert x)\) with a tractable approximation \(q_{\phi}(z \vert x)\) and yields a differentiable lower bound on the log-likelihood. Here we go through the derivation of the ELBO (Evidence Lower Bound). Starting with the log-likelihood which we want to maximize with respect to \(\theta\), we have</p> \[\begin{equation} \text{log}p_{\theta}(x) = \text{log} \int p_{\theta}(x \vert z)p(z)dz \end{equation}\] <p>We introduce a variational distribution \(q_{\phi}(z \vert x)\), which is parameterized by \(\phi\) and approximates the posterior \(p_{\theta}(z \vert x)\). The key idea is to rewrite the marginal likelihood in a way that allows us to take expectations with respect to \(q_{\phi}(z \vert x)\), which we can sample from. We multiply the above equation by \(\frac{q_{\phi}(z \vert x)}{q_{\phi}(z \vert x)}\) and use the fact that \(\int q_{\phi}(z \vert x)dz = 1\) to get</p> \[\begin{align} \text{log}p_{\theta}(x) &amp;= \text{log} \int p_{\theta}(x \vert z)p(z)dz\\ &amp;= \text{log} \int p_{\theta}(x \vert z)q_{\phi}(z \vert x)\frac{p(z)}{q_{\phi}(z \vert x)}dz\\ &amp;= \text{log} E_{z \sim q_{\phi}(z \vert x)}\left[\frac{p_{\theta}(x \vert z)p(z)}{q_{\phi}(z \vert x)}\right] \end{align}\] <p>The logarithm is a concave function, so we can apply Jensen’s inequality to get</p> \[\begin{align} \text{log}p_{\theta}(x) &amp;= \text{log} E_{z \sim q_{\phi}(z \vert x)}\left[\frac{p_{\theta}(x \vert z)p(z)}{q_{\phi}(z \vert x)}\right]\\ &amp;\geq E_{z \sim q_{\phi}(z \vert x)}\left[\text{log}\frac{p_{\theta}(x \vert z)p(z)}{q_{\phi}(z \vert x)}\right]\\ &amp;= \mathbb{E}_{z \sim q_{\phi}(z \vert x)}\left[\text{log}p_{\theta}(x \vert z) + \text{log}p(z) - \text{log}q_{\phi}(z \vert x)\right]\\ \end{align}\] <p>The right hand side is the ELBO, which is a lower bound on the log-likelihood. It is a differentiable function of \(\theta\) and \(\phi\), and can be used to train the model.</p> <p><strong>What makes a good \(q_{\phi}(z \vert x)\)?</strong></p> <p>The intuition is that \(q_{\phi}(z \vert x)\) should be close to \(p(z)\), and we can use KL-divergence to measure the difference between the two distributions.</p> \[\begin{align} D_{KL}(q_{\phi}(z \vert x) \vert \vert p(z)) &amp;= \mathbb{E}_{z \sim q_{\phi}(z \vert x)}\left[\text{log}\frac{q_{\phi}(z \vert x)}{p(z)}\right]\\ &amp;= \mathbb{E}_{z \sim q_{\phi}(z \vert x)}\left[\text{log}q_{\phi}(z \vert x) - \text{log}p(z)\right] \end{align}\] <p>So we can rewrite the ELBO as</p> \[\begin{align} \text{ELBO}(\theta, \phi) &amp;= \mathbb{E}_{z \sim q_{\phi}(z \vert x)}\left[\text{log}p_{\theta}(x \vert z) + \text{log}p(z) - \text{log}q_{\phi}(z \vert x)\right]\\ &amp;= \mathbb{E}_{z \sim q_{\phi}(z \vert x)}\left[\text{log}p_{\theta}(x \vert z) - D_{KL}(q_{\phi}(z \vert x) \vert \vert p(z)) \right] \end{align}\] <p>To maximize \(p_{\theta}(x)\), we can maximize the ELBO with respect to \(\theta\) and \(\phi\). Intuitively, ELBO balances two objectives: the reconstruction loss \(\text{log}p_{\theta}(x \vert z)\) and the KL-divergence \(D_{KL}(q_{\phi}(z \vert x) \vert \vert p(z))\). The reconstruction loss encourages \(q_{\phi}(z \vert x)\) to place mass on latent variables that reconstruct \(x\) well, while the KL-divergence encourages \(q_{\phi}(z \vert x)\) to be close to \(p(z)\).</p> ]]></content><author><name></name></author><summary type="html"><![CDATA[This post goes through the derivation of Evidence Lower Bound (ELBO) and an intuitive explanation of how variational inference works for latent variable models. Much of the intuition presented here is inspired by Sergey Levine’s lectures on variational inference and insights from discussions with Chris Heckman during my area exam.]]></summary></entry><entry><title type="html">Counterfactual Reasoning and Environment Design for Active Preference Learning</title><link href="https://yi-shiuan-tung.github.io/blog/2025/cred/" rel="alternate" type="text/html" title="Counterfactual Reasoning and Environment Design for Active Preference Learning"/><published>2025-06-26T00:00:00+00:00</published><updated>2025-06-26T00:00:00+00:00</updated><id>https://yi-shiuan-tung.github.io/blog/2025/cred</id><content type="html" xml:base="https://yi-shiuan-tung.github.io/blog/2025/cred/"><![CDATA[<p><strong>Yi-Shiuan Tung</strong>, <strong>Bradley Hayes</strong>, and <strong>Alessandro Roncone</strong><br/> University of Colorado Boulder<br/> <a href="https://hitl-robot-learning.github.io/pdfs/cred.pdf">Workshop Paper PDF</a></p> <hr/> <h2 id="overview">Overview</h2> <center> <img src="/blog/assets/img/pref_learning/husky_question.png" alt="Delivery robot has multiple options for routes to take" width="310"/> </center> <p><br/></p> <p>Robots deployed in the real world must align their behaviors with human preferences—whether balancing speed and safety in delivery tasks or adapting routes based on distance, time, and terrain. But those preferences are hard to predefine and differ across users.</p> <p><strong>Active Preference Learning (APL)</strong> helps robots learn these preferences by asking users to compare and rank trajectories. To improve sample efficiency, we present the human with trajectory pairs that maximize information gain [1]. The objective minimizes the difference between the entropy of the reward distribution (\(H(w)\)) before and after getting human input (\(I\)).</p> <p>\(\begin{equation} \max_{\xi_A, \xi_B} f(\xi_A, \xi_B) = \max_{\xi_A, \xi_B} H(\mathbf{w}) - \mathbb{E}_{I}[H(\mathbf{w} | I)] \end{equation}\).</p> <p>To make the optimization tractable, prior work use a pre-generated set of trajectories from random rollouts [1] or rollouts from a replay buffer [2] to find trajectory pair that optimizes information gain. However, this is <strong>sample inefficient</strong> for long-horizon tasks because the number of possible trajectories grow exponentially as the number of horizon increases. For robot routing, we also have to query the human for preferences in different environments or scenarios to enable <strong>generalization</strong>. Therefore, we include the environment parameters as optimization variables.</p> <p>We propose <strong>CRED</strong>, a method that improves preference learning by:</p> <ul> <li>Using <strong>Counterfactual Reasoning</strong> to generate queries with trajectories that represent different preferences.</li> <li>Performing <strong>Environment Design</strong> to create “imagined” environments that better elicit informative preferences.</li> </ul> <p>CRED significantly improves sample efficiency and generalization across both simulated GridWorld and OpenStreetMap navigation.</p> <hr/> <h2 id="method">Method</h2> <h3 id="1-counterfactual-reasoning">1. Counterfactual Reasoning</h3> <p>When asking humans for preferences, we hypothesize that the trajectories should represent different preferences. To do that, CRED samples potential human reward functions from the current Bayesian belief over weights (w), and generates trajectories that would be optimal if those weights were true. It then evaluates pairs of these counterfactual trajectories to find the most informative preference queries—those that maximize <a href="#overview">Eq. 1</a>.</p> <center> <img src="/blog/assets/img/pref_learning/cr.png" alt="Counterfactual reasoning samples rewards from current belief to generate trajectories that resemble different human preferences." width="600"/> </center> <p><br/></p> <h3 id="2-environment-design">2. Environment Design</h3> <p>The environment affects which preferences can be expressed. For example, distinguishing between preferences for “paved vs. gravel” requires an environment with both terrains.</p> <p>CRED uses <strong>Bayesian Optimization</strong> to find environment configurations that maximize the informativeness of queries. In practice, this means modifying terrain layouts or edge attributes (e.g., road slope or elevation) to elicit more useful feedback. Bayesian optimization uses a Gaussian process to guide its search, reducing the number of evaluations of <a href="#overview">Eq. 1</a>. Here, \(F\) is Eq. 1 but includes environment parameters \(\theta_E\) as optimization variables.</p> <center> <img src="/blog/assets/img/pref_learning/env_design.png" alt="Bayesian optimization finds an environment to query the human." width="800"/> </center> <p><br/></p> <hr/> <h2 id="experiments">Experiments</h2> <p>We evaluate CRED in two domains:</p> <h3 id="gridworld-navigation">GridWorld Navigation</h3> <p>A 15×15 terrain-based environment with brick (red), gravel (grey), sand (yellow), grass (green), and paved (white). The goal is to move from the top left corner to the bottom right corner. For environment design, we first compress the 15x15 grid to a 5 dimensional vector using variational autoencoder.</p> <center> <img src="/blog/assets/img/pref_learning/gridworld_1.png" alt="GridWorld 1" width="150"/> <img src="/blog/assets/img/pref_learning/gridworld_2.png" alt="GridWorld 2" width="150"/> <img src="/blog/assets/img/pref_learning/gridworld_3.png" alt="GridWorld 3" width="150"/> <img src="/blog/assets/img/pref_learning/gridworld_4.png" alt="GridWorld 4" width="150"/> <img src="/blog/assets/img/pref_learning/gridworld_5.png" alt="GridWorld 5" width="150"/> </center> <h3 id="openstreetmap-routing">OpenStreetMap Routing</h3> <p>We use OpenStreetMap to extract nodes representing intersections and edges representing streets. We evaluate the algorithm’s ability to learn preferences for distance, time, and elevation (+/-). For environment design, we modify edge attributes such as traversal time (i.e. traffic) and elevation and evaluate generalization to new street networks.</p> <center> <img src="/blog/assets/img/pref_learning/StreetNav_boulder.png" alt="Boulder" width="150"/> <img src="/blog/assets/img/pref_learning/StreetNav_east_boulder.png" alt="East Boulder" width="150"/> <img src="/blog/assets/img/pref_learning/StreetNav_south_boulder.png" alt="South Boulder" width="150"/> </center> <hr/> <h2 id="baselines">Baselines</h2> <p>We compare CRED to:</p> <ul> <li><strong>RR (Random Rollouts):</strong> Pre-generated set of trajectories through random rollout [1].</li> <li><strong>MBP (Mean Belief Policy):</strong> Uses the mean of the belief over rewards as the best guess and perform rollouts with policy trained on the reward [3].</li> <li><strong>CR (Counterfactual Reasoning only):</strong> Ablation of CRED without environment design.</li> <li><strong>MBP + ED:</strong> Mean Belief Policy combined with environment design.</li> </ul> <hr/> <h2 id="key-results">Key Results</h2> <h3 id="-qualitative-results">👀 Qualitative Results</h3> <center> <video width="250" controls=""> <source src="/blog/assets/img/pref_learning/GridWorld-v0_3_query_5.mp4" type="video/mp4"/> Your browser does not support the video tag. <figcaption>Mean Belief Policy [3]</figcaption> </video> <video width="250" controls=""> <source src="/blog/assets/img/pref_learning/GridWorld-v0_3_query_4.mp4" type="video/mp4"/> Your browser does not support the video tag. <figcaption>Counterfactual Reasoning</figcaption> </video> <video width="250" controls=""> <source src="/blog/assets/img/pref_learning/GridWorld-v0_3_query_0.mp4" type="video/mp4"/> Your browser does not support the video tag. <figcaption>Our Approach: CRED</figcaption> </video> </center> <p><br/></p> <p>While Mean Belief Policy [3] (left) can generate trajectories with different features, the trajectories are very similar. Counterfactual reasoning (middle) generates trajectories that better resemble different preferences. With environment design (right), we can query the human for feedback in different environments.</p> <h3 id="-higher-information-gain">🔍 Higher Information Gain</h3> <center> <img src="/blog/assets/img/pref_learning/GridWorld-v0_objective_values.png" alt="GridWorld Objective Values" width="230"/> <img src="/blog/assets/img/pref_learning/GridWorld-v0_entropy.png" alt="GridWorld Entropy" width="230"/> <img src="/blog/assets/img/pref_learning/SimpleStreetNav-v0_objective_values.png" alt="StreetNav Objective Values" width="230"/> <img src="/blog/assets/img/pref_learning/SimpleStreetNav-v0_entropy.png" alt="StreetNav Entropy" width="230"/> </center> <center> <img src="/blog/assets/img/pref_learning/legend.png" width="400"/> </center> <p>Left to right: GridWorld information gain, entropy of belief over reward weights, OpenStreetMap information gain, and entropy of belief over reward weights across different iterations of querying the human for feedback. CRED generates more informative preference queries early on, resulting in lower entropy of the belief over rewards.</p> <h3 id="-higher-generalization">✅ Higher Generalization</h3> <p>CRED-trained policies perform better in unseen environments, demonstrating faster convergence and higher rewards and policy accuracy.</p> <center> <img src="/blog/assets/img/pref_learning/pref_results.png" alt="Results" width="1000"/> </center> <hr/> <h2 id="conclusion">Conclusion</h2> <p>We introduce CRED for active preference learning which improves the sample efficiency and generalization of the learned reward functions. Counterfactual reasoning generates queries with trajectories that better resemble different reward functions. By using environment design, we can jointly optimize the environment and query generation, enabling the ability to query the human in different environments.</p> <hr/> <h2 id="acknowledgments">Acknowledgments</h2> <p>Thanks to Dusty Woods for help with visualizations and figure editing.</p> <hr/> <h2 id="references">References</h2> <p>[1] Biyik, E., Palan, M., Landolfi, N. C., Losey, D. P., &amp; Sadigh, D. (2020). <em>Asking easy questions: A user-friendly approach to active reward learning</em>. CoRL.</p> <p>[2] Lee, K., Smith, L. M., &amp; Abbeel, P. (2021). <em>PEBBLE: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training</em>. ICML.</p> <p>[3] Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., &amp; Amodei, D. (2017). <em>Deep reinforcement learning from human preferences</em>. NeurIPS.</p> <hr/> <h2 id="contact">Contact</h2> <p>Questions or collaboration ideas?<br/> 📧 yi-shiuan.tung@colorado.edu</p>]]></content><author><name></name></author><category term="publications"/><summary type="html"><![CDATA[Blog post for RSS'25 Human-in-the-Loop Robot Learning Workshop]]></summary></entry><entry><title type="html">Workspace Optimization Techniques to Improve Human Motion Prediction</title><link href="https://yi-shiuan-tung.github.io/blog/2024/workspace-optimization/" rel="alternate" type="text/html" title="Workspace Optimization Techniques to Improve Human Motion Prediction"/><published>2024-01-12T00:00:00+00:00</published><updated>2024-01-12T00:00:00+00:00</updated><id>https://yi-shiuan-tung.github.io/blog/2024/workspace-optimization</id><content type="html" xml:base="https://yi-shiuan-tung.github.io/blog/2024/workspace-optimization/"><![CDATA[<p align="center"> <img width="500" src="/assets/img/hri2024/intro.png"/> <p align="center"> Figure 1. </p> </p> <p align="justified"> Suppose that you are picking up the blue square cube shown in Figure 1 (left). The natural path (solid) makes it hard for the robot to predict whether you are picking up the blue square cube or the red triangle cube while the legible path (dotted) requires you to take a circuitous route. To improve a robot's prediction of a human teammate's goals during a collaborative task shown in Figure 1 (right), the robot can configure the workspace by rearranging objects and projecting "virtual obstacles" in augmented reality (cyan and red barriers), in order to induce naturally legible paths from the human. </p> <p align="justified"> In human-robot collaboration, the robot needs to predict human motion in order to coordinate its actions with those of the human. Current algorithms rely on the human motion model to achieve safe interactions, but human motion is inherently highly variable as humans can always move unexpectedly. Our work takes a different approach and addresses a fundamental challenge faced by all human motion prediction models; we reduce the uncertainty inherent in modeling the intentions of human collaborators by pushing them towards legible behavior via environment design. Our work improves human motion model predictions by increasing environmental structure to reduce uncertainties, facilitating more fluent human-robot interactions. </p> <h2> Methods </h2> <p align="center"> <img width="1000" src="/assets/img/hri2024/system_diagram.png"/> <p align="center"> Figure 2. </p> </p> <h3> Quality Diversity Search </h3> <p>We use a quality diversity (QD) algorithm called MAP-Elites to search through the space of environment configurations (i.e. object positions and virtual obstacle placements). MAP-Elites keeps track of a behavior performance map (also known as the solution map) that stores the best performing solution found for each combination of features chosen by the designer. For example, in the Overcooked game, we use two features 1) Number of Obstacles and 2) Ordering of Ingredient Placements. The environment shown in the middle of Figure 2 has 3 obstacles and ingredient ordering onions-fish-dish-tomatoes-cabbage. The solution map is a matrix where the first dimension is the number of obstacles and the second dimension includes the possible ingredient orderings. Note that the solution map can have more than 2 dimensions.</p> <p>MAP-Elites consists of two phases: 1) initialization phase where environments are randomly generated and placed into the solution map according to their features and 2) improvement phase where environments are randomly sampled from the map and mutated. While MAP-Elites generates diverse solutions by altering existing ones through random mutations, the process may require substantial computational time to yield high quality solutions. Differentiable QD is a method that performs gradient descent on the objective function and the features to speed up MAP-Elites but requires both the objective and feature functions to be differentiable. We empirically approximate the gradient of the objective function through stochastic sampling, which may lead to “suboptimal” solutions that can, however, contribute to increased diversity. In our implementations, we continue this stochastic hill climbing until a local maxima is found. In Overcooked, we sample new locations for ingredients and additions/removals of virtual obstacles as possible mutations. The solution map is updated if the new environment generated from the mutation step is better than the existing solution in the solution map with the same features.</p> <p align="center"> <img width="300" src="/assets/img/hri2024/end_config.png"/> <img width="300" src="/assets/img/hri2024/improvement.gif"/> <p align="center"> Figure 3. </p> </p> <p>The tabletop task in Figure 1 requires the human and the robot to collaboratively place cubes into a desired configuration shown in Figure 3 (left). To mutate an existing environment, we sample new locations for the cubes based on a Gaussian with variance = 7cm. We also sample the locations and orientations of fixed-size virtual obstacles. An example of the stochastic hill climbing in the improvement phase of MAP-Elites is shown in Figure 3 (right).</p> <h3> Legibility Objective Function </h3> <p>The objective function considers all the possible goals the human might be reaching for at a given stage of task execution and maximizes the probability of correctly predicting the human’s chosen goal. The probability of the human’s goal is given by the equation below:</p> \[\Pr(G | \mathcal{\xi}_{S \rightarrow Q}) \propto \frac{exp(-C(\mathcal{\xi}_{S \rightarrow Q}) - C(\mathcal{\xi}^*_{Q \rightarrow G}))}{exp(-C(\mathcal{\xi}^*_{S \rightarrow G}))}\] <p>The optimal human trajectory from point \(X\) to point \(Y\) with respect to cost function \(C\) is denoted by \(\mathcal{\xi}^*_{X \rightarrow Y}\). This equation evaluates how cost efficient (with respect to \(C\)) going to goal \(G\) is from start state \(S\) given the observed partial trajectory \(\mathcal{\xi}_{S \rightarrow Q}\) relative to the most efficient trajectory \(\mathcal{\xi}^*_{S \rightarrow G}\). For a given ground truth goal \(G_{true}\), if the predicted goal is not \(G_{true}\), we penalize by a constant \(c\) multiplied by the length of the observed trajectory \(\vert \mathcal{\xi}_{S \rightarrow Q} \vert\). If the predicted goal is correct, we encourage more confident predictions by maximizing the difference between the probability of the correct goal and the second highest goal probability. This is summarized in the equation below.</p> \[\text{EnvLegibility}(G_{true}) = \begin{cases} -c |\mathcal{\xi}_{S \rightarrow Q}|, \text{ if } \underset{G \in \mathcal{G}}{\arg\max} \Pr(G | \mathcal{\xi}_{S \rightarrow Q}) \neq G_{true} \\ margin(\mathcal{G}|\mathcal{\xi}_{S \rightarrow Q}) = G_{(n)} - G_{(n-1)}, \text{ otherwise} \end{cases}\] <p>We compute EnvLegibility for each possible ground truth goal at each stage of the task execution. The function \(permutations(T)\) are all the different ways a task \(T\) can be performed. \(\mathcal{G}\) is the set of valid goals the human can reach for when performing subtask \(t\) with task ordering \(T'\).</p> \[\text{objective function} = \sum_{T' \in \text{permutations}(T)} \mathbb{1}\{\text{valid}(T')\} \times \sum_{t \in T'} \sum_{G \in \mathcal{G}} \text{EnvLegibility}(G)\] <h2> Experiments and Results </h2> <div align="center"> <iframe width="840" height="472.5" src="https://www.youtube.com/embed/CEl5-aQ29pk?si=n41TGVbaGbowm2lT" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen=""></iframe> </div> <p><br/></p> <p align="center"> <img width="200" src="/assets/img/hri2024/gaussian_A.png"/> <img width="200" src="/assets/img/hri2024/gaussian_B.png"/> <img width="200" src="/assets/img/hri2024/gaussian_D.png"/> <img width="200" src="/assets/img/hri2024/gaussian_C.png"/> <p align="justified"> Figure 4. Top down view of the workspace plotting the mean and covariance of the time series multivariate Gaussian for each condition. From left to right, the conditions are Baseline, Placement Optimized, Virtual Obstacle Optimized, and our approach Both Optimized. The model in the Both Optimized condition has less covariance compared to the models trained in the other environment configurations. </p> </p> <h2> Discussion </h2> <p>In this work, we introduce an algorithmic approach for autonomous workspace optimization to improve robot predictions of a human collaborator’s goals. We envision that our framework can improve human robot teaming, by improving goal prediction and situational awareness, for domains such as shared autonomy for assistive manipulation, warehouse stocking, cooking assistance, among others. Our approach is applicable for domains where the following conditions hold: 1) Multiple agents share the same physical space and the agents do not have access to other agents’ controllers or decision making processes (otherwise a centralized controller can be used), 2) the environment allows physical or virtual configurations, and 3) environment configuration can be performed prior to the interaction. Through dual experiments in 2D navigation (see paper) and tabletop manipulation, we show that our approach results in more accurate model predictions across two distinct goal inference methods, requiring less data to achieve these correct predictions. Importantly, we demonstrate that environmental adaptations can be discovered and leveraged to compensate for shortfalls of prediction models in otherwise unstructured settings.</p>]]></content><author><name></name></author><category term="publications"/><summary type="html"><![CDATA[Blog post for HRI'24 paper]]></summary></entry><entry><title type="html">Minimizing Entropy for Classification Problems</title><link href="https://yi-shiuan-tung.github.io/blog/2023/min-entropy/" rel="alternate" type="text/html" title="Minimizing Entropy for Classification Problems"/><published>2023-12-15T00:00:00+00:00</published><updated>2023-12-15T00:00:00+00:00</updated><id>https://yi-shiuan-tung.github.io/blog/2023/min-entropy</id><content type="html" xml:base="https://yi-shiuan-tung.github.io/blog/2023/min-entropy/"><![CDATA[<p>When we want to be more confident about our predictions for a classification problem, we often use an objective that minimizes the entropy. But why is this not enough? This post will discuss entropy and cross entropy losses.</p> <h2 id="what-is-entropy">What is Entropy?</h2> <p>Entropy (in information theory) is a measure of uncertainty; the higher the entropy, the more uncertain you are. Entropy is defined as</p> \[H(X) = - \sum_{x \in \mathcal{X}} p(x)\text{log} p(x)\] <p>where \(X\) is the discrete random variable that takes values in the alphabet \(\mathcal{X}\) and is distributed according to \(p: \mathcal{X} \rightarrow [0, 1]\). \(-\text{log}p(x)\) is the information of an event \(x\). So entropy \(H\) is the sum of the information for each possible event \(x \in \mathcal{X}\) weighted by the probability of the event \(p(x)\). Rare events (low probability) give more information and have higher values. Another way to think of it is that the entropy of a probability distribution is the optimal number of bits (when using log base 2) required to encode the distribution. When \(p(x)\) is high, we use fewer bits to represent the event \(x\) because we see it more often and it is cheaper to use fewer bits. When \(p(x)\) is low, we use more bits. This is given by the information of the event \(-\text{log}p(x)\).</p> <h2 id="classification-problems">Classification problems</h2> <p>In the context of human goal prediction, we want to train a model that outputs the correct human goal \(x\) given that the model observed some initial human trajectory \(\xi_{S \rightarrow Q}\) that started at point \(S\) and ended at point \(Q\). A human goal can be an object they are reaching towards or some task that they are performing. Suppose the human can reach towards the apple, banana, or grapes (\(\mathcal{X} = \{\text{apple}, \text{banana}, \text{grapes}\}\)), and we have a model \(f\) that outputs a distribution over the likelihood of goals (via neural network with softmax output or a Bayesian classifier). We can get the predicted goal by taking the argmax of the distribution \(\hat{x} = \text{argmax}_{x} f(\xi)\).</p> <center> <img src="/blog/assets/img/reaching_example.png" alt="Reaching Example" width="310"/> </center> <p><a href="https://www.flaticon.com/free-icons/grape" title="grape icons">Grape icons created by Dreamcreateicons - Flaticon</a></p> <p>To train our model to be more certain about its predictions, we can minimize the entropy of the output distribution during training. We can use the following loss function: Given a predicted label \(\hat{x}\) and the true label \(x\), \(\mathcal{L}(x, \hat{x}) = \mathbb{1}\{x = \hat{x}\}H(f(x))+\mathbb{1}\{x != \hat{x}\}c\) for some constant \(c\). This equation penalizes the prediction by \(c\) if the prediction is incorrect and by the entropy if it is correct. \(\mathbb{1}\{q\}\) is the indicator function and evaluates to 1 if \(q\) is true otherwise 0. However, for predictions that are correct, minimizing the entropy may not give you more confident correct predictions. Suppose that the model has the following two predictions for the figure above where the human is reaching for the apple: 1) [0.55, 0.25, 0.2] and 2) [0.45, 0.44, 0.11]. The array corresponds to the goal distribution for apple, banana, and grapes respectively. Intuitively, we prefer the first array because the model is more confident (\(55 \%\)) about the prediction. However, the entropy for 1) is 0.998 and for 2) is 0.963. Minimizing the entropy will move the model outputs closer to 2) [0.45, 0.44, 0.11].</p> <center> <img src="/blog/assets/img/dist1.png" alt="Distribution 1" width="300"/> <img src="/blog/assets/img/dist2.png" alt="Distribution 2" width="300"/> <figcaption style="text-align:justify"> Two possible goal probability distributions. The model is more confident about its prediction on the left, but the entropy is smaller for the distribution on the right. If our objective is to minimize the entropy for correct predictions, we could be pushing the model's output closer to the right distribution.</figcaption> </center> <p><br/></p> <h2 id="connection-to-cross-entropy">Connection to Cross Entropy</h2> <p>A common loss function for classification problems is the cross entropy loss. The cross entropy of distribution \(q\) relative to another distribution \(p\) is defined as</p> \[H(p, q) = - \sum_{x \in \mathcal{X}} p(x)\text{log}q(x)\] <p>Intuitively, it measures the average number of bits needed to encode the actual distribution \(p\) when using the distribution \(q\). We can rewrite \(H(p, q)\) as</p> \[\begin{align} H(p, q) &amp;= - \sum_{x \in \mathcal{X}} p(x)\text{log}q(x)\\ &amp;= -\sum_{x \in \mathcal{X}} p(x) (\frac{\text{log}q(x)}{\text{log}p(x)} \text{log}p(x))\\ &amp;= -\sum_{x \in \mathcal{X}} p(x) \frac{\text{log}q(x)}{\text{log}p(x)} - \sum_{x \in \mathcal{X}} p(x) \text{log}p(x)\\ &amp;= D_{KL}(p||q) + H(p) \end{align}\] <p>The first term is the Kullback-Leibler (KL) divergence which measures how different the distributions \(p\) and \(q\) are, and the second term is the entropy of \(p\). In addition to minimizing entropy, the cross entropy loss minimizes the distance between the predicted and actual distributions which resolves the issue above.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[When we want to be more confident about our predictions for a classification problem, we often use an objective that minimizes the entropy. But why is this not enough? This post will discuss entropy and cross entropy losses.]]></summary></entry><entry><title type="html">Bilevel Optimization for Just-in-Time Robotic Kitting</title><link href="https://yi-shiuan-tung.github.io/blog/2022/robotic-kitting/" rel="alternate" type="text/html" title="Bilevel Optimization for Just-in-Time Robotic Kitting"/><published>2022-08-01T00:00:00+00:00</published><updated>2022-08-01T00:00:00+00:00</updated><id>https://yi-shiuan-tung.github.io/blog/2022/robotic-kitting</id><content type="html" xml:base="https://yi-shiuan-tung.github.io/blog/2022/robotic-kitting/"><![CDATA[<h2 id="overview">Overview</h2> <p><strong>Problem:</strong> Traditional kitting systems often use few pre-defined kits that do not adapt to variability such as part shortages or unexpected delays in real time. This inflexibility can lead to inefficiencies, higher cognitive load for workers, and longer production times.</p> <p><strong>Solution:</strong> We propose a dynamic robotic kitting planner that segments and schedules assembly tasks to minimize idle time and reduce makespan. Using a bilevel optimization framework, the upper-level optimization determines the task segmentation to minimize idle time and the lower-level optimization designs the physical layout of parts on the kitting tray to ensure usability and logical grouping.</p> <h2 id="what-is-kitting">What is Kitting?</h2> <p>Kitting is the process of preparing and grouping the required components for assembly of a given product. Kitting is advantageous for assemblies involving numerous small components and products that support a wide range of customizations. For example, it is used in electronics and automobile manufacturing (https://tulip.co/blog/the-kitting-process-for-manufacturers/).</p> <h2 id="problem-setup">Problem Setup</h2> <p>We have a set of tasks to finish represented by a Directed Acyclic Graph (DAG). An edge between two tasks <strong>a</strong> -&gt; <strong>b</strong> indicates that task <strong>a</strong> has to be finished before task <strong>b</strong> can begin. The tasks that the human can do depends on the assembly parts that the robot delivers on the kitting tray. The robot has estimates of how long it takes the human to perform each task and the availability of parts. <strong>How does the robot determine what parts to place on the kitting tray at any given time in order to maximize team throughput?</strong></p> <h2 id="approach">Approach</h2> <h3 id="bilevel-optimization">Bilevel Optimization</h3> <p>The upper level problem optimizes for an objective function that depends on the outcome of the lower level problem. In the upper level problem, the robot segments the task such that human idle time is minimized. The first segment is the kit that the robot delivers in the next time step. The “goodness” of the segmentation is influenced by the lower level problem of how well the assembly parts fit in the kit. For more details, please refer to the <a href="https://hiro-group.ronc.one/papers/2022_Tung_ROMAN_kitting.pdf">paper</a>.</p> <p align="center"> <img width="500" src="/assets/img/roman2022/bilevel-opt.png"/> <p align="center"> </p> </p> <h2 id="experiments">Experiments</h2> <h3 id="user-study">User Study</h3> <p>In the user study, articipants assembled a miniature table that required connecting four legs, connectors and a flat surface plank by snapping the pieces together and securing with screws and nuts. The robot has to deliver four legs, eight connectors, small/large screw and nuts boxes on the kitting tray.</p> <div align="center"> <img width="200" src="/assets/img/roman2022/arranging_cropped.png"/> <img width="200" src="/assets/img/roman2022/delivery_cropped.png"/> <img width="200" src="/assets/img/roman2022/building_cropped.png"/> <img width="200" src="/assets/img/roman2022/finished_cropped.png"/> </div> <p><br/></p> <p>Our optimization produced the following kitting strategy based on an initial data set of human task times.</p> <div align="center"> <img width="250" src="/assets/img/roman2022/segment1.jpg"/> <img width="250" src="/assets/img/roman2022/segment2.jpg"/> <img width="250" src="/assets/img/roman2022/segment3.jpg"/> </div> <p><br/></p> <p>The first kit allows the human to connect a leg to a top connector. The third kit is repeated until the task is done. Here we do not consider part shortages. In the <a href="#discrete-event-simulation">simulation experiment</a>, we model various assembly part arrival time distributions and part-feeding machine breakdown conditions (i.e. the part is not available to the robot until after the machine is repaired.)</p> <p>The baselines that we compare to are <strong>Single Task</strong> where the robot delivers parts for a single task at a time and <strong>Whole Assembly</strong> where all the parts are delivered at once. Our approach <strong>Optimized</strong> has a shorter total task time and idle times than <strong>Whole Assembly</strong> and is also rated more useful and efficient.</p> <div align="center"> <img width="300" src="/assets/img/roman2022/userstudy_tasktimes.png"/> <img width="300" src="/assets/img/roman2022/userstudy_postexp.png"/> </div> <h3 id="discrete-event-simulation">Discrete Event Simulation</h3> <p>We model the arrival of assembly parts as a Poisson process with rate \(1/MAT\) where \(MAT\) is the mean arrival time. The machine breakdown is modeled similarly with the mean time to failure denoted by \(MTTF\). The figures below are the percent improvement in total task time of <strong>Optimized</strong> over <strong>Whole Assembly</strong> (top-left) and over <strong>Single Task</strong> (top-right), and the percent improvement in human idle time of <strong>Optimized</strong> over <strong>Whole Assembly</strong> (bottom-left) and over <strong>Single Task</strong> (bottom-right). The total task time and human idle time are significantly shorter for <strong>Optimized</strong> than the baselines. <strong>Optimized</strong> is most advantageous over <strong>Whole Assembly</strong> when there is high part shortage (\(MAT\) is high). <strong>Optimized</strong> is most advantageous over <strong>Single Task</strong> when there are many machine failures (\(MTTF\) is low).</p> <div align="center" style="display: flex; flex-wrap: wrap; justify-content: center;"> <div style="display: flex; width: 100%; justify-content: center; gap: 10px; margin-bottom: 10px;"> <img width="300" src="/assets/img/roman2022/tt_optimized_over_whole.png"/> <img width="300" src="/assets/img/roman2022/tt_optimized_over_single.png"/> </div> <div style="display: flex; width: 100%; justify-content: center; gap: 10px;"> <img width="300" src="/assets/img/roman2022/hit_optimized_over_whole.png"/> <img width="300" src="/assets/img/roman2022/hit_optimized_over_single.png"/> </div> </div>]]></content><author><name></name></author><category term="publications"/><summary type="html"><![CDATA[Blog post for RO-MAN 2022 paper]]></summary></entry></feed>