1 University of Rochester 2 University of California San Diego
Learning from Demonstrations (LfD) via Behavior Cloning works well on multiple complex tasks. However,
it requires expert demonstrations for all scenarios. We propose Learning from Interventions (LfI) with
hierarchical policies that overcomes this limitation by using an expert overseer who only intervenes when
unsafe actions are about to be taken. Our approach addresses expert reaction delay and learns long-term
behavior through sub-goal prediction.
The Thirty-Fourth AAAI Conference on Artificial Intelligence
Abstract
Learning from Demonstrations (LfD) via Behavior Cloning (BC) works well on multiple complex tasks. However, a limitation
of the typical LfD approach is that it requires expert demonstrations for all scenarios, including those in which the
algorithm is already well-trained. The recently proposed Learning from Interventions (LfI) overcomes this limitation
by using an expert overseer. The expert overseer only intervenes when it suspects that an unsafe action is about to be taken.
Although LfI significantly improves over LfD, the state-of-the-art LfI fails to account for delay caused by the expert's
reaction time and only learns short-term behavior. We address these limitations by 1) interpolating the expert's
interventions back in time, and 2) by splitting the policy into two hierarchical levels, one that generates sub-goals
for the future and another that generates actions to reach those desired sub-goals.
This sub-goal prediction forces the algorithm to learn long-term behavior while also being robust to the expert's
reaction time. Our experiments show that LfI using sub-goals in a hierarchical policy framework trains faster and
achieves better asymptotic performance than typical LfD.
Demonstration of the Learning from Interventions approach showing how an expert overseer
intervenes when the autonomous agent is about to take unsafe actions, allowing for safe learning without
requiring demonstrations for all scenarios.
Method Overview
Top view of the map in CARLA simulator where experiments were conducted.
The agent navigates autonomously according to high-level commands while staying in lanes and avoiding crashes.
Our approach combines Learning from Interventions with Hierarchical Reinforcement Learning. We split the policy into
two hierarchy levels: the top-level policy predicts a sub-goal to be achieved k steps ahead in the future, while the
bottom-level policy chooses the best action to achieve the sub-goal generated by the top-level policy.
We address the expert's reaction delay through an interpolation technique called Backtracking, which
allows us to use state-action pairs before and after the intervention. Since we work with images as our state space,
we generate intermediate sub-goal embeddings using Triplet Networks rather than predicting future images at pixel-level detail.
Problem Formulation
We consider a Goal-conditioned Markov Decision Process (S, A, T, R, C) where an agent interacts with an environment
in discrete time steps. At each time step t, the agent receives an observation ot ∈ O (sequence of images)
and a command ct from the expert, then takes action at.
The expert has policy π* : O × C → A and can intervene when the probability of catastrophic damage exceeds a threshold:
p*c(st-k*, at-k*) > 1-δc. The effective policy with expert intervention is:
where k* is the unknown expert's reaction delay. Our goal is to improve the learned policy π(st, ct; θ)
without catastrophic failure by minimizing the difference between desired and learned policies on intervention data.
Traditional Learning from Demonstrations requires expert data for all scenarios, even those where the algorithm performs well.
Learning from Interventions addresses this by having an expert overseer who only intervenes when necessary, but existing
approaches have key limitations:
Expert Reaction Delay: The expert cannot react instantaneously and requires time to process and respond
Short-term Learning: Current methods only learn reactive behaviors without long-term planning
Video demonstration of the autonomous driving task showing the agent navigating through
urban environments with expert interventions when unsafe actions are detected.
Hierarchical Policy Architecture
Instead of learning a direct policy network π(st, ct; θ), we divide the policy learning into
two levels of hierarchy. The top-level policy φ̂k learns to predict a sub-goal vector ĝt+k, k steps ahead,
while the bottom-level policy π learns to generate actions to achieve that sub-goal:
π(ot, ct; θ) = π(st, ĝt+k; θb)
ĝt+k := φ̂k(st, ct; θt)
Since we don't have access to ground truth sub-goals from the expert policy, we use another network
gt+k := φ(st+k; θg) that outputs the desired sub-goal embedding from an achieved state
st+k. This follows the Hindsight Experience Replay principle - any achieved state is considered a desirable
goal for past observations.
Top-level Policy: Generates sub-goals for future states (long-term planning)
Bottom-level Policy: Selects actions to reach the generated sub-goals (short-term execution)
This hierarchical structure has several advantages:
Long-term predictions are less affected by expert reaction delays
Forces the algorithm to learn long-term behavior beyond reactive responses
Enables better compensation for expert timing uncertainties
Triplet Network Training
We train the top-level policy and goal encoder using a Triplet Network approach. This avoids trivial solutions
that occur with direct minimization and is more suitable than autoencoders for goal prediction. The triplet loss
brings closer the embeddings ĝt+k and gt+k while pushing away ĝt+k from randomly
chosen goal embeddings.
LTriplet = (exp(d+) / (exp(d+) + exp(d-)))2
where d+ = ||φ̂k(st, ct) - φ(st+k)||
and d- = ||φ̂k(st, ct) - φ(sr)||
Backtracking for Reaction Delay
To address the expert's reaction delay, we propose a novel interpolation technique called Backtracking.
During data collection, we maintain a queue of past M time-steps. When an expert intervenes at time t, we interpolate
the actions between at-M and the intervened action a*t to update the action queue.
The Backtracking algorithm works as follows:
Maintain a data queue Dβ of past M time-steps
When intervention occurs at time t, interpolate actions using Backtrack(a*t, at-M, j)
Update the action queue with interpolated values
Add the corrected trajectory data to the training dataset
This technique allows us to:
Use state-action pairs both before and after the intervention
Better align expert feedback with the actual decision points
Improve the quality of training data despite reaction delays
Focus on critical intervention-related states rather than all demonstration data
Combined Objective Function
The complete training objective combines both bottom-level policy learning and triplet network training:
θ*t, θ*g, θ*b = arg minθt,g,b ∑t∈D {
LBottom(π(st, ĝt+k,θt; θb), at) +
LBottom(π(st, gt+k,θg; θb), at) +
LTriplet(st, st+k, sr; θt, θg) }
Key Contributions
1. Novel Problem Formulation
We propose a new formulation of Learning from Interventions that incorporates the expert's reaction delay,
making it more realistic for real-world applications.
2. Hierarchical LfI Algorithm
We combine Learning from Interventions with Hierarchical Reinforcement Learning to address both reaction
delay and long-term behavior learning.
3. Backtracking Interpolation
Our interpolation technique allows effective use of state-action pairs before and after interventions,
improving data utilization.
4. Triplet Network Architecture
We propose a novel architecture using Triplet Networks to train hierarchical policies without ground-truth
sub-goals, enabling practical implementation.
Experimental Setup & Results
CARLA Simulation Environment
We evaluate our approach using the CARLA 3D urban driving simulator, which provides realistic vehicle dynamics
and environmental conditions. The agent navigates through urban environments with 80 other vehicles, following
topological commands from a human expert or planner.
Expert Policy and Data Collection
The expert policy combines a PID controller for steering with human oversight. The human expert intervenes only
in critical situations: potential crashes or lane evasions. We record steering angles at,s ∈ [-1, 1]
and binary brake signals at,b ∈ {0, 1}, along with topological commands for lane-following, left turns,
and right turns.
Implementation Details
Our system processes 800×600 resolution images using ResNet-50 feature extraction, followed by command-conditioned
MLPs for policy execution. We use k=5 time steps for sub-goal prediction (1.25 seconds at 4 fps) and maintain
M past time-steps for backtracking interpolation.
We compare three data collection approaches:
Demo: Standard Behavior Cloning using all intervention and non-intervention data
CoL: Cycle-of-Learning using only intervention data
LbB: Our Learning by Backtracking approach with interpolated intervention data
And two policy representations:
Branched: Baseline approach with parallel MLPs selected by command
Sub-goal: Our hierarchical network with k=5 sub-goal prediction
Key Experimental Results
Performance Metrics
We measure success using two primary metrics: time and distance traveled without expert intervention.
These represent the duration of successful autonomous driving before requiring expert help.
Traditional Behavior Cloning shows moderate improvement initially but plateaus
Both Sub-goal and LbB individually improve performance
The combination achieves the best results for both time and distance metrics
Hierarchical policies enable longer-term planning and more effective data utilization
Data Efficiency
Our approach demonstrates superior data efficiency:
Sub-goal + LbB uses the least amount of training data per iteration
As the policy improves, fewer expert interventions are needed
Learning from Interventions focuses on critical states rather than all scenarios
Sharp reduction in data requirements followed by consistently low values
Hyperparameter Analysis
We evaluated the effect of k (sub-goal prediction horizon) and found k=5 provides optimal performance.
This represents a trade-off between:
Large k: Better long-term planning but harder for bottom-level policy to follow
Small k: More reactive behavior, similar to baseline approaches
k=5 (1.25s): Optimal balance for autonomous driving tasks
Real-world Validation
To demonstrate practical applicability, we deployed our approach on a real RC truck (1/10 scale, 13"×10"×11")
equipped with:
Nvidia TX2 embedded computer
Intel RealSense D415 central camera
Two side webcams for wide field of view
Arduino board and USB servo controller
Instead of vehicle-rich environments, we studied navigation in pedestrian-rich scenarios with command categories:
Path-following: Navigate with no pedestrians
Pedestrian-following: Follow behind a pedestrian
Confronting: Avoid hitting a confronting person
Crossing: Avoid hitting a crossing pedestrian
The real-world experiments confirmed the data efficiency improvements, showing a decrease in required interventions
across all four scenarios, validating the practical effectiveness of our hierarchical Learning from Interventions framework.
Citation
If you found this work useful in your research, please consider citing the following:
@inproceedings{bi2020learning,
title={Learning from Interventions Using Hierarchical Policies for Safe Learning},
author={Bi, Jing and Dhiman, Vikas and Xiao, Tianyou and Xu, Chenliang},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={34},
number={04},
pages={10352--10359},
year={2020}
}