Learning from Interventions Using Hierarchical Policies for Safe Learning

Jing Bi¹, Vikas Dhiman², Tianyou Xiao¹, Chenliang Xu¹

¹ University of Rochester
² University of California San Diego

Learning from Demonstrations (LfD) via Behavior Cloning works well on multiple complex tasks. However, it requires expert demonstrations for all scenarios. We propose Learning from Interventions (LfI) with hierarchical policies that overcomes this limitation by using an expert overseer who only intervenes when unsafe actions are about to be taken. Our approach addresses expert reaction delay and learns long-term behavior through sub-goal prediction.

Paper

AAAI 2020

The Thirty-Fourth AAAI Conference on Artificial Intelligence

Abstract

Learning from Demonstrations (LfD) via Behavior Cloning (BC) works well on multiple complex tasks. However, a limitation of the typical LfD approach is that it requires expert demonstrations for all scenarios, including those in which the algorithm is already well-trained. The recently proposed Learning from Interventions (LfI) overcomes this limitation by using an expert overseer. The expert overseer only intervenes when it suspects that an unsafe action is about to be taken.

Although LfI significantly improves over LfD, the state-of-the-art LfI fails to account for delay caused by the expert's reaction time and only learns short-term behavior. We address these limitations by 1) interpolating the expert's interventions back in time, and 2) by splitting the policy into two hierarchical levels, one that generates sub-goals for the future and another that generates actions to reach those desired sub-goals.

This sub-goal prediction forces the algorithm to learn long-term behavior while also being robust to the expert's reaction time. Our experiments show that LfI using sub-goals in a hierarchical policy framework trains faster and achieves better asymptotic performance than typical LfD.

Demonstration of the Learning from Interventions approach showing how an expert overseer intervenes when the autonomous agent is about to take unsafe actions, allowing for safe learning without requiring demonstrations for all scenarios.

Method Overview

Top view of the map in CARLA simulator where experiments were conducted. The agent navigates autonomously according to high-level commands while staying in lanes and avoiding crashes.

Our approach combines Learning from Interventions with Hierarchical Reinforcement Learning. We split the policy into two hierarchy levels: the top-level policy predicts a sub-goal to be achieved k steps ahead in the future, while the bottom-level policy chooses the best action to achieve the sub-goal generated by the top-level policy.

We address the expert's reaction delay through an interpolation technique called Backtracking, which allows us to use state-action pairs before and after the intervention. Since we work with images as our state space, we generate intermediate sub-goal embeddings using Triplet Networks rather than predicting future images at pixel-level detail.

Problem Formulation

We consider a Goal-conditioned Markov Decision Process (S, A, T, R, C) where an agent interacts with an environment in discrete time steps. At each time step t, the agent receives an observation o_t ∈ O (sequence of images) and a command c_t from the expert, then takes action a_t.

The expert has policy π* : O × C → A and can intervene when the probability of catastrophic damage exceeds a threshold: p*_c(s_t-k*, a_t-k*) > 1-δ_c. The effective policy with expert intervention is:

π_eff(s_t, c_t) = {π*(s_t, c_t), if intervention needed; π(s_t, c_t; θ), otherwise}

where k* is the unknown expert's reaction delay. Our goal is to improve the learned policy π(s_t, c_t; θ) without catastrophic failure by minimizing the difference between desired and learned policies on intervention data.

Traditional Learning from Demonstrations requires expert data for all scenarios, even those where the algorithm performs well. Learning from Interventions addresses this by having an expert overseer who only intervenes when necessary, but existing approaches have key limitations:

Expert Reaction Delay: The expert cannot react instantaneously and requires time to process and respond
Short-term Learning: Current methods only learn reactive behaviors without long-term planning

Video demonstration of the autonomous driving task showing the agent navigating through urban environments with expert interventions when unsafe actions are detected.

Hierarchical Policy Architecture

Instead of learning a direct policy network π(s_t, c_t; θ), we divide the policy learning into two levels of hierarchy. The top-level policy φ̂_k learns to predict a sub-goal vector ĝ_t+k, k steps ahead, while the bottom-level policy π learns to generate actions to achieve that sub-goal:

π(o_t, c_t; θ) = π(s_t, ĝ_t+k; θ_b)

ĝ_t+k := φ̂_k(s_t, c_t; θ_t)

Since we don't have access to ground truth sub-goals from the expert policy, we use another network g_t+k := φ(s_t+k; θ_g) that outputs the desired sub-goal embedding from an achieved state s_t+k. This follows the Hindsight Experience Replay principle - any achieved state is considered a desirable goal for past observations.

Top-level Policy: Generates sub-goals for future states (long-term planning)
Bottom-level Policy: Selects actions to reach the generated sub-goals (short-term execution)

This hierarchical structure has several advantages:

Long-term predictions are less affected by expert reaction delays
Forces the algorithm to learn long-term behavior beyond reactive responses
Enables better compensation for expert timing uncertainties

Triplet Network Training

We train the top-level policy and goal encoder using a Triplet Network approach. This avoids trivial solutions that occur with direct minimization and is more suitable than autoencoders for goal prediction. The triplet loss brings closer the embeddings ĝ_t+k and g_t+k while pushing away ĝ_t+k from randomly chosen goal embeddings.

L_Triplet = (exp(d₊) / (exp(d₊) + exp(d_-)))²

where d₊ = ||φ̂_k(s_t, c_t) - φ(s_t+k)||

and d_- = ||φ̂_k(s_t, c_t) - φ(s_r)||

Backtracking for Reaction Delay

To address the expert's reaction delay, we propose a novel interpolation technique called Backtracking. During data collection, we maintain a queue of past M time-steps. When an expert intervenes at time t, we interpolate the actions between a_t-M and the intervened action a*_t to update the action queue.

The Backtracking algorithm works as follows:

Maintain a data queue D_β of past M time-steps
When intervention occurs at time t, interpolate actions using Backtrack(a*_t, a_t-M, j)
Update the action queue with interpolated values
Add the corrected trajectory data to the training dataset

This technique allows us to:

Use state-action pairs both before and after the intervention
Better align expert feedback with the actual decision points
Improve the quality of training data despite reaction delays
Focus on critical intervention-related states rather than all demonstration data

Combined Objective Function

The complete training objective combines both bottom-level policy learning and triplet network training:

θ*_t, θ*_g, θ*_b = arg min_θt,g,b ∑_t∈D {

L_Bottom(π(s_t, ĝ_t+k,θt; θ_b), a_t) +

L_Bottom(π(s_t, g_t+k,θg; θ_b), a_t) +

L_Triplet(s_t, s_t+k, s_r; θ_t, θ_g) }

Key Contributions

1. Novel Problem Formulation

We propose a new formulation of Learning from Interventions that incorporates the expert's reaction delay, making it more realistic for real-world applications.

2. Hierarchical LfI Algorithm

We combine Learning from Interventions with Hierarchical Reinforcement Learning to address both reaction delay and long-term behavior learning.

3. Backtracking Interpolation

Our interpolation technique allows effective use of state-action pairs before and after interventions, improving data utilization.

4. Triplet Network Architecture

We propose a novel architecture using Triplet Networks to train hierarchical policies without ground-truth sub-goals, enabling practical implementation.

Experimental Setup & Results

CARLA Simulation Environment

We evaluate our approach using the CARLA 3D urban driving simulator, which provides realistic vehicle dynamics and environmental conditions. The agent navigates through urban environments with 80 other vehicles, following topological commands from a human expert or planner.

Expert Policy and Data Collection

The expert policy combines a PID controller for steering with human oversight. The human expert intervenes only in critical situations: potential crashes or lane evasions. We record steering angles a_t,s ∈ [-1, 1] and binary brake signals a_t,b ∈ {0, 1}, along with topological commands for lane-following, left turns, and right turns.

Implementation Details

Our system processes 800×600 resolution images using ResNet-50 feature extraction, followed by command-conditioned MLPs for policy execution. We use k=5 time steps for sub-goal prediction (1.25 seconds at 4 fps) and maintain M past time-steps for backtracking interpolation.

We compare three data collection approaches:

Demo: Standard Behavior Cloning using all intervention and non-intervention data
CoL: Cycle-of-Learning using only intervention data
LbB: Our Learning by Backtracking approach with interpolated intervention data

And two policy representations:

Branched: Baseline approach with parallel MLPs selected by command
Sub-goal: Our hierarchical network with k=5 sub-goal prediction

Key Experimental Results

Performance Metrics

We measure success using two primary metrics: time and distance traveled without expert intervention. These represent the duration of successful autonomous driving before requiring expert help.

Accuracy Results

Our combined approach (Sub-goal + LbB) significantly outperforms baseline methods:

Traditional Behavior Cloning shows moderate improvement initially but plateaus
Both Sub-goal and LbB individually improve performance
The combination achieves the best results for both time and distance metrics
Hierarchical policies enable longer-term planning and more effective data utilization

Data Efficiency

Our approach demonstrates superior data efficiency:

Sub-goal + LbB uses the least amount of training data per iteration
As the policy improves, fewer expert interventions are needed
Learning from Interventions focuses on critical states rather than all scenarios
Sharp reduction in data requirements followed by consistently low values

Hyperparameter Analysis

We evaluated the effect of k (sub-goal prediction horizon) and found k=5 provides optimal performance. This represents a trade-off between:

Large k: Better long-term planning but harder for bottom-level policy to follow
Small k: More reactive behavior, similar to baseline approaches
k=5 (1.25s): Optimal balance for autonomous driving tasks

Real-world Validation

To demonstrate practical applicability, we deployed our approach on a real RC truck (1/10 scale, 13"×10"×11") equipped with:

Nvidia TX2 embedded computer
Intel RealSense D415 central camera
Two side webcams for wide field of view
Arduino board and USB servo controller

Instead of vehicle-rich environments, we studied navigation in pedestrian-rich scenarios with command categories:

Path-following: Navigate with no pedestrians
Pedestrian-following: Follow behind a pedestrian
Confronting: Avoid hitting a confronting person
Crossing: Avoid hitting a crossing pedestrian

The real-world experiments confirmed the data efficiency improvements, showing a decrease in required interventions across all four scenarios, validating the practical effectiveness of our hierarchical Learning from Interventions framework.

Citation

If you found this work useful in your research, please consider citing the following:

@inproceedings{bi2020learning,
  title={Learning from Interventions Using Hierarchical Policies for Safe Learning},
  author={Bi, Jing and Dhiman, Vikas and Xiao, Tianyou and Xu, Chenliang},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={34},
  number={04},
  pages={10352--10359},
  year={2020}
}