Learning from Interventions Using Hierarchical Policies for Safe Learning

Jing Bi1, Vikas Dhiman2, Tianyou Xiao1, Chenliang Xu1

1 University of Rochester
2 University of California San Diego

Learning from Demonstrations (LfD) via Behavior Cloning works well on multiple complex tasks. However, it requires expert demonstrations for all scenarios. We propose Learning from Interventions (LfI) with hierarchical policies that overcomes this limitation by using an expert overseer who only intervenes when unsafe actions are about to be taken. Our approach addresses expert reaction delay and learns long-term behavior through sub-goal prediction.

Paper   

AAAI 2020

The Thirty-Fourth AAAI Conference on Artificial Intelligence

Attention Head Attention Head

Abstract

Learning from Demonstrations (LfD) via Behavior Cloning (BC) works well on multiple complex tasks. However, a limitation of the typical LfD approach is that it requires expert demonstrations for all scenarios, including those in which the algorithm is already well-trained. The recently proposed Learning from Interventions (LfI) overcomes this limitation by using an expert overseer. The expert overseer only intervenes when it suspects that an unsafe action is about to be taken.

Although LfI significantly improves over LfD, the state-of-the-art LfI fails to account for delay caused by the expert's reaction time and only learns short-term behavior. We address these limitations by 1) interpolating the expert's interventions back in time, and 2) by splitting the policy into two hierarchical levels, one that generates sub-goals for the future and another that generates actions to reach those desired sub-goals.

This sub-goal prediction forces the algorithm to learn long-term behavior while also being robust to the expert's reaction time. Our experiments show that LfI using sub-goals in a hierarchical policy framework trains faster and achieves better asymptotic performance than typical LfD.

Demonstration of the Learning from Interventions approach showing how an expert overseer intervenes when the autonomous agent is about to take unsafe actions, allowing for safe learning without requiring demonstrations for all scenarios.

Method Overview

Top view of the map in CARLA simulator where experiments were conducted. The agent navigates autonomously according to high-level commands while staying in lanes and avoiding crashes.

Our approach combines Learning from Interventions with Hierarchical Reinforcement Learning. We split the policy into two hierarchy levels: the top-level policy predicts a sub-goal to be achieved k steps ahead in the future, while the bottom-level policy chooses the best action to achieve the sub-goal generated by the top-level policy.

We address the expert's reaction delay through an interpolation technique called Backtracking, which allows us to use state-action pairs before and after the intervention. Since we work with images as our state space, we generate intermediate sub-goal embeddings using Triplet Networks rather than predicting future images at pixel-level detail.

Problem Formulation

We consider a Goal-conditioned Markov Decision Process (S, A, T, R, C) where an agent interacts with an environment in discrete time steps. At each time step t, the agent receives an observation ot ∈ O (sequence of images) and a command ct from the expert, then takes action at.

The expert has policy π* : O × C → A and can intervene when the probability of catastrophic damage exceeds a threshold: p*c(st-k*, at-k*) > 1-δc. The effective policy with expert intervention is:

πeff(st, ct) = {π*(st, ct), if intervention needed; π(st, ct; θ), otherwise}

where k* is the unknown expert's reaction delay. Our goal is to improve the learned policy π(st, ct; θ) without catastrophic failure by minimizing the difference between desired and learned policies on intervention data.

Traditional Learning from Demonstrations requires expert data for all scenarios, even those where the algorithm performs well. Learning from Interventions addresses this by having an expert overseer who only intervenes when necessary, but existing approaches have key limitations:

Video demonstration of the autonomous driving task showing the agent navigating through urban environments with expert interventions when unsafe actions are detected.

Hierarchical Policy Architecture

Instead of learning a direct policy network π(st, ct; θ), we divide the policy learning into two levels of hierarchy. The top-level policy φ̂k learns to predict a sub-goal vector ĝt+k, k steps ahead, while the bottom-level policy π learns to generate actions to achieve that sub-goal:

π(ot, ct; θ) = π(st, ĝt+k; θb)

ĝt+k := φ̂k(st, ct; θt)

Since we don't have access to ground truth sub-goals from the expert policy, we use another network gt+k := φ(st+k; θg) that outputs the desired sub-goal embedding from an achieved state st+k. This follows the Hindsight Experience Replay principle - any achieved state is considered a desirable goal for past observations.

This hierarchical structure has several advantages:

Triplet Network Training

We train the top-level policy and goal encoder using a Triplet Network approach. This avoids trivial solutions that occur with direct minimization and is more suitable than autoencoders for goal prediction. The triplet loss brings closer the embeddings ĝt+k and gt+k while pushing away ĝt+k from randomly chosen goal embeddings.

LTriplet = (exp(d+) / (exp(d+) + exp(d-)))2

where d+ = ||φ̂k(st, ct) - φ(st+k)||

and d- = ||φ̂k(st, ct) - φ(sr)||

Backtracking for Reaction Delay

To address the expert's reaction delay, we propose a novel interpolation technique called Backtracking. During data collection, we maintain a queue of past M time-steps. When an expert intervenes at time t, we interpolate the actions between at-M and the intervened action a*t to update the action queue.

The Backtracking algorithm works as follows:

  1. Maintain a data queue Dβ of past M time-steps
  2. When intervention occurs at time t, interpolate actions using Backtrack(a*t, at-M, j)
  3. Update the action queue with interpolated values
  4. Add the corrected trajectory data to the training dataset

This technique allows us to:

Combined Objective Function

The complete training objective combines both bottom-level policy learning and triplet network training:

θ*t, θ*g, θ*b = arg minθt,g,bt∈D {

LBottom(π(st, ĝt+k,θt; θb), at) +

LBottom(π(st, gt+k,θg; θb), at) +

LTriplet(st, st+k, sr; θt, θg) }

Key Contributions

1. Novel Problem Formulation

We propose a new formulation of Learning from Interventions that incorporates the expert's reaction delay, making it more realistic for real-world applications.

2. Hierarchical LfI Algorithm

We combine Learning from Interventions with Hierarchical Reinforcement Learning to address both reaction delay and long-term behavior learning.

3. Backtracking Interpolation

Our interpolation technique allows effective use of state-action pairs before and after interventions, improving data utilization.

4. Triplet Network Architecture

We propose a novel architecture using Triplet Networks to train hierarchical policies without ground-truth sub-goals, enabling practical implementation.

Experimental Setup & Results

CARLA Simulation Environment

We evaluate our approach using the CARLA 3D urban driving simulator, which provides realistic vehicle dynamics and environmental conditions. The agent navigates through urban environments with 80 other vehicles, following topological commands from a human expert or planner.

Expert Policy and Data Collection

The expert policy combines a PID controller for steering with human oversight. The human expert intervenes only in critical situations: potential crashes or lane evasions. We record steering angles at,s ∈ [-1, 1] and binary brake signals at,b ∈ {0, 1}, along with topological commands for lane-following, left turns, and right turns.

Implementation Details

Our system processes 800×600 resolution images using ResNet-50 feature extraction, followed by command-conditioned MLPs for policy execution. We use k=5 time steps for sub-goal prediction (1.25 seconds at 4 fps) and maintain M past time-steps for backtracking interpolation.

We compare three data collection approaches:

And two policy representations:

Key Experimental Results

Performance Metrics

We measure success using two primary metrics: time and distance traveled without expert intervention. These represent the duration of successful autonomous driving before requiring expert help.

Accuracy Results

Our combined approach (Sub-goal + LbB) significantly outperforms baseline methods:

Data Efficiency

Our approach demonstrates superior data efficiency:

Hyperparameter Analysis

We evaluated the effect of k (sub-goal prediction horizon) and found k=5 provides optimal performance. This represents a trade-off between:

Real-world Validation

To demonstrate practical applicability, we deployed our approach on a real RC truck (1/10 scale, 13"×10"×11") equipped with:

Instead of vehicle-rich environments, we studied navigation in pedestrian-rich scenarios with command categories:

The real-world experiments confirmed the data efficiency improvements, showing a decrease in required interventions across all four scenarios, validating the practical effectiveness of our hierarchical Learning from Interventions framework.

Citation

If you found this work useful in your research, please consider citing the following:

@inproceedings{bi2020learning,
  title={Learning from Interventions Using Hierarchical Policies for Safe Learning},
  author={Bi, Jing and Dhiman, Vikas and Xiao, Tianyou and Xu, Chenliang},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={34},
  number={04},
  pages={10352--10359},
  year={2020}
}