Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.

For the best experience please use the latest Chrome, Safari or Firefox browser.

Advanced Off-Policy RL

Stone Tao

(slides prepared by Hao Su, Zhan Ling, and Stone Tao)

Spring, 2024

Contents are based on website

Agenda

click to jump to the section.

Key Ideas of Off-policy RL

Off-Policy RL

Key ideas:

Recall: Q-Learning for Tabular RL

  1. Given transitions $\{(s,a,s',r)\}$ from some trajectories, how to improve the current Q-function?
    • By Temporal Difference learning, the update target for $Q(s,a)$ is
      • $r+\gamma\max_{a'} Q(s', a')$
    • Take a small step towards the target
      • $Q(s,a)\leftarrow Q(s,a)+\alpha[r+\gamma\max_{a'} Q(s', a')-Q(s,a)]$
  2. Given $Q$, how to improve policy?
    • Take the greedy policy based on the current $Q$
      • $\pi(s)=\text{argmax}_a Q(s,a)$
  3. Given $\pi$, how to generate trajectories?
    • $\epsilon$-greedy policy in the environment.
  4. Remember the 3 key elements above, they consitute just about any off-policy algorithm.

Continuous Q-Learning

Continuous Q-Learning

Continuous Deterministic Policy Network

TD-based Q Function Learning

We can still use TD-loss to learn $Q_{\theta}$. Given a transition sample $(s,a,s',r)$:

Have We Finished? Revisit the Three Questions

  1. Given transitions $\{(s,a,s',r)\}$ from some trajectories, how to improve the current Q-function?
    • We have derived the update target for $Q(s,a)=r+\gamma Q(s', \pi_{\phi}(s'))$.
  2. Given $Q$, how to improve the policy?
    • We introduced a policy network $\pi_{\phi}$ and update it by solving \[ \begin{aligned} &\underset{\phi}{\text{maximize}}&&Q_{\theta}(s, \pi_{\phi}(s))\\ \end{aligned} \]
  3. Given $\pi$, how to generate trajectories?
    • We also need exploration in continuous action space!

Exploration in Continuous Action Space

Deep Deterministic Policy Gradient (DDPG)

Trouble and Tricks in Practice

Tricks to Overcome Value Estimation

Issue: Value Overestimation

Double Q-Learning

Clipped Double Q-Learning

Tricks to Address Rare Beneficial Samples

Issue: Rare Beneficial Samples
in the Replay Buffer

Blind Cliffwalk

Analysis with Q-Learning

Prioritized Experience Replay

Prioritized Experience Replay

Tricks to Accelerate Reward Propagation

Slow Reward Propagation

Tricks in Value Network Architecture Design

Dueling Network

Tricks by Considering Uncertainty of Value Estimation

Stochasticity in the Environments

Value Network with Discrete Distribution

Read by Yourself

Tricks in Leveraging State-conditioned Exploration Noise

State-conditioned Exploration Noise

Noisy Nets (For Discrete Action Space)

Parameterized Squashed Gaussian policy (For Continuous Action Space)

Off-Policy RL Frameworks in Practice

Practical Off-Policy Algorithms

Rainbow

Ablation study of tricks in Rainbow

Soft-Actor-Critic (SAC)

Randomized Ensembled Double Q-Learning (REDQ)