Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.

For the best experience please use the latest Chrome, Safari or Firefox browser.

Advanced Policy Gradient Algorithms

Tongzhou Mu

(slides prepared by Hao Su, Shuang Liu, and Tongzhou Mu)

Spring, 2024

Agenda

click to jump to the section.

Review

Taxonomy of RL Algorithms and Examples

graph TD l1("RL Algorithms") l11("Model-Free RL") l12("Model-Based RL") l111("MC Sampling
") l112("Bellman based
") l121("Learn the Model") l122("Given the Model") l1111("REINFORCE") l1121("Deep Q-Network") l1-->l11 l1-->l12 l11-->l111 l11-->l112 l12-->l121 l12-->l122 l111-->l1111 l112-->l1121 style l11 fill:#eadfa4 style l111 fill:#eadfa4 style l112 fill:#eadfa4 style l1111 fill:#eadfa4 style l1121 fill:#FFA500

Deep Value Network

Training Deep Q Network

Replay Buffer

Deep Q-Learning Algorithm

Taxonomy of RL Algorithms and Examples

graph TD l1("RL Algorithms") l11("Model-Free RL") l12("Model-Based RL") l111("MC Sampling
") l112("Bellman based
") l121("Learn the Model") l122("Given the Model") l1111("REINFORCE") l1121("Deep Q-Network") l1-->l11 l1-->l12 l11-->l111 l11-->l112 l12-->l121 l12-->l122 l111-->l1111 l112-->l1121 style l11 fill:#eadfa4 style l111 fill:#eadfa4 style l112 fill:#eadfa4 style l1111 fill:#FFA500 style l1121 fill:#eadfa4

First-Order Policy Optimization

Goal: Derivate a way to directly estimate the gradient $\frac{\partial V^{\pi_{\theta}}(s_0)}{\partial \theta}$ from samples.

Policy Gradient Theorem (Discounted)

We will assume the discounted setting from now on.

Creating an Unbiased Estimate for PG (Cont'd)

We have shown that \begin{align*} \nabla_{\theta}V^{\pi_{\theta}, \gamma}(s_0)=\mathbb{E}\left[\sum_{t = 0}^{\infty}\gamma^t\nabla_{\theta}\ln\left(\pi_{\theta}(a_t|s_t)\right)\cdot \sum_{i = t}^{\infty} \gamma^{i - t}\cdot r_i\right]\\ \end{align*}
We have established an MC sampling based method to
estimate the gradient of value w.r.t. policy parameters!
This estimate is unbiased.

REINFORCE Algorithm

The steps involved in the implementation of REINFORCE would be as follows:
  1. Randomly initialize a policy network that takes the state as input and returns the probability of actions
  2. Use the policy to play $n$ episodes of the game. Record $(s,a,s',r)$ for each step.
  3. Calculate the discounted reward for each step backwards
  4. Calculate expected reward $G_t=\sum_{i=t}^{\infty}\gamma^{i-t}r_i$
  5. Adjust weights of policy according to the gradient by policy gradient theorem
  6. Repeat from 2

Practical First-Order Policy Optimization

Advanced Value Estimates

Advantage

Advantage Estimates

\begin{align*} \nabla_{\theta}V^{\pi_{\theta}, \gamma}(s_0) &= \sum_s\sum_{t = 0}^{\infty}\gamma^t\mu_t(s;s_0) \left(\sum_{a}\nabla_{\theta}\pi_{\theta}(a|s) \cdot Q^{\pi_{\theta}, \gamma}(s, a) - 0\right)\\ \text{(why?)}&= \sum_s\sum_{t = 0}^{\infty}\gamma^t\mu_t(s;s_0) \left(\sum_{a}\nabla_{\theta}\pi_{\theta}(a|s) \cdot Q^{\pi_{\theta}, \gamma}(s, a) - \sum_{a}\nabla_{\theta}\pi_{\theta}(a|s)\cdot V^{\pi_{\theta}, \gamma}(s)\right)\\ &= \sum_s\sum_{t = 0}^{\infty}\gamma^t\mu_t(s;s_0) \sum_{a}\nabla_{\theta}\pi_{\theta}(a|s) \cdot \left(Q^{\pi_{\theta}, \gamma}(s, a) - V^{\pi_{\theta}, \gamma}(s)\right).\\ &= \sum_s\sum_{t = 0}^{\infty}\gamma^t\mu_t(s;s_0)\sum_{a}\nabla_{\theta}\ln\left(\pi_{\theta}(a|s)\right) \cdot \pi_{\theta}(a|s) A^{\pi_{\theta}, \gamma}(s, a)\\ &=\mathbb{E}\left[\sum_{t = 0}^{\infty}\gamma^t\nabla_{\theta}\ln\left(\pi_{\theta}(a_t|s_t)\right)\cdot A^{\pi_{\theta}, \gamma}(s_t, a_t)\right] \end{align*}

Advantage Estimates (Cont'd)

Advantage Estimates (Cont'd)

Incremental Monte Carlo
Value Function Estimation

Some Additional Information

Given any advantage estimate $\hat{A}^{\pi_{\theta}, \gamma}(s_t, a_t)$, we can estimate the policy gradient by \begin{align*} \hat{\nabla}_{\theta}V^{\pi_{\theta}, \gamma}(s_0) =\mathbb{E}\left[\sum_{t = 0}^{\infty}\gamma^t\nabla_{\theta}\ln\left(\pi_{\theta}(a_t|s_t)\right)\cdot \hat{A}^{\pi_{\theta}, \gamma}(s_t, a_t)\right]. \end{align*} However, in most implementations, people simply use \begin{align*} \hat{\nabla}_{\theta}V^{\pi_{\theta}, \gamma}(s_0) =\mathbb{E}\left[\sum_{t = 0}^{\infty}\nabla_{\theta}\ln\left(\pi_{\theta}(a_t|s_t)\right)\cdot \hat{A}^{\pi_{\theta}, \gamma}(s_t, a_t)\right]. \end{align*}

Efficient and Stable Policy Optimization

On-Policy RL vs. Off-Policy RL

\[ \hat{\nabla}_{\theta}V^{\pi_{\theta}, \gamma}(s_0) =\mathbb{E}_{\color{red}{\pi_{\theta}}}\left[\sum_{t = 0}^{\infty}\nabla_{\theta}\ln\left(\pi_{\theta}(a_t|s_t)\right)\cdot \hat{A}^{\pi_{\theta}, \gamma}(s_t, a_t)\right]. \tag{PGT} \]
vs.
\[ \nabla_{\th}L(\th)=\bb{E}_{\color{red}{(s,a,s')\sim \rm{Replay Buffer}}}[\nabla_{\th}\|Q_{\th}(s,a)-[R(s,a,s')+\gamma\max_{a'}Q_{\th}(s',a')]\|^2]\tag{TD} \]

Make Better Use of Samples for On-Policy RL

Trust-region Method

\[ \begin{aligned} &\underset{x}{\text{minimize}}&&f(x)\\ \end{aligned} \]
  • Iterate:
    • Solve a constrained sub-problem \[ \begin{aligned} & \underset{x}{\text{minimize}} &&\tilde{f}_{x_k}(x)\\ &\text{subject to}&&D(x,x_k)\le \epsilon \end{aligned} \]
    • $x_{k+1}=x; k\leftarrow k+1$
$\tilde{f}_{x_k}(x)$ is a local approximation of $f$ near $x_k$ (e.g., linear or quadratic Taylor's expansion), and $D(\cdot, x_k)\le \epsilon$ restricts the next step $x$ to be in a local region of $x_k$.
https://optimization.mccormick.northwestern.edu/index.php/Trust-region_methods
We compute the gradient (and Hessian) of $f$ for once, but we can use it to update $x$ for multiple steps safely!

Basic Framework of Trust-region Policy Optimization (e.g, TRPO/PPO)

TRPO/PPO repeat the following procedure:

Trust-region Method for Policy Optimization

Trust-region Method for Policy Optimization

Trust-region Method for Policy Optimization

The local constrained subproblem in TRPO: \[ \aligned{ & \underset{\theta}{\text{maximize}} && \mathbb{E}_{\pi_{\theta_k}}\sum_{t=0}^{\infty} \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_k}(a_t|s_t)}A^{\pi_{\theta_k}, \gamma}(s_t,a_t)\\ & \text{subject to} && \mathbb{E}_{\pi_{\theta_k}}[\mathrm{KL}(\pi_{\theta_k}(s,\cdot)\|\pi_{\theta}(s,\cdot))]\le \delta } \] The constraint restricts that $\pi_{\theta}$ not deviates much from $\pi_{\theta_k}$.

Constrained Opt. $\rightarrow$ Unconstrained Opt.

Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO)

Exploration

Exploration

End