Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.

For the best experience please use the latest Chrome, Safari or Firefox browser.

L4: DQN and REINFORCE

Tongzhou Mu

(slides prepared by Hao Su, Shuang Liu, and Tongzhou Mu)

Spring, 2024

Agenda

click to jump to the section.

Review

Optimal Value Function

Optimal Policy

Bellman Optimality Equation

Key Differences between MC and TD

Pros and Cons of MC vs TD (2)

The Anatomy of an RL algorithm

CS285 taught at UC Berkeley by Prof. Sergey Levine.

Running Q-learning on Maze

Plot with the tools in an awesome playground for value-based RL from Justin Fu

Deep Q-Learning

Taxonomy of RL Algorithms and Examples

graph TD l1("RL Algorithms") l11("Model-Free RL") l12("Model-Based RL") l111("MC Sampling
") l112("Bellman based
") l121("Learn the Model") l122("Given the Model") l1111("REINFORCE") l1121("Deep Q-Network") l1-->l11 l1-->l12 l11-->l111 l11-->l112 l12-->l121 l12-->l122 l111-->l1111 l112-->l1121 style l11 fill:#eadfa4 style l111 fill:#eadfa4 style l112 fill:#eadfa4 style l1111 fill:#eadfa4 style l1121 fill:#FFA500

Challenge of Representing $Q$

Deep Value Network

Training Deep Q Network

Replay Buffer

Deep Q-Learning Algorithm

Some Engineering Concerns about
Deep $Q$-Learning

Some Theoretical Concerns about $Q$-Learning

Convergence of Q-Learning Algorithms

We state the facts without proof:

Unbiased Policy Gradient Estimation (REINFORCE)

Taxonomy of RL Algorithms and Examples

graph TD l1("RL Algorithms") l11("Model-Free RL") l12("Model-Based RL") l111("MC Sampling
") l112("Bellman based
") l121("Learn the Model") l122("Given the Model") l1111("REINFORCE") l1121("Deep Q-Network") l1-->l11 l1-->l12 l11-->l111 l11-->l112 l12-->l121 l12-->l122 l111-->l1111 l112-->l1121 style l11 fill:#eadfa4 style l111 fill:#eadfa4 style l112 fill:#eadfa4 style l1111 fill:#FFA500 style l1121 fill:#eadfa4

First-Order Policy Optimization

First-Order Policy Optimization

Goal: Derivate a way to directly estimate the gradient $\frac{\partial V^{\pi_{\theta}}(s_0)}{\partial \theta}$ from samples.

Policy Gradient Theorem (Undiscounted)

Policy Gradient Theorem (Undiscounted)

Policy Gradient Theorem (Discounted)

We will assume the discounted setting from now on.

Creating an Unbiased Estimate for PG

\[ \nabla_{\theta}V^{\pi_{\theta}, \gamma}(s_0) =\sum_{t = 0}^{\infty}\sum_s\gamma^t\mu_t(s;s_0) \sum_{a}\nabla_{\theta}\pi_{\theta}(a|s) \cdot Q^{\pi_{\theta}, \gamma}(s, a) \] Let's say we have used $\pi_{\theta}$ to collect a rollout trajectory $\left\{(s_t, a_t, r_t)\right\}_{t = 0}^{\infty}$, where $s_t, a_t, r_t$ are random variables. Note that $\nabla_{\th} \ln \pi_{\th}=\frac{\nabla \pi_{\th}}{\pi_{\th}}\Rightarrow \nabla \pi_{\th}=\nabla_{\theta} \ln(\pi_{\th})\cdot\pi_{\th}$ \begin{align*} \nabla_{\theta}V^{\pi_{\theta}, \gamma}(s_0)&= \sum_s\sum_{t = 0}^{\infty}\gamma^t\mu_t(s;s_0)\sum_{a}\nabla_{\theta}\ln\left(\pi_{\theta}(a|s)\right) \cdot \pi_{\theta}(a|s) Q^{\pi_{\theta}, \gamma}(s, a)\\ \text{(absorb randomness of env. in $\mu_t$)} &=\mathbb{E}\left[\sum_{t = 0}^{\infty}\gamma^t\sum_{a}\nabla_{\theta}\ln\left(\pi_{\theta}(a|s_t)\right)\cdot\pi_{\theta}(a|s_t)Q^{\pi_{\theta}, \gamma}(s_t, a)\right]\\ \text{(absorb randomness of action in $\pi_{\th}$)}&=\mathbb{E}\left[\sum_{t = 0}^{\infty}\gamma^t\nabla_{\theta}\ln\left(\pi_{\theta}(a_t|s_t)\right)\cdot Q^{\pi_{\theta}, \gamma}(s_t, a_t)\right]\\ &=\mathbb{E}\left[\sum_{t = 0}^{\infty}\gamma^t\nabla_{\theta}\ln\left(\pi_{\theta}(a_t|s_t)\right)\cdot \sum_{i = t}^{\infty} \gamma^{i - t}\cdot r_i\right]\\ \end{align*}

Creating an Unbiased Estimate for PG (Cont'd)

We have shown that \begin{align*} \nabla_{\theta}V^{\pi_{\theta}, \gamma}(s_0)=\mathbb{E}\left[\sum_{t = 0}^{\infty}\gamma^t\nabla_{\theta}\ln\left(\pi_{\theta}(a_t|s_t)\right)\cdot \sum_{i = t}^{\infty} \gamma^{i - t}\cdot r_i\right]\\ \end{align*}
We have established an MC sampling based method to
estimate the gradient of value w.r.t. policy parameters!
This estimate is unbiased.

REINFORCE Algorithm

The steps involved in the implementation of REINFORCE would be as follows:
  1. Randomly initialize a policy network that takes the state as input and returns the probability of actions
  2. Use the policy to play $n$ episodes of the game. Record $(s,a,s',r)$ for each step.
  3. Calculate the discounted reward for each step backwards
  4. Calculate expected reward $G_t=\sum_{i=t}^{\infty}\gamma^{i-t}r_i$
  5. Adjust weights of policy according to the gradient by policy gradient theorem
  6. Repeat from 2

Difference between Theory and Practice

According to what we have derived from the Policy Gradient Theorem (discounted): \begin{align*} \nabla_{\theta}V^{\pi_{\theta}, \gamma}(s_0)=\mathbb{E}\left[\sum_{t = 0}^{\infty}\gamma^t\nabla_{\theta}\ln\left(\pi_{\theta}(a_t|s_t)\right)\cdot \sum_{i = t}^{\infty} \gamma^{i - t}\cdot r_i\right]\\ \end{align*} However, in the practical REINFORCE algorithm implementation, we use \begin{align*} \nabla_{\theta}V^{\pi_{\theta}, \gamma}(s_0)=\mathbb{E}\left[\sum_{t = 0}^{\infty}\nabla_{\theta}\ln\left(\pi_{\theta}(a_t|s_t)\right)\cdot \sum_{i = t}^{\infty} \gamma^{i - t}\cdot r_i\right]\\ \end{align*} See the following articles for further discussions:
Read by Yourself
End