Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.

For the best experience please use the latest Chrome, Safari or Firefox browser.

L3: Framework of Reinforcement Learning (II)

Tongzhou Mu

(slides prepared by Hao Su and Tongzhou Mu)

Spring, 2024

Contents are based on Reinforcement Learning: An Introduction from Prof. Richard S. Sutton and Prof. Andrew G. Barto, and COMPM050/COMPGI13 taught at UCL by Prof. David Silver.

Agenda

click to jump to the section.

Review

Agent-Environment Interface

Probabilistic Description of Environment: Markov Decision Processes

Policy

Maze Example: Policy

  • Arrows represent policy $\pi(s)$ for each state s
  • This is the optimal policy for this Maze MDP

Value Function

Maze Example: Value Function

  • Numbers represent value $V_\pi(s)$ of each state $s$
  • This is the value function corresponds to the optimal policy we showed previously

Optimal Policy and Optimal Value Function

Optimal Value Function

Optimal Policy

Bellman Optimality Equation

Solving the Bellman Optimality Equation

Estimating Value Function for a Given Policy

Goal: Given a policy $\pi(a|s)$, estimate the value of the policy.

Monte-Carlo Policy Evaluation

Monte-Carlo Policy Evaluation

Monte-Carlo Methods

Temporal-Difference Learning

Temporal-Difference Learning

\[ V_{\pi}(s)= \bb{E}_{\pi}[R_{t+1}+\gamma V_{\pi}(S_{t+1})|S_t=s] \]

Temporal-Difference Learning

Key Differences between MC and TD

Pros and Cons of MC vs. TD

Pros and Cons of MC vs. TD

Bias/Variance Trade-Off

Pros and Cons of MC vs TD (2)

Bellman, or no Bellman, that is the question

We start from REINFORCE and Deep Q-Learning

Taxonomy of RL Algorithms and Examples

graph TD l1("RL Algorithms") l11("Model-Free RL") l12("Model-Based RL") l111("MC Sampling
") l112("Bellman based
") l121("Learn the Model") l122("Given the Model") l1111("REINFORCE") l1121("Deep Q-Network") l1-->l11 l1-->l12 l11-->l111 l11-->l112 l12-->l121 l12-->l122 l111-->l1111 l112-->l1121 style l11 fill:#eadfa4 style l111 fill:#eadfa4 style l112 fill:#eadfa4 style l1111 fill:#eadfa4 style l1121 fill:#eadfa4

Q-Learning for Tabular RL

Tabular RL: RL with discrete and finite state space, convenient for algorithm development and convergence analysis

The Anatomy of an RL algorithm

CS285 taught at UC Berkeley by Prof. Sergey Levine.

Failure Example

  • Initialize Q
    • $Q(s_0,a_1)=0,Q(s_0,a_2)=0$
    • $\pi(s_0)=a_1$
  • Iteration 1: take $a_1$ and update $Q$
    • $Q(s_0,a_1)=1,Q(s_0,a_2)=0$
    • $\pi(s_0)=a_1$
  • Iteration 2: take $a_1$ and update $Q$
    • $Q(s_0,a_1)=1,Q(s_0,a_2)=0$
    • $\pi(s_0)=a_1$
  • ...
  • $Q$ stops to improve because the agent is too greedy!

$\epsilon$-Greedy Exploration

Exploration vs Exploitation

Q-Learning Algorithm

Running Q-learning on Maze

Plot with the tools in an awesome playground for value-based RL from Justin Fu
End