Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.

For the best experience please use the latest Chrome, Safari or Firefox browser.

L2: Framework of Reinforcement Learning (I)

Tongzhou Mu

(slides prepared by Hao Su and Tongzhou Mu)

Spring, 2024

Contents are based on Reinforcement Learning: An Introduction from Prof. Richard S. Sutton and Prof. Andrew G. Barto, and COMPM050/COMPGI13 taught at UCL by Prof. David Silver.

Overview of Unit 1

Today's Agenda

click to jump to the section.

Examples

RL Applications

Control a humannoid in Mujoco.
https://gym.openai.com/envs/Humanoid-v2/
Play Atari games.
https://gym.openai.com/envs/Enduro-v0/

RL Applications

Learn motor skills for legged robots
https://www.youtube.com/watch?v=ITfBKjBH46E
Play Go.

Agent-Environment Interface

Agent-Environment Interface

  • At each step \(t\) the agent
    • Executes action \(A_t\)
    • Receives state \(S_t\)
    • Receives scalar reward \(R_t\)
  • The environment
    • Receives action \(A_t\)
    • Emits state \(S_{t+1}\)
    • Emits scalar reward \(R_{t+1}\)

RL: A Sequential Decision Making Problem

Environment Description and Learning Objective

State

Transition

Markov Property

Observation

Reward

Probabilistic Description of Environment: Markov Decision Processes

Probabilistic Description of Environment: Markov Decision Processes

Return

Episode

Learning Objective of RL

Learning Objective of RL

Data Collection in Supervised Learning and Reinforcement Learning

Relationship between Optimal Control and Reinforcement Learning

  • Optimal Control
    • Controller
    • Controlled System
    • Control Signal
    • State
    • Cost
    • Cost-to-go function
  • Reinforcement Learning
    • Agent
    • Environment
    • Action
    • State / Observation
    • Reward
    • Return

Inside an RL Agent

Major Components of an RL Agent

Model

Maze Example

  • States: Agent's location
  • Actions: N, E, S, W, stay
  • Reward: -1 per time-step
  • Termination: Reach goal

Maze Example: Model

  • Agent may have an internal model of the environment
  • Dynamics: how actions change the state
  • Rewards: how much reward from each state
  • In the right figure:
    • Grid layout represents transition model $\mc{P}_{s,s'}^a$
    • Numbers represent immediate reward $\mc{R}_s^a$ from each state $s$ (same for all $a$)

Policy

Maze Example: Policy

  • Arrows represent policy $\pi(s)$ for each state s
  • This is the optimal policy for this Maze MDP

Value Function

Bellman Expectation Equation

Maze Example: Value Function

  • Numbers represent value $V_\pi(s)$ of each state $s$
  • This is the value function corresponds to the optimal policy we showed previously

A Taxonomy of RL Algorithms and Examples

End