Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.

For the best experience please use the latest Chrome, Safari or Firefox browser.

Exploration

Stone Tao

(slides prepared by Hao Su, Quan Vuong, Tongzhou Mu, Zhiao Huang, and Stone Tao)

Spring, 2024

Some contents are based on Bandit Algorithms from Dr. Tor Lattimore and Prof. Csaba Szepesvári, and COMPM050/COMPGI13 taught at UCL by Prof. David Silver.

Agenda

click to jump to the section.

Exploration versus Exploitation

Motivation to Explore vs Exploit

Why Exploration is Difficult

Exploration to Escape Local Minima in Reward

Knowing what to explore is critical

Balancing Exploration and Exploitation

Multi-Armed Bandits

Multi-Armed Bandits

Multi-Armed Bandits

Regret

Regret (Formalized)

Total Regret Decomposition

Total Regret Decomposition

\[ \begin{aligned} L_{T} =\sum_{a \in \mathcal{A}} \mathbb{E}_{a\sim \pi}\left[N_{T}(a)\right] \Delta_{a} \end{aligned} \]

Desirable Total Regret Behavior

Greedy and $\epsilon$-Greedy

Decaying $\epsilon$-Greedy Algorithm

The Principle of Optimism in the Face of Uncertainty

Upper Confidence Bound

Deriving UCB1 Algorithm

Deriving UCB1 Algorithm

Deriving UCB1 Algorithm

\[\mathbb{P}\left[Q(a)>\hat{Q}_{t}(a) + U_{t}(a)\right] \leq e^{-2 N_{t}(a) U_{t}(a)^{2}}\]

Deriving UCB1 Algorithm

Theoretical Properties of UCB1 Algorithm

Intrinsic Rewards

The Perspective of Intrinsic Rewards

Count Based Exploration

Issues with Count Based Exploration

Counting via Hashing

Counting via Hashing: SimHash

Counting via Hashing: Autoencoders

Counting via Density Estimation

Density Estimation over State Space

Density Estimation over State Space

Quick Refresher on GMM

Gif from https://brilliant.org/wiki/gaussian-mixture-model/, which is also a easy resource to learn how EM works.

Computing Pseudo-Count from Density Model

We need to estimate a "pseudo-count" at any state $s$ using a density model $\rho_t(s)$ from visited states $s_{1:t}$. The purpose of the work below is to better "regulate" the density model and make it possible to get pseudo-counts from any density model.

Computing Pseudo-Count from Density Model

Pseudo-Count Performance

Intrinsic Rewards

Leveraging Model Predicton Errors for Novelty

Random Network Distillation (RND)

RND Intuition

Random Network Distillation Performance

Random Network Distillation Performance

Summary of Terms