Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.

For the best experience please use the latest Chrome, Safari or Firefox browser.

Exploration

Hao Su

(slides prepared in tandem with Quan Vuong, Tongzhou Mu and Zhiao Huang)

Spring, 2021

Some contents are based on Bandit Algorithms from Dr. Tor Lattimore and Prof. Csaba Szepesvári, and COMPM050/COMPGI13 taught at UCL by Prof. David Silver.

Agenda

click to jump to the section.

Exploration versus Exploitation

Exploration is Difficult

Balancing Exploration and Exploitation

Multi-Armed Bandits

Examples from Lecture 13

  • We can remove the state space $\{s_0, s_1, s_2\}$ since action does not need to be conditioned on the state.
  • The MDP is now defined by the action space $\{a_1, a_2\}$ and the reward function $\mathcal{R}$:
    • $\mathcal{R}(a_1) = 1$
    • $\mathcal{R}(a_2) = 2$
  • Such one-step MDP has a special name "Multi-Armed Bandits".
  • Very extensively studied and widely deployed in industry.

Multi-Armed Bandits

Regret

Regret

Total Regret Decomposition

Total Regret Decomposition

\[ \begin{aligned} L_{T} =\sum_{a \in \mathcal{A}} \mathbb{E}_{a\sim \pi}\left[N_{T}(a)\right] \Delta_{a} \end{aligned} \]

Desirable Total Regret Behavior

Greedy and $\epsilon$-Greedy

Decaying $\epsilon$-Greedy Algorithm

The Principle of Optimism in the Face of Uncertainty

Upper Confidence Bound

Deriving UCB1 Algorithm

Deriving UCB1 Algorithm

Deriving UCB1 Algorithm

\[\mathbb{P}\left[Q(a)>\hat{Q}_{t}(a) + U_{t}(a)\right] \leq e^{-2 N_{t}(a) U_{t}(a)^{2}}\]

Deriving UCB1 Algorithm

Theoretical Properties of UCB1 Algorithm

Limitation of Exact Count-based Methods

As an Aside: Density Estimation Problem

Density Estimation over State Space

Computing Pseudo-Count from Density Model

We introduce a hack to estimate a "pseudo-count" at any state $s$ using a density model $\rho_t(s)$ from visited states $s_{1:t}$:

Computing Pseudo-Count from Density Model

Unifying Count-Based Exploration and Intrinsic Motivation. Bellemare et al.

Pseudo-Count Performance

Intrinsic Rewards

The Perspective of Intrinsic Rewards

Adding another term to supplement the MDP reward is a common technique. The added term is sometimes referred as "intrinsic rewards".

Intrinsic Rewards

Novelty-driven Exploration without Estimating Pseudo-count

Random Network Distillation

Random Network Distillation. Burda, Edwards, Storkey, Klimov.

Random Network Distillation Performance

End