*This is an additional note to my previous note. Also see the new paper list on IDS, and the note on Andreas Krause’s IMSI talk.*

# 1. High-level Idea

### Decoupling *Exploitation* and *Exploration*

\(\pi_{t}^{I D S}=\arg \min _{\pi \in \mathcal{D}(\mathcal{A})}\left\{\Psi_{t}(\pi):=\frac{\Delta_{t}(\pi)^{2}}{I_{t}(\pi)}\right\}\)

As a decision-making policy, Information-Directed Sampling (IDS) is featured by its decoupling of exploitation and exploration in optimization:

**Exploitation**is governed by**immediate regret**$\Delta_{t}(\pi)$.**Exploration**is governed by**mutual information**(i.e., information gain) $I_{t}(\pi)$.

### Exploration: Decoupling `Epistemic`

and `Aleatoric`

Uncertainty

Specifically in **exploration**, IDS is more efficient than other approaches (e.g., UCB, Thompson Sampling). The reason is that, IDS only concerns about the **informative uncertainty**:

`Epistemic uncertainty`

: $P\left(\mathcal{E} \in \cdot \mid H_{t}\right)$, i.e., uncertainty of the environment given your observed history.`Aleatoric uncertainty`

: $P\left(O_{t+1}=\cdot \mid \mathcal{E}, H_{t}, A_{t}\right)$, i.e., uncertainty of the next observation conditioned on the environment.- Notice that we consider only
`epistemic uncertainty`

as the informative part of uncertainty, as it allows us to learn about the environment.

### Mutual Information: Two Perspectives

As summarized in the paper list, the two lines of work describes the exploration phase of IDS by interpreting mutual information differently:

- Van Roy’s line: $I_{t}(\mathbf{x})$ represents the
**reduction of uncertainty**on the**posterior distribution of best arm**$P\left(a^{*}=a \mid \mathcal{F}_{t-1}\right)$. - Krause’s line: $I_{t}(\mathbf{x})$ represents the
**ratio of epistemic and aleatoric uncertainty**. The key idea here is that not uncertainty $\neq$ informativeness, i.e., only the environment-related uncertainty is informative.

Either perspective, the key spirit is the same: we should get only information *relevant* to the environment.

# 2. Problem Statement

### Setting

**Action set**: $\mathcal{A}$.- An agent chooses an arm $a_{t}$ at time $t \in[1, T]$.

**Reward distribution**: $p_{a}$.- Reward $r_{a}$ are drawn from the reward distribution $p_{a}$ of the arm $a$, and the estimated reward distribution at each time $t$ is denoted as $\hat{p}_{a,t}$.

**Optimal arm**: $a^{*}=\arg \max*{a \in \mathcal{A}} \mathbb{E}*{r_{a} \sim p_{a}}\left[r_{a}\right]$.**Belief (posterior) on best arm**: $\alpha_{t}(a)=P\left(a^{*}=a \mid \mathcal{F}_{t-1}\right)$.- $\mathcal{F}_{t-1}$ is the past history including chosen arms and their observed rewards.

**Policy**: $\pi$.- This would be constructed based on the posterior $\alpha_{t}$.

### Regret

\(\mathbb{E}[\operatorname{Regret}(T)]=\underset{r_{a}^{*} \sim p_{a^{*}}}{\mathbb{E}} \sum_{t=1}^{T} r_{a^{*}}-\underset{a \sim \pi \atop r_{a, t} \sim p_{a}}{\mathbb{E}} \sum_{t=1}^{T} r_{a, t}\)

`Immediate Regret`

$\Delta_{t}(a)$

\(\Delta_{t}(a)=\underset{a^{*} \sim \alpha_{t} \atop r_{a^{*}, t} \sim \hat{p}_{a^{*}, t}}{\mathbb{E}}\left[r_{a^{*}, t} \mid \mathcal{F}_{t-1}\right]-\underset{r_{a, t} \sim \hat{p}_{a, t}}{\mathbb{E}}\left[r_{a, t} \mid \mathcal{F}_{t-1}\right]\)

`Information Gain`

$I_{t}(x)$

\(\begin{aligned} I_{t}(a) &=I\left(a_{t}^{*}, r_{a, t}\right) \\ &=\underset{r_{i} \sim \hat{p}_{a,t}}\mathbb{E}\left[H\left(a_{t}^{*}\right)-H\left(a_{t+1}^{*}\right) \mid \mathcal{F}_{t-1}, a_{t}=a, r_{a, t}=r_{i}\right] \end{aligned}\)

### Optimization

\(\pi_{t}^{I D S}=\arg \min _{\pi \in \mathcal{D}(\mathcal{A})}\left\{\Psi_{t}(\pi):=\frac{\Delta_{t}(\pi)^{2}}{I_{t}(\pi)}\right\}\)

# 3. Key Theoretical Conclusions

Fix a deterministic $\lambda \in \mathbb{R}$ and a policy $\pi=\left(\pi_{1}, \pi_{2}, \ldots\right)$ such that $\Psi_{t}\left(\pi_{t}\right) \leq \lambda$ almost surely for each $t \in{1, . ., T} .$ Then: \(\begin{aligned} \mathbb{E}(\operatorname{Regret}(T, \pi)) &=\mathbb{E} \sum_{t=1}^{T} \Delta_{t}(\pi) \\ & \leq \sqrt{\lambda} \mathbb{E} \sum_{t=1}^{T} \sqrt{g_{t}(\pi)} \\ & \leq \sqrt{\lambda T} \sqrt{\mathbb{E} \sum_{t=1}^{T} g_{t}(\pi)} \ \ \text { (Caushy-Schwardsz inequality) } \\ & \leq \sqrt{\lambda H\left(\alpha_{1}\right) T}. \end{aligned}\)

# 4. Connections to Combinatorial Bandits

Suppose $\mathcal{A} \subset{a \subset{0,1, \ldots, d}:|a| \leq m}$, and that there are random variables $\left(X_{t, i}: t \in \mathbb{N}, i \in{1, \ldots, d}\right)$ such that
\(Y_{t, a}=\left(X_{t, i}: i \in a\right) \quad \text { and } \quad R_{t, a}=\frac{1}{m} \sum_{i \in a} X_{t, i}.\)
Assume that the random variables $\left{X_{t, i}: i \in{1, \ldots, d}\right}$ are independent conditioned on $\mathcal{F}*{t}$ and $X*{t, i} \in\left[\frac{-1}{2}, \frac{1}{2}\right]$ almost surely for each $(t, i) .$ Then for all $t \in \mathbb{N}, \Psi_{t}\left(\pi_{t}^{\mathrm{IDS}}\right) \leq \frac{d}{2 m^{2}}$ almost surely. Thus:
\(\mathbb{E}[\operatorname{Regret}(T, \pi)] \leq \sqrt{\frac{d}{2 m^{2}} H\left(\alpha_{1}\right) T}.\)
We could further prove that the lower bound of this problem is of order $\sqrt{\frac{d}{m} T}$, so the bound is order optimal to a $\sqrt{\log \left(\frac{d}{m}\right)}$ factor.

# 5. Can We Do Better: Adaptive IDS?

(To be updated.) An extension could be exploration inside each inner loop.