#StackBounty: #confidence-interval #multiarmed-bandit Analysing Regret for Multi Armed Bandits

Bounty: 50

I am referring to the paper [2012] [Regret Analysis of Stochastic and NonStochastic MAB]1 to understand regret analysis of stochastic multi-armed bandits. In the paper, in Section 2.2, it says the following.

It is assumed that the distribution of rewards $X$ satisfy the following moment conditions. There exists a convex function $psi$ on the reals such that, for all $lambda geq 0$,

Equation 1:
begin{equation}
ln mathbb{E} e^{lambda(X- mathbb{E}[X])} leq psi(lambda) text{ and } ln mathbb{E} e^{lambda(mathbb{E}[X]-X)} leq psi(lambda)
end{equation}.

They use the above to construct an upper bound estimate on the mean of each arm at some fixed confidence level, and the MAB strategy is to choose the arm that looks best under this estimate. To do so, they use the Legendre-Fenchel transform $psi$, defined by

begin{equation}
psi^*(epsilon) = sup_{lambda in mathbb{R}} (lambda epsilon – psi(lambda))
end{equation}.

Question 1: Why are the authors using this transformation? Can someone explain what does $psi^*(epsilon)$ represent really?

Now, let $hat{mu_{i,s}}$ be the sample mean of rewards obtained by pulling arm $i$ for $s$ times. Note that since the rewards are i.i.d in stochastic MAB setting, we have that distribution $hat{mu}{i,s}$ is equal to $frac{1}{s} sum{t=1}^{s} X_{i,t}$

Next, the authors say, using Markov’s inequality, from Equation 1, one can obtain that

Equation 2:
begin{equation}
P(mu_i – hat{mu}_{i,s} > epsilon) leq e^{-spsi^*(epsilon)}
end{equation}

Question 2: How can I derive Equation 2 from Equation 1? Following is my attempt. Please show me how to proceed.
begin{equation}
ln mathbb{E} e^{lambda(X- mathbb{E}[X])} leq psi(lambda) text{ (from Eq 1)}\
mathbb{E} e^{lambda(X- mathbb{E}[X])} leq e^{psi(lambda)}\
frac{P(X – mathbb{E[X]} < epsilon)}{epsilon} leq e^{psi(lambda)}
end{equation}

Furthermore, authors say that in other words Equation 2 is, with probability at least $1-delta$,

Equation 3:
begin{equation}
hat{mu_{i,s}} + (psi^*)^{-1}(frac{1}{s} ln frac{1}{delta}) > mu_i
end{equation}

Question 3: Is the way I derive Equation 3 from Equation 2 correct?
begin{equation}
P(X – mathbb{E}[X] > epsilon) < delta text{ (error should be small) } \
e^{-s psi^(epsilon)} leq delta text{ (from Eq.2) } \
-s psi^
(epsilon) leq ln delta \
s psi^(epsilon) geq -ln delta \
psi^
(epsilon) geq frac{1}{s}ln frac{1}{delta}\
epsilon geq psi^{-1}(frac{1}{s}ln frac{1}{delta})\
text{We need } mu_i – hat{mu}{i,a} < epsilon. text{So,} \
mu_i – hat{mu}
{i,a} < psi^{-1}(frac{1}{s}ln frac{1}{delta})\
mu_i < hat{mu}_{i,a} + psi^{-1}(frac{1}{s}ln frac{1}{delta})
end{equation}


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.