# CONTEXTUAL MULTI-ARMED BANDITS WITH COMMUNICATION CONSTRAINTS

**Anonymous authors**
Paper under double-blind review

ABSTRACT

We consider a remote Contextual Multi-Armed Bandit (CMAB) problem, in
which the decision-maker observes the context and the reward, but must communicate the actions to be taken by the agents over a rate-limited communication channel. This can model, for example, a personalized ad placement application, where
the content owner observes the individual visitors to its website, and hence has the
context information, but must convey the ads that must be shown to each visitor
to a separate entity that manages the marketing content. In this Rate-Constrained
CMAB (RC-CMAB) problem, the constraint on the communication rate between
the decision-maker and the agents imposes a trade-off between the number of bits
sent per agent and the acquired average reward. We are particularly interested in
the scenario in which the number of agents and the number of possible actions are
large, while the communication budget is limited. Consequently, it can be considered as a policy compression problem, where the distortion metric is induced by
the learning objectives. We first consider the fundamental information theoretic
limits of this problem by letting the number of agents go to infinity, and study
the regret that can be achieved. In particular, we identify two distinct rate regions
resulting in linear and sub-linear regret behaviour, respectively. Then, we propose
a practical coding scheme, and provide numerical results for the achieved regret.

1 INTRODUCTION

In the last few years, synergies between Machine Learning (ML) and communication networks have
attracted a lot of interest in the research community, thanks to the fruitful interplay of the two fields
in emerging applications from Internet of things (IoT) to autonomous vehicles and other edge services. In most of these applications, both the generated data and the processing power is distributed
across a network of physically distant devices, thus a reliable communication infrastructure is pivotal to run ML algorithms that can leverage the collected distributed knowledge (Park et al., 2019).
To this end, a lot of recent works have tried to redesign networks and to efficiently represent information to support distributed ML applications, where the activities of data collection, processing,
learning and inference are performed in different geographical locations, and should consider limited
communication, memory, or processing resources, as well as addressing privacy issues.

In contrast to the insatiable growth in our desire to gather more data and intelligence, available communication resources (bandwidth and power, in particular) are highly limited, and must be shared
among many different devices and applications. This requires the design of highly communicationefficient distributed learning algorithms, particularly for edge applications. Information theory, in
particular the rate-distortion theory, have laid the fundamental limits of efficient data compression
with the aim to reconstructing the source signal with the highest fidelity (Cover & Thomas, 2006b).
However, in the aforementioned applications, the goal is often not to reconstruct the source signal,
but to make some inference based on that. This requires task-oriented compression, filtering out the
unnecessary information for the target application, and thus decreasing the number of bits that have
to be transmitted over the communication networks. This approach should target the questions of
_what is the most useful information that has to be sent, and how to represent it, in order to meet the_
application requirements consuming the minimum amount of network resources.

Our goal in this paper is to investigate a theoretically grounded method to efficiently transmit data in
a Contextual Multi-Armed Bandit (CMAB) problem, in which the context information is available to


-----

a decision-maker, whereas the actions can be taken by a remote entity, called controller, controlling a
multitude of agents. We assume that a limited communication link is available between the decisionmaker and the controller to communicate at each round the intended actions. The controller must
decide on the actions to take based on the message received over the channel, while the decisionmaker observes the rewards at each round, and updates its policy accordingly.

This scenario can model, for example, a personalized ad placement application, where the content
owner observes the individual visitors to its website; and hence, has the context information, but
must convey the ads that must be shown to each visitor to a separate entity that manages the marketing content. This will require communicating hundreds or thousands of adds to be placed at
each round, from among a large set of possible adds, within the communication resource and delay constraints of the underlying communication channel, which is quantified as the number of bits
available per agent. This problem may arise in other similar applications of CMABs with communication constraints between the decision-maker and the controller (Bouneffouf & Rish, 2019).

1.1 RELATED WORK

Given the amount of data that is generated by machines, sensors and mobile devices, the design of
distributed learning algorithms is a hot topic in the ML literature. These algorithms often impose
communication constraints among agents, requiring the design of methods that would allow efficient
representation of messages to be exchanged. While rate-distortion theory deals with efficient lossy
transmission of signals (Cover & Thomas, 2006b), in ML applications, we typically do not need to
reconstruct the underlying signal, but make some inference based on that. These applications can be
modeled through distributed hypothesis testing (Berger, Sep. 1979; Ahlswede & Csisz´ar, 1986) and
estimation (Zhang et al., 2013; Xu & Raginsky, 2017) problems under rate constraints.

In parallel to the theoretical rate-distortion analysis, significant research efforts have been invested
in the design of practical data compression algorithms, focusing on specific information sources,
such as JPEG and BPG for image compression, or MPEG and H.264 for video compression. While
adapting these tools to specific inference tasks is difficult, recently deep learning techniques have
been employed to learn task-specific compression algorithms (Torfason et al., 2018; Jankowski et al.,
2021), which achieve significant efficiency by bypassing image reconstruction.

While the above mainly focus on the inference task through supervised learning, here we consider the CMAB problem. There is a growing literature on multi-agent Reinforcement Learning
(RL) problems with communication links (Foerster et al., 2016; Sukhbaatar et al., 2016; Havrylov
& Titov, 2017; Lazaridou et al., 2017). These papers consider a multi-agent partially observable
Markov decision process (POMDP), where the agents collaborate to resolve a specific task. In addition to the usual reward signals, agents can also benefit from the available communication links
to better cooperate and coordinate their actions. It is shown that communication can help overcome
the inherent non-stationarity of the multi-agent environment. Our problem can be considered as
a special case of this general RL formulation, where the state at each time is independent of the
past states and actions. Moreover, we focus on a particular setting in which the communication is
one-way, from the decision-maker that observes the state and the reward, towards the controller that
takes the actions. This formulation is different from the existing results in the literature involving
multi-agent Multi-Armed Bandit (MAB). In Agarwal et al. (2021), each agent can pull an arm and
communicate with others. They do not consider the contextual case, and focus on a particular communication scheme, where each agent shares the index of the best arm according to its experience.
Another related formulation is proposed in Hanna et al. (2021), where a pool of agents collaborate
to solve a common MAB problem with a rate-constrained communication channel from the agents
to the server. In this case, agents observe their rewards and upload them to the server, which in turn
updates the policy used to instruct them. In Park & Faradonbeh (2021), the authors consider a partially observable CMAB scenario, where the agent has only partial information about the context.
However, this paper does not consider any communication constraint, and the partial/ noisy view
of the context is generated by nature. Differently from the existing literature, our goal here is to
identify the fundamental information theoretic limits of learning with communication constraints in
this particular scenario.


-----

2 PROBLEM FORMULATION

2.1 CONTEXTUAL MULTI-ARMED BANDIT (CMAB) PROBLEM

We consider N agents, which experience independent realizations of the same CMAB problem. The
CMAB is a sequential decision game in which the environment imposes a probability distribution
_PS over a set of contexts, or states, S, which is finite in our case. The game proceeds in rounds, and_
each agentat each round i ∈{ t = 11, . . ., N, . . ., T}. At each time step, a realization of the state t, and for each agent st,i ∈S is sampled from the distribution i, states are sampled iid according PS for
to PS . In the usual CMAB setting, the decision-maker would observe the states {st,i}i[N]=1 [of the]
agents, and choose an action (or arm) at,i 1, . . ., K =, for each agent, where K is the total
number of available actions, with probability ∈{ πt,i(at,i _s}t,i) A given by a (possibly) stochastic policy_
_|_
_πt : S →_ ∆K. Here ∆K is the K-dimensional simplex, containing all possible distributions over the
set of actions. Once the actions have been taken by all the agents, at each round t the environment
returns rewards for all the agents following independent realizations of the same reward process,
_rt,i = r(st,i, at,i) ∼_ _PR(r|st,i, at,i),_ _∀i ∈{1, . . ., N_ _}, which depends on the state and the action_
of the corresponding agent. Based on the history of returns up to roundcan optimize its policy to maximize the sum of the rewards G = _t=1_ _tNi −=11[r], the decision-maker[t,i][.]_
P

2.2 RATE-CONSTRAINED CMAB [P][T]

In our modified version, the process of observing the system states is spatially separated from
the process of taking the actions. The environment states, {st,i}i[N]=1[, are observed by a cen-]
tral entity, called the decision-maker, that has to communicate to the controller over a rateconstrained communication channel, at each round t, the information about the actions _at,i_ _i=1_
_{_ _}[N]_
the agents should take. The decision-maker can exploit the knowledge accumulated from the
states and rewards observed by all the agents up to round t − 1, denoted by H(t − 1) =
({s1,i}i[N]=1 _[,][ {][a][1][,i][}]i[N]=1_ _[,][ {][r][1][,i][}]i[N]=1[)][, . . .,][ (][{][s][t][−][1][,i][}][N]i=1_ _[,][ {][a][t][−][1][,i][}]i[N]=1_ _[,][ {][r][t][−][1][,i][}]i[N]=1[)][,]_ _∈H[(][t][−][1)], to_
optimize the policy to be used at roundn _t. Consequently, the problem is to communicate the ac-o_
tion distribution, i.e., the policy πt(a _st), which depends on the specific state realizations observed_
_|_
in round t, to the controller within the available communication resources while inciting the minimal
impact on the performance of the learning algorithm.

Specifically, the decision-maker employs function ft[(][N] [)] : 1, 2, . . ., B to map the
_H[(][t][−][1)]_ _×S_ _[N]_ _→{_ _}_
history up to time t and the states of the agents at time t to a message index to be transmitted over
the channel. The controller, on the other hand, employs a function gt[(][N] [)] : 1, 2, . . ., B to
_{_ _} →A[N]_
map the received message to a set of actions for the agents. In general, both functions ft[(][N] [)] and gt[(][N] [)]
_T_
can be stochastic. Average per period regret achieved by sequences _ft[(][N]_ [)], gt[(][N] [)]

_t=1_ [is given by]

n o

_T_ _N_

_ρ[(][N]_ [)](T ) = [1] _r(st,i, a[∗](st,i)_ _r(st,i, at)_ _,_ (1)

_T_ [E] "t=1 _i=1_ _−_ #

X X


where at,i = _gt,i(m(t)) is the action taken by agent i based on message m(t)_ =

_ft[(][N]_ [)] _H(t_ 1), _st,i_ _i=1_ transmitted in round t, a[∗](st,i) is the action with maximum mean re_−_ _{_ _}[N]_
ward in state _st,i, i.e., the optimal action, and the expectation is taken with respect to the state distri-_
bution PS and reward distribution PR. We say that, for a given time period T and N agents, an av
_T_
erage regret- communication rate pair (ρ, R) is achievable if there exist functions _ft[(][N]_ [)], gt[(][N] [)]

_t=1_
as defined above with rate _N[1]_ [log][2][ B][ ≤] _[R][ and regret][ ρ][(][N]_ [)][(][T] [)][ ≤] _[ρ][.]_ n o

If a sufficiently large rate is available for communication, i.e., R ≥ log K, then the intended action
for each agent can be reliably conveyed to the controller. Otherwise, to achieve the learning goal
while satisfying the rate constraint, the decision-maker must apply a lossy compression scheme,
such that the action distribution adopted by the pool of agents resembles the intended policy as
much as possible.


-----

Figure 1: The problem of communicating policies.

3 SOLUTION

In this section, we present the proposed solution to the problem. First of all, we briefly discuss the
algorithm adopted by the decision-maker to solve the CMAB. Then, the communication scheme for
the RC-CMAB with a particular distortion function is provided, which is inherited by the learning
task. We then identify two distinct rate regions resulting in linear and sub-linear regret, and propose
a more practical coding scheme.

3.1 THOMPSON SAMPLING

In the proposed solution to the CMAB problem, the decision-maker adopts Thompson Sampling
(TS) Thompson (1935). In particular, it makes use of one TS instance for each state s ∈S. Consequently, the decision-maker maintains an estimate of the distribution p[s,a]t [(][µ][)][ of the mean reward]
_µ[s,a]_ _∈D ⊆_ R of action a in state s at time t. To take the decision in state st, the decisionmaker would sample µ[s]t _[t][,a]_ _∼_ _p[s]t_ _[t][,a], ∀a ∈A, and takes the action a[∗]_ = arg maxa∈A{µ[s]t _[t][,a]}. This_
procedure is repeated for each agent i ∈{1, . . ., N _}. After receiving the rewards {rt,i}i[N]=1[, the]_
decision-maker can update its belief on µ[s,a], i.e., the probabilities p[s,a]t [(][µ][)][, in order to minimize]
the regret. We notice that this strategy induces a probability distribution πt(a _s) over the actions_
_|_
that is πt(a|s) = _D_ _[p]t[s,a][(][µ][)][ Q]j[K]=1,j≠_ _a_ _[P]t[ s,j](µ)dµ, where Pt[s,j](µ) is the Cumulative Distribution_

Function (CDF) of µj, where the random variables µ[s,a] are considered independently distributed.

R

However, the constraint on the rate imposed by the RC-CMAB formulation makes it infeasible for
the decision-maker to sample, through the pool of agents, the actions directly from the true distribution πt(a _s). The agents have to use a proxy Qt(a_ _s), which is the one obtained from the message_
_|_ _|_
received. This problem is similar to approximate TS, where a proxy distribution is used to sample
the actions, or the reward means, given that the true distribution is too complex to sample from. In
that case, the bottleneck is given by the complexity of the mean reward distributions, whereas in this
work, it is imposed by the limited-rate communication channel between the decision-maker and the
controller.

3.2 OPTIMAL SOLUTION FOR THE RC-CMAB

We model the environment as a Discrete Memoryless Source (DMS), that generates states from a
finite alphabet with probability PS, emitting sequences of N symbols s[N] = (s1, . . ., sN ), one per
_S_
agent. We then denote with _Q[ˆ]sN (m) the empirical probability of state m in s[N]_ . We also consider
the sequence of actions a[N], and denote with _Q[ˆ]zN (m, j) the empirical joint probability of the pair_
(m, j) in z[N] = ((s1, a1), . . ., (sN _, aN_ )). The whole picture can be seen in Fig. 1. In the figure
above, the actions taken by the agents are denoted by ˆa to indicate that they can differ from a.
However, we are interested in the probability distributions generating the sequences, thus we will
denote with A the random variables indicating both the actions at the controller and decision-maker
side. We assume that the distribution PS is known (or accurately estimated).

The decision-maker can observe the realization s[N] of the contexts, and its task is to transmit an
index u 1, . . ., B, and the agents can generate from u a sequence a[N], such that _Q[ˆ]sN_ _aN is close_
_∈{_ _}_
to PSA(sa) = PS(s)π(a|s), where closeness depends on a distortion measure E[d( Q[ˆ]SN _AN, PSA)],_
which in general is not an average of a per-letter distortion measure. The problem is a compression
task in which the server has complete (or partial) knowledge of the states s[N], and wants to transmit
a conditional probability distribution πA _S to the agents, consuming the minimum amount of bits,_
_|_
in such a way that the empirical distributions _Q[ˆ]sN_ _aN given by the sequence induced by the agents_


-----

is close to the joint distribution PSA induced by the policy. For a distortion function d(QSA, PSA)
that is 1) nonnegative, 2) upper bounded by a constant Dmax, 3) continuous in QSA, and 4) convex
in QSA, in Kramer & Savari (2007), the authors provide the rate-distortion function R(D), i.e., the
minimum rate R = [log]N[2][ B] bits per symbol such that E[d( Q[ˆ]SN _AN, PSA)]_ _D, in the limit when N_

_≤_
is arbitrarily large. The solution is given by

_R(D) =_ min (2)
_QA|S_ :d(QSA,PSA)≤D _[I][(][S][;][ A][)][,]_

where QSA = PSQA _S is the joint probability induced by the environment distribution PS and_
_|_
policy QA|S, which depends on the information sent by the decision-maker. As we can see, in the
asymptotic limit of N agents, the problem admits a single-letter solution, which also serves as a
lower bound on the finite agent scenario. The studied RC-CMAB model described in Sec. 2.2 fits
into this framework, and in the following section we identify an appropriate distortion function and
provide its characterization.

3.3 THE KL-DIVERGENCE AS DISTORTION FUNCTION

When applying TS to the RC-CMAB problem with limited communication rate, the decision-maker
may not be able to induce the controller to take samples from the true policy πt[s,a]. In Phan et al.
(2019), the authors provide some theoretical guidelines to construct approximate sampling policies
to make the posteriors, i.e., p[s,a]t [, concentrate achieving sub-linear regret. In particular, they studied]
the case in which the sampling distribution Q differs from the target posterior π, using Dα(π, Q) as
distortion measure, which denotes the α-divergence between the two, and is defined as

_π(x)[α]Q(x)[1][−][α]dx_
_Dα(π, Q) = [1][ −]_ _._ (3)

_α(1_ _α_

R _−_


In Phan et al. (2019), it is shown in Theorem 1 that, for α > 0, the condition Dα(π, Q) < δ with
_δ > 0, cannot guarantee that the posterior π converges in sub-linear time, even for very small values_
of δ. On the contrary, for α ≤ 0, the authors provide a scheme that can guarantee sub-linear regret.
the actions are sampled fromIn particular, they suggest to introduce an exploration term Qt with probability 1−ρt, and uniformly at random with probability ρt ∈ _o(1) such that_ _t=1_ _[ρ][t][ =][ ∞][, and] ρt,_
while Dα(πt, Qt) < δ for all t. In this case, it is possible to obtain a sub-linear regret. Consequently,

[P][∞]
in order to solve the proposed RC-CMAB problem, we decide to investigate this strategy with α = 0,
which leads to the reverse KL-divergence from Q to π, or equivalently, the KL-divergence from π
to Q, defined as DKL(Q, π) = _x_ _[Q][(][x][) log][ Q]π([(]x[x])[)]_ [, when][ x][ takes values in the discrete set][ X] [.]

_∈X_

Consequently, to find the optimal constrained policy Qt(a _s), we need to find a solution to Eq. (2),_
_|_
which is a rate-distortion optimization problem, when the distortion function is given by the reverse[P]
KL-divergence.

We can rewrite the optimization objective of Eq. (2) as a double minimization problem (Sec. 10.8,
(Cover & Thomas, 2006a))

_Q(a_ _s)_

_R(D) = minQ˜(a)_ _QA|S_ :d(QminSA,PSA)≤D _s,a_ _P_ (s)Q(a|s) log2 _Q˜(a|)_ _._ (4)

X

Following (Lemma 10.8.1, (Cover & Thomas, 2006a)), the marginal Q[∗](y) = _x_ _[P]_ [(][x][)][Q][(][y][|][x][)][ has]

the property
_Q[∗](y) = arg minQ˜(y)_ _DKL(P_ (x)Q(y|x)||P (x) Q[˜](y)), [P] (5)

that is, it minimizes the KL-divergence between the joint and the product P (x)Q(y), _Q_ ∆K.
_∀_ _∈_
This means that _Q[˜](a) obtained by solving Eq. (4) is indeed the marginal over the actions induced by_
_Q(a|s). Exploiting this formulation, it is possible to apply the iterative Blahut-Arimoto algorithm_
to solve the problem and find the solution. In particular, given that the two sets ∆K and ∆S are
made of probability distributions, and the target measure is the KL-divergence, the algorithm does
converge to the minimum (Csisz´ar & Tusn´ady, 1984). The process is initialized by setting a random
_Q˜0(a), which is used as a fixed point to compute_


_Q(a_ _s)_
_Q(a|s) log2_ _Q˜0(|a)_ _._ (6)


_Q[∗]1[(][a][|][s][) =][ argmin]QA|S_ :d(QSA,PSA)≤D


_P_ (s)


-----

From Q[∗]1[(][a][|][s][)][, we compute the optimal][ Q][∗]1[(][a][)][ by solving Eq. (5), which is simply the marginal]
_Q[∗]1[(][a][) =][ P]s_ _[P]_ [(][s][)][Q]1[∗][(][a][|][s][)][. The process is iterated until convergence.]

We now solve the inner minimization problem, i.e., Eq. (6) with fixed _Q[˜](a). As we can see, the_
constraint on the distortion tries to keep QA _S close to πA_ _S, which is the target distribution. By_
_|_ _|_
minimizing the mutual information, we want QA _S as close to QA as possible, which is the marginal_
_|_
known by the agent (it is sent once at the beginning of the round and does not depend on the state
realizations), in order to reduce the number of necessary bits to be transmitted. The solution to
Eq. (2) is a conditional distribution QA _S that is a combination of the target policy conditioned on_
_|_
_S, and the marginal distribution._

We now discuss some simple corner cases. If the maximum acceptable distortion is 0, we need
_QA_ _S = πA_ _S, thus the rate is equal to EPS_ _DKL_ _πA_ _S_ _πA_ . If S and A are independent, i.e., the
_|_ _|_ _|_ _||_
policy does not depend on the state, we have πA _S = πA, thus the decision-maker does not need to_
send any bits to the agents, which can just sample _|_   A _πA. Moreover, if we can find a distribution_
_∼_
_QA such that EPS_ _DKL_ _QA_ _πA_ _S_ _D, then the best strategy is not to convey any message_
_||_ _|_ _≤_
to the controller, as the constraint is already satisfied with rate R(D) = 0. The problem is to find,
   
among all conditional distributions Q(a|s) that satisfy the constraint, the one with the minimum
divergence from Q(a), in order to minimize the rate [log]N[2][ B], i.e., the number of bits to be consumed

per agent, for an arbitrarily large N, in the general case. We solve the problem using the Lagrangian
multipliers, and obtain the shape of the optimal distribution given by

_Q˜(a)[γ][∗]_ _P_ (a _s)[1][−][γ][∗]_
_Qγ∗_ (a _s) =_ _|_ _s_ _, a_ _,_ (7)
_|_ _a[′]∈A_ _Q[˜](a[′])[γ][∗]_ _P_ (a[′]|s)[1][−][γ][∗] _[,]_ _∀_ _∈S_ _∈A_

where γ[∗] is s.t. EPS [DKL(Qγ∗ _||PP_ )] = D. The derivation is provided in Appendix A.1.

3.4 ASYMPTOTIC REGRET

To prove the results on the achievable regret, we need additional assumptions, which are contained
in the Assumption 1 in Russo (2016), which states that rewards have to be distributed following
canonical exponential families, and the priors used by TS over the average rewards have to be uniformly bounded, i.e., bounded away from zero ∀(s, a). The proofs of both Lemmas are reported in
the Appendix D.

In the following, we provide the minimum rate needed to achieve sub-linear regret in all states s ∈S.
We now define H(A[∗]) as the marginal entropy of the optimal arm, computed based on the marginal
_π[∗](a) =_ _s_ _[P][S][(][s][)][π][∗][(][a][|][s][)][, and defined as][ H][(][A][∗][) =][ P]a_ _[π][∗][(][a][) log]_ _π[∗]1(a)_ [, and we prove that it is]

the minimum rate required to achieve sub-linear regret.

**Lemma 3.1.[P]** _If R < H(A[∗]), then it is not possible to convey a policy Qt(s, a) that achieves sub-_
_linear regret in all states s ∈S._

The following Lemma provides the achievibility part.

**Lemma 3.2. If R > H(A[∗]), then achieving sub-linear regret is possible in all states s ∈S.**

The consequence of this second Lemma is that, even if the exact TS policy πt cannot be transmitted
_∀t, as long as sufficient rate is available, i.e., R > H(A[∗]), it is still possible to achieve sub-linear_
regret. Following the notation introduced in Sec. 2.2, this implies that, as T →∞, and for all
sub-linear regrets ρ, the regret-communication pair (ρ, R) is achievable as long as R > H(A[∗]).
Moreover, we argue that the policy construction found in Sec. 3.3 can achieve better empirical
performance w.r.t. the scheme used to prove Lemma 3.2.

3.5 PRACTICAL CODING SCHEME

The above analysis allows us to characterize an information theoretical bound on the optimal performance, but does not provide a constructive communication scheme. To find a practical coding
scheme, we propose a solution that is based on state reduction and computes a compact state representation. In essence, the decision-maker constructs a message containing the new state representations ˆs(s) ∈ _S[ˆ] of s, one for each agent, and send it over the channel. Once the agents have received_


-----

the message, they can sample the actions according to a common policy Qsˆ[(][a][|]s[ˆ]), which is defined
on the compressed state space _S[ˆ]. If the rate constraint imposes B bits per agent, it means that it is_
possible to transmit at most 2[B] different states to each agent. The idea is to group the states into
2[B] = M clusters ˆs1, . . ., ˆsM, minimizing _s_ _[P]_ [(][s][)][D][KL][(][Q]s[ˆ][(][a][|]s[ˆ]) _π(a_ _s)), where Qsˆ[(][a][|]s[ˆ]) is the_

_||_ _|_
policy defined on the state ˆs(s) ∈ _S[ˆ]._

[P]

To find the clusters and relative policies we employ the well-known Lloyd algorithm, which is an
iterative process to group states into 2[B] clusters. First of all, knowing the policy π, the decisionmaker maps each state si to a K-dimensional point α[i] = π(·|si) ∈ ∆K, finding |S| = L different
points α[1], . . ., α[L]. Then, it generates 2[B] = M random points µ[1], . . ., µ[M] _∈_ ∆K as initial
centroids, i.e., representative policies, and iterates over the following two steps:

1. Assign to each point α[i] the class j[∗] _∈{1, . . ., M_ _} such that j[∗]_ = arg minj DKL(µ[j]||α[i]),
i.e., minimizing the divergence between the representative µ[j][∗] and the original policy,
which is the point α[i]. For each cluster j, we now define with S _[j]_ the set containing the
states associated to the policies in the cluster.

2. Update µ[1], . . ., µ[M] such that µ[j] = arg minµ∈∆K _s∈Sj_ _[P]_ [(][s][)][D][KL][(][µ][||][π][(][·|][s][))][,][ which]

is still a convex optimization problem, and can be solved again applying the Lagrangian

P

multipliers. The solution is


_P (s)_

_A(Sj_ )


_s_ _j_ _[π][(][·|][s][)]_
_∈S_


**_µ[j]_** =


(8)


where the product has to be considered element-wise, A ( _j) is the sum of the probabilities_
_S_
of states in Sj, i.e., A (Sj) = _s∈Sj_ _[P]_ [(][s][)][, and][ Z][ is the normalizing factor. After com-]

puting the new centroids, we go back to step (1). The derivation of Eq. (8) is provided in
Appendix A.2. [P]

The process continues until the new solution does not decrease the KL-divergence between the
clustered and the target policy DKL(Qsˆ[(ˆ]s, a)||P (s, a)) = _j=1_ _s∈Sj_ _[P]_ [(][s][)][D][KL][(][µ][j][||][π][(][·|][s][))][.]

**Observation Note that the controller is assumed to know the 2[NR]Ppolicies from which it samples**

[P][M]

the actions of the agents. This can be transmitted at the beginning of each round. In this case, the
scheme is efficient as long as N log2 K >> BLλ log2 K, where λ is the number of bits used to
represent the values of the Probability Mass Function (PMF) Qsˆ[(][·|]s[ˆ]). For this reason, we provide a
scheme where the new policy is updated not at every transmission, but just when the new target π has
changed considerably. In particular, if we denote with π[cls] the policy defined over the compressed
state representation, with π[last] the last policy used to compute π[cls], and with π the updated target
policy, we compute and transmit π[cls] every time DKL(π[last] _π) > β._
_||_

4 NUMERICAL RESULTS

In this section, we provide numerical results in support of our theoretical analysis. Additional experiments and details can be found in Appendix C

4.1 RATE-DISTORTION FUNCTION

In this first experiment, we analyze the rate-distortion function, and the related optimal Q(a|s), when
the distortion is given by the KL-divergence in three different problems, when the numbers of states,
_L, and of actions, K, are both equal to 16. The target policies π(a|s) have a one-to-one relation with_
the states, and are generated such that π(a = i|s = j) = 0.99 if j = i, and uniformly distributed
otherwise. We refer to this as the Deterministic case. In the second experiment, the target policies are
generated in a similar way, but the other actions’ probabilities take random values in (0, 0.05], and
then normalized. This, we call the Random Deterministic case. In the third experiment, the states
are grouped in 4 blocks, such that π(a = i|s = j) = 0.99 if ⌊j/4⌋ = i, and uniformly distributed
otherwise. This case is denoted as the Block one. In all three experiments, the distribution PS
is uniform over the state space S. The rate-distortion curves are reported in Fig. 2a. As we can
see, in the Deterministic case, the point with zero distortion has R(0) H(PS), where H(PS) is
_∼_


-----

(b)

(c)


(a)


Figure 2: Rate-Distortion function for the 3 different experiments (a). π(a|s = 8) and Q(a|s = 8)
in the third experiment, when the rate R is equal to 1 (b). π(a|s = 8) and Q(a|s = 8) in the second
experiment, when the rate R is equal to 1.

the Shannon entropy of PS, as expected. Indeed, in this case, the action distributions are strongly
correlated with the state, thus we need an accurate knowledge of it. In the second experiment, the
starting point is similar, but the curve decreases more rapidly, caused by the random values of the
other actions’ probabilities. In the third experiment, the zero distortion point is achieved at R(0) ∼ 2
bits, since, similarly to the Deterministic experiment, action probabilities are correlated with the state
realization, but are grouped into four different cases. Again, given the uniform distribution over the
state, R(0) ∼ H(Unif[4]), where Unif[4] is the uniform distribution over a set of 4 elements. In
Fig. 2c, the approximate distributions for state s = 8 are reported for the Deterministic and Random
_Deterministic cases, when R = 1._

4.2 CONTEXTUAL MULTI-ARMED BANDIT

We now analyze the RC-CMAB problem presented in Sec. 2, and apply the clustered policy schemes
to solve it. In particular, we compare the performance of the Perfect agent, which applies TS without any rate constraint, thus admits samples from the true posterior π, with the performance of
the Comm, Cluster, and Marginal agents. The Comm agent uses the optimal scheme provided
in Sec. 3.2, the Cluster agent implements the practical coding scheme provided in Sec. 3.5, with
_B = ⌈R⌉_ bits per agent, where R is the rate adopted by the Comm agent; the Marginal agent adopts
the marginal over the states, computed from the target policy π and the environment distribution
_PS, and serves as a lower bound on the performance. State distribution PS is uniform over L = 16_
states, and there are K = 16 actions, N = 100 agents, and the total number of rounds is T = 200.
The threshold for changing the clustered policy is set to β = 0.2, and the communication rate constraint to R = 2.5. In this experiment, for each state si, the best reward is given by the arm
_aj such that i = j, with i, j ∈{0, . . ., 15}. In particular, the reward behind arm ∈S_ _i when in state_
_ji ̸ is a Bernoulli random variable with parameter= j. In this case, the best action response is strongly correlated with the state realization, thus µj = 0.8 if i = j, whereas µj ∼_ Unif[0,0.75] if
a sufficiently high rate is required to sample from the target policy π. As it can be observed from
Fig. 3(a), the theoretical rate to transmit the target policy is related to the amount of information the
agent is learning from the environment, and it is computed using Eq. (2). Indeed, during the first
_∼_ 40 iterations, the rate is below R = 2.5. As the learning process continues, the required rate for
reliable transmission increases, as the mutual information between S and A increases. We highlight
that, in order to fairly represent the 16 different actions, one would need 4 bits. Indeed, another way
of looking at Eq. (2) is through bottleneck principle (Igl et al., 2019), which is used to constrain the
mutual information between the states and the actions, and to encourage exploration, that can be
exploited to reduce the required rate when training multi-agent systems.

In Fig 3 (b), the KL-divergence between the last policy used to compute the compressed state representation, i.e., π[last] in Sec. 3.5, and the target known by the decision-maker is reported. The
trend shows that, at the beginning of the learning process, it oscillates rapidly, as the target policy is
changing significantly between two iterations. In gray, the threshold β = 2 is highlighted, indicating
when a new cluster policy has to be sent. In this experiment, the decision-maker had to send a new
policy, on average, once every ∼ 7 rounds. Fig. 4 (a) reports the regret at state s = 10 (the other


-----

plots can be seen in Appendix C.2), for the four different agents, as defined in Eq. (1). We can see
that both the Cluster and the Comm agents can achieve sub-linear regret in this particular state. In
Fig. 4 (b), the average rewards obtained by the agents are reported, for the different states separately.
We can see that the developed scheme is still able to achieve good performance.

(a)
(b)

Figure 3: Asymptotic rates to convey the Perfect and Comm policies, and the bits used by the Cluster
agent, averaged over 5 runs (a). DKL(π[last] _π), averaged over 5 runs (b)._
_||_

(a) (b)

Figure 4: Cumulative regret for different agents, together with the performance of the Marginal
agent, evaluated for the state s = 10 (a). Average reward per state (b).

5 CONCLUSION

We have studied the RC-CMAB problem, in which an intelligent entity, i.e., the decision-maker,
observes the contexts of N parallel CMAB processes, and has to decide on the actions depending
on the current contexts and the past actions and rewards. However, the actions are implemented
by a controller that is connected to the decision-maker through a rate-constrained communication
link. First, we cast the problem into the proper information-theoretic framework, formulating it
as a policy compression problem, and provided the optimal compression scheme in the limit of an
infinite number of agents, when the adopted distortion measure is the KL-divergence between the
compressed policy adopted by the controller and the policy of the decision-maker. We then characterize the minimum needed rate to obtain sub-linear regret, and prove that any rates above this
threshold can achieve it. In the end, we designed a practical coding scheme to transmit the actions
for a finite N, which relies on a compressed state representation Finally, we evaluated the performances of the policy obtained through the asymptotic information theoretic formulation, and the one
obtained through the clustering scheme, and observed a close gap between the two. We numerically
showed the relation between the asymptotic rate bound and the learning phase of agents, showing
how it is possible to save communication resources when training a multi-agent system like the one
considered. We believe that this work can serve as a first step towards understanding the fundamental performance limits of multi-agent decision-making problems under communication constraints,
and highlights the intimate relation between the communication scheme and the learning process.
Ongoing work include deriving an information theoretic converse result on the regret performance,
and generalizing the framework to reinforcement learning problems.


-----

REFERENCES

Mridul Agarwal, Vaneet Aggarwal, and Kamyar Azizzadenesheli. Multi-agent multi-armed bandits
with limited communication. In arXiv:2102.08462 [cs], 2021.

Rudolf Ahlswede and Imre Csisz´ar. Hypothesis testing with communication constraints. 32(4):
533–542, Jul. 1986.

Toby Berger. Decentralized estimation and decision theory. In IEEE 7th. Spring Workshop on Inf.
_Theory, Mt. Kisco, NY, Sep. 1979._

Djallel Bouneffouf and Irina Rish. A survey on practical applications of multi-armed and contextual
bandits. arXiv cs.LG:1904.10040, 2019.

Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecom_munications and Signal Processing). Wiley-Interscience, USA, 2006a. ISBN 0471241954._

Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecom_munications and Signal Processing). Wiley-Interscience, USA, 2006b._

Imre Csisz´ar and Gabor Tusn´ady. Information Geometry and Alternating Minimization Procedures.
_Statistics and Decisions, Supplement Issue, pp. 205–237, 1984._

Jakob N. Foerster, Yannis M. Assael, Nando de Freitas, and Shimon Whiteson. Learning to Communicate with Deep Multi-Agent Reinforcement Learning. arXiv:1605.06676 [cs], May 2016.
arXiv: 1605.06676.

Osama A. Hanna, Lin F. Yang, and Christina Fragouli. Solving multi-arm bandit using a few bits of
communication. In 38 th International Conference on Machine Learning, 2021.

Serhii Havrylov and Ivan Titov. Emergence of language with multi-agent games: Learning to communicate with sequences of symbols. pp. 11, 2017.

Maximilian Igl, Kamil Ciosek, Yingzhen Li, Sebastian Tschiatschek, Cheng Zhang, Sam Devlin,
and Katja Hofmann. Generalization in reinforcement learning with selective noise injection and
information bottleneck. In Advances in Neural Information Processing Systems, volume 32, pp.
13956–13968, 2019.

Mikolaj Jankowski, Deniz G¨und¨uz, and Krystian Mikolajczyk. Wireless image retrieval at the edge.
_IEEE Journal on Selected Areas in Communications, 39(1):89–100, 2021. doi: 10.1109/JSAC._
2020.3036955.

Cem Kalkanli and Ayfer Ozgur. Asymptotic convergence of Thompson sampling. In
_arXiv:2011.03917v1, 2020._

Gerhard Kramer and Serap A. Savari. Communicating probability distributions. IEEE Transactions
_on Information Theory, 53(2):518–525, 2007. doi: 10.1109/TIT.2006.889015._

Angeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni. Multi-agent cooperation and the
emergence of (natural) language. arXiv:1612.07182 [cs], March 2017. arXiv: 1612.07182.

Hongju Park and Mohamad Kazem Shirani Faradonbeh. Analysis of Thompson sampling for partially observable contextual multi-armed bandits, 2021.

Jihong Park, Sumudu Samarakoon, Mehdi Bennis, and M´erouane Debbah. Wireless network intelligence at the edge. Proceedings of the IEEE, 107(11):2204–2239, 2019.

My Phan, Yasin Abbasi Yadkori, and Justin Domke. Thompson sampling and approximate inference. In Advances in Neural Information Processing Systems, 2019.

Daniel Russo. Simple bayesian algorithms for best arm identification. In Vitaly Feldman, Alexander
Rakhlin, and Ohad Shamir (eds.), 29th Annual Conference on Learning Theory, volume 49 of
_Proceedings of Machine Learning Research, pp. 1417–1418, Columbia University, New York,_
[New York, USA, 23–26 Jun 2016. PMLR. URL https://proceedings.mlr.press/](https://proceedings.mlr.press/v49/russo16.html)
[v49/russo16.html.](https://proceedings.mlr.press/v49/russo16.html)


-----

Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. Learning multiagent communication with
backpropagation. In Proc. of 30th Int’l Conf. on Neural Information Proc. Systems, NIPS’16, pp.
2252–2260, Red Hook, NY, December 2016.

William R. Thompson. On the theory of apportionment. American Journal of Mathematics, 57(2):
450–456, 1935.

R´obert Torfason, Fabian Mentzer, Eir´ıkur Ag´[´] ustsson, Michael Tschannen, Radu Timofte, and
Luc Van Gool. Towards image understanding from deep compression without decoding. In
_International Conference on Learning Representations, 2018._

Aolin Xu and Maxim Raginsky. Information-theoretic lower bounds on Bayes risk in decentralized
estimation. IEEE Transactions on Information Theory, 63(3):1580–1600, 2017. doi: 10.1109/
TIT.2016.2646342.

Yuchen Zhang, John Duchi, Michael I Jordan, and Martin J Wainwright. Information-theoretic
lower bounds for distributed statistical estimation with communication constraints. In Advances
_in Neural Information Processing Systems, volume 26, 2013._

APPENDIX

A POLICIES DERIVATION

A.1 OPTIMAL POLICY

To solve this problem, we solve the related Lagrangian


_Q(a_ _s) log_ _[Q][(][a][|][s][)]_ + λ
_|_ _Q˜a_


_Q(a_ _s) log_ _[Q][(][a][|][s][)]_
_|_ _P_ (a _s)_

_|_ _[−]_ _[D]_


_L(Q(a|s), λ, µ) =_


_P_ (s)

_s_

X

+ µ

X


_P_ (s)


_P_ (s)


_Q(a|s) −_ 1


where the Lagrangian multiplier λ has to be optimized to meet the constraints on the divergence,
whereas µ ensures that the solution is a probability distributions, i.e., the elements sum up to one.
The positivity constraints on the terms are already satisfied by the fact that the solution has an
exponential shape. We first take the derivative of the Lagrangian w.r.t. to the terms Q(a|s) and set it
to zero
_∂_ (Q(a _s), λ, µ)_
_L_ _∂Q(|a_ _s)_ = P (s) log _[Q]Q˜[(]([a]a[|][s])[)]_ + P (s) − _Ps[′]_ _Q(a|s[′])_ _Q[P]˜([(]a[s][)])_ +

_|_ _s[′]_

X


+ λP (s) log _[Q][(][a][|][s][)]_ + µP (s) = 0

_P_ (a _s) [+ 1]_

 _|_ 


finding


_Q(a_ _s)[1+][λ]_
log _|_

_Q˜(a)P_ (a _s)[λ][ =][ −][(][λ][ +][ µ][)]_
_|_

and now we rewrite with γ = 1+Q1(λa[,]Q|[ γ]s()[ ∈]a[1+]|s[λ]) =[[0]=[,][ 1]] e e[, and][−]−[(](1+µ[µ]+[+]λλ[ µ][λ])[)]Q[ ˜][ such that]Q˜((aa))1+P1(λa P|[ e]s()a−[λ]|(1+sµ)+λ1+λλ) _λis the normalizing factor._

The solution has thus the shape

_Q˜(a)[γ]P_ (a _s)[1][−][γ]_
_Qγ(a_ _s) =_ _|_ _s_ _, a_ _._ (9)
_|_ _a[′]∈A_ _Q[˜](a[′])[γ]P_ (a[′]|s)[1][−][γ][,] _∀_ _∈S_ _∈A_

By the convexity of KL-Divergence and its triangular inequality, we know the solution lies on the

P

boundary of the constraints, i.e., when EPS [DKL(Qγ∗ _||P_ )] = D. Consequently, if we can numerically find γ[∗] s.t. EPS [DKL(Qγ∗ _||P_ )] = D, we solve Eq. (6).


-----

A.2 UPDATE CENTROIDS

Again, we compute the optimal centroids by solving the Lagrangian


_µ[j]a_ [log] _µ[j]a_

_π(a_ _s) [+][ λ]_

_aX∈A_ _|_


_L(µ[j]a[, λ][) =]_


_µ[j]a_

Xa∈A _[−]_ [1]


_P_ (s)
_sX∈Sj_


taking its derivative and solving the equality


_∂L(µ[j]a[, λ][)]_

_∂µ[j]a_


_P_ (s) log _µ[j]a_ + λ = 0

_π(a_ _s) [+ 1]_

_sX∈Sj_  _|_ 


finding


log µ[j]a[A][ (][S][j][) =]

log µ[j]a [=]

_µ[j]_ =


_P_ (s) log π(a _s) + A (_ _j) + λ_
_|_ _S_
_sX∈Sj_

_P_ (s) _λ_

_A (_ _j) [log][ π][(][a][|][s][) + 1 +]_ _A (_ _j)_

_sX∈Sj_ _S_ _S_

_P (s)_


_A(Sj_ )


_s_ _j_ _[π][(][·|][s][)]_
_∈S_


where Z is the normalizing factor, obtaining the shape expressed in Eq. (8).


-----

B PRESENTED ALGORITHMS

In this appendix, the algorithms for the theoretically-optimal and cluster policies are described.

B.1 THEORETICAL OPTIMAL POLICY

This is the pseudocode of the algorithm at the decision-maker side, when adopting the informationtheoretic inspired compressed policy. The only task for the controller is to decode the received
message, and to communicate the actions to the agents. We define with [A] the set {1, . . ., A}, and
with UnifB the uniform probability over the set B.

**Algorithm 1 Optimal Policy**
**Input rate R, n° agents N**, n° actions K and state distribution PS
**Initialize Thompson Sampling policy π, and compressed policy Q(** _s) to Unif[K]_

_·|_
**foreach t ∈** 1, . . ., T do

Observe s[N]t [= (][s][t,][1][, . . ., s][t,N] [)]
Sample a[N]t [= (][a][t,][1][, . . ., a][t,N] [)][ from][ Q][(][·|][s][)]
Code a[N]t [into][ u][t] [=][ f][t][(][a][N]t [)][ and transmit to the controller]
Observe rt[N] [= (][r][t,][1][, . . ., r][t,N] [)]
Update action posteriors π(·|s) using the Thompson Sampling Algorithm
Update Q(a|s) using the Blahut-Arimoto Algorithm (Iterate over Eq. 5 and Eq. 6)

B.2 CLUSTER POLICY

This is the pseudocode of the algorithm at the decision-maker side, when adopting the cluster policy.
The task for the controller is to sample, at time t and for each agent i, the action at,i according to
the last received centroid µj(st,i), whose components convey the action probabilities.

**Algorithm 2 Cluster Policy**
**Input per-agent bit B, n° agents N**, n° actions K and state distribution PS, threshold β
**Initialize Thompson Sampling policy π**
Compute the centroids µj for j = 1, . . ., 2[B] using π with Lloyd algorithm and the rule in Eq. 8
Transmit the centroids µj to all agents.
_π[last]_ = π
**foreach t ∈** 1, . . ., T do

Observe s[N]t [= (][s][t,][1][, . . ., s][t,N] [)]
**if DKL** _π[last]||π_ _> β then_

Update the centroids µj using π with Lloyd algorithm and the rule in Eq. 8
Transmit the centroids   **_µj to all agents._**

_∀st,i compute the index j(st,i) indicating its belonging cluster_
Transmit the vector jt[N] [= (][j][(][s][t,][1][)][, . . ., j][(][s][t,N] [))]
Observe rt[N] [= (][r][t,][1][, . . ., r][t,N] [)]
Update action posteriors π(·|s) using the Thompson Sampling Algorithm


-----

C EXPERIMENTS

In this section, additional experiments are provided to clarify the proposed scheme, and quantify the
trade-offs between rate and regret.

C.1 STATE REPRESENTATION

In Fig. 5 we report the policy clustering results when applying the coding scheme explained in
Sec. 3.5, when R = 2 bits can be used to represent 20 randomly generated policies. In the figure,
it is possible to see the 4 policies representative of the 4 compressed state representations. As we
can see, the policy centroids found by the coding scheme try to fairly resemble the shapes of all the
policies within the same cluster.

Figure 5: Clusters and their representatives with L = 20 and b = 2.

C.2 DETERMINISTIC

In this section we reported some more details on the RC-CMAB experiments presented in Sec. 4.2.

(a) (b)

(c)

Figure 6: Best action probability for a given state, for the posterior π(a[∗]|s) (a) and compressed
policy Q(a[∗]|s) (b) for the different agents. KL-divergence between the agents’ action posteriors,
and the target one (c).


-----

-----

Figure 8: Cumulative average regret ± one std for the different agents, from state s = 0 to state
_s = 15_


-----

C.3 8-BLOCK EXPERIMENT

In this second RC-CMAB experiment, the setting is similar to the one presented above, but the
best action response is not a one-to-one mapping with the state. Again, a ∈{0, . . ., 15} and s ∈
_{0, . . ., 15}, but the Bernoulli parameter µa(s) for action a in state s is 0.8 if ⌊_ 2[s] _[⌋]_ [=][ a][, and sampled]

uniformly in (0, 0.75] otherwise. Thus, the best action responses are grouped into 8 different classes,
depended on the state realization. In this case, the rate is limited to R = 2.

(b)
(a)

Figure 9: Block experiment : asymptotic rates to convey the Perfect and Comm policies, and the bits
used by the Cluster agent, averaged over 5 runs (a). Average reward achieved in each state by the
pool of agents (b)

(a) (b)

(c)

Figure 10: Block experiment : Best action probability for a given state, for the posterior π(a[∗]|s) (a)
and compressed policy Q(a[∗]|s) (b) for the different agents. KL-divergence between the agents’
action posteriors, and the target one (c).


-----

-----

Figure 12: Block experiment : cumulative average regret ± one std for the different agents, from
state s = 0 to state s = 15


-----

C.4 REGRET VS NUMBER OF CLASSES

In this last experiment, the environment is the same as the one described in Sec. 4.2, a part form
the number of agents, that here is N = 50. The task is to quantify the effect of the number of
per-agent bits B in the cluster policy on the achievable regret, which varies from 1 to 4. As we
can see from the plots below, when B is equal to 1 or 2, which in turn means that the number of
clusters is, respectively, 2 and 4, the regret is not sub-linear in several states. However, the policy
with just 2 bits can achieve performance comparable with the 3 and 4 bits policies in some states,
e.g., state 4 and 6. This is due to the fact that even in those cases with not enough bits, the posteriors
still converges to good solutions, thus if a cluster contains, for example, one state, it will convey the
best policy, achieving sub-linear regret. On the contrary, when a cluster contains more states, the
representative policy can fail to achieve optimal performance, if those are substantially different.


-----

Figure 14: Deterministic experiment : cumulative average regret ± one std for cluster policy, when
_B = {1, 2, 3, 4}, from state s = 0 to state s = 15_
.


-----

D REGRET BOUND

To prove our statements, we define with Hq(X) and Iq(X; Y ) the marginal entropy and mutual
information w.r.t. to the joint probability q, i.e., H(A[∗]) = Hπ∗ (A). We start by proving Lemma D.1.
**Lemma D.1. If the exact Thompson Sampling policies πt(a|s) achieve sub-linear Bayesian regret**
_for all state s ∈S, then limt→∞_ _Iπt_ (S; A) = limt→∞ _Hπt_ (A) = H(A[∗]).

_Proof. First of all, we write Iπt_ (S; A) = Hπt (A) _Hπt_ (A _S). Following Theorem 2 from Kalkanli_
_−_ _|_
& Ozgur (2020), if πt(a _s) achieves sub-linear regret_ _s_, then _s_
_|_ _∀_ _∈S_ _∀_ _∈S_

lim when a[∗] is the optimal arm
_t→∞_ _[π][t][(][a][ =][ a][∗][|][s][) = 1]_

lim when a[′] = a[∗].
_t_ _̸_
_→∞_ _[π][t][(][a][ =][ a][′][|][s][) = 0]_

Consequently, we have that in the limit πt(a _s) is a deterministic function, thus_
_|_

lim
_t→∞_ _[H][π][t]_ [(][A][∗][|][S][) = 0][,]

which concludes our proof.

Then, we prove Lemma 3.1, which is repeated below.

**Lemma 3.1 If R < H(A[∗]), then it is not possible to convey a policy Qt(s, a) that achieves sub-**
linear regret in all state s ∈S.

_Proof. Following Lemma 10 in Phan et al. (2019), if limt_ _πt(a[∗]_ _s) = 1 and DKL (Qt_ _πt) < ϵ,_
_→∞_ _|_ _||_
_t, for some ϵ > 0, then limt_ _Qt(a[∗]_ _s) = 1, inducing limt_ _DKL (Qt_ _πt) = 0._
_∀_ _→∞_ _|_ _→∞_ _||_

Now, for any policy Qt, if the minimal rate to convey it, _R[ˆ]t = IQt_ (S; A), is below the threshold of
the optimal policy, i.e., _R[ˆ]t < H(A[∗]), by Eq. (2), we have_

_DKL (Qt||π[∗]) > 0,_ (10)

which holds also in the limit. Otherwise, we would have a policy Qt s.t. DKL (Qt||π[∗]) = 0, which
could be conveyed at a rate strictly below that characterized by Eq. (2), which would contradict with
the definition of the rate-distortion function.

Consequently, by Lemma 10 in Phan et al. (2019) and Eq. (10), there cannot be any ϵ > 0 s.t.
limt _DKL (Qt_ _πt) < ϵ; and hence, limt_ _DKL (Qt_ _πt) =_ .
_→∞_ _||_ _→∞_ _||_ _∞_

However, if limt _DKL (Qt_ _πt) =_, (s, a) s.t. limt _Qt(a_ _s) = c > 0, when_
_→∞_ _||_ _∞_ _∃_ _∈S × A_ _→∞_ _|_
_a_ = a[∗]. This implies that, at step t, Qt(s, a) plays a sub-optimal arm in state s with constant
_̸_
probability, and so it can not achieve sub-linear regret in all states.

Finally, we provide a proof for Lemma 3.2, which we repeat below.

**Lemma 3.2 If R > H(A[∗]), then achieving sub-linear regret is possible in all states s ∈S.**

_Proof. We define by R[π][t]_ the rate needed to convey the TS policy perfectly to the controller at time
_t, and let δ > 0 s.t. R = H(A[∗]) + δ, where R is the available communication rate. We now provide_
a scheme that guarantees sub-linear regret.

First, generate ρt parameters as in Sec. 3.3. As long as R[π][t] _> R, with probability ρt play at_
uniformly at random, and with probability 1 _ρt, play according to a policy Qt(a_ _s), which satisfies_
_−_ _|_
the rate-distortion constraint I(A; S) < R under Qt(a _s), which can be transmitted to the controller,_
_|_
and will have a bounded reverse KL divergence from the TS policy πt at time t. Following Lemma 14
in Russo (2016), in this way enough exploration is guaranteed for the TS policy to concentrate, and
there exists a finite t0 s.t. ∀t > t0, R[π][t] _< H(A[∗]) + δ. This means that, for the first t0 rounds, both_
the target and the approximating policies are playing sub-optimal arms with non-zero probabilities.
However, their average gap is within a constant, given the average rewards are bounded within [0, 1].
Then, ∀t > t0 it is possible to play the exact TS policy, leading to the optimal policy for all future
steps, and hence, a sub-linear regret.


-----