# CONTEXTUAL MULTI-ARMED BANDITS WITH COMMUNICATION CONSTRAINTS **Anonymous authors** Paper under double-blind review ABSTRACT We consider a remote Contextual Multi-Armed Bandit (CMAB) problem, in which the decision-maker observes the context and the reward, but must communicate the actions to be taken by the agents over a rate-limited communication channel. This can model, for example, a personalized ad placement application, where the content owner observes the individual visitors to its website, and hence has the context information, but must convey the ads that must be shown to each visitor to a separate entity that manages the marketing content. In this Rate-Constrained CMAB (RC-CMAB) problem, the constraint on the communication rate between the decision-maker and the agents imposes a trade-off between the number of bits sent per agent and the acquired average reward. We are particularly interested in the scenario in which the number of agents and the number of possible actions are large, while the communication budget is limited. Consequently, it can be considered as a policy compression problem, where the distortion metric is induced by the learning objectives. We first consider the fundamental information theoretic limits of this problem by letting the number of agents go to infinity, and study the regret that can be achieved. In particular, we identify two distinct rate regions resulting in linear and sub-linear regret behaviour, respectively. Then, we propose a practical coding scheme, and provide numerical results for the achieved regret. 1 INTRODUCTION In the last few years, synergies between Machine Learning (ML) and communication networks have attracted a lot of interest in the research community, thanks to the fruitful interplay of the two fields in emerging applications from Internet of things (IoT) to autonomous vehicles and other edge services. In most of these applications, both the generated data and the processing power is distributed across a network of physically distant devices, thus a reliable communication infrastructure is pivotal to run ML algorithms that can leverage the collected distributed knowledge (Park et al., 2019). To this end, a lot of recent works have tried to redesign networks and to efficiently represent information to support distributed ML applications, where the activities of data collection, processing, learning and inference are performed in different geographical locations, and should consider limited communication, memory, or processing resources, as well as addressing privacy issues. In contrast to the insatiable growth in our desire to gather more data and intelligence, available communication resources (bandwidth and power, in particular) are highly limited, and must be shared among many different devices and applications. This requires the design of highly communicationefficient distributed learning algorithms, particularly for edge applications. Information theory, in particular the rate-distortion theory, have laid the fundamental limits of efficient data compression with the aim to reconstructing the source signal with the highest fidelity (Cover & Thomas, 2006b). However, in the aforementioned applications, the goal is often not to reconstruct the source signal, but to make some inference based on that. This requires task-oriented compression, filtering out the unnecessary information for the target application, and thus decreasing the number of bits that have to be transmitted over the communication networks. This approach should target the questions of _what is the most useful information that has to be sent, and how to represent it, in order to meet the_ application requirements consuming the minimum amount of network resources. Our goal in this paper is to investigate a theoretically grounded method to efficiently transmit data in a Contextual Multi-Armed Bandit (CMAB) problem, in which the context information is available to ----- a decision-maker, whereas the actions can be taken by a remote entity, called controller, controlling a multitude of agents. We assume that a limited communication link is available between the decisionmaker and the controller to communicate at each round the intended actions. The controller must decide on the actions to take based on the message received over the channel, while the decisionmaker observes the rewards at each round, and updates its policy accordingly. This scenario can model, for example, a personalized ad placement application, where the content owner observes the individual visitors to its website; and hence, has the context information, but must convey the ads that must be shown to each visitor to a separate entity that manages the marketing content. This will require communicating hundreds or thousands of adds to be placed at each round, from among a large set of possible adds, within the communication resource and delay constraints of the underlying communication channel, which is quantified as the number of bits available per agent. This problem may arise in other similar applications of CMABs with communication constraints between the decision-maker and the controller (Bouneffouf & Rish, 2019). 1.1 RELATED WORK Given the amount of data that is generated by machines, sensors and mobile devices, the design of distributed learning algorithms is a hot topic in the ML literature. These algorithms often impose communication constraints among agents, requiring the design of methods that would allow efficient representation of messages to be exchanged. While rate-distortion theory deals with efficient lossy transmission of signals (Cover & Thomas, 2006b), in ML applications, we typically do not need to reconstruct the underlying signal, but make some inference based on that. These applications can be modeled through distributed hypothesis testing (Berger, Sep. 1979; Ahlswede & Csisz´ar, 1986) and estimation (Zhang et al., 2013; Xu & Raginsky, 2017) problems under rate constraints. In parallel to the theoretical rate-distortion analysis, significant research efforts have been invested in the design of practical data compression algorithms, focusing on specific information sources, such as JPEG and BPG for image compression, or MPEG and H.264 for video compression. While adapting these tools to specific inference tasks is difficult, recently deep learning techniques have been employed to learn task-specific compression algorithms (Torfason et al., 2018; Jankowski et al., 2021), which achieve significant efficiency by bypassing image reconstruction. While the above mainly focus on the inference task through supervised learning, here we consider the CMAB problem. There is a growing literature on multi-agent Reinforcement Learning (RL) problems with communication links (Foerster et al., 2016; Sukhbaatar et al., 2016; Havrylov & Titov, 2017; Lazaridou et al., 2017). These papers consider a multi-agent partially observable Markov decision process (POMDP), where the agents collaborate to resolve a specific task. In addition to the usual reward signals, agents can also benefit from the available communication links to better cooperate and coordinate their actions. It is shown that communication can help overcome the inherent non-stationarity of the multi-agent environment. Our problem can be considered as a special case of this general RL formulation, where the state at each time is independent of the past states and actions. Moreover, we focus on a particular setting in which the communication is one-way, from the decision-maker that observes the state and the reward, towards the controller that takes the actions. This formulation is different from the existing results in the literature involving multi-agent Multi-Armed Bandit (MAB). In Agarwal et al. (2021), each agent can pull an arm and communicate with others. They do not consider the contextual case, and focus on a particular communication scheme, where each agent shares the index of the best arm according to its experience. Another related formulation is proposed in Hanna et al. (2021), where a pool of agents collaborate to solve a common MAB problem with a rate-constrained communication channel from the agents to the server. In this case, agents observe their rewards and upload them to the server, which in turn updates the policy used to instruct them. In Park & Faradonbeh (2021), the authors consider a partially observable CMAB scenario, where the agent has only partial information about the context. However, this paper does not consider any communication constraint, and the partial/ noisy view of the context is generated by nature. Differently from the existing literature, our goal here is to identify the fundamental information theoretic limits of learning with communication constraints in this particular scenario. ----- 2 PROBLEM FORMULATION 2.1 CONTEXTUAL MULTI-ARMED BANDIT (CMAB) PROBLEM We consider N agents, which experience independent realizations of the same CMAB problem. The CMAB is a sequential decision game in which the environment imposes a probability distribution _PS over a set of contexts, or states, S, which is finite in our case. The game proceeds in rounds, and_ each agentat each round i ∈{ t = 11, . . ., N, . . ., T}. At each time step, a realization of the state t, and for each agent st,i ∈S is sampled from the distribution i, states are sampled iid according PS for to PS . In the usual CMAB setting, the decision-maker would observe the states {st,i}i[N]=1 [of the] agents, and choose an action (or arm) at,i 1, . . ., K =, for each agent, where K is the total number of available actions, with probability ∈{ πt,i(at,i _s}t,i) A given by a (possibly) stochastic policy_ _|_ _πt : S →_ ∆K. Here ∆K is the K-dimensional simplex, containing all possible distributions over the set of actions. Once the actions have been taken by all the agents, at each round t the environment returns rewards for all the agents following independent realizations of the same reward process, _rt,i = r(st,i, at,i) ∼_ _PR(r|st,i, at,i),_ _∀i ∈{1, . . ., N_ _}, which depends on the state and the action_ of the corresponding agent. Based on the history of returns up to roundcan optimize its policy to maximize the sum of the rewards G = _t=1_ _tNi −=11[r], the decision-maker[t,i][.]_ P 2.2 RATE-CONSTRAINED CMAB [P][T] In our modified version, the process of observing the system states is spatially separated from the process of taking the actions. The environment states, {st,i}i[N]=1[, are observed by a cen-] tral entity, called the decision-maker, that has to communicate to the controller over a rateconstrained communication channel, at each round t, the information about the actions _at,i_ _i=1_ _{_ _}[N]_ the agents should take. The decision-maker can exploit the knowledge accumulated from the states and rewards observed by all the agents up to round t − 1, denoted by H(t − 1) = ({s1,i}i[N]=1 _[,][ {][a][1][,i][}]i[N]=1_ _[,][ {][r][1][,i][}]i[N]=1[)][, . . .,][ (][{][s][t][−][1][,i][}][N]i=1_ _[,][ {][a][t][−][1][,i][}]i[N]=1_ _[,][ {][r][t][−][1][,i][}]i[N]=1[)][,]_ _∈H[(][t][−][1)], to_ optimize the policy to be used at roundn _t. Consequently, the problem is to communicate the ac-o_ tion distribution, i.e., the policy πt(a _st), which depends on the specific state realizations observed_ _|_ in round t, to the controller within the available communication resources while inciting the minimal impact on the performance of the learning algorithm. Specifically, the decision-maker employs function ft[(][N] [)] : 1, 2, . . ., B to map the _H[(][t][−][1)]_ _×S_ _[N]_ _→{_ _}_ history up to time t and the states of the agents at time t to a message index to be transmitted over the channel. The controller, on the other hand, employs a function gt[(][N] [)] : 1, 2, . . ., B to _{_ _} →A[N]_ map the received message to a set of actions for the agents. In general, both functions ft[(][N] [)] and gt[(][N] [)] _T_ can be stochastic. Average per period regret achieved by sequences _ft[(][N]_ [)], gt[(][N] [)] _t=1_ [is given by] n o _T_ _N_ _ρ[(][N]_ [)](T ) = [1] _r(st,i, a[∗](st,i)_ _r(st,i, at)_ _,_ (1) _T_ [E] "t=1 _i=1_ _−_ # X X where at,i = _gt,i(m(t)) is the action taken by agent i based on message m(t)_ = _ft[(][N]_ [)] _H(t_ 1), _st,i_ _i=1_ transmitted in round t, a[∗](st,i) is the action with maximum mean re_−_ _{_ _}[N]_ ward in state _st,i, i.e., the optimal action, and the expectation is taken with respect to the state distri-_ bution PS and reward distribution PR. We say that, for a given time period T and N agents, an av _T_ erage regret- communication rate pair (ρ, R) is achievable if there exist functions _ft[(][N]_ [)], gt[(][N] [)] _t=1_ as defined above with rate _N[1]_ [log][2][ B][ ≤] _[R][ and regret][ ρ][(][N]_ [)][(][T] [)][ ≤] _[ρ][.]_ n o If a sufficiently large rate is available for communication, i.e., R ≥ log K, then the intended action for each agent can be reliably conveyed to the controller. Otherwise, to achieve the learning goal while satisfying the rate constraint, the decision-maker must apply a lossy compression scheme, such that the action distribution adopted by the pool of agents resembles the intended policy as much as possible. ----- Figure 1: The problem of communicating policies. 3 SOLUTION In this section, we present the proposed solution to the problem. First of all, we briefly discuss the algorithm adopted by the decision-maker to solve the CMAB. Then, the communication scheme for the RC-CMAB with a particular distortion function is provided, which is inherited by the learning task. We then identify two distinct rate regions resulting in linear and sub-linear regret, and propose a more practical coding scheme. 3.1 THOMPSON SAMPLING In the proposed solution to the CMAB problem, the decision-maker adopts Thompson Sampling (TS) Thompson (1935). In particular, it makes use of one TS instance for each state s ∈S. Consequently, the decision-maker maintains an estimate of the distribution p[s,a]t [(][µ][)][ of the mean reward] _µ[s,a]_ _∈D ⊆_ R of action a in state s at time t. To take the decision in state st, the decisionmaker would sample µ[s]t _[t][,a]_ _∼_ _p[s]t_ _[t][,a], ∀a ∈A, and takes the action a[∗]_ = arg maxa∈A{µ[s]t _[t][,a]}. This_ procedure is repeated for each agent i ∈{1, . . ., N _}. After receiving the rewards {rt,i}i[N]=1[, the]_ decision-maker can update its belief on µ[s,a], i.e., the probabilities p[s,a]t [(][µ][)][, in order to minimize] the regret. We notice that this strategy induces a probability distribution πt(a _s) over the actions_ _|_ that is πt(a|s) = _D_ _[p]t[s,a][(][µ][)][ Q]j[K]=1,j≠_ _a_ _[P]t[ s,j](µ)dµ, where Pt[s,j](µ) is the Cumulative Distribution_ Function (CDF) of µj, where the random variables µ[s,a] are considered independently distributed. R However, the constraint on the rate imposed by the RC-CMAB formulation makes it infeasible for the decision-maker to sample, through the pool of agents, the actions directly from the true distribution πt(a _s). The agents have to use a proxy Qt(a_ _s), which is the one obtained from the message_ _|_ _|_ received. This problem is similar to approximate TS, where a proxy distribution is used to sample the actions, or the reward means, given that the true distribution is too complex to sample from. In that case, the bottleneck is given by the complexity of the mean reward distributions, whereas in this work, it is imposed by the limited-rate communication channel between the decision-maker and the controller. 3.2 OPTIMAL SOLUTION FOR THE RC-CMAB We model the environment as a Discrete Memoryless Source (DMS), that generates states from a finite alphabet with probability PS, emitting sequences of N symbols s[N] = (s1, . . ., sN ), one per _S_ agent. We then denote with _Q[ˆ]sN (m) the empirical probability of state m in s[N]_ . We also consider the sequence of actions a[N], and denote with _Q[ˆ]zN (m, j) the empirical joint probability of the pair_ (m, j) in z[N] = ((s1, a1), . . ., (sN _, aN_ )). The whole picture can be seen in Fig. 1. In the figure above, the actions taken by the agents are denoted by ˆa to indicate that they can differ from a. However, we are interested in the probability distributions generating the sequences, thus we will denote with A the random variables indicating both the actions at the controller and decision-maker side. We assume that the distribution PS is known (or accurately estimated). The decision-maker can observe the realization s[N] of the contexts, and its task is to transmit an index u 1, . . ., B, and the agents can generate from u a sequence a[N], such that _Q[ˆ]sN_ _aN is close_ _∈{_ _}_ to PSA(sa) = PS(s)π(a|s), where closeness depends on a distortion measure E[d( Q[ˆ]SN _AN, PSA)],_ which in general is not an average of a per-letter distortion measure. The problem is a compression task in which the server has complete (or partial) knowledge of the states s[N], and wants to transmit a conditional probability distribution πA _S to the agents, consuming the minimum amount of bits,_ _|_ in such a way that the empirical distributions _Q[ˆ]sN_ _aN given by the sequence induced by the agents_ ----- is close to the joint distribution PSA induced by the policy. For a distortion function d(QSA, PSA) that is 1) nonnegative, 2) upper bounded by a constant Dmax, 3) continuous in QSA, and 4) convex in QSA, in Kramer & Savari (2007), the authors provide the rate-distortion function R(D), i.e., the minimum rate R = [log]N[2][ B] bits per symbol such that E[d( Q[ˆ]SN _AN, PSA)]_ _D, in the limit when N_ _≤_ is arbitrarily large. The solution is given by _R(D) =_ min (2) _QA|S_ :d(QSA,PSA)≤D _[I][(][S][;][ A][)][,]_ where QSA = PSQA _S is the joint probability induced by the environment distribution PS and_ _|_ policy QA|S, which depends on the information sent by the decision-maker. As we can see, in the asymptotic limit of N agents, the problem admits a single-letter solution, which also serves as a lower bound on the finite agent scenario. The studied RC-CMAB model described in Sec. 2.2 fits into this framework, and in the following section we identify an appropriate distortion function and provide its characterization. 3.3 THE KL-DIVERGENCE AS DISTORTION FUNCTION When applying TS to the RC-CMAB problem with limited communication rate, the decision-maker may not be able to induce the controller to take samples from the true policy πt[s,a]. In Phan et al. (2019), the authors provide some theoretical guidelines to construct approximate sampling policies to make the posteriors, i.e., p[s,a]t [, concentrate achieving sub-linear regret. In particular, they studied] the case in which the sampling distribution Q differs from the target posterior π, using Dα(π, Q) as distortion measure, which denotes the α-divergence between the two, and is defined as _π(x)[α]Q(x)[1][−][α]dx_ _Dα(π, Q) = [1][ −]_ _._ (3) _α(1_ _α_ R _−_ In Phan et al. (2019), it is shown in Theorem 1 that, for α > 0, the condition Dα(π, Q) < δ with _δ > 0, cannot guarantee that the posterior π converges in sub-linear time, even for very small values_ of δ. On the contrary, for α ≤ 0, the authors provide a scheme that can guarantee sub-linear regret. the actions are sampled fromIn particular, they suggest to introduce an exploration term Qt with probability 1−ρt, and uniformly at random with probability ρt ∈ _o(1) such that_ _t=1_ _[ρ][t][ =][ ∞][, and] ρt,_ while Dα(πt, Qt) < δ for all t. In this case, it is possible to obtain a sub-linear regret. Consequently, [P][∞] in order to solve the proposed RC-CMAB problem, we decide to investigate this strategy with α = 0, which leads to the reverse KL-divergence from Q to π, or equivalently, the KL-divergence from π to Q, defined as DKL(Q, π) = _x_ _[Q][(][x][) log][ Q]π([(]x[x])[)]_ [, when][ x][ takes values in the discrete set][ X] [.] _∈X_ Consequently, to find the optimal constrained policy Qt(a _s), we need to find a solution to Eq. (2),_ _|_ which is a rate-distortion optimization problem, when the distortion function is given by the reverse[P] KL-divergence. We can rewrite the optimization objective of Eq. (2) as a double minimization problem (Sec. 10.8, (Cover & Thomas, 2006a)) _Q(a_ _s)_ _R(D) = minQ˜(a)_ _QA|S_ :d(QminSA,PSA)≤D _s,a_ _P_ (s)Q(a|s) log2 _Q˜(a|)_ _._ (4) X Following (Lemma 10.8.1, (Cover & Thomas, 2006a)), the marginal Q[∗](y) = _x_ _[P]_ [(][x][)][Q][(][y][|][x][)][ has] the property _Q[∗](y) = arg minQ˜(y)_ _DKL(P_ (x)Q(y|x)||P (x) Q[˜](y)), [P] (5) that is, it minimizes the KL-divergence between the joint and the product P (x)Q(y), _Q_ ∆K. _∀_ _∈_ This means that _Q[˜](a) obtained by solving Eq. (4) is indeed the marginal over the actions induced by_ _Q(a|s). Exploiting this formulation, it is possible to apply the iterative Blahut-Arimoto algorithm_ to solve the problem and find the solution. In particular, given that the two sets ∆K and ∆S are made of probability distributions, and the target measure is the KL-divergence, the algorithm does converge to the minimum (Csisz´ar & Tusn´ady, 1984). The process is initialized by setting a random _Q˜0(a), which is used as a fixed point to compute_ _Q(a_ _s)_ _Q(a|s) log2_ _Q˜0(|a)_ _._ (6) _Q[∗]1[(][a][|][s][) =][ argmin]QA|S_ :d(QSA,PSA)≤D _P_ (s) ----- From Q[∗]1[(][a][|][s][)][, we compute the optimal][ Q][∗]1[(][a][)][ by solving Eq. (5), which is simply the marginal] _Q[∗]1[(][a][) =][ P]s_ _[P]_ [(][s][)][Q]1[∗][(][a][|][s][)][. The process is iterated until convergence.] We now solve the inner minimization problem, i.e., Eq. (6) with fixed _Q[˜](a). As we can see, the_ constraint on the distortion tries to keep QA _S close to πA_ _S, which is the target distribution. By_ _|_ _|_ minimizing the mutual information, we want QA _S as close to QA as possible, which is the marginal_ _|_ known by the agent (it is sent once at the beginning of the round and does not depend on the state realizations), in order to reduce the number of necessary bits to be transmitted. The solution to Eq. (2) is a conditional distribution QA _S that is a combination of the target policy conditioned on_ _|_ _S, and the marginal distribution._ We now discuss some simple corner cases. If the maximum acceptable distortion is 0, we need _QA_ _S = πA_ _S, thus the rate is equal to EPS_ _DKL_ _πA_ _S_ _πA_ . If S and A are independent, i.e., the _|_ _|_ _|_ _||_ policy does not depend on the state, we have πA _S = πA, thus the decision-maker does not need to_ send any bits to the agents, which can just sample _|_ A _πA. Moreover, if we can find a distribution_ _∼_ _QA such that EPS_ _DKL_ _QA_ _πA_ _S_ _D, then the best strategy is not to convey any message_ _||_ _|_ _≤_ to the controller, as the constraint is already satisfied with rate R(D) = 0. The problem is to find,   among all conditional distributions Q(a|s) that satisfy the constraint, the one with the minimum divergence from Q(a), in order to minimize the rate [log]N[2][ B], i.e., the number of bits to be consumed per agent, for an arbitrarily large N, in the general case. We solve the problem using the Lagrangian multipliers, and obtain the shape of the optimal distribution given by _Q˜(a)[γ][∗]_ _P_ (a _s)[1][−][γ][∗]_ _Qγ∗_ (a _s) =_ _|_ _s_ _, a_ _,_ (7) _|_ _a[′]∈A_ _Q[˜](a[′])[γ][∗]_ _P_ (a[′]|s)[1][−][γ][∗] _[,]_ _∀_ _∈S_ _∈A_ where γ[∗] is s.t. EPS [DKL(Qγ∗ _||PP_ )] = D. The derivation is provided in Appendix A.1. 3.4 ASYMPTOTIC REGRET To prove the results on the achievable regret, we need additional assumptions, which are contained in the Assumption 1 in Russo (2016), which states that rewards have to be distributed following canonical exponential families, and the priors used by TS over the average rewards have to be uniformly bounded, i.e., bounded away from zero ∀(s, a). The proofs of both Lemmas are reported in the Appendix D. In the following, we provide the minimum rate needed to achieve sub-linear regret in all states s ∈S. We now define H(A[∗]) as the marginal entropy of the optimal arm, computed based on the marginal _π[∗](a) =_ _s_ _[P][S][(][s][)][π][∗][(][a][|][s][)][, and defined as][ H][(][A][∗][) =][ P]a_ _[π][∗][(][a][) log]_ _π[∗]1(a)_ [, and we prove that it is] the minimum rate required to achieve sub-linear regret. **Lemma 3.1.[P]** _If R < H(A[∗]), then it is not possible to convey a policy Qt(s, a) that achieves sub-_ _linear regret in all states s ∈S._ The following Lemma provides the achievibility part. **Lemma 3.2. If R > H(A[∗]), then achieving sub-linear regret is possible in all states s ∈S.** The consequence of this second Lemma is that, even if the exact TS policy πt cannot be transmitted _∀t, as long as sufficient rate is available, i.e., R > H(A[∗]), it is still possible to achieve sub-linear_ regret. Following the notation introduced in Sec. 2.2, this implies that, as T →∞, and for all sub-linear regrets ρ, the regret-communication pair (ρ, R) is achievable as long as R > H(A[∗]). Moreover, we argue that the policy construction found in Sec. 3.3 can achieve better empirical performance w.r.t. the scheme used to prove Lemma 3.2. 3.5 PRACTICAL CODING SCHEME The above analysis allows us to characterize an information theoretical bound on the optimal performance, but does not provide a constructive communication scheme. To find a practical coding scheme, we propose a solution that is based on state reduction and computes a compact state representation. In essence, the decision-maker constructs a message containing the new state representations ˆs(s) ∈ _S[ˆ] of s, one for each agent, and send it over the channel. Once the agents have received_ ----- the message, they can sample the actions according to a common policy Qsˆ[(][a][|]s[ˆ]), which is defined on the compressed state space _S[ˆ]. If the rate constraint imposes B bits per agent, it means that it is_ possible to transmit at most 2[B] different states to each agent. The idea is to group the states into 2[B] = M clusters ˆs1, . . ., ˆsM, minimizing _s_ _[P]_ [(][s][)][D][KL][(][Q]s[ˆ][(][a][|]s[ˆ]) _π(a_ _s)), where Qsˆ[(][a][|]s[ˆ]) is the_ _||_ _|_ policy defined on the state ˆs(s) ∈ _S[ˆ]._ [P] To find the clusters and relative policies we employ the well-known Lloyd algorithm, which is an iterative process to group states into 2[B] clusters. First of all, knowing the policy π, the decisionmaker maps each state si to a K-dimensional point α[i] = π(·|si) ∈ ∆K, finding |S| = L different points α[1], . . ., α[L]. Then, it generates 2[B] = M random points µ[1], . . ., µ[M] _∈_ ∆K as initial centroids, i.e., representative policies, and iterates over the following two steps: 1. Assign to each point α[i] the class j[∗] _∈{1, . . ., M_ _} such that j[∗]_ = arg minj DKL(µ[j]||α[i]), i.e., minimizing the divergence between the representative µ[j][∗] and the original policy, which is the point α[i]. For each cluster j, we now define with S _[j]_ the set containing the states associated to the policies in the cluster. 2. Update µ[1], . . ., µ[M] such that µ[j] = arg minµ∈∆K _s∈Sj_ _[P]_ [(][s][)][D][KL][(][µ][||][π][(][·|][s][))][,][ which] is still a convex optimization problem, and can be solved again applying the Lagrangian P multipliers. The solution is _P (s)_ _A(Sj_ ) _s_ _j_ _[π][(][·|][s][)]_ _∈S_ **_µ[j]_** = (8) where the product has to be considered element-wise, A ( _j) is the sum of the probabilities_ _S_ of states in Sj, i.e., A (Sj) = _s∈Sj_ _[P]_ [(][s][)][, and][ Z][ is the normalizing factor. After com-] puting the new centroids, we go back to step (1). The derivation of Eq. (8) is provided in Appendix A.2. [P] The process continues until the new solution does not decrease the KL-divergence between the clustered and the target policy DKL(Qsˆ[(ˆ]s, a)||P (s, a)) = _j=1_ _s∈Sj_ _[P]_ [(][s][)][D][KL][(][µ][j][||][π][(][·|][s][))][.] **Observation Note that the controller is assumed to know the 2[NR]Ppolicies from which it samples** [P][M] the actions of the agents. This can be transmitted at the beginning of each round. In this case, the scheme is efficient as long as N log2 K >> BLλ log2 K, where λ is the number of bits used to represent the values of the Probability Mass Function (PMF) Qsˆ[(][·|]s[ˆ]). For this reason, we provide a scheme where the new policy is updated not at every transmission, but just when the new target π has changed considerably. In particular, if we denote with π[cls] the policy defined over the compressed state representation, with π[last] the last policy used to compute π[cls], and with π the updated target policy, we compute and transmit π[cls] every time DKL(π[last] _π) > β._ _||_ 4 NUMERICAL RESULTS In this section, we provide numerical results in support of our theoretical analysis. Additional experiments and details can be found in Appendix C 4.1 RATE-DISTORTION FUNCTION In this first experiment, we analyze the rate-distortion function, and the related optimal Q(a|s), when the distortion is given by the KL-divergence in three different problems, when the numbers of states, _L, and of actions, K, are both equal to 16. The target policies π(a|s) have a one-to-one relation with_ the states, and are generated such that π(a = i|s = j) = 0.99 if j = i, and uniformly distributed otherwise. We refer to this as the Deterministic case. In the second experiment, the target policies are generated in a similar way, but the other actions’ probabilities take random values in (0, 0.05], and then normalized. This, we call the Random Deterministic case. In the third experiment, the states are grouped in 4 blocks, such that π(a = i|s = j) = 0.99 if ⌊j/4⌋ = i, and uniformly distributed otherwise. This case is denoted as the Block one. In all three experiments, the distribution PS is uniform over the state space S. The rate-distortion curves are reported in Fig. 2a. As we can see, in the Deterministic case, the point with zero distortion has R(0) H(PS), where H(PS) is _∼_ ----- (b) (c) (a) Figure 2: Rate-Distortion function for the 3 different experiments (a). π(a|s = 8) and Q(a|s = 8) in the third experiment, when the rate R is equal to 1 (b). π(a|s = 8) and Q(a|s = 8) in the second experiment, when the rate R is equal to 1. the Shannon entropy of PS, as expected. Indeed, in this case, the action distributions are strongly correlated with the state, thus we need an accurate knowledge of it. In the second experiment, the starting point is similar, but the curve decreases more rapidly, caused by the random values of the other actions’ probabilities. In the third experiment, the zero distortion point is achieved at R(0) ∼ 2 bits, since, similarly to the Deterministic experiment, action probabilities are correlated with the state realization, but are grouped into four different cases. Again, given the uniform distribution over the state, R(0) ∼ H(Unif[4]), where Unif[4] is the uniform distribution over a set of 4 elements. In Fig. 2c, the approximate distributions for state s = 8 are reported for the Deterministic and Random _Deterministic cases, when R = 1._ 4.2 CONTEXTUAL MULTI-ARMED BANDIT We now analyze the RC-CMAB problem presented in Sec. 2, and apply the clustered policy schemes to solve it. In particular, we compare the performance of the Perfect agent, which applies TS without any rate constraint, thus admits samples from the true posterior π, with the performance of the Comm, Cluster, and Marginal agents. The Comm agent uses the optimal scheme provided in Sec. 3.2, the Cluster agent implements the practical coding scheme provided in Sec. 3.5, with _B = ⌈R⌉_ bits per agent, where R is the rate adopted by the Comm agent; the Marginal agent adopts the marginal over the states, computed from the target policy π and the environment distribution _PS, and serves as a lower bound on the performance. State distribution PS is uniform over L = 16_ states, and there are K = 16 actions, N = 100 agents, and the total number of rounds is T = 200. The threshold for changing the clustered policy is set to β = 0.2, and the communication rate constraint to R = 2.5. In this experiment, for each state si, the best reward is given by the arm _aj such that i = j, with i, j ∈{0, . . ., 15}. In particular, the reward behind arm ∈S_ _i when in state_ _ji ̸ is a Bernoulli random variable with parameter= j. In this case, the best action response is strongly correlated with the state realization, thus µj = 0.8 if i = j, whereas µj ∼_ Unif[0,0.75] if a sufficiently high rate is required to sample from the target policy π. As it can be observed from Fig. 3(a), the theoretical rate to transmit the target policy is related to the amount of information the agent is learning from the environment, and it is computed using Eq. (2). Indeed, during the first _∼_ 40 iterations, the rate is below R = 2.5. As the learning process continues, the required rate for reliable transmission increases, as the mutual information between S and A increases. We highlight that, in order to fairly represent the 16 different actions, one would need 4 bits. Indeed, another way of looking at Eq. (2) is through bottleneck principle (Igl et al., 2019), which is used to constrain the mutual information between the states and the actions, and to encourage exploration, that can be exploited to reduce the required rate when training multi-agent systems. In Fig 3 (b), the KL-divergence between the last policy used to compute the compressed state representation, i.e., π[last] in Sec. 3.5, and the target known by the decision-maker is reported. The trend shows that, at the beginning of the learning process, it oscillates rapidly, as the target policy is changing significantly between two iterations. In gray, the threshold β = 2 is highlighted, indicating when a new cluster policy has to be sent. In this experiment, the decision-maker had to send a new policy, on average, once every ∼ 7 rounds. Fig. 4 (a) reports the regret at state s = 10 (the other ----- plots can be seen in Appendix C.2), for the four different agents, as defined in Eq. (1). We can see that both the Cluster and the Comm agents can achieve sub-linear regret in this particular state. In Fig. 4 (b), the average rewards obtained by the agents are reported, for the different states separately. We can see that the developed scheme is still able to achieve good performance. (a) (b) Figure 3: Asymptotic rates to convey the Perfect and Comm policies, and the bits used by the Cluster agent, averaged over 5 runs (a). DKL(π[last] _π), averaged over 5 runs (b)._ _||_ (a) (b) Figure 4: Cumulative regret for different agents, together with the performance of the Marginal agent, evaluated for the state s = 10 (a). Average reward per state (b). 5 CONCLUSION We have studied the RC-CMAB problem, in which an intelligent entity, i.e., the decision-maker, observes the contexts of N parallel CMAB processes, and has to decide on the actions depending on the current contexts and the past actions and rewards. However, the actions are implemented by a controller that is connected to the decision-maker through a rate-constrained communication link. First, we cast the problem into the proper information-theoretic framework, formulating it as a policy compression problem, and provided the optimal compression scheme in the limit of an infinite number of agents, when the adopted distortion measure is the KL-divergence between the compressed policy adopted by the controller and the policy of the decision-maker. We then characterize the minimum needed rate to obtain sub-linear regret, and prove that any rates above this threshold can achieve it. In the end, we designed a practical coding scheme to transmit the actions for a finite N, which relies on a compressed state representation Finally, we evaluated the performances of the policy obtained through the asymptotic information theoretic formulation, and the one obtained through the clustering scheme, and observed a close gap between the two. We numerically showed the relation between the asymptotic rate bound and the learning phase of agents, showing how it is possible to save communication resources when training a multi-agent system like the one considered. We believe that this work can serve as a first step towards understanding the fundamental performance limits of multi-agent decision-making problems under communication constraints, and highlights the intimate relation between the communication scheme and the learning process. Ongoing work include deriving an information theoretic converse result on the regret performance, and generalizing the framework to reinforcement learning problems. ----- REFERENCES Mridul Agarwal, Vaneet Aggarwal, and Kamyar Azizzadenesheli. Multi-agent multi-armed bandits with limited communication. In arXiv:2102.08462 [cs], 2021. Rudolf Ahlswede and Imre Csisz´ar. Hypothesis testing with communication constraints. 32(4): 533–542, Jul. 1986. Toby Berger. Decentralized estimation and decision theory. In IEEE 7th. Spring Workshop on Inf. _Theory, Mt. Kisco, NY, Sep. 1979._ Djallel Bouneffouf and Irina Rish. A survey on practical applications of multi-armed and contextual bandits. arXiv cs.LG:1904.10040, 2019. Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecom_munications and Signal Processing). Wiley-Interscience, USA, 2006a. ISBN 0471241954._ Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecom_munications and Signal Processing). Wiley-Interscience, USA, 2006b._ Imre Csisz´ar and Gabor Tusn´ady. Information Geometry and Alternating Minimization Procedures. _Statistics and Decisions, Supplement Issue, pp. 205–237, 1984._ Jakob N. Foerster, Yannis M. Assael, Nando de Freitas, and Shimon Whiteson. Learning to Communicate with Deep Multi-Agent Reinforcement Learning. arXiv:1605.06676 [cs], May 2016. arXiv: 1605.06676. Osama A. Hanna, Lin F. Yang, and Christina Fragouli. Solving multi-arm bandit using a few bits of communication. In 38 th International Conference on Machine Learning, 2021. Serhii Havrylov and Ivan Titov. Emergence of language with multi-agent games: Learning to communicate with sequences of symbols. pp. 11, 2017. Maximilian Igl, Kamil Ciosek, Yingzhen Li, Sebastian Tschiatschek, Cheng Zhang, Sam Devlin, and Katja Hofmann. Generalization in reinforcement learning with selective noise injection and information bottleneck. In Advances in Neural Information Processing Systems, volume 32, pp. 13956–13968, 2019. Mikolaj Jankowski, Deniz G¨und¨uz, and Krystian Mikolajczyk. Wireless image retrieval at the edge. _IEEE Journal on Selected Areas in Communications, 39(1):89–100, 2021. doi: 10.1109/JSAC._ 2020.3036955. Cem Kalkanli and Ayfer Ozgur. Asymptotic convergence of Thompson sampling. In _arXiv:2011.03917v1, 2020._ Gerhard Kramer and Serap A. Savari. Communicating probability distributions. IEEE Transactions _on Information Theory, 53(2):518–525, 2007. doi: 10.1109/TIT.2006.889015._ Angeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni. Multi-agent cooperation and the emergence of (natural) language. arXiv:1612.07182 [cs], March 2017. arXiv: 1612.07182. Hongju Park and Mohamad Kazem Shirani Faradonbeh. Analysis of Thompson sampling for partially observable contextual multi-armed bandits, 2021. Jihong Park, Sumudu Samarakoon, Mehdi Bennis, and M´erouane Debbah. Wireless network intelligence at the edge. Proceedings of the IEEE, 107(11):2204–2239, 2019. My Phan, Yasin Abbasi Yadkori, and Justin Domke. Thompson sampling and approximate inference. In Advances in Neural Information Processing Systems, 2019. Daniel Russo. Simple bayesian algorithms for best arm identification. In Vitaly Feldman, Alexander Rakhlin, and Ohad Shamir (eds.), 29th Annual Conference on Learning Theory, volume 49 of _Proceedings of Machine Learning Research, pp. 1417–1418, Columbia University, New York,_ [New York, USA, 23–26 Jun 2016. PMLR. URL https://proceedings.mlr.press/](https://proceedings.mlr.press/v49/russo16.html) [v49/russo16.html.](https://proceedings.mlr.press/v49/russo16.html) ----- Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. Learning multiagent communication with backpropagation. In Proc. of 30th Int’l Conf. on Neural Information Proc. Systems, NIPS’16, pp. 2252–2260, Red Hook, NY, December 2016. William R. Thompson. On the theory of apportionment. American Journal of Mathematics, 57(2): 450–456, 1935. R´obert Torfason, Fabian Mentzer, Eir´ıkur Ag´[´] ustsson, Michael Tschannen, Radu Timofte, and Luc Van Gool. Towards image understanding from deep compression without decoding. In _International Conference on Learning Representations, 2018._ Aolin Xu and Maxim Raginsky. Information-theoretic lower bounds on Bayes risk in decentralized estimation. IEEE Transactions on Information Theory, 63(3):1580–1600, 2017. doi: 10.1109/ TIT.2016.2646342. Yuchen Zhang, John Duchi, Michael I Jordan, and Martin J Wainwright. Information-theoretic lower bounds for distributed statistical estimation with communication constraints. In Advances _in Neural Information Processing Systems, volume 26, 2013._ APPENDIX A POLICIES DERIVATION A.1 OPTIMAL POLICY To solve this problem, we solve the related Lagrangian _Q(a_ _s) log_ _[Q][(][a][|][s][)]_ + λ _|_ _Q˜a_ _Q(a_ _s) log_ _[Q][(][a][|][s][)]_ _|_ _P_ (a _s)_ _|_ _[−]_ _[D]_ _L(Q(a|s), λ, µ) =_ _P_ (s) _s_ X + µ X _P_ (s) _P_ (s) _Q(a|s) −_ 1 where the Lagrangian multiplier λ has to be optimized to meet the constraints on the divergence, whereas µ ensures that the solution is a probability distributions, i.e., the elements sum up to one. The positivity constraints on the terms are already satisfied by the fact that the solution has an exponential shape. We first take the derivative of the Lagrangian w.r.t. to the terms Q(a|s) and set it to zero _∂_ (Q(a _s), λ, µ)_ _L_ _∂Q(|a_ _s)_ = P (s) log _[Q]Q˜[(]([a]a[|][s])[)]_ + P (s) − _Ps[′]_ _Q(a|s[′])_ _Q[P]˜([(]a[s][)])_ + _|_ _s[′]_ X + λP (s) log _[Q][(][a][|][s][)]_ + µP (s) = 0 _P_ (a _s) [+ 1]_  _|_  finding _Q(a_ _s)[1+][λ]_ log _|_ _Q˜(a)P_ (a _s)[λ][ =][ −][(][λ][ +][ µ][)]_ _|_ and now we rewrite with γ = 1+Q1(λa[,]Q|[ γ]s()[ ∈]a[1+]|s[λ]) =[[0]=[,][ 1]] e e[, and][−]−[(](1+µ[µ]+[+]λλ[ µ][λ])[)]Q[ ˜][ such that]Q˜((aa))1+P1(λa P|[ e]s()a−[λ]|(1+sµ)+λ1+λλ) _λis the normalizing factor._ The solution has thus the shape _Q˜(a)[γ]P_ (a _s)[1][−][γ]_ _Qγ(a_ _s) =_ _|_ _s_ _, a_ _._ (9) _|_ _a[′]∈A_ _Q[˜](a[′])[γ]P_ (a[′]|s)[1][−][γ][,] _∀_ _∈S_ _∈A_ By the convexity of KL-Divergence and its triangular inequality, we know the solution lies on the P boundary of the constraints, i.e., when EPS [DKL(Qγ∗ _||P_ )] = D. Consequently, if we can numerically find γ[∗] s.t. EPS [DKL(Qγ∗ _||P_ )] = D, we solve Eq. (6). ----- A.2 UPDATE CENTROIDS Again, we compute the optimal centroids by solving the Lagrangian _µ[j]a_ [log] _µ[j]a_ _π(a_ _s) [+][ λ]_ _aX∈A_ _|_ _L(µ[j]a[, λ][) =]_ _µ[j]a_ Xa∈A _[−]_ [1] _P_ (s) _sX∈Sj_ taking its derivative and solving the equality _∂L(µ[j]a[, λ][)]_ _∂µ[j]a_ _P_ (s) log _µ[j]a_ + λ = 0 _π(a_ _s) [+ 1]_ _sX∈Sj_  _|_  finding log µ[j]a[A][ (][S][j][) =] log µ[j]a [=] _µ[j]_ = _P_ (s) log π(a _s) + A (_ _j) + λ_ _|_ _S_ _sX∈Sj_ _P_ (s) _λ_ _A (_ _j) [log][ π][(][a][|][s][) + 1 +]_ _A (_ _j)_ _sX∈Sj_ _S_ _S_ _P (s)_ _A(Sj_ ) _s_ _j_ _[π][(][·|][s][)]_ _∈S_ where Z is the normalizing factor, obtaining the shape expressed in Eq. (8). ----- B PRESENTED ALGORITHMS In this appendix, the algorithms for the theoretically-optimal and cluster policies are described. B.1 THEORETICAL OPTIMAL POLICY This is the pseudocode of the algorithm at the decision-maker side, when adopting the informationtheoretic inspired compressed policy. The only task for the controller is to decode the received message, and to communicate the actions to the agents. We define with [A] the set {1, . . ., A}, and with UnifB the uniform probability over the set B. **Algorithm 1 Optimal Policy** **Input rate R, n° agents N**, n° actions K and state distribution PS **Initialize Thompson Sampling policy π, and compressed policy Q(** _s) to Unif[K]_ _·|_ **foreach t ∈** 1, . . ., T do Observe s[N]t [= (][s][t,][1][, . . ., s][t,N] [)] Sample a[N]t [= (][a][t,][1][, . . ., a][t,N] [)][ from][ Q][(][·|][s][)] Code a[N]t [into][ u][t] [=][ f][t][(][a][N]t [)][ and transmit to the controller] Observe rt[N] [= (][r][t,][1][, . . ., r][t,N] [)] Update action posteriors π(·|s) using the Thompson Sampling Algorithm Update Q(a|s) using the Blahut-Arimoto Algorithm (Iterate over Eq. 5 and Eq. 6) B.2 CLUSTER POLICY This is the pseudocode of the algorithm at the decision-maker side, when adopting the cluster policy. The task for the controller is to sample, at time t and for each agent i, the action at,i according to the last received centroid µj(st,i), whose components convey the action probabilities. **Algorithm 2 Cluster Policy** **Input per-agent bit B, n° agents N**, n° actions K and state distribution PS, threshold β **Initialize Thompson Sampling policy π** Compute the centroids µj for j = 1, . . ., 2[B] using π with Lloyd algorithm and the rule in Eq. 8 Transmit the centroids µj to all agents. _π[last]_ = π **foreach t ∈** 1, . . ., T do Observe s[N]t [= (][s][t,][1][, . . ., s][t,N] [)] **if DKL** _π[last]||π_ _> β then_ Update the centroids µj using π with Lloyd algorithm and the rule in Eq. 8 Transmit the centroids  **_µj to all agents._** _∀st,i compute the index j(st,i) indicating its belonging cluster_ Transmit the vector jt[N] [= (][j][(][s][t,][1][)][, . . ., j][(][s][t,N] [))] Observe rt[N] [= (][r][t,][1][, . . ., r][t,N] [)] Update action posteriors π(·|s) using the Thompson Sampling Algorithm ----- C EXPERIMENTS In this section, additional experiments are provided to clarify the proposed scheme, and quantify the trade-offs between rate and regret. C.1 STATE REPRESENTATION In Fig. 5 we report the policy clustering results when applying the coding scheme explained in Sec. 3.5, when R = 2 bits can be used to represent 20 randomly generated policies. In the figure, it is possible to see the 4 policies representative of the 4 compressed state representations. As we can see, the policy centroids found by the coding scheme try to fairly resemble the shapes of all the policies within the same cluster. Figure 5: Clusters and their representatives with L = 20 and b = 2. C.2 DETERMINISTIC In this section we reported some more details on the RC-CMAB experiments presented in Sec. 4.2. (a) (b) (c) Figure 6: Best action probability for a given state, for the posterior π(a[∗]|s) (a) and compressed policy Q(a[∗]|s) (b) for the different agents. KL-divergence between the agents’ action posteriors, and the target one (c). ----- ----- Figure 8: Cumulative average regret ± one std for the different agents, from state s = 0 to state _s = 15_ ----- C.3 8-BLOCK EXPERIMENT In this second RC-CMAB experiment, the setting is similar to the one presented above, but the best action response is not a one-to-one mapping with the state. Again, a ∈{0, . . ., 15} and s ∈ _{0, . . ., 15}, but the Bernoulli parameter µa(s) for action a in state s is 0.8 if ⌊_ 2[s] _[⌋]_ [=][ a][, and sampled] uniformly in (0, 0.75] otherwise. Thus, the best action responses are grouped into 8 different classes, depended on the state realization. In this case, the rate is limited to R = 2. (b) (a) Figure 9: Block experiment : asymptotic rates to convey the Perfect and Comm policies, and the bits used by the Cluster agent, averaged over 5 runs (a). Average reward achieved in each state by the pool of agents (b) (a) (b) (c) Figure 10: Block experiment : Best action probability for a given state, for the posterior π(a[∗]|s) (a) and compressed policy Q(a[∗]|s) (b) for the different agents. KL-divergence between the agents’ action posteriors, and the target one (c). ----- ----- Figure 12: Block experiment : cumulative average regret ± one std for the different agents, from state s = 0 to state s = 15 ----- C.4 REGRET VS NUMBER OF CLASSES In this last experiment, the environment is the same as the one described in Sec. 4.2, a part form the number of agents, that here is N = 50. The task is to quantify the effect of the number of per-agent bits B in the cluster policy on the achievable regret, which varies from 1 to 4. As we can see from the plots below, when B is equal to 1 or 2, which in turn means that the number of clusters is, respectively, 2 and 4, the regret is not sub-linear in several states. However, the policy with just 2 bits can achieve performance comparable with the 3 and 4 bits policies in some states, e.g., state 4 and 6. This is due to the fact that even in those cases with not enough bits, the posteriors still converges to good solutions, thus if a cluster contains, for example, one state, it will convey the best policy, achieving sub-linear regret. On the contrary, when a cluster contains more states, the representative policy can fail to achieve optimal performance, if those are substantially different. ----- Figure 14: Deterministic experiment : cumulative average regret ± one std for cluster policy, when _B = {1, 2, 3, 4}, from state s = 0 to state s = 15_ . ----- D REGRET BOUND To prove our statements, we define with Hq(X) and Iq(X; Y ) the marginal entropy and mutual information w.r.t. to the joint probability q, i.e., H(A[∗]) = Hπ∗ (A). We start by proving Lemma D.1. **Lemma D.1. If the exact Thompson Sampling policies πt(a|s) achieve sub-linear Bayesian regret** _for all state s ∈S, then limt→∞_ _Iπt_ (S; A) = limt→∞ _Hπt_ (A) = H(A[∗]). _Proof. First of all, we write Iπt_ (S; A) = Hπt (A) _Hπt_ (A _S). Following Theorem 2 from Kalkanli_ _−_ _|_ & Ozgur (2020), if πt(a _s) achieves sub-linear regret_ _s_, then _s_ _|_ _∀_ _∈S_ _∀_ _∈S_ lim when a[∗] is the optimal arm _t→∞_ _[π][t][(][a][ =][ a][∗][|][s][) = 1]_ lim when a[′] = a[∗]. _t_ _̸_ _→∞_ _[π][t][(][a][ =][ a][′][|][s][) = 0]_ Consequently, we have that in the limit πt(a _s) is a deterministic function, thus_ _|_ lim _t→∞_ _[H][π][t]_ [(][A][∗][|][S][) = 0][,] which concludes our proof. Then, we prove Lemma 3.1, which is repeated below. **Lemma 3.1 If R < H(A[∗]), then it is not possible to convey a policy Qt(s, a) that achieves sub-** linear regret in all state s ∈S. _Proof. Following Lemma 10 in Phan et al. (2019), if limt_ _πt(a[∗]_ _s) = 1 and DKL (Qt_ _πt) < ϵ,_ _→∞_ _|_ _||_ _t, for some ϵ > 0, then limt_ _Qt(a[∗]_ _s) = 1, inducing limt_ _DKL (Qt_ _πt) = 0._ _∀_ _→∞_ _|_ _→∞_ _||_ Now, for any policy Qt, if the minimal rate to convey it, _R[ˆ]t = IQt_ (S; A), is below the threshold of the optimal policy, i.e., _R[ˆ]t < H(A[∗]), by Eq. (2), we have_ _DKL (Qt||π[∗]) > 0,_ (10) which holds also in the limit. Otherwise, we would have a policy Qt s.t. DKL (Qt||π[∗]) = 0, which could be conveyed at a rate strictly below that characterized by Eq. (2), which would contradict with the definition of the rate-distortion function. Consequently, by Lemma 10 in Phan et al. (2019) and Eq. (10), there cannot be any ϵ > 0 s.t. limt _DKL (Qt_ _πt) < ϵ; and hence, limt_ _DKL (Qt_ _πt) =_ . _→∞_ _||_ _→∞_ _||_ _∞_ However, if limt _DKL (Qt_ _πt) =_, (s, a) s.t. limt _Qt(a_ _s) = c > 0, when_ _→∞_ _||_ _∞_ _∃_ _∈S × A_ _→∞_ _|_ _a_ = a[∗]. This implies that, at step t, Qt(s, a) plays a sub-optimal arm in state s with constant _̸_ probability, and so it can not achieve sub-linear regret in all states. Finally, we provide a proof for Lemma 3.2, which we repeat below. **Lemma 3.2 If R > H(A[∗]), then achieving sub-linear regret is possible in all states s ∈S.** _Proof. We define by R[π][t]_ the rate needed to convey the TS policy perfectly to the controller at time _t, and let δ > 0 s.t. R = H(A[∗]) + δ, where R is the available communication rate. We now provide_ a scheme that guarantees sub-linear regret. First, generate ρt parameters as in Sec. 3.3. As long as R[π][t] _> R, with probability ρt play at_ uniformly at random, and with probability 1 _ρt, play according to a policy Qt(a_ _s), which satisfies_ _−_ _|_ the rate-distortion constraint I(A; S) < R under Qt(a _s), which can be transmitted to the controller,_ _|_ and will have a bounded reverse KL divergence from the TS policy πt at time t. Following Lemma 14 in Russo (2016), in this way enough exploration is guaranteed for the TS policy to concentrate, and there exists a finite t0 s.t. ∀t > t0, R[π][t] _< H(A[∗]) + δ. This means that, for the first t0 rounds, both_ the target and the approximating policies are playing sub-optimal arms with non-zero probabilities. However, their average gap is within a constant, given the average rewards are bounded within [0, 1]. Then, ∀t > t0 it is possible to play the exact TS policy, leading to the optimal policy for all future steps, and hence, a sub-linear regret. -----