Under review as a conference paper at ICLR 2022 SHAPED REWARDS BIAS EMERGENT LANGUAGE Anonymous authors Paper under double-blind review ABSTRACT One of the primary characteristics of emergent phenomena is that they are de- termined by the basic properties of the system whence they emerge as opposed to explicitly designed constraints. Reinforcement learning is often used to elicit such phenomena which specifically arise from the pressure to maximize reward. We distinguish two types of rewards. The first is the base reward which is mo- tivated directly by the task being solved. The second is shaped rewards which are designed specifically to make the task easier to learn by introducing biases in the learning process. The inductive bias which shaped rewards introduce is problematic for emergent language experimentation because it biases the object of study: the emergent language. The fact that shaped rewards are intentionally designed conflicts with the basic premise of emergent phenomena arising from basic principles. In this paper, we use a simple sender-receiver navigation game to demonstrate how shaped rewards can 1) explicitly bias the semantics of the learned language, 2) significantly change the entropy of the learned language, and 3) mask the potential effects of other environmental variables of interest. 1 INTRODUCTION In emergent language research, the goal is to study language as it emerges from the inherent proper- ties of the environment, language, and agents. One pitfall for such experiments, though, is that the language simply mirrors some of design choices of the environment or experimental setting more generally. there is a risk that the language simply mirrors the design choices of the environment. For example, Bullard et al. (2021) introduce a method for discovering optimal languages for commu- nication between independently trained agents, yet rather than emerging from basic principles, the learned language is the result of an intentionally designed search algorithm. Reinforcement learning is a common tool in this field for observing the emergence of language out of a reward maximiza- tion pressure. One such design choice which can obscure these emergent properties is adding shaped rewards on top of the base reward of the reinforcement learning environment (Wiewiora, 2010). The base reward of the environment derives directly from succeeding at the task in question. The difficulty with relying solely on the base rewards is that if the task is especially long or complicated, the agent may only receive a reward infrequently which makes for a difficult learning problem. In such a case, base reward is considered sparse. This motivates shaped rewards which are inserted at intermediate steps based on domain knowledge in order to introduce an inductive bias towards good solutions. For example, the base reward in chess would simply be winning or losing the game. A shaped reward could then be taking the opponents material while not losing your own. While this shaped reward is often a good heuristic, it can lead to local optima; for example, it discourages strategies which would sacrifice individual pieces in order to win the whole game. While local optima present a problem for maximizing reward, the biases introduced by shaped rewards present a unique problem for emergent language which we will highlight in this paper. For emergent language research, the inductive bias which shaped rewards introduce is especially problematic because it exerts a significant influence on the learned language whose emergent prop- erties are the object of study. This influence can comprise 1) biasing the semantics of the language, 2) changing a property of the whole language (e.g., language entropy), or 3) masking the influence of some other environmental parameter on the language. For example, some emergent language work incorporates shaped rewards into their environment without accounting for the the biases it may introduce (Mordatch & Abbeel, 2018; Brandizzi et al., 2021; Harding Graesser et al., 2019). From an engineering and design perspective, tweaking the system to achieve a desired result is standard 1 Under review as a conference paper at ICLR 2022 Goal (a) In the centerward environment, the receiver must navigate to the center of the world. S-agent (perceptron) R-agent (perceptron) (x, y) action (x, y) location Gumbel-Softmax (b) The architecture of our two-agent, asymmetric system. Figure 1 practice, but from a scientific and experimental perspective, these additional shaped rewards serve as potential confounding factors and hinder accurate observations of the emergent phenomena. We study this by introducing a simple navigation game with continuous states and actions where a sender passes a one-word message to the receiver (illustrated in Figure 1b and 1a). Within this environment, our experiments look at the entropy and semantics of the emergent language in the absence and presence of shaped rewards. In the course of these experiments, we find that rein- forcement learning algorithm’s experience buffer size has a significant impact on the entropy of the learned language and potentially explains our experimental findings. To this end we introduce a mathematical model based on the Chinese restaurant process for understanding the effect of experi- ence buffer size on emergent language more generally. We highlight the following contributions in our paper: • Demonstrating basic ways in which shaped rewards can undesirably influence emergent language experiments • Presenting a mathematical model for understanding the role of experience buffer size in the entropy of emergent language 2 RELATED WORK 2.1 TAXONOMY We intentionally design our environment to study shaped rewards and the entropy of language, which requires it to differ from prior art in specific ways. To elucidate this, we create a taxonomy of emergent language environments based on whether or not the environment has multi-step episodes and the presence of a trivially optimal language (defined below). The taxonomy is given in Table 1 and a brief description of each environment is given in Appendix A. Generally speaking, the motivation for shaped rewards in a given environment is sparsity of the base reward which requires a multi-step, multi-utterance environment. Thus, our experiments naturally require a multi-step environment. We consider an environment to have a trivially optimal language if the information which needs to be communicated from sender to receiver can be perfectly represented in the emergent language. Such a language most frequently arises when the communicated information is derived from a small number of categorical variables encoded in the observation. For example, in an environment where the sender must specify an element of the set {red, green, blue} × {square, circle} using messages from the set {r, g, b} × {s, c}, a trivial language is where color maps to a unique letter in the first position and shape maps to a unique letter in the second position. Other environments have no trivially optimal languages. For example, if the sender must communicate an element of {1, 2, ..., 100} using messages from the set {a, b, c}, there is no trivially optimally language since the sender can at best partition the set of integers to reach an optimal but imperfect solution. Kharitonov et al. (2020) gives evidence that there is an entropy minimization in pressure inherent in emergent language settings. Building on this, Chaabouni et al. (2021) explicitly look at a the 2 Under review as a conference paper at ICLR 2022 Entropy Max Reward (a) Trivially optimal language Entropy Max Reward (b) No trivially optimal language Figure 2: Maximum reward for a given entropy. When a trivially optimal language exists (a), the reward plateaus at a global maximum where further increases in entropy do not increase reward. The environment in this paper has no such trivial language and is more similar (b). Paper Task Trivial? Multi-step? Havrylov & Titov (2017) image signalling no no Kottur et al. (2017) dialog Q&A yes yes Mordatch & Abbeel (2018) goal specification yes yes Lazaridou et al. (2018) image signalling yes no Kharitonov et al. (2020) vector signalling yes no Chaabouni et al. (2021) color signalling no no This paper navigation no yes Table 1: Summary of related work in terms of emergent language configurations used. tradeoff between entropy and reward—higher entropy languages have the potential to convey more information at the expense of being more difficult to learn. This tradeoff disappears if a trivially optimal language is learned since there is no further reward maximization pressure. Language en- tropy greater than this minimum, then, does not emerge from the the reward maximization pressure. Although such an environment does not preclude studying entropy we choose to use an environment where the information being communicated is fully continuous so an increase in entropy can always translate to a increase in reward. This leads to a smooth tradeoff between entropy and reward which is illustrated in Figure 2. 2.2 SPECIFIC APPROACHES The environment and agent configuration of this paper are most closely related to Chaabouni et al. (2021) who test the balance between entropy and semantic precision in a two-agent color discrim- ination task. Although the color space comprises 330 distinct colors, the environment facilitates languages which cover a (learned) region of nearby colors. In this way, there is no trivially optimal, one-to-one language as the task is inherently fuzzy. Havrylov & Titov (2017) use a signalling game with natural images which lacks a trivially optimal language, but using natural images results in many uncontrolled variables, lessening the ability of the experiments to make basic, first-principles claims. Studying shaped rewards has been explored previously in emergent language primarily as inductive bias for encouraging some property, such as compositionality, to emerge (Hazra et al., 2021; Jaques et al., 2019; Eccles et al., 2019). This paper, instead, focuses on the negative aspects of induc- tive biases introduced by shaped rewards regarding how they can hinder empirical investigation of emergent languages. Superficially, the environment used in this paper bears resemblance to Mordatch & Abbeel (2018) and Kaji´ c et al. (2020) as these both deal with navigation. In both of these environments, the agents communicate about a discrete set of actions or goal locations, whereas the agents in this paper communicate about a continuous action. Neither of the papers specifically investigates the effects of shaped rewards on the languages learned. 3 Under review as a conference paper at ICLR 2022 3 METHODS 3.1 ENVIRONMENT In this paper we use a simple 2-dimensional navigation environment with two closely related tasks. A sender agent observes the position of a receiver agent, sends a message to the receiver, and the receiver takes an action. In the centerward task (illustrated in Figure 1a), the receiver is initialized uniformly at random within a circle and must navigate towards a circular goal region at the center. In the edgeward task, the receiver is initialized uniformly at random within a circle and must navigate to a goal region comprising the entire area outside of the circle. The centerward environment is a more realistic environment. The edgeward environment, on the other hand, admits a greater variety of languages, since moving consistently in any one direction eventually gets you to the edge; therefore, learning to move in a variety of evenly spaced direction solves the task more quickly but is not strictly necessary. There are no obstacles or walls in the environment. The receiver’s location and action are continuous variables stored as floating-point values. If the receiver does not reach the goal region within a certain number of steps, the episode ends with no reward given. 3.2 AGENT ARCHITECTURE Our architecture comprises two agents, conceptually speaking, but in practice, they are a single neural network. The sender is a disembodied agent which observes the location of the receiver and passes a message in order to guide it towards the goal. The receiver is an agent which receives the message as its only input and takes an action solely based on that message (i.e., it is “blind”). The sender and receiver are randomly initialized at the start of training, trained together, and tested together. The architecture of the agents is illustrated in Figure 1b. The observation of the sender is a pair of floating-point values representing the receiver’s location. The sender itself is a 2-layer perceptron with tanh activations. The output of the second layer is passed to a Gumbel-Softmax bottleneck layer (Maddison et al., 2017; Jang et al., 2017) which en- ables learning a discrete, one-hot representation, as an information bottleneck (Tishby et al., 2000). The activations of this layer can be thought of as the “words” or “lexicon” of the emergent language. At evaluation time, the bottleneck layer functions deterministically as an argmax layer, emitting one- hot vectors. The receiver is a 1-layer perceptron which receives the output of the Gumbel-Softmax layer as input. The output is a pair of floating-point values which determine the action of the agent. The action is clamped to a maximum step size. 3.3 OPTIMIZATION Optimization is performed using stochastic gradient descent as a part of proximal policy optimiza- tion (PPO) (Schulman et al., 2017). Specifically, we use the implementation of PPO provided by Stable Baselines 3 with our neural networks implemented in PyTorch (Raffin et al., 2019; Paszke et al., 2019). Using a Gumbel-Softmax bottleneck layer allows for end-to-end backpropagation, making optimization faster and more consistent than using a backpropagation-free method like RE- INFORCE (Kharitonov et al., 2020; Williams, 1992). We will give a basic explanation of how PPO (and related algorithms) works as it is necessary to connect it to the mathematical model presented in Section 4. PPO has two stages sampling and optimizing. During the sampling stage, the algorithm runs the agents in the environment and stores the states, actions, and rewards in a experience buffer (or rollout buffer). The optimization stage then comprises performing gradient descent on the agents (i.e., neural network policy) using the data from experience buffer. The next iteration starts with the updated agents and an empty experience buffer. 3.4 REWARDS We make use of two different rewards in our configuration, a base reward and an shaped reward. The base reward is simply a positive reward of 1 given if the receiver reaches to the goal region before the episode ends and no reward otherwise. The shaped reward, given at every timestep, is the decrease in distance to the goal. If the goal region is centered at (0, 0), the standard shaped reward for the centerward environment is given by Equation 1; we also use a trivially biased version of the 4 Under review as a conference paper at ICLR 2022 reward which only takes into account horizontal distance specified in Equation 2. rt = q x2 t−1 + y2 t−1 − q x2 t + y2 t (1) r′ t = q x2 t−1 − q x2 t (2) For the edgeward environment, we use the opposite of rt as the goal is to move away from the center. The interplay between base and shaped rewards is important to understand in the larger context of how reinforcement learning problems are structured. The base rewards are well-motivated and directly correspond to the ultimate aim of the task, but their sparsity can make it difficult for the agents to learn to succeed. Shaped rewards facilitate learning by using expert knowledge to form an inductive bias, yet they present a drawback for traditional reinforcement learning and emergent language. Within reinforcement learning where the goal is to train the best performing agent, shaped rewards can lead to the agent finding local optima if better solutions are excluded by the inductive bias. Within emergent language, the problem is more nuanced as the goal is primarily to study a wide range of emergent properties of the language learned within the environment. While base rewards have a natural connection to the environment and task, the shaped rewards introduce a reward signal which is not intrinsically connected with the task and environment, even if it is a good heuristic. 4 EXPLANATORY MODEL We argue that the results presented in Sections 5.2 and 5.3 can be explained by an effective change in the size of PPO’s experience buffer. In order illustrate this, we introduce a simple mathematical model based on the Chinese restaurant process (Blei, 2007; Aldous, 1985). While the model does not exactly match our experimental setup, the key similarities allow us to reason about our results as well as potential future experiments. The Chinese restaurant process is an iterative stochastic process which yields a probability distribu- tion over the positive integers. The key similarity between the Chinese restaurant process and our learning setup is that it they are self-reinforcing processes, that is, when a given value is selected in one iteration of the process, the probability that that value is chosen again in subsequent iterations increases. We generalize one aspect of the Chinese restaurant process in Section 4.2 to better match the sampling-optimization procedure of PPO. The primary simplification which this model makes is that it does not take into account the “meaning” of actions and the effects they have within the environment. For example, every successful agent in the centerward environment will use at least three distinct nearly equiprobably so as to span the 2-dimensional space whereas as no such lower bound exists in the stochastic process. 4.1 CHINESE RESTAURANT PROCESS As the name suggests, a useful analogy for the Chinese restaurant process starts with a restaurant with infinitely many empty tables, indexed by the positive integers, which can hold an unbounded number of customers. As each customer walks in, they sit at a populated table with a probability proportional to the number of people already at that table. The customer will sit at a new table proportional to a hyperparameter α which modulates the concentration of the final distribution. The decision the customer makes is equivalent to sampling from a categorical distribution where the unnormalized weights are the customer counts along with the weight, α, for the new table. The pseudocode for the Chinese restaurant process is given in Algorithm 1 for β = 1. By analogy to the neural networks representing our agents, we can view the tables as bottleneck units and the customers choosing a table as parameter updates which reinforce the use of that unit in accordance with the reward. Mordatch & Abbeel (2018) implicitly assume this when they introduce a reward corresponding to the probability that the emergent lexicon is generated by a Chinese restaurant process (a Dirichlet process, in their words). The self-reinforcing property can be expressed informally as: more popular tables get more new customers, keeping them popular. A higher α means that customers are more likely to sit at a new table, so the distribution over tables will be more spread out in expectation. The distribution stabi- lizes as the number of iterations goes to infinity as an individual new customers has a diminishing effect the relative size of the weights. 5 Under review as a conference paper at ICLR 2022 Algorithm 1 Expectation Chinese Restaurant Process 1 assert type(alpha) is float and alpha > 0 2 assert type(n_iters) is int and n_iters >= 0 3 assert type(beta) is int and beta > 0 4 5 def sample_categorical_alpha(weights): 6 w_alpha = weights.copy() 7 k = num_nonzero(weights) 8 w_alpha[k + 1] = alpha 9 return sample_categorical(w_alpha / sum(w_alpha)) 10 11 weights = array([1, 0, 0, ...]) 12 for _ in range(n_iters): 13 addend = array([0, ...]) 14 for _ in range(beta): 15 i = sample_categorical_alpha(weights) 16 addend[i] += 1 / beta 17 weights += addend 18 return weights / sum(weights) 4.2 EXPECTATION CHINESE RESTAURANT PROCESS The key difference between how the Chinese restaurant process and PPO works is the relationship between sampling (i.e., simulating episodes) and updating the weights/parameters. In each iteration, the regular Chinese restaurant process draws a sample based on its weights and updates those weight immediately. In PPO, the agent will populate the experience buffer with a number of steps (on the order of 100 to 1000) in the environment before performing gradient descent with that buffer to update the parameters. As a result, the parameter update is performed based on a weighting across multiple bottleneck units based on how often they were used in the episodes recorded in the experience buffer. Thus, to the appropriately generalize the Chinese restaurant process, we introduce the expectation Chinese restaurant process. In this process, we add a hyperparameter β which is a positive integer describing how many samples we take from the distribution before updating the weights; the updates are normalized by β so the sum of all weights still only increases by 1 per iteration. The restaurant analogy breaks down here as we would have to say that in each iteration, β customers simultaneously and independently make a decision, get shrunk to 1 β th their size, and then sit at their table of choice. The pseudocode for the expectation Chinese restaurant process is given in Algorithm 1. 5 EXPERIMENTS Each run of experiment starts by training a sender and receiver for a fixed number of timesteps for a range of independent variable values. The trained models are then evaluated by initializing 3000 episodes at evenly distributed locations using Vogel’s method (Vogel, 1979). In most settings, the agents are able to achieve a 100% success rate during training and evaluation; we remove any models which do not from consideration. All model for our experiments use 26 = 64 bottleneck units which translates to a maximum entropy of 6 bits. Hyperparameters are given in Appendix B. 5.1 BIASED SEMANTICS In our first experiment we demonstrate how shaped rewards which are trivially biased directly distort the semantics of the language, that is, the action associated with each bottleneck unit. We compare three environments, no shaped rewards, the standard shaped reward, and the trivally biased shaped reward. We visualize the semantics of the language with so-called “sea star plots” in Figure 3. Each arm of the sea star is the action taken by the receiver in response to a single bottleneck unit with opacity representing the frequency of use. 6 Under review as a conference paper at ICLR 2022 (a) No Shaped Rewards (b) Standard Shaped Reward (c) Biased Shaped Reward Figure 3: Sea star plots for three different settings in the edgeward navigation environment. Each “sea star” corresponds to an independent language learned in the given setting. 2 3 4 5 Entropy (bits) Shaped No Shaped (a) Entropy Histogram 2 4 8 16 32 World Radius 2 3 4 5 6 Entropy (bits) (b) No Shaped Rewards 2 4 8 16 32 World Radius 2 3 4 5 6 Entropy (bits) (c) Shaped Reward Figure 4 In the setting with no shaped rewards, we see actions (i.e., the meanings of the messages) learned featuring 2 to 4 arms pointing in a variety of directions. Since the standard shaped reward takes both dimensions into account, we do not see any bias in the direction of the learned actions. With the trivially biased reward, though, we see the the learned languages exclusively favor actions near to the horizontal axis. In this setting, nothing explicitly prevents the agents from learning vertical actions, but the fact that horizontal dimensions receive the shaped reward makes those actions easier to learn. 5.2 CHANGING THE DISTRIBUTION OF ENTROPY Naturally, a shaped reward which favors certain actions over others will bias the semantics of the language. Thus, our second experiment investigates more closely the effect that shaped rewards without this explicit bias can have. Specifically, we investigate the distribution of language en- tropies in the two environments. By entropy we are specifically referring to the Shannon entropy (measured in bits) of the bottleneck units as used in the trained agents’ language (as averaged over 3000 quasirandomly initialized episodes). Entropy is an important aspect of language as it represents the upper bound on the information that the language can convey. Languages with higher entropy can yield more precise meaning in their utterances, yet this comes at the cost of being more difficult to learn or acquire as they need a greater variety of training example to be learned. To investigate the distribution of language entropies, we look at a histogram showing the Shannon entropy of languages belonging to environments with and without shaped rewards. The distributions is computed from 2000 independent runs for each reward setting. This is shown in Figure 4a. The presence of shaped rewards shifts the distribution upwards, demonstrating that even shaped rewards which is free of a trivial bias can still bias the emergent language. A potential explanation of these results is discussed and illustrated in Section 5.4. 7 Under review as a conference paper at ICLR 2022 (a) ECRP 40 300 3000 Rollout Buffer Size 2 3 4 5 6 Entropy (bits) (b) No Shaped Rewards 40 300 3000 Rollout Buffer Size 2 3 4 5 6 Entropy (bits) (c) Shaped Reward Figure 5 5.3 MASKING ENVIRONMENTAL PARAMETERS In our final primary experiment, we demonstrate how shaped rewards can mitigate the influence of environmental parameters on the entropy of the learned language. This is an issue insofar as the presence of shaped rewards make it difficult to observe an emergent property of interest. Specifi- cally, we look at how the standard shaped rewards hide the effect of world radius on entropy in our centerward environment. In Figures 4b and 4c, we plot the language entropies against different world radii. In both settings, we observe that entropy decreases as the world radius increases, but the setting with no shaped rewards shows a much more rapid decrease in entropy. We offer one possible explanation for this effect in Section 5.4. When the only reward signal agents have access to is the base reward, they can only accomplish the task by finding the goal randomly at first; as the size of the environment increases, the chance of finding the goal with random movements decreases and the agent pair often fails to learn a fully successful language at the highest world radii. 5.4 EXPERIENCE BUFFER SIZE As an explanatory experiment, we demonstrate how changing the size of the PPO experience buffer has a significant impact of the entropy of the emergent language. We compare this with the effects we would expect to see according to the model presented in the next section, i.e., the expectation Chinese Restaurant process. In turn we use this to explain one mechanism by which shaped rewards can have the observed effects on entropy shown by the previous experiments. In Figure 5a we show the effect of a logarithmic sweep of β on entropy of the expectation Chinese restaurant process. We first observe that increasing β reduces the variance between distributions yielded by the process since, as β increases, the individual updates made in each iterations are also reduced in variance. In fact, in the limiting case as β →∞, the process will always yield the same distribution as the update will just be the expectation of sampling from the categorical distribution described by the (normalized) weights (plus α). The second effect is that increasing β will decrease the concentration (i.e., increase the entropy), on average, of the distribution yielded from the process. The intuition behind this is that the since each update is less concentrated, the distribution as a whole will be less concentrated as the probability mass will be spread out. These results can be used to explain, in part, both the effect of shaped rewards and world radius on entropy. First, though, we must establish an analogous correspondence between the expectation Chinese restaurant process and the PPO-learning process. An iteration of the process described in Algorithm 1 consists of sampling from the modified categorical distribution (Line 15) and incre- menting the weights (Line 17). In PPO, the sampling corresponds to populating the experience buffer with steps associated with a reward, and the increment operation is analogous to PPO per- forming gradient descent on the agents using the buffer. Thus, β is analogously increased for PPO when the number of successful episodes per iteration which is dependent both on the size of the experience buffer as well as the environmental factors affecting frequency of success. 8 Under review as a conference paper at ICLR 2022 In Figures 5b and 5c, we directly vary the size of the experience buffer in our environments with and without shaped rewards. Both environments replicate the correlation between β/buffer size and entropy, though the decrease in variance is less distinct as buffer size increases. Having established this correlation, we can offer a potential explanation to the experiments involving world radius as well as the distribution of entropies between the environments with and without shaped rewards. Shaped rewards effectively increase β since it assigns a reward signal to every step whereas the base reward-only environment requires successful episode. This effect is exacerbated when the world radius is increased, the base reward-only environment yields rewards less frequently in the beginning because randomly finding the goal is less likely. This effectively decreases β which corresponds to a lower entropy and higher variance. 6 CONCLUSION We have, then, demonstrated the pitfalls that shaped rewards present for emergent language research: directly biasing the learned semantics of the language, changing the distribution of an emergent property of the language (i.e., entropy), and masking the emergent effects of other environmental variables. These experiments were performed with with a novel navigation-based emergent language environment. This environment allows allows for shaped rewards through multi-step episodes and avoids a trivially optimal language by employing a continuous state and action space. In addition to this, we introduced the expectation Chinese restaurant process both to explain our own experimental results and to provide a foundation for future models of emergent language. The limitations of this work can be illustrated through the future directions that could be taken. First, studying a variety of environments would further characterize what biases shaped rewards introduce into emergent languages. Second, increasing the complexity of environments would greatly expand the range of both emergent properties and types of shaped rewards which could be studied. For example, a rich environment like chess presents many emergent strategies such as the valuation of positions, balance between offense and defense, and favoring certain pieces; shaped rewards could then take the form of rewarding stochastic strategies, evaluating positions with traditional chess engines, or assigning pieces explicit point values. Furthermore, while we only use the expecta- tion Chinese restaurant process as an explanatory model, further work could design experiments to demonstrate its predictive power. The studies presented in this paper are both exploratory and anticipatory in nature since emergent language research has yet to tackle environments difficult enough to require shaped rewards. Nev- ertheless, the field will follow reinforcement learning in tackling progressively more difficult tasks which will present further opportunities for or even require shaped rewards. When this occurs, ac- counting for the biases which are inherent to shaped rewards is imperative to preserving the integrity of emergent language experiments. This work, then, prepares researchers in the field for these future challenges. REPRODUCIBILITY STATEMENT The code used in association with this paper is located at https://example.com/ repo-name (in a ZIP file for review). Reproduction instructions are located in the README.md file. Experiments were performed on a 20-thread Intel Core i9-9900X server on which they take less than 24 hours to run. No random seeds are recorded or provided as the training process is stable and similar results should be observed for any random seed. REFERENCES David J. Aldous. Exchangeability and related topics. In P. L. Hennequin (ed.), ´ Ecole d’ ´ Et´ e de Probabilit´ es de Saint-Flour XIII — 1983, pp. 1–198, Berlin, Heidelberg, 1985. Springer Berlin Heidelberg. ISBN 978-3-540-39316-0. David Blei. The chinese restaurant process, 2007. URL https://www.cs.princeton.edu/ courses/archive/fall07/cos597C/scribe/20070921.pdf. 9 Under review as a conference paper at ICLR 2022 Nicolo’ Brandizzi, Davide Grossi, and Luca Iocchi. Rlupus: Cooperation through emergent com- munication in the werewolf social deduction game, 2021. Kalesha Bullard, Douwe Kiela, Franziska Meier, Joelle Pineau, and Jakob Foerster. Quasi- equivalence discovery for zero-shot emergent communication, 2021. Rahma Chaabouni, Eugene Kharitonov, Emmanuel Dupoux, and Marco Baroni. Communicating artificial neural networks develop efficient color-naming systems. Proceedings of the National Academy of Sciences, 118(12), Mar 2021. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas. 2016569118. URL https://www.pnas.org/content/118/12/e2016569118. Tom Eccles, Yoram Bachrach, Guy Lever, Angeliki Lazaridou, and Thore Grae- pel. Biases for Emergent Communication in Multi-agent Reinforcement Learning. In Advances in Neural Information Processing Systems, volume 32. Curran Asso- ciates, Inc., 2019. URL https://papers.nips.cc/paper/2019/hash/ fe5e7cb609bdbe6d62449d61849c38b0-Abstract.html. Laura Harding Graesser, Kyunghyun Cho, and Douwe Kiela. Emergent linguistic phenomena in multi-agent communication games. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3700–3710. Association for Computational Linguistics, Nov 2019. doi: 10.18653/v1/D19-1384. URL https://aclanthology.org/D19-1384. Serhii Havrylov and Ivan Titov. Emergence of language with multi-agent games: Learn- ing to communicate with sequences of symbols. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30, pp. 2149–2159. Curran Asso- ciates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/ 70222949cc0db89ab32c9969754d4758-Paper.pdf. Rishi Hazra, Sonu Dixit, and Sayambhu Sen. Zero-shot generalization using intrinsically moti- vated compositional emergent protocols. Visually Grounded Interaction and Language Workshop, NAACL, 2021. Eric Jang, Shixian Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In Proceedings of the 2017 International Conference on Learning Representations (ICLR), 2017. URL https://openreview.net/forum?id=rkE3y85ee. Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre, Pedro Ortega, Dj Strouse, Joel Z. Leibo, and Nando De Freitas. Social influence as intrinsic mo- tivation for multi-agent deep reinforcement learning. In International Conference on Machine Learning, pp. 3040–3049. PMLR, May 2019. URL http://proceedings. mlr.press/v97/jaques19a.html. Ivana Kaji´ c, Eser Ayg¨ un, and Doina Precup. Learning to cooperate: Emergent communication in multi-agent navigation. In 42nd Annual Meeting of the Cognitive Science Society, pp. 1993–1999, Toronto, ON, 2020. Cognitive Science Society. Eugene Kharitonov, Rahma Chaabouni, Diane Bouchacourt, and Marco Baroni. Entropy minimiza- tion in emergent languages. In Hal Daum´ e III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learn- ing Research, pp. 5220–5230. PMLR, 13–18 Jul 2020. URL http://proceedings.mlr. press/v119/kharitonov20a.html. Satwik Kottur, Jos´ e Moura, Stefan Lee, and Dhruv Batra. Natural language does not emerge “nat- urally” in multi-agent dialog. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017. doi: 10.18653/v1/d17-1321. URL http://dx.doi. org/10.18653/v1/D17-1321. Angeliki Lazaridou, Karl Moritz Hermann, Karl Tuyls, and Stephen Clark. Emergence of linguistic communication from referential games with symbolic and pixel input. In International Confer- ence on Learning Representations, 2018. URL https://openreview.net/forum?id= HJGv1Z-AW. 10 Under review as a conference paper at ICLR 2022 Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. In Proceedings of the 2017 International Conference on Learning Representations (ICLR), 2017. URL https://openreview.net/forum?id= S1jE5L5gl. Igor Mordatch and Pieter Abbeel. Emergence of grounded compositional language in multi-agent populations, 2018. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´ e-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., 2019. URL http://papers.neurips.cc/paper/ 9015-pytorch-an-imperative-style-high-performance-deep-learning-library. pdf. Antonin Raffin, Ashley Hill, Maximilian Ernestus, Adam Gleave, Anssi Kanervisto, and Noah Dor- mann. Stable baselines3. https://github.com/DLR-RM/stable-baselines3, 2019. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000. Helmut Vogel. A better way to construct the sunflower head. Mathematical Biosciences, 44(3):179– 189, 1979. ISSN 0025-5564. doi: https://doi.org/10.1016/0025-5564(79)90080-4. URL https: //www.sciencedirect.com/science/article/pii/0025556479900804. Eric Wiewiora. Reward Shaping, pp. 863–865. Springer US, Boston, MA, 2010. ISBN 978-0- 387-30164-8. doi: 10.1007/978-0-387-30164-8 731. URL https://doi.org/10.1007/ 978-0-387-30164-8_731. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229–256, 1992. A ENVIRONMENTS OF RELATED WORK Havrylov & Titov (2017) Standard signalling game where the observations are natural images from Microsoft COCO. There is no trivially optimal language since the information being commu- nicated is simply which image is being shown. There are natural image classes (e.g., cat vs. dog), but they are not necessarily the features of the images which the agents need to communicate. The standard signalling game is single-step. Kottur et al. (2017) “Task and talk” dialog game where one agent must ask questions of the other agent to learn the attributes of an object. Specifically, the questioner has a set of attributes it must learn (the objective) about the object that only the answerer can see. The desired language has the questioner using a unique message to specify a property and the answerer responding with a unique message to specify the value for that property such that messages are always in one-to-one correspondence what they are communicating. Such a language is trivially optimal. Multi-round dialog environments are inherently multi-step. Mordatch & Abbeel (2018) Collaborative navigation game where multiple agents tell each other whither to move in a 2D environment. The environment consists of agents and landmarks distin- guisted by color. Each agent is given a “goal”; for example, the blue agent might receive “red agent go to green landmark” (represented as a categorical feature vector) which the blue agent then has to communicate to the red agent. The agents have a vocabulary large enough to assign a unique “word” to each concept being expressed in one-to-one correspondence, yielding a trivially optimal language. Each episodes consists of multiple timesteps which can each have their own utterance. 11 Under review as a conference paper at ICLR 2022 Lazaridou et al. (2018) An image-based signalling game using rendered images from MuJoCo. The information being communicated is the shape and color of an object where set of shapes and colors are both small. Since the number of words available is at least as big as the cardinalities of the sets and the sender is able to use multiple words per utterance, it is possible to construct a trivially optimal langauge. The standard signalling game is single-step. Kharitonov et al. (2020) A binary vector-based signalling game with shared context. The goal of the signalling game is to communicate the bits of a binary vector which are not shared by the sender and receiver. The messages consist of a single symbol where the number of unique symbols is greater than the combination of bits to communicated, thus there is a trivially optimal langauge where one unique symbol is assigned to each possible combination of unshared bits. The standard signalling game is single-step. Chaabouni et al. (2021) A signalling game with 330 different colors represented as a real vector in CIELAB color space. The game is set up so that colors which are nearby never appear in the same distractor set which encourages solutions which can cover some arbitrary region of colors. Due to this “fuzzy” concept of solution (i.e., not using 330 distinct words for each color), we consider this environment not to have a trivially optimal solution. The standard signalling game is single-step. B HYPERPARAMETERS B.1 DEFAULT CONFIGURATION Environment • Type: centerward • World radius: 9 • Goal radius: 1 • Max steps per episode: 3 × world radius Agent Architecture • Bottleneck size: 64 • Architecture; sender is 1-3 and receiver is 5; bottleneck size is N 1. Linear w/ bias: 2 in, 32 out 2. Tanh activation 3. Linear w/ bias: 32 in, N out 4. Gumbel-Softmax: N in, N out 5. Linear w/ bias: N in, 2 out (action) and 1 out (value) • Bottleneck (Gumbel-Softmax) temperature: 1.5 • Weight initialization: U  − q 1 n, q 1 n  , where n is the input size of the layer (PyTorch 1.10 default) Optimization • Reinforcement learning algorithm: proximal policy optimization – Default hyperparameters used unless otherwise noted: https:// stable-baselines3.readthedocs.io/en/v1.0/modules/ppo. html • Training steps: 1 × 105 • Evaluation episodes: 3 × 103 • Learning rate: 3 × 10−3 • Experience buffer size: 1024 12 Under review as a conference paper at ICLR 2022 • Batch size: 256 • Temporal discount factor (γ): 0.9 B.2 EXPERIMENT-SPECIFIC CONFIGURATIONS Note that we define a logarithmic sweep from x to y (inclusive) with n steps to be defined by Equation 3.  x · y x  i n−1 i ∈{0, 1, . . . , n −1}  (3) Biased Semantics • Type: edgeward • World radius: 8 • Goal Radius: 8 • Experience buffer size: 256 • Batch size: 256 Changing the Distribution of Entropy • Number of independent runs per configuration: 2000 • Experience buffer size: 256 • Batch size: 256 Masking Environmental Parameters • World radius: logarithmic sweep from 2 to 40 with 300 steps Experience Buffer Size • Experience Buffer Size: logarithmic sweep from 32 to 4096 with 200 steps (floor function applied as experience buffer size is integer-valued) • Training steps: 2 × 105 Expectation Chinese Restaurant Process This is not an emergent language experiment and just consists of running the mathematical model. • α: 5 • β: logarithmic sweep from 1 to 1000 with 10 000 steps (floor function applied, though plotted without floor function) • Number of iterations per run: 1000 13