Under review as a conference paper at ICLR 2022
SHAPED REWARDS BIAS EMERGENT LANGUAGE
Anonymous authors
Paper under double-blind review
ABSTRACT
One of the primary characteristics of emergent phenomena is that they are de-
termined by the basic properties of the system whence they emerge as opposed
to explicitly designed constraints. Reinforcement learning is often used to elicit
such phenomena which speciﬁcally arise from the pressure to maximize reward.
We distinguish two types of rewards. The ﬁrst is the base reward which is mo-
tivated directly by the task being solved. The second is shaped rewards which
are designed speciﬁcally to make the task easier to learn by introducing biases
in the learning process. The inductive bias which shaped rewards introduce is
problematic for emergent language experimentation because it biases the object
of study: the emergent language. The fact that shaped rewards are intentionally
designed conﬂicts with the basic premise of emergent phenomena arising from
basic principles. In this paper, we use a simple sender-receiver navigation game
to demonstrate how shaped rewards can 1) explicitly bias the semantics of the
learned language, 2) signiﬁcantly change the entropy of the learned language, and
3) mask the potential effects of other environmental variables of interest.
1
INTRODUCTION
In emergent language research, the goal is to study language as it emerges from the inherent proper-
ties of the environment, language, and agents. One pitfall for such experiments, though, is that the
language simply mirrors some of design choices of the environment or experimental setting more
generally. there is a risk that the language simply mirrors the design choices of the environment. For
example, Bullard et al. (2021) introduce a method for discovering optimal languages for commu-
nication between independently trained agents, yet rather than emerging from basic principles, the
learned language is the result of an intentionally designed search algorithm. Reinforcement learning
is a common tool in this ﬁeld for observing the emergence of language out of a reward maximiza-
tion pressure. One such design choice which can obscure these emergent properties is adding shaped
rewards on top of the base reward of the reinforcement learning environment (Wiewiora, 2010).
The base reward of the environment derives directly from succeeding at the task in question. The
difﬁculty with relying solely on the base rewards is that if the task is especially long or complicated,
the agent may only receive a reward infrequently which makes for a difﬁcult learning problem. In
such a case, base reward is considered sparse. This motivates shaped rewards which are inserted
at intermediate steps based on domain knowledge in order to introduce an inductive bias towards
good solutions. For example, the base reward in chess would simply be winning or losing the game.
A shaped reward could then be taking the opponents material while not losing your own. While
this shaped reward is often a good heuristic, it can lead to local optima; for example, it discourages
strategies which would sacriﬁce individual pieces in order to win the whole game. While local
optima present a problem for maximizing reward, the biases introduced by shaped rewards present
a unique problem for emergent language which we will highlight in this paper.
For emergent language research, the inductive bias which shaped rewards introduce is especially
problematic because it exerts a signiﬁcant inﬂuence on the learned language whose emergent prop-
erties are the object of study. This inﬂuence can comprise 1) biasing the semantics of the language,
2) changing a property of the whole language (e.g., language entropy), or 3) masking the inﬂuence of
some other environmental parameter on the language. For example, some emergent language work
incorporates shaped rewards into their environment without accounting for the the biases it may
introduce (Mordatch & Abbeel, 2018; Brandizzi et al., 2021; Harding Graesser et al., 2019). From
an engineering and design perspective, tweaking the system to achieve a desired result is standard
1
Under review as a conference paper at ICLR 2022
Goal
(a) In the centerward environment, the receiver
must navigate to the center of the world.
S-agent
(perceptron)
R-agent
(perceptron)
(x, y)
action
(x, y)
location
Gumbel-Softmax
(b) The architecture of our two-agent, asymmetric
system.
Figure 1
practice, but from a scientiﬁc and experimental perspective, these additional shaped rewards serve
as potential confounding factors and hinder accurate observations of the emergent phenomena.
We study this by introducing a simple navigation game with continuous states and actions where
a sender passes a one-word message to the receiver (illustrated in Figure 1b and 1a). Within this
environment, our experiments look at the entropy and semantics of the emergent language in the
absence and presence of shaped rewards. In the course of these experiments, we ﬁnd that rein-
forcement learning algorithm’s experience buffer size has a signiﬁcant impact on the entropy of the
learned language and potentially explains our experimental ﬁndings. To this end we introduce a
mathematical model based on the Chinese restaurant process for understanding the effect of experi-
ence buffer size on emergent language more generally. We highlight the following contributions in
our paper:
• Demonstrating basic ways in which shaped rewards can undesirably inﬂuence emergent
language experiments
• Presenting a mathematical model for understanding the role of experience buffer size in the
entropy of emergent language
2
RELATED WORK
2.1
TAXONOMY
We intentionally design our environment to study shaped rewards and the entropy of language,
which requires it to differ from prior art in speciﬁc ways. To elucidate this, we create a taxonomy of
emergent language environments based on whether or not the environment has multi-step episodes
and the presence of a trivially optimal language (deﬁned below). The taxonomy is given in Table 1
and a brief description of each environment is given in Appendix A.
Generally speaking, the motivation for shaped rewards in a given environment is sparsity of the base
reward which requires a multi-step, multi-utterance environment. Thus, our experiments naturally
require a multi-step environment.
We consider an environment to have a trivially optimal language if the information which needs to be
communicated from sender to receiver can be perfectly represented in the emergent language. Such a
language most frequently arises when the communicated information is derived from a small number
of categorical variables encoded in the observation. For example, in an environment where the
sender must specify an element of the set {red, green, blue} × {square, circle} using messages from
the set {r, g, b} × {s, c}, a trivial language is where color maps to a unique letter in the ﬁrst position
and shape maps to a unique letter in the second position. Other environments have no trivially
optimal languages. For example, if the sender must communicate an element of {1, 2, ..., 100}
using messages from the set {a, b, c}, there is no trivially optimally language since the sender can at
best partition the set of integers to reach an optimal but imperfect solution.
Kharitonov et al. (2020) gives evidence that there is an entropy minimization in pressure inherent
in emergent language settings. Building on this, Chaabouni et al. (2021) explicitly look at a the
2
Under review as a conference paper at ICLR 2022
Entropy
Max Reward
(a) Trivially optimal language
Entropy
Max Reward
(b) No trivially optimal language
Figure 2: Maximum reward for a given entropy. When a trivially optimal language exists (a), the
reward plateaus at a global maximum where further increases in entropy do not increase reward.
The environment in this paper has no such trivial language and is more similar (b).
Paper
Task
Trivial?
Multi-step?
Havrylov & Titov (2017)
image signalling
no
no
Kottur et al. (2017)
dialog Q&A
yes
yes
Mordatch & Abbeel (2018)
goal speciﬁcation
yes
yes
Lazaridou et al. (2018)
image signalling
yes
no
Kharitonov et al. (2020)
vector signalling
yes
no
Chaabouni et al. (2021)
color signalling
no
no
This paper
navigation
no
yes
Table 1: Summary of related work in terms of emergent language conﬁgurations used.
tradeoff between entropy and reward—higher entropy languages have the potential to convey more
information at the expense of being more difﬁcult to learn. This tradeoff disappears if a trivially
optimal language is learned since there is no further reward maximization pressure. Language en-
tropy greater than this minimum, then, does not emerge from the the reward maximization pressure.
Although such an environment does not preclude studying entropy we choose to use an environment
where the information being communicated is fully continuous so an increase in entropy can always
translate to a increase in reward. This leads to a smooth tradeoff between entropy and reward which
is illustrated in Figure 2.
2.2
SPECIFIC APPROACHES
The environment and agent conﬁguration of this paper are most closely related to Chaabouni et al.
(2021) who test the balance between entropy and semantic precision in a two-agent color discrim-
ination task. Although the color space comprises 330 distinct colors, the environment facilitates
languages which cover a (learned) region of nearby colors. In this way, there is no trivially optimal,
one-to-one language as the task is inherently fuzzy. Havrylov & Titov (2017) use a signalling game
with natural images which lacks a trivially optimal language, but using natural images results in
many uncontrolled variables, lessening the ability of the experiments to make basic, ﬁrst-principles
claims.
Studying shaped rewards has been explored previously in emergent language primarily as inductive
bias for encouraging some property, such as compositionality, to emerge (Hazra et al., 2021; Jaques
et al., 2019; Eccles et al., 2019). This paper, instead, focuses on the negative aspects of induc-
tive biases introduced by shaped rewards regarding how they can hinder empirical investigation of
emergent languages.
Superﬁcially, the environment used in this paper bears resemblance to Mordatch & Abbeel (2018)
and Kaji´
c et al. (2020) as these both deal with navigation. In both of these environments, the agents
communicate about a discrete set of actions or goal locations, whereas the agents in this paper
communicate about a continuous action. Neither of the papers speciﬁcally investigates the effects of
shaped rewards on the languages learned.
3
Under review as a conference paper at ICLR 2022
3
METHODS
3.1
ENVIRONMENT
In this paper we use a simple 2-dimensional navigation environment with two closely related tasks.
A sender agent observes the position of a receiver agent, sends a message to the receiver, and the
receiver takes an action. In the centerward task (illustrated in Figure 1a), the receiver is initialized
uniformly at random within a circle and must navigate towards a circular goal region at the center. In
the edgeward task, the receiver is initialized uniformly at random within a circle and must navigate to
a goal region comprising the entire area outside of the circle. The centerward environment is a more
realistic environment. The edgeward environment, on the other hand, admits a greater variety of
languages, since moving consistently in any one direction eventually gets you to the edge; therefore,
learning to move in a variety of evenly spaced direction solves the task more quickly but is not
strictly necessary. There are no obstacles or walls in the environment. The receiver’s location and
action are continuous variables stored as ﬂoating-point values. If the receiver does not reach the goal
region within a certain number of steps, the episode ends with no reward given.
3.2
AGENT ARCHITECTURE
Our architecture comprises two agents, conceptually speaking, but in practice, they are a single
neural network. The sender is a disembodied agent which observes the location of the receiver and
passes a message in order to guide it towards the goal. The receiver is an agent which receives
the message as its only input and takes an action solely based on that message (i.e., it is “blind”).
The sender and receiver are randomly initialized at the start of training, trained together, and tested
together. The architecture of the agents is illustrated in Figure 1b.
The observation of the sender is a pair of ﬂoating-point values representing the receiver’s location.
The sender itself is a 2-layer perceptron with tanh activations. The output of the second layer is
passed to a Gumbel-Softmax bottleneck layer (Maddison et al., 2017; Jang et al., 2017) which en-
ables learning a discrete, one-hot representation, as an information bottleneck (Tishby et al., 2000).
The activations of this layer can be thought of as the “words” or “lexicon” of the emergent language.
At evaluation time, the bottleneck layer functions deterministically as an argmax layer, emitting one-
hot vectors. The receiver is a 1-layer perceptron which receives the output of the Gumbel-Softmax
layer as input. The output is a pair of ﬂoating-point values which determine the action of the agent.
The action is clamped to a maximum step size.
3.3
OPTIMIZATION
Optimization is performed using stochastic gradient descent as a part of proximal policy optimiza-
tion (PPO) (Schulman et al., 2017). Speciﬁcally, we use the implementation of PPO provided by
Stable Baselines 3 with our neural networks implemented in PyTorch (Rafﬁn et al., 2019; Paszke
et al., 2019). Using a Gumbel-Softmax bottleneck layer allows for end-to-end backpropagation,
making optimization faster and more consistent than using a backpropagation-free method like RE-
INFORCE (Kharitonov et al., 2020; Williams, 1992).
We will give a basic explanation of how PPO (and related algorithms) works as it is necessary to
connect it to the mathematical model presented in Section 4. PPO has two stages sampling and
optimizing. During the sampling stage, the algorithm runs the agents in the environment and stores
the states, actions, and rewards in a experience buffer (or rollout buffer). The optimization stage then
comprises performing gradient descent on the agents (i.e., neural network policy) using the data from
experience buffer. The next iteration starts with the updated agents and an empty experience buffer.
3.4
REWARDS
We make use of two different rewards in our conﬁguration, a base reward and an shaped reward.
The base reward is simply a positive reward of 1 given if the receiver reaches to the goal region
before the episode ends and no reward otherwise. The shaped reward, given at every timestep, is the
decrease in distance to the goal. If the goal region is centered at (0, 0), the standard shaped reward
for the centerward environment is given by Equation 1; we also use a trivially biased version of the
4
Under review as a conference paper at ICLR 2022
reward which only takes into account horizontal distance speciﬁed in Equation 2.
rt =
q
x2
t−1 + y2
t−1 −
q
x2
t + y2
t
(1)
r′
t =
q
x2
t−1 −
q
x2
t
(2)
For the edgeward environment, we use the opposite of rt as the goal is to move away from the center.
The interplay between base and shaped rewards is important to understand in the larger context
of how reinforcement learning problems are structured. The base rewards are well-motivated and
directly correspond to the ultimate aim of the task, but their sparsity can make it difﬁcult for the
agents to learn to succeed. Shaped rewards facilitate learning by using expert knowledge to form
an inductive bias, yet they present a drawback for traditional reinforcement learning and emergent
language. Within reinforcement learning where the goal is to train the best performing agent, shaped
rewards can lead to the agent ﬁnding local optima if better solutions are excluded by the inductive
bias. Within emergent language, the problem is more nuanced as the goal is primarily to study a wide
range of emergent properties of the language learned within the environment. While base rewards
have a natural connection to the environment and task, the shaped rewards introduce a reward signal
which is not intrinsically connected with the task and environment, even if it is a good heuristic.
4
EXPLANATORY MODEL
We argue that the results presented in Sections 5.2 and 5.3 can be explained by an effective change
in the size of PPO’s experience buffer. In order illustrate this, we introduce a simple mathematical
model based on the Chinese restaurant process (Blei, 2007; Aldous, 1985). While the model does
not exactly match our experimental setup, the key similarities allow us to reason about our results as
well as potential future experiments.
The Chinese restaurant process is an iterative stochastic process which yields a probability distribu-
tion over the positive integers. The key similarity between the Chinese restaurant process and our
learning setup is that it they are self-reinforcing processes, that is, when a given value is selected in
one iteration of the process, the probability that that value is chosen again in subsequent iterations
increases. We generalize one aspect of the Chinese restaurant process in Section 4.2 to better match
the sampling-optimization procedure of PPO. The primary simpliﬁcation which this model makes
is that it does not take into account the “meaning” of actions and the effects they have within the
environment. For example, every successful agent in the centerward environment will use at least
three distinct nearly equiprobably so as to span the 2-dimensional space whereas as no such lower
bound exists in the stochastic process.
4.1
CHINESE RESTAURANT PROCESS
As the name suggests, a useful analogy for the Chinese restaurant process starts with a restaurant
with inﬁnitely many empty tables, indexed by the positive integers, which can hold an unbounded
number of customers. As each customer walks in, they sit at a populated table with a probability
proportional to the number of people already at that table. The customer will sit at a new table
proportional to a hyperparameter α which modulates the concentration of the ﬁnal distribution. The
decision the customer makes is equivalent to sampling from a categorical distribution where the
unnormalized weights are the customer counts along with the weight, α, for the new table. The
pseudocode for the Chinese restaurant process is given in Algorithm 1 for β = 1. By analogy
to the neural networks representing our agents, we can view the tables as bottleneck units and the
customers choosing a table as parameter updates which reinforce the use of that unit in accordance
with the reward. Mordatch & Abbeel (2018) implicitly assume this when they introduce a reward
corresponding to the probability that the emergent lexicon is generated by a Chinese restaurant
process (a Dirichlet process, in their words).
The self-reinforcing property can be expressed informally as: more popular tables get more new
customers, keeping them popular. A higher α means that customers are more likely to sit at a new
table, so the distribution over tables will be more spread out in expectation. The distribution stabi-
lizes as the number of iterations goes to inﬁnity as an individual new customers has a diminishing
effect the relative size of the weights.
5
Under review as a conference paper at ICLR 2022
Algorithm 1 Expectation Chinese Restaurant Process
1
assert type(alpha)
is float and alpha
>
0
2
assert type(n_iters) is int
and n_iters >= 0
3
assert type(beta)
is int
and beta
>
0
4
5
def sample_categorical_alpha(weights):
6
w_alpha = weights.copy()
7
k = num_nonzero(weights)
8
w_alpha[k + 1] = alpha
9
return sample_categorical(w_alpha / sum(w_alpha))
10
11
weights = array([1, 0, 0, ...])
12
for _ in range(n_iters):
13
addend = array([0, ...])
14
for _ in range(beta):
15
i = sample_categorical_alpha(weights)
16
addend[i] += 1 / beta
17
weights += addend
18
return weights / sum(weights)
4.2
EXPECTATION CHINESE RESTAURANT PROCESS
The key difference between how the Chinese restaurant process and PPO works is the relationship
between sampling (i.e., simulating episodes) and updating the weights/parameters. In each iteration,
the regular Chinese restaurant process draws a sample based on its weights and updates those weight
immediately. In PPO, the agent will populate the experience buffer with a number of steps (on
the order of 100 to 1000) in the environment before performing gradient descent with that buffer
to update the parameters. As a result, the parameter update is performed based on a weighting
across multiple bottleneck units based on how often they were used in the episodes recorded in the
experience buffer.
Thus, to the appropriately generalize the Chinese restaurant process, we introduce the expectation
Chinese restaurant process. In this process, we add a hyperparameter β which is a positive integer
describing how many samples we take from the distribution before updating the weights; the updates
are normalized by β so the sum of all weights still only increases by 1 per iteration. The restaurant
analogy breaks down here as we would have to say that in each iteration, β customers simultaneously
and independently make a decision, get shrunk to 1
β th their size, and then sit at their table of choice.
The pseudocode for the expectation Chinese restaurant process is given in Algorithm 1.
5
EXPERIMENTS
Each run of experiment starts by training a sender and receiver for a ﬁxed number of timesteps for
a range of independent variable values. The trained models are then evaluated by initializing 3000
episodes at evenly distributed locations using Vogel’s method (Vogel, 1979). In most settings, the
agents are able to achieve a 100% success rate during training and evaluation; we remove any models
which do not from consideration. All model for our experiments use 26 = 64 bottleneck units which
translates to a maximum entropy of 6 bits. Hyperparameters are given in Appendix B.
5.1
BIASED SEMANTICS
In our ﬁrst experiment we demonstrate how shaped rewards which are trivially biased directly distort
the semantics of the language, that is, the action associated with each bottleneck unit. We compare
three environments, no shaped rewards, the standard shaped reward, and the trivally biased shaped
reward. We visualize the semantics of the language with so-called “sea star plots” in Figure 3. Each
arm of the sea star is the action taken by the receiver in response to a single bottleneck unit with
opacity representing the frequency of use.
6
Under review as a conference paper at ICLR 2022
(a) No Shaped Rewards
(b) Standard Shaped Reward
(c) Biased Shaped Reward
Figure 3: Sea star plots for three different settings in the edgeward navigation environment. Each
“sea star” corresponds to an independent language learned in the given setting.
2
3
4
5
Entropy (bits)
Shaped
No Shaped
(a) Entropy Histogram
2
4
8
16
32
World Radius
2
3
4
5
6
Entropy (bits)
(b) No Shaped Rewards
2
4
8
16
32
World Radius
2
3
4
5
6
Entropy (bits)
(c) Shaped Reward
Figure 4
In the setting with no shaped rewards, we see actions (i.e., the meanings of the messages) learned
featuring 2 to 4 arms pointing in a variety of directions. Since the standard shaped reward takes
both dimensions into account, we do not see any bias in the direction of the learned actions. With
the trivially biased reward, though, we see the the learned languages exclusively favor actions near
to the horizontal axis. In this setting, nothing explicitly prevents the agents from learning vertical
actions, but the fact that horizontal dimensions receive the shaped reward makes those actions easier
to learn.
5.2
CHANGING THE DISTRIBUTION OF ENTROPY
Naturally, a shaped reward which favors certain actions over others will bias the semantics of the
language. Thus, our second experiment investigates more closely the effect that shaped rewards
without this explicit bias can have. Speciﬁcally, we investigate the distribution of language en-
tropies in the two environments. By entropy we are speciﬁcally referring to the Shannon entropy
(measured in bits) of the bottleneck units as used in the trained agents’ language (as averaged over
3000 quasirandomly initialized episodes).
Entropy is an important aspect of language as it represents the upper bound on the information that
the language can convey. Languages with higher entropy can yield more precise meaning in their
utterances, yet this comes at the cost of being more difﬁcult to learn or acquire as they need a greater
variety of training example to be learned.
To investigate the distribution of language entropies, we look at a histogram showing the Shannon
entropy of languages belonging to environments with and without shaped rewards. The distributions
is computed from 2000 independent runs for each reward setting. This is shown in Figure 4a. The
presence of shaped rewards shifts the distribution upwards, demonstrating that even shaped rewards
which is free of a trivial bias can still bias the emergent language. A potential explanation of these
results is discussed and illustrated in Section 5.4.
7
Under review as a conference paper at ICLR 2022
(a) ECRP
40
300
3000
Rollout Buffer Size
2
3
4
5
6
Entropy (bits)
(b) No Shaped Rewards
40
300
3000
Rollout Buffer Size
2
3
4
5
6
Entropy (bits)
(c) Shaped Reward
Figure 5
5.3
MASKING ENVIRONMENTAL PARAMETERS
In our ﬁnal primary experiment, we demonstrate how shaped rewards can mitigate the inﬂuence of
environmental parameters on the entropy of the learned language. This is an issue insofar as the
presence of shaped rewards make it difﬁcult to observe an emergent property of interest. Speciﬁ-
cally, we look at how the standard shaped rewards hide the effect of world radius on entropy in our
centerward environment.
In Figures 4b and 4c, we plot the language entropies against different world radii. In both settings,
we observe that entropy decreases as the world radius increases, but the setting with no shaped
rewards shows a much more rapid decrease in entropy. We offer one possible explanation for this
effect in Section 5.4. When the only reward signal agents have access to is the base reward, they
can only accomplish the task by ﬁnding the goal randomly at ﬁrst; as the size of the environment
increases, the chance of ﬁnding the goal with random movements decreases and the agent pair often
fails to learn a fully successful language at the highest world radii.
5.4
EXPERIENCE BUFFER SIZE
As an explanatory experiment, we demonstrate how changing the size of the PPO experience buffer
has a signiﬁcant impact of the entropy of the emergent language. We compare this with the effects
we would expect to see according to the model presented in the next section, i.e., the expectation
Chinese Restaurant process. In turn we use this to explain one mechanism by which shaped rewards
can have the observed effects on entropy shown by the previous experiments.
In Figure 5a we show the effect of a logarithmic sweep of β on entropy of the expectation Chinese
restaurant process. We ﬁrst observe that increasing β reduces the variance between distributions
yielded by the process since, as β increases, the individual updates made in each iterations are also
reduced in variance. In fact, in the limiting case as β →∞, the process will always yield the same
distribution as the update will just be the expectation of sampling from the categorical distribution
described by the (normalized) weights (plus α). The second effect is that increasing β will decrease
the concentration (i.e., increase the entropy), on average, of the distribution yielded from the process.
The intuition behind this is that the since each update is less concentrated, the distribution as a whole
will be less concentrated as the probability mass will be spread out.
These results can be used to explain, in part, both the effect of shaped rewards and world radius
on entropy. First, though, we must establish an analogous correspondence between the expectation
Chinese restaurant process and the PPO-learning process. An iteration of the process described in
Algorithm 1 consists of sampling from the modiﬁed categorical distribution (Line 15) and incre-
menting the weights (Line 17). In PPO, the sampling corresponds to populating the experience
buffer with steps associated with a reward, and the increment operation is analogous to PPO per-
forming gradient descent on the agents using the buffer. Thus, β is analogously increased for PPO
when the number of successful episodes per iteration which is dependent both on the size of the
experience buffer as well as the environmental factors affecting frequency of success.
8
Under review as a conference paper at ICLR 2022
In Figures 5b and 5c, we directly vary the size of the experience buffer in our environments with
and without shaped rewards. Both environments replicate the correlation between β/buffer size and
entropy, though the decrease in variance is less distinct as buffer size increases.
Having established this correlation, we can offer a potential explanation to the experiments involving
world radius as well as the distribution of entropies between the environments with and without
shaped rewards. Shaped rewards effectively increase β since it assigns a reward signal to every step
whereas the base reward-only environment requires successful episode. This effect is exacerbated
when the world radius is increased, the base reward-only environment yields rewards less frequently
in the beginning because randomly ﬁnding the goal is less likely. This effectively decreases β which
corresponds to a lower entropy and higher variance.
6
CONCLUSION
We have, then, demonstrated the pitfalls that shaped rewards present for emergent language research:
directly biasing the learned semantics of the language, changing the distribution of an emergent
property of the language (i.e., entropy), and masking the emergent effects of other environmental
variables. These experiments were performed with with a novel navigation-based emergent language
environment. This environment allows allows for shaped rewards through multi-step episodes and
avoids a trivially optimal language by employing a continuous state and action space. In addition to
this, we introduced the expectation Chinese restaurant process both to explain our own experimental
results and to provide a foundation for future models of emergent language.
The limitations of this work can be illustrated through the future directions that could be taken. First,
studying a variety of environments would further characterize what biases shaped rewards introduce
into emergent languages. Second, increasing the complexity of environments would greatly expand
the range of both emergent properties and types of shaped rewards which could be studied. For
example, a rich environment like chess presents many emergent strategies such as the valuation of
positions, balance between offense and defense, and favoring certain pieces; shaped rewards could
then take the form of rewarding stochastic strategies, evaluating positions with traditional chess
engines, or assigning pieces explicit point values. Furthermore, while we only use the expecta-
tion Chinese restaurant process as an explanatory model, further work could design experiments to
demonstrate its predictive power.
The studies presented in this paper are both exploratory and anticipatory in nature since emergent
language research has yet to tackle environments difﬁcult enough to require shaped rewards. Nev-
ertheless, the ﬁeld will follow reinforcement learning in tackling progressively more difﬁcult tasks
which will present further opportunities for or even require shaped rewards. When this occurs, ac-
counting for the biases which are inherent to shaped rewards is imperative to preserving the integrity
of emergent language experiments. This work, then, prepares researchers in the ﬁeld for these future
challenges.
REPRODUCIBILITY STATEMENT
The code used in association with this paper is located at https://example.com/
repo-name (in a ZIP ﬁle for review). Reproduction instructions are located in the README.md
ﬁle. Experiments were performed on a 20-thread Intel Core i9-9900X server on which they take less
than 24 hours to run. No random seeds are recorded or provided as the training process is stable and
similar results should be observed for any random seed.
REFERENCES
David J. Aldous. Exchangeability and related topics. In P. L. Hennequin (ed.), ´
Ecole d’ ´
Et´
e de
Probabilit´
es de Saint-Flour XIII — 1983, pp. 1–198, Berlin, Heidelberg, 1985. Springer Berlin
Heidelberg. ISBN 978-3-540-39316-0.
David Blei. The chinese restaurant process, 2007. URL https://www.cs.princeton.edu/
courses/archive/fall07/cos597C/scribe/20070921.pdf.
9
Under review as a conference paper at ICLR 2022
Nicolo’ Brandizzi, Davide Grossi, and Luca Iocchi. Rlupus: Cooperation through emergent com-
munication in the werewolf social deduction game, 2021.
Kalesha Bullard, Douwe Kiela, Franziska Meier, Joelle Pineau, and Jakob Foerster.
Quasi-
equivalence discovery for zero-shot emergent communication, 2021.
Rahma Chaabouni, Eugene Kharitonov, Emmanuel Dupoux, and Marco Baroni. Communicating
artiﬁcial neural networks develop efﬁcient color-naming systems. Proceedings of the National
Academy of Sciences, 118(12), Mar 2021. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.
2016569118. URL https://www.pnas.org/content/118/12/e2016569118.
Tom
Eccles,
Yoram
Bachrach,
Guy
Lever,
Angeliki
Lazaridou,
and
Thore
Grae-
pel.
Biases
for
Emergent
Communication
in
Multi-agent
Reinforcement
Learning.
In
Advances
in
Neural
Information
Processing
Systems,
volume
32.
Curran
Asso-
ciates,
Inc.,
2019.
URL
https://papers.nips.cc/paper/2019/hash/
fe5e7cb609bdbe6d62449d61849c38b0-Abstract.html.
Laura Harding Graesser, Kyunghyun Cho, and Douwe Kiela. Emergent linguistic phenomena in
multi-agent communication games. In Proceedings of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th International Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), pp. 3700–3710. Association for Computational Linguistics, Nov
2019. doi: 10.18653/v1/D19-1384. URL https://aclanthology.org/D19-1384.
Serhii Havrylov and Ivan Titov.
Emergence of language with multi-agent games:
Learn-
ing
to
communicate
with
sequences
of
symbols.
In
I.
Guyon,
U.
V.
Luxburg,
S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances
in Neural Information Processing Systems,
volume 30,
pp. 2149–2159. Curran Asso-
ciates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/
70222949cc0db89ab32c9969754d4758-Paper.pdf.
Rishi Hazra, Sonu Dixit, and Sayambhu Sen. Zero-shot generalization using intrinsically moti-
vated compositional emergent protocols. Visually Grounded Interaction and Language Workshop,
NAACL, 2021.
Eric Jang, Shixian Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In
Proceedings of the 2017 International Conference on Learning Representations (ICLR), 2017.
URL https://openreview.net/forum?id=rkE3y85ee.
Natasha Jaques,
Angeliki Lazaridou,
Edward Hughes,
Caglar Gulcehre,
Pedro Ortega,
Dj Strouse, Joel Z. Leibo, and Nando De Freitas.
Social inﬂuence as intrinsic mo-
tivation
for
multi-agent
deep
reinforcement
learning.
In
International
Conference
on Machine Learning,
pp. 3040–3049. PMLR, May 2019.
URL <ahref="http:
//proceedings.mlr.press/v97/jaques19a.html">http://proceedings.
mlr.press/v97/jaques19a.html</a>.
Ivana Kaji´
c, Eser Ayg¨
un, and Doina Precup. Learning to cooperate: Emergent communication in
multi-agent navigation. In 42nd Annual Meeting of the Cognitive Science Society, pp. 1993–1999,
Toronto, ON, 2020. Cognitive Science Society.
Eugene Kharitonov, Rahma Chaabouni, Diane Bouchacourt, and Marco Baroni. Entropy minimiza-
tion in emergent languages. In Hal Daum´
e III and Aarti Singh (eds.), Proceedings of the 37th
International Conference on Machine Learning, volume 119 of Proceedings of Machine Learn-
ing Research, pp. 5220–5230. PMLR, 13–18 Jul 2020. URL http://proceedings.mlr.
press/v119/kharitonov20a.html.
Satwik Kottur, Jos´
e Moura, Stefan Lee, and Dhruv Batra. Natural language does not emerge “nat-
urally” in multi-agent dialog.
Proceedings of the 2017 Conference on Empirical Methods in
Natural Language Processing, 2017. doi: 10.18653/v1/d17-1321. URL http://dx.doi.
org/10.18653/v1/D17-1321.
Angeliki Lazaridou, Karl Moritz Hermann, Karl Tuyls, and Stephen Clark. Emergence of linguistic
communication from referential games with symbolic and pixel input. In International Confer-
ence on Learning Representations, 2018. URL https://openreview.net/forum?id=
HJGv1Z-AW.
10
Under review as a conference paper at ICLR 2022
Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous
relaxation of discrete random variables. In Proceedings of the 2017 International Conference on
Learning Representations (ICLR), 2017. URL https://openreview.net/forum?id=
S1jE5L5gl.
Igor Mordatch and Pieter Abbeel. Emergence of grounded compositional language in multi-agent
populations, 2018.
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor
Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward
Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner,
Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance
deep learning library.
In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´
e-Buc,
E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems 32, pp.
8024–8035. Curran Associates, Inc., 2019. URL http://papers.neurips.cc/paper/
9015-pytorch-an-imperative-style-high-performance-deep-learning-library.
pdf.
Antonin Rafﬁn, Ashley Hill, Maximilian Ernestus, Adam Gleave, Anssi Kanervisto, and Noah Dor-
mann. Stable baselines3. https://github.com/DLR-RM/stable-baselines3, 2019.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy
optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv
preprint physics/0004057, 2000.
Helmut Vogel. A better way to construct the sunﬂower head. Mathematical Biosciences, 44(3):179–
189, 1979. ISSN 0025-5564. doi: https://doi.org/10.1016/0025-5564(79)90080-4. URL https:
//www.sciencedirect.com/science/article/pii/0025556479900804.
Eric Wiewiora. Reward Shaping, pp. 863–865. Springer US, Boston, MA, 2010. ISBN 978-0-
387-30164-8. doi: 10.1007/978-0-387-30164-8 731. URL https://doi.org/10.1007/
978-0-387-30164-8_731.
Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement
learning. Machine learning, 8(3):229–256, 1992.
A
ENVIRONMENTS OF RELATED WORK
Havrylov & Titov (2017)
Standard signalling game where the observations are natural images
from Microsoft COCO. There is no trivially optimal language since the information being commu-
nicated is simply which image is being shown. There are natural image classes (e.g., cat vs. dog),
but they are not necessarily the features of the images which the agents need to communicate. The
standard signalling game is single-step.
Kottur et al. (2017)
“Task and talk” dialog game where one agent must ask questions of the other
agent to learn the attributes of an object. Speciﬁcally, the questioner has a set of attributes it must
learn (the objective) about the object that only the answerer can see. The desired language has
the questioner using a unique message to specify a property and the answerer responding with a
unique message to specify the value for that property such that messages are always in one-to-one
correspondence what they are communicating. Such a language is trivially optimal. Multi-round
dialog environments are inherently multi-step.
Mordatch & Abbeel (2018)
Collaborative navigation game where multiple agents tell each other
whither to move in a 2D environment. The environment consists of agents and landmarks distin-
guisted by color. Each agent is given a “goal”; for example, the blue agent might receive “red agent
go to green landmark” (represented as a categorical feature vector) which the blue agent then has
to communicate to the red agent. The agents have a vocabulary large enough to assign a unique
“word” to each concept being expressed in one-to-one correspondence, yielding a trivially optimal
language. Each episodes consists of multiple timesteps which can each have their own utterance.
11
Under review as a conference paper at ICLR 2022
Lazaridou et al. (2018)
An image-based signalling game using rendered images from MuJoCo.
The information being communicated is the shape and color of an object where set of shapes and
colors are both small. Since the number of words available is at least as big as the cardinalities of the
sets and the sender is able to use multiple words per utterance, it is possible to construct a trivially
optimal langauge. The standard signalling game is single-step.
Kharitonov et al. (2020)
A binary vector-based signalling game with shared context. The goal
of the signalling game is to communicate the bits of a binary vector which are not shared by the
sender and receiver. The messages consist of a single symbol where the number of unique symbols
is greater than the combination of bits to communicated, thus there is a trivially optimal langauge
where one unique symbol is assigned to each possible combination of unshared bits. The standard
signalling game is single-step.
Chaabouni et al. (2021)
A signalling game with 330 different colors represented as a real vector
in CIELAB color space. The game is set up so that colors which are nearby never appear in the same
distractor set which encourages solutions which can cover some arbitrary region of colors. Due to
this “fuzzy” concept of solution (i.e., not using 330 distinct words for each color), we consider this
environment not to have a trivially optimal solution. The standard signalling game is single-step.
B
HYPERPARAMETERS
B.1
DEFAULT CONFIGURATION
Environment
• Type: centerward
• World radius: 9
• Goal radius: 1
• Max steps per episode: 3 × world radius
Agent Architecture
• Bottleneck size: 64
• Architecture; sender is 1-3 and receiver is 5; bottleneck size is N
1. Linear w/ bias: 2 in, 32 out
2. Tanh activation
3. Linear w/ bias: 32 in, N out
4. Gumbel-Softmax: N in, N out
5. Linear w/ bias: N in, 2 out (action) and 1 out (value)
• Bottleneck (Gumbel-Softmax) temperature: 1.5
• Weight initialization: U

−
q
1
n,
q
1
n

, where n is the input size of the layer (PyTorch 1.10
default)
Optimization
• Reinforcement learning algorithm: proximal policy optimization
– Default
hyperparameters
used
unless
otherwise
noted:
https://
stable-baselines3.readthedocs.io/en/v1.0/modules/ppo.
html
• Training steps: 1 × 105
• Evaluation episodes: 3 × 103
• Learning rate: 3 × 10−3
• Experience buffer size: 1024
12
Under review as a conference paper at ICLR 2022
• Batch size: 256
• Temporal discount factor (γ): 0.9
B.2
EXPERIMENT-SPECIFIC CONFIGURATIONS
Note that we deﬁne a logarithmic sweep from x to y (inclusive) with n steps to be deﬁned by
Equation 3.

x ·
y
x

i
n−1 


 i ∈{0, 1, . . . , n −1}

(3)
Biased Semantics
• Type: edgeward
• World radius: 8
• Goal Radius: 8
• Experience buffer size: 256
• Batch size: 256
Changing the Distribution of Entropy
• Number of independent runs per conﬁguration: 2000
• Experience buffer size: 256
• Batch size: 256
Masking Environmental Parameters
• World radius: logarithmic sweep from 2 to 40 with 300 steps
Experience Buffer Size
• Experience Buffer Size: logarithmic sweep from 32 to 4096 with 200 steps (ﬂoor function
applied as experience buffer size is integer-valued)
• Training steps: 2 × 105
Expectation Chinese Restaurant Process
This is not an emergent language experiment and just consists of running the mathematical model.
• α: 5
• β: logarithmic sweep from 1 to 1000 with 10 000 steps (ﬂoor function applied, though
plotted without ﬂoor function)
• Number of iterations per run: 1000
13