# TOWARDS LEARNING TO SPEAK AND HEAR THROUGH MULTI-AGENT COMMUNICATION
## OVER A CONTINUOUS ACOUSTIC CHANNEL

**Anonymous authors**
Paper under double-blind review

ABSTRACT

While multi-agent reinforcement learning has been used as an effective means to
study emergent communication between agents, existing work has focused almost
exclusively on communication with discrete symbols. Human communication
often takes place (and emerged) over a continuous acoustic channel; human infants
acquire language in large part through continuous signalling with their caregivers.
We therefore ask: Are we able to observe emergent language between agents with
a continuous communication channel trained through reinforcement learning? And
if so, what is the impact of channel characteristics on the emerging language? We
propose an environment and training methodology to serve as a means to carry out
an initial exploration of these questions. We use a simple messaging environment
where a “speaker” agent needs to convey a concept to a “listener”. The Speaker
is equipped with a vocoder that maps symbols to a continuous waveform, this
is passed over a lossy continuous channel, and the Listener needs to map the
continuous signal to the concept. Using deep Q-learning, we show that basic
compositionality emerges in the learned language representations. We find that
noise is essential in the communication channel when conveying unseen concept
combinations. And we show that we can ground the emergent communication by
introducing a caregiver predisposed to “hearing” or “speaking” English. Finally,
we describe how our platform serves as a starting point for future work that uses a
combination of deep reinforcement learning and multi-agent systems to study our
questions of continuous signalling in language learning and emergence.

1 INTRODUCTION

Reinforcement learning (RL) is increasingly being used as a tool to study language emergence
(Mordatch & Abbeel, 2017; Lazaridou et al., 2018; Eccles et al., 2019; Chaabouni et al., 2020;
Lazaridou & Baroni, 2020). By allowing multiple agents to communicate with each other while
solving a common task, a communication protocol needs to be established. The resulting protocol can
be studied to see if it adheres to properties of human language, such as compositionality (Kirby, 2001;
Geffen Lan et al., 2020; Andreas, 2020; Resnick et al., 2020). The tasks and environments themselves
can also be studied, to see what types of constraints are necessary for human-like language to
emerge (Steels, 1997). Referential games are often used for this purpose (Kajic et al., 2020; Havrylov
& Titov, 2017; Yuan et al., 2020). While these studies open up the possibility of using computational
models to investigate how language emerged and how language is acquired through interaction with
an environment and other agents, most RL studies consider communication using discrete symbols.

Spoken language instead operates and presumably emerged over a continuous acoustic channel.
Human infants acquire their native language by being exposed to speech audio in their environments (Kuhl, 2005); by interacting and communicating with their caregivers using continuous signals,
infants can observe the consequences of their communicative attempts (e.g. through parental responses) that may guide the process of language acquisition (see e.g. Howard & Messum (2014)
for discussion). Continuous signalling is challenging since an agent needs to be able to deal with
different acoustic environments and noise introduced by the lossy channel. These intricacies are lost
when agents communicate directly with discrete symbols. This raises the question: Are we able


-----

Lossy communication

channel

Speaker Agent Listener Agent


Figure 1: Environment setup showing a Speaker communicating to a Listener over a lossy acoustic
communication channel f .

to observe emergent language between agents with a continuous communication channel, trained
through RL? This paper is our first step towards answering this larger research question.

Earlier work has considered models of human language acquisition using continuous signalling
between a simulated infant and caregiver (Oudeyer, 2005; Steels & Belpaeme, 2005). But these
models often rely on heuristic approaches and older neural modelling techniques, making them
difficult to extend; e.g. it isn’t easy to directly incorporate other environmental rewards or interactions
between multiple agents. More recent RL approaches would make this possible, but as noted, has
mainly focused on discrete communication. Our work here tries to bridge the disconnect between
recent contributions in multi-agent reinforcement learning (MARL) and earlier literature in language
acquisition and modelling (Moulin-Frier & Oudeyer, 2021).

One recent exception which do use continuous signalling within a modern RL framework is the work
of Gao et al. (2020). In their setup, a Student agent is exposed to a large collection of unlabelled
speech audio, from which it builds up a dictionary of possible spoken words. The Student can then
select segmented words from its dictionary to play back to a Teacher, which uses a trained automatic
speech recognition (ASR) model to classify the words and execute a movement command in a discrete
environment. The Student is then awarded for moving towards a goal position. We also propose a
Student-Teacher setup, but importantly, our agents can generate their own unique audio waveforms
rather than just segmenting and repeating words exactly from past observations. Moreover, in our
setup an agent is not required to use a pretrained ASR system for “listening”.

Concretely, we propose the environment illustrated in Figure 1, which is an extension of a referential
signalling game used in several previous studies (Lewis, 1969; Lazaridou et al., 2018; Chaabouni
et al., 2020; Rita et al., 2020). Here s represents one out of a set of possible concepts the Speaker must
communicate to a Listener agent. Taking this concept as input, the Speaker produces a waveform as
output, which passes over a (potentially lossy) acoustic channel. The Listener “hears” the utterance
from the speaker. Taking the waveform as input, the Speaker produces output ˆs. This output is the
Listener’s interpretation of the concept that the Speaker agent tried to communicate. The agents must
develop a common communication protocol such that s = ˆs. This process encapsulates one of the
core goals of human language: conveying meaning through communication (Dor, 2014). To train the
agents, we use deep Q-learning (Mnih et al., 2013).

Our bigger goal is to explore the question of whether and how language emerges when using RL
to train agents that communicate via continuous acoustic signals. Our proposed environment and
training methodology serves as a means to perform such an exploration, and the goal of the paper is to
showcase the capabilities of the platform. Concretely, we illustrate that a valid protocol is established
between agents communicating freely, that basic compositionality emerges when agents need to
communicate a combination of two concepts, that channel noise affects generalisation, and that one
agent will act accordingly when the other is made to “hear” or “speak” English. At the end of the
paper, we also discuss questions that can be tackled in the future using the groundwork laid here.


-----

phone sequence audio waveform mel-spectrogram

d a ʊ n

Speaker Agent Synthesiser Channel Listener Agent


Q-network
Dictionary Lookup


eSpeak
Festival


Noise
Time/Pitch warping
Time masking


Q-network
DTW


Figure 2: Example interaction of each component and the environment in a single round.

2 ENVIRONMENT

We base our environment on the referential signaling game from Chaabouni et al. (2020) and Rita
et al. (2020)—which itself is based on Lewis (1969) and Lazaridou et al. (2018)—where a sender
must convey a message to a receiver. In our case, communication takes place between a Speaker and
a Listener over a continuous acoustic channel, instead of sending symbols directly (Figure 1). In each
game round, a Speaker agent is tasked with conveying a single concept. The Speaker needs to explain
this concept using a speech waveform which is transmitted over a noisy communication channel,
and then received by a Listener agent. The Listener agent then classifies its understanding of the
Speaker’s concept. If the Speaker’s target concept matches the classified concept from the Listener,
the agents are rewarded. The Speaker is then presented with another concept and the cycle repeats.

Formally, in each episode, the environment generates s, a one-hot encoded vector representing one
of N target concepts from a set S. The Speaker receives s and generates a sequence of phones
**_c = (c1, c2, . . ., cM_** ), each ct representing a phone from a predefined phonetic alphabet . The
phone sequence is then converted into a waveform ∈P **_wraw, an audio signal sampled at 16 kHz P_** . For
this we use a trained text-to-speech model (Black & Lenzo, 2000; Duddington, 2006). A channel
noise function is then applied to the generated waveform, and the result win = f (wraw) is presented
as input to the Listener. The Listener converts the input waveform to a mel-scale spectrogram:
a sequence of vectors over time representing the frequency content of an audio signal scaled to
mimic human frequency perception (Davis & Mermelstein, 1980). Taking the mel-spectrogram
sequence X = (x1, x2, . . ., xT ) of T acoustic frames as input, the Listener agent outputs a vector ˆs
representing its predicted concept. The agents are both rewarded if the predicted word is equal to the
target word s = ˆs.

To make the environment a bit more concrete, we present a brief example in Figure 2. For illustrative
purposes, consider a set of concepts S = {up, down, left, right}. The state representation for down
would be s = [0, 1, 0, 0][⊤]. A possible phone sequence generated by the Speaker would be c =
(d, a, U, n, </s>).[1] This would be synthesised, passed through the channel, and then be interpreted by
the Listener agent. If the Listener’s prediction is ˆs = [0, 1, 0, 0][⊤], then it selected the correct concept
of down. The environment would then reward the agents accordingly:

1 if s = ˆs
_r =_ (1)
0 otherwise


In our environment we have modelled the task of the Speaker agent as a discrete problem. Despite
this, the combination of both agents and their environment is a continuous communication task; in
our communication channel, we apply continuous signal transforms which can be motivated by real
acoustic environments. The Listener also needs to take in and process a noisy acoustic signal. It is
true that the Speaker outputs a discrete sequence; what we have done here is to equip the Speaker with

1<s> and </s> respectively represent the start-of-sequence and end-of-sequence tokens.


-----

articulatory capabilities so that these do not need to be learned by the model. There are studies that
consider how articulation can be learned (Howard & Messum, 2014; Asada, 2016; Rasilo & Ras¨ anen,¨
2017), but none of these do so in an RL environment, rather using a form of imitation learning. In
Section 5 we discuss how future work could consider learning the articulation process itself within
our environment, and the challenges involved in doing so.

3 LEARNING TO SPEAK AND HEAR USING RL

To train our agents, we use deep Q-learning (Mnih et al., 2013). For the Speaker agent, this means
predicting the action-value of phone sequences. The Listener agent predicts the value of selecting
each classification target ˆs ∈S.

3.1 SPEAKER MODEL

The Speaker agent is tasked with generating a sequence of phones c describing a concept or idea.
The model architecture is shown in Figure 3. The target concept is represented by the one-hot input
state s. We use gated recurrent unit (GRU) based sequence generation as the core of the Speaker
agent, which generates a sequence of Q-values, a distribution over phones P per output-step from 1
to M . The input state s is embedded as the initial hidden state h0 of the GRU. The output phone of
each GRU layer is embedded as input to the next GRU layer.[2] We also make use of start-of-sequence
(SOS) and end-of-sequence (EOS) tokens, <s> and </s> respectively, appended to the phone-set.
These allow the Speaker to generate arbitrary length phone sequences up to a maximum length of M .

3.2 LISTENER MODEL

The Listener agent may be viewed as a classification task with the full model architecture illustrated
in Figure 4. The model is roughly based on (Amodei et al., 2016). Given an input mel-spectrogram
_X, the Listener generates a set of state-action values. These action-values represent the expected_
reward for each classification vector ˆs.

We first apply a set of convolutional layers over the input mel-spectrogram, keeping the size of the
time-axis consistent throughout. We then flatten the convolution outputs over the filters and feature
axis, resulting in a single vector per time step. We process each vector through a bidirectional GRU,
feeding the final hidden state through a linear layer to arrive at our final action-value predictions. An
argmax of these action-values gives us a greedy prediction for ˆs.

2No gradients flow through the argmax: this connection indicates to the network which phone was selected
at the previous GRU step.

argmax

logits

linear

GRU

embedding

argmax argmax

logits logits

linear linear

GRU GRU

embedding

embedding embedding


Figure 3: The Speaker agent generates an arbitrary length sequence of action-values given an input
concept represented by s.


-----

mel-spectrogram

CNN flatten

GRU linear argmax


Figure 4: The Listener agent Q-network generates action-values given an input mel-spectrogram X.

3.3 DEEP Q-LEARNING

The Q-network of the Speaker agent generates a sequence of phones c in every communication round
until the EOS token is reached. The sequence of phones may be seen as predicting an action sequence
per environment step, while standard RL generally only predicts a single action per step. To train
such a Q-network, we therefore modify the general gradient-descent update equation from Sutton &
Barto (1998). Since we only have a single communication round, we update the model parameters θ
as follows:


_r_
_−_ _M[1]_


_∇qˆ(S, A; θ),_ (2)


_θ ←_ _θ + α_


_qˆm(S, A; θ)_
_m=1_

X


where the reward r is given in (1), S is the environment state, A is the action, α is the learning rate,
and ˆq = (ˆq1, ˆq2, . . ., ˆqM ). For the Speaker, ˆqm is the value of performing the action cm at output m.
For the Speaker, the environment state would be the desired concept S = s and the actions would be
_A = c = (c1, c2, ..., cM_ ), the output of the network in Figure 3.

The Listener is also trained using (2), but here this corresponds to the more standard case where the
agent produces a single action, i.e. M = 1. Concretely, for the Listener this action is A = ˆs, the
output of the network in Figure 4. The Listener’s environment is the mel-spectrogram S = X. The
Speaker and Listener each have their own independent learner and replay buffer (Mnih et al., 2013).
A replay buffer is a storage buffer that keeps track of the observed environment states, actions and
rewards. The replay buffer is then sampled when updating the agent’s Q-networks through gradient
descent with (2). We may see this two-agent environment as multi-agent deep Q-learning (Tampuu
et al., 2017), and therefore have to take careful consideration of the non-stationary replay buffer: we
limit the maximum replay buffer size to twice the batch size. This ensures that the agent learns only
from its most recent experiences.

4 EXPERIMENTS

4.1 IMPLEMENTATION

The lossy communication channel has Gaussian white noise with a signal-to-noise ratio (SNR) of
30 dB, unless otherwise stated. During training, the channel applies Gaussian-sampled time stretch
and pitch shift using Librosa (McFee et al., 2021), with variance 0.4 and 0.3, respectively. The
channel also masks up to 15% of the mel-spectrogram time-axis during training. We train our agents
with an ϵ-greedy exploration, where ϵ is decayed exponentially from 0.1 to 0 over the training steps.

We use eSpeak (Duddington, 2006) as our speech synthesiser. eSpeak is a parametric text-to-speech
software package that uses formant synthesis to generate audio from phone sequences. Festival (Black


-----

1.0

0.8

0.6

0.4

0.2

0.0


1.0

0.8

0.6

0.4

0.2

0.0

|Col1|Col2|Col3|Col4|Col5|
|---|---|---|---|---|
||||||
||||||
||||||
||||||
||acou|stic comm.|discr|ete comm.|

|Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|
|---|---|---|---|---|---|---|---|
|||||||||
|||||||||
|||||||||
|||acou||||||
||||acou|stic comm.||discr|ete comm.|
|||unse||en codes||unse|en codes|


1000 2000 3000 4000 5000

acoustic comm. discrete comm.

Training Episode


1000 2000 3000 4000 5000

acoustic comm. discrete comm.
unseen codes unseen codes

Training Episode


(a) Mean evaluation reward of the Listener agent
interpreting a single concept over 20 runs.


(b) Mean evaluation reward of the Listener agent
interpreting two concepts in each round.


Figure 5: Results for unconstrained communication. The agents are evaluated every 100 training
episodes over 20 runs. Shading indicates the bootstrapped 95% confidence interval.

& Lenzo, 2000) was also tested, although eSpeak is favoured for its simpler phone scheme and multilanguage support. We use eSpeak’s full English phone-set of 164 unique phones and phonetic
modifiers. The standard maximum number of phones the Speaker is allowed to generate in each
communication round is M = 5, including the EOS token. All GRUs have 2 layers with a hidden
layer size of 256. All Speaker agent embeddings (Section 3.1) are also 256-dimensional. The Listener
(Section 3.2) uses 4 convolutional layers, each with 64 filters and a kernel width and height of 3.
The input to the first convolutional layer is a sequence of 128-dimensional mel-spectrogram vectors
extracted every 32 ms. We apply zero padding of size 1 at each layer to retain the input dimensions.
Additional experimental details are given in Appendix A.


4.2 UNCONSTRAINED COMMUNICATION OF SINGLE CONCEPTS

**Motivation** We first verify that the environment works as expected and that a valid communication
protocol emerges when no constraints are applied to the agents.

**Setup** The Speaker and Listener agents are trained simultaneously here, as described in Section 3.3.
The agents are tasked with communicating 16 unique concepts. We compare our acoustic communication to a discrete baseline based on RIAL (Foerster et al., 2017). In this setup, the CNN of the
Listener agent is replaced by an embedding network, allowing the discrete symbols of the Speaker to
be directly interpreted by the Listener. The Speaker’s discrete alphabet size of setup is equal to the
phonetic alphabet size of 164. Improvements have been made to RIAL—e.g. (Eccles et al., 2019;
Chaabouni et al., 2020)—although RAIL itself proves sufficient as a comparison to our proposed
acoustic communication setting.

**Findings** Figure 5a shows the mean evaluation reward of the Listener agent over training steps.
(This is also an indication of the Speaker’s performance, since without successful coordination
between the two agents, no reward is given to either.) The agents achieve a final mean reward of 0.917
after 5000 training episodes, successfully developing a valid communication protocol for roughly
15 out of the total of 16 concepts.[3] This is comparable to the performance of the purely discrete
communication which reaches a mean evaluation reward of 0.959. What does the communication
sound like? Since there are no constraints placed on communication, the agents can easily coordinate
to use arbitrary phone sequences to communicate distinct concepts. The interested reader can listen
to generated samples.[4] We next consider a more involved setting in order to study composition and
generalisation.


4.3 UNCONSTRAINED COMMUNICATION GENERALISING TO MULTIPLE CONCEPTS

**Motivation** To study composition and generalisation, we perform an experiment based on (Kirby,
2001). They used an iterative language model (ILM) to convey two separate meanings (a and b) in a
single string. This ILM was able to generate structured compositional mappings from meaning to
strings. For example, in one result they found a0 q and b0 da. The combination of the two
_−→_ _−→_

3The maximum evaluation reward in all experiments is 1.0.
[4Audio samples for all experiments are available at https://iclr2022-1504.github.io/samples/.](https://iclr2022-1504.github.io/samples/)


-----

Table 1: Mean evaluation reward of the twoconcept experiments with varying channel
noise. The results for no lossy communication channel is also shown. The 95% confidence for all values falls within 0.01.


Table 2: Output sequences from a trained Speaker.
Each entry corresponds to a combination of two concepts, s1 and s2, respectively. The bold combinations
were unseen during training.


Average Training Unseen **_s1_**

0 1 2 3

SNR (dB) Codes Codes

no channel **0.966** 0.386 0 nnLGGx DLLççç nsspxx nnssss
40 0.878 0.389 2 1
30 0.931 0.402 **_s_** jLLeee @@ööee wwwxxx sss@@@
20 0.895 **0.413** 2 jjLL:: DpLLj: Dwppçx enGsss

3

10 0.731 0.361 jjL::: GDDp:: Gjxxxp Gss:::
0 0.654 0.366

nnLGGx DLLççç nsspxx nnssss

jLLeee @@ööee wwwxxx sss@@@

jjLL:: DpLLj: Dwppçx enGsss

jjL::: GDDp:: Gjxxxp Gss:::


meanings was therefore (a0, b0) qda. Similarly, (a1, b0) bguda with a1 bgu. Motivated
by this, we try to test the generalisation capabilities in continuous signalling in our environment. −→ _−→_ _−→_

**Setup** Rather than conveying a single concept in each episode, we now ask the agents to convey two
concepts. The target concept s and predicted concept ˆs now become s1, s2 and ˆs1, ˆs2, respectively.
We also make sure that some concept combinations are never seen during training. We then see if the
agents are still able to convey these concept combinations at test time, indicating how well the agents
generalise to novel inputs. The reward model is adjusted accordingly, with the agents receiving 0.5
for each concept correctly identified by the Listener. Here s1 can take on 4 distinct concepts while s2
can take on another 4 concepts. Out of the 16 total combinations, we make sure that 4 are never seen
during training. The unseen combinations are chosen such that there remains an even distribution of
individual unseen concepts. We also increase the maximum phone length to M = 7. To encourage
compositionality (Kottur et al., 2017), we limit the size of the phonetic alphabet to 16.

As an example, you can think of s1 as indicating an item from the set of concepts S1 =
_{up, down, left, right} while s2 indicates and item from S2 = {fast, medium, regular, slow} and_
we want the agents to communicate concept combinations such as up+fast. Some combinations such
as right+slow is never given as the target concept combination during training (but e.g. right+fast
and left+slow would be), and we see if the agents can generalise to these unseen combinations at test
time and how they do it.

**Findings: Quantitative** The results are shown in Figure 5b. We see the mean evaluation reward of
the acoustic Listener agent reaches 0.931 on the training concept combinations. This is slightly lower
than the discrete case which reaches a mean of 0.965. The acoustic communication agents achieve a
mean evaluation reward of 0.402 on the unseen combinations, indicating that they are usually able
to successfully communicate at least one of the two concepts. The discrete agents do marginally
better on unseen combinations, with slightly higher variance. The chance-level baseline for this task
would receive a mean reward of 0.25. The performance on the unseen combinations is thus better
than random.

Table 1 shows the mean evaluation reward of the same two-concept experiments, but now with
varying degrees of channel noise expressed in SNR.[5] The goal here is to evaluate how the channel
influences the generalisation of the agents to unseen input combinations. In the no-channel case, the
Speaker output is directly input to the Listener agent, without any time stretching or pitch shifting.
The no channel case does best on the training codes as expected, but does not generalise as well to
unseen input combinations. We find that increasing channel noise decreases the performance of the
training codes and increases generalisation performance on unseen codes, up to a point where both
decrease. This is an early indication that the channel specifically influences generalisation.

Lazaridou et al. (2018) reported the structural similarity of the emergent communication in terms of
Spearman ρ correlation between the input and message space, known as topographic similarity or
_topism (Brighton & Kirby, 2006). Chaabouni et al. (2020) extended this metric by introducing two_
new metrics. Positional disentanglement (posdis) measures the positional contribution of symbols to


5The SNR is calculated based on the average energy in a signal generated by eSpeak.


-----

Table 3: Compositionality metrics of the unconstrained multi-concept Speaker agents. The mean
evaluation metrics and 95% confidence bounds are shown

_topism_ _posdis_ _bosdis_

acoustic comm. 0.265 (±0.041) 0.103 (±0.015) 0.116 (±0.018)
discrete comm. 0.244 (±0.032) 0.087 (±0.017) 0.118 (±0.017)

meaning. Bag-of-symbols disentanglement (bosdis) measures distinct symbol meaning but does so in
a permutation-invariant language way. We record all 3 metrics for the case where the average SNR
is 30 dB, taking measurements between the input space and the sequence of discrete phones. The
results are shown in Table 3. For topism, we average 0.265, which is comparable to the results of
(Lazaridou et al., 2018). For posdis and bosdis, we average 0.103 and 0.116, respectively. This falls
within the lower end of the results of (Chaabouni et al., 2020). All three metrics yield similar results
for both acoustic and discrete communication.

**Findings: Qualitative** Table 2 shows examples of the sequences produced by a trained Speaker
agent for each concept combination, with the phone units written using the international phonetic
alphabet. Ideally, we would want each row and each column to affect the phonetic sequence in
a unique way. This would indicate that the agents have learnt a compositional language protocol,
combining phonetic segments together to create a sequence in which the Listener can distinguish
the individual component concepts. We see this type of behaviour to some degree in our Speaker
samples, such as the [x] phones for s1 = 2 or the repeated [s] sound when s1 = 3. This indicates at
least some level of compositionality in the learned communication. More qualitatively, the realisation
[from eSpeak of [L] sounds very similar to [n] for s2 = 0. (We refer the reader to the sample page,](https://iclr2022-1504.github.io/samples/)
linked in Section 4.2.)

The bold phone sequences in Table 2 were unseen during training. The agents correctly classified one
combination (s1, s2 = 3, 0) out of the 4 unseen combinations. For the other 3 unseen combinations,
the agents correctly predicted at least s1 or s2 correctly. These sequences also show some degree of
compositionality, such as the [jL] sequence where s1 = 0. We should note that the agents are never
specifically encouraged to develop any sort of compositionality in this experiment. They could, for
example, use a unique single phone for each of the 16 concept combinations.

4.4 GROUNDING EMERGENT COMMUNICATION

**Motivation** Although the Speaker uses an English phone-set, up to this point there has been no
reason for the agents to actually learn to use English words to convey the concepts. In this subsection,
either the Speaker or Listener is predisposed to speak or hear English words, and the other agent needs
to act accordingly. One scientific motivation for this setting is that it can be used to study how an infant
learns language from a caregiver (Kuhl, 2005). To study this computationally, several studies have
looked at cognitive models of early vocal development through infant-caregiver interaction; Asada
(2016) provides a comprehensive review. Most of these studies, however, considered the problem of
learning to vocalise (Howard & Messum, 2014; Moulin-Frier et al., 2015; Rasilo & Ras¨ anen, 2017),¨
which limits the types of interactions and environmental rewards that can be incorporated into the
model. We instead simplify the vocalisation process by using an existing synthesiser, but this allows
us to use modern MARL techniques to study continuous signalling.

We first give the Listener agent the infant role, and the Speaker will be the caregiver. This mimics the
setting where an infant learns to identify words spoken by a caregiver. Later, we reverse the roles,
having the Speaker agent assume the infant role. This represents an infant learning to speak their first
words and their caregiver responds to recognised words. Since here one agent (the caregiver) has an
explicit notion of the meaning of a word, this process can be described as “grounding” from the other
agent’s perspective (the infant).

**Setup** We first consider a setting where we have a single set of 4 concepts S =
_{up, down, left, right}. While this is similar to the examples given in preceding sections, here_
the agents will be required to use actual English words to convey these concepts. In the setting where
the Listener acts as an infant, the caregiver Speaker agent speaks English words; the Speaker consists
simply of a dictionary lookup for the pronunciation of the word, which is then generated by eSpeak.


-----

In the setting where the Speaker takes on the role of the infant, the Listener is now a static entity that
can recognise English words; we make use of a dynamic time warping (DTW) system that matches
the incoming waveform to a set of reference words and selects the closest one as its output label.
50 reference words are generated by eSpeak. The action-space of the Speaker agent is very large
(|P|[M] ), and would be near impossible to explore entirely. Therefore, we provide guidance: with
probability ϵ (Section 4.1), choose the correct ground truth phonetic sequence for s. We also consider
the two-concept combination setting of Section 4.3 where either the Speaker or Listener now hears or
speaks actual English words; DTW is too slow for the static Listener in this case, so here we first
train the Listener in the infant role and then fix it as the caregiver when training the Speaker.

**Findings: Grounding the Listener** Here the Listener is trained while the Speaker is a fixed
caregiver. The Listener agent reached a mean evaluation reward of 1.0, indicating the agent learnt
to correctly classify all 4 target words 100% of the time (full graphs given in Appendix B.1). The
Listener agent was also tested with a vocabulary size of 50, consisting of the 50 most common English
words including the original up, down, left, and right. With this setup, the Listener still reached a
mean evaluation reward of 0.934.

**Findings: Grounding the Speaker** We now ground the Speaker agent by swapping its role to that
of the infant. The Speaker agent reaches a mean evaluation reward of 0.983 over 20 runs, indicating it
is generally able to articulate all of the 4 target words. Table 4 gives samples of one of the experiment
runs and compares them to the eSpeak ground truth phonetic descriptions. Although appearing very
different to the ground truth, the audio generated by eSpeak of the phone sequences qualitatively
similar. The reader can confirm this for themselves by listening to the generated samples (again we
[refer the reader to the sample page, linked in Section 4.2.)](https://iclr2022-1504.github.io/samples/)

**Findings:** **Grounding generalisation in communicating two concepts** Analogous to Section 4.3, we now have infant and caregiver agents in a setting with two concepts, specifically
_S1 = {up, down, left, right} and S2 = {fast, medium, regular, slow}. Here, these sets don’t simply_
serve as an example as in Section 4.3, but the Speaker would now actually say “up” when it is the
caregiver and the Listener will now actually be pretrained to recognise the word “up” when it is
the caregiver. 4 combinations are unseen during training: up-slow, down-regular, left-medium, and
_right-fast. Again we consider both role combinations of infant and caregiver. Figure 6a shows the_
results when training a two-word Listener agent. The agent reaches a mean evaluation reward of 1.0
for the training codes and 0.952 for the unseen code combinations. This indicates that the Listener
agent learns near-optimal generalisation. As mentioned above, for the case where the Speaker is the
infant, the DTW-based fixed Listener was found to be impractical. Thus, we use a static Listener agent
pre-trained to classify 50 concepts for each s1 and s2. This totals to 2500 unique input combinations.
The results of the two-word Speaker agent are shown in Figure 6b. The Speaker agent does not
perform as well as the Listener agent, reaching a mean evaluation reward of 0.719 for the training
word combinations and 0.425 for the unseen.

We have replicated the experiments in this subsection using the Afrikaans version of eSpeak, reaching
similar performance to English. This shows our results are not language specific.

5 DISCUSSION

The work we have presented here has gone further than Gao et al. (2020), which only allowed
segmented template words to be generated: our Speaker agent has the ability to generate unique
audio waveforms. On the other hand, our Speaker can only generate sequences based on a fixed

Table 4: Table of the target word, ground truth phonetic description, and trained Speaker agent’s
predicted phonetic description.

Target word Ground truth Predicted phones

_up_ 2p 2vb
_down_ daUn daU
_left_ lEft lE
_right_ ôaIt ôaISjn


-----

1.0

0.8

0.6

0.4

0.2

0.0


1.0

0.8

0.6

0.4

0.2

0.0

|Col1|Col2|Col3|Col4|Col5|Col6|
|---|---|---|---|---|---|
|||||||
|||||||
|||||||
||||trai||ning codes|
|||||trai||
||||uns||een codes|

|Col1|Col2|Col3|Col4|Col5|Col6|
|---|---|---|---|---|---|
|||||||
|||||||
|||||||
||||trai||ning codes|
|||||trai||
||||uns||een codes|


1000 2000 3000 4000 5000

training codes
unseen codes

Training Episode


1000 2000 3000 4000 5000

training codes
unseen codes

Training Episode


(a) Mean evaluation reward of two-word Listener
agent over 20 training runs.


(b) Mean evaluation reward of two-word Speaker
agent over 20 training runs.


Figure 6: Evaluation results of the grounded two-word Speaker and Listener agent during training.
The mean evaluation reward of the unseen word combinations are also shown.

phone-set (which is then passed over a continuous acoustic channel). This is in contrast to earlier
work (Howard & Messum, 2014; Asada, 2016; Rasilo & Ras¨ anen, 2017) that considered a Speaker¨
that learns a full articulation model in an effort to come as close as possible in imitating an utterance
from a caregiver; this allows a Speaker to generate arbitrary learnt units. We have thus gone further
than Gao et al. (2020) but not as far as these older studies. Nevertheless, our approach has the benefit
that it is formulated in a modern MARL setting: it can therefore be easily extended. Future work can
therefore consider whether articulation can be learnt as part of our model – possibly using imitation
learning to guide the agent’s exploration of the very large action-space of articulatory movements.

In the experiments carried out in this study, we only considered a single communication round. We
also referred to our setup as multi-agent, which is accurate but could be extended even further where a
single agent has both a speaking and listening module, and these composed agents then communicate
with one another. Future work could therefore consider multi-round communication games between 2
or more agents. Such games would extend our work to the full MARL problem, where agents would
need to “speak” to and “hear” each other to solve a common task.

Finally, in terms of future work, we saw in Section 4.3 the importance of the channel for generalisation.
Adding white noise is, however, not a good enough simulation of real-life channel acoustic channels.
But our approach could be extended with real background noise and more accurate models of
environmental dynamics. This could form the basis for a computational investigation of the effect of
real acoustic channels in language learning and emergence.

We reflect on our initial research question: Are we able to observe emergent language between agents
with a continuous acoustic communication channel trained through RL? This work has laid only a first
foundation for answering this larger question. We have showcased the capability of a environment
and training approach which will serve as a means of further exploration in answering the question.


ETHICS STATEMENT

We currently do not identify any obvious reasons to have ethical concerns about this work. Ethical
considerations will be made taken into account in the future if some of the models are compared to
data from human studies or trials.


REPRODUCIBILITY STATEMENT

We provide all model and experimental details in Section 4.1, and additional details in Appendix A.
The information given should provide enough details to reproduce these results. Finally, our code
will be released on GitHub with an open-source license upon acceptance.


-----

REFERENCES

D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro,
Q. Cheng, G. Chen, J. Chen, J. Chen, Z. Chen, M. Chrzanowski, A. Coates, G. Diamos, K. Ding,
N. Du, E. Elsen, J. Engel, W. Fang, L. Fan, C. Fougner, L. Gao, C. Gong, A. Hannun, T. Han,
L. V. Johannes, B. Jiang, C. Ju, B. Jun, P. LeGresley, L. Lin, J. Liu, Y. Liu, W. Li, X. Li, D. Ma,
S. Narang, A. Ng, S. Ozair, Y. Peng, R. Prenger, S. Qian, Z. Quan, J. Raiman, V. Rao, S. Satheesh,
D. Seetapun, S. Sengupta, K. Srinet, A. Sriram, H. Tang, L. Tang, C. Wang, J. Wang, K. Wang,
Y. Wang, Z. Wang, Z. Wang, S. Wu, L. Wei, B. Xiao, W. Xie, Y. Xie, D. Yogatama, B. Yuan,
J. Zhan, and Z. Zhu. Deep Speech 2: End-to-end speech recognition in English and Mandarin. In
_Proc. ICML, pp. 173–182, 2016._

J. Andreas. Good-enough compositional data augmentation. In Proc. ACL, 2020.

M. Asada. Modeling early vocal development through infant–caregiver interaction: A review. IEEE
_Transactions on Cognitive and Developmental Systems, pp. 128–138, 2016._

A. Black and K. Lenzo. Building voices in the festival speech synthesis system. unpublished document,
[2000. URL http://www.cstr.ed.ac.uk/projects/festival/docs/festvox/.](http://www.cstr.ed.ac.uk/projects/festival/docs/festvox/)

H. Brighton and S. Kirby. Understanding linguistic evolution by visualizing the emergence of
topographic mappings. Artificial Life, 2006.

R. Chaabouni, E. Kharitonov, D. Bouchacourt, E. Dupoux, and M. Baroni. Compositionality and
generalization in emergent languages. In Proc. ACL, pp. 4427–4442, 2020.

S. Davis and P. Mermelstein. Comparison of parametric representations for monosyllabic word
recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal
_Processing, pp. 357–366, 1980._

D. Dor. The instruction of imagination: language and its evolution as a communication technology,
pp. 105–125. Princeton University Press, 2014.

[J. Duddington. eSpeak text to speech, 2006. URL http://espeak.sourceforge.net/.](http://espeak.sourceforge.net/)

T. Eccles, Y. Bachrach, G. Lever, A. Lazaridou, and T. Graepel. Biases for emergent communication
in multi-agent reinforcement learning. In Proc. NeurIPS, 2019.

J. Foerster, N. Nardelli, G. Farquhar, T. Afouras, P. H. S. Torr, P. Kohli, and S. Whiteson. Stabilising
Experience Replay for Deep Multi-Agent Reinforcement Learning. Proc. ICML, 2017.

S. Gao, W. Hou, T. Tanaka, and T. Shinozaki. Spoken language acquisition based on reinforcement
learning and word unit segmentation. In Proc. ICASSP, pp. 6149–6153, 2020.

N. Geffen Lan, E. Chemla, and S. Steinert-Threlkeld. On the Spontaneous Emergence of Discrete
and Compositional Signals. In Proc. ACL, pp. 4794–4800, 2020.

S. Havrylov and I. Titov. Emergence of language with multi-agent games: Learning to communicate
with sequences of symbols. In Proc. NeurIPS, 2017.

I. S. Howard and P. Messum. Learning to pronounce first words in three languages: An investigation
of caregiver and infant behavior using a computational model of an infant. PLOS ONE, pp. 1–21,
2014.

I. Kajic, E. Aygun, and D. Precup. Learning to cooperate: Emergent communication in multi-agent¨
navigation. arXiv e-prints, 2020.

S. Kirby. Spontaneous evolution of linguistic structure: an iterated learning model of the emergence
of regularity and irregularity. IEEE Transactions on Evolutionary Computation, pp. 102–110,
2001.

S. Kottur, J. Moura, S. Lee, and D. Batra. Natural language does not emerge ‘naturally’ in multi-agent
dialog. In Proc. EMNLP, 2017.


-----

P. K. Kuhl. Early language acquisition: cracking the speech code. Nature Reviews Neuroscience, pp.
831–843, 2005.

A. Lazaridou and M. Baroni. Emergent multi-agent communication in the deep learning era. CoRR,
2020.

A. Lazaridou, K. Hermann, K. Tuyls, and S. Clark. Emergence of linguistic communication from
referential games with symbolic and pixel input. Proc. ICLR, 2018.

D. Lewis. Convention. Blackwell, 1969.

B. McFee, A. Metsai, M. McVicar, S. Balke, C. Thome, C. Raffel, F. Zalkow, A. Malek, Dana, K. Lee,´
O. Nieto, D. Ellis, J. Mason, E. Battenberg, S. Seyfarth, R. Yamamoto, viktorandreevichmorozov,
K. Choi, J. Moore, R. Bittner, S. Hidaka, Z. Wei, nullmightybofo, D. Heren˜u, F.-R. St´ oter,¨
P. Friesch, A. Weiss, M. Vollrath, T. Kim, and Thassilo. librosa/librosa: 0.8.1rc2, 2021. URL
[https://doi.org/10.5281/zenodo.4792298.](https://doi.org/10.5281/zenodo.4792298)

V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller.
Playing Atari with deep reinforcement learning. In NIPS Deep Learning Workshop, 2013.

I. Mordatch and P. Abbeel. Emergence of grounded compositional language in multi-agent populations. In Proc. AAAI, 2017.

C. Moulin-Frier and P.-Y. Oudeyer. Multi-Agent Reinforcement Learning as a Computational Tool
for Language Evolution Research: Historical Context and Future Challenges. In Proc. AAAI, 2021.

C. Moulin-Frier, J. Diard, J.-L. Schwartz, and P. Bessiere. Cosmo (“communicating about ob-`
jects using sensory–motor operations”): A bayesian modeling framework for studying speech
communication and the emergence of phonological systems. Journal of Phonetics, pp. 5–41, 2015.

P.-Y. Oudeyer. The self-organization of speech sounds. Journal of Theoretical Biology, pp. 435–449,
2005.

H. Rasilo and O. Ras¨ anen. An online model for vowel imitation learning.¨ _Speech Communication,_
pp. 1–23, 2017.

C. Resnick, A. Gupta, J. Foerster, A. Dai, and K. Cho. Capacity, bandwidth, and compositionality in
emergent language learning. In Proc. AAMAS, 2020.

M. Rita, R. Chaabouni, and E. Dupoux. “LazImpa”: Lazy and impatient neural agents learn to
communicate efficiently. In Proc. ACL, pp. 335–343, 2020.

L. Steels. The synthetic modeling of language origins. Evolution of Communication, pp. 1–34, 1997.

L. Steels and T. Belpaeme. coordinating perceptually grounded categories through language: a case
study for colour. Behavioral and Brain Sciences, pp. 469–489, 2005.

R. S. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.

A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru, J. Aru, and R. Vicente. Multiagent
cooperation and competition with deep reinforcement learning. PLOS ONE, pp. 1–15, 2017.

L. Yuan, Z. Fu, J. Shen, L. Xu, J. Shen, and S.-C. Zhu. Emergence of pragmatics from referential
game between theory of mind agents. In Proc. NeurIPS, 2020.


-----

APPENDICES

A EXPERIMENT DETAILS


A.1 GENERAL EXPERIMENTAL SETUP

Here we provide the general setup for all experimentation.


**Parameter** **Value**

_Optimiser_ Adam
_Batch Size_ 128
_Replay size_ 256
_Training Episodes_ 5000
_Evaluation interval_ 100
_Evaluation episodes_ 25
_Runs (varying seed)_ 20
_GPU_ Nvidia RTX 2080 Super
_Time (per run)_ _≈_ 30 minutes

A.2 EXPERIMENT PARAMETERS


Here we provide specific details on a per-experiment basis. The phone sequence length M in the
grounded experiments is chosen such that the full ground truth phonetic pronunciation could be made
by the speaker agent.

**Experiment** **Agent** **Learning Rate** **Phone length (M** **)** **GRU hidden size**


Unconstrained Single-Concept Speaker 1 × 10[−][4] 5 256
Listener 5 × 10[−][5] -  256

Unconstrained Multi-Concept Speaker 1 × 10[−][5] 7 512
Listener 5 × 10[−][5] -  512

Grounded Single-Concept Speaker 1 × 10[−][4] 6 256
Listener 5 × 10[−][5] -  256

Grounded Multi-Concept Speaker 1 × 10[−][5] 16 512
Listener 5 × 10[−][5] -  512


B RESULTS

B.1 GROUNDING EMERGENT COMMUNICATION


1.0

0.8

0.6

0.4

0.2

0.0


1.0

0.8

0.6

0.4

0.2

0.0


500 1000 1500 2000 2500 3000

Training Episode


1000 2000 3000 4000 5000

Training Episode


(a) Mean evaluation reward of Listener agent
over 20 training runs.


(b) Mean evaluation reward of Speaker agent
over 20 training runs.


Figure 7: Evaluation results of the grounded Speaker and Listener agent during training. Shading
indicates the bootstrapped 95% confidence interval.


-----