# TOWARDS LEARNING TO SPEAK AND HEAR THROUGH MULTI-AGENT COMMUNICATION ## OVER A CONTINUOUS ACOUSTIC CHANNEL **Anonymous authors** Paper under double-blind review ABSTRACT While multi-agent reinforcement learning has been used as an effective means to study emergent communication between agents, existing work has focused almost exclusively on communication with discrete symbols. Human communication often takes place (and emerged) over a continuous acoustic channel; human infants acquire language in large part through continuous signalling with their caregivers. We therefore ask: Are we able to observe emergent language between agents with a continuous communication channel trained through reinforcement learning? And if so, what is the impact of channel characteristics on the emerging language? We propose an environment and training methodology to serve as a means to carry out an initial exploration of these questions. We use a simple messaging environment where a “speaker” agent needs to convey a concept to a “listener”. The Speaker is equipped with a vocoder that maps symbols to a continuous waveform, this is passed over a lossy continuous channel, and the Listener needs to map the continuous signal to the concept. Using deep Q-learning, we show that basic compositionality emerges in the learned language representations. We find that noise is essential in the communication channel when conveying unseen concept combinations. And we show that we can ground the emergent communication by introducing a caregiver predisposed to “hearing” or “speaking” English. Finally, we describe how our platform serves as a starting point for future work that uses a combination of deep reinforcement learning and multi-agent systems to study our questions of continuous signalling in language learning and emergence. 1 INTRODUCTION Reinforcement learning (RL) is increasingly being used as a tool to study language emergence (Mordatch & Abbeel, 2017; Lazaridou et al., 2018; Eccles et al., 2019; Chaabouni et al., 2020; Lazaridou & Baroni, 2020). By allowing multiple agents to communicate with each other while solving a common task, a communication protocol needs to be established. The resulting protocol can be studied to see if it adheres to properties of human language, such as compositionality (Kirby, 2001; Geffen Lan et al., 2020; Andreas, 2020; Resnick et al., 2020). The tasks and environments themselves can also be studied, to see what types of constraints are necessary for human-like language to emerge (Steels, 1997). Referential games are often used for this purpose (Kajic et al., 2020; Havrylov & Titov, 2017; Yuan et al., 2020). While these studies open up the possibility of using computational models to investigate how language emerged and how language is acquired through interaction with an environment and other agents, most RL studies consider communication using discrete symbols. Spoken language instead operates and presumably emerged over a continuous acoustic channel. Human infants acquire their native language by being exposed to speech audio in their environments (Kuhl, 2005); by interacting and communicating with their caregivers using continuous signals, infants can observe the consequences of their communicative attempts (e.g. through parental responses) that may guide the process of language acquisition (see e.g. Howard & Messum (2014) for discussion). Continuous signalling is challenging since an agent needs to be able to deal with different acoustic environments and noise introduced by the lossy channel. These intricacies are lost when agents communicate directly with discrete symbols. This raises the question: Are we able ----- Lossy communication channel Speaker Agent Listener Agent Figure 1: Environment setup showing a Speaker communicating to a Listener over a lossy acoustic communication channel f . to observe emergent language between agents with a continuous communication channel, trained through RL? This paper is our first step towards answering this larger research question. Earlier work has considered models of human language acquisition using continuous signalling between a simulated infant and caregiver (Oudeyer, 2005; Steels & Belpaeme, 2005). But these models often rely on heuristic approaches and older neural modelling techniques, making them difficult to extend; e.g. it isn’t easy to directly incorporate other environmental rewards or interactions between multiple agents. More recent RL approaches would make this possible, but as noted, has mainly focused on discrete communication. Our work here tries to bridge the disconnect between recent contributions in multi-agent reinforcement learning (MARL) and earlier literature in language acquisition and modelling (Moulin-Frier & Oudeyer, 2021). One recent exception which do use continuous signalling within a modern RL framework is the work of Gao et al. (2020). In their setup, a Student agent is exposed to a large collection of unlabelled speech audio, from which it builds up a dictionary of possible spoken words. The Student can then select segmented words from its dictionary to play back to a Teacher, which uses a trained automatic speech recognition (ASR) model to classify the words and execute a movement command in a discrete environment. The Student is then awarded for moving towards a goal position. We also propose a Student-Teacher setup, but importantly, our agents can generate their own unique audio waveforms rather than just segmenting and repeating words exactly from past observations. Moreover, in our setup an agent is not required to use a pretrained ASR system for “listening”. Concretely, we propose the environment illustrated in Figure 1, which is an extension of a referential signalling game used in several previous studies (Lewis, 1969; Lazaridou et al., 2018; Chaabouni et al., 2020; Rita et al., 2020). Here s represents one out of a set of possible concepts the Speaker must communicate to a Listener agent. Taking this concept as input, the Speaker produces a waveform as output, which passes over a (potentially lossy) acoustic channel. The Listener “hears” the utterance from the speaker. Taking the waveform as input, the Speaker produces output ˆs. This output is the Listener’s interpretation of the concept that the Speaker agent tried to communicate. The agents must develop a common communication protocol such that s = ˆs. This process encapsulates one of the core goals of human language: conveying meaning through communication (Dor, 2014). To train the agents, we use deep Q-learning (Mnih et al., 2013). Our bigger goal is to explore the question of whether and how language emerges when using RL to train agents that communicate via continuous acoustic signals. Our proposed environment and training methodology serves as a means to perform such an exploration, and the goal of the paper is to showcase the capabilities of the platform. Concretely, we illustrate that a valid protocol is established between agents communicating freely, that basic compositionality emerges when agents need to communicate a combination of two concepts, that channel noise affects generalisation, and that one agent will act accordingly when the other is made to “hear” or “speak” English. At the end of the paper, we also discuss questions that can be tackled in the future using the groundwork laid here. ----- phone sequence audio waveform mel-spectrogram d a ʊ n Speaker Agent Synthesiser Channel Listener Agent Q-network Dictionary Lookup eSpeak Festival Noise Time/Pitch warping Time masking Q-network DTW Figure 2: Example interaction of each component and the environment in a single round. 2 ENVIRONMENT We base our environment on the referential signaling game from Chaabouni et al. (2020) and Rita et al. (2020)—which itself is based on Lewis (1969) and Lazaridou et al. (2018)—where a sender must convey a message to a receiver. In our case, communication takes place between a Speaker and a Listener over a continuous acoustic channel, instead of sending symbols directly (Figure 1). In each game round, a Speaker agent is tasked with conveying a single concept. The Speaker needs to explain this concept using a speech waveform which is transmitted over a noisy communication channel, and then received by a Listener agent. The Listener agent then classifies its understanding of the Speaker’s concept. If the Speaker’s target concept matches the classified concept from the Listener, the agents are rewarded. The Speaker is then presented with another concept and the cycle repeats. Formally, in each episode, the environment generates s, a one-hot encoded vector representing one of N target concepts from a set S. The Speaker receives s and generates a sequence of phones **_c = (c1, c2, . . ., cM_** ), each ct representing a phone from a predefined phonetic alphabet . The phone sequence is then converted into a waveform ∈P **_wraw, an audio signal sampled at 16 kHz P_** . For this we use a trained text-to-speech model (Black & Lenzo, 2000; Duddington, 2006). A channel noise function is then applied to the generated waveform, and the result win = f (wraw) is presented as input to the Listener. The Listener converts the input waveform to a mel-scale spectrogram: a sequence of vectors over time representing the frequency content of an audio signal scaled to mimic human frequency perception (Davis & Mermelstein, 1980). Taking the mel-spectrogram sequence X = (x1, x2, . . ., xT ) of T acoustic frames as input, the Listener agent outputs a vector ˆs representing its predicted concept. The agents are both rewarded if the predicted word is equal to the target word s = ˆs. To make the environment a bit more concrete, we present a brief example in Figure 2. For illustrative purposes, consider a set of concepts S = {up, down, left, right}. The state representation for down would be s = [0, 1, 0, 0][⊤]. A possible phone sequence generated by the Speaker would be c = (d, a, U, n, ).[1] This would be synthesised, passed through the channel, and then be interpreted by the Listener agent. If the Listener’s prediction is ˆs = [0, 1, 0, 0][⊤], then it selected the correct concept of down. The environment would then reward the agents accordingly: 1 if s = ˆs _r =_ (1) 0 otherwise  In our environment we have modelled the task of the Speaker agent as a discrete problem. Despite this, the combination of both agents and their environment is a continuous communication task; in our communication channel, we apply continuous signal transforms which can be motivated by real acoustic environments. The Listener also needs to take in and process a noisy acoustic signal. It is true that the Speaker outputs a discrete sequence; what we have done here is to equip the Speaker with 1 and respectively represent the start-of-sequence and end-of-sequence tokens. ----- articulatory capabilities so that these do not need to be learned by the model. There are studies that consider how articulation can be learned (Howard & Messum, 2014; Asada, 2016; Rasilo & Ras¨ anen,¨ 2017), but none of these do so in an RL environment, rather using a form of imitation learning. In Section 5 we discuss how future work could consider learning the articulation process itself within our environment, and the challenges involved in doing so. 3 LEARNING TO SPEAK AND HEAR USING RL To train our agents, we use deep Q-learning (Mnih et al., 2013). For the Speaker agent, this means predicting the action-value of phone sequences. The Listener agent predicts the value of selecting each classification target ˆs ∈S. 3.1 SPEAKER MODEL The Speaker agent is tasked with generating a sequence of phones c describing a concept or idea. The model architecture is shown in Figure 3. The target concept is represented by the one-hot input state s. We use gated recurrent unit (GRU) based sequence generation as the core of the Speaker agent, which generates a sequence of Q-values, a distribution over phones P per output-step from 1 to M . The input state s is embedded as the initial hidden state h0 of the GRU. The output phone of each GRU layer is embedded as input to the next GRU layer.[2] We also make use of start-of-sequence (SOS) and end-of-sequence (EOS) tokens, and respectively, appended to the phone-set. These allow the Speaker to generate arbitrary length phone sequences up to a maximum length of M . 3.2 LISTENER MODEL The Listener agent may be viewed as a classification task with the full model architecture illustrated in Figure 4. The model is roughly based on (Amodei et al., 2016). Given an input mel-spectrogram _X, the Listener generates a set of state-action values. These action-values represent the expected_ reward for each classification vector ˆs. We first apply a set of convolutional layers over the input mel-spectrogram, keeping the size of the time-axis consistent throughout. We then flatten the convolution outputs over the filters and feature axis, resulting in a single vector per time step. We process each vector through a bidirectional GRU, feeding the final hidden state through a linear layer to arrive at our final action-value predictions. An argmax of these action-values gives us a greedy prediction for ˆs. 2No gradients flow through the argmax: this connection indicates to the network which phone was selected at the previous GRU step. argmax logits linear GRU embedding argmax argmax logits logits linear linear GRU GRU embedding embedding embedding Figure 3: The Speaker agent generates an arbitrary length sequence of action-values given an input concept represented by s. ----- mel-spectrogram CNN flatten GRU linear argmax Figure 4: The Listener agent Q-network generates action-values given an input mel-spectrogram X. 3.3 DEEP Q-LEARNING The Q-network of the Speaker agent generates a sequence of phones c in every communication round until the EOS token is reached. The sequence of phones may be seen as predicting an action sequence per environment step, while standard RL generally only predicts a single action per step. To train such a Q-network, we therefore modify the general gradient-descent update equation from Sutton & Barto (1998). Since we only have a single communication round, we update the model parameters θ as follows: _r_ _−_ _M[1]_ _∇qˆ(S, A; θ),_ (2) _θ ←_ _θ + α_ _qˆm(S, A; θ)_ _m=1_ X where the reward r is given in (1), S is the environment state, A is the action, α is the learning rate, and ˆq = (ˆq1, ˆq2, . . ., ˆqM ). For the Speaker, ˆqm is the value of performing the action cm at output m. For the Speaker, the environment state would be the desired concept S = s and the actions would be _A = c = (c1, c2, ..., cM_ ), the output of the network in Figure 3. The Listener is also trained using (2), but here this corresponds to the more standard case where the agent produces a single action, i.e. M = 1. Concretely, for the Listener this action is A = ˆs, the output of the network in Figure 4. The Listener’s environment is the mel-spectrogram S = X. The Speaker and Listener each have their own independent learner and replay buffer (Mnih et al., 2013). A replay buffer is a storage buffer that keeps track of the observed environment states, actions and rewards. The replay buffer is then sampled when updating the agent’s Q-networks through gradient descent with (2). We may see this two-agent environment as multi-agent deep Q-learning (Tampuu et al., 2017), and therefore have to take careful consideration of the non-stationary replay buffer: we limit the maximum replay buffer size to twice the batch size. This ensures that the agent learns only from its most recent experiences. 4 EXPERIMENTS 4.1 IMPLEMENTATION The lossy communication channel has Gaussian white noise with a signal-to-noise ratio (SNR) of 30 dB, unless otherwise stated. During training, the channel applies Gaussian-sampled time stretch and pitch shift using Librosa (McFee et al., 2021), with variance 0.4 and 0.3, respectively. The channel also masks up to 15% of the mel-spectrogram time-axis during training. We train our agents with an ϵ-greedy exploration, where ϵ is decayed exponentially from 0.1 to 0 over the training steps. We use eSpeak (Duddington, 2006) as our speech synthesiser. eSpeak is a parametric text-to-speech software package that uses formant synthesis to generate audio from phone sequences. Festival (Black ----- 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 |Col1|Col2|Col3|Col4|Col5| |---|---|---|---|---| |||||| |||||| |||||| |||||| ||acou|stic comm.|discr|ete comm.| |Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8| |---|---|---|---|---|---|---|---| ||||||||| ||||||||| ||||||||| |||acou|||||| ||||acou|stic comm.||discr|ete comm.| |||unse||en codes||unse|en codes| 1000 2000 3000 4000 5000 acoustic comm. discrete comm. Training Episode 1000 2000 3000 4000 5000 acoustic comm. discrete comm. unseen codes unseen codes Training Episode (a) Mean evaluation reward of the Listener agent interpreting a single concept over 20 runs. (b) Mean evaluation reward of the Listener agent interpreting two concepts in each round. Figure 5: Results for unconstrained communication. The agents are evaluated every 100 training episodes over 20 runs. Shading indicates the bootstrapped 95% confidence interval. & Lenzo, 2000) was also tested, although eSpeak is favoured for its simpler phone scheme and multilanguage support. We use eSpeak’s full English phone-set of 164 unique phones and phonetic modifiers. The standard maximum number of phones the Speaker is allowed to generate in each communication round is M = 5, including the EOS token. All GRUs have 2 layers with a hidden layer size of 256. All Speaker agent embeddings (Section 3.1) are also 256-dimensional. The Listener (Section 3.2) uses 4 convolutional layers, each with 64 filters and a kernel width and height of 3. The input to the first convolutional layer is a sequence of 128-dimensional mel-spectrogram vectors extracted every 32 ms. We apply zero padding of size 1 at each layer to retain the input dimensions. Additional experimental details are given in Appendix A. 4.2 UNCONSTRAINED COMMUNICATION OF SINGLE CONCEPTS **Motivation** We first verify that the environment works as expected and that a valid communication protocol emerges when no constraints are applied to the agents. **Setup** The Speaker and Listener agents are trained simultaneously here, as described in Section 3.3. The agents are tasked with communicating 16 unique concepts. We compare our acoustic communication to a discrete baseline based on RIAL (Foerster et al., 2017). In this setup, the CNN of the Listener agent is replaced by an embedding network, allowing the discrete symbols of the Speaker to be directly interpreted by the Listener. The Speaker’s discrete alphabet size of setup is equal to the phonetic alphabet size of 164. Improvements have been made to RIAL—e.g. (Eccles et al., 2019; Chaabouni et al., 2020)—although RAIL itself proves sufficient as a comparison to our proposed acoustic communication setting. **Findings** Figure 5a shows the mean evaluation reward of the Listener agent over training steps. (This is also an indication of the Speaker’s performance, since without successful coordination between the two agents, no reward is given to either.) The agents achieve a final mean reward of 0.917 after 5000 training episodes, successfully developing a valid communication protocol for roughly 15 out of the total of 16 concepts.[3] This is comparable to the performance of the purely discrete communication which reaches a mean evaluation reward of 0.959. What does the communication sound like? Since there are no constraints placed on communication, the agents can easily coordinate to use arbitrary phone sequences to communicate distinct concepts. The interested reader can listen to generated samples.[4] We next consider a more involved setting in order to study composition and generalisation. 4.3 UNCONSTRAINED COMMUNICATION GENERALISING TO MULTIPLE CONCEPTS **Motivation** To study composition and generalisation, we perform an experiment based on (Kirby, 2001). They used an iterative language model (ILM) to convey two separate meanings (a and b) in a single string. This ILM was able to generate structured compositional mappings from meaning to strings. For example, in one result they found a0 q and b0 da. The combination of the two _−→_ _−→_ 3The maximum evaluation reward in all experiments is 1.0. [4Audio samples for all experiments are available at https://iclr2022-1504.github.io/samples/.](https://iclr2022-1504.github.io/samples/) ----- Table 1: Mean evaluation reward of the twoconcept experiments with varying channel noise. The results for no lossy communication channel is also shown. The 95% confidence for all values falls within 0.01. Table 2: Output sequences from a trained Speaker. Each entry corresponds to a combination of two concepts, s1 and s2, respectively. The bold combinations were unseen during training. Average Training Unseen **_s1_** 0 1 2 3 SNR (dB) Codes Codes no channel **0.966** 0.386 0 nnLGGx DLLççç nsspxx nnssss 40 0.878 0.389 2 1 30 0.931 0.402 **_s_** jLLeee @@ööee wwwxxx sss@@@ 20 0.895 **0.413** 2 jjLL:: DpLLj: Dwppçx enGsss 3 10 0.731 0.361 jjL::: GDDp:: Gjxxxp Gss::: 0 0.654 0.366 nnLGGx DLLççç nsspxx nnssss jLLeee @@ööee wwwxxx sss@@@ jjLL:: DpLLj: Dwppçx enGsss jjL::: GDDp:: Gjxxxp Gss::: meanings was therefore (a0, b0) qda. Similarly, (a1, b0) bguda with a1 bgu. Motivated by this, we try to test the generalisation capabilities in continuous signalling in our environment. −→ _−→_ _−→_ **Setup** Rather than conveying a single concept in each episode, we now ask the agents to convey two concepts. The target concept s and predicted concept ˆs now become s1, s2 and ˆs1, ˆs2, respectively. We also make sure that some concept combinations are never seen during training. We then see if the agents are still able to convey these concept combinations at test time, indicating how well the agents generalise to novel inputs. The reward model is adjusted accordingly, with the agents receiving 0.5 for each concept correctly identified by the Listener. Here s1 can take on 4 distinct concepts while s2 can take on another 4 concepts. Out of the 16 total combinations, we make sure that 4 are never seen during training. The unseen combinations are chosen such that there remains an even distribution of individual unseen concepts. We also increase the maximum phone length to M = 7. To encourage compositionality (Kottur et al., 2017), we limit the size of the phonetic alphabet to 16. As an example, you can think of s1 as indicating an item from the set of concepts S1 = _{up, down, left, right} while s2 indicates and item from S2 = {fast, medium, regular, slow} and_ we want the agents to communicate concept combinations such as up+fast. Some combinations such as right+slow is never given as the target concept combination during training (but e.g. right+fast and left+slow would be), and we see if the agents can generalise to these unseen combinations at test time and how they do it. **Findings: Quantitative** The results are shown in Figure 5b. We see the mean evaluation reward of the acoustic Listener agent reaches 0.931 on the training concept combinations. This is slightly lower than the discrete case which reaches a mean of 0.965. The acoustic communication agents achieve a mean evaluation reward of 0.402 on the unseen combinations, indicating that they are usually able to successfully communicate at least one of the two concepts. The discrete agents do marginally better on unseen combinations, with slightly higher variance. The chance-level baseline for this task would receive a mean reward of 0.25. The performance on the unseen combinations is thus better than random. Table 1 shows the mean evaluation reward of the same two-concept experiments, but now with varying degrees of channel noise expressed in SNR.[5] The goal here is to evaluate how the channel influences the generalisation of the agents to unseen input combinations. In the no-channel case, the Speaker output is directly input to the Listener agent, without any time stretching or pitch shifting. The no channel case does best on the training codes as expected, but does not generalise as well to unseen input combinations. We find that increasing channel noise decreases the performance of the training codes and increases generalisation performance on unseen codes, up to a point where both decrease. This is an early indication that the channel specifically influences generalisation. Lazaridou et al. (2018) reported the structural similarity of the emergent communication in terms of Spearman ρ correlation between the input and message space, known as topographic similarity or _topism (Brighton & Kirby, 2006). Chaabouni et al. (2020) extended this metric by introducing two_ new metrics. Positional disentanglement (posdis) measures the positional contribution of symbols to 5The SNR is calculated based on the average energy in a signal generated by eSpeak. ----- Table 3: Compositionality metrics of the unconstrained multi-concept Speaker agents. The mean evaluation metrics and 95% confidence bounds are shown _topism_ _posdis_ _bosdis_ acoustic comm. 0.265 (±0.041) 0.103 (±0.015) 0.116 (±0.018) discrete comm. 0.244 (±0.032) 0.087 (±0.017) 0.118 (±0.017) meaning. Bag-of-symbols disentanglement (bosdis) measures distinct symbol meaning but does so in a permutation-invariant language way. We record all 3 metrics for the case where the average SNR is 30 dB, taking measurements between the input space and the sequence of discrete phones. The results are shown in Table 3. For topism, we average 0.265, which is comparable to the results of (Lazaridou et al., 2018). For posdis and bosdis, we average 0.103 and 0.116, respectively. This falls within the lower end of the results of (Chaabouni et al., 2020). All three metrics yield similar results for both acoustic and discrete communication. **Findings: Qualitative** Table 2 shows examples of the sequences produced by a trained Speaker agent for each concept combination, with the phone units written using the international phonetic alphabet. Ideally, we would want each row and each column to affect the phonetic sequence in a unique way. This would indicate that the agents have learnt a compositional language protocol, combining phonetic segments together to create a sequence in which the Listener can distinguish the individual component concepts. We see this type of behaviour to some degree in our Speaker samples, such as the [x] phones for s1 = 2 or the repeated [s] sound when s1 = 3. This indicates at least some level of compositionality in the learned communication. More qualitatively, the realisation [from eSpeak of [L] sounds very similar to [n] for s2 = 0. (We refer the reader to the sample page,](https://iclr2022-1504.github.io/samples/) linked in Section 4.2.) The bold phone sequences in Table 2 were unseen during training. The agents correctly classified one combination (s1, s2 = 3, 0) out of the 4 unseen combinations. For the other 3 unseen combinations, the agents correctly predicted at least s1 or s2 correctly. These sequences also show some degree of compositionality, such as the [jL] sequence where s1 = 0. We should note that the agents are never specifically encouraged to develop any sort of compositionality in this experiment. They could, for example, use a unique single phone for each of the 16 concept combinations. 4.4 GROUNDING EMERGENT COMMUNICATION **Motivation** Although the Speaker uses an English phone-set, up to this point there has been no reason for the agents to actually learn to use English words to convey the concepts. In this subsection, either the Speaker or Listener is predisposed to speak or hear English words, and the other agent needs to act accordingly. One scientific motivation for this setting is that it can be used to study how an infant learns language from a caregiver (Kuhl, 2005). To study this computationally, several studies have looked at cognitive models of early vocal development through infant-caregiver interaction; Asada (2016) provides a comprehensive review. Most of these studies, however, considered the problem of learning to vocalise (Howard & Messum, 2014; Moulin-Frier et al., 2015; Rasilo & Ras¨ anen, 2017),¨ which limits the types of interactions and environmental rewards that can be incorporated into the model. We instead simplify the vocalisation process by using an existing synthesiser, but this allows us to use modern MARL techniques to study continuous signalling. We first give the Listener agent the infant role, and the Speaker will be the caregiver. This mimics the setting where an infant learns to identify words spoken by a caregiver. Later, we reverse the roles, having the Speaker agent assume the infant role. This represents an infant learning to speak their first words and their caregiver responds to recognised words. Since here one agent (the caregiver) has an explicit notion of the meaning of a word, this process can be described as “grounding” from the other agent’s perspective (the infant). **Setup** We first consider a setting where we have a single set of 4 concepts S = _{up, down, left, right}. While this is similar to the examples given in preceding sections, here_ the agents will be required to use actual English words to convey these concepts. In the setting where the Listener acts as an infant, the caregiver Speaker agent speaks English words; the Speaker consists simply of a dictionary lookup for the pronunciation of the word, which is then generated by eSpeak. ----- In the setting where the Speaker takes on the role of the infant, the Listener is now a static entity that can recognise English words; we make use of a dynamic time warping (DTW) system that matches the incoming waveform to a set of reference words and selects the closest one as its output label. 50 reference words are generated by eSpeak. The action-space of the Speaker agent is very large (|P|[M] ), and would be near impossible to explore entirely. Therefore, we provide guidance: with probability ϵ (Section 4.1), choose the correct ground truth phonetic sequence for s. We also consider the two-concept combination setting of Section 4.3 where either the Speaker or Listener now hears or speaks actual English words; DTW is too slow for the static Listener in this case, so here we first train the Listener in the infant role and then fix it as the caregiver when training the Speaker. **Findings: Grounding the Listener** Here the Listener is trained while the Speaker is a fixed caregiver. The Listener agent reached a mean evaluation reward of 1.0, indicating the agent learnt to correctly classify all 4 target words 100% of the time (full graphs given in Appendix B.1). The Listener agent was also tested with a vocabulary size of 50, consisting of the 50 most common English words including the original up, down, left, and right. With this setup, the Listener still reached a mean evaluation reward of 0.934. **Findings: Grounding the Speaker** We now ground the Speaker agent by swapping its role to that of the infant. The Speaker agent reaches a mean evaluation reward of 0.983 over 20 runs, indicating it is generally able to articulate all of the 4 target words. Table 4 gives samples of one of the experiment runs and compares them to the eSpeak ground truth phonetic descriptions. Although appearing very different to the ground truth, the audio generated by eSpeak of the phone sequences qualitatively similar. The reader can confirm this for themselves by listening to the generated samples (again we [refer the reader to the sample page, linked in Section 4.2.)](https://iclr2022-1504.github.io/samples/) **Findings:** **Grounding generalisation in communicating two concepts** Analogous to Section 4.3, we now have infant and caregiver agents in a setting with two concepts, specifically _S1 = {up, down, left, right} and S2 = {fast, medium, regular, slow}. Here, these sets don’t simply_ serve as an example as in Section 4.3, but the Speaker would now actually say “up” when it is the caregiver and the Listener will now actually be pretrained to recognise the word “up” when it is the caregiver. 4 combinations are unseen during training: up-slow, down-regular, left-medium, and _right-fast. Again we consider both role combinations of infant and caregiver. Figure 6a shows the_ results when training a two-word Listener agent. The agent reaches a mean evaluation reward of 1.0 for the training codes and 0.952 for the unseen code combinations. This indicates that the Listener agent learns near-optimal generalisation. As mentioned above, for the case where the Speaker is the infant, the DTW-based fixed Listener was found to be impractical. Thus, we use a static Listener agent pre-trained to classify 50 concepts for each s1 and s2. This totals to 2500 unique input combinations. The results of the two-word Speaker agent are shown in Figure 6b. The Speaker agent does not perform as well as the Listener agent, reaching a mean evaluation reward of 0.719 for the training word combinations and 0.425 for the unseen. We have replicated the experiments in this subsection using the Afrikaans version of eSpeak, reaching similar performance to English. This shows our results are not language specific. 5 DISCUSSION The work we have presented here has gone further than Gao et al. (2020), which only allowed segmented template words to be generated: our Speaker agent has the ability to generate unique audio waveforms. On the other hand, our Speaker can only generate sequences based on a fixed Table 4: Table of the target word, ground truth phonetic description, and trained Speaker agent’s predicted phonetic description. Target word Ground truth Predicted phones _up_ 2p 2vb _down_ daUn daU _left_ lEft lE _right_ ôaIt ôaISjn ----- 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 |Col1|Col2|Col3|Col4|Col5|Col6| |---|---|---|---|---|---| ||||||| ||||||| ||||||| ||||trai||ning codes| |||||trai|| ||||uns||een codes| |Col1|Col2|Col3|Col4|Col5|Col6| |---|---|---|---|---|---| ||||||| ||||||| ||||||| ||||trai||ning codes| |||||trai|| ||||uns||een codes| 1000 2000 3000 4000 5000 training codes unseen codes Training Episode 1000 2000 3000 4000 5000 training codes unseen codes Training Episode (a) Mean evaluation reward of two-word Listener agent over 20 training runs. (b) Mean evaluation reward of two-word Speaker agent over 20 training runs. Figure 6: Evaluation results of the grounded two-word Speaker and Listener agent during training. The mean evaluation reward of the unseen word combinations are also shown. phone-set (which is then passed over a continuous acoustic channel). This is in contrast to earlier work (Howard & Messum, 2014; Asada, 2016; Rasilo & Ras¨ anen, 2017) that considered a Speaker¨ that learns a full articulation model in an effort to come as close as possible in imitating an utterance from a caregiver; this allows a Speaker to generate arbitrary learnt units. We have thus gone further than Gao et al. (2020) but not as far as these older studies. Nevertheless, our approach has the benefit that it is formulated in a modern MARL setting: it can therefore be easily extended. Future work can therefore consider whether articulation can be learnt as part of our model – possibly using imitation learning to guide the agent’s exploration of the very large action-space of articulatory movements. In the experiments carried out in this study, we only considered a single communication round. We also referred to our setup as multi-agent, which is accurate but could be extended even further where a single agent has both a speaking and listening module, and these composed agents then communicate with one another. Future work could therefore consider multi-round communication games between 2 or more agents. Such games would extend our work to the full MARL problem, where agents would need to “speak” to and “hear” each other to solve a common task. Finally, in terms of future work, we saw in Section 4.3 the importance of the channel for generalisation. Adding white noise is, however, not a good enough simulation of real-life channel acoustic channels. But our approach could be extended with real background noise and more accurate models of environmental dynamics. This could form the basis for a computational investigation of the effect of real acoustic channels in language learning and emergence. We reflect on our initial research question: Are we able to observe emergent language between agents with a continuous acoustic communication channel trained through RL? This work has laid only a first foundation for answering this larger question. We have showcased the capability of a environment and training approach which will serve as a means of further exploration in answering the question. ETHICS STATEMENT We currently do not identify any obvious reasons to have ethical concerns about this work. Ethical considerations will be made taken into account in the future if some of the models are compared to data from human studies or trials. REPRODUCIBILITY STATEMENT We provide all model and experimental details in Section 4.1, and additional details in Appendix A. The information given should provide enough details to reproduce these results. Finally, our code will be released on GitHub with an open-source license upon acceptance. ----- REFERENCES D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, J. Chen, J. Chen, Z. Chen, M. Chrzanowski, A. Coates, G. Diamos, K. Ding, N. Du, E. Elsen, J. Engel, W. Fang, L. Fan, C. Fougner, L. Gao, C. Gong, A. Hannun, T. Han, L. V. Johannes, B. Jiang, C. Ju, B. Jun, P. LeGresley, L. Lin, J. Liu, Y. Liu, W. Li, X. Li, D. Ma, S. Narang, A. Ng, S. Ozair, Y. Peng, R. Prenger, S. Qian, Z. Quan, J. Raiman, V. Rao, S. Satheesh, D. Seetapun, S. Sengupta, K. Srinet, A. Sriram, H. Tang, L. Tang, C. Wang, J. Wang, K. Wang, Y. Wang, Z. Wang, Z. Wang, S. Wu, L. Wei, B. Xiao, W. Xie, Y. Xie, D. Yogatama, B. Yuan, J. Zhan, and Z. Zhu. Deep Speech 2: End-to-end speech recognition in English and Mandarin. In _Proc. ICML, pp. 173–182, 2016._ J. Andreas. Good-enough compositional data augmentation. In Proc. ACL, 2020. M. Asada. Modeling early vocal development through infant–caregiver interaction: A review. IEEE _Transactions on Cognitive and Developmental Systems, pp. 128–138, 2016._ A. Black and K. Lenzo. Building voices in the festival speech synthesis system. unpublished document, [2000. URL http://www.cstr.ed.ac.uk/projects/festival/docs/festvox/.](http://www.cstr.ed.ac.uk/projects/festival/docs/festvox/) H. Brighton and S. Kirby. Understanding linguistic evolution by visualizing the emergence of topographic mappings. Artificial Life, 2006. R. Chaabouni, E. Kharitonov, D. Bouchacourt, E. Dupoux, and M. Baroni. Compositionality and generalization in emergent languages. In Proc. ACL, pp. 4427–4442, 2020. S. Davis and P. Mermelstein. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal _Processing, pp. 357–366, 1980._ D. Dor. The instruction of imagination: language and its evolution as a communication technology, pp. 105–125. Princeton University Press, 2014. [J. Duddington. eSpeak text to speech, 2006. URL http://espeak.sourceforge.net/.](http://espeak.sourceforge.net/) T. Eccles, Y. Bachrach, G. Lever, A. Lazaridou, and T. Graepel. Biases for emergent communication in multi-agent reinforcement learning. In Proc. NeurIPS, 2019. J. Foerster, N. Nardelli, G. Farquhar, T. Afouras, P. H. S. Torr, P. Kohli, and S. Whiteson. Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning. Proc. ICML, 2017. S. Gao, W. Hou, T. Tanaka, and T. Shinozaki. Spoken language acquisition based on reinforcement learning and word unit segmentation. In Proc. ICASSP, pp. 6149–6153, 2020. N. Geffen Lan, E. Chemla, and S. Steinert-Threlkeld. On the Spontaneous Emergence of Discrete and Compositional Signals. In Proc. ACL, pp. 4794–4800, 2020. S. Havrylov and I. Titov. Emergence of language with multi-agent games: Learning to communicate with sequences of symbols. In Proc. NeurIPS, 2017. I. S. Howard and P. Messum. Learning to pronounce first words in three languages: An investigation of caregiver and infant behavior using a computational model of an infant. PLOS ONE, pp. 1–21, 2014. I. Kajic, E. Aygun, and D. Precup. Learning to cooperate: Emergent communication in multi-agent¨ navigation. arXiv e-prints, 2020. S. Kirby. Spontaneous evolution of linguistic structure: an iterated learning model of the emergence of regularity and irregularity. IEEE Transactions on Evolutionary Computation, pp. 102–110, 2001. S. Kottur, J. Moura, S. Lee, and D. Batra. Natural language does not emerge ‘naturally’ in multi-agent dialog. In Proc. EMNLP, 2017. ----- P. K. Kuhl. Early language acquisition: cracking the speech code. Nature Reviews Neuroscience, pp. 831–843, 2005. A. Lazaridou and M. Baroni. Emergent multi-agent communication in the deep learning era. CoRR, 2020. A. Lazaridou, K. Hermann, K. Tuyls, and S. Clark. Emergence of linguistic communication from referential games with symbolic and pixel input. Proc. ICLR, 2018. D. Lewis. Convention. Blackwell, 1969. B. McFee, A. Metsai, M. McVicar, S. Balke, C. Thome, C. Raffel, F. Zalkow, A. Malek, Dana, K. Lee,´ O. Nieto, D. Ellis, J. Mason, E. Battenberg, S. Seyfarth, R. Yamamoto, viktorandreevichmorozov, K. Choi, J. Moore, R. Bittner, S. Hidaka, Z. Wei, nullmightybofo, D. Heren˜u, F.-R. St´ oter,¨ P. Friesch, A. Weiss, M. Vollrath, T. Kim, and Thassilo. librosa/librosa: 0.8.1rc2, 2021. URL [https://doi.org/10.5281/zenodo.4792298.](https://doi.org/10.5281/zenodo.4792298) V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing Atari with deep reinforcement learning. In NIPS Deep Learning Workshop, 2013. I. Mordatch and P. Abbeel. Emergence of grounded compositional language in multi-agent populations. In Proc. AAAI, 2017. C. Moulin-Frier and P.-Y. Oudeyer. Multi-Agent Reinforcement Learning as a Computational Tool for Language Evolution Research: Historical Context and Future Challenges. In Proc. AAAI, 2021. C. Moulin-Frier, J. Diard, J.-L. Schwartz, and P. Bessiere. Cosmo (“communicating about ob-` jects using sensory–motor operations”): A bayesian modeling framework for studying speech communication and the emergence of phonological systems. Journal of Phonetics, pp. 5–41, 2015. P.-Y. Oudeyer. The self-organization of speech sounds. Journal of Theoretical Biology, pp. 435–449, 2005. H. Rasilo and O. Ras¨ anen. An online model for vowel imitation learning.¨ _Speech Communication,_ pp. 1–23, 2017. C. Resnick, A. Gupta, J. Foerster, A. Dai, and K. Cho. Capacity, bandwidth, and compositionality in emergent language learning. In Proc. AAMAS, 2020. M. Rita, R. Chaabouni, and E. Dupoux. “LazImpa”: Lazy and impatient neural agents learn to communicate efficiently. In Proc. ACL, pp. 335–343, 2020. L. Steels. The synthetic modeling of language origins. Evolution of Communication, pp. 1–34, 1997. L. Steels and T. Belpaeme. coordinating perceptually grounded categories through language: a case study for colour. Behavioral and Brain Sciences, pp. 469–489, 2005. R. S. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru, J. Aru, and R. Vicente. Multiagent cooperation and competition with deep reinforcement learning. PLOS ONE, pp. 1–15, 2017. L. Yuan, Z. Fu, J. Shen, L. Xu, J. Shen, and S.-C. Zhu. Emergence of pragmatics from referential game between theory of mind agents. In Proc. NeurIPS, 2020. ----- APPENDICES A EXPERIMENT DETAILS A.1 GENERAL EXPERIMENTAL SETUP Here we provide the general setup for all experimentation. **Parameter** **Value** _Optimiser_ Adam _Batch Size_ 128 _Replay size_ 256 _Training Episodes_ 5000 _Evaluation interval_ 100 _Evaluation episodes_ 25 _Runs (varying seed)_ 20 _GPU_ Nvidia RTX 2080 Super _Time (per run)_ _≈_ 30 minutes A.2 EXPERIMENT PARAMETERS Here we provide specific details on a per-experiment basis. The phone sequence length M in the grounded experiments is chosen such that the full ground truth phonetic pronunciation could be made by the speaker agent. **Experiment** **Agent** **Learning Rate** **Phone length (M** **)** **GRU hidden size** Unconstrained Single-Concept Speaker 1 × 10[−][4] 5 256 Listener 5 × 10[−][5] - 256 Unconstrained Multi-Concept Speaker 1 × 10[−][5] 7 512 Listener 5 × 10[−][5] - 512 Grounded Single-Concept Speaker 1 × 10[−][4] 6 256 Listener 5 × 10[−][5] - 256 Grounded Multi-Concept Speaker 1 × 10[−][5] 16 512 Listener 5 × 10[−][5] - 512 B RESULTS B.1 GROUNDING EMERGENT COMMUNICATION 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 500 1000 1500 2000 2500 3000 Training Episode 1000 2000 3000 4000 5000 Training Episode (a) Mean evaluation reward of Listener agent over 20 training runs. (b) Mean evaluation reward of Speaker agent over 20 training runs. Figure 7: Evaluation results of the grounded Speaker and Listener agent during training. Shading indicates the bootstrapped 95% confidence interval. -----