|
# ONLINE AD HOC TEAMWORK UNDER PARTIAL OBSERVABILITY |
|
|
|
**Pengjie Gu[1], Mengchen Zhao** [2][,][∗], Jianye Hao[3][,][2], Bo An[1] |
|
|
|
School of Computer Science and Engineering, Nanyang Technological University, Singapore[1] |
|
|
|
Noah’s Ark Lab, Huawei[2] |
|
|
|
College of Intelligence and Computing, Tianjin University[3] |
|
{pengjie.gu, boan}@ntu.edu.sg, {zhaomengchen, haojianye}@huawei.com |
|
|
|
ABSTRACT |
|
|
|
Autonomous agents often need to work together as a team to accomplish complex cooperative |
|
tasks. Due to privacy and other realistic constraints, agents might need to collaborate with |
|
previously unknown teammates on the fly. This problem is known as ad hoc teamwork, which |
|
remains a core research challenge. Prior works usually rely heavily on strong assumptions like |
|
full observability, fixed and predefined teammates’ types. This paper relaxes these assumptions |
|
with a novel reinforcement learning framework called ODITS, which allows the autonomous |
|
agent to adapt to arbitrary teammates in an online fashion. Instead of limiting teammates into |
|
a finite set of predefined types, ODITS automatically learns latent variables of teammates’ |
|
behaviors to infer how to cooperate with new teammates effectively. To overcome partial |
|
observability, we introduce an information-based regularizer to derive proxy representations of |
|
the learned variables from local observations. Extensive experimental results show that ODITS |
|
significantly outperforms various baselines in widely used ad hoc teamwork tasks. |
|
|
|
1 INTRODUCTION |
|
|
|
Recently, autonomous agents including robotics and software agents are being widely deployed in different |
|
environments. In many tasks, they are progressively required to cooperate with other unknown teammates on |
|
the fly. For example, in case of search and rescue tasks in a disaster, due to privacy or lack of time, deployed |
|
robots need to interact with robots from other companies or laboratories, whose coordination protocols might not |
|
be explicitly provided in advance (Barrett and Stone, 2015). Besides, in the domain of game AI (Yannakakis, |
|
2012), virtual agents are required to assist different agents controlled by human players. To effectively complete |
|
these tasks, autonomous agents must show high adaptation ability to collaborate with intrinsically diverse and |
|
unknown teammates. This problem is known in the literature as ad hoc teamwork (Stone et al., 2010). |
|
|
|
Existing approaches on ad hoc teamwork usually assume that all teammates’ behaviors are categorized into |
|
several predefined and fixed types, which corresponds to different coordination strategies (Barrett and Stone, |
|
2015; Durugkar et al., 2020; Mirsky et al., 2020). Then, by reasoning over the type of interacting teammates, the |
|
agent switches its behavior to the corresponding policy. If the types are correctly recognized and the strategies are |
|
effective, the agent would accomplish the given cooperation task well. However, defining sufficiently descriptive |
|
types of teammates requires prior domain knowledge, especially in uncertain and complex environments. For |
|
example, in human-AI collaboration in Hanabi (Bard et al., 2020), there are often a wide variety of cooperative |
|
behaviors showed by human players. It is challenging for predefined types to cover all possible human players’ |
|
behaviors. Further, teammates’ strategies might be rapidly evolving throughout the entire teamwork. If the agent |
|
assumes that teammates’ behavioral types are static and cannot adapt to current teammates’ behaviors in an |
|
_online fashion, teamwork would suffer from serious miscoordination (Ravula et al., 2019; Chen et al., 2020)._ |
|
Rescue and search tasks are an essential class of such examples (Ravula et al., 2019). On the other hand, existing |
|
techniques (Barrett and Stone, 2015; Albrecht and Stone, 2017; Chen et al., 2020; Ravula et al., 2019) try to |
|
utilize Bayesian posteriors over teammate types to obtain optimal responses. To effectively compute posteriors, |
|
they usually assume that the agent could always know other teammates’ observations and actions. However, this |
|
|
|
_∗Corresponding author_ |
|
|
|
|
|
----- |
|
|
|
assumption is unrealistic in partial observable environments, in which each agent is not aware of other agents’ |
|
observations. |
|
|
|
To address the issues mentioned above, this paper introduces an adaptive reinforcement learning framework called |
|
**Online aDaptation via Inferred Teamwork Situations (ODITS). Our key insight is that teamwork performance** |
|
is jointly affected by the autonomous agent and other teammates’ behaviors. So, the agent’s optimal behavior |
|
depends on the current teamwork situation, which indicates the influence on the environmental dynamics caused |
|
by other teammates. If the agent identifies the current teamwork situation in an online fashion, it can choose |
|
actions accordingly to ensure effective coordination. In this way, we introduce a multimodal representation |
|
learning framework (Suzuki et al., 2016; Yin et al., 2017). It automatically encodes the core knowledge about |
|
the teamwork situations into a latent probabilistic variable. We show that without any prior knowledge, after |
|
learning from the interactive experience with given teammates, the latent variable is sufficiently descriptive to |
|
provide information about how to coordinate with new teammates’ behaviors. To overcome partial observability, |
|
we propose an information-based proxy encoder to implicitly infer the learned variables from local observations. |
|
Then, the autonomous agent adapts to new teammates’ behaviors dynamically and quickly by conditioning its |
|
policy on the inferred variables. |
|
|
|
Instead of limiting teammates into several predefined and fixed types, ODITS considers a mechanism of how |
|
an agent should adapt to teammates’ behaviors online. It automatically learns continuous representations of |
|
teammates’ behaviors to infer how to coordinate with current teammates’ actions effectively. Without domain |
|
knowledge on current environments, it enables effective ad hoc teamwork performance and fast adaptation to |
|
varying teammates, which the agent might not thoroughly observe under partial observability. In our experimental |
|
evaluation, by interacting with a small set of given teammates, the trained agents could robustly collaborate with |
|
diverse new teammates. Compared with various type-based baselines, ODITS reveals superior ad hoc teamwork |
|
performance. Moreover, our ablations show both the necessity of learning latent variables of teamwork situations |
|
and inferring the proxy representations of learned variables. |
|
|
|
2 RELATED WORKS |
|
|
|
**Ad Hoc Teamwork. The core challenge of achieving cooperative ad hoc teamwork is to develop an adaptive** |
|
policy robust to various unknown teammates’ behaviors (Stone et al., 2010). Existing type-based approaches try |
|
to predefine types of teammates and choose policies accordingly to cooperate with unknown teammates (Chen |
|
et al., 2020; Ravula et al., 2019; Durugkar et al., 2020; Mirsky et al., 2020; Barrett and Stone, 2015) . Specifically, |
|
PLASTIC (Barrett and Stone, 2015) infers types of teammates by computing Bayesian posteriors over all types. |
|
ConvCPD (Ravula et al., 2019) extends this work by introducing a mechanism to detect the change point of the |
|
current teammate’s type. AATEAM (Chen et al., 2020) proposes an attention-based architecture to infer types in |
|
real time by extracting the temporal correlations from the state history. The drawback of these approaches is that |
|
finite and fixed types might not cover all possible situations in complex environments. One recent work avoids |
|
predefining teammates’ types by leveraging graph neural networks (GNNs) to estimate the joint action value of |
|
an ad hoc team (Rahman et al., 2021). However, this work requires all teammates’ observations as input, which |
|
might not always be available in the real world. |
|
|
|
**Agent Modeling. By modeling teammates’ behaviors, approaches of agent modeling aims to provide auxiliary** |
|
information, such as teammates’ goals or future actions, for decision-making (He et al., 2016; Albrecht and Stone, |
|
2018). For example, MeLIBA conditions the ad hoc agent’s policy on a belief over teammates, which is updated |
|
following the Bayesian rule (Zintgraf et al., 2021). However, existing agent models require the full observations |
|
of teammates as input (Raileanu et al.; Grover et al.; Tacchetti et al., 2019). If the agent cannot always observe |
|
other teammates’ information (e.g. observations and actions), those approaches would fail to give an accurate |
|
prediction about teammates’ information. A recent work proposes to use VAE to learn fixed-policy opponents |
|
under partial observability. However, it does not generalize to the ad hoc setting where the teammates can be |
|
non-stationary (Papoudakis and Albrecht, 2020). There are also some works study how to generate diverge agent |
|
policies, which benefits the training of ad hoc agent (Canaan et al., 2019). |
|
|
|
**Multi-agent Reinforcement Learning (MARL). Cooperative MARL(Foerster et al., 2017) with centralized** |
|
training and decentralized execution (Oliehoek et al., 2008) (CTDE) is relevant to this work. Related approaches |
|
(Sunehag et al., 2018; Rashid et al., 2018) utilize value function factorization to overcome the limitations of |
|
both joint and independent learning paradigms simultaneously. However, these algorithms assume that the |
|
developed team is fixed and closed. The team configuration (e.g., team size, team formation, and goals) is |
|
|
|
|
|
----- |
|
|
|
unchanged, and agents will not meet other agents without pre-coordination. Several extended works improve |
|
the generalization ability for complex team configurations by leveraging other insights, like learning dynamic |
|
roles (Wang et al., 2021; 2020), randomized entity-wise factorization (Iqbal et al., 2020), and training regime |
|
based on game-theoretic principle (Muller et al., 2020). However, intrinsically, these approaches usually focus |
|
on co-training a group of highly-coupled agents instead of an autonomous agent that can adapt to non-stationary |
|
teammates. |
|
|
|
3 BACKGROUND |
|
|
|
**Problem Formalization. Our aim is to develop a single autonomous agent, which we refer to as the ad hoc** |
|
_agent, that can effectively cooperate with various teammates under partial observability without pre-coordination,_ |
|
such as joint-training. While we focus on training a single agent in this work, similar approaches can be applied |
|
to construct an ad hoc team. |
|
|
|
Partial observation |
|
|
|
|
|
Ad hoc |
|
|
|
agent |
|
|
|
|
|
|
|
**...** |
|
|
|
Environment Teammate set |
|
|
|
Sample |
|
|
|
Unknown |
|
|
|
teammates |
|
|
|
**_a[-i]_** |
|
|
|
Shared reward _a[i]_ Joint action |
|
|
|
_R_ **_a_** |
|
|
|
Environment |
|
|
|
|
|
Figure 1: Visualization of the Dec-POMDP with an addtional teammate set Γ. |
|
|
|
To evaluate the ad hoc agent’s ability to cooperate with unknown teammates, we formally define the problem as |
|
a decentralized Partially observable Markov Decision Process (Dec-POMDP) (Oliehoek et al., 2008) with an |
|
additional assumption about the set of teammates’ possible policies Γ. |
|
|
|
It can be represented as a tuple ⟨N, S, A, O, T, P, R, O, Γ⟩, where N denotes the number of agents required |
|
by the task, s ∈S denotes the global state of the environment. The joint action a ∈A[N] is formed by all |
|
agent’s independent actions a[i] _∈A, where i is the index of the agent. In the environment, each agent only_ |
|
has access to its partial observation o[i] _∈O drawn according to the observation function O(s, i), and it has an_ |
|
observation-action history τ _[i]_ _∈T ≡_ (O × A)[∗]. P (s[′], |s, a) denotes the probability that taking joint action a in |
|
state s results in a transition to state s[′]. R(s, a) is the reward function that maps a state s and a joint action a to a |
|
team reward r ∈ R. Γ represents a pool of various policies, which can be pretrained or predefined to exhibit |
|
cooperative behaviors. Without loss of generality, we denote by πi the policy of the ad hoc agent and by π _i the_ |
|
_−_ |
|
joint policy of all other agents. |
|
|
|
Fig.1 shows the detailed schematics of this problem. Note that the orginal group size of teammate can be arbitrary. |
|
In a Dec-POMDP with an additional teammate set Γ, the objective of the ad hoc agent is to maximize the |
|
expected team return when it teams up with N − 1 arbitrary teammates sampled from Γ, though it has no prior |
|
knowledge about those teammates. Therefore, the ad hoc agent’s optimal policy πi[∗] [is required to maximize the] |
|
joint action value Q[π][i] (s, a[i], a[−][i]), which indicates the expected accumulative team reward over different ad hoc |
|
teamwork: |
|
|
|
|
|
+∞ |
|
|
|
_γ[t]rt_ |
|
|
|
"t=0 |
|
X |
|
|
|
|
|
_Q[π][i]_ (s, a[i], a[−][i]) = Eait=1:+∞[∼][π][i][,][a]t[−]=1:+[i] _∞[∼][π][−][i][,][π][−][i][∼][Γ]_ |
|
|
|
|
|
(1) |
|
|
|
|
|
) = Eait=1:+∞[∼][π][i][,][a]t[−]=1:+[i] _∞[∼][π][−][i][,][π][−][i][∼][Γ]_ "t=0 _γ[t]rt_ # (1) |
|
|
|
X |
|
|
|
_[s][0][ =][ s,][ a][0][ =][ a][, P]_ |
|
|
|
_Q[π]i[∗] (s, a[i], a[−][i])_ _Q[π][i]_ (s, a[i], a[−][i]), _πi, s, ai, a[−][i]_ (2) |
|
_≥_ _∀_ |
|
|
|
|
|
**Marginal Utility is defined to measure the contribution of an ad hoc agent to the whole team utility (Genter and** |
|
Stone, 2011). It represents the increase (or decrease) in a team’s utility when an ad hoc agent is added to the |
|
team. Given teammates’ actions a[−][i], there is a relationship between the marginal utility and the team utility |
|
|
|
|
|
----- |
|
|
|
(denoted by the joint action value) as follow: |
|
|
|
arg max (3) |
|
_a[i][ u][i][(][s, a][i][,][ a][−][i][) = arg max]a[i][ Q][π][i]_ [(][s, a][i][,][ a][−][i][)] |
|
|
|
where u[i](s, a[i], a[−][i]) denotes the marginal utility when the ad hoc agent chooses the action a[i] under the state s. |
|
Note that the marginal utility is not necessarily equal to the Q-value (Sunehag et al., 2018). The ad hoc agent |
|
chooses the action which maximizes the marginal utility to ensure the maximal team utility. |
|
|
|
4 ODITS LEARNING FRAMEWORK |
|
|
|
Our approach addresses the ad hoc teamwork problem |
|
with a novel probabilistic framework ODITS. In this |
|
section, we first introduce the overall architecture of **_Q Loss_** **Only available in the training phase** |
|
this framework and then present a detailed description **Joint Action Value** |
|
of all modules in this framework. |
|
|
|
4.1 OVERVIEW |
|
|
|
|
|
ODITS aims to estimate the ad hoc agent’s marginal |
|
utility for choosing corresponding actions to maximize the team utility. To achieve the adaptive policy |
|
to unknown teammates, we model the marginal utility |
|
as a conditional function on the inferred latent variable, which implicitly represents the current teamwork |
|
situation. ODITS jointly optimizes the marginal utility function and the latent variable by two learning |
|
objectives in an end-to-end fashion. Fig.2 shows the |
|
detailed schematics of ODITS. It splits the team into |
|
two parts: teammates and the ad hoc agent. |
|
|
|
|
|
|
|
**_Q Loss_** **Only available in the training phase** |
|
|
|
**Joint Action Value** |
|
|
|
**Integrating Network G** G _cti_ |
|
|
|
**Teamwork Situation** **Teamwork Situation** |
|
|
|
**Decoder g** **Encoder f** |
|
|
|
(st, _[a]t−i_ ) |
|
|
|
_u_ _i_ (ti, ; _zti_ ) **_MI Loss_** |
|
|
|
**Marginal Utility Network M** M _zit_ _bti_ |
|
|
|
Ad hoc agent Other |
|
|
|
|
|
teammates |
|
|
|
|
|
**Proxy Decoder g*** |
|
**Proxy Encoder f*** |
|
|
|
Figure 2: Schematics of ODITS. |
|
|
|
|
|
First, we regard other teammates as a part of environmental dynamics perceived by the ad hoc agent. Since |
|
different combinations of teammates lead to diverse and complex dynamics, we expect to learn a latent variable to |
|
describe the core information of teammates’ behaviors implicitly. To do this, we introduce a teamwork situation |
|
**encoder f to learn the variable. Then, a loss function (Q loss), an integrating network G and a teamwork** |
|
**situation decoder g are jointly proposed to regularize the information embedded in the learned variable c[i]t[.]** |
|
|
|
For the ad hoc agent, we expect to condition its policy on the learned variable c[i]t[. However, the partial observability] |
|
impedes the direct access to c[i]t[. Thus, we introduce a][ proxy encoder][ f][ ∗] [to infer a proxy representation][ z]t[i] [of] |
|
_c[i]t_ [from local observations. We force][ z]t[i] [to be informationally consistent with][ c]t[i] [by an information-based loss] |
|
function (MI loss). Then, we train a marginal utility network M to estimate the ad hoc agent’s conditional |
|
marginal utility ˆu[i](τt[i][, a][i]t[;][ z]t[i][)][ ≈] _[u][i][(][s][t][, a][i]t[,][ a]t[−][i][)][. For conditional behavior, a part of parameters of][ M][ are]_ |
|
generated by the proxy decoder g[∗]. |
|
|
|
Similar to the CTDE scenario (Oliehoek et al., 2008), we relax the partial observability in the training phase. |
|
ODITS is granted access to the global state st and other teammates’ actions a[−]t _[i]_ during training. During |
|
execution, G, f, g is removed; the ad hoc agent chooses the action which maximizes the conditional marginal |
|
utility function ˆu[i](τt[i][, a][i]t[;][ z]t[i][)][.] |
|
|
|
4.2 LEARNING TO REPRESENT TEAMWORK SITUATIONS |
|
|
|
For adaptive behaviors, we expect to condition the ad hoc agent’s policy on other teammates. However, unknown |
|
teammates show complex behaviors. Directly conditioning the policy on them might lead to a volatile policy. To |
|
address this issue, we aim to embed the teammates’ information into a compact but descriptive representation. |
|
To model the concept clearly, we formally define teamwork situation: |
|
|
|
**Definition 1 (teamwork situation) At each time step t, the ad hoc agent is in the teamwork situation c[i]t** |
|
_which is the current underlying teamwork state yielded by the environment state st and other teammates’ actions[∈C][,]_ |
|
**_a[−]t_** _[i][. It reflects the][ high-level semantics][ about the teammates’ behaviors.]_ |
|
|
|
|
|
----- |
|
|
|
Though different teammates generate diverse state-action trajectories, we assume that they can cause similar |
|
teamwork situations at some time, and the ad hoc agent’s action would affect their transitions. When the current |
|
teamwork situation is identified, the ad hoc agent can choose the action accordingly to ensure online adaptation. |
|
|
|
**Teamwork Situation Encoder f** **. To model the uncertainty of unknown teammates, we encode teamwork** |
|
situations in a stochastic embedding spaceprobabilistic variable c[i] that is drawn from a multivariate Gaussian distribution C. Thus, any teamwork situation can be represented as a latent(µci _, σci_ ). To enable the |
|
_N_ |
|
dependency revealed in the definition, we use a trainable neural network f to learn the parameters of the Gaussian |
|
distribution of c[i]: |
|
|
|
(µci _, σci_ ) = f (s, a[−][i]; θf ), c[i] (µci _, σci_ ) (4) |
|
_∼N_ |
|
where θf are parameters of f . |
|
|
|
**Regularizing Information Embedded in c[i]. We introduce a set of modules to jointly force c[i]** to be sufficiently |
|
descriptive for reflecting the current teamwork situation. If c[i]t [is able to capture the core knowledge about other] |
|
teammates’ current behaviors, we can predict the joint action value Q[π][i] (st, a[i]t[,][ a]t[−][i][)][ according to][ c]t[i] [and the ad] |
|
hoc agent’s marginal utility u[i]t 1. Thus, we propose an integrating network G for generating the joint action |
|
value’s estimation G(u[i]t[, c][i]t[)][ ≈] _[Q][π][i]_ [(][s][t][, a][i]t[,][ a]t[−][i][)][. We adopt a modified asynchronous Q-learning’s loss function] |
|
(Q-loss) (Mnih et al., 2016) as the optimization objective: |
|
|
|
|
|
_Q = E(uit[,c][i]t[,r][t][)][∼D]_ |
|
_L_ |
|
|
|
|
|
|
|
[(rt + γ max _G¯(u[i]t+1[, c]t[i]+1[)][ −]_ _[G][(][u]t[i][, c]t[i][)]][2]_ |
|
_a[i]t+1_ |
|
|
|
|
|
(5) |
|
|
|
|
|
where _G[¯] is a periodically updated target network. The expectation is estimated with uniform samples from the_ |
|
replay buffer D, which saves the interactive experience with training teammates. |
|
|
|
**Integrating Network G. One simple approach for integrating c[i]** with u[i] is to formulate G as an MLP that |
|
maps their concatenation into the joint value estimation. We instead propose to map c[i] into the parameters |
|
of G by a hypernetwork (Ha et al., 2016), which we refer to as the teamwork situation decoder g. Then, G |
|
maps the ad hoc agent’s utility u[i] into the value estimation. This alternative design changes the procedure of |
|
information integration. The decoder provides multiplicative integration to aggregate information. By contrast, |
|
the concatenation-based operation only provides additive integration, leading to a poor information integration |
|
ability (see Supplementary). We also empirically show that multiplicative integration stabilizes the training |
|
procedure and improves teamwork performance. In addition, we expect that there is a monotonicity modeling |
|
the relationship between G and the marginal utility u[i]t [:][ ∂G]∂u[i]t _t[, the increase of the ad hoc agent’s]_ |
|
|
|
marginal utility results in the improved joint action value. To achieve this property, we force[≥] [0][. Given any][ c][i] _θG_ 0. |
|
_≥_ |
|
|
|
4.3 LEARNING CONDITIONAL MARGINAL UTILITY FUNCTION UNDER PARTIAL OBSERVABILITY |
|
|
|
Apparently, the marginal utility of the ad hoc agent depends on other teammates’ behaviors. Distinct behaviors |
|
result in different marginal utilities. Here, we formalize the marginal utility network M as a deep recurrent Q |
|
network (DRQN) (Hausknecht and Stone, 2015) parameterized by θM . To enable adaptation, we force the final |
|
layers’ parameters of M to condition on the learned variable c[i]t[.] |
|
|
|
**Proxy Encoder f** **_[∗]. Because of partial observability, the teamwork situation encoder f is not available_** |
|
during execution. Thus, we introduce a proxy encoder f _[∗]_ to estimate c[i]t [from the local transition data] |
|
_b[i]t_ [= (][o]t[i][, r][t][−][1][, a][i]t 1[, o][i]t 1[)][. We assume that][ b]t[i] [can partly reflect the current teamwork situation since the] |
|
_−_ _−_ |
|
transition implicitly indicates the underlying dynamics, which is primarily influenced by other teammates’ behaviors. We denote the estimation of c[i]t [as][ z]t[i][. Then,][ z]t[i] [would be fed into a][ proxy decoder][ g][∗][(][z]t[i][;][ θ][g][∗] [)][ parameterized] |
|
by θg∗ to generate the parameters θM of M, enabling the marginal utility function to condition on the proxy |
|
representation zt[i][. Similar to][ c][i][, we encode][ z][i][ into a stochastic embedding space:] |
|
|
|
|
|
(µzi _, σzi_ ) = f _[∗](b[i]; θf ∗_ ), z[i] _∼N_ (µzi _, σzi_ ) (6) |
|
|
|
|
|
where θf ∗ are parameters of f _[∗]._ |
|
|
|
|
|
1uit [is a shorthand notation of][ u][i][(][τ][ i]t _[, a][i]t[;][ z]t[i][)]_ |
|
|
|
|
|
----- |
|
|
|
**Regularizing Information Embedded in z[i]. To make zt[i]** [identifiable, we expect][ z]t[i] [to be informatively consis-] |
|
tent with c[i]t[. Thus, we introduce an information-based loss function][ L][MI] [here to maximize the conditional mutual] |
|
information I(zt[i][;][ c]t[i][|][b]t[i][)][ between the proxy variables and the true variables. However, estimating and maximizing] |
|
mutual information is often infeasible. We introduce a variational distribution qξ(zt[i][|][c]t[i][, b]t[i][)][ parameterized by][ ξ][ to] |
|
derive a tractable lower bound for the mutual information (Alemi et al., 2017): |
|
|
|
|
|
_t[|][c][i]t[, b][i]t[)]_ |
|
|
|
_I(zt[i][;][ c][i]t[|][b][i]t[)][ ≥]_ [E]zt[i][,c][i]t[,b][i]t log _[q][ξ]p[(][z](z[i]_ _t[i][|][b]t[i][)]_ (7) |
|
|
|
|
|
|
|
where p(zt[i][|][b][i]t[)][ is the Gaussian distribution][ N] [(][µ]z[i] _[, σ]z[i]_ [)][. This lower bound can be rewritten as a loss function to] |
|
be minimized: |
|
|
|
_MI_ (θf _[∗]_ _, ξ) == E(bit[,s][t][,][a][−]t_ _[i])_ _t[|][b][i]t[)][||][q][ξ][(][z]t[i][|][c][i]t[, b][i]t[)]]]_ (8) |
|
_L_ _∼D[[][D][KL][[][p][(][z][i]_ |
|
|
|
where is the replay buffer, DKL[ ] is the KL divergence operator. The detailed derivation can be found in |
|
_D_ _·||·_ |
|
Supplementary. |
|
|
|
|
|
4.4 OVERALL OPTIMIZATION OBJECTIVE |
|
|
|
**Algorithm 1 ODITS Training** |
|
|
|
**Require: Batch of training teammates’ behavioral policies** **_πj[−][i]_** _j=1,2,_ _,J_ [; learning rate][ α][; scaling factor][ λ][.] |
|
_{_ _[}][tr]_ _···_ |
|
|
|
1: initialize the replay buffer D |
|
2: while not done do |
|
|
|
4:3: **forSample the teammates’ policies k = 1, · · ·, K do** **_πj[−][i]_** from {πj[−][i][}]j[tr]=1,2,···,J |
|
|
|
5: Sample data Dk = (st, a[i]t[,][ a][−]t _[i][, r][t][)][}][t][=1][,][···][,T]_ [using the ad hoc agent’s policy][ π][i][ and][ π][−][i] |
|
_{_ |
|
|
|
6: Add Dk into D |
|
|
|
7: **for steps in training steps do** |
|
|
|
8: Sample one trajectory D ∼D |
|
|
|
10:9: **for tCompute = 1, · · · (, Tµc −it** _[, σ]1[c]t[i] do[) =][ f]_ [(][s][t][,][ a]t[−][i][)][ and sample][ c]t[i] _c[i]t_ _[, σ][c]t[i]_ [)] |
|
|
|
11: Compute (µzti _[, σ][z]t[i]_ [) =][ f][ ∗][(][b]t[i][)][ and sample][ z]t[i] _[∼N][∼N][(][µ]zt[i][(][, σ][µ]_ _[z]t[i]_ [)] |
|
|
|
12: Compute u[i]t[(][τ][ i]t _[, a][i]t[;][ z]t[i][)][ and][ G][(][u][i]t[;][ c][i]t[)]_ |
|
|
|
13: Compute _Q,_ _MI_ |
|
_L_ _L_ |
|
|
|
14: _θ_ _θ + α_ _θ(_ _Q)_ |
|
_←_ _∇_ _L_ |
|
|
|
15: _θf ∗_ _θf ∗_ + λ _α_ _θf_ _∗_ ( _MI_ ) |
|
_←_ _·_ _∇_ _L_ |
|
|
|
16: _ξ_ _ξ + λ_ _α_ _ξ(_ _MI_ ) |
|
_←_ _·_ _∇_ _L_ |
|
|
|
|
|
**Algorithm 2 ODITS Testing** |
|
|
|
**Require: Testing teammates’ behavioral policies π[−][i].** |
|
|
|
1: for t = 1, · · ·, T do |
|
2: Generate teaammtes’ actions a[−]t _[i]_ **_π[−][i]_** |
|
|
|
3: Compute (µzti _[, σ][z]t[i]_ [) =][ f][ ∗][(][b]t[i][)][ and sample]∼ _[ z]t[i]_ _[∼N]_ [(][µ]zt[i] _[, σ][z]t[i]_ [)] |
|
|
|
4: Do the action a[i]t [that maximizes][ u][i][(][τ][ i]t _[, a][i]t[;][ z]t[i][)]_ |
|
|
|
To the end, the overall objective becomes: |
|
|
|
|
|
(θ) = _Q(θ) + λ_ _MI_ (θf ∗ _, ξ)_ (9) |
|
_L_ _L_ _L_ |
|
|
|
where θ = (θf _, θg, θM_ _, θf ∗_ _, θp, ξ), λ is the scaling facor._ |
|
|
|
During the training phase, the ad hoc agent interacts with different training teammates for collecting transition |
|
data into the replay buffer D. Then, samples from D are fed into the framework for updating all parameters |
|
|
|
|
|
----- |
|
|
|
2d1y2d1y 4d2y4d2y 8d4y8d4y |
|
|
|
# Avg. Captured Preys |
|
|
|
|
|
Save the City |
|
|
|
2a2b2a2b 4a3b4a3b 6a4b6a4b |
|
|
|
# Avg. Completed Buildings |
|
|
|
|
|
Figure 3: Performance comparison across various scenarios for Predator Prey (top panel) and Save the City |
|
(bottom panel). |
|
|
|
|
|
by gradients induced by the overall loss. During execution, the ad hoc agent conditions its behavior on the |
|
inferred teamwork situations by choosing actions to maximize the conditional utility function u[i](τt[i][, a][i]t[;][ z]t[i][)][. We] |
|
summarize our training procedure and testing procedure in Algorithm 1 and Algorithm 2. |
|
|
|
5 EXPERIMENTS |
|
|
|
|
|
We now empirically evaluate ODITS on various new and existing domains. All experiments in this paper are |
|
carried out 4 different random seeds, and results are shown with a 95% confidence interval over the standard |
|
deviation. In the following description, we refer to the teammates that interact with the ad hoc agent during |
|
the training phase as the training teammates, and refer to those teammates with unknown policies as testing |
|
teammates. And the “teammate types” correspond to the policy types of teammates. All experimental results |
|
illustrate the average teamwork performance when the ad hoc agent cooperates with different testing teammates. |
|
Additional experiments, further experiment details and implementation details of all models can be found at |
|
Supplementary. |
|
|
|
5.1 PREDATOR PREY |
|
|
|
|
|
**Configurations. In this environment, m homogenous predators try to capture n randomly-moving preys in a** |
|
7 × 7 grid world. Each predator has six actions: the moving actions in four directions, the capturing action, and |
|
waiting at a cell. Due to partial observability, each predator can only access the environmental information within |
|
two cells nearby. Besides, there are two obstacles at random locations. Episodes are 40 steps long. The predators |
|
get a team reward of 500 if two or more predators are capturing the same prey at the same time, and they are |
|
penalized for -10 if only one of them tries to capture a prey. Here, we adopt three different settings to verify |
|
the ad hoc agent’s ability to cooperate with different number of teammates. They are 2 predators and 1 preys |
|
(2d1y), 4 predators and 2 preys (4d2y) and 8 predators and 4 preys (8d4y), respectively. |
|
|
|
We compare our methods with three type-based baselines: AATEAM (Chen et al., 2020), ConvCPD (Ravula et al., |
|
2019), PLASTIC (Barrett and Stone, 2015). Note that these approaches assume that the ad hoc agent has full |
|
visibility on the environment. To apply them on partially observed settings, we replace the full state information |
|
used in them with partial observations of the ad hoc agent. Furthermore, we also compare two other strategies: |
|
(i) Random: The ad hoc agent chooses actions randomly. (ii) Combined: The ad hoc agent utilizes a DQN |
|
algorithm to learn a single policy using the data collected from all possible teammates. This intuitive baseline |
|
provides the view of treating the problem as a vanilla single-agent learning problem, where the agent ignores the |
|
differences between its teammates. |
|
|
|
Before training all algorithms, we first require a teammate set that consists of various behavioral policies of |
|
teammates. Instead of crafting several teammates’ cooperative policies by hand, we expect to train a set of distinct |
|
policies automatically. Therefore, we first utilize 5 different MARL algorithms (e.g. VDN (Sunehag et al., 2018) |
|
and QMIX (Rashid et al., 2018)) to develop several teams of agents. To ensure diversity, we use different random |
|
seeds for each algorithm and save the corresponding models at 3 different checkpoints (3 million steps, 4 million |
|
|
|
|
|
----- |
|
|
|
steps, and 5 million steps). Then, we manually select 15 different policies showing distinct policy representations |
|
(Grover et al.) from all developed models. Finally, we randomly sampled 8 policies as the training set and the |
|
other 7 policies as the testing set. During training, we define 8 teammate types that correspond to 8 policies in |
|
the training set for the type-based approaches. Then, algorithms would develop their models according to the |
|
interactive experience with training teammates. For all algorithms, agents are trained for 4.5 million time steps. |
|
The number of captured preys when the ad hoc agent cooperates with testing teammates throughout training is |
|
reported. See Supplementary for further settings. |
|
|
|
**Results. The top panel of Fig. 3 reports the results across 3 scenarios. We first observe that ODITS achieves** |
|
superior results on the number of captured preys across a varying number of teammates, verifying its effectiveness. |
|
ODITS also tends to give more consistent results than other methods across different difficulties. The other 3 |
|
type-based baselines and ODITS show better results than random and combined policies, indicating that they can |
|
indeed lead to adaptive behaviors to different teammates. Furthermore, the random strategy captures some preys |
|
on 4d2y and 8d4y, but completely fails on 2d1y. This indicates that without the cooperative behaviors of the |
|
ad hoc agent, other teammates can also coordinate with each other to achieve the given goal. The combined policy |
|
shows worse results than the random policy on two scenarios (4d2y and 8d4y). This might be because the |
|
combined policy show behaviors that conflict with other teammates. With the number of teammates increasing, |
|
the increasing effects of conflicts lead to serious miscoordination. |
|
|
|
5.2 SAVE THE CITY |
|
|
|
**Configurations.** This is a grid world resource allocation task presented in (Iqbal et al., 2020). |
|
|
|
In this task, there are 3 distinct types of agents, and their goal is to complete |
|
the construction of all buildings on the map while preventing them from burning |
|
down. Each agent has 8 actions: stay in place, move to the next cell in one of |
|
the four cardinal directions, put out the fire, and build. We set the agents to get |
|
a team reward of 100 if they have completed a building and be penalized for -500 |
|
when one building is burned down. Agent types include firefighters (20x speedup |
|
over the base rate in reducing fires), builders (20x speedup in the building), or |
|
generalists (5x speedup in both as well 2x speedup in moving ). Buildings also |
|
consist of two varieties: fast-burning and slow-burning, where the fast-burning |
|
buildings burn four times faster. In our experiments, each agent can only access |
|
the environmental information within four cells nearby. We adopt three different |
|
scenarios here to verify all methods. They are 2 agents and 1 buildings (2a2b), |
|
4 agents and 3 buildings (4a3b), 6 agents and 4 buildings (6a4b). |
|
|
|
Similar to training settings in Predator Prey, we select 15 distinct behavioral |
|
|
|
**Predator Prey 4d2y** |
|
|
|
**Save the City 4a3b** |
|
|
|
policies for the teammate set and randomly partition them into 8 training policies Figure 4: ODITS V.S. |
|
and 7 testing policies. For all algorithms, agents are trained for 4.5 million time QMIX. |
|
steps. The number of completed buildings when the ad hoc agent cooperates with testings teammates throughout |
|
training is reported. See Supplementary for further settings. |
|
|
|
**Results.** The bottom panel of Fig. 3 reports the results across 3 scenarios. We first observe that ODITS |
|
outperforms other baselines, verifying its effectiveness. Since the setting force all agents in the environment to be |
|
heterogeneous, the results also underpin the robustness of ODITS. Interestingly, we find that the combined policy |
|
reveals better performance than other type-based approaches. This result is not consistent with that in Predator |
|
Prey. Our intuition is that the requirement of cooperative behaviors in Save the City is less than that in Predator |
|
Prey. Actually, one agent in Save the City can complete buildings individually without the strong necessity of |
|
cooperating with other teammates’ behaviors. By contrast, one predator cannot capture prey by itself. As a result, |
|
the combined policy learns a universal and effective policy by learning from more interactive experience with |
|
different teammates, while type-based approaches fail because developing distinct cooperative behaviors leads to |
|
the instability of the ad hoc agent. This hypothesis is also empirically demonstrated in our ablations. |
|
|
|
5.3 COMPARISON WITH MARL |
|
|
|
In order to compare the performance of ODITS and MARL, we implement the commonly used algorithm |
|
QMIX (Rashid et al., 2018) as our baseline. Similar to the training procedure of ODITS, we fix one agent and |
|
train it with teammates randomly sampled from a pool consisting of 8 policies. The gradients for updating the |
|
|
|
|
|
----- |
|
|
|
teammates’ policies are blocked but the mixing network is updating as in the original implementation of QMIX. |
|
Figure 4 shows the comparison of ODITS and QMIX on Predator Prey 4d2y and Save the City 4a3b. In both |
|
environments, QMIX performs significantly worse than ODITS. This is not quite surprising because MARL |
|
algorithms usually assume that all the teammates are fixed. Therefore, although trained with multiple teammates, |
|
the agent under the QMIX framework does not learn to cooperate with an impromptu team. |
|
|
|
5.4 ABLATIONS. |
|
|
|
|
|
We perform several ablations on the Predator Prey 4d2y and Save |
|
the City 4a3b to try and determine the importance of each component of ODITS. |
|
|
|
|
|
|
|
|
|
**Adaptive Behaviors. We first consider removing the information-** |
|
based loss LMI from the overall learning objective ( denoted as |
|
**w/o info.), Fig. 5 shows that without LMI regularizing the informa-** |
|
tion embedded in zt[i][, ODITS induces worse teamwork performance.] |
|
This indicates that improving the mutual information between the |
|
proxy variable and the true variable indeed results in better representations of teamwork situations. We next consider how the inferred |
|
variables of teamwork situations affect the ad hoc agent’s adaptive |
|
|
|
Predator Prey Save the City |
|
|
|
4d2y 4a3b |
|
|
|
behaviors. We remove the proxy encoder and set zt[i] [as a fixed and] Figure 5: Ablations for different comporandomly generated vector ( denoted as w/o infer.). As shown in nents. |
|
Fig. 5, conditioning on a random signal leads to a further drop in performance, indicating that irrelevant signals |
|
cannot promote the ad hoc agent to develop adaptive policies. |
|
|
|
**Integrating Mechanism. We remove the teamwork situation encoder as well as LMI from the framework and** |
|
feed a vector filled with 1 into the teamwork situation decoder (labeled with w/o integ.). This setting enables |
|
ODITS not to integrating the ad hoc agent’s marginal utility with the information of teammates’ behaviors. |
|
Compared with w/o info., it brings a larger drop in teamwork performance. One intuition is that predicting the |
|
joint-action value plays an essential role in estimating the marginal utility. Suppose the integrating network has |
|
no information on other teammates’ behaviors. In that case, it cannot accurately predict the joint-action value, |
|
resulting in instability in marginal utility estimation. Despite the empirical evidence supporting this argument, |
|
however, it would be interesting to develop further theoretical insights into this training regime in future work. |
|
We finally consider the additive integration mechanism mentioned in section 4.2 (labeled with additve). We |
|
observe that despite additive integration shows an excellent performance in Save the City, it suffers from poor |
|
performance in Predator Prey, indicating that multiplicative integration provides a more stable and effective |
|
ability to integrate information from teammates and the ad hoc agent. Interestingly, we also find that most |
|
ablations get worse results in Predator Prey than those in Save the city. We believe that the different levels |
|
of cooperative requirement in two environments result in this phenomenon. The prey is captured when two |
|
nearby predators are simultaneously capturing them. By contrast, the burning building can be constructed by an |
|
individual agent. Therefore, removing mechanisms that promote the cooperative behaviors leads to the worse |
|
performance in Predator Prey. |
|
|
|
6 CONCLUSIONS |
|
|
|
This paper proposes a novel adaptive reinforcement learning algorithm called ODITS to address the challenging |
|
ad hoc teamwork problem. Without the need to predefine types of teammates, ODITS automatically learns |
|
compact but descriptive variables to infer how to coordinate with previously unknown teammates’ behaviors. To |
|
overcome partial observability, we introduce an information based regularizer to estimate proxy representations |
|
of learned variables from local observations. Experimental results show that ODITS obtains superior performance |
|
compared to various baselines on several complex ad hoc teamwork benchmarks. |
|
|
|
ACKNOWLEDGMENTS |
|
|
|
This research was supported by the National Research Foundation, Singapore under its AI Singapore Programme |
|
(AISG Award No: AISG-RP-2019-0013), National Satellite of Excellence in Trustworthy Software Systems |
|
(Award No: NSOETSS2019-01), and NTU. |
|
|
|
|
|
----- |
|
|
|
REFERENCES |
|
|
|
S. V. Albrecht and P. Stone. Reasoning about hypothetical agent behaviours and their parameters. In Proceedings |
|
_of the International Conference on Autonomous Agents and Multiagent Systems, 2017._ |
|
|
|
S. V. Albrecht and P. Stone. Autonomous agents modelling other agents: A comprehensive survey and open |
|
problems. Artificial Intelligence, 258:66–95, 2018. |
|
|
|
A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy. Deep variational information bottleneck. In International |
|
_Conference on Learning Representations, 2017._ |
|
|
|
N. Bard, J. N. Foerster, S. Chandar, N. Burch, M. Lanctot, H. F. Song, E. Parisotto, V. Dumoulin, S. Moitra, |
|
E. Hughes, I. Dunning, S. Mourad, H. Larochelle, M. G. Bellemare, and M. Bowling. The hanabi challenge: |
|
A new frontier for ai research. Artificial Intelligence, 280:103216, 2020. |
|
|
|
S. Barrett and P. Stone. Cooperating with unknown teammates in complex domains: A robot soccer case study |
|
of ad hoc teamwork. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence |
|
_Conference on Artificial Intelligence, 2015._ |
|
|
|
R. Canaan, J. Togelius, A. Nealen, and S. Menzel. Diverse agents for ad-hoc cooperation in hanabi. arXiv |
|
_preprint arXiv:1907.03840, 2019._ |
|
|
|
S. Chen, E. Andrejczuk, Z. Cao, and J. Zhang. Aateam: Achieving the ad hoc teamwork by employing the |
|
attention mechanism. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence |
|
_Conference on Artificial Intelligence, pages 7095–7102, 2020._ |
|
|
|
I. Durugkar, E. Liebman, and P. Stone. Balancing individual preferences and shared objectives in multiagent |
|
reinforcement learning. In Proceedings of the International Joint Conference on Artificial Intelligence, 2020. |
|
|
|
J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson. Counterfactual multi-agent policy gradients. |
|
_Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence Conference on Artificial_ |
|
_Intelligence, 2017._ |
|
|
|
N. A. Genter, Katie and P. Stone. Role-based ad hoc teamwork. In Workshops at the AAAI Conference on |
|
_Artificial Intelligence, 2011._ |
|
|
|
A. Grover, M. Al-Shedivat, J. K. Gupta, Y. Burda, and H. Edwards. Learning policy representations in multiagent |
|
systems. In Proceedings of the Proceedings of the International Conference on Machine Learning, pages |
|
1802–1811. |
|
|
|
D. Ha, A. Dai, and Q. V. Le. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016. |
|
|
|
M. Hausknecht and P. Stone. Deep recurrent q-learning for partially observable mdps. In Proceedings of the |
|
_AAAI Conference on Artificial Intelligence, 2015._ |
|
|
|
H. He, J. Boyd-Graber, K. Kwok, and H. Daumé. Opponent modeling in deep reinforcement learning. In |
|
_Proceedings of the International Conference on Machine Learning, pages 1804–1813, 2016._ |
|
|
|
S. Iqbal, C. A. S. de Witt, B. Peng, W. Böhmer, S. Whiteson, and F. Sha. Randomized entity-wise factorization |
|
for multi-agent reinforcement learning. arXiv preprint arXiv:2006.04222, 2020. |
|
|
|
A. Mahajan, T. Rashid, M. Samvelyan, and S. Whiteson. Maven: Multi-agent variational exploration. In |
|
_Advances in Neural Information Processing Systems, pages 7613–7624, 2019._ |
|
|
|
R. Mirsky, W. Macke, A. Wang, H. Yedidsion, and P. Stone. A penny for your thoughts: The value of |
|
communication in ad hoc teamwork. In Proceedings of the International Joint Conference on Artificial |
|
_Intelligence, 2020._ |
|
|
|
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous |
|
methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, |
|
pages 1928–1937, 2016. |
|
|
|
10 |
|
|
|
|
|
----- |
|
|
|
P. Muller, S. Omidshafiei, M. Rowland, K. Tuyls, J. Perolat, S. Liu, D. Hennes, L. Marris, M. Lanctot, E. Hughes, |
|
Z. Wang, G. Lever, N. Heess, T. Graepel, and R. Munos. A generalized training approach for multiagent |
|
learning. In International Conference on Learning Representations, 2020. |
|
|
|
F. A. Oliehoek, M. T. Spaan, N. Vlassis, and S. Whiteson. Exploiting locality of interaction in factored decpomdps. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, |
|
pages 517–524, 2008. |
|
|
|
G. Papoudakis and S. V. Albrecht. Variational autoencoders for opponent modeling in multi-agent systems. In |
|
_AAAI Workshop on Reinforcement Learning in Games, 2020._ |
|
|
|
M. A. Rahman, N. Hopner, F. Christianos, and S. V. Albrecht. Towards open ad hoc teamwork using graph-based |
|
policy learning. In Proceedings of the International Conference on Machine Learning, pages 8776–8786, |
|
2021. |
|
|
|
R. Raileanu, E. Denton, A. Szlam, and R. Fergus. Modeling others using oneself in multi-agent reinforcement |
|
learning. In Proceedings of the International Conference on Machine Learning, pages 4257–4266. |
|
|
|
R. Raileanu, M. Goldstein, A. Szlam, and R. Fergus. Fast adaptation to new environments via policy-dynamics |
|
value functions. In Proceedings of the International Conference on Machine Learning, pages 7920–7931, |
|
2020. |
|
|
|
T. Rashid, M. Samvelyan, C. S. de Witt, G. Farquhar, J. Foerster, and S. Whiteson. Qmix: Monotonic value |
|
function factorisation for deep multi-agent reinforcement learning. In Proceedings of the International |
|
_Conference on Machine Learning, pages 1228–1236, 2018._ |
|
|
|
M. Ravula, S. Alkobi, and P. Stone. Ad hoc teamwork with behavior switching agents. In Proceedings of the |
|
_International Joint Conference on Artificial Intelligence, 2019._ |
|
|
|
M. Samvelyan, T. Rashid, C. S. de Witt, G. Farquhar, N. Nardelli, T. G. Rudner, C.-M. Hung, P. H. Torr, |
|
J. Foerster, and S. Whiteson. The starcraft multi-agent challenge. arXiv preprint arXiv:1902.04043, 2019. |
|
|
|
P. Stone, G. A. Kaminka, S. Kraus, and J. S. Rosenschein. Ad hoc autonomous agent teams: Collaboration |
|
without pre-coordination. In Proceedings of the AAAI Conference on Artificial Intelligence, 2010. |
|
|
|
P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. F. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. |
|
Leibo, K. Tuyls, et al. Value-decomposition networks for cooperative multi-agent learning based on team |
|
reward. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, pages |
|
2085–2087, 2018. |
|
|
|
M. Suzuki, K. Nakayama, and Y. Matsuo. Joint multimodal learning with deep generative models. arXiv preprint |
|
_arXiv:1611.01891, 2016._ |
|
|
|
A. Tacchetti, H. F. Song, P. A. M. Mediano, V. Zambaldi, J. Kramár, N. C. Rabinowitz, T. Graepel, M. Botvinick, |
|
and P. W. Battaglia. Relational forward models for multi-agent learning. In International Conference on |
|
_Learning Representations, 2019._ |
|
|
|
M. Tan. Multi-agent reinforcement learning independent vs. cooperative agents. In Proceedings of the Interna_tional Conference on Machine Learning, 1993._ |
|
|
|
T. Wang, H. Dong, V. Lesser, and C. Zhang. Roma: Multi-agent reinforcement learning with emergent roles. In |
|
_Proceedings of the International Conference on Machine Learning, 2020._ |
|
|
|
T. Wang, T. Gupta, A. Mahajan, B. Peng, S. Whiteson, and C. Zhang. Rode: Learning roles to decompose |
|
multi-agent tasks. International Conference on Learning Representations, 2021. |
|
|
|
G. N. Yannakakis. Game ai revisited. In Proceedings of the 9th conference on Computing Frontiers, pages |
|
285–292, 2012. |
|
|
|
H. Yin, F. Melo, A. Billard, and A. Paiva. Associate latent encodings in learning from demonstrations. In |
|
_Proceedings of the AAAI Conference on Artificial Intelligence, 2017._ |
|
|
|
L. Zintgraf, S. Devlin, K. Ciosek, S. Whiteson, and K. Hofmann. Deep interactive bayesian reinforcement |
|
learning via meta-learning. arXiv preprint arXiv:2101.03864, 2021. |
|
|
|
11 |
|
|
|
|
|
----- |
|
|
|
A APPENDIX |
|
|
|
A.1 MATHEMATICAL DERIVATION |
|
|
|
A.1.1 ADDITIVE INTEGRATION VS. MULTIPLICATIVE INTEGRATION IN INTEGRATING NETWORK. |
|
|
|
Concretely, let us consider a simple example that G is a one-layer network. The information aggregation of |
|
additive integration (abbreviated as Gadd) could be written as follow: |
|
|
|
_Gadd = F(Wuu[i]_ + Wcc[i] + b) (10) |
|
|
|
where Wu and Wc are the weight vectors of u[i] and c[i], respectively; b is the bias vector; F is the corresponding |
|
activation function. |
|
|
|
By comparison, the multiplicative integration (abbreviated as Gmul) could be written as follow: |
|
|
|
_Gmul = F(Wc(c[i])Wuu[i]_ + b(c[i])) (11) |
|
|
|
where W (c[i]) and b(c[i]) are the function of c[i] for generating weight vectors. |
|
|
|
Compared with Gadd, Gmul changes the integrating network from first order to second order, while introducing |
|
no extra parameters. The information of teamwork situation directly scales the information of the ad hoc agent |
|
through the term W (c[i])Wuu[i], rather than via a more subtle bias Wuu[i] + Wcc[i]. |
|
|
|
Further, Gmul also brings about advantages over Gadd as it alters the integrating network’s gradient properties. |
|
In the additive integration, the gradient of _[∂G]∂c[add][i]_ can be computed as follow: |
|
|
|
_∂Gadd_ |
|
|
|
= Wc (12) |
|
_∂c[i]_ _F_ _[′]_ |
|
|
|
|
|
where = (Wuu[i] + Wcc[i] + b). This equation reveals that the gradient heavily depends on the matrix Wc, |
|
_F_ _[′]_ _F_ _[′]_ |
|
but∂G∂cadd W[i] **_uis as follow: and u[i]_** play a limited role: they only come in the derivative of F _[′]_ mixed with Wuu[i]. By comparison, |
|
|
|
_∂Gmul_ |
|
|
|
_∂c[i]_ = [Wuu[i]Wc[′] [+][ b][′][]][F] _[′]_ (13) |
|
|
|
where Wc[′] [=][ W][ ′]c[(][c][i][)][ and][ b][′][ =][ b][′][(][c][i][)][. In this equation,][ W][u][u][i][ is directly involved in the gradient computation] |
|
by gating Wc[′][, hence more capble of altering the updates of the learning procedure. This naturally leads to a] |
|
better information integration. |
|
|
|
A.1.2 MUTUAL INFORMATION LOSS FUNCTION _MI_ |
|
_L_ |
|
|
|
For forcing the proxy representation of the learned latent variable to be infomationally consistent with the true |
|
latent variable, we propose to maximize the mutual information between the proxy representations and the latent |
|
variables. In this paper, we introduce a posterior estimator and derive a tractable lower bound of the mutual |
|
information term: |
|
|
|
|
|
_t[|][c][i]t[, b][i]t[)]_ |
|
|
|
_I(zt[i][;][ c][i]t[|][b][i]t[) =][ E]zt[i][,c][i]t[,b][i]t_ log _[p][(]p[z]([i]zt[i][|][b][i]t[)]_ |
|
|
|
|
|
|
|
_t[|][c][i]t[, b][i]t[)]_ |
|
|
|
= Ezti[,c][i]t[,b][i]t log _[q][ξ]p[(][z](z[i]_ _t[i][|][b][i]t[)]_ |
|
|
|
|
|
|
|
+ Ecit[,b][i]t [[][D][KL][(][p][(][z]t[i][|][c][i]t[, b][i]t[)][||][q][xi][(][z]t[i][|][c][i]t[, b][i]t[))]] |
|
|
|
_t[|][c][i]t[, b][i]t[)]_ |
|
|
|
_≥_ Ezti[,c][i]t[,b][i]t log _[q][ξ]p[(][z](z[i]_ _t[i][|][b][i]t[)]_ |
|
|
|
|
|
|
|
where the last inequality holds because of the non-negativity of the KL divergence. Then it follows that: |
|
|
|
12 |
|
|
|
|
|
(14) |
|
|
|
|
|
----- |
|
|
|
_t[|][c][i]t[, b][i]t[)]_ |
|
|
|
Ezti[,c][i]t[,b][i]t log _[q][ξ]p[(][z](z[i]_ _t[i][|][b]t[i][)]_ |
|
|
|
|
|
|
|
=Ezti[,c][i]t[,b][i]t [[log][ q][ξ][(][z]t[i][|][c][i]t[, b][i]t[)]][ −] [E]zt[i][,b][i]t [[log][ p][(][z]t[i][|][b][i]t[)]] |
|
|
|
=Ezti[,c][i]t[,b][i]t [[log][ q][ξ][(][z]t[i][|][c][i]t[, b][i]t[)] +][ E]b[i]t [[][H][(][z]t[i][|][b][i]t[)]] |
|
|
|
=Ecit[,b][i]t _p(zt[i][|][c][i]t[, b][i]t[) log][ q][ξ][(][z]t[i][|][c][i]t[, b][i]t[)][dz]t[i]_ + Ebit [[][H][(][z]t[i][|][b][i]t[)]] |
|
|
|
Z |
|
|
|
|
|
(15) |
|
|
|
|
|
The proxy encoder is conditioned on the transition data. Given the transitions, the distribution of the proxy |
|
representations p(zt[i][)][ are independent from the local histories. Thus, we have] |
|
|
|
_I(zt[i][;][ c][i]t[|][b][i]t[)][ ≥−][E]c[i]t[,b][i]t_ [[][CE][[][p][(][z]t[i][|][c][i]t[, b][i]t[)][||][q][ξ][(][z]t[i][|][c][i]t[, b][i]t[)][dz]t[i][] +][ E]b[i]t [[][H][(][z]t[i][|][b][i]t[)]] (16) |
|
|
|
where In practice, we sample data from the replay buffer D and minimize |
|
|
|
_MI_ (θf ∗ _, ξ) =_ E(cit[,b][i]t[)][∼D][[][CE][[][p][(][z]t[i][|][c]t[i][, b]t[i][)][||][q][ξ][(][z]t[i][|][c]t[i][, b]t[i][)][dz]t[i][] +][ E]b[i]t [[][H][(][z]t[i][|][b]t[i][)]] |
|
_L_ _−_ (17) |
|
|
|
= E(bit[,s][t][,][a][−]t _[i])_ _t[|][b]t[i][)][||][q][ξ][(][z]t[i][|][c]t[i][, b]t[i][]]]_ |
|
_∼D[[][D][KL][[][p][(][z][i]_ |
|
|
|
A.2 ARCHITECTURE, HYPERPARAMETERS, AND INFRASTRUCTURE |
|
|
|
A.2.1 ODITS |
|
|
|
Details of the neural network architectures used by ODITS in all environments |
|
are provided in Fig. 7. We control the scale of the loss function by controlling |
|
the optimization procedure. It is conducted using RMSprop with a learning rate |
|
of 5 × 10[−][4], α of 0.99, and with no momentum or weight decay. For the lambda |
|
value, we search over{1e − 5, 1e − 4, 5e − 4, 1e − 3, 5e − 3, 1e − 2}. We finally |
|
adopt λMI = 1e − 3,λMI = 1e − 3, and λMI = 5e − 4 for Modified Coin |
|
Game, Predator Prey, and Save the City, respectively, since they induce the best |
|
|
|
Figure 6: performance com |
|
performance compared with other values. For the dimension of the latent variables |
|
|
|
parison of two algorithms |
|
|
|
_zt[i]_ [and][ c]t[i][, we search over][ {][1][,][ 2][,][ 3][,][ 5][,][ 10][}][ and finally adopt][ |][z][|][ = 10][ in Save the] |
|
|
|
on the modified coin game. |
|
|
|
city and |z| = 1 in the other environments. In addition, we set |c| = |z|. For |
|
exploration, we use ϵ-greedy with ϵ from 1.0 to 0.05 over 50, 000 time steps and kept constant for the rest of the |
|
training. Batches of 128 episodes are sampled from the replay buffer, and all components in the framework are |
|
trained together in an end-to-end fashion. |
|
|
|
A.2.2 BASELINES |
|
|
|
We compare ODITS with 5 baselines. For the random strategy, we force the ad hoc agent to choose its action at |
|
each time step randomly. |
|
|
|
For the combined strategy, a three-layer DQN is trained to get the policy of the ad hoc agent. Details of DQN’s |
|
architecture are illustrated in Fig. 7. We sample a type of training teammates at each training episode and collect |
|
the interacting experience data into the replay buffer. The optimization procedure and the exploration scheme |
|
are the same as those of ODITS. We set the batch size of samples as 128 and the target update interval as 200 |
|
episodes. |
|
|
|
For the other three type-based baselines, we assume that each training policy of teammates corresponds to a |
|
predefined type of teammates. So, the number of predefined types equals to the number of training polices. |
|
We construct a set of three-layer DQN to learn the policies for all training types. Each DQN only learns the |
|
corresponding teammates’ of interacting experience. Training settings for these DQNs are the same as those |
|
used in the combined strategy. Furthermore, to apply these baselines in partially observable environments, we |
|
replace the state information used in them with partial observations. For PLASTIC (Barrett and Stone, 2015), |
|
|
|
13 |
|
|
|
|
|
----- |
|
|
|
LeakeyReLU |
|
|
|
BatchNormalization |
|
|
|
FC, 2 X |c| units |
|
|
|
LeakeyReLU |
|
|
|
BatchNormalization |
|
|
|
FC, 128 units |
|
|
|
(st, _[a]t−i_ ) |
|
|
|
Teamwork |
|
|
|
Situation Encoder |
|
|
|
LeakeyReLU |
|
|
|
BatchNormalization |
|
|
|
FC, 2 X |z| units |
|
|
|
LeakeyReLU |
|
|
|
BatchNormalization |
|
|
|
FC, 128 units |
|
|
|
_bti_ |
|
|
|
|
|
Proxy Encoder |
|
|
|
|
|
Parameterizing |
|
|
|
FC, 1 units |
|
|
|
ReLU Weights and Biases |
|
|
|
FC, 128 units |
|
|
|
Absolutization |
|
|
|
_uti_ |
|
|
|
FC, 16768 units |
|
|
|
_cti_ |
|
|
|
|
|
FC, |A| units |
|
|
|
ReLU |
|
|
|
FC, 64 units |
|
|
|
ReLU |
|
|
|
FC, 64 units |
|
|
|
_oti_ |
|
|
|
|
|
Integrating |
|
|
|
Network |
|
|
|
Parameterizing |
|
|
|
FC, |A| units |
|
|
|
Weights and Biases |
|
|
|
GRU, 64 units |
|
|
|
ReLU |
|
|
|
FC, 64 X (|A|+1) units |
|
|
|
FC, 64 units |
|
|
|
(oti, _[a]ti−1)_ _z_ _ti_ |
|
|
|
|
|
Marginal Utility |
|
|
|
Network |
|
|
|
|
|
Teamwork |
|
|
|
Situation Decoder |
|
|
|
Parameterizing |
|
|
|
Weights and Biases |
|
|
|
FC, 64 X (|A|+1) units |
|
|
|
_z_ _ti_ |
|
|
|
Proxy Decoder |
|
|
|
|
|
DQN |
|
|
|
|Col1|LeakeyReL| |
|
|---|---| |
|
|BatchNormalization|| |
|
||| |
|
|FC, |c| units|| |
|
||ci t| |
|
|
|
|
|
|
|
Variational Posterior |
|
|
|
Estimator |
|
|
|
|
|
Softmax |
|
|
|
FC, _nclasses units_ |
|
|
|
ReLU |
|
|
|
FC, 20 units |
|
|
|
ReLU |
|
|
|
FC, 100 units |
|
|
|
ReLU |
|
|
|
FC, 100 units |
|
|
|
ReLU |
|
|
|
Maxpool, 2X2 |
|
|
|
Conv, 40X3X3 |
|
|
|
(ntp, _[n]t_ ) |
|
|
|
|
|
CPD Network |
|
|
|
|
|
Figure 7: Architecture details of ODITS and baselines. |
|
|
|
we set the parameter η used in the UpdateBelief function as 0.2. For ConvCPD (Ravula et al., 2019), we follow |
|
the implementation details mentioned in the original paper. We construct the architecture of the Change Point |
|
Detection (CPD) Network as illustrated in Fig. 7, where ntp is the number of predefined types, nt is the number |
|
of time steps that needed be considered in ConvCPD,of change points. The ConCPD algorithm is trained with nclasses nclasses = ntp 1000 × ( samples in which each claess has 1000ntp − 1) + 1 is the number of classess |
|
samples (batch size = 64, learning rate = 0.01, decay=0.1, optimizer=SGD). For × AATEAM (Chen et al., 2020), we |
|
follow its proposed implementation details. The number of hidden layers for GRU is 5. The maximum length of |
|
a state sequence that attention networks process is 60. The rate of dropout is 0.1. During training, the default |
|
learning rate is 0.01, and the batch size is 32. |
|
|
|
A.3 MODIFIED COIN GAME |
|
|
|
To show the difference between ODITS and type-based approaches, we introduce a simple modified coin game. |
|
The game takes place on a 7 × 7 map which contains 6 coins of 3 different colors (2 coins of each color). The |
|
aim of the team is to only collect any two kinds of coins (correct coins with a reward of 100) and avoid collecting |
|
the other kind of coins (false coins with a reward of -200). The policies of the teammates are predefined and |
|
illustrated in the order of colors in Fig.8 left (2 training types and 2 testing types). For example, the first training |
|
type (red → green) indicates that the policy of this teammate is to collect red and green coins, and it will collect |
|
red coins firstly. Therefore, while the correct coins of the first training type (green → red) and the second testing |
|
type (red → green) are the same, they are different policies since their paths to collect coins are apparently |
|
different. Each agent has five actions: move up, down, left, right, or pass. Once an agent steps on a coin, that |
|
|
|
14 |
|
|
|
|
|
----- |
|
|
|
coin disappears from the cell. The game ends after 20 steps. To maximize the team return, the aim of the ad hoc |
|
agent is to infer its current teammates’ desired coins and collect them as much as possible. |
|
|
|
Here, we adopt one state-of-the art type-based approach (AATEAM (Chen et al., 2020)) as the baseline. Fig.6 |
|
shows the testing performance. We observe that ODITS shows superior performance and converges quickly |
|
while AATEAM shows an unstable curve. We believe this discrepance results from the key difference between |
|
our method and type-based approaches. The baseline is hard to cooperate with new types of teammates. For |
|
example, when the baseline agent interacts with the teammate of the second testing type (green → red) and |
|
observes that the teammate is collecting the green coins at the start stage, it would switch its own policy to the |
|
one corresponding to the second training type of teammate (green → blue), so it would collect green coins and |
|
blue coins (false coins) simultaneously, leading to poor teamwork performance. By contrast, ODITS can be |
|
easily generalized to the testing types of teammates. During training, ODITS learns how to cooperate with the |
|
teammate according to its current behavior instead of its types. If it observes that its teammate is collecting one |
|
kind of coins, it will collect the same kind of coins, and this knowledge is automatically embedded in c and z. |
|
|
|
A.4 DETAILS OF EXPERIMENTS |
|
|
|
Modified Coin GameModified Coin Game Predator Prey Save the City |
|
|
|
|
|
CapturingCapturingCapturing |
|
|
|
MovingMovingMoving |
|
|
|
PredatorsPredators |
|
|
|
PreysPreys |
|
|
|
ObstaclesObstacles |
|
|
|
Fast-burning Fast-burning |
|
|
|
buildingsbuildings |
|
|
|
Slow-burning Slow-burning |
|
|
|
buildingsbuildings |
|
|
|
FirefightersFirefighters |
|
|
|
BuildersBuilders |
|
|
|
GeneralistsGeneralists |
|
|
|
Completion Completion |
|
|
|
progressprogress |
|
|
|
TeammateTeammate Ad hoc agentAd hoc agent |
|
|
|
CoinsCoinsCoins |
|
|
|
Policies of Training typesPolicies of Training types |
|
|
|
Policies of Testing typesPolicies of Testing types |
|
|
|
Observation Observation |
|
|
|
rangerange |
|
|
|
Failed |
|
|
|
capturing |
|
|
|
Successful Successful |
|
|
|
capturingcapturing |
|
|
|
Figure 8: Illustration of Modified Coin Game (left), Predator Prey (middle) and Save the City (right) |
|
|
|
**Modified Coin Game. In this environment, the ad hoc agent needs to cooperate with its teammate to collect** |
|
target coins which depends on the current teammates’ type. The teammates’ types are illustrated as two ordered |
|
colors of coins in Fig. 8. The teammate follows a heuristic strategy and has complete visibility of the entire |
|
environmental state. It would first move along an axis for which it has the smallest distance from the coin of |
|
its first color. If it collides with obstacles or boundaries, it would first choose random actions in 3 steps, and |
|
then choose actions to move to the target coin again. If there is no coin of its first color, it would move to the |
|
coin of its second color following the same strategy. The ad hoc agent can only access the information 3 grids |
|
nearby itself. We use one-hot coding to represent different entities in each grid and concatenate them together as |
|
a vector to construct the observation of the ad hoc agent. The game ends after 20 steps and takes place on a 7 × 7 |
|
grid containing 6 coins of 3 different colors (2 coins of each color). Each agent has five actions: move up, down, |
|
left, right, or pass. Once an agent steps on a coin, that coin disappears from the grid. This game requires agents |
|
to collect as many target coins as possible, which are indicated by the teammates’ types. One right collected coin |
|
gives a reward of 100, while one false coin gives a reward of -200. As a result, the ad hoc agent needs to infer its |
|
teammate’s goals according to teammates’ behaviors and move to right coins while avoiding meeting false coins. |
|
|
|
**Predator Prey. In this environment, m homogenous predators try to capture n randomly-moving preys in a 7** _×_ 7 |
|
grid world. Each predator has six actions: the moving actions in four directions, the capturing action, and waiting |
|
at a grid. Besides, there are two obstacles at random locations. Due to partial observability, each predator can |
|
only access the environmental information within two grids nearby. The information of each grid is embedded |
|
into a one-hot vector to represent different entities: obstacles, blank grids, predators, preys. Episodes are 40 steps |
|
long. The predators get a team reward of 500 if two or more predators are capturing the same prey simultaneously, |
|
and they are penalized for -10 if only one of them tries to capture a prey. Here, we adopt three different scenarios |
|
to verify the ad hoc agent’s ability to cooperate with the different number of teammates. They are 2 predators |
|
and 1 preys (2d1y), 4 predators and 2 preys (4d2y) and 8 predators and 4 preys (8d4y), respectively. |
|
|
|
In this environment, to simulate complex cooperative behaviors, we utilize 5 algorithms (VDN (Sunehag et al., |
|
2018), MAVEN (Mahajan et al., 2019), ROMA (Wang et al., 2020), QMIX (Rashid et al., 2018), IQL (Tan, |
|
|
|
15 |
|
|
|
|
|
----- |
|
|
|
2d1y2d1y 4d2y4d2y 8d4y8d4y |
|
|
|
2a2b2a2b 4a3b4a3b 6a4b6a4b |
|
|
|
Training policies Testing policies Unselected policies |
|
|
|
Figure 9: t-SNE plots of the learned teammates’ policy representations for Predator Prey (top panel) and Save |
|
the City (bottom panel) in 6 scenarios. For each scenario, we first developed 60 candidate policies by using |
|
existing MARL open-source implementations. Then, we train a self-supervised policy encoder mentioned in |
|
(Raileanu et al., 2020) by learning information from collected behavioral trajectories {(o[i]t[, a][i]t[)][}][ of all candidates] |
|
to represent their policies. Finally, we select 15 policies whose average policy embeddings are distinct from each |
|
other, and split them into 8 training policies and 7 testing policies for each scenario. |
|
|
|
|
|
1993)) to train the candidate policies of teammates using their open-source implementations based on PyMARL |
|
(Samvelyan et al., 2019). We use 4 different random seeds for each algorithm and save the corresponding models |
|
at 3 different training steps (3 million, 4 million, and 5 million). As a result, we get 60 candidate teammates’ |
|
policies for each scenario. To ensure diversity of cooperative behaviors, we visualize policy representations |
|
of all candidate models by using the open-source implementation mentioned in (Raileanu et al., 2020). This |
|
mechanism encodes the agent’s behavioral trajectories {(o[i]t[, a][i]t[)][}][ into an embedding space, in which different] |
|
policies generate different representations. Then, we manually select 15 distinct policies and randomly split them |
|
into 8 training policies and 7 testing policies for each scenario. We illustrate the policy representations in Fig. 9. |
|
During training or testing, in each episode, we sample one policy from the policy set and equip a team of agents |
|
with the policy. Then, we replace one random agent in the team with the ad hoc agent to construct an ad hoc |
|
team. |
|
|
|
**Save the City. This is a 14 × 14 grid world resource allocation task presented in (Iqbal et al., 2020). In this** |
|
task, there are 3 distinct types of agents, and their goal is to complete the construction of all buildings on the |
|
map while preventing them from burning down. Each agent has 8 actions: stay in place, move to the next grid |
|
in one of the four cardinal directions, put out the fire, and build. We set the agents to get a team reward of 100 |
|
if they have completed a building and be penalized for -500 when one building is burned down. Agent types |
|
include firefighters (20x speedup over the base rate in reducing fires), builders (20x speedup in the building), |
|
or generalists (5x speedup in both as well 2x speedup in moving ). Buildings also consist of two varieties: |
|
|
|
16 |
|
|
|
|
|
----- |
|
|
|
fast-burning and slow-burning, where the fast-burning buildings burn four times faster. In our experiments, |
|
each agent can only access the environmental information within four grids nearby. The observation of each |
|
agent contains the information (type, position, and the completion procedure of the building) of each entity |
|
in the environment. If one entity is not in the sight range of the agent, the observation vector of this entity |
|
would be filled with 0. We adopt three different scenarios here to verify all methods. They are 2 agents and 1 |
|
buildings (2a2b), 4 agents and 3 buildings (4a3b), 6 agents and 4 buildings (6a4b). Similar to Predator Prey, |
|
we also manually selected 15 teammates’ policies for each scenario. We illustrate their policy representations in |
|
Fig. 9. Furthermore, since each team has a random combination of agent types, teammates’ behaviors on this |
|
environments show a greater diversity. In addition, all buildings’ types are randomly assigned in each episode. |
|
|
|
B LIMITATIONS AND FUTURE WORKS |
|
|
|
In this paper, the experiments were carried with relatively small teams (2-8 agents) and with agent behavior |
|
largely coming from pre-trained RL policies. In this case, the diversity of teammates’ policies can be easily |
|
verified by visualizing the policy representations in latent space. However, with the team size increasing, there |
|
might be more complex and diverse teamwork situations. This leads to a higher requirement for the training speed |
|
and the scalability of the architecture. In addition, the diversity of the policy sets for training has considerable |
|
influence on the performance. However, the characterization and quantization of such an influence remain |
|
unexplored. We leave these issues for future work. |
|
|
|
17 |
|
|
|
|
|
----- |
|
|
|
|