pradachan
/

AI-Scientist

Model card Files Files and versions Community

AI-Scientist / review_iclr_bench /iclr_parsed /18Ys0-PzyPI.txt

pradachan

Upload folder using huggingface_hub

f71c233 verified 19 days ago

raw

history blame

65.4 kB

	# ONLINE AD HOC TEAMWORK UNDER PARTIAL OBSERVABILITY

	Pengjie Gu[1], Mengchen Zhao [2][,][∗], Jianye Hao[3][,][2], Bo An[1]

	School of Computer Science and Engineering, Nanyang Technological University, Singapore[1]

	Noah’s Ark Lab, Huawei[2]

	College of Intelligence and Computing, Tianjin University[3]
	{pengjie.gu, boan}@ntu.edu.sg, {zhaomengchen, haojianye}@huawei.com

	ABSTRACT

	Autonomous agents often need to work together as a team to accomplish complex cooperative
	tasks. Due to privacy and other realistic constraints, agents might need to collaborate with
	previously unknown teammates on the fly. This problem is known as ad hoc teamwork, which
	remains a core research challenge. Prior works usually rely heavily on strong assumptions like
	full observability, fixed and predefined teammates’ types. This paper relaxes these assumptions
	with a novel reinforcement learning framework called ODITS, which allows the autonomous
	agent to adapt to arbitrary teammates in an online fashion. Instead of limiting teammates into
	a finite set of predefined types, ODITS automatically learns latent variables of teammates’
	behaviors to infer how to cooperate with new teammates effectively. To overcome partial
	observability, we introduce an information-based regularizer to derive proxy representations of
	the learned variables from local observations. Extensive experimental results show that ODITS
	significantly outperforms various baselines in widely used ad hoc teamwork tasks.

	1 INTRODUCTION

	Recently, autonomous agents including robotics and software agents are being widely deployed in different
	environments. In many tasks, they are progressively required to cooperate with other unknown teammates on
	the fly. For example, in case of search and rescue tasks in a disaster, due to privacy or lack of time, deployed
	robots need to interact with robots from other companies or laboratories, whose coordination protocols might not
	be explicitly provided in advance (Barrett and Stone, 2015). Besides, in the domain of game AI (Yannakakis,
	2012), virtual agents are required to assist different agents controlled by human players. To effectively complete
	these tasks, autonomous agents must show high adaptation ability to collaborate with intrinsically diverse and
	unknown teammates. This problem is known in the literature as ad hoc teamwork (Stone et al., 2010).

	Existing approaches on ad hoc teamwork usually assume that all teammates’ behaviors are categorized into
	several predefined and fixed types, which corresponds to different coordination strategies (Barrett and Stone,
	2015; Durugkar et al., 2020; Mirsky et al., 2020). Then, by reasoning over the type of interacting teammates, the
	agent switches its behavior to the corresponding policy. If the types are correctly recognized and the strategies are
	effective, the agent would accomplish the given cooperation task well. However, defining sufficiently descriptive
	types of teammates requires prior domain knowledge, especially in uncertain and complex environments. For
	example, in human-AI collaboration in Hanabi (Bard et al., 2020), there are often a wide variety of cooperative
	behaviors showed by human players. It is challenging for predefined types to cover all possible human players’
	behaviors. Further, teammates’ strategies might be rapidly evolving throughout the entire teamwork. If the agent
	assumes that teammates’ behavioral types are static and cannot adapt to current teammates’ behaviors in an
	_online fashion, teamwork would suffer from serious miscoordination (Ravula et al., 2019; Chen et al., 2020)._
	Rescue and search tasks are an essential class of such examples (Ravula et al., 2019). On the other hand, existing
	techniques (Barrett and Stone, 2015; Albrecht and Stone, 2017; Chen et al., 2020; Ravula et al., 2019) try to
	utilize Bayesian posteriors over teammate types to obtain optimal responses. To effectively compute posteriors,
	they usually assume that the agent could always know other teammates’ observations and actions. However, this

	_∗Corresponding author_


	-----

	assumption is unrealistic in partial observable environments, in which each agent is not aware of other agents’
	observations.

	To address the issues mentioned above, this paper introduces an adaptive reinforcement learning framework called
	Online aDaptation via Inferred Teamwork Situations (ODITS). Our key insight is that teamwork performance
	is jointly affected by the autonomous agent and other teammates’ behaviors. So, the agent’s optimal behavior
	depends on the current teamwork situation, which indicates the influence on the environmental dynamics caused
	by other teammates. If the agent identifies the current teamwork situation in an online fashion, it can choose
	actions accordingly to ensure effective coordination. In this way, we introduce a multimodal representation
	learning framework (Suzuki et al., 2016; Yin et al., 2017). It automatically encodes the core knowledge about
	the teamwork situations into a latent probabilistic variable. We show that without any prior knowledge, after
	learning from the interactive experience with given teammates, the latent variable is sufficiently descriptive to
	provide information about how to coordinate with new teammates’ behaviors. To overcome partial observability,
	we propose an information-based proxy encoder to implicitly infer the learned variables from local observations.
	Then, the autonomous agent adapts to new teammates’ behaviors dynamically and quickly by conditioning its
	policy on the inferred variables.

	Instead of limiting teammates into several predefined and fixed types, ODITS considers a mechanism of how
	an agent should adapt to teammates’ behaviors online. It automatically learns continuous representations of
	teammates’ behaviors to infer how to coordinate with current teammates’ actions effectively. Without domain
	knowledge on current environments, it enables effective ad hoc teamwork performance and fast adaptation to
	varying teammates, which the agent might not thoroughly observe under partial observability. In our experimental
	evaluation, by interacting with a small set of given teammates, the trained agents could robustly collaborate with
	diverse new teammates. Compared with various type-based baselines, ODITS reveals superior ad hoc teamwork
	performance. Moreover, our ablations show both the necessity of learning latent variables of teamwork situations
	and inferring the proxy representations of learned variables.

	2 RELATED WORKS

	Ad Hoc Teamwork. The core challenge of achieving cooperative ad hoc teamwork is to develop an adaptive
	policy robust to various unknown teammates’ behaviors (Stone et al., 2010). Existing type-based approaches try
	to predefine types of teammates and choose policies accordingly to cooperate with unknown teammates (Chen
	et al., 2020; Ravula et al., 2019; Durugkar et al., 2020; Mirsky et al., 2020; Barrett and Stone, 2015) . Specifically,
	PLASTIC (Barrett and Stone, 2015) infers types of teammates by computing Bayesian posteriors over all types.
	ConvCPD (Ravula et al., 2019) extends this work by introducing a mechanism to detect the change point of the
	current teammate’s type. AATEAM (Chen et al., 2020) proposes an attention-based architecture to infer types in
	real time by extracting the temporal correlations from the state history. The drawback of these approaches is that
	finite and fixed types might not cover all possible situations in complex environments. One recent work avoids
	predefining teammates’ types by leveraging graph neural networks (GNNs) to estimate the joint action value of
	an ad hoc team (Rahman et al., 2021). However, this work requires all teammates’ observations as input, which
	might not always be available in the real world.

	Agent Modeling. By modeling teammates’ behaviors, approaches of agent modeling aims to provide auxiliary
	information, such as teammates’ goals or future actions, for decision-making (He et al., 2016; Albrecht and Stone,
	2018). For example, MeLIBA conditions the ad hoc agent’s policy on a belief over teammates, which is updated
	following the Bayesian rule (Zintgraf et al., 2021). However, existing agent models require the full observations
	of teammates as input (Raileanu et al.; Grover et al.; Tacchetti et al., 2019). If the agent cannot always observe
	other teammates’ information (e.g. observations and actions), those approaches would fail to give an accurate
	prediction about teammates’ information. A recent work proposes to use VAE to learn fixed-policy opponents
	under partial observability. However, it does not generalize to the ad hoc setting where the teammates can be
	non-stationary (Papoudakis and Albrecht, 2020). There are also some works study how to generate diverge agent
	policies, which benefits the training of ad hoc agent (Canaan et al., 2019).

	Multi-agent Reinforcement Learning (MARL). Cooperative MARL(Foerster et al., 2017) with centralized
	training and decentralized execution (Oliehoek et al., 2008) (CTDE) is relevant to this work. Related approaches
	(Sunehag et al., 2018; Rashid et al., 2018) utilize value function factorization to overcome the limitations of
	both joint and independent learning paradigms simultaneously. However, these algorithms assume that the
	developed team is fixed and closed. The team configuration (e.g., team size, team formation, and goals) is


	-----

	unchanged, and agents will not meet other agents without pre-coordination. Several extended works improve
	the generalization ability for complex team configurations by leveraging other insights, like learning dynamic
	roles (Wang et al., 2021; 2020), randomized entity-wise factorization (Iqbal et al., 2020), and training regime
	based on game-theoretic principle (Muller et al., 2020). However, intrinsically, these approaches usually focus
	on co-training a group of highly-coupled agents instead of an autonomous agent that can adapt to non-stationary
	teammates.

	3 BACKGROUND

	Problem Formalization. Our aim is to develop a single autonomous agent, which we refer to as the ad hoc
	_agent, that can effectively cooperate with various teammates under partial observability without pre-coordination,_
	such as joint-training. While we focus on training a single agent in this work, similar approaches can be applied
	to construct an ad hoc team.

	Partial observation


	Ad hoc

	agent



	...

	Environment Teammate set 

	Sample

	Unknown

	teammates

	_a[-i]_

	Shared reward _a[i]_ Joint action

	_R_ _a_

	Environment


	Figure 1: Visualization of the Dec-POMDP with an addtional teammate set Γ.

	To evaluate the ad hoc agent’s ability to cooperate with unknown teammates, we formally define the problem as
	a decentralized Partially observable Markov Decision Process (Dec-POMDP) (Oliehoek et al., 2008) with an
	additional assumption about the set of teammates’ possible policies Γ.

	It can be represented as a tuple ⟨N, S, A, O, T, P, R, O, Γ⟩, where N denotes the number of agents required
	by the task, s ∈S denotes the global state of the environment. The joint action a ∈A[N] is formed by all
	agent’s independent actions a[i] _∈A, where i is the index of the agent. In the environment, each agent only_
	has access to its partial observation o[i] _∈O drawn according to the observation function O(s, i), and it has an_
	observation-action history τ _[i]_ _∈T ≡_ (O × A)[∗]. P (s[′], \|s, a) denotes the probability that taking joint action a in
	state s results in a transition to state s[′]. R(s, a) is the reward function that maps a state s and a joint action a to a
	team reward r ∈ R. Γ represents a pool of various policies, which can be pretrained or predefined to exhibit
	cooperative behaviors. Without loss of generality, we denote by πi the policy of the ad hoc agent and by π _i the_
	_−_
	joint policy of all other agents.

	Fig.1 shows the detailed schematics of this problem. Note that the orginal group size of teammate can be arbitrary.
	In a Dec-POMDP with an additional teammate set Γ, the objective of the ad hoc agent is to maximize the
	expected team return when it teams up with N − 1 arbitrary teammates sampled from Γ, though it has no prior
	knowledge about those teammates. Therefore, the ad hoc agent’s optimal policy πi[∗] [is required to maximize the]
	joint action value Q[π][i] (s, a[i], a[−][i]), which indicates the expected accumulative team reward over different ad hoc
	teamwork:


	+∞

	_γ[t]rt_

	"t=0
	X


	_Q[π][i]_ (s, a[i], a[−][i]) = Eait=1:+∞[∼][π][i][,][a]t[−]=1:+[i] _∞[∼][π][−][i][,][π][−][i][∼][Γ]_


	(1)


	) = Eait=1:+∞[∼][π][i][,][a]t[−]=1:+[i] _∞[∼][π][−][i][,][π][−][i][∼][Γ]_ "t=0 _γ[t]rt_ # (1)

	X

	_[s][0][ =][ s,][ a][0][ =][ a][, P]_

	_Q[π]i[∗] (s, a[i], a[−][i])_ _Q[π][i]_ (s, a[i], a[−][i]), _πi, s, ai, a[−][i]_ (2)
	_≥_ _∀_


	Marginal Utility is defined to measure the contribution of an ad hoc agent to the whole team utility (Genter and
	Stone, 2011). It represents the increase (or decrease) in a team’s utility when an ad hoc agent is added to the
	team. Given teammates’ actions a[−][i], there is a relationship between the marginal utility and the team utility


	-----

	(denoted by the joint action value) as follow:

	arg max (3)
	_a[i][ u][i][(][s, a][i][,][ a][−][i][) = arg max]a[i][ Q][π][i]_ [(][s, a][i][,][ a][−][i][)]

	where u[i](s, a[i], a[−][i]) denotes the marginal utility when the ad hoc agent chooses the action a[i] under the state s.
	Note that the marginal utility is not necessarily equal to the Q-value (Sunehag et al., 2018). The ad hoc agent
	chooses the action which maximizes the marginal utility to ensure the maximal team utility.

	4 ODITS LEARNING FRAMEWORK

	Our approach addresses the ad hoc teamwork problem
	with a novel probabilistic framework ODITS. In this
	section, we first introduce the overall architecture of _Q Loss_ Only available in the training phase
	this framework and then present a detailed description Joint Action Value
	of all modules in this framework.

	4.1 OVERVIEW


	ODITS aims to estimate the ad hoc agent’s marginal
	utility for choosing corresponding actions to maximize the team utility. To achieve the adaptive policy
	to unknown teammates, we model the marginal utility
	as a conditional function on the inferred latent variable, which implicitly represents the current teamwork
	situation. ODITS jointly optimizes the marginal utility function and the latent variable by two learning
	objectives in an end-to-end fashion. Fig.2 shows the
	detailed schematics of ODITS. It splits the team into
	two parts: teammates and the ad hoc agent.



	_Q Loss_ Only available in the training phase

	Joint Action Value

	Integrating Network G G _cti_

	Teamwork Situation Teamwork Situation

	Decoder g Encoder f

	(st, _[a]t−i_ )

	_u_ _i_ (ti, ; _zti_ ) _MI Loss_

	Marginal Utility Network M M _zit_ _bti_

	Ad hoc agent Other


	teammates


	Proxy Decoder g*
	Proxy Encoder f*

	Figure 2: Schematics of ODITS.


	First, we regard other teammates as a part of environmental dynamics perceived by the ad hoc agent. Since
	different combinations of teammates lead to diverse and complex dynamics, we expect to learn a latent variable to
	describe the core information of teammates’ behaviors implicitly. To do this, we introduce a teamwork situation
	encoder f to learn the variable. Then, a loss function (Q loss), an integrating network G and a teamwork
	situation decoder g are jointly proposed to regularize the information embedded in the learned variable c[i]t[.]

	For the ad hoc agent, we expect to condition its policy on the learned variable c[i]t[. However, the partial observability]
	impedes the direct access to c[i]t[. Thus, we introduce a][ proxy encoder][ f][ ∗] [to infer a proxy representation][ z]t[i] [of]
	_c[i]t_ [from local observations. We force][ z]t[i] [to be informationally consistent with][ c]t[i] [by an information-based loss]
	function (MI loss). Then, we train a marginal utility network M to estimate the ad hoc agent’s conditional
	marginal utility ˆu[i](τt[i][, a][i]t[;][ z]t[i][)][ ≈] _[u][i][(][s][t][, a][i]t[,][ a]t[−][i][)][. For conditional behavior, a part of parameters of][ M][ are]_
	generated by the proxy decoder g[∗].

	Similar to the CTDE scenario (Oliehoek et al., 2008), we relax the partial observability in the training phase.
	ODITS is granted access to the global state st and other teammates’ actions a[−]t _[i]_ during training. During
	execution, G, f, g is removed; the ad hoc agent chooses the action which maximizes the conditional marginal
	utility function ˆu[i](τt[i][, a][i]t[;][ z]t[i][)][.]

	4.2 LEARNING TO REPRESENT TEAMWORK SITUATIONS

	For adaptive behaviors, we expect to condition the ad hoc agent’s policy on other teammates. However, unknown
	teammates show complex behaviors. Directly conditioning the policy on them might lead to a volatile policy. To
	address this issue, we aim to embed the teammates’ information into a compact but descriptive representation.
	To model the concept clearly, we formally define teamwork situation:

	Definition 1 (teamwork situation) At each time step t, the ad hoc agent is in the teamwork situation c[i]t
	_which is the current underlying teamwork state yielded by the environment state st and other teammates’ actions[∈C][,]_
	_a[−]t_ _[i][. It reflects the][ high-level semantics][ about the teammates’ behaviors.]_


	-----

	Though different teammates generate diverse state-action trajectories, we assume that they can cause similar
	teamwork situations at some time, and the ad hoc agent’s action would affect their transitions. When the current
	teamwork situation is identified, the ad hoc agent can choose the action accordingly to ensure online adaptation.

	Teamwork Situation Encoder f . To model the uncertainty of unknown teammates, we encode teamwork
	situations in a stochastic embedding spaceprobabilistic variable c[i] that is drawn from a multivariate Gaussian distribution C. Thus, any teamwork situation can be represented as a latent(µci _, σci_ ). To enable the
	_N_
	dependency revealed in the definition, we use a trainable neural network f to learn the parameters of the Gaussian
	distribution of c[i]:

	(µci _, σci_ ) = f (s, a[−][i]; θf ), c[i] (µci _, σci_ ) (4)
	_∼N_
	where θf are parameters of f .

	Regularizing Information Embedded in c[i]. We introduce a set of modules to jointly force c[i] to be sufficiently
	descriptive for reflecting the current teamwork situation. If c[i]t [is able to capture the core knowledge about other]
	teammates’ current behaviors, we can predict the joint action value Q[π][i] (st, a[i]t[,][ a]t[−][i][)][ according to][ c]t[i] [and the ad]
	hoc agent’s marginal utility u[i]t 1. Thus, we propose an integrating network G for generating the joint action
	value’s estimation G(u[i]t[, c][i]t[)][ ≈] _[Q][π][i]_ [(][s][t][, a][i]t[,][ a]t[−][i][)][. We adopt a modified asynchronous Q-learning’s loss function]
	(Q-loss) (Mnih et al., 2016) as the optimization objective:


	_Q = E(uit[,c][i]t[,r][t][)][∼D]_
	_L_



	[(rt + γ max _G¯(u[i]t+1[, c]t[i]+1[)][ −]_ _[G][(][u]t[i][, c]t[i][)]][2]_
	_a[i]t+1_


	(5)


	where _G[¯] is a periodically updated target network. The expectation is estimated with uniform samples from the_
	replay buffer D, which saves the interactive experience with training teammates.

	Integrating Network G. One simple approach for integrating c[i] with u[i] is to formulate G as an MLP that
	maps their concatenation into the joint value estimation. We instead propose to map c[i] into the parameters
	of G by a hypernetwork (Ha et al., 2016), which we refer to as the teamwork situation decoder g. Then, G
	maps the ad hoc agent’s utility u[i] into the value estimation. This alternative design changes the procedure of
	information integration. The decoder provides multiplicative integration to aggregate information. By contrast,
	the concatenation-based operation only provides additive integration, leading to a poor information integration
	ability (see Supplementary). We also empirically show that multiplicative integration stabilizes the training
	procedure and improves teamwork performance. In addition, we expect that there is a monotonicity modeling
	the relationship between G and the marginal utility u[i]t [:][ ∂G]∂u[i]t _t[, the increase of the ad hoc agent’s]_

	marginal utility results in the improved joint action value. To achieve this property, we force[≥] [0][. Given any][ c][i] _θG_ 0.
	_≥_

	4.3 LEARNING CONDITIONAL MARGINAL UTILITY FUNCTION UNDER PARTIAL OBSERVABILITY

	Apparently, the marginal utility of the ad hoc agent depends on other teammates’ behaviors. Distinct behaviors
	result in different marginal utilities. Here, we formalize the marginal utility network M as a deep recurrent Q
	network (DRQN) (Hausknecht and Stone, 2015) parameterized by θM . To enable adaptation, we force the final
	layers’ parameters of M to condition on the learned variable c[i]t[.]

	Proxy Encoder f _[∗]. Because of partial observability, the teamwork situation encoder f is not available_
	during execution. Thus, we introduce a proxy encoder f _[∗]_ to estimate c[i]t [from the local transition data]
	_b[i]t_ [= (][o]t[i][, r][t][−][1][, a][i]t 1[, o][i]t 1[)][. We assume that][ b]t[i] [can partly reflect the current teamwork situation since the]
	_−_ _−_
	transition implicitly indicates the underlying dynamics, which is primarily influenced by other teammates’ behaviors. We denote the estimation of c[i]t [as][ z]t[i][. Then,][ z]t[i] [would be fed into a][ proxy decoder][ g][∗][(][z]t[i][;][ θ][g][∗] [)][ parameterized]
	by θg∗ to generate the parameters θM of M, enabling the marginal utility function to condition on the proxy
	representation zt[i][. Similar to][ c][i][, we encode][ z][i][ into a stochastic embedding space:]


	(µzi _, σzi_ ) = f _[∗](b[i]; θf ∗_ ), z[i] _∼N_ (µzi _, σzi_ ) (6)


	where θf ∗ are parameters of f _[∗]._


	1uit [is a shorthand notation of][ u][i][(][τ][ i]t _[, a][i]t[;][ z]t[i][)]_


	-----

	Regularizing Information Embedded in z[i]. To make zt[i] [identifiable, we expect][ z]t[i] [to be informatively consis-]
	tent with c[i]t[. Thus, we introduce an information-based loss function][ L][MI] [here to maximize the conditional mutual]
	information I(zt[i][;][ c]t[i][\|][b]t[i][)][ between the proxy variables and the true variables. However, estimating and maximizing]
	mutual information is often infeasible. We introduce a variational distribution qξ(zt[i][\|][c]t[i][, b]t[i][)][ parameterized by][ ξ][ to]
	derive a tractable lower bound for the mutual information (Alemi et al., 2017):


	_t[\|][c][i]t[, b][i]t[)]_

	_I(zt[i][;][ c][i]t[\|][b][i]t[)][ ≥]_ [E]zt[i][,c][i]t[,b][i]t log _[q][ξ]p[(][z](z[i]_ _t[i][\|][b]t[i][)]_ (7)



	where p(zt[i][\|][b][i]t[)][ is the Gaussian distribution][ N] [(][µ]z[i] _[, σ]z[i]_ [)][. This lower bound can be rewritten as a loss function to]
	be minimized:

	_MI_ (θf _[∗]_ _, ξ) == E(bit[,s][t][,][a][−]t_ _[i])_ _t[\|][b][i]t[)][\|\|][q][ξ][(][z]t[i][\|][c][i]t[, b][i]t[)]]]_ (8)
	_L_ _∼D[[][D][KL][[][p][(][z][i]_

	where is the replay buffer, DKL[ ] is the KL divergence operator. The detailed derivation can be found in
	_D_ _·\|\|·_
	Supplementary.


	4.4 OVERALL OPTIMIZATION OBJECTIVE

	Algorithm 1 ODITS Training

	Require: Batch of training teammates’ behavioral policies _πj[−][i]_ _j=1,2,_ _,J_ [; learning rate][ α][; scaling factor][ λ][.]
	_{_ _[}][tr]_ _···_

	1: initialize the replay buffer D
	2: while not done do

	4:3: forSample the teammates’ policies k = 1, · · ·, K do _πj[−][i]_ from {πj[−][i][}]j[tr]=1,2,···,J

	5: Sample data Dk = (st, a[i]t[,][ a][−]t _[i][, r][t][)][}][t][=1][,][···][,T]_ [using the ad hoc agent’s policy][ π][i][ and][ π][−][i]
	_{_

	6: Add Dk into D

	7: for steps in training steps do

	8: Sample one trajectory D ∼D

	10:9: for tCompute = 1, · · · (, Tµc −it _[, σ]1[c]t[i] do[) =][ f]_ [(][s][t][,][ a]t[−][i][)][ and sample][ c]t[i] _c[i]t_ _[, σ][c]t[i]_ [)]

	11: Compute (µzti _[, σ][z]t[i]_ [) =][ f][ ∗][(][b]t[i][)][ and sample][ z]t[i] _[∼N][∼N][(][µ]zt[i][(][, σ][µ]_ _[z]t[i]_ [)]

	12: Compute u[i]t[(][τ][ i]t _[, a][i]t[;][ z]t[i][)][ and][ G][(][u][i]t[;][ c][i]t[)]_

	13: Compute _Q,_ _MI_
	_L_ _L_

	14: _θ_ _θ + α_ _θ(_ _Q)_
	_←_ _∇_ _L_

	15: _θf ∗_ _θf ∗_ + λ _α_ _θf_ _∗_ ( _MI_ )
	_←_ _·_ _∇_ _L_

	16: _ξ_ _ξ + λ_ _α_ _ξ(_ _MI_ )
	_←_ _·_ _∇_ _L_


	Algorithm 2 ODITS Testing

	Require: Testing teammates’ behavioral policies π[−][i].

	1: for t = 1, · · ·, T do
	2: Generate teaammtes’ actions a[−]t _[i]_ _π[−][i]_

	3: Compute (µzti _[, σ][z]t[i]_ [) =][ f][ ∗][(][b]t[i][)][ and sample]∼ _[ z]t[i]_ _[∼N]_ [(][µ]zt[i] _[, σ][z]t[i]_ [)]

	4: Do the action a[i]t [that maximizes][ u][i][(][τ][ i]t _[, a][i]t[;][ z]t[i][)]_

	To the end, the overall objective becomes:


	(θ) = _Q(θ) + λ_ _MI_ (θf ∗ _, ξ)_ (9)
	_L_ _L_ _L_

	where θ = (θf _, θg, θM_ _, θf ∗_ _, θp, ξ), λ is the scaling facor._

	During the training phase, the ad hoc agent interacts with different training teammates for collecting transition
	data into the replay buffer D. Then, samples from D are fed into the framework for updating all parameters


	-----

	2d1y2d1y 4d2y4d2y 8d4y8d4y

	# Avg. Captured Preys


	Save the City

	2a2b2a2b 4a3b4a3b 6a4b6a4b

	# Avg. Completed Buildings


	Figure 3: Performance comparison across various scenarios for Predator Prey (top panel) and Save the City
	(bottom panel).


	by gradients induced by the overall loss. During execution, the ad hoc agent conditions its behavior on the
	inferred teamwork situations by choosing actions to maximize the conditional utility function u[i](τt[i][, a][i]t[;][ z]t[i][)][. We]
	summarize our training procedure and testing procedure in Algorithm 1 and Algorithm 2.

	5 EXPERIMENTS


	We now empirically evaluate ODITS on various new and existing domains. All experiments in this paper are
	carried out 4 different random seeds, and results are shown with a 95% confidence interval over the standard
	deviation. In the following description, we refer to the teammates that interact with the ad hoc agent during
	the training phase as the training teammates, and refer to those teammates with unknown policies as testing
	teammates. And the “teammate types” correspond to the policy types of teammates. All experimental results
	illustrate the average teamwork performance when the ad hoc agent cooperates with different testing teammates.
	Additional experiments, further experiment details and implementation details of all models can be found at
	Supplementary.

	5.1 PREDATOR PREY


	Configurations. In this environment, m homogenous predators try to capture n randomly-moving preys in a
	7 × 7 grid world. Each predator has six actions: the moving actions in four directions, the capturing action, and
	waiting at a cell. Due to partial observability, each predator can only access the environmental information within
	two cells nearby. Besides, there are two obstacles at random locations. Episodes are 40 steps long. The predators
	get a team reward of 500 if two or more predators are capturing the same prey at the same time, and they are
	penalized for -10 if only one of them tries to capture a prey. Here, we adopt three different settings to verify
	the ad hoc agent’s ability to cooperate with different number of teammates. They are 2 predators and 1 preys
	(2d1y), 4 predators and 2 preys (4d2y) and 8 predators and 4 preys (8d4y), respectively.

	We compare our methods with three type-based baselines: AATEAM (Chen et al., 2020), ConvCPD (Ravula et al.,
	2019), PLASTIC (Barrett and Stone, 2015). Note that these approaches assume that the ad hoc agent has full
	visibility on the environment. To apply them on partially observed settings, we replace the full state information
	used in them with partial observations of the ad hoc agent. Furthermore, we also compare two other strategies:
	(i) Random: The ad hoc agent chooses actions randomly. (ii) Combined: The ad hoc agent utilizes a DQN
	algorithm to learn a single policy using the data collected from all possible teammates. This intuitive baseline
	provides the view of treating the problem as a vanilla single-agent learning problem, where the agent ignores the
	differences between its teammates.

	Before training all algorithms, we first require a teammate set that consists of various behavioral policies of
	teammates. Instead of crafting several teammates’ cooperative policies by hand, we expect to train a set of distinct
	policies automatically. Therefore, we first utilize 5 different MARL algorithms (e.g. VDN (Sunehag et al., 2018)
	and QMIX (Rashid et al., 2018)) to develop several teams of agents. To ensure diversity, we use different random
	seeds for each algorithm and save the corresponding models at 3 different checkpoints (3 million steps, 4 million


	-----

	steps, and 5 million steps). Then, we manually select 15 different policies showing distinct policy representations
	(Grover et al.) from all developed models. Finally, we randomly sampled 8 policies as the training set and the
	other 7 policies as the testing set. During training, we define 8 teammate types that correspond to 8 policies in
	the training set for the type-based approaches. Then, algorithms would develop their models according to the
	interactive experience with training teammates. For all algorithms, agents are trained for 4.5 million time steps.
	The number of captured preys when the ad hoc agent cooperates with testing teammates throughout training is
	reported. See Supplementary for further settings.

	Results. The top panel of Fig. 3 reports the results across 3 scenarios. We first observe that ODITS achieves
	superior results on the number of captured preys across a varying number of teammates, verifying its effectiveness.
	ODITS also tends to give more consistent results than other methods across different difficulties. The other 3
	type-based baselines and ODITS show better results than random and combined policies, indicating that they can
	indeed lead to adaptive behaviors to different teammates. Furthermore, the random strategy captures some preys
	on 4d2y and 8d4y, but completely fails on 2d1y. This indicates that without the cooperative behaviors of the
	ad hoc agent, other teammates can also coordinate with each other to achieve the given goal. The combined policy
	shows worse results than the random policy on two scenarios (4d2y and 8d4y). This might be because the
	combined policy show behaviors that conflict with other teammates. With the number of teammates increasing,
	the increasing effects of conflicts lead to serious miscoordination.

	5.2 SAVE THE CITY

	Configurations. This is a grid world resource allocation task presented in (Iqbal et al., 2020).

	In this task, there are 3 distinct types of agents, and their goal is to complete
	the construction of all buildings on the map while preventing them from burning
	down. Each agent has 8 actions: stay in place, move to the next cell in one of
	the four cardinal directions, put out the fire, and build. We set the agents to get
	a team reward of 100 if they have completed a building and be penalized for -500
	when one building is burned down. Agent types include firefighters (20x speedup
	over the base rate in reducing fires), builders (20x speedup in the building), or
	generalists (5x speedup in both as well 2x speedup in moving ). Buildings also
	consist of two varieties: fast-burning and slow-burning, where the fast-burning
	buildings burn four times faster. In our experiments, each agent can only access
	the environmental information within four cells nearby. We adopt three different
	scenarios here to verify all methods. They are 2 agents and 1 buildings (2a2b),
	4 agents and 3 buildings (4a3b), 6 agents and 4 buildings (6a4b).

	Similar to training settings in Predator Prey, we select 15 distinct behavioral

	Predator Prey 4d2y

	Save the City 4a3b

	policies for the teammate set and randomly partition them into 8 training policies Figure 4: ODITS V.S.
	and 7 testing policies. For all algorithms, agents are trained for 4.5 million time QMIX.
	steps. The number of completed buildings when the ad hoc agent cooperates with testings teammates throughout
	training is reported. See Supplementary for further settings.

	Results. The bottom panel of Fig. 3 reports the results across 3 scenarios. We first observe that ODITS
	outperforms other baselines, verifying its effectiveness. Since the setting force all agents in the environment to be
	heterogeneous, the results also underpin the robustness of ODITS. Interestingly, we find that the combined policy
	reveals better performance than other type-based approaches. This result is not consistent with that in Predator
	Prey. Our intuition is that the requirement of cooperative behaviors in Save the City is less than that in Predator
	Prey. Actually, one agent in Save the City can complete buildings individually without the strong necessity of
	cooperating with other teammates’ behaviors. By contrast, one predator cannot capture prey by itself. As a result,
	the combined policy learns a universal and effective policy by learning from more interactive experience with
	different teammates, while type-based approaches fail because developing distinct cooperative behaviors leads to
	the instability of the ad hoc agent. This hypothesis is also empirically demonstrated in our ablations.

	5.3 COMPARISON WITH MARL

	In order to compare the performance of ODITS and MARL, we implement the commonly used algorithm
	QMIX (Rashid et al., 2018) as our baseline. Similar to the training procedure of ODITS, we fix one agent and
	train it with teammates randomly sampled from a pool consisting of 8 policies. The gradients for updating the


	-----

	teammates’ policies are blocked but the mixing network is updating as in the original implementation of QMIX.
	Figure 4 shows the comparison of ODITS and QMIX on Predator Prey 4d2y and Save the City 4a3b. In both
	environments, QMIX performs significantly worse than ODITS. This is not quite surprising because MARL
	algorithms usually assume that all the teammates are fixed. Therefore, although trained with multiple teammates,
	the agent under the QMIX framework does not learn to cooperate with an impromptu team.

	5.4 ABLATIONS.


	We perform several ablations on the Predator Prey 4d2y and Save
	the City 4a3b to try and determine the importance of each component of ODITS.




	Adaptive Behaviors. We first consider removing the information-
	based loss LMI from the overall learning objective ( denoted as
	w/o info.), Fig. 5 shows that without LMI regularizing the informa-
	tion embedded in zt[i][, ODITS induces worse teamwork performance.]
	This indicates that improving the mutual information between the
	proxy variable and the true variable indeed results in better representations of teamwork situations. We next consider how the inferred
	variables of teamwork situations affect the ad hoc agent’s adaptive

	Predator Prey Save the City

	4d2y 4a3b

	behaviors. We remove the proxy encoder and set zt[i] [as a fixed and] Figure 5: Ablations for different comporandomly generated vector ( denoted as w/o infer.). As shown in nents.
	Fig. 5, conditioning on a random signal leads to a further drop in performance, indicating that irrelevant signals
	cannot promote the ad hoc agent to develop adaptive policies.

	Integrating Mechanism. We remove the teamwork situation encoder as well as LMI from the framework and
	feed a vector filled with 1 into the teamwork situation decoder (labeled with w/o integ.). This setting enables
	ODITS not to integrating the ad hoc agent’s marginal utility with the information of teammates’ behaviors.
	Compared with w/o info., it brings a larger drop in teamwork performance. One intuition is that predicting the
	joint-action value plays an essential role in estimating the marginal utility. Suppose the integrating network has
	no information on other teammates’ behaviors. In that case, it cannot accurately predict the joint-action value,
	resulting in instability in marginal utility estimation. Despite the empirical evidence supporting this argument,
	however, it would be interesting to develop further theoretical insights into this training regime in future work.
	We finally consider the additive integration mechanism mentioned in section 4.2 (labeled with additve). We
	observe that despite additive integration shows an excellent performance in Save the City, it suffers from poor
	performance in Predator Prey, indicating that multiplicative integration provides a more stable and effective
	ability to integrate information from teammates and the ad hoc agent. Interestingly, we also find that most
	ablations get worse results in Predator Prey than those in Save the city. We believe that the different levels
	of cooperative requirement in two environments result in this phenomenon. The prey is captured when two
	nearby predators are simultaneously capturing them. By contrast, the burning building can be constructed by an
	individual agent. Therefore, removing mechanisms that promote the cooperative behaviors leads to the worse
	performance in Predator Prey.

	6 CONCLUSIONS

	This paper proposes a novel adaptive reinforcement learning algorithm called ODITS to address the challenging
	ad hoc teamwork problem. Without the need to predefine types of teammates, ODITS automatically learns
	compact but descriptive variables to infer how to coordinate with previously unknown teammates’ behaviors. To
	overcome partial observability, we introduce an information based regularizer to estimate proxy representations
	of learned variables from local observations. Experimental results show that ODITS obtains superior performance
	compared to various baselines on several complex ad hoc teamwork benchmarks.

	ACKNOWLEDGMENTS

	This research was supported by the National Research Foundation, Singapore under its AI Singapore Programme
	(AISG Award No: AISG-RP-2019-0013), National Satellite of Excellence in Trustworthy Software Systems
	(Award No: NSOETSS2019-01), and NTU.


	-----

	REFERENCES

	S. V. Albrecht and P. Stone. Reasoning about hypothetical agent behaviours and their parameters. In Proceedings
	_of the International Conference on Autonomous Agents and Multiagent Systems, 2017._

	S. V. Albrecht and P. Stone. Autonomous agents modelling other agents: A comprehensive survey and open
	problems. Artificial Intelligence, 258:66–95, 2018.

	A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy. Deep variational information bottleneck. In International
	_Conference on Learning Representations, 2017._

	N. Bard, J. N. Foerster, S. Chandar, N. Burch, M. Lanctot, H. F. Song, E. Parisotto, V. Dumoulin, S. Moitra,
	E. Hughes, I. Dunning, S. Mourad, H. Larochelle, M. G. Bellemare, and M. Bowling. The hanabi challenge:
	A new frontier for ai research. Artificial Intelligence, 280:103216, 2020.

	S. Barrett and P. Stone. Cooperating with unknown teammates in complex domains: A robot soccer case study
	of ad hoc teamwork. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence
	_Conference on Artificial Intelligence, 2015._

	R. Canaan, J. Togelius, A. Nealen, and S. Menzel. Diverse agents for ad-hoc cooperation in hanabi. arXiv
	_preprint arXiv:1907.03840, 2019._

	S. Chen, E. Andrejczuk, Z. Cao, and J. Zhang. Aateam: Achieving the ad hoc teamwork by employing the
	attention mechanism. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence
	_Conference on Artificial Intelligence, pages 7095–7102, 2020._

	I. Durugkar, E. Liebman, and P. Stone. Balancing individual preferences and shared objectives in multiagent
	reinforcement learning. In Proceedings of the International Joint Conference on Artificial Intelligence, 2020.

	J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson. Counterfactual multi-agent policy gradients.
	_Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence Conference on Artificial_
	_Intelligence, 2017._

	N. A. Genter, Katie and P. Stone. Role-based ad hoc teamwork. In Workshops at the AAAI Conference on
	_Artificial Intelligence, 2011._

	A. Grover, M. Al-Shedivat, J. K. Gupta, Y. Burda, and H. Edwards. Learning policy representations in multiagent
	systems. In Proceedings of the Proceedings of the International Conference on Machine Learning, pages
	1802–1811.

	D. Ha, A. Dai, and Q. V. Le. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016.

	M. Hausknecht and P. Stone. Deep recurrent q-learning for partially observable mdps. In Proceedings of the
	_AAAI Conference on Artificial Intelligence, 2015._

	H. He, J. Boyd-Graber, K. Kwok, and H. Daumé. Opponent modeling in deep reinforcement learning. In
	_Proceedings of the International Conference on Machine Learning, pages 1804–1813, 2016._

	S. Iqbal, C. A. S. de Witt, B. Peng, W. Böhmer, S. Whiteson, and F. Sha. Randomized entity-wise factorization
	for multi-agent reinforcement learning. arXiv preprint arXiv:2006.04222, 2020.

	A. Mahajan, T. Rashid, M. Samvelyan, and S. Whiteson. Maven: Multi-agent variational exploration. In
	_Advances in Neural Information Processing Systems, pages 7613–7624, 2019._

	R. Mirsky, W. Macke, A. Wang, H. Yedidsion, and P. Stone. A penny for your thoughts: The value of
	communication in ad hoc teamwork. In Proceedings of the International Joint Conference on Artificial
	_Intelligence, 2020._

	V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous
	methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning,
	pages 1928–1937, 2016.

	10


	-----

	P. Muller, S. Omidshafiei, M. Rowland, K. Tuyls, J. Perolat, S. Liu, D. Hennes, L. Marris, M. Lanctot, E. Hughes,
	Z. Wang, G. Lever, N. Heess, T. Graepel, and R. Munos. A generalized training approach for multiagent
	learning. In International Conference on Learning Representations, 2020.

	F. A. Oliehoek, M. T. Spaan, N. Vlassis, and S. Whiteson. Exploiting locality of interaction in factored decpomdps. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems,
	pages 517–524, 2008.

	G. Papoudakis and S. V. Albrecht. Variational autoencoders for opponent modeling in multi-agent systems. In
	_AAAI Workshop on Reinforcement Learning in Games, 2020._

	M. A. Rahman, N. Hopner, F. Christianos, and S. V. Albrecht. Towards open ad hoc teamwork using graph-based
	policy learning. In Proceedings of the International Conference on Machine Learning, pages 8776–8786,
	2021.

	R. Raileanu, E. Denton, A. Szlam, and R. Fergus. Modeling others using oneself in multi-agent reinforcement
	learning. In Proceedings of the International Conference on Machine Learning, pages 4257–4266.

	R. Raileanu, M. Goldstein, A. Szlam, and R. Fergus. Fast adaptation to new environments via policy-dynamics
	value functions. In Proceedings of the International Conference on Machine Learning, pages 7920–7931,
	2020.

	T. Rashid, M. Samvelyan, C. S. de Witt, G. Farquhar, J. Foerster, and S. Whiteson. Qmix: Monotonic value
	function factorisation for deep multi-agent reinforcement learning. In Proceedings of the International
	_Conference on Machine Learning, pages 1228–1236, 2018._

	M. Ravula, S. Alkobi, and P. Stone. Ad hoc teamwork with behavior switching agents. In Proceedings of the
	_International Joint Conference on Artificial Intelligence, 2019._

	M. Samvelyan, T. Rashid, C. S. de Witt, G. Farquhar, N. Nardelli, T. G. Rudner, C.-M. Hung, P. H. Torr,
	J. Foerster, and S. Whiteson. The starcraft multi-agent challenge. arXiv preprint arXiv:1902.04043, 2019.

	P. Stone, G. A. Kaminka, S. Kraus, and J. S. Rosenschein. Ad hoc autonomous agent teams: Collaboration
	without pre-coordination. In Proceedings of the AAAI Conference on Artificial Intelligence, 2010.

	P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. F. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z.
	Leibo, K. Tuyls, et al. Value-decomposition networks for cooperative multi-agent learning based on team
	reward. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, pages
	2085–2087, 2018.

	M. Suzuki, K. Nakayama, and Y. Matsuo. Joint multimodal learning with deep generative models. arXiv preprint
	_arXiv:1611.01891, 2016._

	A. Tacchetti, H. F. Song, P. A. M. Mediano, V. Zambaldi, J. Kramár, N. C. Rabinowitz, T. Graepel, M. Botvinick,
	and P. W. Battaglia. Relational forward models for multi-agent learning. In International Conference on
	_Learning Representations, 2019._

	M. Tan. Multi-agent reinforcement learning independent vs. cooperative agents. In Proceedings of the Interna_tional Conference on Machine Learning, 1993._

	T. Wang, H. Dong, V. Lesser, and C. Zhang. Roma: Multi-agent reinforcement learning with emergent roles. In
	_Proceedings of the International Conference on Machine Learning, 2020._

	T. Wang, T. Gupta, A. Mahajan, B. Peng, S. Whiteson, and C. Zhang. Rode: Learning roles to decompose
	multi-agent tasks. International Conference on Learning Representations, 2021.

	G. N. Yannakakis. Game ai revisited. In Proceedings of the 9th conference on Computing Frontiers, pages
	285–292, 2012.

	H. Yin, F. Melo, A. Billard, and A. Paiva. Associate latent encodings in learning from demonstrations. In
	_Proceedings of the AAAI Conference on Artificial Intelligence, 2017._

	L. Zintgraf, S. Devlin, K. Ciosek, S. Whiteson, and K. Hofmann. Deep interactive bayesian reinforcement
	learning via meta-learning. arXiv preprint arXiv:2101.03864, 2021.

	11


	-----

	A APPENDIX

	A.1 MATHEMATICAL DERIVATION

	A.1.1 ADDITIVE INTEGRATION VS. MULTIPLICATIVE INTEGRATION IN INTEGRATING NETWORK.

	Concretely, let us consider a simple example that G is a one-layer network. The information aggregation of
	additive integration (abbreviated as Gadd) could be written as follow:

	_Gadd = F(Wuu[i]_ + Wcc[i] + b) (10)

	where Wu and Wc are the weight vectors of u[i] and c[i], respectively; b is the bias vector; F is the corresponding
	activation function.

	By comparison, the multiplicative integration (abbreviated as Gmul) could be written as follow:

	_Gmul = F(Wc(c[i])Wuu[i]_ + b(c[i])) (11)

	where W (c[i]) and b(c[i]) are the function of c[i] for generating weight vectors.

	Compared with Gadd, Gmul changes the integrating network from first order to second order, while introducing
	no extra parameters. The information of teamwork situation directly scales the information of the ad hoc agent
	through the term W (c[i])Wuu[i], rather than via a more subtle bias Wuu[i] + Wcc[i].

	Further, Gmul also brings about advantages over Gadd as it alters the integrating network’s gradient properties.
	In the additive integration, the gradient of _[∂G]∂c[add][i]_ can be computed as follow:

	_∂Gadd_

	= Wc (12)
	_∂c[i]_ _F_ _[′]_


	where = (Wuu[i] + Wcc[i] + b). This equation reveals that the gradient heavily depends on the matrix Wc,
	_F_ _[′]_ _F_ _[′]_
	but∂G∂cadd W[i] _uis as follow: and u[i]_ play a limited role: they only come in the derivative of F _[′]_ mixed with Wuu[i]. By comparison,

	_∂Gmul_

	_∂c[i]_ = [Wuu[i]Wc[′] [+][ b][′][]][F] _[′]_ (13)

	where Wc[′] [=][ W][ ′]c[(][c][i][)][ and][ b][′][ =][ b][′][(][c][i][)][. In this equation,][ W][u][u][i][ is directly involved in the gradient computation]
	by gating Wc[′][, hence more capble of altering the updates of the learning procedure. This naturally leads to a]
	better information integration.

	A.1.2 MUTUAL INFORMATION LOSS FUNCTION _MI_
	_L_

	For forcing the proxy representation of the learned latent variable to be infomationally consistent with the true
	latent variable, we propose to maximize the mutual information between the proxy representations and the latent
	variables. In this paper, we introduce a posterior estimator and derive a tractable lower bound of the mutual
	information term:


	_t[\|][c][i]t[, b][i]t[)]_

	_I(zt[i][;][ c][i]t[\|][b][i]t[) =][ E]zt[i][,c][i]t[,b][i]t_ log _[p][(]p[z]([i]zt[i][\|][b][i]t[)]_



	_t[\|][c][i]t[, b][i]t[)]_

	= Ezti[,c][i]t[,b][i]t log _[q][ξ]p[(][z](z[i]_ _t[i][\|][b][i]t[)]_



	+ Ecit[,b][i]t [[][D][KL][(][p][(][z]t[i][\|][c][i]t[, b][i]t[)][\|\|][q][xi][(][z]t[i][\|][c][i]t[, b][i]t[))]]

	_t[\|][c][i]t[, b][i]t[)]_

	_≥_ Ezti[,c][i]t[,b][i]t log _[q][ξ]p[(][z](z[i]_ _t[i][\|][b][i]t[)]_



	where the last inequality holds because of the non-negativity of the KL divergence. Then it follows that:

	12


	(14)


	-----

	_t[\|][c][i]t[, b][i]t[)]_

	Ezti[,c][i]t[,b][i]t log _[q][ξ]p[(][z](z[i]_ _t[i][\|][b]t[i][)]_



	=Ezti[,c][i]t[,b][i]t [[log][ q][ξ][(][z]t[i][\|][c][i]t[, b][i]t[)]][ −] [E]zt[i][,b][i]t [[log][ p][(][z]t[i][\|][b][i]t[)]]

	=Ezti[,c][i]t[,b][i]t [[log][ q][ξ][(][z]t[i][\|][c][i]t[, b][i]t[)] +][ E]b[i]t [[][H][(][z]t[i][\|][b][i]t[)]]

	=Ecit[,b][i]t _p(zt[i][\|][c][i]t[, b][i]t[) log][ q][ξ][(][z]t[i][\|][c][i]t[, b][i]t[)][dz]t[i]_ + Ebit [[][H][(][z]t[i][\|][b][i]t[)]]

	Z


	(15)


	The proxy encoder is conditioned on the transition data. Given the transitions, the distribution of the proxy
	representations p(zt[i][)][ are independent from the local histories. Thus, we have]

	_I(zt[i][;][ c][i]t[\|][b][i]t[)][ ≥−][E]c[i]t[,b][i]t_ [[][CE][[][p][(][z]t[i][\|][c][i]t[, b][i]t[)][\|\|][q][ξ][(][z]t[i][\|][c][i]t[, b][i]t[)][dz]t[i][] +][ E]b[i]t [[][H][(][z]t[i][\|][b][i]t[)]] (16)

	where In practice, we sample data from the replay buffer D and minimize

	_MI_ (θf ∗ _, ξ) =_ E(cit[,b][i]t[)][∼D][[][CE][[][p][(][z]t[i][\|][c]t[i][, b]t[i][)][\|\|][q][ξ][(][z]t[i][\|][c]t[i][, b]t[i][)][dz]t[i][] +][ E]b[i]t [[][H][(][z]t[i][\|][b]t[i][)]]
	_L_ _−_ (17)

	= E(bit[,s][t][,][a][−]t _[i])_ _t[\|][b]t[i][)][\|\|][q][ξ][(][z]t[i][\|][c]t[i][, b]t[i][]]]_
	_∼D[[][D][KL][[][p][(][z][i]_

	A.2 ARCHITECTURE, HYPERPARAMETERS, AND INFRASTRUCTURE

	A.2.1 ODITS

	Details of the neural network architectures used by ODITS in all environments
	are provided in Fig. 7. We control the scale of the loss function by controlling
	the optimization procedure. It is conducted using RMSprop with a learning rate
	of 5 × 10[−][4], α of 0.99, and with no momentum or weight decay. For the lambda
	value, we search over{1e − 5, 1e − 4, 5e − 4, 1e − 3, 5e − 3, 1e − 2}. We finally
	adopt λMI = 1e − 3,λMI = 1e − 3, and λMI = 5e − 4 for Modified Coin
	Game, Predator Prey, and Save the City, respectively, since they induce the best

	Figure 6: performance com
	performance compared with other values. For the dimension of the latent variables

	parison of two algorithms

	_zt[i]_ [and][ c]t[i][, we search over][ {][1][,][ 2][,][ 3][,][ 5][,][ 10][}][ and finally adopt][ \|][z][\|][ = 10][ in Save the]

	on the modified coin game.

	city and \|z\| = 1 in the other environments. In addition, we set \|c\| = \|z\|. For
	exploration, we use ϵ-greedy with ϵ from 1.0 to 0.05 over 50, 000 time steps and kept constant for the rest of the
	training. Batches of 128 episodes are sampled from the replay buffer, and all components in the framework are
	trained together in an end-to-end fashion.

	A.2.2 BASELINES

	We compare ODITS with 5 baselines. For the random strategy, we force the ad hoc agent to choose its action at
	each time step randomly.

	For the combined strategy, a three-layer DQN is trained to get the policy of the ad hoc agent. Details of DQN’s
	architecture are illustrated in Fig. 7. We sample a type of training teammates at each training episode and collect
	the interacting experience data into the replay buffer. The optimization procedure and the exploration scheme
	are the same as those of ODITS. We set the batch size of samples as 128 and the target update interval as 200
	episodes.

	For the other three type-based baselines, we assume that each training policy of teammates corresponds to a
	predefined type of teammates. So, the number of predefined types equals to the number of training polices.
	We construct a set of three-layer DQN to learn the policies for all training types. Each DQN only learns the
	corresponding teammates’ of interacting experience. Training settings for these DQNs are the same as those
	used in the combined strategy. Furthermore, to apply these baselines in partially observable environments, we
	replace the state information used in them with partial observations. For PLASTIC (Barrett and Stone, 2015),

	13


	-----

	LeakeyReLU

	BatchNormalization

	FC, 2 X \|c\| units

	LeakeyReLU

	BatchNormalization

	FC, 128 units

	(st, _[a]t−i_ )

	Teamwork

	Situation Encoder

	LeakeyReLU

	BatchNormalization

	FC, 2 X \|z\| units

	LeakeyReLU

	BatchNormalization

	FC, 128 units

	_bti_


	Proxy Encoder


	Parameterizing

	FC, 1 units

	ReLU Weights and Biases

	FC, 128 units

	Absolutization

	_uti_

	FC, 16768 units

	_cti_


	FC, \|A\| units

	ReLU

	FC, 64 units

	ReLU

	FC, 64 units

	_oti_


	Integrating

	Network

	Parameterizing

	FC, \|A\| units

	Weights and Biases

	GRU, 64 units

	ReLU

	FC, 64 X (\|A\|+1) units

	FC, 64 units

	(oti, _[a]ti−1)_ _z_ _ti_


	Marginal Utility

	Network


	Teamwork

	Situation Decoder

	Parameterizing

	Weights and Biases

	FC, 64 X (\|A\|+1) units

	_z_ _ti_

	Proxy Decoder


	DQN

	\|Col1\|LeakeyReL\|
	\|---\|---\|
	\|BatchNormalization\|\|
	\|\|\|
	\|FC, \|c\| units\|\|
	\|\|ci t\|



	Variational Posterior

	Estimator


	Softmax

	FC, _nclasses units_

	ReLU

	FC, 20 units

	ReLU

	FC, 100 units

	ReLU

	FC, 100 units

	ReLU

	Maxpool, 2X2

	Conv, 40X3X3

	(ntp, _[n]t_ )


	CPD Network


	Figure 7: Architecture details of ODITS and baselines.

	we set the parameter η used in the UpdateBelief function as 0.2. For ConvCPD (Ravula et al., 2019), we follow
	the implementation details mentioned in the original paper. We construct the architecture of the Change Point
	Detection (CPD) Network as illustrated in Fig. 7, where ntp is the number of predefined types, nt is the number
	of time steps that needed be considered in ConvCPD,of change points. The ConCPD algorithm is trained with nclasses nclasses = ntp 1000 × ( samples in which each claess has 1000ntp − 1) + 1 is the number of classess
	samples (batch size = 64, learning rate = 0.01, decay=0.1, optimizer=SGD). For × AATEAM (Chen et al., 2020), we
	follow its proposed implementation details. The number of hidden layers for GRU is 5. The maximum length of
	a state sequence that attention networks process is 60. The rate of dropout is 0.1. During training, the default
	learning rate is 0.01, and the batch size is 32.

	A.3 MODIFIED COIN GAME

	To show the difference between ODITS and type-based approaches, we introduce a simple modified coin game.
	The game takes place on a 7 × 7 map which contains 6 coins of 3 different colors (2 coins of each color). The
	aim of the team is to only collect any two kinds of coins (correct coins with a reward of 100) and avoid collecting
	the other kind of coins (false coins with a reward of -200). The policies of the teammates are predefined and
	illustrated in the order of colors in Fig.8 left (2 training types and 2 testing types). For example, the first training
	type (red → green) indicates that the policy of this teammate is to collect red and green coins, and it will collect
	red coins firstly. Therefore, while the correct coins of the first training type (green → red) and the second testing
	type (red → green) are the same, they are different policies since their paths to collect coins are apparently
	different. Each agent has five actions: move up, down, left, right, or pass. Once an agent steps on a coin, that

	14


	-----

	coin disappears from the cell. The game ends after 20 steps. To maximize the team return, the aim of the ad hoc
	agent is to infer its current teammates’ desired coins and collect them as much as possible.

	Here, we adopt one state-of-the art type-based approach (AATEAM (Chen et al., 2020)) as the baseline. Fig.6
	shows the testing performance. We observe that ODITS shows superior performance and converges quickly
	while AATEAM shows an unstable curve. We believe this discrepance results from the key difference between
	our method and type-based approaches. The baseline is hard to cooperate with new types of teammates. For
	example, when the baseline agent interacts with the teammate of the second testing type (green → red) and
	observes that the teammate is collecting the green coins at the start stage, it would switch its own policy to the
	one corresponding to the second training type of teammate (green → blue), so it would collect green coins and
	blue coins (false coins) simultaneously, leading to poor teamwork performance. By contrast, ODITS can be
	easily generalized to the testing types of teammates. During training, ODITS learns how to cooperate with the
	teammate according to its current behavior instead of its types. If it observes that its teammate is collecting one
	kind of coins, it will collect the same kind of coins, and this knowledge is automatically embedded in c and z.

	A.4 DETAILS OF EXPERIMENTS

	Modified Coin GameModified Coin Game Predator Prey Save the City


	CapturingCapturingCapturing

	MovingMovingMoving

	PredatorsPredators

	PreysPreys

	ObstaclesObstacles

	Fast-burning Fast-burning

	buildingsbuildings

	Slow-burning Slow-burning

	buildingsbuildings

	FirefightersFirefighters

	BuildersBuilders

	GeneralistsGeneralists

	Completion Completion

	progressprogress

	TeammateTeammate Ad hoc agentAd hoc agent

	CoinsCoinsCoins

	Policies of Training typesPolicies of Training types

	Policies of Testing typesPolicies of Testing types

	Observation Observation

	rangerange

	Failed

	capturing

	Successful Successful

	capturingcapturing

	Figure 8: Illustration of Modified Coin Game (left), Predator Prey (middle) and Save the City (right)

	Modified Coin Game. In this environment, the ad hoc agent needs to cooperate with its teammate to collect
	target coins which depends on the current teammates’ type. The teammates’ types are illustrated as two ordered
	colors of coins in Fig. 8. The teammate follows a heuristic strategy and has complete visibility of the entire
	environmental state. It would first move along an axis for which it has the smallest distance from the coin of
	its first color. If it collides with obstacles or boundaries, it would first choose random actions in 3 steps, and
	then choose actions to move to the target coin again. If there is no coin of its first color, it would move to the
	coin of its second color following the same strategy. The ad hoc agent can only access the information 3 grids
	nearby itself. We use one-hot coding to represent different entities in each grid and concatenate them together as
	a vector to construct the observation of the ad hoc agent. The game ends after 20 steps and takes place on a 7 × 7
	grid containing 6 coins of 3 different colors (2 coins of each color). Each agent has five actions: move up, down,
	left, right, or pass. Once an agent steps on a coin, that coin disappears from the grid. This game requires agents
	to collect as many target coins as possible, which are indicated by the teammates’ types. One right collected coin
	gives a reward of 100, while one false coin gives a reward of -200. As a result, the ad hoc agent needs to infer its
	teammate’s goals according to teammates’ behaviors and move to right coins while avoiding meeting false coins.

	Predator Prey. In this environment, m homogenous predators try to capture n randomly-moving preys in a 7 _×_ 7
	grid world. Each predator has six actions: the moving actions in four directions, the capturing action, and waiting
	at a grid. Besides, there are two obstacles at random locations. Due to partial observability, each predator can
	only access the environmental information within two grids nearby. The information of each grid is embedded
	into a one-hot vector to represent different entities: obstacles, blank grids, predators, preys. Episodes are 40 steps
	long. The predators get a team reward of 500 if two or more predators are capturing the same prey simultaneously,
	and they are penalized for -10 if only one of them tries to capture a prey. Here, we adopt three different scenarios
	to verify the ad hoc agent’s ability to cooperate with the different number of teammates. They are 2 predators
	and 1 preys (2d1y), 4 predators and 2 preys (4d2y) and 8 predators and 4 preys (8d4y), respectively.

	In this environment, to simulate complex cooperative behaviors, we utilize 5 algorithms (VDN (Sunehag et al.,
	2018), MAVEN (Mahajan et al., 2019), ROMA (Wang et al., 2020), QMIX (Rashid et al., 2018), IQL (Tan,

	15


	-----

	2d1y2d1y 4d2y4d2y 8d4y8d4y

	2a2b2a2b 4a3b4a3b 6a4b6a4b

	Training policies Testing policies Unselected policies

	Figure 9: t-SNE plots of the learned teammates’ policy representations for Predator Prey (top panel) and Save
	the City (bottom panel) in 6 scenarios. For each scenario, we first developed 60 candidate policies by using
	existing MARL open-source implementations. Then, we train a self-supervised policy encoder mentioned in
	(Raileanu et al., 2020) by learning information from collected behavioral trajectories {(o[i]t[, a][i]t[)][}][ of all candidates]
	to represent their policies. Finally, we select 15 policies whose average policy embeddings are distinct from each
	other, and split them into 8 training policies and 7 testing policies for each scenario.


	1993)) to train the candidate policies of teammates using their open-source implementations based on PyMARL
	(Samvelyan et al., 2019). We use 4 different random seeds for each algorithm and save the corresponding models
	at 3 different training steps (3 million, 4 million, and 5 million). As a result, we get 60 candidate teammates’
	policies for each scenario. To ensure diversity of cooperative behaviors, we visualize policy representations
	of all candidate models by using the open-source implementation mentioned in (Raileanu et al., 2020). This
	mechanism encodes the agent’s behavioral trajectories {(o[i]t[, a][i]t[)][}][ into an embedding space, in which different]
	policies generate different representations. Then, we manually select 15 distinct policies and randomly split them
	into 8 training policies and 7 testing policies for each scenario. We illustrate the policy representations in Fig. 9.
	During training or testing, in each episode, we sample one policy from the policy set and equip a team of agents
	with the policy. Then, we replace one random agent in the team with the ad hoc agent to construct an ad hoc
	team.

	Save the City. This is a 14 × 14 grid world resource allocation task presented in (Iqbal et al., 2020). In this
	task, there are 3 distinct types of agents, and their goal is to complete the construction of all buildings on the
	map while preventing them from burning down. Each agent has 8 actions: stay in place, move to the next grid
	in one of the four cardinal directions, put out the fire, and build. We set the agents to get a team reward of 100
	if they have completed a building and be penalized for -500 when one building is burned down. Agent types
	include firefighters (20x speedup over the base rate in reducing fires), builders (20x speedup in the building),
	or generalists (5x speedup in both as well 2x speedup in moving ). Buildings also consist of two varieties:

	16


	-----

	fast-burning and slow-burning, where the fast-burning buildings burn four times faster. In our experiments,
	each agent can only access the environmental information within four grids nearby. The observation of each
	agent contains the information (type, position, and the completion procedure of the building) of each entity
	in the environment. If one entity is not in the sight range of the agent, the observation vector of this entity
	would be filled with 0. We adopt three different scenarios here to verify all methods. They are 2 agents and 1
	buildings (2a2b), 4 agents and 3 buildings (4a3b), 6 agents and 4 buildings (6a4b). Similar to Predator Prey,
	we also manually selected 15 teammates’ policies for each scenario. We illustrate their policy representations in
	Fig. 9. Furthermore, since each team has a random combination of agent types, teammates’ behaviors on this
	environments show a greater diversity. In addition, all buildings’ types are randomly assigned in each episode.

	B LIMITATIONS AND FUTURE WORKS

	In this paper, the experiments were carried with relatively small teams (2-8 agents) and with agent behavior
	largely coming from pre-trained RL policies. In this case, the diversity of teammates’ policies can be easily
	verified by visualizing the policy representations in latent space. However, with the team size increasing, there
	might be more complex and diverse teamwork situations. This leads to a higher requirement for the training speed
	and the scalability of the architecture. In addition, the diversity of the policy sets for training has considerable
	influence on the performance. However, the characterization and quantization of such an influence remain
	unexplored. We leave these issues for future work.

	17


	-----