File size: 65,428 Bytes
f71c233 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 |
# ONLINE AD HOC TEAMWORK UNDER PARTIAL OBSERVABILITY **Pengjie Gu[1], Mengchen Zhao** [2][,][∗], Jianye Hao[3][,][2], Bo An[1] School of Computer Science and Engineering, Nanyang Technological University, Singapore[1] Noah’s Ark Lab, Huawei[2] College of Intelligence and Computing, Tianjin University[3] {pengjie.gu, boan}@ntu.edu.sg, {zhaomengchen, haojianye}@huawei.com ABSTRACT Autonomous agents often need to work together as a team to accomplish complex cooperative tasks. Due to privacy and other realistic constraints, agents might need to collaborate with previously unknown teammates on the fly. This problem is known as ad hoc teamwork, which remains a core research challenge. Prior works usually rely heavily on strong assumptions like full observability, fixed and predefined teammates’ types. This paper relaxes these assumptions with a novel reinforcement learning framework called ODITS, which allows the autonomous agent to adapt to arbitrary teammates in an online fashion. Instead of limiting teammates into a finite set of predefined types, ODITS automatically learns latent variables of teammates’ behaviors to infer how to cooperate with new teammates effectively. To overcome partial observability, we introduce an information-based regularizer to derive proxy representations of the learned variables from local observations. Extensive experimental results show that ODITS significantly outperforms various baselines in widely used ad hoc teamwork tasks. 1 INTRODUCTION Recently, autonomous agents including robotics and software agents are being widely deployed in different environments. In many tasks, they are progressively required to cooperate with other unknown teammates on the fly. For example, in case of search and rescue tasks in a disaster, due to privacy or lack of time, deployed robots need to interact with robots from other companies or laboratories, whose coordination protocols might not be explicitly provided in advance (Barrett and Stone, 2015). Besides, in the domain of game AI (Yannakakis, 2012), virtual agents are required to assist different agents controlled by human players. To effectively complete these tasks, autonomous agents must show high adaptation ability to collaborate with intrinsically diverse and unknown teammates. This problem is known in the literature as ad hoc teamwork (Stone et al., 2010). Existing approaches on ad hoc teamwork usually assume that all teammates’ behaviors are categorized into several predefined and fixed types, which corresponds to different coordination strategies (Barrett and Stone, 2015; Durugkar et al., 2020; Mirsky et al., 2020). Then, by reasoning over the type of interacting teammates, the agent switches its behavior to the corresponding policy. If the types are correctly recognized and the strategies are effective, the agent would accomplish the given cooperation task well. However, defining sufficiently descriptive types of teammates requires prior domain knowledge, especially in uncertain and complex environments. For example, in human-AI collaboration in Hanabi (Bard et al., 2020), there are often a wide variety of cooperative behaviors showed by human players. It is challenging for predefined types to cover all possible human players’ behaviors. Further, teammates’ strategies might be rapidly evolving throughout the entire teamwork. If the agent assumes that teammates’ behavioral types are static and cannot adapt to current teammates’ behaviors in an _online fashion, teamwork would suffer from serious miscoordination (Ravula et al., 2019; Chen et al., 2020)._ Rescue and search tasks are an essential class of such examples (Ravula et al., 2019). On the other hand, existing techniques (Barrett and Stone, 2015; Albrecht and Stone, 2017; Chen et al., 2020; Ravula et al., 2019) try to utilize Bayesian posteriors over teammate types to obtain optimal responses. To effectively compute posteriors, they usually assume that the agent could always know other teammates’ observations and actions. However, this _∗Corresponding author_ ----- assumption is unrealistic in partial observable environments, in which each agent is not aware of other agents’ observations. To address the issues mentioned above, this paper introduces an adaptive reinforcement learning framework called **Online aDaptation via Inferred Teamwork Situations (ODITS). Our key insight is that teamwork performance** is jointly affected by the autonomous agent and other teammates’ behaviors. So, the agent’s optimal behavior depends on the current teamwork situation, which indicates the influence on the environmental dynamics caused by other teammates. If the agent identifies the current teamwork situation in an online fashion, it can choose actions accordingly to ensure effective coordination. In this way, we introduce a multimodal representation learning framework (Suzuki et al., 2016; Yin et al., 2017). It automatically encodes the core knowledge about the teamwork situations into a latent probabilistic variable. We show that without any prior knowledge, after learning from the interactive experience with given teammates, the latent variable is sufficiently descriptive to provide information about how to coordinate with new teammates’ behaviors. To overcome partial observability, we propose an information-based proxy encoder to implicitly infer the learned variables from local observations. Then, the autonomous agent adapts to new teammates’ behaviors dynamically and quickly by conditioning its policy on the inferred variables. Instead of limiting teammates into several predefined and fixed types, ODITS considers a mechanism of how an agent should adapt to teammates’ behaviors online. It automatically learns continuous representations of teammates’ behaviors to infer how to coordinate with current teammates’ actions effectively. Without domain knowledge on current environments, it enables effective ad hoc teamwork performance and fast adaptation to varying teammates, which the agent might not thoroughly observe under partial observability. In our experimental evaluation, by interacting with a small set of given teammates, the trained agents could robustly collaborate with diverse new teammates. Compared with various type-based baselines, ODITS reveals superior ad hoc teamwork performance. Moreover, our ablations show both the necessity of learning latent variables of teamwork situations and inferring the proxy representations of learned variables. 2 RELATED WORKS **Ad Hoc Teamwork. The core challenge of achieving cooperative ad hoc teamwork is to develop an adaptive** policy robust to various unknown teammates’ behaviors (Stone et al., 2010). Existing type-based approaches try to predefine types of teammates and choose policies accordingly to cooperate with unknown teammates (Chen et al., 2020; Ravula et al., 2019; Durugkar et al., 2020; Mirsky et al., 2020; Barrett and Stone, 2015) . Specifically, PLASTIC (Barrett and Stone, 2015) infers types of teammates by computing Bayesian posteriors over all types. ConvCPD (Ravula et al., 2019) extends this work by introducing a mechanism to detect the change point of the current teammate’s type. AATEAM (Chen et al., 2020) proposes an attention-based architecture to infer types in real time by extracting the temporal correlations from the state history. The drawback of these approaches is that finite and fixed types might not cover all possible situations in complex environments. One recent work avoids predefining teammates’ types by leveraging graph neural networks (GNNs) to estimate the joint action value of an ad hoc team (Rahman et al., 2021). However, this work requires all teammates’ observations as input, which might not always be available in the real world. **Agent Modeling. By modeling teammates’ behaviors, approaches of agent modeling aims to provide auxiliary** information, such as teammates’ goals or future actions, for decision-making (He et al., 2016; Albrecht and Stone, 2018). For example, MeLIBA conditions the ad hoc agent’s policy on a belief over teammates, which is updated following the Bayesian rule (Zintgraf et al., 2021). However, existing agent models require the full observations of teammates as input (Raileanu et al.; Grover et al.; Tacchetti et al., 2019). If the agent cannot always observe other teammates’ information (e.g. observations and actions), those approaches would fail to give an accurate prediction about teammates’ information. A recent work proposes to use VAE to learn fixed-policy opponents under partial observability. However, it does not generalize to the ad hoc setting where the teammates can be non-stationary (Papoudakis and Albrecht, 2020). There are also some works study how to generate diverge agent policies, which benefits the training of ad hoc agent (Canaan et al., 2019). **Multi-agent Reinforcement Learning (MARL). Cooperative MARL(Foerster et al., 2017) with centralized** training and decentralized execution (Oliehoek et al., 2008) (CTDE) is relevant to this work. Related approaches (Sunehag et al., 2018; Rashid et al., 2018) utilize value function factorization to overcome the limitations of both joint and independent learning paradigms simultaneously. However, these algorithms assume that the developed team is fixed and closed. The team configuration (e.g., team size, team formation, and goals) is ----- unchanged, and agents will not meet other agents without pre-coordination. Several extended works improve the generalization ability for complex team configurations by leveraging other insights, like learning dynamic roles (Wang et al., 2021; 2020), randomized entity-wise factorization (Iqbal et al., 2020), and training regime based on game-theoretic principle (Muller et al., 2020). However, intrinsically, these approaches usually focus on co-training a group of highly-coupled agents instead of an autonomous agent that can adapt to non-stationary teammates. 3 BACKGROUND **Problem Formalization. Our aim is to develop a single autonomous agent, which we refer to as the ad hoc** _agent, that can effectively cooperate with various teammates under partial observability without pre-coordination,_ such as joint-training. While we focus on training a single agent in this work, similar approaches can be applied to construct an ad hoc team. Partial observation Ad hoc agent **...** Environment Teammate set Sample Unknown teammates **_a[-i]_** Shared reward _a[i]_ Joint action _R_ **_a_** Environment Figure 1: Visualization of the Dec-POMDP with an addtional teammate set Γ. To evaluate the ad hoc agent’s ability to cooperate with unknown teammates, we formally define the problem as a decentralized Partially observable Markov Decision Process (Dec-POMDP) (Oliehoek et al., 2008) with an additional assumption about the set of teammates’ possible policies Γ. It can be represented as a tuple ⟨N, S, A, O, T, P, R, O, Γ⟩, where N denotes the number of agents required by the task, s ∈S denotes the global state of the environment. The joint action a ∈A[N] is formed by all agent’s independent actions a[i] _∈A, where i is the index of the agent. In the environment, each agent only_ has access to its partial observation o[i] _∈O drawn according to the observation function O(s, i), and it has an_ observation-action history τ _[i]_ _∈T ≡_ (O × A)[∗]. P (s[′], |s, a) denotes the probability that taking joint action a in state s results in a transition to state s[′]. R(s, a) is the reward function that maps a state s and a joint action a to a team reward r ∈ R. Γ represents a pool of various policies, which can be pretrained or predefined to exhibit cooperative behaviors. Without loss of generality, we denote by πi the policy of the ad hoc agent and by π _i the_ _−_ joint policy of all other agents. Fig.1 shows the detailed schematics of this problem. Note that the orginal group size of teammate can be arbitrary. In a Dec-POMDP with an additional teammate set Γ, the objective of the ad hoc agent is to maximize the expected team return when it teams up with N − 1 arbitrary teammates sampled from Γ, though it has no prior knowledge about those teammates. Therefore, the ad hoc agent’s optimal policy πi[∗] [is required to maximize the] joint action value Q[π][i] (s, a[i], a[−][i]), which indicates the expected accumulative team reward over different ad hoc teamwork: +∞ _γ[t]rt_ "t=0 X _Q[π][i]_ (s, a[i], a[−][i]) = Eait=1:+∞[∼][π][i][,][a]t[−]=1:+[i] _∞[∼][π][−][i][,][π][−][i][∼][Γ]_ (1) ) = Eait=1:+∞[∼][π][i][,][a]t[−]=1:+[i] _∞[∼][π][−][i][,][π][−][i][∼][Γ]_ "t=0 _γ[t]rt_ # (1) X _[s][0][ =][ s,][ a][0][ =][ a][, P]_ _Q[π]i[∗] (s, a[i], a[−][i])_ _Q[π][i]_ (s, a[i], a[−][i]), _πi, s, ai, a[−][i]_ (2) _≥_ _∀_ **Marginal Utility is defined to measure the contribution of an ad hoc agent to the whole team utility (Genter and** Stone, 2011). It represents the increase (or decrease) in a team’s utility when an ad hoc agent is added to the team. Given teammates’ actions a[−][i], there is a relationship between the marginal utility and the team utility ----- (denoted by the joint action value) as follow: arg max (3) _a[i][ u][i][(][s, a][i][,][ a][−][i][) = arg max]a[i][ Q][π][i]_ [(][s, a][i][,][ a][−][i][)] where u[i](s, a[i], a[−][i]) denotes the marginal utility when the ad hoc agent chooses the action a[i] under the state s. Note that the marginal utility is not necessarily equal to the Q-value (Sunehag et al., 2018). The ad hoc agent chooses the action which maximizes the marginal utility to ensure the maximal team utility. 4 ODITS LEARNING FRAMEWORK Our approach addresses the ad hoc teamwork problem with a novel probabilistic framework ODITS. In this section, we first introduce the overall architecture of **_Q Loss_** **Only available in the training phase** this framework and then present a detailed description **Joint Action Value** of all modules in this framework. 4.1 OVERVIEW ODITS aims to estimate the ad hoc agent’s marginal utility for choosing corresponding actions to maximize the team utility. To achieve the adaptive policy to unknown teammates, we model the marginal utility as a conditional function on the inferred latent variable, which implicitly represents the current teamwork situation. ODITS jointly optimizes the marginal utility function and the latent variable by two learning objectives in an end-to-end fashion. Fig.2 shows the detailed schematics of ODITS. It splits the team into two parts: teammates and the ad hoc agent. **_Q Loss_** **Only available in the training phase** **Joint Action Value** **Integrating Network G** G _cti_ **Teamwork Situation** **Teamwork Situation** **Decoder g** **Encoder f** (st, _[a]t−i_ ) _u_ _i_ (ti, ; _zti_ ) **_MI Loss_** **Marginal Utility Network M** M _zit_ _bti_ Ad hoc agent Other teammates **Proxy Decoder g*** **Proxy Encoder f*** Figure 2: Schematics of ODITS. First, we regard other teammates as a part of environmental dynamics perceived by the ad hoc agent. Since different combinations of teammates lead to diverse and complex dynamics, we expect to learn a latent variable to describe the core information of teammates’ behaviors implicitly. To do this, we introduce a teamwork situation **encoder f to learn the variable. Then, a loss function (Q loss), an integrating network G and a teamwork** **situation decoder g are jointly proposed to regularize the information embedded in the learned variable c[i]t[.]** For the ad hoc agent, we expect to condition its policy on the learned variable c[i]t[. However, the partial observability] impedes the direct access to c[i]t[. Thus, we introduce a][ proxy encoder][ f][ ∗] [to infer a proxy representation][ z]t[i] [of] _c[i]t_ [from local observations. We force][ z]t[i] [to be informationally consistent with][ c]t[i] [by an information-based loss] function (MI loss). Then, we train a marginal utility network M to estimate the ad hoc agent’s conditional marginal utility ˆu[i](τt[i][, a][i]t[;][ z]t[i][)][ ≈] _[u][i][(][s][t][, a][i]t[,][ a]t[−][i][)][. For conditional behavior, a part of parameters of][ M][ are]_ generated by the proxy decoder g[∗]. Similar to the CTDE scenario (Oliehoek et al., 2008), we relax the partial observability in the training phase. ODITS is granted access to the global state st and other teammates’ actions a[−]t _[i]_ during training. During execution, G, f, g is removed; the ad hoc agent chooses the action which maximizes the conditional marginal utility function ˆu[i](τt[i][, a][i]t[;][ z]t[i][)][.] 4.2 LEARNING TO REPRESENT TEAMWORK SITUATIONS For adaptive behaviors, we expect to condition the ad hoc agent’s policy on other teammates. However, unknown teammates show complex behaviors. Directly conditioning the policy on them might lead to a volatile policy. To address this issue, we aim to embed the teammates’ information into a compact but descriptive representation. To model the concept clearly, we formally define teamwork situation: **Definition 1 (teamwork situation) At each time step t, the ad hoc agent is in the teamwork situation c[i]t** _which is the current underlying teamwork state yielded by the environment state st and other teammates’ actions[∈C][,]_ **_a[−]t_** _[i][. It reflects the][ high-level semantics][ about the teammates’ behaviors.]_ ----- Though different teammates generate diverse state-action trajectories, we assume that they can cause similar teamwork situations at some time, and the ad hoc agent’s action would affect their transitions. When the current teamwork situation is identified, the ad hoc agent can choose the action accordingly to ensure online adaptation. **Teamwork Situation Encoder f** **. To model the uncertainty of unknown teammates, we encode teamwork** situations in a stochastic embedding spaceprobabilistic variable c[i] that is drawn from a multivariate Gaussian distribution C. Thus, any teamwork situation can be represented as a latent(µci _, σci_ ). To enable the _N_ dependency revealed in the definition, we use a trainable neural network f to learn the parameters of the Gaussian distribution of c[i]: (µci _, σci_ ) = f (s, a[−][i]; θf ), c[i] (µci _, σci_ ) (4) _∼N_ where θf are parameters of f . **Regularizing Information Embedded in c[i]. We introduce a set of modules to jointly force c[i]** to be sufficiently descriptive for reflecting the current teamwork situation. If c[i]t [is able to capture the core knowledge about other] teammates’ current behaviors, we can predict the joint action value Q[π][i] (st, a[i]t[,][ a]t[−][i][)][ according to][ c]t[i] [and the ad] hoc agent’s marginal utility u[i]t 1. Thus, we propose an integrating network G for generating the joint action value’s estimation G(u[i]t[, c][i]t[)][ ≈] _[Q][π][i]_ [(][s][t][, a][i]t[,][ a]t[−][i][)][. We adopt a modified asynchronous Q-learning’s loss function] (Q-loss) (Mnih et al., 2016) as the optimization objective: _Q = E(uit[,c][i]t[,r][t][)][∼D]_ _L_ [(rt + γ max _G¯(u[i]t+1[, c]t[i]+1[)][ −]_ _[G][(][u]t[i][, c]t[i][)]][2]_ _a[i]t+1_ (5) where _G[¯] is a periodically updated target network. The expectation is estimated with uniform samples from the_ replay buffer D, which saves the interactive experience with training teammates. **Integrating Network G. One simple approach for integrating c[i]** with u[i] is to formulate G as an MLP that maps their concatenation into the joint value estimation. We instead propose to map c[i] into the parameters of G by a hypernetwork (Ha et al., 2016), which we refer to as the teamwork situation decoder g. Then, G maps the ad hoc agent’s utility u[i] into the value estimation. This alternative design changes the procedure of information integration. The decoder provides multiplicative integration to aggregate information. By contrast, the concatenation-based operation only provides additive integration, leading to a poor information integration ability (see Supplementary). We also empirically show that multiplicative integration stabilizes the training procedure and improves teamwork performance. In addition, we expect that there is a monotonicity modeling the relationship between G and the marginal utility u[i]t [:][ ∂G]∂u[i]t _t[, the increase of the ad hoc agent’s]_ marginal utility results in the improved joint action value. To achieve this property, we force[≥] [0][. Given any][ c][i] _θG_ 0. _≥_ 4.3 LEARNING CONDITIONAL MARGINAL UTILITY FUNCTION UNDER PARTIAL OBSERVABILITY Apparently, the marginal utility of the ad hoc agent depends on other teammates’ behaviors. Distinct behaviors result in different marginal utilities. Here, we formalize the marginal utility network M as a deep recurrent Q network (DRQN) (Hausknecht and Stone, 2015) parameterized by θM . To enable adaptation, we force the final layers’ parameters of M to condition on the learned variable c[i]t[.] **Proxy Encoder f** **_[∗]. Because of partial observability, the teamwork situation encoder f is not available_** during execution. Thus, we introduce a proxy encoder f _[∗]_ to estimate c[i]t [from the local transition data] _b[i]t_ [= (][o]t[i][, r][t][−][1][, a][i]t 1[, o][i]t 1[)][. We assume that][ b]t[i] [can partly reflect the current teamwork situation since the] _−_ _−_ transition implicitly indicates the underlying dynamics, which is primarily influenced by other teammates’ behaviors. We denote the estimation of c[i]t [as][ z]t[i][. Then,][ z]t[i] [would be fed into a][ proxy decoder][ g][∗][(][z]t[i][;][ θ][g][∗] [)][ parameterized] by θg∗ to generate the parameters θM of M, enabling the marginal utility function to condition on the proxy representation zt[i][. Similar to][ c][i][, we encode][ z][i][ into a stochastic embedding space:] (µzi _, σzi_ ) = f _[∗](b[i]; θf ∗_ ), z[i] _∼N_ (µzi _, σzi_ ) (6) where θf ∗ are parameters of f _[∗]._ 1uit [is a shorthand notation of][ u][i][(][τ][ i]t _[, a][i]t[;][ z]t[i][)]_ ----- **Regularizing Information Embedded in z[i]. To make zt[i]** [identifiable, we expect][ z]t[i] [to be informatively consis-] tent with c[i]t[. Thus, we introduce an information-based loss function][ L][MI] [here to maximize the conditional mutual] information I(zt[i][;][ c]t[i][|][b]t[i][)][ between the proxy variables and the true variables. However, estimating and maximizing] mutual information is often infeasible. We introduce a variational distribution qξ(zt[i][|][c]t[i][, b]t[i][)][ parameterized by][ ξ][ to] derive a tractable lower bound for the mutual information (Alemi et al., 2017): _t[|][c][i]t[, b][i]t[)]_ _I(zt[i][;][ c][i]t[|][b][i]t[)][ ≥]_ [E]zt[i][,c][i]t[,b][i]t log _[q][ξ]p[(][z](z[i]_ _t[i][|][b]t[i][)]_ (7) where p(zt[i][|][b][i]t[)][ is the Gaussian distribution][ N] [(][µ]z[i] _[, σ]z[i]_ [)][. This lower bound can be rewritten as a loss function to] be minimized: _MI_ (θf _[∗]_ _, ξ) == E(bit[,s][t][,][a][−]t_ _[i])_ _t[|][b][i]t[)][||][q][ξ][(][z]t[i][|][c][i]t[, b][i]t[)]]]_ (8) _L_ _∼D[[][D][KL][[][p][(][z][i]_ where is the replay buffer, DKL[ ] is the KL divergence operator. The detailed derivation can be found in _D_ _·||·_ Supplementary. 4.4 OVERALL OPTIMIZATION OBJECTIVE **Algorithm 1 ODITS Training** **Require: Batch of training teammates’ behavioral policies** **_πj[−][i]_** _j=1,2,_ _,J_ [; learning rate][ α][; scaling factor][ λ][.] _{_ _[}][tr]_ _···_ 1: initialize the replay buffer D 2: while not done do 4:3: **forSample the teammates’ policies k = 1, · · ·, K do** **_πj[−][i]_** from {πj[−][i][}]j[tr]=1,2,···,J 5: Sample data Dk = (st, a[i]t[,][ a][−]t _[i][, r][t][)][}][t][=1][,][···][,T]_ [using the ad hoc agent’s policy][ π][i][ and][ π][−][i] _{_ 6: Add Dk into D 7: **for steps in training steps do** 8: Sample one trajectory D ∼D 10:9: **for tCompute = 1, · · · (, Tµc −it** _[, σ]1[c]t[i] do[) =][ f]_ [(][s][t][,][ a]t[−][i][)][ and sample][ c]t[i] _c[i]t_ _[, σ][c]t[i]_ [)] 11: Compute (µzti _[, σ][z]t[i]_ [) =][ f][ ∗][(][b]t[i][)][ and sample][ z]t[i] _[∼N][∼N][(][µ]zt[i][(][, σ][µ]_ _[z]t[i]_ [)] 12: Compute u[i]t[(][τ][ i]t _[, a][i]t[;][ z]t[i][)][ and][ G][(][u][i]t[;][ c][i]t[)]_ 13: Compute _Q,_ _MI_ _L_ _L_ 14: _θ_ _θ + α_ _θ(_ _Q)_ _←_ _∇_ _L_ 15: _θf ∗_ _θf ∗_ + λ _α_ _θf_ _∗_ ( _MI_ ) _←_ _·_ _∇_ _L_ 16: _ξ_ _ξ + λ_ _α_ _ξ(_ _MI_ ) _←_ _·_ _∇_ _L_ **Algorithm 2 ODITS Testing** **Require: Testing teammates’ behavioral policies π[−][i].** 1: for t = 1, · · ·, T do 2: Generate teaammtes’ actions a[−]t _[i]_ **_π[−][i]_** 3: Compute (µzti _[, σ][z]t[i]_ [) =][ f][ ∗][(][b]t[i][)][ and sample]∼ _[ z]t[i]_ _[∼N]_ [(][µ]zt[i] _[, σ][z]t[i]_ [)] 4: Do the action a[i]t [that maximizes][ u][i][(][τ][ i]t _[, a][i]t[;][ z]t[i][)]_ To the end, the overall objective becomes: (θ) = _Q(θ) + λ_ _MI_ (θf ∗ _, ξ)_ (9) _L_ _L_ _L_ where θ = (θf _, θg, θM_ _, θf ∗_ _, θp, ξ), λ is the scaling facor._ During the training phase, the ad hoc agent interacts with different training teammates for collecting transition data into the replay buffer D. Then, samples from D are fed into the framework for updating all parameters ----- 2d1y2d1y 4d2y4d2y 8d4y8d4y # Avg. Captured Preys Save the City 2a2b2a2b 4a3b4a3b 6a4b6a4b # Avg. Completed Buildings Figure 3: Performance comparison across various scenarios for Predator Prey (top panel) and Save the City (bottom panel). by gradients induced by the overall loss. During execution, the ad hoc agent conditions its behavior on the inferred teamwork situations by choosing actions to maximize the conditional utility function u[i](τt[i][, a][i]t[;][ z]t[i][)][. We] summarize our training procedure and testing procedure in Algorithm 1 and Algorithm 2. 5 EXPERIMENTS We now empirically evaluate ODITS on various new and existing domains. All experiments in this paper are carried out 4 different random seeds, and results are shown with a 95% confidence interval over the standard deviation. In the following description, we refer to the teammates that interact with the ad hoc agent during the training phase as the training teammates, and refer to those teammates with unknown policies as testing teammates. And the “teammate types” correspond to the policy types of teammates. All experimental results illustrate the average teamwork performance when the ad hoc agent cooperates with different testing teammates. Additional experiments, further experiment details and implementation details of all models can be found at Supplementary. 5.1 PREDATOR PREY **Configurations. In this environment, m homogenous predators try to capture n randomly-moving preys in a** 7 × 7 grid world. Each predator has six actions: the moving actions in four directions, the capturing action, and waiting at a cell. Due to partial observability, each predator can only access the environmental information within two cells nearby. Besides, there are two obstacles at random locations. Episodes are 40 steps long. The predators get a team reward of 500 if two or more predators are capturing the same prey at the same time, and they are penalized for -10 if only one of them tries to capture a prey. Here, we adopt three different settings to verify the ad hoc agent’s ability to cooperate with different number of teammates. They are 2 predators and 1 preys (2d1y), 4 predators and 2 preys (4d2y) and 8 predators and 4 preys (8d4y), respectively. We compare our methods with three type-based baselines: AATEAM (Chen et al., 2020), ConvCPD (Ravula et al., 2019), PLASTIC (Barrett and Stone, 2015). Note that these approaches assume that the ad hoc agent has full visibility on the environment. To apply them on partially observed settings, we replace the full state information used in them with partial observations of the ad hoc agent. Furthermore, we also compare two other strategies: (i) Random: The ad hoc agent chooses actions randomly. (ii) Combined: The ad hoc agent utilizes a DQN algorithm to learn a single policy using the data collected from all possible teammates. This intuitive baseline provides the view of treating the problem as a vanilla single-agent learning problem, where the agent ignores the differences between its teammates. Before training all algorithms, we first require a teammate set that consists of various behavioral policies of teammates. Instead of crafting several teammates’ cooperative policies by hand, we expect to train a set of distinct policies automatically. Therefore, we first utilize 5 different MARL algorithms (e.g. VDN (Sunehag et al., 2018) and QMIX (Rashid et al., 2018)) to develop several teams of agents. To ensure diversity, we use different random seeds for each algorithm and save the corresponding models at 3 different checkpoints (3 million steps, 4 million ----- steps, and 5 million steps). Then, we manually select 15 different policies showing distinct policy representations (Grover et al.) from all developed models. Finally, we randomly sampled 8 policies as the training set and the other 7 policies as the testing set. During training, we define 8 teammate types that correspond to 8 policies in the training set for the type-based approaches. Then, algorithms would develop their models according to the interactive experience with training teammates. For all algorithms, agents are trained for 4.5 million time steps. The number of captured preys when the ad hoc agent cooperates with testing teammates throughout training is reported. See Supplementary for further settings. **Results. The top panel of Fig. 3 reports the results across 3 scenarios. We first observe that ODITS achieves** superior results on the number of captured preys across a varying number of teammates, verifying its effectiveness. ODITS also tends to give more consistent results than other methods across different difficulties. The other 3 type-based baselines and ODITS show better results than random and combined policies, indicating that they can indeed lead to adaptive behaviors to different teammates. Furthermore, the random strategy captures some preys on 4d2y and 8d4y, but completely fails on 2d1y. This indicates that without the cooperative behaviors of the ad hoc agent, other teammates can also coordinate with each other to achieve the given goal. The combined policy shows worse results than the random policy on two scenarios (4d2y and 8d4y). This might be because the combined policy show behaviors that conflict with other teammates. With the number of teammates increasing, the increasing effects of conflicts lead to serious miscoordination. 5.2 SAVE THE CITY **Configurations.** This is a grid world resource allocation task presented in (Iqbal et al., 2020). In this task, there are 3 distinct types of agents, and their goal is to complete the construction of all buildings on the map while preventing them from burning down. Each agent has 8 actions: stay in place, move to the next cell in one of the four cardinal directions, put out the fire, and build. We set the agents to get a team reward of 100 if they have completed a building and be penalized for -500 when one building is burned down. Agent types include firefighters (20x speedup over the base rate in reducing fires), builders (20x speedup in the building), or generalists (5x speedup in both as well 2x speedup in moving ). Buildings also consist of two varieties: fast-burning and slow-burning, where the fast-burning buildings burn four times faster. In our experiments, each agent can only access the environmental information within four cells nearby. We adopt three different scenarios here to verify all methods. They are 2 agents and 1 buildings (2a2b), 4 agents and 3 buildings (4a3b), 6 agents and 4 buildings (6a4b). Similar to training settings in Predator Prey, we select 15 distinct behavioral **Predator Prey 4d2y** **Save the City 4a3b** policies for the teammate set and randomly partition them into 8 training policies Figure 4: ODITS V.S. and 7 testing policies. For all algorithms, agents are trained for 4.5 million time QMIX. steps. The number of completed buildings when the ad hoc agent cooperates with testings teammates throughout training is reported. See Supplementary for further settings. **Results.** The bottom panel of Fig. 3 reports the results across 3 scenarios. We first observe that ODITS outperforms other baselines, verifying its effectiveness. Since the setting force all agents in the environment to be heterogeneous, the results also underpin the robustness of ODITS. Interestingly, we find that the combined policy reveals better performance than other type-based approaches. This result is not consistent with that in Predator Prey. Our intuition is that the requirement of cooperative behaviors in Save the City is less than that in Predator Prey. Actually, one agent in Save the City can complete buildings individually without the strong necessity of cooperating with other teammates’ behaviors. By contrast, one predator cannot capture prey by itself. As a result, the combined policy learns a universal and effective policy by learning from more interactive experience with different teammates, while type-based approaches fail because developing distinct cooperative behaviors leads to the instability of the ad hoc agent. This hypothesis is also empirically demonstrated in our ablations. 5.3 COMPARISON WITH MARL In order to compare the performance of ODITS and MARL, we implement the commonly used algorithm QMIX (Rashid et al., 2018) as our baseline. Similar to the training procedure of ODITS, we fix one agent and train it with teammates randomly sampled from a pool consisting of 8 policies. The gradients for updating the ----- teammates’ policies are blocked but the mixing network is updating as in the original implementation of QMIX. Figure 4 shows the comparison of ODITS and QMIX on Predator Prey 4d2y and Save the City 4a3b. In both environments, QMIX performs significantly worse than ODITS. This is not quite surprising because MARL algorithms usually assume that all the teammates are fixed. Therefore, although trained with multiple teammates, the agent under the QMIX framework does not learn to cooperate with an impromptu team. 5.4 ABLATIONS. We perform several ablations on the Predator Prey 4d2y and Save the City 4a3b to try and determine the importance of each component of ODITS. **Adaptive Behaviors. We first consider removing the information-** based loss LMI from the overall learning objective ( denoted as **w/o info.), Fig. 5 shows that without LMI regularizing the informa-** tion embedded in zt[i][, ODITS induces worse teamwork performance.] This indicates that improving the mutual information between the proxy variable and the true variable indeed results in better representations of teamwork situations. We next consider how the inferred variables of teamwork situations affect the ad hoc agent’s adaptive Predator Prey Save the City 4d2y 4a3b behaviors. We remove the proxy encoder and set zt[i] [as a fixed and] Figure 5: Ablations for different comporandomly generated vector ( denoted as w/o infer.). As shown in nents. Fig. 5, conditioning on a random signal leads to a further drop in performance, indicating that irrelevant signals cannot promote the ad hoc agent to develop adaptive policies. **Integrating Mechanism. We remove the teamwork situation encoder as well as LMI from the framework and** feed a vector filled with 1 into the teamwork situation decoder (labeled with w/o integ.). This setting enables ODITS not to integrating the ad hoc agent’s marginal utility with the information of teammates’ behaviors. Compared with w/o info., it brings a larger drop in teamwork performance. One intuition is that predicting the joint-action value plays an essential role in estimating the marginal utility. Suppose the integrating network has no information on other teammates’ behaviors. In that case, it cannot accurately predict the joint-action value, resulting in instability in marginal utility estimation. Despite the empirical evidence supporting this argument, however, it would be interesting to develop further theoretical insights into this training regime in future work. We finally consider the additive integration mechanism mentioned in section 4.2 (labeled with additve). We observe that despite additive integration shows an excellent performance in Save the City, it suffers from poor performance in Predator Prey, indicating that multiplicative integration provides a more stable and effective ability to integrate information from teammates and the ad hoc agent. Interestingly, we also find that most ablations get worse results in Predator Prey than those in Save the city. We believe that the different levels of cooperative requirement in two environments result in this phenomenon. The prey is captured when two nearby predators are simultaneously capturing them. By contrast, the burning building can be constructed by an individual agent. Therefore, removing mechanisms that promote the cooperative behaviors leads to the worse performance in Predator Prey. 6 CONCLUSIONS This paper proposes a novel adaptive reinforcement learning algorithm called ODITS to address the challenging ad hoc teamwork problem. Without the need to predefine types of teammates, ODITS automatically learns compact but descriptive variables to infer how to coordinate with previously unknown teammates’ behaviors. To overcome partial observability, we introduce an information based regularizer to estimate proxy representations of learned variables from local observations. Experimental results show that ODITS obtains superior performance compared to various baselines on several complex ad hoc teamwork benchmarks. ACKNOWLEDGMENTS This research was supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG-RP-2019-0013), National Satellite of Excellence in Trustworthy Software Systems (Award No: NSOETSS2019-01), and NTU. ----- REFERENCES S. V. Albrecht and P. Stone. Reasoning about hypothetical agent behaviours and their parameters. In Proceedings _of the International Conference on Autonomous Agents and Multiagent Systems, 2017._ S. V. Albrecht and P. Stone. Autonomous agents modelling other agents: A comprehensive survey and open problems. Artificial Intelligence, 258:66–95, 2018. A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy. Deep variational information bottleneck. In International _Conference on Learning Representations, 2017._ N. Bard, J. N. Foerster, S. Chandar, N. Burch, M. Lanctot, H. F. Song, E. Parisotto, V. Dumoulin, S. Moitra, E. Hughes, I. Dunning, S. Mourad, H. Larochelle, M. G. Bellemare, and M. Bowling. The hanabi challenge: A new frontier for ai research. Artificial Intelligence, 280:103216, 2020. S. Barrett and P. Stone. Cooperating with unknown teammates in complex domains: A robot soccer case study of ad hoc teamwork. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence _Conference on Artificial Intelligence, 2015._ R. Canaan, J. Togelius, A. Nealen, and S. Menzel. Diverse agents for ad-hoc cooperation in hanabi. arXiv _preprint arXiv:1907.03840, 2019._ S. Chen, E. Andrejczuk, Z. Cao, and J. Zhang. Aateam: Achieving the ad hoc teamwork by employing the attention mechanism. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence _Conference on Artificial Intelligence, pages 7095–7102, 2020._ I. Durugkar, E. Liebman, and P. Stone. Balancing individual preferences and shared objectives in multiagent reinforcement learning. In Proceedings of the International Joint Conference on Artificial Intelligence, 2020. J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson. Counterfactual multi-agent policy gradients. _Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence Conference on Artificial_ _Intelligence, 2017._ N. A. Genter, Katie and P. Stone. Role-based ad hoc teamwork. In Workshops at the AAAI Conference on _Artificial Intelligence, 2011._ A. Grover, M. Al-Shedivat, J. K. Gupta, Y. Burda, and H. Edwards. Learning policy representations in multiagent systems. In Proceedings of the Proceedings of the International Conference on Machine Learning, pages 1802–1811. D. Ha, A. Dai, and Q. V. Le. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016. M. Hausknecht and P. Stone. Deep recurrent q-learning for partially observable mdps. In Proceedings of the _AAAI Conference on Artificial Intelligence, 2015._ H. He, J. Boyd-Graber, K. Kwok, and H. Daumé. Opponent modeling in deep reinforcement learning. In _Proceedings of the International Conference on Machine Learning, pages 1804–1813, 2016._ S. Iqbal, C. A. S. de Witt, B. Peng, W. Böhmer, S. Whiteson, and F. Sha. Randomized entity-wise factorization for multi-agent reinforcement learning. arXiv preprint arXiv:2006.04222, 2020. A. Mahajan, T. Rashid, M. Samvelyan, and S. Whiteson. Maven: Multi-agent variational exploration. In _Advances in Neural Information Processing Systems, pages 7613–7624, 2019._ R. Mirsky, W. Macke, A. Wang, H. Yedidsion, and P. Stone. A penny for your thoughts: The value of communication in ad hoc teamwork. In Proceedings of the International Joint Conference on Artificial _Intelligence, 2020._ V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, pages 1928–1937, 2016. 10 ----- P. Muller, S. Omidshafiei, M. Rowland, K. Tuyls, J. Perolat, S. Liu, D. Hennes, L. Marris, M. Lanctot, E. Hughes, Z. Wang, G. Lever, N. Heess, T. Graepel, and R. Munos. A generalized training approach for multiagent learning. In International Conference on Learning Representations, 2020. F. A. Oliehoek, M. T. Spaan, N. Vlassis, and S. Whiteson. Exploiting locality of interaction in factored decpomdps. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, pages 517–524, 2008. G. Papoudakis and S. V. Albrecht. Variational autoencoders for opponent modeling in multi-agent systems. In _AAAI Workshop on Reinforcement Learning in Games, 2020._ M. A. Rahman, N. Hopner, F. Christianos, and S. V. Albrecht. Towards open ad hoc teamwork using graph-based policy learning. In Proceedings of the International Conference on Machine Learning, pages 8776–8786, 2021. R. Raileanu, E. Denton, A. Szlam, and R. Fergus. Modeling others using oneself in multi-agent reinforcement learning. In Proceedings of the International Conference on Machine Learning, pages 4257–4266. R. Raileanu, M. Goldstein, A. Szlam, and R. Fergus. Fast adaptation to new environments via policy-dynamics value functions. In Proceedings of the International Conference on Machine Learning, pages 7920–7931, 2020. T. Rashid, M. Samvelyan, C. S. de Witt, G. Farquhar, J. Foerster, and S. Whiteson. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. In Proceedings of the International _Conference on Machine Learning, pages 1228–1236, 2018._ M. Ravula, S. Alkobi, and P. Stone. Ad hoc teamwork with behavior switching agents. In Proceedings of the _International Joint Conference on Artificial Intelligence, 2019._ M. Samvelyan, T. Rashid, C. S. de Witt, G. Farquhar, N. Nardelli, T. G. Rudner, C.-M. Hung, P. H. Torr, J. Foerster, and S. Whiteson. The starcraft multi-agent challenge. arXiv preprint arXiv:1902.04043, 2019. P. Stone, G. A. Kaminka, S. Kraus, and J. S. Rosenschein. Ad hoc autonomous agent teams: Collaboration without pre-coordination. In Proceedings of the AAAI Conference on Artificial Intelligence, 2010. P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. F. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, et al. Value-decomposition networks for cooperative multi-agent learning based on team reward. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, pages 2085–2087, 2018. M. Suzuki, K. Nakayama, and Y. Matsuo. Joint multimodal learning with deep generative models. arXiv preprint _arXiv:1611.01891, 2016._ A. Tacchetti, H. F. Song, P. A. M. Mediano, V. Zambaldi, J. Kramár, N. C. Rabinowitz, T. Graepel, M. Botvinick, and P. W. Battaglia. Relational forward models for multi-agent learning. In International Conference on _Learning Representations, 2019._ M. Tan. Multi-agent reinforcement learning independent vs. cooperative agents. In Proceedings of the Interna_tional Conference on Machine Learning, 1993._ T. Wang, H. Dong, V. Lesser, and C. Zhang. Roma: Multi-agent reinforcement learning with emergent roles. In _Proceedings of the International Conference on Machine Learning, 2020._ T. Wang, T. Gupta, A. Mahajan, B. Peng, S. Whiteson, and C. Zhang. Rode: Learning roles to decompose multi-agent tasks. International Conference on Learning Representations, 2021. G. N. Yannakakis. Game ai revisited. In Proceedings of the 9th conference on Computing Frontiers, pages 285–292, 2012. H. Yin, F. Melo, A. Billard, and A. Paiva. Associate latent encodings in learning from demonstrations. In _Proceedings of the AAAI Conference on Artificial Intelligence, 2017._ L. Zintgraf, S. Devlin, K. Ciosek, S. Whiteson, and K. Hofmann. Deep interactive bayesian reinforcement learning via meta-learning. arXiv preprint arXiv:2101.03864, 2021. 11 ----- A APPENDIX A.1 MATHEMATICAL DERIVATION A.1.1 ADDITIVE INTEGRATION VS. MULTIPLICATIVE INTEGRATION IN INTEGRATING NETWORK. Concretely, let us consider a simple example that G is a one-layer network. The information aggregation of additive integration (abbreviated as Gadd) could be written as follow: _Gadd = F(Wuu[i]_ + Wcc[i] + b) (10) where Wu and Wc are the weight vectors of u[i] and c[i], respectively; b is the bias vector; F is the corresponding activation function. By comparison, the multiplicative integration (abbreviated as Gmul) could be written as follow: _Gmul = F(Wc(c[i])Wuu[i]_ + b(c[i])) (11) where W (c[i]) and b(c[i]) are the function of c[i] for generating weight vectors. Compared with Gadd, Gmul changes the integrating network from first order to second order, while introducing no extra parameters. The information of teamwork situation directly scales the information of the ad hoc agent through the term W (c[i])Wuu[i], rather than via a more subtle bias Wuu[i] + Wcc[i]. Further, Gmul also brings about advantages over Gadd as it alters the integrating network’s gradient properties. In the additive integration, the gradient of _[∂G]∂c[add][i]_ can be computed as follow: _∂Gadd_ = Wc (12) _∂c[i]_ _F_ _[′]_ where = (Wuu[i] + Wcc[i] + b). This equation reveals that the gradient heavily depends on the matrix Wc, _F_ _[′]_ _F_ _[′]_ but∂G∂cadd W[i] **_uis as follow: and u[i]_** play a limited role: they only come in the derivative of F _[′]_ mixed with Wuu[i]. By comparison, _∂Gmul_ _∂c[i]_ = [Wuu[i]Wc[′] [+][ b][′][]][F] _[′]_ (13) where Wc[′] [=][ W][ ′]c[(][c][i][)][ and][ b][′][ =][ b][′][(][c][i][)][. In this equation,][ W][u][u][i][ is directly involved in the gradient computation] by gating Wc[′][, hence more capble of altering the updates of the learning procedure. This naturally leads to a] better information integration. A.1.2 MUTUAL INFORMATION LOSS FUNCTION _MI_ _L_ For forcing the proxy representation of the learned latent variable to be infomationally consistent with the true latent variable, we propose to maximize the mutual information between the proxy representations and the latent variables. In this paper, we introduce a posterior estimator and derive a tractable lower bound of the mutual information term: _t[|][c][i]t[, b][i]t[)]_ _I(zt[i][;][ c][i]t[|][b][i]t[) =][ E]zt[i][,c][i]t[,b][i]t_ log _[p][(]p[z]([i]zt[i][|][b][i]t[)]_ _t[|][c][i]t[, b][i]t[)]_ = Ezti[,c][i]t[,b][i]t log _[q][ξ]p[(][z](z[i]_ _t[i][|][b][i]t[)]_ + Ecit[,b][i]t [[][D][KL][(][p][(][z]t[i][|][c][i]t[, b][i]t[)][||][q][xi][(][z]t[i][|][c][i]t[, b][i]t[))]] _t[|][c][i]t[, b][i]t[)]_ _≥_ Ezti[,c][i]t[,b][i]t log _[q][ξ]p[(][z](z[i]_ _t[i][|][b][i]t[)]_ where the last inequality holds because of the non-negativity of the KL divergence. Then it follows that: 12 (14) ----- _t[|][c][i]t[, b][i]t[)]_ Ezti[,c][i]t[,b][i]t log _[q][ξ]p[(][z](z[i]_ _t[i][|][b]t[i][)]_ =Ezti[,c][i]t[,b][i]t [[log][ q][ξ][(][z]t[i][|][c][i]t[, b][i]t[)]][ −] [E]zt[i][,b][i]t [[log][ p][(][z]t[i][|][b][i]t[)]] =Ezti[,c][i]t[,b][i]t [[log][ q][ξ][(][z]t[i][|][c][i]t[, b][i]t[)] +][ E]b[i]t [[][H][(][z]t[i][|][b][i]t[)]] =Ecit[,b][i]t _p(zt[i][|][c][i]t[, b][i]t[) log][ q][ξ][(][z]t[i][|][c][i]t[, b][i]t[)][dz]t[i]_ + Ebit [[][H][(][z]t[i][|][b][i]t[)]] Z (15) The proxy encoder is conditioned on the transition data. Given the transitions, the distribution of the proxy representations p(zt[i][)][ are independent from the local histories. Thus, we have] _I(zt[i][;][ c][i]t[|][b][i]t[)][ ≥−][E]c[i]t[,b][i]t_ [[][CE][[][p][(][z]t[i][|][c][i]t[, b][i]t[)][||][q][ξ][(][z]t[i][|][c][i]t[, b][i]t[)][dz]t[i][] +][ E]b[i]t [[][H][(][z]t[i][|][b][i]t[)]] (16) where In practice, we sample data from the replay buffer D and minimize _MI_ (θf ∗ _, ξ) =_ E(cit[,b][i]t[)][∼D][[][CE][[][p][(][z]t[i][|][c]t[i][, b]t[i][)][||][q][ξ][(][z]t[i][|][c]t[i][, b]t[i][)][dz]t[i][] +][ E]b[i]t [[][H][(][z]t[i][|][b]t[i][)]] _L_ _−_ (17) = E(bit[,s][t][,][a][−]t _[i])_ _t[|][b]t[i][)][||][q][ξ][(][z]t[i][|][c]t[i][, b]t[i][]]]_ _∼D[[][D][KL][[][p][(][z][i]_ A.2 ARCHITECTURE, HYPERPARAMETERS, AND INFRASTRUCTURE A.2.1 ODITS Details of the neural network architectures used by ODITS in all environments are provided in Fig. 7. We control the scale of the loss function by controlling the optimization procedure. It is conducted using RMSprop with a learning rate of 5 × 10[−][4], α of 0.99, and with no momentum or weight decay. For the lambda value, we search over{1e − 5, 1e − 4, 5e − 4, 1e − 3, 5e − 3, 1e − 2}. We finally adopt λMI = 1e − 3,λMI = 1e − 3, and λMI = 5e − 4 for Modified Coin Game, Predator Prey, and Save the City, respectively, since they induce the best Figure 6: performance com performance compared with other values. For the dimension of the latent variables parison of two algorithms _zt[i]_ [and][ c]t[i][, we search over][ {][1][,][ 2][,][ 3][,][ 5][,][ 10][}][ and finally adopt][ |][z][|][ = 10][ in Save the] on the modified coin game. city and |z| = 1 in the other environments. In addition, we set |c| = |z|. For exploration, we use ϵ-greedy with ϵ from 1.0 to 0.05 over 50, 000 time steps and kept constant for the rest of the training. Batches of 128 episodes are sampled from the replay buffer, and all components in the framework are trained together in an end-to-end fashion. A.2.2 BASELINES We compare ODITS with 5 baselines. For the random strategy, we force the ad hoc agent to choose its action at each time step randomly. For the combined strategy, a three-layer DQN is trained to get the policy of the ad hoc agent. Details of DQN’s architecture are illustrated in Fig. 7. We sample a type of training teammates at each training episode and collect the interacting experience data into the replay buffer. The optimization procedure and the exploration scheme are the same as those of ODITS. We set the batch size of samples as 128 and the target update interval as 200 episodes. For the other three type-based baselines, we assume that each training policy of teammates corresponds to a predefined type of teammates. So, the number of predefined types equals to the number of training polices. We construct a set of three-layer DQN to learn the policies for all training types. Each DQN only learns the corresponding teammates’ of interacting experience. Training settings for these DQNs are the same as those used in the combined strategy. Furthermore, to apply these baselines in partially observable environments, we replace the state information used in them with partial observations. For PLASTIC (Barrett and Stone, 2015), 13 ----- LeakeyReLU BatchNormalization FC, 2 X |c| units LeakeyReLU BatchNormalization FC, 128 units (st, _[a]t−i_ ) Teamwork Situation Encoder LeakeyReLU BatchNormalization FC, 2 X |z| units LeakeyReLU BatchNormalization FC, 128 units _bti_ Proxy Encoder Parameterizing FC, 1 units ReLU Weights and Biases FC, 128 units Absolutization _uti_ FC, 16768 units _cti_ FC, |A| units ReLU FC, 64 units ReLU FC, 64 units _oti_ Integrating Network Parameterizing FC, |A| units Weights and Biases GRU, 64 units ReLU FC, 64 X (|A|+1) units FC, 64 units (oti, _[a]ti−1)_ _z_ _ti_ Marginal Utility Network Teamwork Situation Decoder Parameterizing Weights and Biases FC, 64 X (|A|+1) units _z_ _ti_ Proxy Decoder DQN |Col1|LeakeyReL| |---|---| |BatchNormalization|| ||| |FC, |c| units|| ||ci t| Variational Posterior Estimator Softmax FC, _nclasses units_ ReLU FC, 20 units ReLU FC, 100 units ReLU FC, 100 units ReLU Maxpool, 2X2 Conv, 40X3X3 (ntp, _[n]t_ ) CPD Network Figure 7: Architecture details of ODITS and baselines. we set the parameter η used in the UpdateBelief function as 0.2. For ConvCPD (Ravula et al., 2019), we follow the implementation details mentioned in the original paper. We construct the architecture of the Change Point Detection (CPD) Network as illustrated in Fig. 7, where ntp is the number of predefined types, nt is the number of time steps that needed be considered in ConvCPD,of change points. The ConCPD algorithm is trained with nclasses nclasses = ntp 1000 × ( samples in which each claess has 1000ntp − 1) + 1 is the number of classess samples (batch size = 64, learning rate = 0.01, decay=0.1, optimizer=SGD). For × AATEAM (Chen et al., 2020), we follow its proposed implementation details. The number of hidden layers for GRU is 5. The maximum length of a state sequence that attention networks process is 60. The rate of dropout is 0.1. During training, the default learning rate is 0.01, and the batch size is 32. A.3 MODIFIED COIN GAME To show the difference between ODITS and type-based approaches, we introduce a simple modified coin game. The game takes place on a 7 × 7 map which contains 6 coins of 3 different colors (2 coins of each color). The aim of the team is to only collect any two kinds of coins (correct coins with a reward of 100) and avoid collecting the other kind of coins (false coins with a reward of -200). The policies of the teammates are predefined and illustrated in the order of colors in Fig.8 left (2 training types and 2 testing types). For example, the first training type (red → green) indicates that the policy of this teammate is to collect red and green coins, and it will collect red coins firstly. Therefore, while the correct coins of the first training type (green → red) and the second testing type (red → green) are the same, they are different policies since their paths to collect coins are apparently different. Each agent has five actions: move up, down, left, right, or pass. Once an agent steps on a coin, that 14 ----- coin disappears from the cell. The game ends after 20 steps. To maximize the team return, the aim of the ad hoc agent is to infer its current teammates’ desired coins and collect them as much as possible. Here, we adopt one state-of-the art type-based approach (AATEAM (Chen et al., 2020)) as the baseline. Fig.6 shows the testing performance. We observe that ODITS shows superior performance and converges quickly while AATEAM shows an unstable curve. We believe this discrepance results from the key difference between our method and type-based approaches. The baseline is hard to cooperate with new types of teammates. For example, when the baseline agent interacts with the teammate of the second testing type (green → red) and observes that the teammate is collecting the green coins at the start stage, it would switch its own policy to the one corresponding to the second training type of teammate (green → blue), so it would collect green coins and blue coins (false coins) simultaneously, leading to poor teamwork performance. By contrast, ODITS can be easily generalized to the testing types of teammates. During training, ODITS learns how to cooperate with the teammate according to its current behavior instead of its types. If it observes that its teammate is collecting one kind of coins, it will collect the same kind of coins, and this knowledge is automatically embedded in c and z. A.4 DETAILS OF EXPERIMENTS Modified Coin GameModified Coin Game Predator Prey Save the City CapturingCapturingCapturing MovingMovingMoving PredatorsPredators PreysPreys ObstaclesObstacles Fast-burning Fast-burning buildingsbuildings Slow-burning Slow-burning buildingsbuildings FirefightersFirefighters BuildersBuilders GeneralistsGeneralists Completion Completion progressprogress TeammateTeammate Ad hoc agentAd hoc agent CoinsCoinsCoins Policies of Training typesPolicies of Training types Policies of Testing typesPolicies of Testing types Observation Observation rangerange Failed capturing Successful Successful capturingcapturing Figure 8: Illustration of Modified Coin Game (left), Predator Prey (middle) and Save the City (right) **Modified Coin Game. In this environment, the ad hoc agent needs to cooperate with its teammate to collect** target coins which depends on the current teammates’ type. The teammates’ types are illustrated as two ordered colors of coins in Fig. 8. The teammate follows a heuristic strategy and has complete visibility of the entire environmental state. It would first move along an axis for which it has the smallest distance from the coin of its first color. If it collides with obstacles or boundaries, it would first choose random actions in 3 steps, and then choose actions to move to the target coin again. If there is no coin of its first color, it would move to the coin of its second color following the same strategy. The ad hoc agent can only access the information 3 grids nearby itself. We use one-hot coding to represent different entities in each grid and concatenate them together as a vector to construct the observation of the ad hoc agent. The game ends after 20 steps and takes place on a 7 × 7 grid containing 6 coins of 3 different colors (2 coins of each color). Each agent has five actions: move up, down, left, right, or pass. Once an agent steps on a coin, that coin disappears from the grid. This game requires agents to collect as many target coins as possible, which are indicated by the teammates’ types. One right collected coin gives a reward of 100, while one false coin gives a reward of -200. As a result, the ad hoc agent needs to infer its teammate’s goals according to teammates’ behaviors and move to right coins while avoiding meeting false coins. **Predator Prey. In this environment, m homogenous predators try to capture n randomly-moving preys in a 7** _×_ 7 grid world. Each predator has six actions: the moving actions in four directions, the capturing action, and waiting at a grid. Besides, there are two obstacles at random locations. Due to partial observability, each predator can only access the environmental information within two grids nearby. The information of each grid is embedded into a one-hot vector to represent different entities: obstacles, blank grids, predators, preys. Episodes are 40 steps long. The predators get a team reward of 500 if two or more predators are capturing the same prey simultaneously, and they are penalized for -10 if only one of them tries to capture a prey. Here, we adopt three different scenarios to verify the ad hoc agent’s ability to cooperate with the different number of teammates. They are 2 predators and 1 preys (2d1y), 4 predators and 2 preys (4d2y) and 8 predators and 4 preys (8d4y), respectively. In this environment, to simulate complex cooperative behaviors, we utilize 5 algorithms (VDN (Sunehag et al., 2018), MAVEN (Mahajan et al., 2019), ROMA (Wang et al., 2020), QMIX (Rashid et al., 2018), IQL (Tan, 15 ----- 2d1y2d1y 4d2y4d2y 8d4y8d4y 2a2b2a2b 4a3b4a3b 6a4b6a4b Training policies Testing policies Unselected policies Figure 9: t-SNE plots of the learned teammates’ policy representations for Predator Prey (top panel) and Save the City (bottom panel) in 6 scenarios. For each scenario, we first developed 60 candidate policies by using existing MARL open-source implementations. Then, we train a self-supervised policy encoder mentioned in (Raileanu et al., 2020) by learning information from collected behavioral trajectories {(o[i]t[, a][i]t[)][}][ of all candidates] to represent their policies. Finally, we select 15 policies whose average policy embeddings are distinct from each other, and split them into 8 training policies and 7 testing policies for each scenario. 1993)) to train the candidate policies of teammates using their open-source implementations based on PyMARL (Samvelyan et al., 2019). We use 4 different random seeds for each algorithm and save the corresponding models at 3 different training steps (3 million, 4 million, and 5 million). As a result, we get 60 candidate teammates’ policies for each scenario. To ensure diversity of cooperative behaviors, we visualize policy representations of all candidate models by using the open-source implementation mentioned in (Raileanu et al., 2020). This mechanism encodes the agent’s behavioral trajectories {(o[i]t[, a][i]t[)][}][ into an embedding space, in which different] policies generate different representations. Then, we manually select 15 distinct policies and randomly split them into 8 training policies and 7 testing policies for each scenario. We illustrate the policy representations in Fig. 9. During training or testing, in each episode, we sample one policy from the policy set and equip a team of agents with the policy. Then, we replace one random agent in the team with the ad hoc agent to construct an ad hoc team. **Save the City. This is a 14 × 14 grid world resource allocation task presented in (Iqbal et al., 2020). In this** task, there are 3 distinct types of agents, and their goal is to complete the construction of all buildings on the map while preventing them from burning down. Each agent has 8 actions: stay in place, move to the next grid in one of the four cardinal directions, put out the fire, and build. We set the agents to get a team reward of 100 if they have completed a building and be penalized for -500 when one building is burned down. Agent types include firefighters (20x speedup over the base rate in reducing fires), builders (20x speedup in the building), or generalists (5x speedup in both as well 2x speedup in moving ). Buildings also consist of two varieties: 16 ----- fast-burning and slow-burning, where the fast-burning buildings burn four times faster. In our experiments, each agent can only access the environmental information within four grids nearby. The observation of each agent contains the information (type, position, and the completion procedure of the building) of each entity in the environment. If one entity is not in the sight range of the agent, the observation vector of this entity would be filled with 0. We adopt three different scenarios here to verify all methods. They are 2 agents and 1 buildings (2a2b), 4 agents and 3 buildings (4a3b), 6 agents and 4 buildings (6a4b). Similar to Predator Prey, we also manually selected 15 teammates’ policies for each scenario. We illustrate their policy representations in Fig. 9. Furthermore, since each team has a random combination of agent types, teammates’ behaviors on this environments show a greater diversity. In addition, all buildings’ types are randomly assigned in each episode. B LIMITATIONS AND FUTURE WORKS In this paper, the experiments were carried with relatively small teams (2-8 agents) and with agent behavior largely coming from pre-trained RL policies. In this case, the diversity of teammates’ policies can be easily verified by visualizing the policy representations in latent space. However, with the team size increasing, there might be more complex and diverse teamwork situations. This leads to a higher requirement for the training speed and the scalability of the architecture. In addition, the diversity of the policy sets for training has considerable influence on the performance. However, the characterization and quantization of such an influence remain unexplored. We leave these issues for future work. 17 ----- |