File size: 84,752 Bytes
f71c233 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 |
# MULTI-AGENT REINFORCEMENT LEARNING WITH SHARED RESOURCE FOR INVENTORY MANAGEMENT **Anonymous authors** Paper under double-blind review ABSTRACT We consider inventory management (IM) problem for a single store with a large number of SKUs (stock keeping units) in this paper, where we need to make replenishment decisions for each SKU to balance its supply and demand. Each SKU should cooperate with each other to maximize profits, as well as compete for shared resources e.g., warehouse spaces, budget etc. Co-existence of cooperation and competition behaviors makes IM a complicate game, hence IM can be naturally modelled as a multi-agent reinforcement learning (MARL) problem. In IM problem, we find that agents only interact indirectly with each other through some shared resources, e.g., warehouse spaces. To formally model MARL problems with above structure, we propose shared resource stochastic game along with an efficient algorithm to learn policies particularly for a large number of agents. By leveraging shared-resource structure, our method can greatly reduce model complexity and accelerate learning procedure compared with standard MARL algorithms, as shown by extensive experiments. 1 INTRODUCTION Inventory management (IM) problem has long been one of the most important applications in the supply-chain industry (Nahmias & Smith, 1993). Its main purpose is to maintain a balance between the supply and demand of stock keeping units (SKUs) in a supply chain by optimizing replenishment decisions of each SKU. Besides leading to profit increment and operational cost reduction, efficient IM can even give rise to better services to customers. However, it is quite a challenging task in practice, especially when there are lots of SKUs involved in the supply-chain. Particularly, while all SKUs should cooperate with each other to achieve high profits, they also need to compete for shared resources e.g., warehouse spaces, budget etc. Such co-existence of cooperation and competition renders IM a complicated game that is hard to address. Traditional methods usually reduce IM problems to solving dynamic programming problems. However, these approaches often rely on some unrealistic assumptions such as i.i.d. demand, deterministic leading time, etc. Moreover, as the state space grows exponentially along with some key factors like leading time and number of SKUs (Gijsbrechts et al., 2019), the corresponding dynamic programming problems become intractable due to the curse of dimensionality. Because of these limitations, many approaches based on approximate dynamic programming are proposed to solve IM problems in different settings (Halman et al., 2009; Fang et al., 2013; Chen & Yang, 2019). While these approaches perform well in certain scenarios, they heavily rely on problem-specific expertise or assumptions e.g., the zero or one period leading time assumption in (Halman et al., 2009), hence can hardly generalize to other settings. In contrast, reinforcement learning (RL) based methods, with fast inference speed, can generalize to various scenarios in a data-driven manner. However, it is usually too costly to train a global policy that can make decisions for all SKUs, since the training efficiency can be notably curtailed because of the large global state and action space (Jiang & Agarwal, 2018)). To further address the training efficiency issue, it is natural to adopt the multi-agent reinforcement learning (MARL) paradigm, where each SKU is regarded as an agent whose state and action spaces are localized and only contain information relevant to itself. There are currently two popular paradigms to train MARL in the literature: independent learning (Tan, 1993) and joint action learning (Lowe et al., 2017). Despite of their success in many scenarios, these two MARL paradigms also exhibit certain weaknesses that restrain their effectiveness ----- in solving IM problems. On one hand, if applying independent learning, policy training of one agent simply treats all other agents as parts of the stochastic environment, hence is hard to converge due to non-stationary of the environment. On the other hand, joint action learning usually learns a centralized critic conditioned on the joint action and state spaces of all agents, which can easily become intractable with increasing number of SKUs. Furthermore, it could be quite time-consuming to sample data from joint simulator for a great number of SKUs, since it usually involves much computation on many internal variables caused by complex agent interactions. To address these challenges, we took a closer look at the IM problem and find that there exists a special structure that can be leveraged to design more effective MARL paradigm. Particularly, each agent in IM only interacts with others through the shared resource, e.g., warehouse spaces. We introduce an auxiliary variable to represent the whole inventory level, implying the available shared resource, for all SKUs, and refer to this variable as context. From the MARL perspective, one agent can be influenced by other agents only through such context. The context dynamics actually reflect the collective behaviors of all agents. And, conditioned on context dynamics, one agent’s state transition and reward function are independent of other agents. In this way, leveraging context as an additional input for each agent’s policy/value functions enables us to both avoid the non-stationary problem caused by independent learning and mitigate the intractable centralized critics learning caused by exploding state and action spaces. Moreover, introducing context dynamics inspires us to build a local simulator for each agent in order to facilitate more efficient policy learning for each agent. Based on this structure with context dynamics, we propose a shared-resource stochastic game to model the IM problem. Specifically, we introduce a surrogate objective function to optimize the return of agent i conditioned on all other agents, denoted by −i. Here, we make two approximations to get the surrogate objective function: 1) rearranging the sampling process by first sampling context dynamics then sampling local state/action/reward for each agent i; 2) using context dynamics sampled by previous policies. Based on above surrogate objective, our algorithm consists of two iterative learning procedures: 1) obtaining context dynamics from joint simulator, 2) updating policy for each agent by data sampled from its respective local simulator conditioned on collective context dynamics. By decoupling each agent from all others with a separate training procedure, our method can greatly reduce model complexity and accelerate learning procedure, as shown by extensive experiments. It is worthwhile to mention that our method is not limited to IM, but can also be applied to many other applications with shared-resource structure, e.g., portfolio management (Ye et al., 2020), smart grid scheduling (Remani et al., 2019), etc. Our contributions are summarized as follows: - We propose shared-resource stochastic game to capture the problem structure in IM, that agents only interact with each other through shared resource. - We propose a novel algorithm that leverages the shared-resource structure to solve IM problem efficiently. - Extensive experiments demonstrate that our method outperforms existing MARL algorithms on both profits and computing efficiency. 2 BACKGROUND 2.1 STOCHASTIC GAMES We build our work on stochastic games (SGs) (Shapley, 1953), since each SKU in IM problem has its own profit (reward) to optimize. A Stochastic Game is defined by a tuple _N_ _, S,_ _A[i]_ _i∈N_ _[,][ T][,]_ _R[i]_ _i∈N_ _[, γ]_, where N = {1, · · ·, n} denotes the set of n > 1 agents, denotes the state space observed by all agents, denotes the action space of agent i. Let _S_ _A[i]_ _A := A[1]_ _× · · · × A[n], then T : S × A →_ ∆(S) denotes the transition probability from any state s ∈S to any state s[′] _∈S after taking a joint action a ∈A; R[i]_ : S × A × S → R is the reward function that determines the immediate reward received by agent i for a transition from (s, a) to s[′]; γ ∈ [0, 1) is the discount factor. We can formulate the joint policy of other agents’ ----- as π[−][i] = _j∈−i_ _[π][j][. Each agent][ i][ optimizes its policy][ π][i][ :][ S →]_ [∆] _A[i][]_ to maximize its own long-term reward, which is conditioned on other agents’ behavior, defined as [Q] _maxπi_ _η[i](π[i], π[−][i]) = E(st,ait[,a][−]t_ _[i])_ _,π[i],π[−][i]_ [[] _∼T_ _γ[t]rt[i][]][.]_ (1) _t=0_ X We will illustrate the shared resource structure of IM problem in Section 2.2, which motivates us to propose shared-resource stochastic game as a special case of stochastic game to capture such structure in Section 3.1. 2.2 INVENTORY MANAGEMENT WITH SHARED RESOURCE While a typical setting for IM shall involve a supply network of multi-echelon including stores, warehouses and factories, we simplify the setting to ease our presentation. In the following, we shall focus on scenarios with one store and multiple SKUs. We further assume that there is an upstream warehouse that can fulfill requirements from the store perfectly. Our objective is to learn high-quality replenishing policies for each SKU in the store, particularly when there are a large number of SKUs. As replenishing decisions for SKUs in stores should directly consider consumption behaviors of customers and competitions from other SKUs due to limited resources like spaces, budget etc., they are more challenging to optimize comparing to SKUs in warehouses. It is worthwhile to mention that, due to the flexibility of RL algorithms, our method can also be applied to more complex settings with multi-echelon, fluctuate supply, non-deterministic leading time etc. Similar to previous work, we follow the multi-agent RL (MARL) framework with decentralized agents, each of which manages inventory of one SKU in the store. We assume that the store has n SKUs in sell, all of which share a common space that can store up to Imax units at the same time. Replenishing decisions of each SKU are made on discretized time steps, which are days in our paper. For each time step t and SKU i, let _I[˙]t[i]_ constraint shall hold for all time steps: _[∈]_ [Z][ denote units of][ i][ that are in stock. Hence, the following] _I˙t[i]_ (2) _i=1_ _[≤]_ _[I][max]_ X _∀t ≥_ 0. At any time step t, the agent corresponding to SKU i may place a replenishment order to request _Ot[i]_ fulfillment instantly, but will take several time steps, referred as leading time, before these products[∈] [Z][ units of products from its upstream warehouse. These replenishment orders cannot be] are delivered to the store. Lettransit at the time step t. Meanwhile, demands from customers Li denote the leading time of SKU Dt[i] [may consume inventory of SKU] i and Tt[i] _[∈]_ [Z][ its total units in] _ithan and cause an actual sale of units Dt[i][. Formally, dynamics of these variables can be described as follows:] St[i]_ _[∈]_ [Z][. Due to the possibility of out-of-stock,][ S]t[i] [may be less] _St[i]_ [= min] _Dt[i][,][ ˙]It[i]_ (3) _Tt[i]+1_ [=][ T][ i]t _t_ _Li+1_ [+][ O]t[i]+1 (4) _[−]_ _[O][i]−_ _Iˆt[i]_ [= ˙]It[i] _t_ [+][ O]t[i] _Li+1_ (5) _[−]_ _[S][i]_ _−_ 0 if _i=1_ _I[ˆ]t[i]_ _n_ _ρ =_ _i=1_ _I[ˆ]t[i][−][I][max]_ _[≤]_ _[I][max]_ (6) _n_ otherwise ( Pi=1 _[O]t[i]−Li_ +1 [P][n] _I˙t[i]+1P[= ˙]It[i]_ _t_ [+][ ⌊][(1][ −] _[ρ][)][O]t[i]_ _Li+1[⌋]_ (7) _[−]_ _[S][i]_ _−_ As we mentioned before, due to all SKUs share a common space, it may happen that the storage overflows when ordered products arrive. In this paper, we assume that the excess SKUs will be discarded proportionally according to ρ defined in Eq. 6, which corresponds to the overflowing ratio if we accept all coming SKUs without considering the space constraint. To calculate ρ, we also introduce an extra variable _I[ˆ]t[i][, which we call afterstate of][ ˙]It[i][. Intuitively, it denotes units of SKU][ i]_ in stock at the end of the t-th time step if we omit the capacity constraint. We shall note that other manners e.g., prioritizing all SKUs, to resolve space overflows are possible. The algorithm that we ----- will introduce in the following section will also apply to these settings. Undoubtedly, such behaviors will cause extra operational cost which should be avoided as much as possible in our replenishing decisions. The intermediate profit Pt _[i]t_ [of the][ i][-th SKU is calculated according to the following equation:] _Pt_ _[i]t_ [=][ p][i][S]t[i] _[−]_ _[q][i][O]t[i]_ _[−]_ _[o][I]_ _Ot[i]_ _[>][ 0]_ _−_ _hI[˙]t[i]_ (8) where pi and qi are the unit sale price, unit procurement price for the _i-th SKU, respectively, and o_ and h are the order cost and unit holding cost, respectively. I[·] is an indicator function which equals to one when the condition is true and zero otherwise. The order cost reflects the fixed transportation cost or the order processing cost, and yields whenever the order quantity is non-zero. For convenience, we summarize all notations in Table 2 in Appendix B, where we also give an example with 2 SKUs in Fig. 4 to further illustrate the whole procedure. 3 METHODOLOGY 3.1 SHARED-RESOURCE STOCHASTIC GAME In this section, we show that IM problem introduced in Section 2.2 can be formulated as a shared-resource stochastic game, where each agent is only influenced by other agents through a shared resource pool. We define a shared-resource stochastic game as a tuple _N_ _,_ _S_ _[i]_ _i∈N_ _[,][ C][,]_ _A[i]_ _i∈N_ _[,][ T][,]_ _R[i]_ _i∈N_ _[, γ]_, where N = {1, · · ·, n} denotes the set of n > 1 agents, denotes the state space of agent _i,_ denotes the context space observed by all agents, _S_ _[i]_ _C_ _A[i]_ denotes the action space of agent i. Here, the context represents the occupied capacity of sharedresource. Let S := S [1] _× · · · × S_ _[n]_ and A := A[1] _× · · · × A[n], then T : S × C × A →_ ∆(S × C) denotes the transition probability, which can be decomposed as follows. The context is affected by all agents, i.e., ct+1 _Pc(_ _ct, st, at). Since we are dealing with resource, the transition function_ of context usually has some additive structure with respect to all agents, which we will illustrate later ∼ _· |_ in Section 3.2. Given context ct+1, the transition function of state for each agent is independent with other agents, i.e., s[i]t+1 _s[(][· |][ s][i]t[, a][i]t[, c][t][+1][)][. Reward function is also independent with other agents]_ given context,γ ∈ [0, 1) is the discount factor. Each agent rt[i] _[∼]_ _[R][i][∼][(][s][P]t[i][, a][ i]_ _[i]t[, c][t][+1][)][. We refer] i optimizes its own policy[ P][ i]s_ [and][ R][i][ as][ local][ transition and reward functions.] π[i] : S _[i]_ _× C →_ ∆ _A[i][]_ to maximize the expected long-term reward conditioned on other agents’ policies π[−][i], defined as _maxπi_ _η[i](π[i], π[−][i]) = E(st,ct,ait[,a][−]t_ _[i])_ _,π[i],π[−][i]_ [[] _∼T_ _γ[t]rt[i][]]_ (9) _t=0_ X Given the above definition, an IM problem can be formulated as a shared-resource stochastic game by letting rt[i] [=][ Pt] _t[i][,][ a][i]t_ [=][ O]t[i][, and][ c][t] [=][ P][n]i=1 _I[˙]t[i]_ [for all SKU][ i][ and time step][ t][. Moreover, state] of each SKU i will be a concatenation of information like Tt[i][,][ p][i][, its historical actions and demands] etc. A detailed description of the state space can be found in the Appendix E.3. 3.2 SURROGATE OBJECTIVE WITH CONTEXT DYNAMICS We now introduce how to optimize the return for agent i conditioned on other agents’ behaviors. Roughly speaking, the context dynamics reflect the collective behaviors of other agents. It is possible to estimate the objective function approximately for each agent i only based on its local transition dynamics, local reward function and the context dynamics. First, we give a detailed description of the transition model of shared resource c from each agent’s perspective. Given such dynamics for the shared resource, we then show how to approximate the objective function in Eq. 9 by rearranging the sampling process. Finally, we replace the policies for sampling context dynamics by old policies from the previous iteration to accelerate the decentralized training. 3.2.1 CONTEXT DYNAMICS AND LOCAL SIMULATOR We use ˙c[i]t [to represent the amount of resource occupied by agent][ i][ at time step][ t][ and][ ˙]ct the total amount of occupied resource. We further let ˙c[−]t _[i]_ denote the amount of resource occupied by all ----- agents but i. Since the capacity has the additive structure, we have the following equations: _c˙t =_ _c˙[i]t[;]_ _i_ (10) X _c˙t = ˙c[i]t_ [+ ˙]c[−]t _[i][.]_ Similarly, we denote the afterstate of ˙c[i]t [as][ ˆ]c[i]t[. From the perspective of agent][ i][, it can view the context] replenishment decision influencesc˙[−]t _[i]_ as a part of the environment. Hence, given ˆc[i]t[. Due to the additive structure, we have] s[i]t[,][ ˙]c[i]t[, and][ a][i]t[,][ P] [(ˆ]c[i]t _[|][ s]t[i][, a][i]t[ ˆ][,]c[ ˙]ct[i]t = ˆ[)][ represents how the]c[i]t_ [+ ˆ]c[−]t _[i][. Then by]_ applying a resource overflow resolution procedure as in Eq. 6, we obtain ˙c[i]t+1 [representing state of] the resource in the next step. We use notation ct to represent (ˆct 1, ˙ct). We refer (s[i], c[i]) as the state for agent i, and c[−][i] as the _−_ context. For each agent, we have the following sampling process T _[i]_ given the context dynamics of _c[−][i],_ _c[i]t+1_ _[∼]_ _[P]c[ i][(][· |][ s][i]t[, a][i]t[, c][i]t[, c]t[−][i][)]_ _s[i]t+1_ _[∼]_ _[P]s[ i][(][· |][ s][i]t[, a][i]t[, c][i]t+1[, c][−]t+1[i]_ [)] (11) _rt[i]_ _[∼]_ _[R][i][(][s]t[i][, a]t[i][, c]t[i]+1[, c][−]t+1[i]_ [)][.] Given the context dynamics for c[−][i], we can build a local simulator for agent i according to the above equations. To ease our presentation, we only consider one kind of shared resource i.e., warehouse spaces in the above formulation. However, it can be extended to support multiple resources as long as these resources have the additive structure as in Eq. 10. 3.2.2 SURROGATE LOSS FUNCTION To leverage local simulators, we rearrange the sampling process for evaluating η(π[i], π[−][i]) as follows. First, we follow the regular sampling process in Eq. 9, and get samples of c[−]t _[i][. Then, we re-sample]_ (s[i]t[, a][i]t[, c][i]t[)][ based on samples of][ c]t[−][i] following Eq. 11. It is worth noting that we make an approximation here. Given samples for c[−]t _[i][, we actually need to inference the posterior of][ (][s]t[i][, a][i]t[, c][i]t[)][.]_ However, since we consider scenarios with lots of agents, it is reasonable to assume that each agent _i has limited influence on its context c[−][i]. Therefore, we can assume that c[−]t_ _[i]_ has limited influence on the inference of (s[i]t[, a][i]t[, c][i]t[)][ and sample the latter directly according to Eq.][ 11][. By rearranging the] sampling process, we obtain the following surrogate objective, _maxπi ˜η[i](π[i], π[−][i]) = E(c−t_ _i)_ _,π[i],π[−][i]_ [E][(][s][i]t[,a][i]t[,c][i]t[)][∼T][ i][,π][i] [[] _∼T_ _γ[t]rt[i][]][.]_ (12) _t=0_ X In practice, it is quite costly to sample from the joint dynamics T, but much cheaper to sample from the local dynamics T _[i]. To leverage data samples efficiently, we propose to use the samples_ from previous policies in our objective function, which is a common trick in reinforcement learning. Specifically, we use samples of c[−]t _[i]_ collected by policies (πold[i] _[, π]old[−][i]_ [)][ of the previous iteration, and] further rewrite the surrogate objective function as follows: _maxπi ˆη[i](π[i], πold[i]_ _[, π]old[−][i]_ [) =][ E](c[−]t _[i])∼T,πold[i]_ _[,π]old[−][i]_ [E][(][s]t[i][,a][i]t[,c][i]t[)][∼T][ i][,π][i] [[] _γ[t]rt[i][]][.]_ (13) _t=0_ X As long as current policies stay close to previous ones, we can adopt above surrogate objective function to improve sample efficiency. 3.3 ALGORITHM In the following, we will present details about our proposed algorithm, which is referred as Contextaware Decentralized PPO (CD-PPO). We call it context-aware as the context c[−]t _[i]_ is a part of input to train policy and value functions of each agent. Such a context-aware approach can avoid the nonstationary problem occurred in independent learning methods (which does not consider dynamics of others during training), since the context reflects other agents’ collective behaviors which can ----- Figure 1: Our algorithm consists of two iterative learning procedures: 1. Get context dynamics _{c[−]t_ _[i][}]t[T]=0_ [from the joint simulator and previous policies][ π]old[i] [,][ π]old[−][i] [. 2. Train policy][ π][i][ with data] sampled from the local simulator conditioned on context dynamics {c[−]t _[i][}]t[T]=0[.]_ impact dynamics of each individual agent. In the meanwhile, our method can mitigate the intractable centralized critic learning by avoiding using exploding joint state and action space, i.e., (st, at, ct), as input of the critic. As shown in Algorithm 1, our algorithm consists of two iterative learning procedures: 1) obtaining context dynamics by running all agents in the joint environment (Algorithm 2); 2) updating policy of each agent using data sampled from its local simulator conditioned on the context dynamics (Algorithm 3). Moreover, our algorithm follows a decentralized training paradigm with an acceptable communication cost of the context dynamics. We refer readers to Appendix C for a detailed description. It is worth noting that a naive approach is to optimize policies by Eq. 9 with data sampled from the joint simulator. Nonetheless, we find that it is empirically more time-consuming to sample one step in the joint simulator than letting each of the n local simulators sample once. One major reason lies in that agent interactions take most of the simulation time. Our method, with the advantage of leveraging local simulators to simplify interactions, are henceforth much cheaper to sample. We refer readers to Appendix D for more discussions on the benefit of leveraging local simulators. **Algorithm 1 Context-aware Decentralized PPO** Given the joint simulator Envjoint and local simulators Env[i]local[}][n]i=1 _{_ Initialize policies π[i] and value functions V _[i]_ for i = 1, . . ., n **for M epochs do** // Collect context dynamics via running joint simulation **for{c[−]t k[1][}] = 1t[T]=0[, . . .,], 2 . . ., K[ {][c]t[−] do[n]}t[T]=0** _[←]_ **[GetContextDynamics][(][Env][joint][,][ {][π][i][}]i[n]=1[)][ (Algorithm][ 2][)]** **for all agents i do** // Set capacity trajectory by context dynamics Env// Train policy by running simulation in the corresponding local environment[i]local[.][set c trajectory][(][{][c]t[−][i][}]t[T]=0[)] _π[i], V_ _[i]_ _←_ **DecentralizedPPO(Env[i]local[, π][i][, V][ i][)][ (Algorithm][ 3][)]** **end for** **end for** Evaluate policies _π[i]_ _i=1_ [on joint simulator Env][joint] _{_ _}[n]_ **end for** ----- 4 EXPERIMENTS We evaluate the performance of CD-PPO in three IM problems, which contain 5, 50, and 100 SKUs, respectively. By changing space size of the warehouse, we further testify how CD-PPO performs comparing to a series of baselines under different levels of competition. On these settings, we demonstrate that our algorithm can achieve comparable results as SOTA baselines, but is more sample-efficient. 4.1 EXPERIMENT SETUPS Our experiment is conducted on a simulator which can support both retailing and replenishment for multiple SKUs in one store. Instead of sampling demands from some hypothetical distributions, we instead directly use authentic demand data from Kaggle (Makridakis et al., 2020), which contains sales history of five years (2011-2016) for various SKUs in Walmart. We randomly choose 155 SKUs from the data and use sales of the first four years as our training set, while the others are the testing set. For all other information, e.g. price, cost, leading time etc., that are necessary to instantiate our simulator but not included in the data set, we will randomly sample them from certain reasonable ranges. The evaluation metric is the total profit in dollars. For all results that we present, we run each of all algorithms for four times with random seeds and present the average performance with standard deviations. The source code as well as instruc[tions to reproduce our results can be found in https://anonymous.4open.science/r/](https://anonymous.4open.science/r/replenishment-marl-baselines-75F4) [replenishment-marl-baselines-75F4.](https://anonymous.4open.science/r/replenishment-marl-baselines-75F4) Note that the simulator we are using is developed for general purposes, it contains extra details like replenishing orders fulfillment scheduling, transportation management etc., hence it requires lots of computation resources to simulate IM problems with many SKUs (> 100). This also hinder us to extend our experiments to cases with more than 100 SKUs on a single machine. We leave it our future work to further testify our algorithm in a distributed environment where simulating large scale IM problems is possible. We compare our method CD-PPO with the strong model-free on-policy MARL algorithms MAPPO (Yu et al., 2021) and IPPO (de Witt et al., 2020) on 5-SKUs environment and we do not show the performance of COMA (Foerster et al., 2018) and Value Decomposition methods such as QMIX (Rashid et al., 2018) due to their bad performance (i.e. negative profit). As for 50-SKUs and 100-SKUs scenario, we can only run IPPO, since MAPPO fails quickly as it suffers from the huge joint state-action space for training a centralized critic. Besides RL approaches, we also compare with base-stock policy, which is a well-known policy from OR community and widely adopted in practice (Kapuscinski & Tayur, 1999; Hubbs et al., 2020). (More details can be found in Appendix F). Our implementation is based on the EPyMARL (Papoudakis et al., 2021) framework, which contains MAPPO, IPPO and several common-used MARL algorithms. As all agents are homogeneous in IM, we let parameters of policy network and critic network be shared amongst all agents. The two networks are all constructed by a two-layer MLP with hidden size 64. In our experiment, the actor network maps agent-specific state to a categorical distribution over a discrete action space {0, [1]3 _[,][ 2]3_ _[,][ 1][,][ 4]3_ _[,][ 5]3_ _[,][ 2][,][ 5]2_ _[,][ 3][,][ 4][,][ 5][,][ 6][,][ 7][,][ 9][,][ 12][}][, such that the real replenish-]_ ment quantity is obtained by multiplying the action with an agent’s sales mean of the past two weeks. In addition, to encourage policies to accommodate diverse context dynamics, we propose two methods to augment the context dynamics: 1) add noise to randomly chosen items with a predefined probability; 2) replace randomly chosen items with predicted values coming from a context generator model, which is also guided by the predefined probability. Besides, we also use the data collected in Algorithm 2 for policy training. More details about the algorithm can be found in Appendix C. 4.2 MAIN RESULTS We evaluate CD-PPO, MAPPO and IPPO on 5,50,100-SKUs environments with different sizes of warehouse spaces. IPPO and MAPPO only use individual rewards rather than team reward (summation of all individual rewards) to train critics. For a fair comparison, we also train IPPO and MAPPO with information related to shared resource. The training curves for 5-SKUs scenario are ----- Table 1: Profit Comparison on Different Scenarios of 50 and 100 SKUs Environment |Env Scenario|CD-PPO(Ours)|IPPO-IR(w/o context)|IPPO-IR|IPPO(w/o context)|IPPO|Base-stock(Dynamic)| |---|---|---|---|---|---|---| |N50-C500|310.81 ± 76.46|235.09 ± 60.61|250.03 ± 58.38|164.43 ± 143.01|366.74 ± 89.58|−408.14| |N50-C2000|694.87 ± 174.184|689.27 ± 48.92|545.86 ± 459.71|−1373.29 ± 870.03|−1102.97 ± 1115.69|42.71| |N100-C1000|660.28 ± 149.94|−2106.98 ± 315.38|−1126.42 ± 409.83|−1768.19 ± 1063.61|−669.83 ± 1395.92|−22.05| |N100-C4000|1297.75 ± 124.52|−2223.11 ± 2536.00|148.00 ± 1017.47|−6501.42 ± 6234.06|−6019.28 ± 9056.49|493.32| Env Scenario CD-PPO(Ours) IPPO-IR(w/o context) IPPO-IR IPPO(w/o context) IPPO Base-stock(Dynamic) N50-C500N50-C2000N100-C1000N100-C4000 3106946601297...818728.75 ± ± ± ± 76 174 149 124.46..18494.52 235689−−21062223..0927 ± ±..9811 60 48 ± ±.. 315 25366192 _.38.00_ 250545−1481126...038600 ± ± ±.42 58 459 1017 ±. 40938.71.47.83 164−−−137317686501.43 ±...291942 143 ± ± ± 870 1063 6234.01.03..6106 **366−−−11026696019.74.83..9728 ± ± ± ± 89 1395 1115 9056.58.92..6949** _−42−49340822.71..3205.14_ shown in Figure 2. It is straightforward to observe that CD-PPO converges to the same performance comparing with other methods. In particular, CD-PPO is more sample-efficient due to its local simulations in parallel. To evaluate our algorithm on a larger scenario, we also run CD-PPO and IPPO (same with previous settings) on 50 and 100 SKUs and record the number of data samples when reaching the median performance of baselines. The results are summarized in Table 1 and Figure 3, where “N” and “C” denote the number of SKUs and the maximum capacity of shared resource, respectively. More details about the implementation and hyper-parameters we used can be found in Appendix C. Figure 2: Training curves of different algorithms in 5-SKUs environment. “IR” means the environment only provides individual rewards and “w/o context” denotes the algorithm does not use information related to shared resource as parts of its inputs. The X-axis [†] also takes in samples from local simulation for CD-PPO. Figure 3: Average samples needed by different algorithms in 100-SKUs environment to reach the median performance of baselines. The lower the value, the higher the sample-efficiency for the algorithm. As demonstrated in Figure 2, Figure 3 and Table 1, CD-PPO is able to produce results comparable to other strong baselines and even continues improving as the training proceeds. Moreover, our algorithm can curtail non-stationary effects as in centralized training and, in the meanwhile, can scale up to scenarios with a large number of agents. In contrast, traditional MARL methods with CTDE paradigm even cannot start running on the 50-SKUs and the 100-SKUS environment, since the input of its critic is too huge to fit in the memory. IPPO, on the other hand, can run successfully, but lack stability under different levels of warehouse spaces. The full results and more ablation studies about how the capacity of shared resource affects the performance of CD-PPO and the influence of augmentation for context dynamics can be found in Appendix G. 5 RELATED WORK In this section we will introduce relevant prior work including studies of IM problem and common-used training paradigms in MARL. More detailed description of other related work(e.g. Constrained/Weakly-Coupled MDP, Model-Based MARL and Mean-Filed MARL) are shown in Appendix A. †Specifically, for one interaction in the global environment with N agents, we consider it as N data samples. For one interaction in the local simulator(based on context trajs) with one agent, we consider it as 1 data sample. ----- 5.1 INVENTORY MANAGEMENT Since the pioneer work in (Fukuda, 1964), many approaches have been proposed to solve different variants of IM problems, either using exact (Goldberg et al., 2016; Huh et al., 2009) or approximate (Halman et al., 2009; Fang et al., 2013; Chen & Yang, 2019) dynamic programming. As reinforcement learning based approaches are our main focus, we only present related work of this branch in the following. Interested readers will be referred to (Gijsbrechts et al., 2019) for an overview of traditional approaches for IM. The attempt to apply reinforcement learning to solve inventory management problem has a long history, see for instance (Giannoccaro & Pontrandolfo, 2002; Jiang & Sheng, 2009; Kara & Dogan, 2018; Barat et al., 2019; Gijsbrechts et al., 2019; Oroojlooyjadid et al., 2017; 2020). However, as their main focus is to deal with challenges like volatile customer demands, bullwhip effects etc. in IM, they are restricted to simplified scenarios with only a single SKU. While these approaches are able to outperform traditional approaches in these scenarios, they overlook system constraints and coordination of SKUs imposed by shared resources. Exceptions are two recent works (Barat et al., 2019; Sultana et al., 2020), where more realistic scenarios containing multiple SKUs are considered. In (Barat et al., 2019), the main contribution is to propose a framework that supports efficient deployment of RL algorithms in real systems. As an example, the authors also introduce a centralized algorithm for solving IM problems. In contrast, a decentralized algorithm is proposed in (Sultana et al., 2020) to solve IM problems with not only multiple SKUs but also multiple echelons. In both works, the training algorithm is the advantage actor critic (A2C) (Wu et al., 2018). 5.2 TRAINING PARADIGM IN MARL MARL algorithms generally fall between two frameworks: centralized and decentralized learning. There are two lines of fully decentralized training: Independent Learning methods (IL) (Tan, 1993; de Witt et al., 2020) and decentralized training with communication (Zhang et al., 2018; Sukhbaatar et al., 2016; Peng et al., 2017). For IL, each agent is learning independently to optimize its own reward and perceives the other agents as part of the environment. As for fully decentralized methods, representative methods usually build a direct communication pipe to share the message amongst agents to avoid non-stationary issue in MARL framework. One part of fully centralized approaches (Claus & Boutilier, 1998) assume a cooperative game and directly extend single-agent RL algorithms by learning a single policy to produce the joint actions of all agents simultaneously. Another type of centralized methods is called value decomposition(VD), which typically represents the joint Q-function as a function of agents’ local Q-functions (Sunehag et al., 2017; Son et al., 2019; Rashid et al., 2018) and has been considered a gold standard in MARL. In contrast to previous methods, Centralised Training Decentralised Execution (CTDE) allows sharing of information during training, while policies are only conditioned on the agents’ local observations enabling decentralised execution. The main category of CTDE algorithms are centralised policy gradient methods in which each agent consists of a decentralised actor and a centralised critic, which is optimised based on shared information between the agents. Representative studies of CTDE are MADDPG (Lowe et al., 2017), COMA (Foerster et al., 2018) and MAPPO (Yu et al., 2021) etc. 6 CONCLUSION In this paper, we address inventory management problem for a single store with a large number of SKUs. Our method is based on shared resource structure, where agents only interact indirectly with each other through shared-resource. By leveraging such structure, our method can outperform existing MARL algorithms in terms of both final performance and computation efficiency. It is worth mentioning that our method is not limited to IM, but also applicable to a wide range of real-world applications with shared-resource structure. In real-world applications, we usually need to deal with thousands of agents, which poses a challenge for existing MARL algorithms. To address such challenges, we need develop efficient algorithms which are compatible with distributed training. In this paper, we take our first step towards developing efficient and practical MARL algorithms for real-world applications with shared-resource structure, and will continue to address above challenges arisen in real-world applications in our future work. ----- REFERENCES Pritee Agrawal, Pradeep Varakantham, and William Yeoh. Scalable greedy algorithms for task/resource constrained multi-agent stochastic planning.(2016). In Proceedings of the 25th In_ternational Joint Conference on Artificial Intelligence IJCAI 2016: New York, July 9, volume 15,_ pp. 10–16, 2016. A.2 Souvik Barat, Harshad Khadilkar, Hardik Meisheri, Vinay Kulkarni, Vinita Baniwal, Prashant Kumar, and Monika Gajrani. Actor based simulation for closed loop control of supply chain using reinforcement learning. In Proceedings of the 18th international conference on autonomous _agents and multiagent systems, pp. 1802–1804, 2019. 5.1_ Shalabh Bhatnagar and K Lakshmanan. An online actor–critic algorithm with function approximation for constrained markov decision processes. Journal of Optimization Theory and Applications, 153(3):688–708, 2012. A.2 Craig Boutilier and Tyler Lu. Budget allocation using weakly coupled, constrained markov decision processes. In Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence (UAI_16), pp. 52–61, New York, NY, 2016. A.2_ Ren´e Carmona, Mathieu Lauri`ere, and Zongjun Tan. Model-free mean-field reinforcement learning: mean-field mdp and mean-field q-learning. arXiv preprint arXiv:1910.12802, 2019. A.1 Wenbo Chen and Huixiao Yang. A heuristic based on quadratic approximation for dual sourcing problem with general lead times and supply capacity uncertainty. IISE Transactions, 51(9):943– 956, 2019. ISSN 24725862. doi: 10.1080/24725854.2018.1537532. 1, 5.1 Yu Fan Chen, Miao Liu, Michael Everett, and Jonathan P How. Decentralized non-communicating multiagent collision avoidance with deep reinforcement learning. In 2017 IEEE international _conference on robotics and automation (ICRA), pp. 285–292. IEEE, 2017. A.1_ Caroline Claus and Craig Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems. AAAI/IAAI, 1998(746-752):2, 1998. 5.2 Christian Schroeder de Witt, Tarun Gupta, Denys Makoviichuk, Viktor Makoviychuk, Philip HS Torr, Mingfei Sun, and Shimon Whiteson. Is independent learning all you need in the starcraft multi-agent challenge? arXiv preprint arXiv:2011.09533, 2020. 4.1, 5.2 Raghuram Bharadwaj Diddigi, Sai Koti Reddy Danda, Shalabh Bhatnagar, et al. Actor-critic algorithms for constrained multi-agent reinforcement learning. arXiv preprint arXiv:1905.02907, 2019. A.2 Roel Dobbe, David Fridovich-Keil, and Claire Tomlin. Fully decentralized policies for multi-agent systems: An information theoretic approach. arXiv preprint arXiv:1707.06334, 2017. A.1 Dmitri A Dolgov and Edmund H Durfee. Resource allocation among agents with mdp-induced preferences. Journal of Artificial Intelligence Research, 27:505–549, 2006. A.2 Jiarui Fang, Lei Zhao, Jan C. Fransoo, and Tom Van Woensel. Sourcing strategies in supply risk management: An approximate dynamic programming approach. _Computers & Opera-_ _tions Research, 40(5):1371–1382, 2013. ISSN 0305-0548. doi: https://doi.org/10.1016/j.cor._ 2012.08.016. [URL https://www.sciencedirect.com/science/article/pii/](https://www.sciencedirect.com/science/article/pii/S030505481200189X) [S030505481200189X. 1, 5.1](https://www.sciencedirect.com/science/article/pii/S030505481200189X) Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI Conference on Artificial _Intelligence, volume 32, 2018. 4.1, 5.2, A.1_ Michael Fowler, Pratap Tokekar, T Charles Clancy, and Ryan K Williams. Constrained-action pomdps for multi-agent intelligent knowledge distribution. In 2018 IEEE International Con_ference on Robotics and Automation (ICRA), pp. 3701–3708. IEEE, 2018. A.2_ Yoichiro Fukuda. Optimal Policies for the Inventory Problem with Negotiable Leadtime. Manage_ment Science, 10(4):690–708, 1964. ISSN 0025-1909. doi: 10.1287/mnsc.10.4.690. 5.1_ ----- Ilaria Giannoccaro and Pierpaolo Pontrandolfo. Inventory management in supply chains: A reinforcement learning approach. International Journal of Production Economics, 78(2):153–161, 2002. ISSN 09255273. doi: 10.1016/S0925-5273(00)00156-0. 5.1 Joren Gijsbrechts, Robert N. Boute, Jan Albert Van Mieghem, and Dennis Zhang. Can Deep Reinforcement Learning Improve Inventory Management? Performance and Implementation of Dual Sourcing-Mode Problems. SSRN Electronic Journal, pp. 1–33, 2019. ISSN 1556-5068. doi: 10.2139/ssrn.3302881. 1, 5.1 David A Goldberg, Dmitriy A Katz-Rogozhnikov, Yingdong Lu, Mayank Sharma, and Mark S Squillante. Asymptotic optimality of constant-order policies for lost sales inventory models with large lead times. Mathematics of Operations Research, 41(3):898–913, 2016. 5.1 Jayesh K Gupta, Maxim Egorov, and Mykel Kochenderfer. Cooperative multi-agent control using deep reinforcement learning. In International Conference on Autonomous Agents and Multiagent _Systems, pp. 66–83. Springer, 2017. A.1_ Nir Halman, Diego Klabjan, Mohamed Mostagir, Jim Orlin, and David Simchi-Levi. A fully polynomial-time approximation scheme for single-item stochastic inventory control with discrete demand. Mathematics of Operations Research, 34:674–685, 2009. 1, 5.1 Christian D. Hubbs, Hector D. Perez, Owais Sarwar, Nikolaos V. Sahinidis, Ignacio E. Grossmann, and John M. Wassick. Or-gym: A reinforcement learning library for operations research problems, 2020. 4.1, F Woonghee Tim Huh, Ganesh Janakiraman, John A. Muckstadt, and Paat Rusmevichientong. Asymptotic optimality of order-up-to policies in lost sales inventory systems. Management Sci_[ence, 55(3):404–420, 2009. ISSN 00251909, 15265501. URL http://www.jstor.org/](http://www.jstor.org/stable/40539156)_ [stable/40539156. 5.1](http://www.jstor.org/stable/40539156) Chengzhi Jiang and Zhaohan Sheng. Case-based reinforcement learning for dynamic inventory control in a multi-agent supply-chain system. Expert Systems with Applications, 36(3 PART 2): 6520–6526, 2009. ISSN 09574174. doi: 10.1016/j.eswa.2008.07.036. 5.1 Nan Jiang and Alekh Agarwal. Open problem: The dependence of sample complexity lower bounds on planning horizon. In Conference On Learning Theory, pp. 3395–3398. PMLR, 2018. 1 Roman Kapuscinski and Sridhar Tayur. Optimal Policies and Simulation-Based Optimization for _Capacitated Production Inventory Systems, pp. 7–40. Springer US, Boston, MA, 1999. ISBN 978-_ [1-4615-4949-9. doi: 10.1007/978-1-4615-4949-9 2. URL https://doi.org/10.1007/](https://doi.org/10.1007/978-1-4615-4949-9_2) [978-1-4615-4949-9_2. 4.1](https://doi.org/10.1007/978-1-4615-4949-9_2) Ahmet Kara and Ibrahim Dogan. Reinforcement learning approaches for specifying ordering policies of perishable inventory systems. Expert Systems with Applications, 91:150–158, 2018. ISSN 09574174. doi: 10.1016/j.eswa.2017.08.046. 5.1 Orr Krupnik, Igor Mordatch, and Aviv Tamar. Multi-agent reinforcement learning with multi-step generative models. In Conference on Robot Learning, pp. 776–790. PMLR, 2020. A.3 Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multi-agent reinforcement learning in sequential social dilemmas. arXiv preprint arXiv:1702.03037, 2017. A.1 Shihui Li, Yi Wu, Xinyue Cui, Honghua Dong, Fei Fang, and Stuart Russell. Robust multi-agent reinforcement learning via minimax deep deterministic policy gradient. In Proceedings of the _AAAI Conference on Artificial Intelligence, volume 33, pp. 4213–4220, 2019. A.1_ Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actorcritic for mixed cooperative-competitive environments. arXiv preprint arXiv:1706.02275, 2017. 1, 5.2, A.1 Xueguang Lyu, Yuchen Xiao, Brett Daley, and Christopher Amato. Contrasting centralized and decentralized critics in multi-agent reinforcement learning. arXiv preprint arXiv:2102.04402, 2021. A.1 ----- Anuj Mahajan, Tabish Rashid, Mikayel Samvelyan, and Shimon Whiteson. Maven: Multi-agent variational exploration. arXiv preprint arXiv:1910.07483, 2019. A.1 S. Makridakis, E. Spiliotis, and V. Assimakopoulos. The m5 accuracy competition: Results, findings and conclusions. International Journal of Forecasting, 36(1):224–227, 2020. 4.1 Andrei Marinescu, Ivana Dusparic, Adam Taylor, Vinny Cahill, and Siobh Clarke. P-marl: Prediction-based multi-agent reinforcement learning for non-stationary environments. In AAMAS, pp. 1897–1898, 2015. A.3 Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. PMLR, 2016. A.1, C.2 Steven Nahmias and Stephen A Smith. Mathematical models of retailer inventory systems: A review. _Perspectives in operations management, pp. 249–278, 1993. 1_ Frans A Oliehoek and Christopher Amato. _A concise introduction to decentralized POMDPs._ Springer, 2016. A.2 Afshin Oroojlooyjadid, M Nazari, Lawrence Snyder, and Martin Tak´aˇc. A deep q-network for the beer game: A reinforcement learning algorithm to solve inventory optimization problems. arXiv _preprint arXiv:1708.05924, 2017. 5.1_ Afshin Oroojlooyjadid, Lawrence V. Snyder, and Martin Tak´aˇc. Applying deep learning to the newsvendor problem. IISE Transactions, 52(4):444–463, 2020. ISSN 24725862. doi: 10.1080/ 24725854.2019.1632502. 5.1 Georgios Papoudakis, Filippos Christianos, Lukas Sch¨afer, and Stefano V Albrecht. Benchmarking multi-agent deep reinforcement learning algorithms in cooperative tasks. In Thirty-fifth Confer_ence on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021._ [URL https://openreview.net/forum?id=cIrPX-Sn5n. 4.1, E.1](https://openreview.net/forum?id=cIrPX-Sn5n) Young Joon Park, Yoon Sang Cho, and Seoung Bum Kim. Multi-agent reinforcement learning with approximate model learning for competitive games. PloS one, 14(9):e0222215, 2019. A.3 Bei Peng, Tabish Rashid, Christian A Schroeder de Witt, Pierre-Alexandre Kamienny, Philip HS Torr, Wendelin B¨ohmer, and Shimon Whiteson. Facmac: Factored multi-agent centralised policy gradients. arXiv preprint arXiv:2003.06709, 2020. A.1 Peng Peng, Ying Wen, Yaodong Yang, Quan Yuan, Zhenkun Tang, Haitao Long, and Jun Wang. Multiagent bidirectionally-coordinated nets: Emergence of human-level coordination in learning to play starcraft combat games. arXiv preprint arXiv:1703.10069, 2017. 5.2 Tabish Rashid, Mikayel Samvelyan, Christian Schroeder, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 4295–4304. PMLR, 2018. 4.1, 5.2, A.1 Tabish Rashid, Gregory Farquhar, Bei Peng, and Shimon Whiteson. Weighted qmix: Expanding monotonic value function factorisation. arXiv e-prints, pp. arXiv–2006, 2020. A.1 T. Remani, E. A. Jasmin, and T. P.Imthias Ahamed. Residential Load Scheduling with Renewable Generation in the Smart Grid: A Reinforcement Learning Approach. IEEE Systems Journal, 13 (3):3283–3294, 2019. ISSN 19379234. doi: 10.1109/JSYST.2018.2855689. 1 John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. In Proceedings of the _International Conference on Learning Representations (ICLR), 2016. C.2_ John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. A.1 ----- Lloyd S Shapley. Stochastic games. Proceedings of the national academy of sciences, 39(10): 1095–1100, 1953. 2.1 Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero, and Yung Yi. Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In Interna_tional Conference on Machine Learning, pp. 5887–5896. PMLR, 2019. 5.2, A.1_ Sriram Srinivasan, Marc Lanctot, Vinicius Zambaldi, Julien P´erolat, Karl Tuyls, R´emi Munos, and Michael Bowling. Actor-critic policy optimization in partially observable multiagent environments. arXiv preprint arXiv:1810.09026, 2018. A.1 DJ Strouse, Max Kleiman-Weiner, Josh Tenenbaum, Matt Botvinick, and David Schwab. Learning to share and hide intentions using information regularization. arXiv preprint arXiv:1808.02093, 2018. A.1 Sainbayar Sukhbaatar, Rob Fergus, et al. Learning multiagent communication with backpropagation. Advances in neural information processing systems, 29:2244–2252, 2016. 5.2 Nazneen N Sultana, Hardik Meisheri, Vinita Baniwal, Somjit Nath, Balaraman Ravindran, and Harshad Khadilkar. Reinforcement learning for multi-product multi-node inventory management in supply chains. arXiv preprint arXiv:2006.04037, 2020. 5.1 Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al. Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296, 2017. 5.2, A.1 Ming Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings _of the tenth international conference on machine learning, pp. 330–337, 1993. 1, 5.2_ Jianhong Wang, Yuan Zhang, Tae-Kyun Kim, and Yunjie Gu. Shapley q-value: a local reward approach to solve global reward games. In Proceedings of the AAAI Conference on Artificial _Intelligence, volume 34, pp. 7285–7292, 2020a. A.1_ Tonghan Wang, Jianhao Wang, Chongyi Zheng, and Chongjie Zhang. Learning nearly decomposable value functions via communication minimization. arXiv preprint arXiv:1910.05366, 2019. A.1, A.2 Tonghan Wang, Heng Dong, Victor Lesser, and Chongjie Zhang. Roma: Multi-agent reinforcement learning with emergent roles. arXiv preprint arXiv:2003.08039, 2020b. A.1 Yuhuai Wu, Elman Mansimov, Shun Liao, Alec Radford, and John Schulman. Openai baselines: [Acktr & a2c. https://openai.com/blog/baselines-acktr-a2c/, 2018. 5.1](https://openai.com/blog/baselines-acktr-a2c/) Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. Mean field multiagent reinforcement learning. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th _International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning_ _Research, pp. 5567–5576, Stockholmsm¨assan, Stockholm Sweden, 10–15 Jul 2018. PMLR. A.1_ Yunan Ye, Hengzhi Pei, Boxin Wang, Pin Yu Chen, Yada Zhu, Jun Xiao, and Bo Li. Reinforcementlearning based portfolio management with augmented asset movement prediction states. AAAI _2020 - 34th AAAI Conference on Artificial Intelligence, pp. 1112–1119, 2020. ISSN 2159-5399._ doi: 10.1609/aaai.v34i01.5462. 1 Chao Yu, Akash Velu, Eugene Vinitsky, Yu Wang, Alexandre Bayen, and Yi Wu. The surprising effectiveness of mappo in cooperative, multi-agent games. _arXiv preprint arXiv:2103.01955,_ 2021. 4.1, 5.2, A.1, C.1 Kaiqing Zhang, Zhuoran Yang, Han Liu, Tong Zhang, and Tamer Basar. Fully decentralized multiagent reinforcement learning with networked agents. In International Conference on Machine _Learning, pp. 5872–5881. PMLR, 2018. 5.2_ Ruohan Zhang, Yue Yu, Mahmoud El Chamie, Behc¸et Ac¸ikmese, and Dana H Ballard. Decisionmaking policies for heterogeneous autonomous multi-agent systems with safety constraints. In _IJCAI, pp. 546–553, 2016. A.2_ ----- Weinan Zhang, Xihuai Wang, Jian Shen, and Ming Zhou. Model-based multi-agent policy optimization with adaptive opponent-wise rollouts. arXiv preprint arXiv:2105.03363, 2021. A.3 ----- A RELATED WORK A.1 TRAINING PARADIGM IN MARL In this section we will provide a more detailed description of related work for training paradigm used in MARL. From the perspective of training schemes, it can be devided into three categories: decentralized training decentralized execution (DTDE), centralized training centralized execution (CTCE) and centralized training decentralized execution (CTDE). Recent deep MARL works often use the CTDE or CTCE training pradigm. The CTCE paradigm allows the straightforward employment of single-agent approaches such as Actor-Critic (Mnih et al., 2016) or policy gradient algorithms (Schulman et al., 2017) to multi-agent problems. The representative work of CTCE is (Gupta et al., 2017), which represented the centralized executor as a set of independent sub-policies such that agents’ individual action distributions are captured rather than the joint action distribution of all agents. Value-based CTDE approaches, which are also known as Value Decomposition methods (Peng et al., 2020; Mahajan et al., 2019; Rashid et al., 2020; 2018; Son et al., 2019; Sunehag et al., 2017; Wang et al., 2020b; 2019), mianly focus on how centrally learned value functions can be reasonably decoupled into decentralized ones and have shown promising results. Policy-gradient-based methods on CTDE, on the other hand, have heavily relied on centralized critics. One of the first works utilizating a centralized critic was COMA (Foerster et al., 2018), a framework adopting a centralized critic with a counterfactual baseline. For convergence properties, COMA establishes that the overall effect on decentralized policy gradient with a centralized critic can be reduced to a single-agent actor-critic approach, which ensures convergence under the similar assumptions like A2C. Concurrently with COMA, MADDPG (Lowe et al., 2017) proposed to use a dedicated centralized critic for each agent in semi-competitive domains, demonstrating compelling empirical results in continuous action environments. Recently, MAPPO (Yu et al., 2021), an on-policy policy gradient multi-agent reinforcement learning algorithm, achieves strong results comparable to the state-of-the-art on a variety of cooperative multi-agent challenges. Despite its on-policy nature, MAPPO is competitive to ubiquitous off-policy methods such as MADDPG, QMix, and RODE in terms of final performance, and in the vast majority of cases, comparable to off-policy methods in terms of sample-efficiency. In addition, many incremental research inspired by MADDPG, COMA or MAPPO also borrowed the centralized critic baselines e.g. M3DDPG (Li et al., 2019), SQDDPG (Wang et al., 2020a), etc. Mean Filed Q-Learning (Yang et al., 2018; Carmona et al., 2019) takes a different approach from the CTDE based methods. It employs mean field approximation over the joint action space in order to address the scalability issue that exists in the prior methods. Contrary to CTDE, in DTDE paradigm, each agent has an associated policy which maps local observations to a distribution over individual actions. No information is shared between agents such that each agent learns independently. DTDE has been applied to cooperative navigation task (Chen et al., 2017; Strouse et al., 2018), to partially observable domains (Dobbe et al., 2017; Srinivasan et al., 2018), and to social dilemmas (Leibo et al., 2017). For more comparisons of centralized critic and decentralized critic, please see (Lyu et al., 2021). In this paper, we design a decentralized traning paradigm avoiding the flaws of traditional training paradigms proposed in literature. The fundamental drawback of the DTDE paradigm is that the environment appears non-stationary from a single agent’s viewpoint because agents neither have access to the knowledge of others, nor do they perceive the joint action. Some studies reported that DTDE scale poorly with the number of agent due to the extra sample complexity, which is added to the learning problem (Gupta et al., 2017); An obvious flaw for CTDE/CTCE is that state-action spaces grow exponentially by the number of agents. Even though there are some attempts proposed that the joint model can be factored into individual policies for each agent To address the so-called curse of dimensionality, CTDE/CTCE methods have to use at least joint states overall agents as the input to approximate global value function to give guidance for centralized critics or decentralized policies. Meaning that these traditional training schemes still not have strong expansion capabilities ----- to large number of agents when system’s state is combined by local state for each agent. As for our approach, we train agents independently with learned dynamics model of utilization’s trend for shared resource, which will give agent enough information to learn optimal policy (we will explain it in the following sections). At the meantime, we can also, improve efficiency of data sampling since we don’t always use the original joint simulator containing all agents for data collection. Instead, we mainly running the light-cost local simulator by embedding the learned dynamic model, which can significantly reduce the cost of data collection process, especially when running joint simulator (such as inventory management) expensively. A.2 MDP SCENARIOS IN MARL There are limited similar variants of MDP settings which have been studied under MARL framework, to our knowledge. Dec-POMDP (Oliehoek & Amato, 2016) is the most common setting studied in MARL research, especially in fully cooperative tasks in which agents share the global team reward rather than individual rewards; In the scenarios of Constrained MDP setting (C-MDP, (Bhatnagar & Lakshmanan, 2012; Wang et al., 2019; Diddigi et al., 2019)), there are some constraints in the system, such as the penalty caused by illegal actions in autonomous driving (Zhang et al., 2016; Fowler et al., 2018), the resource-allocation constraints in scheduling tasks (Agrawal et al., 2016; Dolgov & Durfee, 2006), or limited bandwidth during communicating among agents (Fowler et al., 2018), etc. Similar with C-MDP, Weakly-Coupled Constrained MDP (WC-C-MDP (Boutilier & Lu, 2016)) consider the problem of budget (or other resource) allocation in sequential decision problems involving a large number of concurrently running sub-processes,whose only interaction is through their consumption of budget. Different from mentioned scenarios, we focus on the situation where agents can only get the individual reward from the environment, so that team reward is invisible; Comparing with C-MDP and WC-C-MDP, the penalty of overstocking in which is related with stock levels of all agents in future, while in C-MDP and WC-C-MDP, the corresponding cost function is only based on historical states of the system. Leading that we can not optimize only the combined objective of long-term summation of return and penalty. A.3 MODEL-BASED MARL For model-based MARL, there are relatively limited works of literature. P-MARL (Marinescu et al., 2015) proposed an approach that integrates prediction and pattern change detection abilities into MARL and thus minimises the effect of non-stationarity in the environment. The environment is modelled as a time-series, with future estimates provided using prediction techniques. Learning is based on the predicted environment behaviour, with agents employing this knowledge to improve their performance in real-time. (Park et al., 2019) proposed to use a centralized auxiliary prediction network to model the environment dynamics to alleviate the non-stationary dynamics problem. (Krupnik et al., 2020) built a centralized multi-step generative model with a disentangled variational auto-encoder to predict the environment dynamics and the opponent actions and then performed trajectory planning. AORPO (Zhang et al., 2021) is a Dyna-style method in which each agent builds its multi-agent environment model that consist of a dynamics model and multiple opponent models, and trains its policy with the data both generated from adaptive opponent-wise rollout and interacted from the real environment. To our knowledge, unlike previous work, our approach is the first to accelerate the learning process in MARL by building only a part dynamic model of the whole system. Our algorithm is also a Dyna-style method like AORPO. And it is clear that the difficulty of modeling process will be reduced significantly and the learning process is more efficient. A.4 MEAN-FIELD MARL B DETAILS FOR INVENTORY MANAGEMENT PROBLEM We summarize all notations for Section 2.2 in Table 2. We also give an example with 2 SKUs in Fig. 4 to further illustrate the whole procedure for inventory dynamics. ----- Figure 4: A diagram to illustrate an inventory dynamics in two time steps. Table 2: Notations. |Notation|Explanation| |---|---| |I˙i t Iˆi t Oi t Di t Si t T ti Pti t p i q i o h|Units in stock of SKU i at the t-th time step Units in stock of SKU i at the end of time step t if no discard Order quantity of the i-th SKU at the t-th time step Demand of the i-th SKU at the t-th time step Sale quantity of the i-th SKU at the t-th time step Units in transit of the i-th SKU at the t-th time step Profit generated on the i-th SKU at the t-th time step Unit sales price of i-th SKU Unit procurement cost of i-th SKU Unit order cost Unit holding cost for a time step| C ALGORITHM DETAILS Here we provide the pseudocode of our algorithm with context augmentation in Algorithm 4. And details of sub-algorithms will be introduced in the following subsections. Notes that all pseudocode are assumed using RNN as networks. C.1 GET CONTEXT DYNAMICS Our algorithm follows the algorithmic structure of the Multi-Agent PPO (Yu et al., 2021), in which we maintain two separate networks for πθ and Vϕ(s). And the first stage of our algorithm is only running policies in the origin joint environment to get episodes for context dynamics, i.e. the trajectories of c[i]t [and][ c]t[−][i][. This process is similar with all other traditional MARL methods. Notes that] we can also save the transitions of each agent for learning policy and value function network. In the following pseudocode of joint sampling(Algorithm 2), we only record the part for the context dynamics to train the context model prepared for next stage’s training. After data collecting for context dynamics, we will use a LSTM model fc to train a surrogate predictor as an extra augmentation for collected dynamics in next stage. In details, we split the collected trajectories of context into several sub-sequences of length 8 and the training objective is to minimize the mean-squared error of the L + 1 day’s capacity predicted given the dynamics of previous _L days._ min(fc(ct−L, . . ., ct; ω) − _ct+1)[2]_ (14) ----- C.2 DECENTRALIZED PPO With the collected context dynamics of the shared resource, it is easy to run the second stage: sampling on the local simulators for each agent and then training the policy and critic with data. It is worth noting that the main difference between our training paradigm and traditional MARL methods under CTDE structure is that we directly sampling local observations in the extra simulators in which only one agent existed rather than the joint simulator in which all agents interacting with each other. In other words, in the new local simulator, there is only one SKU in the entire store, and the trend of available capacity is completely simulated according to the given context dynamics. In practice, we parallelly initialize new instances of the original inventory environment with the new configure settings which only contains a specific SKU i and a fixed trajectory of context. As for the fixed context trajectory, we use the subtraction results {c[−][i]}t[T]=0 [with some augmentations: 1) add] some noise in some items with a predefined probability;2) replace some items with predicted values comes from the trained context model also by the predefined probability. Then we run the policy to interact with the local simulators to conduct episodes under the embedded context dynamics and put them into a shared replay buffer since all transitions are homogeneous and we shared the parameters over all policies. And decentralized training will be started by utilizing the shared replay buffer of transitions collected from local simulators. We consider a variant of the advantage function based on decentralized learning, where each agent learns a agent-specific state based critic Vϕ(s[i]t[)][ parameterised by][ ϕ][ using][ Generalized Advantage] _Estimation (GAE, (Schulman et al., 2016)) with discount factor γ = 0.99 and λ = 0.95. We also_ add an entropy regularization term to the final policy loss (Mnih et al., 2016). For each agent i, we have its advantage estimation as follows: _A[i]t_ [=] (γλ)[l]δt[i]+l (15) _l=0_ X where δt[i][i][ =][ r][t] _s[i]t[, a][i]t_ + γVϕ _zt[i]+1_ _Vϕ_ _s[i]t_ is the TD error at time step t and h is marked as _−_ steps num. And we use individual reward provided from local simulator _rt[i][(][s][i]t[, a][i]t[)][. So that the final]_ policy loss for each agent i becomes: _πθ_ _a[i]t_ _t_ (θ) = Esit[,a][i]t[∼T][local][(][c][−]t _[i])_ min _[|][ s][i]_ _A[i]t[,]_ _L[i]_ " _πθold _ _a[i]t_ t _[|][ s][i]_ (16) clip _πθ_ _a [i]t_ _[|][ s]t[i]_ , 1 _ϵ, 1 + ϵ_ _A[i]t_ _πθold _ _a[i]t_ t _−_ ! !# _[|][ s][i]_ As for training value function, in addition to clipping the policy updates, our method also use value clipping to restrict the update of critic function for each agent i to be smaller than ϵ as proposed by GAE using: 2 (ϕ) = Esit[∼T][local][(][c][−]t _[i])_ min _Vϕ_ _s[i]t_ _Vt[i]_ _,_ _L[i]_ _−_ [ˆ] 2[] (17) _Vϕold_ _s[i]t_ + clip _Vϕ_ _s[i]t_ _Vϕold_ _s[i]t_ _,_ _ϵ, +ϵ_ _Vt[i]_ _−_ _−_ _−_ [ˆ] where ϕold are old parameters before the update and _V[ˆ]t[i]_ [=][ A]t[i] [+][ V][ϕ] _s[i]t_ . The update equation restricts the update of the value function to within the trust region, and therefore helps us to avoid overfitting to the most recent batch of data. For each agent, the overall learning loss becomes: _N_ _L(θ, ϕ) =_ _L[i](θ) + λcriticL[i](ϕ) + λentropyH_ _π[i][]_ (18) _i=1_ X It is obvious that all networks are trained in the decentralized way since their inputs are all local variables which stem from the light-cost local simulators. As mentioned before, at this learning ----- stage, there are no interactions between any two agents. Although it seems like the way of independent learning, we need to point that we use the global context simulated from the joint environment, which is essentially different from independent learning methods since they will not consider this style global information which is simulated from joint simulator but be fixed in the local simulators. Our decentralized training have several advantages: firstly, the local simulator is running efficient because of its simple only-one-agent transition function; secondly, this paradigm avoid the issue for non-stationary occurred in the traditional MARL methods since there are no interaction amongst agents so that it is no need to consider influences of other agents; thirdly, we can use more diverse context trajectories to make agents face various situations of the available levels of the store, which leads to improve the generalization of the networks to be trained; fourthly, it is easy for this training paradigm to be extended to large-scale distributed training by running parallel simulation whose communication cost is also acceptable for modern distributed training frameworks. If the critic and actor networks are RNNs, then the loss functions additionally sum over time, and the networks are trained via Backpropagation Through Time (BPTT). Pseudocode for local sampling with recurrent version of policy networks is shown in Algorithm 3. **Algorithm 2 GetContextDynamic** **INPUT policies** _πθ[i]_ _i_ _[}]i[n]=1_ [and the joint simulator Env][joint] _{_ (Optional) Initialize ω, the parameters for context model fc set data buffer D = {} **for bs = 1 to batch size do** _τ = [] empty list_ initialize h[1]0,π[,][ · · ·][ h][n]0,π [actor RNN states] **for t = 1 to T do** **for all agents i do** _p[i]t[, h][i]t,π_ [=][ π][i][(][s]t[i][, c][t][, h][i]t 1,π[;][ θ][i][)] _−_ _a[i]t_ _t_ **end for[∼]** _[p][i]_ Execute actions at, observe rt, st+1 _τ += [st, ct, ht,π, at, rt, st+1, ct+1]_ **end for** // Split trajectory τ into chunks of length L **for l = 0, 1, .., T//L do** _D = D ∪_ (c[l : l + T ]) **end for** **end for** // (Optional) Train the context model for augmentation **if Need to train context model then** **for mini-batch k = 1, . . ., K1 do** _b ←_ random mini-batch from D with all agent data Update capacity-context dynamics model with Eq. 14 **end for** Adam update ω with data b **end if** **OUTPUT context dynamics** _c[−]t_ _[i]_ _t=1_ [for][ i][ = 1][, . . ., n][; (Optional)][ f][ω] _{_ _[}][T]_ D LOCAL SAMPLING V.S. JOINT SAMPLING Our algorithm follows a decentralized training paradigm, which is compatible with distributed training framework. We can run parallel local simulation while communication cost of context dynamics is also acceptable. With distributed training, our method can be much more efficient than learning from the joint simulator directly. One may argue that we can also learn from joint simulator efficiently by implementing a distributed/parallelized joint simulator. While this can indeed improve sampling efficiency in many scenarios, its usefulness is limited particularly for systems with a large number of agents. In a typ ----- **Algorithm 3 DecentralizedPPO** // Generate data for agent i with corresponding context dynamic model **INPUT local simulator Env[i]local[, policy][ π]θ[i]** _i_ [and value function][ V][ i]ϕi Set data buffer D = {} Initialize h[1]0,π[,][ · · ·][, h]0[n],π [actor RNN states] Initialize h[1]0,V _[, . . . h]0[n],V_ [critic RNN states] _τ = [] empty list_ **for t = 1 to T do** _p[i]t[, h][i]t,π_ [=][ π][i][(][s]t[i][, h][i]t 1,π[, c]t[i][;][ θ][i][)] _−_ _a[i]t_ _t_ _vt[i][, h][∼][i]t,V[p][i]_ [=][ V][ i][(][s]t[i][, c][i]t[, h][i]t 1,V [;][ ϕ][i][)] _−_ Execute actions a[i]t [in Env]local[i] [, and then observe][ r]t[i][, c][i]t+1[, s][i]t+1 _τ_ + = [s[i]t[, c]t[i][, a]t[i][, h]t,π[i] _[, h][i]t,V_ _[, s]t[i]+1[, c][i]t+1[]]_ **end for** Compute advantage estimate _A[ˆ] via GAE on τ (Eq. 15)_ Compute reward-to-go _R[ˆ] on τ_ // Split trajectory τ into chunks of length L in D **for l = 0, 1, .., T//L do** _D = D ∪_ (τ [l : l + T, _A[ˆ][l : l + L],_ _R[ˆ][l : l + L])_ **end for** **for mini-batch k = 1, . . ., K2 do** _b ←_ random mini-batch from D with all agent data **for each data chunk c in the mini-batch b do** update RNN hidden states for π[i] and V _[i]_ from first hidden state in data chunk **end for** **end for** Calculate the overall loss according to Eq. 16 to Eq. 18 Adam update θi on L[i](θi) and H with data b Adam update ϕi on L[i](ϕi) with data b **OUTPUT policy πθ[i]** _i_ [and value function][ V][ i]ϕi **Algorithm 4 Context-aware Decentralized PPO with Context Augmentation** Given the joint simulator Envjoint and local simulators Env[i]local[}][n]i=1 _{_ Initialize policies π[i] and value functions V _[i]_ for i = 1, . . ., n Initialize context model fω and the augmentation probability paug **for M epochs do** // Collect context dynamics via running joint simulation _{c[−]t_ [1][}]t[T]=0[, . . .,][ {][c]t[−][n]}t[T]=0[, f][ω] _[←]_ **[GetContextDynamics][(][Env][joint][,][ {][π][i][}]i[n]=1[, f][ω][)][ (Algorithm][ 2][)]** **for k = 1, 2 . . ., K do** **for all agents i do** // Set capacity trajectory by augmented context dynamics Env// Train policy by running simulation in the corresponding local environment[i]local[.][set c trajectory][(][aug][(][{][c]t[−][i][}]t[T]=0[, f][ω][, p][aug][))] _π[i], V_ _[i]_ _←_ **DecentralizedPPO(Env[i]local[, π][i][, V][ i][)][ (Algorithm][ 3][)]** **end for** **end for** Evaluate policies _π[i]_ _i=1_ [on joint simulator Env][joint] _{_ _}[n]_ **end for** ----- ical multi-agent system, interactions occur frequently among all agents. Actually, one advantage of multi-agent systems is to model complex interactions among agents. However, such interactions will remarkably reduce the efficiency improvement brought by parallelism, as these interactions often require involved agents to synchronize which usually consumes lots of time on waiting and communicating. For instance, in IM problems, at each time step, all agents should be synchronized in order to calculate ρ in Eq. 6 before moving to the next time step. On the other hand, in our local sampling approach we simplify such costly interactions by utilizing special structures of shared-resource stochastic games. Under our approach, there is no need for agents to be synchronized before moving to the next time step in local simulators. As a result, our method can be much more efficient than learning only from the joint simulator in practice. E TRAINING DETAILS E.1 THE CODEBASE As part of this work we extended the well-known EPyMARL codebase((Papoudakis et al., 2021)) to integrated our simulator and algorithm, which already include several common-used algorithms and support more environments as well as allow for more flexible tuning of the implementation details. It is convenience for us to compare our algorithm with other baselines. All code for our new codebase [is publicly available open-source on Anonymous GitHub under the following link: https://](https://anonymous.4open.science/r/replenishment-marl-baselines-75F4) [anonymous.4open.science/r/replenishment-marl-baselines-75F4](https://anonymous.4open.science/r/replenishment-marl-baselines-75F4) E.2 HYPERPARAMETERS DETAILS Table 3 presents the hyperparameters used in our algorithm for 5-SKUs environment. |Hyperparameters|Value| |---|---| |runner|ParallelRunner| |batch size run|10| |decoupled training|True| |use individual envs|True| |max individual envs|5| |decoupled iterations|1| |train with joint data|True| |context perturbation prob|1.0| |hidden dimension|64| |learning rate|0.00025| |reward standardisation|True| |network type|FC| |entropy coefficient|0.01| |target update|200 (hard)| |n-step|5| Table 3: Hyparameters used in CD-PPO E.3 DETAILS OF STATES AND REWARDS Table 4 shows the features of the state for our MARL agent that corresponds to the i-th SKU on the _t-th time step._ It’s worthy to note that we use the profit generated on the i-th SKU at the t-th time step Pt _[i]t_ [divided] by 1000000 as the individual reward of i-th agent at the t-th time step, for team reward methods, we simply sum up all the individual rewards, which corresponds to the daily profit of the whole store at the t-th time step, divided by 1000000. ----- Table 4: Features of the state [P][l][−][1] |Col1|Features| |---|---| |State Storage information Inventory information History information Product information|Storage capacity C Quantity of products in stock Ii t Quantity of products in transit T ti R S Sa te alp e nl s de an ri h ds ih sm t doe eryn t ai th n ii s nt to hr e oy i ln a t iet ssh tt oe ril c2a at 1e s dt a a2 y l1 es d Sa ti sy ts , S·O −ti ·− 2,2 1S1 t, i · · · , O −ti − 11 − d21 · ·− 1 v i o f h l s s ti, · ·, S ti Unit sales price p i Unit procurement cost q i| |Context Global storage utilization Global unloading level Global excess level|Current total storage level of the store Pl j− =1 I tj 1 Current total unloading level of the store Pl j− =1 O tj 1 −Lj+1 Current total excess level of the store ρ × Pl j− =1 1 O tj −Lj+1| F BASE-STOCK POLICY **Algorithm 5 Base-stock Policy** **Input:** _{Dt[i][}]tt2=t1_ [,][ v][,][ τ] [, IM parameters] **Output:** _Zi,_ _Ot[i][}]t[t]=[4]_ _t3_ _{_ _// Description:Base-stock policy for single SKU i_ _// Zi:Base-stock level for SKU i_ _//_ _Ot[i][}]t[t]=[4]_ _t3_ [:Base-stock replenishment policy for SKU i, from][ t][3] [to][ t][4] _// {Dt[i][}]tt2=t1_ [:Demand series of SKU i used to infer][ Z][i][, from][ t][1][ to][ t][2] _{_ _// v : v ≥_ 1, v ∈ R, Hyper-parameter to control storage utilization level _// τ : τ ∈_ N+, Hyper-parameter to control replenishing interval _// IM parameters:including leading time Li, storage capacity C, etc_ _// Solve Mixed Integer Programming with dual simplex methods:_ _St[i]_ [= min] _Dt[i][, I]t[i]_ _Zi ←_ maxZi Xtt2=t1 _[Pt]t[i]_ [s.t.] TIPtOt[i]t[i]+1t[i]+1[i]t[= max][=][=][=][ p][ I][ T][i]t[i][S][ i]t [−]t[i][−]0[S], Z[O]t[i] _t[i][+]i− −[ O]Lt_ _i+1It[i]−t[i]_ _L[−][+]i+1t[ O][T][ i]t_ _t[i]+1_ _[−]_ _[q][i][O][i]_ _[−]_ _[hI]_ _[i]_ _// Replenishing policy deduction:_ t1 ≤ _t ≤_ _t2_ _Ot[i]_ [= max] 0, Zi − _It[i]_ _[−]_ _[T][ i]t_ _, t3 ≤_ _t ≤_ _t4_ _Ot[i]_ [= min] Ot[i][, vC][ −] [P][n]j=1[(][I]t[j] [+][ T][ j]t [)] _, t3_ _t_ _t4_ _≤_ _≤_ _Ot[i]_ [=][ O]t[i][I][ [][t][ mod][ τ][ = 0]][, t][3] _[≤]_ _[t][ ≤]_ _[t][4]_ return Zi, _Ot[i][}]t[t]=[4]_ _t3_ _{_ In addition to comparing with related MARL baselines, we also add a well-known algorithm ”Basestock” from OR community as our non-RL baseline. The pseudo-code for base-stock policy can be found in Algorithm 5, where Zi is called the base-stock for agent i and is computed by solving a mixed integer programming problem. After that, Zi will be used to guide replenishment of agent _i periodically. We shall note that base-stock policy can not deal with complex IM variant settings_ ----- like coordinated replenishment (multiple SKUs with storage capacities), order cost and stochastic VLTs, etc. These complicated realistic constraints are exactly what we use to test other MARL algorithms. Thus it may happen that Base-stock policies constantly overflow warehouse capacities when storage is tight, in which cases incoming products are abandoned proportionally as we explained in Section 2.2. This explains Base-stock’s poor performance on some of the envs. Base-stock utilizes Linear Programming to work out the base stock levels and thus the replenishing policy. ”Static” indicates that the base stock levels are based on demand data from training set and then kept static while making decisions on test set. ”Dynamic” updates its base stock levels on a regular time cycle. ”Oracle” directly access the whole test set to calculate its base stock levels, which is essentially a cheating version used to show the upper limits of Base-stock policy. In practice, we conduct grid search to find proper v and τ . We refer readers to (Hubbs et al., 2020) for a detailed description. Our implementation of Base-stock baselines is also inspired by it. G ADDITIONAL RESULTS G.1 THE FULL RESULTS ON ALL ENVIRONMENTS Here we present all the experiment results in Table 5 and display the training curves of algorithms in 50-SKUs scenarios in Figure 5 and Figure 6. We also present the results from OR methods designed to deal with IM problem, namely Base-stock, in Table 5. Please note that, actually, VLT (i.e leading time) for each SKU in N50 and N100 is stochastic during simulation. On the one hand, the specified VLTs can be modelled as exponential distributions in our simulator. On the other hand, the real lead time for each procurement can be longer than specified due to lack of upstream vehicles. (Distribution capacity is also considered in the simulator, though not the focus of this paper). So that the results running on N50 and N100 are all considering stochastic VLTs. Table 5: Profit Comparison on All Environments |Col1|Col2|Profit(10k dollar)|Col4| |---|---|---|---| |Env Scenario|CD-PPO(ours)|IPPO-IR(w/o context) MAPPO-IR(w/o context) IPPO-IR MAPPO-IR IPPO(w/o context) MAPPO(w/o context) IPPO MAPPO|Basestock(Static) Basestock(Dynamic) Basestock(Oracle)| |N5-C50|40.58 ± 6.02|40.37 ± 4.89 39.32 ± 15.53 43.33 ± 3.30 54.87 ± 9.26 74.11 ± 1.55 49.24 ± 1.32 63.22 ± 13.75 48.49 ± 1.89|17.4834 33.9469 38.6207| |N5-C100|99.21 ± 1.91|92.41 ± 2.78 94.70 ± 18.84 91.38 ± 3.57 97.69 ± 14.41 97.89 ± 6.65 74.71 ± 1.51 92.90 ± 13.36 71.57 ± 3.14|48.8944 80.8602 97.7010| |N50-C500|310.81 ± 76.46|235.09 ± 60.61 N/A 250.03 ± 58.38 N/A 164.43 ± 143.01 N/A 366.74 ± 89.58 N/A|−430.0810 −408.1434 −397.831| |N50-C2000|694.87 ± 174.184|689.27 ± 48.92 N/A 545.86 ± 459.71 N/A −1373.29 ± 870.03 N/A −1102.97 ± 1115.69 N/A|−15.5912 42.7092 1023.6574| |N100-C1000|660.28 ± 149.94|−2106.98 ± 315.38 N/A −1126.42 ± 409.83 N/A −1768.19 ± 1063.61 N/A −669.83 ± 1395.92 N/A|−173.39 −22.05 91.17| |N100-C4000|1297.75 ± 124.52|−2223.11 ± 2536.00 N/A 148.00 ± 1017.47 N/A −6501.42 ± 6234.06 N/A −6019.28 ± 9056.49 N/A|410.59 493.32 755.47| Table 6: Average samples needed by different algorithms to reach the median performance of baselines on All Environments |Col1|Data Samples(10k)| |---|---| |Env Scenario|CD-PPO(ours) IPPO-IR(w/o context) MAPPO-IR(w/o context) IPPO-IR MAPPO-IR IPPO(w/o context) MAPPO(w/o context) IPPO MAPPO| |N5-C50|∞ ∞ ∞ ∞ 708.10 588.63 1671.10 708.10 1671.10| |N5-C100|522.80 ∞ 711.97 ∞ 987.27 806.56 ∞ 1298.40 ∞| |N50-C500|5484.49 ∞ N/A ∞ N/A 9195.23 N/A 12802.89 N/A| |N50-C2000|2996.87 3138.49 N/A 8483.04 N/A ∞ N/A ∞ N/A| |N100-C1000|47.14 ∞ N/A ∞ N/A 539.26 N/A 191.88 N/A| |N100-C4000|60.07 ∞ N/A 127.57 N/A ∞ N/A 1151.57 N/A| It’s worthy noting that throughout all the environments we’ve experimented on, CD-PPO enjoys higher sample efficiency as shown in Table 6. Though it seems that in environment where the storage capacity is extremely tense, like N5-C50, CD-PPO performs not as well as IPPO, we reason that these tense situations favor IPPO (with team reward), which may quickly learn how to adjust each agents’ behavior to improve team reward. While CD-PPO, owing to its decentralized training paradigm, struggles to learn a highly cooperative policy to cope with the strong resource limits while meeting stochastic customer demands. Yet we still claim that CD-PPO does outperform its IR (w/o context) baselines while producing comparable performance when compared to IR (with context) algorithms. On the other hand, IPPO(with team reward) struggles to learn a good policy in 50-agents environment for the large action space compared to the single team reward signal, and it fails when faced with N50-C2000 whose state space is even larger. To top it all off, CD-PPO does have its strengths in terms of sample efficiency. Though we will continue to address the above challenge in our future work. ----- Figure 5: Training curves on N50-C500. Figure 6: Training curves on N50-C2000 G.2 ABLATION STUDIES FOR CONTEXT AUGMENTATION In this section, we aim to seek a better way to augment the context dynamics in order to boost performance of our algorithm. We ponder the following two questions: **Q1: Which is a better way to augment the original data, adding noise or using a Deep predic-** **tion model?** To answer this question, we set out experiments on environment N5-C100 with our algorithm CDPPO, using either context trajectories generated by a deep LSTM prediction model, or simply adding a random normal perturbation to the original dynamics trajectories, both on 3 different seeds. As shown in Figure 7, those runs with deep prediction model generated dynamics enjoy less std and better final performance. This could result from that the diversity of deep model generated trajectories surpasses that of random perturbation. Figure 7: Training curves of CD-PPO with different augmentation methods ----- Figure 8: Training curves of CD-PPO with varied ratio of augmented data **Q2: Does dynamics augmentation improve the performance of the algorithm? If so, how much** **should we perturb the original data?** We run similar experiments on environment N5-C100 with CD-PPO, in which the local simulator is ingested with a mixture of original dynamics data and LSTM generated data. The ratio of perturbed dynamics data varies from 0% to 100%. And we found that the algorithm turns out the best performance when we use fully generated data, as shown in Figure 8. ----- |