pradachan
/

AI-Scientist

Model card Files Files and versions Community

AI-Scientist / review_iclr_bench /iclr_parsed /-uZp67PZ7p.txt

pradachan

Upload folder using huggingface_hub

f71c233 verified 19 days ago

raw

history blame contribute delete

84.8 kB

	# MULTI-AGENT REINFORCEMENT LEARNING WITH SHARED RESOURCE FOR INVENTORY MANAGEMENT

	Anonymous authors
	Paper under double-blind review

	ABSTRACT

	We consider inventory management (IM) problem for a single store with a large
	number of SKUs (stock keeping units) in this paper, where we need to make replenishment decisions for each SKU to balance its supply and demand. Each
	SKU should cooperate with each other to maximize profits, as well as compete
	for shared resources e.g., warehouse spaces, budget etc. Co-existence of cooperation and competition behaviors makes IM a complicate game, hence IM can be
	naturally modelled as a multi-agent reinforcement learning (MARL) problem. In
	IM problem, we find that agents only interact indirectly with each other through
	some shared resources, e.g., warehouse spaces. To formally model MARL problems with above structure, we propose shared resource stochastic game along with
	an efficient algorithm to learn policies particularly for a large number of agents.
	By leveraging shared-resource structure, our method can greatly reduce model
	complexity and accelerate learning procedure compared with standard MARL algorithms, as shown by extensive experiments.

	1 INTRODUCTION

	Inventory management (IM) problem has long been one of the most important applications in the
	supply-chain industry (Nahmias & Smith, 1993). Its main purpose is to maintain a balance between
	the supply and demand of stock keeping units (SKUs) in a supply chain by optimizing replenishment
	decisions of each SKU. Besides leading to profit increment and operational cost reduction, efficient
	IM can even give rise to better services to customers. However, it is quite a challenging task in
	practice, especially when there are lots of SKUs involved in the supply-chain. Particularly, while all
	SKUs should cooperate with each other to achieve high profits, they also need to compete for shared
	resources e.g., warehouse spaces, budget etc. Such co-existence of cooperation and competition
	renders IM a complicated game that is hard to address.

	Traditional methods usually reduce IM problems to solving dynamic programming problems. However, these approaches often rely on some unrealistic assumptions such as i.i.d. demand, deterministic leading time, etc. Moreover, as the state space grows exponentially along with some key
	factors like leading time and number of SKUs (Gijsbrechts et al., 2019), the corresponding dynamic
	programming problems become intractable due to the curse of dimensionality. Because of these limitations, many approaches based on approximate dynamic programming are proposed to solve IM
	problems in different settings (Halman et al., 2009; Fang et al., 2013; Chen & Yang, 2019). While
	these approaches perform well in certain scenarios, they heavily rely on problem-specific expertise
	or assumptions e.g., the zero or one period leading time assumption in (Halman et al., 2009), hence
	can hardly generalize to other settings. In contrast, reinforcement learning (RL) based methods,
	with fast inference speed, can generalize to various scenarios in a data-driven manner. However, it
	is usually too costly to train a global policy that can make decisions for all SKUs, since the training
	efficiency can be notably curtailed because of the large global state and action space (Jiang & Agarwal, 2018)). To further address the training efficiency issue, it is natural to adopt the multi-agent
	reinforcement learning (MARL) paradigm, where each SKU is regarded as an agent whose state and
	action spaces are localized and only contain information relevant to itself.

	There are currently two popular paradigms to train MARL in the literature: independent learning
	(Tan, 1993) and joint action learning (Lowe et al., 2017). Despite of their success in many scenarios, these two MARL paradigms also exhibit certain weaknesses that restrain their effectiveness


	-----

	in solving IM problems. On one hand, if applying independent learning, policy training of one
	agent simply treats all other agents as parts of the stochastic environment, hence is hard to converge
	due to non-stationary of the environment. On the other hand, joint action learning usually learns a
	centralized critic conditioned on the joint action and state spaces of all agents, which can easily become intractable with increasing number of SKUs. Furthermore, it could be quite time-consuming
	to sample data from joint simulator for a great number of SKUs, since it usually involves much
	computation on many internal variables caused by complex agent interactions.

	To address these challenges, we took a closer look at the IM problem and find that there exists a
	special structure that can be leveraged to design more effective MARL paradigm. Particularly, each
	agent in IM only interacts with others through the shared resource, e.g., warehouse spaces. We
	introduce an auxiliary variable to represent the whole inventory level, implying the available shared
	resource, for all SKUs, and refer to this variable as context. From the MARL perspective, one agent
	can be influenced by other agents only through such context. The context dynamics actually reflect
	the collective behaviors of all agents. And, conditioned on context dynamics, one agent’s state
	transition and reward function are independent of other agents. In this way, leveraging context as an
	additional input for each agent’s policy/value functions enables us to both avoid the non-stationary
	problem caused by independent learning and mitigate the intractable centralized critics learning
	caused by exploding state and action spaces. Moreover, introducing context dynamics inspires us
	to build a local simulator for each agent in order to facilitate more efficient policy learning for each
	agent.

	Based on this structure with context dynamics, we propose a shared-resource stochastic game to
	model the IM problem. Specifically, we introduce a surrogate objective function to optimize the
	return of agent i conditioned on all other agents, denoted by −i. Here, we make two approximations to get the surrogate objective function: 1) rearranging the sampling process by first sampling
	context dynamics then sampling local state/action/reward for each agent i; 2) using context dynamics sampled by previous policies. Based on above surrogate objective, our algorithm consists of
	two iterative learning procedures: 1) obtaining context dynamics from joint simulator, 2) updating
	policy for each agent by data sampled from its respective local simulator conditioned on collective
	context dynamics. By decoupling each agent from all others with a separate training procedure, our
	method can greatly reduce model complexity and accelerate learning procedure, as shown by extensive experiments. It is worthwhile to mention that our method is not limited to IM, but can also be
	applied to many other applications with shared-resource structure, e.g., portfolio management (Ye
	et al., 2020), smart grid scheduling (Remani et al., 2019), etc.

	Our contributions are summarized as follows:

	- We propose shared-resource stochastic game to capture the problem structure in IM, that
	agents only interact with each other through shared resource.

	- We propose a novel algorithm that leverages the shared-resource structure to solve IM
	problem efficiently.

	- Extensive experiments demonstrate that our method outperforms existing MARL algorithms on both profits and computing efficiency.

	2 BACKGROUND

	2.1 STOCHASTIC GAMES

	We build our work on stochastic games (SGs) (Shapley, 1953), since each SKU in IM problem has its own profit (reward) to optimize. A Stochastic Game is defined by a tuple
	_N_ _, S,_ _A[i]_ _i∈N_ _[,][ T][,]_ _R[i]_ _i∈N_ _[, γ]_, where N = {1, · · ·, n} denotes the set of n > 1 agents,
	denotes the state space observed by all agents, denotes the action space of agent i. Let

	_S_ _A[i]_
	_A := A[1]_ _× · · · × A[n], then T : S × A →_ ∆(S) denotes the transition probability from any
	state s ∈S to any state s[′] _∈S after taking a joint action a ∈A; R[i]_ : S × A × S → R is
	the reward function that determines the immediate reward received by agent i for a transition from
	(s, a) to s[′]; γ ∈ [0, 1) is the discount factor. We can formulate the joint policy of other agents’


	-----

	as π[−][i] = _j∈−i_ _[π][j][. Each agent][ i][ optimizes its policy][ π][i][ :][ S →]_ [∆] _A[i][]_ to maximize its own

	long-term reward, which is conditioned on other agents’ behavior, defined as

	[Q]


	_maxπi_ _η[i](π[i], π[−][i]) = E(st,ait[,a][−]t_ _[i])_ _,π[i],π[−][i]_ [[]
	_∼T_


	_γ[t]rt[i][]][.]_ (1)
	_t=0_

	X


	We will illustrate the shared resource structure of IM problem in Section 2.2, which motivates us
	to propose shared-resource stochastic game as a special case of stochastic game to capture such
	structure in Section 3.1.

	2.2 INVENTORY MANAGEMENT WITH SHARED RESOURCE

	While a typical setting for IM shall involve a supply network of multi-echelon including stores,
	warehouses and factories, we simplify the setting to ease our presentation. In the following, we shall
	focus on scenarios with one store and multiple SKUs. We further assume that there is an upstream
	warehouse that can fulfill requirements from the store perfectly. Our objective is to learn high-quality
	replenishing policies for each SKU in the store, particularly when there are a large number of SKUs.
	As replenishing decisions for SKUs in stores should directly consider consumption behaviors of
	customers and competitions from other SKUs due to limited resources like spaces, budget etc., they
	are more challenging to optimize comparing to SKUs in warehouses. It is worthwhile to mention
	that, due to the flexibility of RL algorithms, our method can also be applied to more complex settings
	with multi-echelon, fluctuate supply, non-deterministic leading time etc.

	Similar to previous work, we follow the multi-agent RL (MARL) framework with decentralized
	agents, each of which manages inventory of one SKU in the store. We assume that the store has n
	SKUs in sell, all of which share a common space that can store up to Imax units at the same time.
	Replenishing decisions of each SKU are made on discretized time steps, which are days in our paper.

	For each time step t and SKU i, let _I[˙]t[i]_
	constraint shall hold for all time steps: _[∈]_ [Z][ denote units of][ i][ that are in stock. Hence, the following]


	_I˙t[i]_ (2)
	_i=1_ _[≤]_ _[I][max]_

	X


	_∀t ≥_ 0.


	At any time step t, the agent corresponding to SKU i may place a replenishment order to request
	_Ot[i]_
	fulfillment instantly, but will take several time steps, referred as leading time, before these products[∈] [Z][ units of products from its upstream warehouse. These replenishment orders cannot be]
	are delivered to the store. Lettransit at the time step t. Meanwhile, demands from customers Li denote the leading time of SKU Dt[i] [may consume inventory of SKU] i and Tt[i] _[∈]_ [Z][ its total units in]
	_ithan and cause an actual sale of units Dt[i][. Formally, dynamics of these variables can be described as follows:] St[i]_ _[∈]_ [Z][. Due to the possibility of out-of-stock,][ S]t[i] [may be less]

	_St[i]_ [= min] _Dt[i][,][ ˙]It[i]_ (3)


	_Tt[i]+1_ [=][ T][ i]t _t_ _Li+1_ [+][ O]t[i]+1 (4)

	_[−]_ _[O][i]−_

	_Iˆt[i]_ [= ˙]It[i] _t_ [+][ O]t[i] _Li+1_ (5)

	_[−]_ _[S][i]_ _−_

	0 if _i=1_ _I[ˆ]t[i]_
	_n_

	_ρ =_ _i=1_ _I[ˆ]t[i][−][I][max]_ _[≤]_ _[I][max]_ (6)

	_n_ otherwise
	( Pi=1 _[O]t[i]−Li_ +1 [P][n]

	_I˙t[i]+1P[= ˙]It[i]_ _t_ [+][ ⌊][(1][ −] _[ρ][)][O]t[i]_ _Li+1[⌋]_ (7)

	_[−]_ _[S][i]_ _−_
	As we mentioned before, due to all SKUs share a common space, it may happen that the storage
	overflows when ordered products arrive. In this paper, we assume that the excess SKUs will be
	discarded proportionally according to ρ defined in Eq. 6, which corresponds to the overflowing ratio
	if we accept all coming SKUs without considering the space constraint. To calculate ρ, we also
	introduce an extra variable _I[ˆ]t[i][, which we call afterstate of][ ˙]It[i][. Intuitively, it denotes units of SKU][ i]_
	in stock at the end of the t-th time step if we omit the capacity constraint. We shall note that other
	manners e.g., prioritizing all SKUs, to resolve space overflows are possible. The algorithm that we


	-----

	will introduce in the following section will also apply to these settings. Undoubtedly, such behaviors
	will cause extra operational cost which should be avoided as much as possible in our replenishing
	decisions.

	The intermediate profit Pt _[i]t_ [of the][ i][-th SKU is calculated according to the following equation:]

	_Pt_ _[i]t_ [=][ p][i][S]t[i] _[−]_ _[q][i][O]t[i]_ _[−]_ _[o][I]_ _Ot[i]_ _[>][ 0]_ _−_ _hI[˙]t[i]_ (8)

	where pi and qi are the unit sale price, unit procurement price for the _i-th SKU, respectively, and o_
	and h are the order cost and unit holding cost, respectively. I[·] is an indicator function which equals
	to one when the condition is true and zero otherwise. The order cost reflects the fixed transportation
	cost or the order processing cost, and yields whenever the order quantity is non-zero.

	For convenience, we summarize all notations in Table 2 in Appendix B, where we also give an
	example with 2 SKUs in Fig. 4 to further illustrate the whole procedure.

	3 METHODOLOGY

	3.1 SHARED-RESOURCE STOCHASTIC GAME

	In this section, we show that IM problem introduced in Section 2.2 can be formulated
	as a shared-resource stochastic game, where each agent is only influenced by other agents
	through a shared resource pool. We define a shared-resource stochastic game as a tuple
	_N_ _,_ _S_ _[i]_ _i∈N_ _[,][ C][,]_ _A[i]_ _i∈N_ _[,][ T][,]_ _R[i]_ _i∈N_ _[, γ]_, where N = {1, · · ·, n} denotes the set of n > 1
	agents, denotes the state space of agent _i,_ denotes the context space observed by all agents,

	_S_ _[i]_ _C_ _A[i]_
	denotes the action space of agent i. Here, the context represents the occupied capacity of sharedresource. Let S := S [1] _× · · · × S_ _[n]_ and A := A[1] _× · · · × A[n], then T : S × C × A →_ ∆(S × C)
	denotes the transition probability, which can be decomposed as follows. The context is affected by
	all agents, i.e., ct+1 _Pc(_ _ct, st, at). Since we are dealing with resource, the transition function_
	of context usually has some additive structure with respect to all agents, which we will illustrate later ∼ _· \|_
	in Section 3.2. Given context ct+1, the transition function of state for each agent is independent with
	other agents, i.e., s[i]t+1 _s[(][· \|][ s][i]t[, a][i]t[, c][t][+1][)][. Reward function is also independent with other agents]_
	given context,γ ∈ [0, 1) is the discount factor. Each agent rt[i] _[∼]_ _[R][i][∼][(][s][P]t[i][, a][ i]_ _[i]t[, c][t][+1][)][. We refer] i optimizes its own policy[ P][ i]s_ [and][ R][i][ as][ local][ transition and reward functions.] π[i] : S _[i]_ _× C →_ ∆ _A[i][]_ to
	maximize the expected long-term reward conditioned on other agents’ policies π[−][i], defined as



	_maxπi_ _η[i](π[i], π[−][i]) = E(st,ct,ait[,a][−]t_ _[i])_ _,π[i],π[−][i]_ [[]
	_∼T_


	_γ[t]rt[i][]]_ (9)
	_t=0_

	X


	Given the above definition, an IM problem can be formulated as a shared-resource stochastic game
	by letting rt[i] [=][ Pt] _t[i][,][ a][i]t_ [=][ O]t[i][, and][ c][t] [=][ P][n]i=1 _I[˙]t[i]_ [for all SKU][ i][ and time step][ t][. Moreover, state]
	of each SKU i will be a concatenation of information like Tt[i][,][ p][i][, its historical actions and demands]
	etc. A detailed description of the state space can be found in the Appendix E.3.

	3.2 SURROGATE OBJECTIVE WITH CONTEXT DYNAMICS

	We now introduce how to optimize the return for agent i conditioned on other agents’ behaviors.
	Roughly speaking, the context dynamics reflect the collective behaviors of other agents. It is possible
	to estimate the objective function approximately for each agent i only based on its local transition
	dynamics, local reward function and the context dynamics. First, we give a detailed description of
	the transition model of shared resource c from each agent’s perspective. Given such dynamics for
	the shared resource, we then show how to approximate the objective function in Eq. 9 by rearranging
	the sampling process. Finally, we replace the policies for sampling context dynamics by old policies
	from the previous iteration to accelerate the decentralized training.

	3.2.1 CONTEXT DYNAMICS AND LOCAL SIMULATOR

	We use ˙c[i]t [to represent the amount of resource occupied by agent][ i][ at time step][ t][ and][ ˙]ct the total
	amount of occupied resource. We further let ˙c[−]t _[i]_ denote the amount of resource occupied by all


	-----

	agents but i. Since the capacity has the additive structure, we have the following equations:

	_c˙t =_ _c˙[i]t[;]_

	_i_ (10)

	X

	_c˙t = ˙c[i]t_ [+ ˙]c[−]t _[i][.]_


	Similarly, we denote the afterstate of ˙c[i]t [as][ ˆ]c[i]t[. From the perspective of agent][ i][, it can view the context]
	replenishment decision influencesc˙[−]t _[i]_ as a part of the environment. Hence, given ˆc[i]t[. Due to the additive structure, we have] s[i]t[,][ ˙]c[i]t[, and][ a][i]t[,][ P] [(ˆ]c[i]t _[\|][ s]t[i][, a][i]t[ ˆ][,]c[ ˙]ct[i]t = ˆ[)][ represents how the]c[i]t_ [+ ˆ]c[−]t _[i][. Then by]_
	applying a resource overflow resolution procedure as in Eq. 6, we obtain ˙c[i]t+1 [representing state of]
	the resource in the next step.

	We use notation ct to represent (ˆct 1, ˙ct). We refer (s[i], c[i]) as the state for agent i, and c[−][i] as the
	_−_
	context. For each agent, we have the following sampling process T _[i]_ given the context dynamics of
	_c[−][i],_

	_c[i]t+1_ _[∼]_ _[P]c[ i][(][· \|][ s][i]t[, a][i]t[, c][i]t[, c]t[−][i][)]_

	_s[i]t+1_ _[∼]_ _[P]s[ i][(][· \|][ s][i]t[, a][i]t[, c][i]t+1[, c][−]t+1[i]_ [)] (11)

	_rt[i]_ _[∼]_ _[R][i][(][s]t[i][, a]t[i][, c]t[i]+1[, c][−]t+1[i]_ [)][.]

	Given the context dynamics for c[−][i], we can build a local simulator for agent i according to the above
	equations. To ease our presentation, we only consider one kind of shared resource i.e., warehouse
	spaces in the above formulation. However, it can be extended to support multiple resources as long
	as these resources have the additive structure as in Eq. 10.

	3.2.2 SURROGATE LOSS FUNCTION

	To leverage local simulators, we rearrange the sampling process for evaluating η(π[i], π[−][i]) as follows.
	First, we follow the regular sampling process in Eq. 9, and get samples of c[−]t _[i][. Then, we re-sample]_
	(s[i]t[, a][i]t[, c][i]t[)][ based on samples of][ c]t[−][i] following Eq. 11. It is worth noting that we make an approximation here. Given samples for c[−]t _[i][, we actually need to inference the posterior of][ (][s]t[i][, a][i]t[, c][i]t[)][.]_
	However, since we consider scenarios with lots of agents, it is reasonable to assume that each agent
	_i has limited influence on its context c[−][i]. Therefore, we can assume that c[−]t_ _[i]_ has limited influence
	on the inference of (s[i]t[, a][i]t[, c][i]t[)][ and sample the latter directly according to Eq.][ 11][. By rearranging the]
	sampling process, we obtain the following surrogate objective,


	_maxπi ˜η[i](π[i], π[−][i]) = E(c−t_ _i)_ _,π[i],π[−][i]_ [E][(][s][i]t[,a][i]t[,c][i]t[)][∼T][ i][,π][i] [[]
	_∼T_


	_γ[t]rt[i][]][.]_ (12)
	_t=0_

	X


	In practice, it is quite costly to sample from the joint dynamics T, but much cheaper to sample
	from the local dynamics T _[i]. To leverage data samples efficiently, we propose to use the samples_
	from previous policies in our objective function, which is a common trick in reinforcement learning.
	Specifically, we use samples of c[−]t _[i]_ collected by policies (πold[i] _[, π]old[−][i]_ [)][ of the previous iteration, and]
	further rewrite the surrogate objective function as follows:


	_maxπi ˆη[i](π[i], πold[i]_ _[, π]old[−][i]_ [) =][ E](c[−]t _[i])∼T,πold[i]_ _[,π]old[−][i]_ [E][(][s]t[i][,a][i]t[,c][i]t[)][∼T][ i][,π][i] [[]


	_γ[t]rt[i][]][.]_ (13)
	_t=0_

	X


	As long as current policies stay close to previous ones, we can adopt above surrogate objective
	function to improve sample efficiency.

	3.3 ALGORITHM

	In the following, we will present details about our proposed algorithm, which is referred as Contextaware Decentralized PPO (CD-PPO). We call it context-aware as the context c[−]t _[i]_ is a part of input
	to train policy and value functions of each agent. Such a context-aware approach can avoid the nonstationary problem occurred in independent learning methods (which does not consider dynamics
	of others during training), since the context reflects other agents’ collective behaviors which can


	-----

	Figure 1: Our algorithm consists of two iterative learning procedures: 1. Get context dynamics
	_{c[−]t_ _[i][}]t[T]=0_ [from the joint simulator and previous policies][ π]old[i] [,][ π]old[−][i] [. 2. Train policy][ π][i][ with data]
	sampled from the local simulator conditioned on context dynamics {c[−]t _[i][}]t[T]=0[.]_

	impact dynamics of each individual agent. In the meanwhile, our method can mitigate the intractable
	centralized critic learning by avoiding using exploding joint state and action space, i.e., (st, at, ct),
	as input of the critic.

	As shown in Algorithm 1, our algorithm consists of two iterative learning procedures: 1) obtaining
	context dynamics by running all agents in the joint environment (Algorithm 2); 2) updating policy
	of each agent using data sampled from its local simulator conditioned on the context dynamics (Algorithm 3). Moreover, our algorithm follows a decentralized training paradigm with an acceptable
	communication cost of the context dynamics. We refer readers to Appendix C for a detailed description.

	It is worth noting that a naive approach is to optimize policies by Eq. 9 with data sampled from the
	joint simulator. Nonetheless, we find that it is empirically more time-consuming to sample one step
	in the joint simulator than letting each of the n local simulators sample once. One major reason
	lies in that agent interactions take most of the simulation time. Our method, with the advantage of
	leveraging local simulators to simplify interactions, are henceforth much cheaper to sample. We
	refer readers to Appendix D for more discussions on the benefit of leveraging local simulators.

	Algorithm 1 Context-aware Decentralized PPO

	Given the joint simulator Envjoint and local simulators Env[i]local[}][n]i=1
	_{_

	Initialize policies π[i] and value functions V _[i]_ for i = 1, . . ., n
	for M epochs do

	// Collect context dynamics via running joint simulation

	for{c[−]t k[1][}] = 1t[T]=0[, . . .,], 2 . . ., K[ {][c]t[−] do[n]}t[T]=0 _[←]_ [GetContextDynamics][(][Env][joint][,][ {][π][i][}]i[n]=1[)][ (Algorithm][ 2][)]

	for all agents i do

	// Set capacity trajectory by context dynamics
	Env// Train policy by running simulation in the corresponding local environment[i]local[.][set c trajectory][(][{][c]t[−][i][}]t[T]=0[)]

	_π[i], V_ _[i]_ _←_ DecentralizedPPO(Env[i]local[, π][i][, V][ i][)][ (Algorithm][ 3][)]

	end for

	end for
	Evaluate policies _π[i]_ _i=1_ [on joint simulator Env][joint]
	_{_ _}[n]_

	end for


	-----

	4 EXPERIMENTS

	We evaluate the performance of CD-PPO in three IM problems, which contain 5, 50, and 100
	SKUs, respectively. By changing space size of the warehouse, we further testify how CD-PPO
	performs comparing to a series of baselines under different levels of competition. On these settings,
	we demonstrate that our algorithm can achieve comparable results as SOTA baselines, but is more
	sample-efficient.

	4.1 EXPERIMENT SETUPS

	Our experiment is conducted on a simulator which can support both retailing and replenishment for
	multiple SKUs in one store. Instead of sampling demands from some hypothetical distributions,
	we instead directly use authentic demand data from Kaggle (Makridakis et al., 2020), which contains sales history of five years (2011-2016) for various SKUs in Walmart. We randomly choose
	155 SKUs from the data and use sales of the first four years as our training set, while the others are the testing set. For all other information, e.g. price, cost, leading time etc., that are
	necessary to instantiate our simulator but not included in the data set, we will randomly sample them from certain reasonable ranges. The evaluation metric is the total profit in dollars. For
	all results that we present, we run each of all algorithms for four times with random seeds and
	present the average performance with standard deviations. The source code as well as instruc[tions to reproduce our results can be found in https://anonymous.4open.science/r/](https://anonymous.4open.science/r/replenishment-marl-baselines-75F4)
	[replenishment-marl-baselines-75F4.](https://anonymous.4open.science/r/replenishment-marl-baselines-75F4)

	Note that the simulator we are using is developed for general purposes, it contains extra details like
	replenishing orders fulfillment scheduling, transportation management etc., hence it requires lots
	of computation resources to simulate IM problems with many SKUs (> 100). This also hinder us
	to extend our experiments to cases with more than 100 SKUs on a single machine. We leave it our
	future work to further testify our algorithm in a distributed environment where simulating large scale
	IM problems is possible.

	We compare our method CD-PPO with the strong model-free on-policy MARL algorithms
	MAPPO (Yu et al., 2021) and IPPO (de Witt et al., 2020) on 5-SKUs environment and we do not
	show the performance of COMA (Foerster et al., 2018) and Value Decomposition methods such as
	QMIX (Rashid et al., 2018) due to their bad performance (i.e. negative profit). As for 50-SKUs
	and 100-SKUs scenario, we can only run IPPO, since MAPPO fails quickly as it suffers from the
	huge joint state-action space for training a centralized critic. Besides RL approaches, we also
	compare with base-stock policy, which is a well-known policy from OR community and widely
	adopted in practice (Kapuscinski & Tayur, 1999; Hubbs et al., 2020). (More details can be found in
	Appendix F). Our implementation is based on the EPyMARL (Papoudakis et al., 2021) framework,
	which contains MAPPO, IPPO and several common-used MARL algorithms.

	As all agents are homogeneous in IM, we let parameters of policy network and critic network be
	shared amongst all agents. The two networks are all constructed by a two-layer MLP with hidden
	size 64. In our experiment, the actor network maps agent-specific state to a categorical distribution
	over a discrete action space {0, [1]3 _[,][ 2]3_ _[,][ 1][,][ 4]3_ _[,][ 5]3_ _[,][ 2][,][ 5]2_ _[,][ 3][,][ 4][,][ 5][,][ 6][,][ 7][,][ 9][,][ 12][}][, such that the real replenish-]_

	ment quantity is obtained by multiplying the action with an agent’s sales mean of the past two
	weeks. In addition, to encourage policies to accommodate diverse context dynamics, we propose
	two methods to augment the context dynamics: 1) add noise to randomly chosen items with a predefined probability; 2) replace randomly chosen items with predicted values coming from a context
	generator model, which is also guided by the predefined probability. Besides, we also use the data
	collected in Algorithm 2 for policy training. More details about the algorithm can be found in
	Appendix C.


	4.2 MAIN RESULTS

	We evaluate CD-PPO, MAPPO and IPPO on 5,50,100-SKUs environments with different sizes of
	warehouse spaces. IPPO and MAPPO only use individual rewards rather than team reward (summation of all individual rewards) to train critics. For a fair comparison, we also train IPPO and
	MAPPO with information related to shared resource. The training curves for 5-SKUs scenario are


	-----

	Table 1: Profit Comparison on Different Scenarios of 50 and 100 SKUs Environment

	\|Env Scenario\|CD-PPO(Ours)\|IPPO-IR(w/o context)\|IPPO-IR\|IPPO(w/o context)\|IPPO\|Base-stock(Dynamic)\|
	\|---\|---\|---\|---\|---\|---\|---\|
	\|N50-C500\|310.81 ± 76.46\|235.09 ± 60.61\|250.03 ± 58.38\|164.43 ± 143.01\|366.74 ± 89.58\|−408.14\|
	\|N50-C2000\|694.87 ± 174.184\|689.27 ± 48.92\|545.86 ± 459.71\|−1373.29 ± 870.03\|−1102.97 ± 1115.69\|42.71\|
	\|N100-C1000\|660.28 ± 149.94\|−2106.98 ± 315.38\|−1126.42 ± 409.83\|−1768.19 ± 1063.61\|−669.83 ± 1395.92\|−22.05\|
	\|N100-C4000\|1297.75 ± 124.52\|−2223.11 ± 2536.00\|148.00 ± 1017.47\|−6501.42 ± 6234.06\|−6019.28 ± 9056.49\|493.32\|


	Env Scenario CD-PPO(Ours) IPPO-IR(w/o context) IPPO-IR IPPO(w/o context) IPPO Base-stock(Dynamic)

	N50-C500N50-C2000N100-C1000N100-C4000 3106946601297...818728.75 ± ± ± ± 76 174 149 124.46..18494.52 235689−−21062223..0927 ± ±..9811 60 48 ± ±.. 315 25366192 _.38.00_ 250545−1481126...038600 ± ± ±.42 58 459 1017 ±. 40938.71.47.83 164−−−137317686501.43 ±...291942 143 ± ± ± 870 1063 6234.01.03..6106 366−−−11026696019.74.83..9728 ± ± ± ± 89 1395 1115 9056.58.92..6949 _−42−49340822.71..3205.14_


	shown in Figure 2. It is straightforward to observe that CD-PPO converges to the same performance
	comparing with other methods. In particular, CD-PPO is more sample-efficient due to its local simulations in parallel. To evaluate our algorithm on a larger scenario, we also run CD-PPO and IPPO
	(same with previous settings) on 50 and 100 SKUs and record the number of data samples when
	reaching the median performance of baselines. The results are summarized in Table 1 and Figure 3,
	where “N” and “C” denote the number of SKUs and the maximum capacity of shared resource,
	respectively. More details about the implementation and hyper-parameters we used can be found in
	Appendix C.


	Figure 2: Training curves of different algorithms
	in 5-SKUs environment. “IR” means the environment only provides individual rewards and “w/o
	context” denotes the algorithm does not use information related to shared resource as parts of its
	inputs. The X-axis [†] also takes in samples from
	local simulation for CD-PPO.


	Figure 3: Average samples needed by
	different algorithms in 100-SKUs environment to reach the median performance of baselines. The lower the
	value, the higher the sample-efficiency
	for the algorithm.


	As demonstrated in Figure 2, Figure 3 and Table 1, CD-PPO is able to produce results comparable
	to other strong baselines and even continues improving as the training proceeds. Moreover, our algorithm can curtail non-stationary effects as in centralized training and, in the meanwhile, can scale
	up to scenarios with a large number of agents. In contrast, traditional MARL methods with CTDE
	paradigm even cannot start running on the 50-SKUs and the 100-SKUS environment, since the input of its critic is too huge to fit in the memory. IPPO, on the other hand, can run successfully, but
	lack stability under different levels of warehouse spaces. The full results and more ablation studies
	about how the capacity of shared resource affects the performance of CD-PPO and the influence of
	augmentation for context dynamics can be found in Appendix G.

	5 RELATED WORK

	In this section we will introduce relevant prior work including studies of IM problem and
	common-used training paradigms in MARL. More detailed description of other related work(e.g.
	Constrained/Weakly-Coupled MDP, Model-Based MARL and Mean-Filed MARL) are shown in
	Appendix A.

	†Specifically, for one interaction in the global environment with N agents, we consider it as N data samples.
	For one interaction in the local simulator(based on context trajs) with one agent, we consider it as 1 data sample.


	-----

	5.1 INVENTORY MANAGEMENT

	Since the pioneer work in (Fukuda, 1964), many approaches have been proposed to solve different
	variants of IM problems, either using exact (Goldberg et al., 2016; Huh et al., 2009) or approximate (Halman et al., 2009; Fang et al., 2013; Chen & Yang, 2019) dynamic programming. As reinforcement learning based approaches are our main focus, we only present related work of this branch
	in the following. Interested readers will be referred to (Gijsbrechts et al., 2019) for an overview of
	traditional approaches for IM.

	The attempt to apply reinforcement learning to solve inventory management problem has a long
	history, see for instance (Giannoccaro & Pontrandolfo, 2002; Jiang & Sheng, 2009; Kara & Dogan,
	2018; Barat et al., 2019; Gijsbrechts et al., 2019; Oroojlooyjadid et al., 2017; 2020). However, as
	their main focus is to deal with challenges like volatile customer demands, bullwhip effects etc. in
	IM, they are restricted to simplified scenarios with only a single SKU. While these approaches are
	able to outperform traditional approaches in these scenarios, they overlook system constraints and
	coordination of SKUs imposed by shared resources. Exceptions are two recent works (Barat et al.,
	2019; Sultana et al., 2020), where more realistic scenarios containing multiple SKUs are considered.
	In (Barat et al., 2019), the main contribution is to propose a framework that supports efficient deployment of RL algorithms in real systems. As an example, the authors also introduce a centralized
	algorithm for solving IM problems. In contrast, a decentralized algorithm is proposed in (Sultana
	et al., 2020) to solve IM problems with not only multiple SKUs but also multiple echelons. In both
	works, the training algorithm is the advantage actor critic (A2C) (Wu et al., 2018).

	5.2 TRAINING PARADIGM IN MARL

	MARL algorithms generally fall between two frameworks: centralized and decentralized learning. There are two lines of fully decentralized training: Independent Learning methods (IL) (Tan,
	1993; de Witt et al., 2020) and decentralized training with communication (Zhang et al., 2018;
	Sukhbaatar et al., 2016; Peng et al., 2017). For IL, each agent is learning independently to optimize
	its own reward and perceives the other agents as part of the environment. As for fully decentralized
	methods, representative methods usually build a direct communication pipe to share the message
	amongst agents to avoid non-stationary issue in MARL framework. One part of fully centralized
	approaches (Claus & Boutilier, 1998) assume a cooperative game and directly extend single-agent
	RL algorithms by learning a single policy to produce the joint actions of all agents simultaneously.
	Another type of centralized methods is called value decomposition(VD), which typically represents
	the joint Q-function as a function of agents’ local Q-functions (Sunehag et al., 2017; Son et al.,
	2019; Rashid et al., 2018) and has been considered a gold standard in MARL. In contrast to previous methods, Centralised Training Decentralised Execution (CTDE) allows sharing of information during training, while policies are only conditioned on the agents’ local observations enabling
	decentralised execution. The main category of CTDE algorithms are centralised policy gradient
	methods in which each agent consists of a decentralised actor and a centralised critic, which is
	optimised based on shared information between the agents. Representative studies of CTDE are
	MADDPG (Lowe et al., 2017), COMA (Foerster et al., 2018) and MAPPO (Yu et al., 2021) etc.

	6 CONCLUSION

	In this paper, we address inventory management problem for a single store with a large number
	of SKUs. Our method is based on shared resource structure, where agents only interact indirectly
	with each other through shared-resource. By leveraging such structure, our method can outperform
	existing MARL algorithms in terms of both final performance and computation efficiency. It is worth
	mentioning that our method is not limited to IM, but also applicable to a wide range of real-world
	applications with shared-resource structure.

	In real-world applications, we usually need to deal with thousands of agents, which poses a challenge
	for existing MARL algorithms. To address such challenges, we need develop efficient algorithms
	which are compatible with distributed training. In this paper, we take our first step towards developing efficient and practical MARL algorithms for real-world applications with shared-resource
	structure, and will continue to address above challenges arisen in real-world applications in our
	future work.


	-----

	REFERENCES

	Pritee Agrawal, Pradeep Varakantham, and William Yeoh. Scalable greedy algorithms for
	task/resource constrained multi-agent stochastic planning.(2016). In Proceedings of the 25th In_ternational Joint Conference on Artificial Intelligence IJCAI 2016: New York, July 9, volume 15,_
	pp. 10–16, 2016. A.2

	Souvik Barat, Harshad Khadilkar, Hardik Meisheri, Vinay Kulkarni, Vinita Baniwal, Prashant Kumar, and Monika Gajrani. Actor based simulation for closed loop control of supply chain using reinforcement learning. In Proceedings of the 18th international conference on autonomous
	_agents and multiagent systems, pp. 1802–1804, 2019. 5.1_

	Shalabh Bhatnagar and K Lakshmanan. An online actor–critic algorithm with function approximation for constrained markov decision processes. Journal of Optimization Theory and Applications,
	153(3):688–708, 2012. A.2

	Craig Boutilier and Tyler Lu. Budget allocation using weakly coupled, constrained markov decision
	processes. In Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence (UAI_16), pp. 52–61, New York, NY, 2016. A.2_

	Ren´e Carmona, Mathieu Lauri`ere, and Zongjun Tan. Model-free mean-field reinforcement learning:
	mean-field mdp and mean-field q-learning. arXiv preprint arXiv:1910.12802, 2019. A.1

	Wenbo Chen and Huixiao Yang. A heuristic based on quadratic approximation for dual sourcing
	problem with general lead times and supply capacity uncertainty. IISE Transactions, 51(9):943–
	956, 2019. ISSN 24725862. doi: 10.1080/24725854.2018.1537532. 1, 5.1

	Yu Fan Chen, Miao Liu, Michael Everett, and Jonathan P How. Decentralized non-communicating
	multiagent collision avoidance with deep reinforcement learning. In 2017 IEEE international
	_conference on robotics and automation (ICRA), pp. 285–292. IEEE, 2017. A.1_

	Caroline Claus and Craig Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems. AAAI/IAAI, 1998(746-752):2, 1998. 5.2

	Christian Schroeder de Witt, Tarun Gupta, Denys Makoviichuk, Viktor Makoviychuk, Philip HS
	Torr, Mingfei Sun, and Shimon Whiteson. Is independent learning all you need in the starcraft
	multi-agent challenge? arXiv preprint arXiv:2011.09533, 2020. 4.1, 5.2

	Raghuram Bharadwaj Diddigi, Sai Koti Reddy Danda, Shalabh Bhatnagar, et al. Actor-critic algorithms for constrained multi-agent reinforcement learning. arXiv preprint arXiv:1905.02907,
	2019. A.2

	Roel Dobbe, David Fridovich-Keil, and Claire Tomlin. Fully decentralized policies for multi-agent
	systems: An information theoretic approach. arXiv preprint arXiv:1707.06334, 2017. A.1

	Dmitri A Dolgov and Edmund H Durfee. Resource allocation among agents with mdp-induced
	preferences. Journal of Artificial Intelligence Research, 27:505–549, 2006. A.2

	Jiarui Fang, Lei Zhao, Jan C. Fransoo, and Tom Van Woensel. Sourcing strategies in supply
	risk management: An approximate dynamic programming approach. _Computers & Opera-_
	_tions Research, 40(5):1371–1382, 2013. ISSN 0305-0548. doi: https://doi.org/10.1016/j.cor._
	2012.08.016. [URL https://www.sciencedirect.com/science/article/pii/](https://www.sciencedirect.com/science/article/pii/S030505481200189X)
	[S030505481200189X. 1, 5.1](https://www.sciencedirect.com/science/article/pii/S030505481200189X)

	Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson.
	Counterfactual multi-agent policy gradients. In Proceedings of the AAAI Conference on Artificial
	_Intelligence, volume 32, 2018. 4.1, 5.2, A.1_

	Michael Fowler, Pratap Tokekar, T Charles Clancy, and Ryan K Williams. Constrained-action
	pomdps for multi-agent intelligent knowledge distribution. In 2018 IEEE International Con_ference on Robotics and Automation (ICRA), pp. 3701–3708. IEEE, 2018. A.2_

	Yoichiro Fukuda. Optimal Policies for the Inventory Problem with Negotiable Leadtime. Manage_ment Science, 10(4):690–708, 1964. ISSN 0025-1909. doi: 10.1287/mnsc.10.4.690. 5.1_


	-----

	Ilaria Giannoccaro and Pierpaolo Pontrandolfo. Inventory management in supply chains: A reinforcement learning approach. International Journal of Production Economics, 78(2):153–161,
	2002. ISSN 09255273. doi: 10.1016/S0925-5273(00)00156-0. 5.1

	Joren Gijsbrechts, Robert N. Boute, Jan Albert Van Mieghem, and Dennis Zhang. Can Deep Reinforcement Learning Improve Inventory Management? Performance and Implementation of Dual
	Sourcing-Mode Problems. SSRN Electronic Journal, pp. 1–33, 2019. ISSN 1556-5068. doi:
	10.2139/ssrn.3302881. 1, 5.1

	David A Goldberg, Dmitriy A Katz-Rogozhnikov, Yingdong Lu, Mayank Sharma, and Mark S
	Squillante. Asymptotic optimality of constant-order policies for lost sales inventory models with
	large lead times. Mathematics of Operations Research, 41(3):898–913, 2016. 5.1

	Jayesh K Gupta, Maxim Egorov, and Mykel Kochenderfer. Cooperative multi-agent control using
	deep reinforcement learning. In International Conference on Autonomous Agents and Multiagent
	_Systems, pp. 66–83. Springer, 2017. A.1_

	Nir Halman, Diego Klabjan, Mohamed Mostagir, Jim Orlin, and David Simchi-Levi. A fully
	polynomial-time approximation scheme for single-item stochastic inventory control with discrete
	demand. Mathematics of Operations Research, 34:674–685, 2009. 1, 5.1

	Christian D. Hubbs, Hector D. Perez, Owais Sarwar, Nikolaos V. Sahinidis, Ignacio E. Grossmann,
	and John M. Wassick. Or-gym: A reinforcement learning library for operations research problems, 2020. 4.1, F

	Woonghee Tim Huh, Ganesh Janakiraman, John A. Muckstadt, and Paat Rusmevichientong.
	Asymptotic optimality of order-up-to policies in lost sales inventory systems. Management Sci_[ence, 55(3):404–420, 2009. ISSN 00251909, 15265501. URL http://www.jstor.org/](http://www.jstor.org/stable/40539156)_
	[stable/40539156. 5.1](http://www.jstor.org/stable/40539156)

	Chengzhi Jiang and Zhaohan Sheng. Case-based reinforcement learning for dynamic inventory
	control in a multi-agent supply-chain system. Expert Systems with Applications, 36(3 PART 2):
	6520–6526, 2009. ISSN 09574174. doi: 10.1016/j.eswa.2008.07.036. 5.1

	Nan Jiang and Alekh Agarwal. Open problem: The dependence of sample complexity lower bounds
	on planning horizon. In Conference On Learning Theory, pp. 3395–3398. PMLR, 2018. 1

	Roman Kapuscinski and Sridhar Tayur. Optimal Policies and Simulation-Based Optimization for
	_Capacitated Production Inventory Systems, pp. 7–40. Springer US, Boston, MA, 1999. ISBN 978-_
	[1-4615-4949-9. doi: 10.1007/978-1-4615-4949-9 2. URL https://doi.org/10.1007/](https://doi.org/10.1007/978-1-4615-4949-9_2)
	[978-1-4615-4949-9_2. 4.1](https://doi.org/10.1007/978-1-4615-4949-9_2)

	Ahmet Kara and Ibrahim Dogan. Reinforcement learning approaches for specifying ordering policies of perishable inventory systems. Expert Systems with Applications, 91:150–158, 2018. ISSN
	09574174. doi: 10.1016/j.eswa.2017.08.046. 5.1

	Orr Krupnik, Igor Mordatch, and Aviv Tamar. Multi-agent reinforcement learning with multi-step
	generative models. In Conference on Robot Learning, pp. 776–790. PMLR, 2020. A.3

	Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multi-agent
	reinforcement learning in sequential social dilemmas. arXiv preprint arXiv:1702.03037, 2017.
	A.1

	Shihui Li, Yi Wu, Xinyue Cui, Honghua Dong, Fei Fang, and Stuart Russell. Robust multi-agent
	reinforcement learning via minimax deep deterministic policy gradient. In Proceedings of the
	_AAAI Conference on Artificial Intelligence, volume 33, pp. 4213–4220, 2019. A.1_

	Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actorcritic for mixed cooperative-competitive environments. arXiv preprint arXiv:1706.02275, 2017.
	1, 5.2, A.1

	Xueguang Lyu, Yuchen Xiao, Brett Daley, and Christopher Amato. Contrasting centralized and
	decentralized critics in multi-agent reinforcement learning. arXiv preprint arXiv:2102.04402,
	2021. A.1


	-----

	Anuj Mahajan, Tabish Rashid, Mikayel Samvelyan, and Shimon Whiteson. Maven: Multi-agent
	variational exploration. arXiv preprint arXiv:1910.07483, 2019. A.1

	S. Makridakis, E. Spiliotis, and V. Assimakopoulos. The m5 accuracy competition: Results, findings
	and conclusions. International Journal of Forecasting, 36(1):224–227, 2020. 4.1

	Andrei Marinescu, Ivana Dusparic, Adam Taylor, Vinny Cahill, and Siobh Clarke. P-marl:
	Prediction-based multi-agent reinforcement learning for non-stationary environments. In AAMAS,
	pp. 1897–1898, 2015. A.3

	Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim
	Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement
	learning. In International conference on machine learning, pp. 1928–1937. PMLR, 2016. A.1,
	C.2

	Steven Nahmias and Stephen A Smith. Mathematical models of retailer inventory systems: A review.
	_Perspectives in operations management, pp. 249–278, 1993. 1_

	Frans A Oliehoek and Christopher Amato. _A concise introduction to decentralized POMDPs._
	Springer, 2016. A.2

	Afshin Oroojlooyjadid, M Nazari, Lawrence Snyder, and Martin Tak´aˇc. A deep q-network for the
	beer game: A reinforcement learning algorithm to solve inventory optimization problems. arXiv
	_preprint arXiv:1708.05924, 2017. 5.1_

	Afshin Oroojlooyjadid, Lawrence V. Snyder, and Martin Tak´aˇc. Applying deep learning to the
	newsvendor problem. IISE Transactions, 52(4):444–463, 2020. ISSN 24725862. doi: 10.1080/
	24725854.2019.1632502. 5.1

	Georgios Papoudakis, Filippos Christianos, Lukas Sch¨afer, and Stefano V Albrecht. Benchmarking
	multi-agent deep reinforcement learning algorithms in cooperative tasks. In Thirty-fifth Confer_ence on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021._
	[URL https://openreview.net/forum?id=cIrPX-Sn5n. 4.1, E.1](https://openreview.net/forum?id=cIrPX-Sn5n)

	Young Joon Park, Yoon Sang Cho, and Seoung Bum Kim. Multi-agent reinforcement learning with
	approximate model learning for competitive games. PloS one, 14(9):e0222215, 2019. A.3

	Bei Peng, Tabish Rashid, Christian A Schroeder de Witt, Pierre-Alexandre Kamienny, Philip HS
	Torr, Wendelin B¨ohmer, and Shimon Whiteson. Facmac: Factored multi-agent centralised policy
	gradients. arXiv preprint arXiv:2003.06709, 2020. A.1

	Peng Peng, Ying Wen, Yaodong Yang, Quan Yuan, Zhenkun Tang, Haitao Long, and Jun Wang.
	Multiagent bidirectionally-coordinated nets: Emergence of human-level coordination in learning
	to play starcraft combat games. arXiv preprint arXiv:1703.10069, 2017. 5.2

	Tabish Rashid, Mikayel Samvelyan, Christian Schroeder, Gregory Farquhar, Jakob Foerster, and
	Shimon Whiteson. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 4295–4304. PMLR, 2018.
	4.1, 5.2, A.1

	Tabish Rashid, Gregory Farquhar, Bei Peng, and Shimon Whiteson. Weighted qmix: Expanding
	monotonic value function factorisation. arXiv e-prints, pp. arXiv–2006, 2020. A.1

	T. Remani, E. A. Jasmin, and T. P.Imthias Ahamed. Residential Load Scheduling with Renewable
	Generation in the Smart Grid: A Reinforcement Learning Approach. IEEE Systems Journal, 13
	(3):3283–3294, 2019. ISSN 19379234. doi: 10.1109/JSYST.2018.2855689. 1

	John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. In Proceedings of the
	_International Conference on Learning Representations (ICLR), 2016. C.2_

	John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy
	optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. A.1


	-----

	Lloyd S Shapley. Stochastic games. Proceedings of the national academy of sciences, 39(10):
	1095–1100, 1953. 2.1

	Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero, and Yung Yi. Qtran: Learning
	to factorize with transformation for cooperative multi-agent reinforcement learning. In Interna_tional Conference on Machine Learning, pp. 5887–5896. PMLR, 2019. 5.2, A.1_

	Sriram Srinivasan, Marc Lanctot, Vinicius Zambaldi, Julien P´erolat, Karl Tuyls, R´emi Munos, and
	Michael Bowling. Actor-critic policy optimization in partially observable multiagent environments. arXiv preprint arXiv:1810.09026, 2018. A.1

	DJ Strouse, Max Kleiman-Weiner, Josh Tenenbaum, Matt Botvinick, and David Schwab. Learning
	to share and hide intentions using information regularization. arXiv preprint arXiv:1808.02093,
	2018. A.1

	Sainbayar Sukhbaatar, Rob Fergus, et al. Learning multiagent communication with backpropagation. Advances in neural information processing systems, 29:2244–2252, 2016. 5.2

	Nazneen N Sultana, Hardik Meisheri, Vinita Baniwal, Somjit Nath, Balaraman Ravindran, and Harshad Khadilkar. Reinforcement learning for multi-product multi-node inventory management in
	supply chains. arXiv preprint arXiv:2006.04037, 2020. 5.1

	Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max
	Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al. Value-decomposition
	networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296, 2017. 5.2, A.1

	Ming Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings
	_of the tenth international conference on machine learning, pp. 330–337, 1993. 1, 5.2_

	Jianhong Wang, Yuan Zhang, Tae-Kyun Kim, and Yunjie Gu. Shapley q-value: a local reward
	approach to solve global reward games. In Proceedings of the AAAI Conference on Artificial
	_Intelligence, volume 34, pp. 7285–7292, 2020a. A.1_

	Tonghan Wang, Jianhao Wang, Chongyi Zheng, and Chongjie Zhang. Learning nearly decomposable value functions via communication minimization. arXiv preprint arXiv:1910.05366, 2019.
	A.1, A.2

	Tonghan Wang, Heng Dong, Victor Lesser, and Chongjie Zhang. Roma: Multi-agent reinforcement
	learning with emergent roles. arXiv preprint arXiv:2003.08039, 2020b. A.1

	Yuhuai Wu, Elman Mansimov, Shun Liao, Alec Radford, and John Schulman. Openai baselines:
	[Acktr & a2c. https://openai.com/blog/baselines-acktr-a2c/, 2018. 5.1](https://openai.com/blog/baselines-acktr-a2c/)

	Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. Mean field multiagent reinforcement learning. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th
	_International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning_
	_Research, pp. 5567–5576, Stockholmsm¨assan, Stockholm Sweden, 10–15 Jul 2018. PMLR. A.1_

	Yunan Ye, Hengzhi Pei, Boxin Wang, Pin Yu Chen, Yada Zhu, Jun Xiao, and Bo Li. Reinforcementlearning based portfolio management with augmented asset movement prediction states. AAAI
	_2020 - 34th AAAI Conference on Artificial Intelligence, pp. 1112–1119, 2020. ISSN 2159-5399._
	doi: 10.1609/aaai.v34i01.5462. 1

	Chao Yu, Akash Velu, Eugene Vinitsky, Yu Wang, Alexandre Bayen, and Yi Wu. The surprising
	effectiveness of mappo in cooperative, multi-agent games. _arXiv preprint arXiv:2103.01955,_
	2021. 4.1, 5.2, A.1, C.1

	Kaiqing Zhang, Zhuoran Yang, Han Liu, Tong Zhang, and Tamer Basar. Fully decentralized multiagent reinforcement learning with networked agents. In International Conference on Machine
	_Learning, pp. 5872–5881. PMLR, 2018. 5.2_

	Ruohan Zhang, Yue Yu, Mahmoud El Chamie, Behc¸et Ac¸ikmese, and Dana H Ballard. Decisionmaking policies for heterogeneous autonomous multi-agent systems with safety constraints. In
	_IJCAI, pp. 546–553, 2016. A.2_


	-----

	Weinan Zhang, Xihuai Wang, Jian Shen, and Ming Zhou. Model-based multi-agent policy optimization with adaptive opponent-wise rollouts. arXiv preprint arXiv:2105.03363, 2021. A.3


	-----

	A RELATED WORK

	A.1 TRAINING PARADIGM IN MARL

	In this section we will provide a more detailed description of related work for training paradigm
	used in MARL. From the perspective of training schemes, it can be devided into three categories:
	decentralized training decentralized execution (DTDE), centralized training centralized execution
	(CTCE) and centralized training decentralized execution (CTDE). Recent deep MARL works often
	use the CTDE or CTCE training pradigm.

	The CTCE paradigm allows the straightforward employment of single-agent approaches such
	as Actor-Critic (Mnih et al., 2016) or policy gradient algorithms (Schulman et al., 2017) to
	multi-agent problems. The representative work of CTCE is (Gupta et al., 2017), which represented
	the centralized executor as a set of independent sub-policies such that agents’ individual action
	distributions are captured rather than the joint action distribution of all agents.
	Value-based CTDE approaches, which are also known as Value Decomposition methods (Peng
	et al., 2020; Mahajan et al., 2019; Rashid et al., 2020; 2018; Son et al., 2019; Sunehag et al., 2017;
	Wang et al., 2020b; 2019), mianly focus on how centrally learned value functions can be reasonably
	decoupled into decentralized ones and have shown promising results. Policy-gradient-based
	methods on CTDE, on the other hand, have heavily relied on centralized critics. One of the first
	works utilizating a centralized critic was COMA (Foerster et al., 2018), a framework adopting a
	centralized critic with a counterfactual baseline. For convergence properties, COMA establishes
	that the overall effect on decentralized policy gradient with a centralized critic can be reduced to a
	single-agent actor-critic approach, which ensures convergence under the similar assumptions like
	A2C.

	Concurrently with COMA, MADDPG (Lowe et al., 2017) proposed to use a dedicated centralized
	critic for each agent in semi-competitive domains, demonstrating compelling empirical results
	in continuous action environments. Recently, MAPPO (Yu et al., 2021), an on-policy policy
	gradient multi-agent reinforcement learning algorithm, achieves strong results comparable to the
	state-of-the-art on a variety of cooperative multi-agent challenges. Despite its on-policy nature,
	MAPPO is competitive to ubiquitous off-policy methods such as MADDPG, QMix, and RODE in
	terms of final performance, and in the vast majority of cases, comparable to off-policy methods in
	terms of sample-efficiency. In addition, many incremental research inspired by MADDPG, COMA
	or MAPPO also borrowed the centralized critic baselines e.g. M3DDPG (Li et al., 2019), SQDDPG (Wang et al., 2020a), etc. Mean Filed Q-Learning (Yang et al., 2018; Carmona et al., 2019)
	takes a different approach from the CTDE based methods. It employs mean field approximation
	over the joint action space in order to address the scalability issue that exists in the prior methods.

	Contrary to CTDE, in DTDE paradigm, each agent has an associated policy which maps local
	observations to a distribution over individual actions. No information is shared between agents such
	that each agent learns independently. DTDE has been applied to cooperative navigation task (Chen
	et al., 2017; Strouse et al., 2018), to partially observable domains (Dobbe et al., 2017; Srinivasan
	et al., 2018), and to social dilemmas (Leibo et al., 2017). For more comparisons of centralized critic
	and decentralized critic, please see (Lyu et al., 2021).

	In this paper, we design a decentralized traning paradigm avoiding the flaws of traditional training
	paradigms proposed in literature. The fundamental drawback of the DTDE paradigm is that the
	environment appears non-stationary from a single agent’s viewpoint because agents neither have
	access to the knowledge of others, nor do they perceive the joint action. Some studies reported that
	DTDE scale poorly with the number of agent due to the extra sample complexity, which is added
	to the learning problem (Gupta et al., 2017); An obvious flaw for CTDE/CTCE is that state-action
	spaces grow exponentially by the number of agents. Even though there are some attempts proposed
	that the joint model can be factored into individual policies for each agent To address the so-called
	curse of dimensionality, CTDE/CTCE methods have to use at least joint states overall agents as the
	input to approximate global value function to give guidance for centralized critics or decentralized
	policies. Meaning that these traditional training schemes still not have strong expansion capabilities


	-----

	to large number of agents when system’s state is combined by local state for each agent. As for
	our approach, we train agents independently with learned dynamics model of utilization’s trend for
	shared resource, which will give agent enough information to learn optimal policy (we will explain it
	in the following sections). At the meantime, we can also, improve efficiency of data sampling since
	we don’t always use the original joint simulator containing all agents for data collection. Instead,
	we mainly running the light-cost local simulator by embedding the learned dynamic model, which
	can significantly reduce the cost of data collection process, especially when running joint simulator
	(such as inventory management) expensively.

	A.2 MDP SCENARIOS IN MARL

	There are limited similar variants of MDP settings which have been studied under MARL framework, to our knowledge. Dec-POMDP (Oliehoek & Amato, 2016) is the most common setting studied in MARL research, especially in fully cooperative tasks in which agents share the global team
	reward rather than individual rewards; In the scenarios of Constrained MDP setting (C-MDP, (Bhatnagar & Lakshmanan, 2012; Wang et al., 2019; Diddigi et al., 2019)), there are some constraints
	in the system, such as the penalty caused by illegal actions in autonomous driving (Zhang et al.,
	2016; Fowler et al., 2018), the resource-allocation constraints in scheduling tasks (Agrawal et al.,
	2016; Dolgov & Durfee, 2006), or limited bandwidth during communicating among agents (Fowler
	et al., 2018), etc. Similar with C-MDP, Weakly-Coupled Constrained MDP (WC-C-MDP (Boutilier
	& Lu, 2016)) consider the problem of budget (or other resource) allocation in sequential decision
	problems involving a large number of concurrently running sub-processes,whose only interaction is
	through their consumption of budget. Different from mentioned scenarios, we focus on the situation where agents can only get the individual reward from the environment, so that team reward is
	invisible; Comparing with C-MDP and WC-C-MDP, the penalty of overstocking in which is related
	with stock levels of all agents in future, while in C-MDP and WC-C-MDP, the corresponding cost
	function is only based on historical states of the system. Leading that we can not optimize only the
	combined objective of long-term summation of return and penalty.

	A.3 MODEL-BASED MARL

	For model-based MARL, there are relatively limited works of literature. P-MARL (Marinescu et al.,
	2015) proposed an approach that integrates prediction and pattern change detection abilities into
	MARL and thus minimises the effect of non-stationarity in the environment. The environment is
	modelled as a time-series, with future estimates provided using prediction techniques. Learning is
	based on the predicted environment behaviour, with agents employing this knowledge to improve
	their performance in real-time. (Park et al., 2019) proposed to use a centralized auxiliary prediction network to model the environment dynamics to alleviate the non-stationary dynamics problem.
	(Krupnik et al., 2020) built a centralized multi-step generative model with a disentangled variational
	auto-encoder to predict the environment dynamics and the opponent actions and then performed trajectory planning. AORPO (Zhang et al., 2021) is a Dyna-style method in which each agent builds
	its multi-agent environment model that consist of a dynamics model and multiple opponent models,
	and trains its policy with the data both generated from adaptive opponent-wise rollout and interacted
	from the real environment. To our knowledge, unlike previous work, our approach is the first to
	accelerate the learning process in MARL by building only a part dynamic model of the whole system. Our algorithm is also a Dyna-style method like AORPO. And it is clear that the difficulty of
	modeling process will be reduced significantly and the learning process is more efficient.

	A.4 MEAN-FIELD MARL

	B DETAILS FOR INVENTORY MANAGEMENT PROBLEM

	We summarize all notations for Section 2.2 in Table 2. We also give an example with 2 SKUs in
	Fig. 4 to further illustrate the whole procedure for inventory dynamics.


	-----

	Figure 4: A diagram to illustrate an inventory dynamics in two time steps.

	Table 2: Notations.

	\|Notation\|Explanation\|
	\|---\|---\|
	\|I˙i t Iˆi t Oi t Di t Si t T ti Pti t p i q i o h\|Units in stock of SKU i at the t-th time step Units in stock of SKU i at the end of time step t if no discard Order quantity of the i-th SKU at the t-th time step Demand of the i-th SKU at the t-th time step Sale quantity of the i-th SKU at the t-th time step Units in transit of the i-th SKU at the t-th time step Profit generated on the i-th SKU at the t-th time step Unit sales price of i-th SKU Unit procurement cost of i-th SKU Unit order cost Unit holding cost for a time step\|



	C ALGORITHM DETAILS

	Here we provide the pseudocode of our algorithm with context augmentation in Algorithm 4. And
	details of sub-algorithms will be introduced in the following subsections. Notes that all pseudocode
	are assumed using RNN as networks.

	C.1 GET CONTEXT DYNAMICS

	Our algorithm follows the algorithmic structure of the Multi-Agent PPO (Yu et al., 2021), in which
	we maintain two separate networks for πθ and Vϕ(s). And the first stage of our algorithm is only
	running policies in the origin joint environment to get episodes for context dynamics, i.e. the trajectories of c[i]t [and][ c]t[−][i][. This process is similar with all other traditional MARL methods. Notes that]
	we can also save the transitions of each agent for learning policy and value function network. In
	the following pseudocode of joint sampling(Algorithm 2), we only record the part for the context
	dynamics to train the context model prepared for next stage’s training.
	After data collecting for context dynamics, we will use a LSTM model fc to train a surrogate predictor as an extra augmentation for collected dynamics in next stage. In details, we split the collected
	trajectories of context into several sub-sequences of length 8 and the training objective is to minimize the mean-squared error of the L + 1 day’s capacity predicted given the dynamics of previous
	_L days._

	min(fc(ct−L, . . ., ct; ω) − _ct+1)[2]_ (14)


	-----

	C.2 DECENTRALIZED PPO

	With the collected context dynamics of the shared resource, it is easy to run the second stage:
	sampling on the local simulators for each agent and then training the policy and critic with data. It is
	worth noting that the main difference between our training paradigm and traditional MARL methods
	under CTDE structure is that we directly sampling local observations in the extra simulators in
	which only one agent existed rather than the joint simulator in which all agents interacting with
	each other. In other words, in the new local simulator, there is only one SKU in the entire store, and
	the trend of available capacity is completely simulated according to the given context dynamics.

	In practice, we parallelly initialize new instances of the original inventory environment with the new
	configure settings which only contains a specific SKU i and a fixed trajectory of context. As for the
	fixed context trajectory, we use the subtraction results {c[−][i]}t[T]=0 [with some augmentations: 1) add]
	some noise in some items with a predefined probability;2) replace some items with predicted values
	comes from the trained context model also by the predefined probability. Then we run the policy
	to interact with the local simulators to conduct episodes under the embedded context dynamics
	and put them into a shared replay buffer since all transitions are homogeneous and we shared the
	parameters over all policies. And decentralized training will be started by utilizing the shared replay
	buffer of transitions collected from local simulators.

	We consider a variant of the advantage function based on decentralized learning, where each agent
	learns a agent-specific state based critic Vϕ(s[i]t[)][ parameterised by][ ϕ][ using][ Generalized Advantage]
	_Estimation (GAE, (Schulman et al., 2016)) with discount factor γ = 0.99 and λ = 0.95. We also_
	add an entropy regularization term to the final policy loss (Mnih et al., 2016). For each agent i, we
	have its advantage estimation as follows:


	_A[i]t_ [=]


	(γλ)[l]δt[i]+l (15)
	_l=0_

	X


	where δt[i][i][ =][ r][t] _s[i]t[, a][i]t_ + γVϕ _zt[i]+1_ _Vϕ_ _s[i]t_ is the TD error at time step t and h is marked as
	_−_
	steps num. And we use individual reward provided from local simulator _rt[i][(][s][i]t[, a][i]t[)][. So that the final]_
	policy loss for each agent i becomes:

	_πθ_ _a[i]t_ _t_

	(θ) = Esit[,a][i]t[∼T][local][(][c][−]t _[i])_ min _[\|][ s][i]_ _A[i]t[,]_
	_L[i]_ " _πθold_ _a[i]t_ t

	_[\|][ s][i]_ (16)

	clip _πθ_ _a[i]t_ _[\|][ s]t[i]_ , 1 _ϵ, 1 + ϵ_ _A[i]t_

	_πθold_ _a[i]t_ t _−_ ! !#

	_[\|][ s][i]_

	As for training value function, in addition to clipping the policy updates, our method also use value
	clipping to restrict the update of critic function for each agent i to be smaller than ϵ as proposed by
	GAE using:


	2

	(ϕ) = Esit[∼T][local][(][c][−]t _[i])_ min _Vϕ_ _s[i]t_ _Vt[i]_ _,_
	_L[i]_ _−_ [ˆ]

	2[] (17)

	_Vϕold_ _s[i]t_ + clip _Vϕ_ _s[i]t_ _Vϕold_ _s[i]t_ _,_ _ϵ, +ϵ_ _Vt[i]_
	_−_ _−_ _−_ [ˆ]


	where ϕold are old parameters before the update and _V[ˆ]t[i]_ [=][ A]t[i] [+][ V][ϕ] _s[i]t_ . The update equation
	restricts the update of the value function to within the trust region, and therefore helps us to avoid

	overfitting to the most recent batch of data. For each agent, the overall learning loss becomes:

	_N_

	_L(θ, ϕ) =_ _L[i](θ) + λcriticL[i](ϕ) + λentropyH_ _π[i][]_ (18)

	_i=1_

	X

	It is obvious that all networks are trained in the decentralized way since their inputs are all local
	variables which stem from the light-cost local simulators. As mentioned before, at this learning


	-----

	stage, there are no interactions between any two agents. Although it seems like the way of independent learning, we need to point that we use the global context simulated from the joint environment,
	which is essentially different from independent learning methods since they will not consider this
	style global information which is simulated from joint simulator but be fixed in the local simulators.
	Our decentralized training have several advantages: firstly, the local simulator is running efficient
	because of its simple only-one-agent transition function; secondly, this paradigm avoid the issue for
	non-stationary occurred in the traditional MARL methods since there are no interaction amongst
	agents so that it is no need to consider influences of other agents; thirdly, we can use more diverse
	context trajectories to make agents face various situations of the available levels of the store, which
	leads to improve the generalization of the networks to be trained; fourthly, it is easy for this training
	paradigm to be extended to large-scale distributed training by running parallel simulation whose
	communication cost is also acceptable for modern distributed training frameworks.

	If the critic and actor networks are RNNs, then the loss functions additionally sum over time, and
	the networks are trained via Backpropagation Through Time (BPTT). Pseudocode for local sampling
	with recurrent version of policy networks is shown in Algorithm 3.

	Algorithm 2 GetContextDynamic

	INPUT policies _πθ[i]_ _i_ _[}]i[n]=1_ [and the joint simulator Env][joint]
	_{_

	(Optional) Initialize ω, the parameters for context model fc
	set data buffer D = {}
	for bs = 1 to batch size do

	_τ = [] empty list_
	initialize h[1]0,π[,][ · · ·][ h][n]0,π [actor RNN states]

	for t = 1 to T do

	for all agents i do

	_p[i]t[, h][i]t,π_ [=][ π][i][(][s]t[i][, c][t][, h][i]t 1,π[;][ θ][i][)]
	_−_

	_a[i]t_ _t_

	end for[∼] _[p][i]_
	Execute actions at, observe rt, st+1
	_τ += [st, ct, ht,π, at, rt, st+1, ct+1]_

	end for
	// Split trajectory τ into chunks of length L
	for l = 0, 1, .., T//L do

	_D = D ∪_ (c[l : l + T ])

	end for

	end for
	// (Optional) Train the context model for augmentation
	if Need to train context model then

	for mini-batch k = 1, . . ., K1 do

	_b ←_ random mini-batch from D with all agent data
	Update capacity-context dynamics model with Eq. 14

	end for
	Adam update ω with data b

	end if
	OUTPUT context dynamics _c[−]t_ _[i]_ _t=1_ [for][ i][ = 1][, . . ., n][; (Optional)][ f][ω]
	_{_ _[}][T]_


	D LOCAL SAMPLING V.S. JOINT SAMPLING

	Our algorithm follows a decentralized training paradigm, which is compatible with distributed training framework. We can run parallel local simulation while communication cost of context dynamics
	is also acceptable. With distributed training, our method can be much more efficient than learning
	from the joint simulator directly.

	One may argue that we can also learn from joint simulator efficiently by implementing a distributed/parallelized joint simulator. While this can indeed improve sampling efficiency in many
	scenarios, its usefulness is limited particularly for systems with a large number of agents. In a typ

	-----

	Algorithm 3 DecentralizedPPO

	// Generate data for agent i with corresponding context dynamic model
	INPUT local simulator Env[i]local[, policy][ π]θ[i] _i_ [and value function][ V][ i]ϕi

	Set data buffer D = {}
	Initialize h[1]0,π[,][ · · ·][, h]0[n],π [actor RNN states]

	Initialize h[1]0,V _[, . . . h]0[n],V_ [critic RNN states]

	_τ = [] empty list_
	for t = 1 to T do

	_p[i]t[, h][i]t,π_ [=][ π][i][(][s]t[i][, h][i]t 1,π[, c]t[i][;][ θ][i][)]
	_−_

	_a[i]t_ _t_

	_vt[i][, h][∼][i]t,V[p][i]_ [=][ V][ i][(][s]t[i][, c][i]t[, h][i]t 1,V [;][ ϕ][i][)]
	_−_

	Execute actions a[i]t [in Env]local[i] [, and then observe][ r]t[i][, c][i]t+1[, s][i]t+1

	_τ_ + = [s[i]t[, c]t[i][, a]t[i][, h]t,π[i] _[, h][i]t,V_ _[, s]t[i]+1[, c][i]t+1[]]_

	end for
	Compute advantage estimate _A[ˆ] via GAE on τ (Eq. 15)_

	Compute reward-to-go _R[ˆ] on τ_

	// Split trajectory τ into chunks of length L in D
	for l = 0, 1, .., T//L do

	_D = D ∪_ (τ [l : l + T, _A[ˆ][l : l + L],_ _R[ˆ][l : l + L])_

	end for
	for mini-batch k = 1, . . ., K2 do

	_b ←_ random mini-batch from D with all agent data
	for each data chunk c in the mini-batch b do

	update RNN hidden states for π[i] and V _[i]_ from first hidden state in data chunk

	end for

	end for
	Calculate the overall loss according to Eq. 16 to Eq. 18
	Adam update θi on L[i](θi) and H with data b
	Adam update ϕi on L[i](ϕi) with data b
	OUTPUT policy πθ[i] _i_ [and value function][ V][ i]ϕi


	Algorithm 4 Context-aware Decentralized PPO with Context Augmentation

	Given the joint simulator Envjoint and local simulators Env[i]local[}][n]i=1
	_{_

	Initialize policies π[i] and value functions V _[i]_ for i = 1, . . ., n
	Initialize context model fω and the augmentation probability paug
	for M epochs do

	// Collect context dynamics via running joint simulation
	_{c[−]t_ [1][}]t[T]=0[, . . .,][ {][c]t[−][n]}t[T]=0[, f][ω] _[←]_ [GetContextDynamics][(][Env][joint][,][ {][π][i][}]i[n]=1[, f][ω][)][ (Algorithm][ 2][)]

	for k = 1, 2 . . ., K do

	for all agents i do

	// Set capacity trajectory by augmented context dynamics
	Env// Train policy by running simulation in the corresponding local environment[i]local[.][set c trajectory][(][aug][(][{][c]t[−][i][}]t[T]=0[, f][ω][, p][aug][))]

	_π[i], V_ _[i]_ _←_ DecentralizedPPO(Env[i]local[, π][i][, V][ i][)][ (Algorithm][ 3][)]

	end for

	end for
	Evaluate policies _π[i]_ _i=1_ [on joint simulator Env][joint]
	_{_ _}[n]_

	end for


	-----

	ical multi-agent system, interactions occur frequently among all agents. Actually, one advantage
	of multi-agent systems is to model complex interactions among agents. However, such interactions
	will remarkably reduce the efficiency improvement brought by parallelism, as these interactions
	often require involved agents to synchronize which usually consumes lots of time on waiting and
	communicating. For instance, in IM problems, at each time step, all agents should be synchronized
	in order to calculate ρ in Eq. 6 before moving to the next time step.

	On the other hand, in our local sampling approach we simplify such costly interactions by utilizing
	special structures of shared-resource stochastic games. Under our approach, there is no need for
	agents to be synchronized before moving to the next time step in local simulators. As a result, our
	method can be much more efficient than learning only from the joint simulator in practice.

	E TRAINING DETAILS

	E.1 THE CODEBASE

	As part of this work we extended the well-known EPyMARL codebase((Papoudakis et al., 2021)) to
	integrated our simulator and algorithm, which already include several common-used algorithms and
	support more environments as well as allow for more flexible tuning of the implementation details. It
	is convenience for us to compare our algorithm with other baselines. All code for our new codebase
	[is publicly available open-source on Anonymous GitHub under the following link: https://](https://anonymous.4open.science/r/replenishment-marl-baselines-75F4)
	[anonymous.4open.science/r/replenishment-marl-baselines-75F4](https://anonymous.4open.science/r/replenishment-marl-baselines-75F4)

	E.2 HYPERPARAMETERS DETAILS

	Table 3 presents the hyperparameters used in our algorithm for 5-SKUs environment.

	\|Hyperparameters\|Value\|
	\|---\|---\|
	\|runner\|ParallelRunner\|
	\|batch size run\|10\|
	\|decoupled training\|True\|
	\|use individual envs\|True\|
	\|max individual envs\|5\|
	\|decoupled iterations\|1\|
	\|train with joint data\|True\|
	\|context perturbation prob\|1.0\|
	\|hidden dimension\|64\|
	\|learning rate\|0.00025\|
	\|reward standardisation\|True\|
	\|network type\|FC\|
	\|entropy coefficient\|0.01\|
	\|target update\|200 (hard)\|
	\|n-step\|5\|



	Table 3: Hyparameters used in CD-PPO

	E.3 DETAILS OF STATES AND REWARDS

	Table 4 shows the features of the state for our MARL agent that corresponds to the i-th SKU on the
	_t-th time step._

	It’s worthy to note that we use the profit generated on the i-th SKU at the t-th time step Pt _[i]t_ [divided]
	by 1000000 as the individual reward of i-th agent at the t-th time step, for team reward methods, we
	simply sum up all the individual rewards, which corresponds to the daily profit of the whole store at
	the t-th time step, divided by 1000000.


	-----

	Table 4: Features of the state [P][l][−][1]

	\|Col1\|Features\|
	\|---\|---\|
	\|State Storage information Inventory information History information Product information\|Storage capacity C Quantity of products in stock Ii t Quantity of products in transit T ti R S Sa te alp e nl s de an ri h ds ih sm t doe eryn t ai th n ii s nt to hr e oy i ln a t iet ssh tt oe ril c2a at 1e s dt a a2 y l1 es d Sa ti sy ts , S·O −ti ·− 2,2 1S1 t, i · · · , O −ti − 11 − d21 · ·− 1 v i o f h l s s ti, · ·, S ti Unit sales price p i Unit procurement cost q i\|
	\|Context Global storage utilization Global unloading level Global excess level\|Current total storage level of the store Pl j− =1 I tj 1 Current total unloading level of the store Pl j− =1 O tj 1 −Lj+1 Current total excess level of the store ρ × Pl j− =1 1 O tj −Lj+1\|



	F BASE-STOCK POLICY

	Algorithm 5 Base-stock Policy


	Input:
	_{Dt[i][}]tt2=t1_ [,][ v][,][ τ] [, IM parameters]

	Output:
	_Zi,_ _Ot[i][}]t[t]=[4]_ _t3_
	_{_
	_// Description:Base-stock policy for single SKU i_
	_// Zi:Base-stock level for SKU i_
	_//_ _Ot[i][}]t[t]=[4]_ _t3_ [:Base-stock replenishment policy for SKU i, from][ t][3] [to][ t][4]
	_// {Dt[i][}]tt2=t1_ [:Demand series of SKU i used to infer][ Z][i][, from][ t][1][ to][ t][2]
	_{_
	_// v : v ≥_ 1, v ∈ R, Hyper-parameter to control storage utilization level
	_// τ : τ ∈_ N+, Hyper-parameter to control replenishing interval
	_// IM parameters:including leading time Li, storage capacity C, etc_

	_// Solve Mixed Integer Programming with dual simplex methods:_

	_St[i]_ [= min] _Dt[i][, I]t[i]_

	_Zi ←_ maxZi Xtt2=t1 _[Pt]t[i]_ [s.t.] TIPtOt[i]t[i]+1t[i]+1[i]t[= max][=][=][=][ p][ I][ T][i]t[i][S][ i]t[−]t[i][−]0[S], Z[O]t[i] _t[i][+]i− −[ O]Lt_ _i+1It[i]−t[i]_ _L[−][+]i+1t[ O][T][ i]t_ _t[i]+1_



	_[−]_ _[q][i][O][i]_ _[−]_ _[hI]_ _[i]_
	_// Replenishing policy deduction:_ t1 ≤ _t ≤_ _t2_
	_Ot[i]_ [= max] 0, Zi − _It[i]_ _[−]_ _[T][ i]t_ _, t3 ≤_ _t ≤_ _t4_
	_Ot[i]_ [= min] Ot[i][, vC][ −] [P][n]j=1[(][I]t[j] [+][ T][ j]t [)] _, t3_ _t_ _t4_
	_≤_ _≤_
	_Ot[i]_ [=][ O]t[i][I][ [][t][ mod][ τ][ = 0]][, t][3] _[≤]_ _[t][ ≤]_ _[t][4]_
	return Zi, _Ot[i][}]t[t]=[4]_ _t3_
	_{_

	In addition to comparing with related MARL baselines, we also add a well-known algorithm ”Basestock” from OR community as our non-RL baseline. The pseudo-code for base-stock policy can be
	found in Algorithm 5, where Zi is called the base-stock for agent i and is computed by solving a
	mixed integer programming problem. After that, Zi will be used to guide replenishment of agent
	_i periodically. We shall note that base-stock policy can not deal with complex IM variant settings_


	-----

	like coordinated replenishment (multiple SKUs with storage capacities), order cost and stochastic
	VLTs, etc. These complicated realistic constraints are exactly what we use to test other MARL algorithms. Thus it may happen that Base-stock policies constantly overflow warehouse capacities when
	storage is tight, in which cases incoming products are abandoned proportionally as we explained in
	Section 2.2. This explains Base-stock’s poor performance on some of the envs.

	Base-stock utilizes Linear Programming to work out the base stock levels and thus the replenishing
	policy. ”Static” indicates that the base stock levels are based on demand data from training set and
	then kept static while making decisions on test set. ”Dynamic” updates its base stock levels on a
	regular time cycle. ”Oracle” directly access the whole test set to calculate its base stock levels, which
	is essentially a cheating version used to show the upper limits of Base-stock policy. In practice, we
	conduct grid search to find proper v and τ . We refer readers to (Hubbs et al., 2020) for a detailed
	description. Our implementation of Base-stock baselines is also inspired by it.

	G ADDITIONAL RESULTS

	G.1 THE FULL RESULTS ON ALL ENVIRONMENTS

	Here we present all the experiment results in Table 5 and display the training curves of algorithms
	in 50-SKUs scenarios in Figure 5 and Figure 6. We also present the results from OR methods
	designed to deal with IM problem, namely Base-stock, in Table 5.

	Please note that, actually, VLT (i.e leading time) for each SKU in N50 and N100 is stochastic
	during simulation. On the one hand, the specified VLTs can be modelled as exponential distributions
	in our simulator. On the other hand, the real lead time for each procurement can be longer than
	specified due to lack of upstream vehicles. (Distribution capacity is also considered in the simulator,
	though not the focus of this paper). So that the results running on N50 and N100 are all considering
	stochastic VLTs.

	Table 5: Profit Comparison on All Environments

	\|Col1\|Col2\|Profit(10k dollar)\|Col4\|
	\|---\|---\|---\|---\|
	\|Env Scenario\|CD-PPO(ours)\|IPPO-IR(w/o context) MAPPO-IR(w/o context) IPPO-IR MAPPO-IR IPPO(w/o context) MAPPO(w/o context) IPPO MAPPO\|Basestock(Static) Basestock(Dynamic) Basestock(Oracle)\|
	\|N5-C50\|40.58 ± 6.02\|40.37 ± 4.89 39.32 ± 15.53 43.33 ± 3.30 54.87 ± 9.26 74.11 ± 1.55 49.24 ± 1.32 63.22 ± 13.75 48.49 ± 1.89\|17.4834 33.9469 38.6207\|
	\|N5-C100\|99.21 ± 1.91\|92.41 ± 2.78 94.70 ± 18.84 91.38 ± 3.57 97.69 ± 14.41 97.89 ± 6.65 74.71 ± 1.51 92.90 ± 13.36 71.57 ± 3.14\|48.8944 80.8602 97.7010\|
	\|N50-C500\|310.81 ± 76.46\|235.09 ± 60.61 N/A 250.03 ± 58.38 N/A 164.43 ± 143.01 N/A 366.74 ± 89.58 N/A\|−430.0810 −408.1434 −397.831\|
	\|N50-C2000\|694.87 ± 174.184\|689.27 ± 48.92 N/A 545.86 ± 459.71 N/A −1373.29 ± 870.03 N/A −1102.97 ± 1115.69 N/A\|−15.5912 42.7092 1023.6574\|
	\|N100-C1000\|660.28 ± 149.94\|−2106.98 ± 315.38 N/A −1126.42 ± 409.83 N/A −1768.19 ± 1063.61 N/A −669.83 ± 1395.92 N/A\|−173.39 −22.05 91.17\|
	\|N100-C4000\|1297.75 ± 124.52\|−2223.11 ± 2536.00 N/A 148.00 ± 1017.47 N/A −6501.42 ± 6234.06 N/A −6019.28 ± 9056.49 N/A\|410.59 493.32 755.47\|



	Table 6: Average samples needed by different algorithms to reach the median performance of baselines on All Environments

	\|Col1\|Data Samples(10k)\|
	\|---\|---\|
	\|Env Scenario\|CD-PPO(ours) IPPO-IR(w/o context) MAPPO-IR(w/o context) IPPO-IR MAPPO-IR IPPO(w/o context) MAPPO(w/o context) IPPO MAPPO\|
	\|N5-C50\|∞ ∞ ∞ ∞ 708.10 588.63 1671.10 708.10 1671.10\|
	\|N5-C100\|522.80 ∞ 711.97 ∞ 987.27 806.56 ∞ 1298.40 ∞\|
	\|N50-C500\|5484.49 ∞ N/A ∞ N/A 9195.23 N/A 12802.89 N/A\|
	\|N50-C2000\|2996.87 3138.49 N/A 8483.04 N/A ∞ N/A ∞ N/A\|
	\|N100-C1000\|47.14 ∞ N/A ∞ N/A 539.26 N/A 191.88 N/A\|
	\|N100-C4000\|60.07 ∞ N/A 127.57 N/A ∞ N/A 1151.57 N/A\|



	It’s worthy noting that throughout all the environments we’ve experimented on, CD-PPO enjoys
	higher sample efficiency as shown in Table 6. Though it seems that in environment where the
	storage capacity is extremely tense, like N5-C50, CD-PPO performs not as well as IPPO, we reason
	that these tense situations favor IPPO (with team reward), which may quickly learn how to adjust
	each agents’ behavior to improve team reward. While CD-PPO, owing to its decentralized training
	paradigm, struggles to learn a highly cooperative policy to cope with the strong resource limits while
	meeting stochastic customer demands. Yet we still claim that CD-PPO does outperform its IR (w/o
	context) baselines while producing comparable performance when compared to IR (with context)
	algorithms. On the other hand, IPPO(with team reward) struggles to learn a good policy in 50-agents
	environment for the large action space compared to the single team reward signal, and it fails when
	faced with N50-C2000 whose state space is even larger. To top it all off, CD-PPO does have its
	strengths in terms of sample efficiency. Though we will continue to address the above challenge in
	our future work.


	-----

	Figure 5: Training curves on N50-C500. Figure 6: Training curves on N50-C2000

	G.2 ABLATION STUDIES FOR CONTEXT AUGMENTATION

	In this section, we aim to seek a better way to augment the context dynamics in order to boost
	performance of our algorithm. We ponder the following two questions:

	Q1: Which is a better way to augment the original data, adding noise or using a Deep predic-
	tion model?

	To answer this question, we set out experiments on environment N5-C100 with our algorithm CDPPO, using either context trajectories generated by a deep LSTM prediction model, or simply adding
	a random normal perturbation to the original dynamics trajectories, both on 3 different seeds. As
	shown in Figure 7, those runs with deep prediction model generated dynamics enjoy less std and better final performance. This could result from that the diversity of deep model generated trajectories
	surpasses that of random perturbation.

	Figure 7: Training curves of CD-PPO with different augmentation methods


	-----

	Figure 8: Training curves of CD-PPO with varied ratio of augmented data

	Q2: Does dynamics augmentation improve the performance of the algorithm? If so, how much
	should we perturb the original data?

	We run similar experiments on environment N5-C100 with CD-PPO, in which the local simulator
	is ingested with a mixture of original dynamics data and LSTM generated data. The ratio of perturbed dynamics data varies from 0% to 100%. And we found that the algorithm turns out the best
	performance when we use fully generated data, as shown in Figure 8.


	-----