File size: 84,752 Bytes

f71c233

# MULTI-AGENT REINFORCEMENT LEARNING WITH SHARED RESOURCE FOR INVENTORY MANAGEMENT

**Anonymous authors**
Paper under double-blind review

ABSTRACT

We consider inventory management (IM) problem for a single store with a large
number of SKUs (stock keeping units) in this paper, where we need to make replenishment decisions for each SKU to balance its supply and demand. Each
SKU should cooperate with each other to maximize profits, as well as compete
for shared resources e.g., warehouse spaces, budget etc. Co-existence of cooperation and competition behaviors makes IM a complicate game, hence IM can be
naturally modelled as a multi-agent reinforcement learning (MARL) problem. In
IM problem, we find that agents only interact indirectly with each other through
some shared resources, e.g., warehouse spaces. To formally model MARL problems with above structure, we propose shared resource stochastic game along with
an efficient algorithm to learn policies particularly for a large number of agents.
By leveraging shared-resource structure, our method can greatly reduce model
complexity and accelerate learning procedure compared with standard MARL algorithms, as shown by extensive experiments.

1 INTRODUCTION

Inventory management (IM) problem has long been one of the most important applications in the
supply-chain industry (Nahmias & Smith, 1993). Its main purpose is to maintain a balance between
the supply and demand of stock keeping units (SKUs) in a supply chain by optimizing replenishment
decisions of each SKU. Besides leading to profit increment and operational cost reduction, efficient
IM can even give rise to better services to customers. However, it is quite a challenging task in
practice, especially when there are lots of SKUs involved in the supply-chain. Particularly, while all
SKUs should cooperate with each other to achieve high profits, they also need to compete for shared
resources e.g., warehouse spaces, budget etc. Such co-existence of cooperation and competition
renders IM a complicated game that is hard to address.

Traditional methods usually reduce IM problems to solving dynamic programming problems. However, these approaches often rely on some unrealistic assumptions such as i.i.d. demand, deterministic leading time, etc. Moreover, as the state space grows exponentially along with some key
factors like leading time and number of SKUs (Gijsbrechts et al., 2019), the corresponding dynamic
programming problems become intractable due to the curse of dimensionality. Because of these limitations, many approaches based on approximate dynamic programming are proposed to solve IM
problems in different settings (Halman et al., 2009; Fang et al., 2013; Chen & Yang, 2019). While
these approaches perform well in certain scenarios, they heavily rely on problem-specific expertise
or assumptions e.g., the zero or one period leading time assumption in (Halman et al., 2009), hence
can hardly generalize to other settings. In contrast, reinforcement learning (RL) based methods,
with fast inference speed, can generalize to various scenarios in a data-driven manner. However, it
is usually too costly to train a global policy that can make decisions for all SKUs, since the training
efficiency can be notably curtailed because of the large global state and action space (Jiang & Agarwal, 2018)). To further address the training efficiency issue, it is natural to adopt the multi-agent
reinforcement learning (MARL) paradigm, where each SKU is regarded as an agent whose state and
action spaces are localized and only contain information relevant to itself.

There are currently two popular paradigms to train MARL in the literature: independent learning
(Tan, 1993) and joint action learning (Lowe et al., 2017). Despite of their success in many scenarios, these two MARL paradigms also exhibit certain weaknesses that restrain their effectiveness


-----

in solving IM problems. On one hand, if applying independent learning, policy training of one
agent simply treats all other agents as parts of the stochastic environment, hence is hard to converge
due to non-stationary of the environment. On the other hand, joint action learning usually learns a
centralized critic conditioned on the joint action and state spaces of all agents, which can easily become intractable with increasing number of SKUs. Furthermore, it could be quite time-consuming
to sample data from joint simulator for a great number of SKUs, since it usually involves much
computation on many internal variables caused by complex agent interactions.

To address these challenges, we took a closer look at the IM problem and find that there exists a
special structure that can be leveraged to design more effective MARL paradigm. Particularly, each
agent in IM only interacts with others through the shared resource, e.g., warehouse spaces. We
introduce an auxiliary variable to represent the whole inventory level, implying the available shared
resource, for all SKUs, and refer to this variable as context. From the MARL perspective, one agent
can be influenced by other agents only through such context. The context dynamics actually reflect
the collective behaviors of all agents. And, conditioned on context dynamics, one agent’s state
transition and reward function are independent of other agents. In this way, leveraging context as an
additional input for each agent’s policy/value functions enables us to both avoid the non-stationary
problem caused by independent learning and mitigate the intractable centralized critics learning
caused by exploding state and action spaces. Moreover, introducing context dynamics inspires us
to build a local simulator for each agent in order to facilitate more efficient policy learning for each
agent.

Based on this structure with context dynamics, we propose a shared-resource stochastic game to
model the IM problem. Specifically, we introduce a surrogate objective function to optimize the
return of agent i conditioned on all other agents, denoted by −i. Here, we make two approximations to get the surrogate objective function: 1) rearranging the sampling process by first sampling
context dynamics then sampling local state/action/reward for each agent i; 2) using context dynamics sampled by previous policies. Based on above surrogate objective, our algorithm consists of
two iterative learning procedures: 1) obtaining context dynamics from joint simulator, 2) updating
policy for each agent by data sampled from its respective local simulator conditioned on collective
context dynamics. By decoupling each agent from all others with a separate training procedure, our
method can greatly reduce model complexity and accelerate learning procedure, as shown by extensive experiments. It is worthwhile to mention that our method is not limited to IM, but can also be
applied to many other applications with shared-resource structure, e.g., portfolio management (Ye
et al., 2020), smart grid scheduling (Remani et al., 2019), etc.

Our contributions are summarized as follows:

-  We propose shared-resource stochastic game to capture the problem structure in IM, that
agents only interact with each other through shared resource.

-  We propose a novel algorithm that leverages the shared-resource structure to solve IM
problem efficiently.

-  Extensive experiments demonstrate that our method outperforms existing MARL algorithms on both profits and computing efficiency.

2 BACKGROUND

2.1 STOCHASTIC GAMES

We build our work on stochastic games (SGs) (Shapley, 1953), since each SKU in IM problem has its own profit (reward) to optimize. A Stochastic Game is defined by a tuple
_N_ _, S,_ _A[i]_ _i∈N_ _[,][ T][,]_ _R[i]_ _i∈N_ _[, γ]_, where N = {1, · · ·, n} denotes the set of n > 1 agents,
 denotes the state space observed by all agents,   denotes the action space of agent i. Let

_S_ _A[i]_
_A := A[1]_ _× · · · × A[n], then T : S × A →_ ∆(S) denotes the transition probability from any
state s ∈S to any state s[′] _∈S after taking a joint action a ∈A; R[i]_ : S × A × S → R is
the reward function that determines the immediate reward received by agent i for a transition from
(s, a) to s[′]; γ ∈ [0, 1) is the discount factor. We can formulate the joint policy of other agents’


-----

as π[−][i] = _j∈−i_ _[π][j][. Each agent][ i][ optimizes its policy][ π][i][ :][ S →]_ [∆] _A[i][]_ to maximize its own

long-term reward, which is conditioned on other agents’ behavior, defined as

[Q] 


_maxπi_ _η[i](π[i], π[−][i]) = E(st,ait[,a][−]t_ _[i])_ _,π[i],π[−][i]_ [[]
_∼T_


_γ[t]rt[i][]][.]_ (1)
_t=0_

X


We will illustrate the shared resource structure of IM problem in Section 2.2, which motivates us
to propose shared-resource stochastic game as a special case of stochastic game to capture such
structure in Section 3.1.

2.2 INVENTORY MANAGEMENT WITH SHARED RESOURCE

While a typical setting for IM shall involve a supply network of multi-echelon including stores,
warehouses and factories, we simplify the setting to ease our presentation. In the following, we shall
focus on scenarios with one store and multiple SKUs. We further assume that there is an upstream
warehouse that can fulfill requirements from the store perfectly. Our objective is to learn high-quality
replenishing policies for each SKU in the store, particularly when there are a large number of SKUs.
As replenishing decisions for SKUs in stores should directly consider consumption behaviors of
customers and competitions from other SKUs due to limited resources like spaces, budget etc., they
are more challenging to optimize comparing to SKUs in warehouses. It is worthwhile to mention
that, due to the flexibility of RL algorithms, our method can also be applied to more complex settings
with multi-echelon, fluctuate supply, non-deterministic leading time etc.

Similar to previous work, we follow the multi-agent RL (MARL) framework with decentralized
agents, each of which manages inventory of one SKU in the store. We assume that the store has n
SKUs in sell, all of which share a common space that can store up to Imax units at the same time.
Replenishing decisions of each SKU are made on discretized time steps, which are days in our paper.

For each time step t and SKU i, let _I[˙]t[i]_
constraint shall hold for all time steps: _[∈]_ [Z][ denote units of][ i][ that are in stock. Hence, the following]


_I˙t[i]_ (2)
_i=1_ _[≤]_ _[I][max]_

X


_∀t ≥_ 0.


At any time step t, the agent corresponding to SKU i may place a replenishment order to request
_Ot[i]_
fulfillment instantly, but will take several time steps, referred as leading time, before these products[∈] [Z][ units of products from its upstream warehouse. These replenishment orders cannot be]
are delivered to the store. Lettransit at the time step t. Meanwhile, demands from customers Li denote the leading time of SKU Dt[i] [may consume inventory of SKU] i and Tt[i] _[∈]_ [Z][ its total units in]
_ithan and cause an actual sale of units Dt[i][. Formally, dynamics of these variables can be described as follows:] St[i]_ _[∈]_ [Z][. Due to the possibility of out-of-stock,][ S]t[i] [may be less]

_St[i]_ [= min] _Dt[i][,][ ˙]It[i]_ (3)
 

_Tt[i]+1_ [=][ T][ i]t _t_ _Li+1_ [+][ O]t[i]+1 (4)

_[−]_ _[O][i]−_

_Iˆt[i]_ [= ˙]It[i] _t_ [+][ O]t[i] _Li+1_ (5)

_[−]_ _[S][i]_ _−_

0 if _i=1_ _I[ˆ]t[i]_
_n_

_ρ =_ _i=1_ _I[ˆ]t[i][−][I][max]_ _[≤]_ _[I][max]_ (6)

_n_ otherwise
( Pi=1 _[O]t[i]−Li_ +1 [P][n]

_I˙t[i]+1P[= ˙]It[i]_ _t_ [+][ ⌊][(1][ −] _[ρ][)][O]t[i]_ _Li+1[⌋]_ (7)

_[−]_ _[S][i]_ _−_
As we mentioned before, due to all SKUs share a common space, it may happen that the storage
overflows when ordered products arrive. In this paper, we assume that the excess SKUs will be
discarded proportionally according to ρ defined in Eq. 6, which corresponds to the overflowing ratio
if we accept all coming SKUs without considering the space constraint. To calculate ρ, we also
introduce an extra variable _I[ˆ]t[i][, which we call afterstate of][ ˙]It[i][. Intuitively, it denotes units of SKU][ i]_
in stock at the end of the t-th time step if we omit the capacity constraint. We shall note that other
manners e.g., prioritizing all SKUs, to resolve space overflows are possible. The algorithm that we


-----

will introduce in the following section will also apply to these settings. Undoubtedly, such behaviors
will cause extra operational cost which should be avoided as much as possible in our replenishing
decisions.

The intermediate profit Pt _[i]t_ [of the][ i][-th SKU is calculated according to the following equation:]

_Pt_ _[i]t_ [=][ p][i][S]t[i] _[−]_ _[q][i][O]t[i]_ _[−]_ _[o][I]_ _Ot[i]_ _[>][ 0]_ _−_ _hI[˙]t[i]_ (8)

where pi and qi are the unit sale price, unit procurement price for the  _i-th SKU, respectively, and o_
and h are the order cost and unit holding cost, respectively. I[·] is an indicator function which equals
to one when the condition is true and zero otherwise. The order cost reflects the fixed transportation
cost or the order processing cost, and yields whenever the order quantity is non-zero.

For convenience, we summarize all notations in Table 2 in Appendix B, where we also give an
example with 2 SKUs in Fig. 4 to further illustrate the whole procedure.

3 METHODOLOGY

3.1 SHARED-RESOURCE STOCHASTIC GAME

In this section, we show that IM problem introduced in Section 2.2 can be formulated
as a shared-resource stochastic game, where each agent is only influenced by other agents
through a shared resource pool. We define a shared-resource stochastic game as a tuple
_N_ _,_ _S_ _[i]_ _i∈N_ _[,][ C][,]_ _A[i]_ _i∈N_ _[,][ T][,]_ _R[i]_ _i∈N_ _[, γ]_, where N = {1, · · ·, n} denotes the set of n > 1
agents, denotes the state space of agent  _i,_  denotes the context space observed by all agents,

_S_ _[i]_ _C_ _A[i]_
denotes the action space of agent i. Here, the context represents the occupied capacity of sharedresource. Let S := S [1] _× · · · × S_ _[n]_ and A := A[1] _× · · · × A[n], then T : S × C × A →_ ∆(S × C)
denotes the transition probability, which can be decomposed as follows. The context is affected by
all agents, i.e., ct+1 _Pc(_ _ct, st, at). Since we are dealing with resource, the transition function_
of context usually has some additive structure with respect to all agents, which we will illustrate later ∼ _· |_
in Section 3.2. Given context ct+1, the transition function of state for each agent is independent with
other agents, i.e., s[i]t+1 _s[(][· |][ s][i]t[, a][i]t[, c][t][+1][)][. Reward function is also independent with other agents]_
given context,γ ∈ [0, 1) is the discount factor. Each agent rt[i] _[∼]_ _[R][i][∼][(][s][P]t[i][, a][ i]_ _[i]t[, c][t][+1][)][. We refer] i optimizes its own policy[ P][ i]s_ [and][ R][i][ as][ local][ transition and reward functions.] π[i] : S _[i]_ _× C →_ ∆ _A[i][]_ to
maximize the expected long-term reward conditioned on other agents’ policies π[−][i], defined as



_maxπi_ _η[i](π[i], π[−][i]) = E(st,ct,ait[,a][−]t_ _[i])_ _,π[i],π[−][i]_ [[]
_∼T_


_γ[t]rt[i][]]_ (9)
_t=0_

X


Given the above definition, an IM problem can be formulated as a shared-resource stochastic game
by letting rt[i] [=][ Pt] _t[i][,][ a][i]t_ [=][ O]t[i][, and][ c][t] [=][ P][n]i=1 _I[˙]t[i]_ [for all SKU][ i][ and time step][ t][. Moreover, state]
of each SKU i will be a concatenation of information like Tt[i][,][ p][i][, its historical actions and demands]
etc. A detailed description of the state space can be found in the Appendix E.3.

3.2 SURROGATE OBJECTIVE WITH CONTEXT DYNAMICS

We now introduce how to optimize the return for agent i conditioned on other agents’ behaviors.
Roughly speaking, the context dynamics reflect the collective behaviors of other agents. It is possible
to estimate the objective function approximately for each agent i only based on its local transition
dynamics, local reward function and the context dynamics. First, we give a detailed description of
the transition model of shared resource c from each agent’s perspective. Given such dynamics for
the shared resource, we then show how to approximate the objective function in Eq. 9 by rearranging
the sampling process. Finally, we replace the policies for sampling context dynamics by old policies
from the previous iteration to accelerate the decentralized training.

3.2.1 CONTEXT DYNAMICS AND LOCAL SIMULATOR

We use ˙c[i]t [to represent the amount of resource occupied by agent][ i][ at time step][ t][ and][ ˙]ct the total
amount of occupied resource. We further let ˙c[−]t _[i]_ denote the amount of resource occupied by all


-----

agents but i. Since the capacity has the additive structure, we have the following equations:

_c˙t =_ _c˙[i]t[;]_

_i_ (10)

X

_c˙t = ˙c[i]t_ [+ ˙]c[−]t _[i][.]_


Similarly, we denote the afterstate of ˙c[i]t [as][ ˆ]c[i]t[. From the perspective of agent][ i][, it can view the context]
replenishment decision influencesc˙[−]t _[i]_ as a part of the environment. Hence, given ˆc[i]t[. Due to the additive structure, we have] s[i]t[,][ ˙]c[i]t[, and][ a][i]t[,][ P] [(ˆ]c[i]t _[|][ s]t[i][, a][i]t[ ˆ][,]c[ ˙]ct[i]t = ˆ[)][ represents how the]c[i]t_ [+ ˆ]c[−]t _[i][. Then by]_
applying a resource overflow resolution procedure as in Eq. 6, we obtain ˙c[i]t+1 [representing state of]
the resource in the next step.

We use notation ct to represent (ˆct 1, ˙ct). We refer (s[i], c[i]) as the state for agent i, and c[−][i] as the
_−_
context. For each agent, we have the following sampling process T _[i]_ given the context dynamics of
_c[−][i],_

_c[i]t+1_ _[∼]_ _[P]c[ i][(][· |][ s][i]t[, a][i]t[, c][i]t[, c]t[−][i][)]_

_s[i]t+1_ _[∼]_ _[P]s[ i][(][· |][ s][i]t[, a][i]t[, c][i]t+1[, c][−]t+1[i]_ [)] (11)

_rt[i]_ _[∼]_ _[R][i][(][s]t[i][, a]t[i][, c]t[i]+1[, c][−]t+1[i]_ [)][.]

Given the context dynamics for c[−][i], we can build a local simulator for agent i according to the above
equations. To ease our presentation, we only consider one kind of shared resource i.e., warehouse
spaces in the above formulation. However, it can be extended to support multiple resources as long
as these resources have the additive structure as in Eq. 10.

3.2.2 SURROGATE LOSS FUNCTION

To leverage local simulators, we rearrange the sampling process for evaluating η(π[i], π[−][i]) as follows.
First, we follow the regular sampling process in Eq. 9, and get samples of c[−]t _[i][. Then, we re-sample]_
(s[i]t[, a][i]t[, c][i]t[)][ based on samples of][ c]t[−][i] following Eq. 11. It is worth noting that we make an approximation here. Given samples for c[−]t _[i][, we actually need to inference the posterior of][ (][s]t[i][, a][i]t[, c][i]t[)][.]_
However, since we consider scenarios with lots of agents, it is reasonable to assume that each agent
_i has limited influence on its context c[−][i]. Therefore, we can assume that c[−]t_ _[i]_ has limited influence
on the inference of (s[i]t[, a][i]t[, c][i]t[)][ and sample the latter directly according to Eq.][ 11][. By rearranging the]
sampling process, we obtain the following surrogate objective,


_maxπi ˜η[i](π[i], π[−][i]) = E(c−t_ _i)_ _,π[i],π[−][i]_ [E][(][s][i]t[,a][i]t[,c][i]t[)][∼T][ i][,π][i] [[]
_∼T_


_γ[t]rt[i][]][.]_ (12)
_t=0_

X


In practice, it is quite costly to sample from the joint dynamics T, but much cheaper to sample
from the local dynamics T _[i]. To leverage data samples efficiently, we propose to use the samples_
from previous policies in our objective function, which is a common trick in reinforcement learning.
Specifically, we use samples of c[−]t _[i]_ collected by policies (πold[i] _[, π]old[−][i]_ [)][ of the previous iteration, and]
further rewrite the surrogate objective function as follows:


_maxπi ˆη[i](π[i], πold[i]_ _[, π]old[−][i]_ [) =][ E](c[−]t _[i])∼T,πold[i]_ _[,π]old[−][i]_ [E][(][s]t[i][,a][i]t[,c][i]t[)][∼T][ i][,π][i] [[]


_γ[t]rt[i][]][.]_ (13)
_t=0_

X


As long as current policies stay close to previous ones, we can adopt above surrogate objective
function to improve sample efficiency.

3.3 ALGORITHM

In the following, we will present details about our proposed algorithm, which is referred as Contextaware Decentralized PPO (CD-PPO). We call it context-aware as the context c[−]t _[i]_ is a part of input
to train policy and value functions of each agent. Such a context-aware approach can avoid the nonstationary problem occurred in independent learning methods (which does not consider dynamics
of others during training), since the context reflects other agents’ collective behaviors which can


-----

Figure 1: Our algorithm consists of two iterative learning procedures: 1. Get context dynamics
_{c[−]t_ _[i][}]t[T]=0_ [from the joint simulator and previous policies][ π]old[i] [,][ π]old[−][i] [. 2. Train policy][ π][i][ with data]
sampled from the local simulator conditioned on context dynamics {c[−]t _[i][}]t[T]=0[.]_

impact dynamics of each individual agent. In the meanwhile, our method can mitigate the intractable
centralized critic learning by avoiding using exploding joint state and action space, i.e., (st, at, ct),
as input of the critic.

As shown in Algorithm 1, our algorithm consists of two iterative learning procedures: 1) obtaining
context dynamics by running all agents in the joint environment (Algorithm 2); 2) updating policy
of each agent using data sampled from its local simulator conditioned on the context dynamics (Algorithm 3). Moreover, our algorithm follows a decentralized training paradigm with an acceptable
communication cost of the context dynamics. We refer readers to Appendix C for a detailed description.

It is worth noting that a naive approach is to optimize policies by Eq. 9 with data sampled from the
joint simulator. Nonetheless, we find that it is empirically more time-consuming to sample one step
in the joint simulator than letting each of the n local simulators sample once. One major reason
lies in that agent interactions take most of the simulation time. Our method, with the advantage of
leveraging local simulators to simplify interactions, are henceforth much cheaper to sample. We
refer readers to Appendix D for more discussions on the benefit of leveraging local simulators.

**Algorithm 1 Context-aware Decentralized PPO**

Given the joint simulator Envjoint and local simulators Env[i]local[}][n]i=1
_{_

Initialize policies π[i] and value functions V _[i]_ for i = 1, . . ., n
**for M epochs do**

// Collect context dynamics via running joint simulation

**for{c[−]t k[1][}] = 1t[T]=0[, . . .,], 2 . . ., K[ {][c]t[−] do[n]}t[T]=0** _[←]_ **[GetContextDynamics][(][Env][joint][,][ {][π][i][}]i[n]=1[)][ (Algorithm][ 2][)]**

**for all agents i do**

// Set capacity trajectory by context dynamics
Env// Train policy by running simulation in the corresponding local environment[i]local[.][set c trajectory][(][{][c]t[−][i][}]t[T]=0[)]

_π[i], V_ _[i]_ _←_ **DecentralizedPPO(Env[i]local[, π][i][, V][ i][)][ (Algorithm][ 3][)]**

**end for**

**end for**
Evaluate policies _π[i]_ _i=1_ [on joint simulator Env][joint]
_{_ _}[n]_

**end for**


-----

4 EXPERIMENTS

We evaluate the performance of CD-PPO in three IM problems, which contain 5, 50, and 100
SKUs, respectively. By changing space size of the warehouse, we further testify how CD-PPO
performs comparing to a series of baselines under different levels of competition. On these settings,
we demonstrate that our algorithm can achieve comparable results as SOTA baselines, but is more
sample-efficient.

4.1 EXPERIMENT SETUPS

Our experiment is conducted on a simulator which can support both retailing and replenishment for
multiple SKUs in one store. Instead of sampling demands from some hypothetical distributions,
we instead directly use authentic demand data from Kaggle (Makridakis et al., 2020), which contains sales history of five years (2011-2016) for various SKUs in Walmart. We randomly choose
155 SKUs from the data and use sales of the first four years as our training set, while the others are the testing set. For all other information, e.g. price, cost, leading time etc., that are
necessary to instantiate our simulator but not included in the data set, we will randomly sample them from certain reasonable ranges. The evaluation metric is the total profit in dollars. For
all results that we present, we run each of all algorithms for four times with random seeds and
present the average performance with standard deviations. The source code as well as instruc[tions to reproduce our results can be found in https://anonymous.4open.science/r/](https://anonymous.4open.science/r/replenishment-marl-baselines-75F4)
[replenishment-marl-baselines-75F4.](https://anonymous.4open.science/r/replenishment-marl-baselines-75F4)

Note that the simulator we are using is developed for general purposes, it contains extra details like
replenishing orders fulfillment scheduling, transportation management etc., hence it requires lots
of computation resources to simulate IM problems with many SKUs (> 100). This also hinder us
to extend our experiments to cases with more than 100 SKUs on a single machine. We leave it our
future work to further testify our algorithm in a distributed environment where simulating large scale
IM problems is possible.

We compare our method CD-PPO with the strong model-free on-policy MARL algorithms
MAPPO (Yu et al., 2021) and IPPO (de Witt et al., 2020) on 5-SKUs environment and we do not
show the performance of COMA (Foerster et al., 2018) and Value Decomposition methods such as
QMIX (Rashid et al., 2018) due to their bad performance (i.e. negative profit). As for 50-SKUs
and 100-SKUs scenario, we can only run IPPO, since MAPPO fails quickly as it suffers from the
huge joint state-action space for training a centralized critic. Besides RL approaches, we also
compare with base-stock policy, which is a well-known policy from OR community and widely
adopted in practice (Kapuscinski & Tayur, 1999; Hubbs et al., 2020). (More details can be found in
Appendix F). Our implementation is based on the EPyMARL (Papoudakis et al., 2021) framework,
which contains MAPPO, IPPO and several common-used MARL algorithms.

As all agents are homogeneous in IM, we let parameters of policy network and critic network be
shared amongst all agents. The two networks are all constructed by a two-layer MLP with hidden
size 64. In our experiment, the actor network maps agent-specific state to a categorical distribution
over a discrete action space {0, [1]3 _[,][ 2]3_ _[,][ 1][,][ 4]3_ _[,][ 5]3_ _[,][ 2][,][ 5]2_ _[,][ 3][,][ 4][,][ 5][,][ 6][,][ 7][,][ 9][,][ 12][}][, such that the real replenish-]_

ment quantity is obtained by multiplying the action with an agent’s sales mean of the past two
weeks. In addition, to encourage policies to accommodate diverse context dynamics, we propose
two methods to augment the context dynamics: 1) add noise to randomly chosen items with a predefined probability; 2) replace randomly chosen items with predicted values coming from a context
generator model, which is also guided by the predefined probability. Besides, we also use the data
collected in Algorithm 2 for policy training. More details about the algorithm can be found in
Appendix C.


4.2 MAIN RESULTS

We evaluate CD-PPO, MAPPO and IPPO on 5,50,100-SKUs environments with different sizes of
warehouse spaces. IPPO and MAPPO only use individual rewards rather than team reward (summation of all individual rewards) to train critics. For a fair comparison, we also train IPPO and
MAPPO with information related to shared resource. The training curves for 5-SKUs scenario are


-----

Table 1: Profit Comparison on Different Scenarios of 50 and 100 SKUs Environment

|Env Scenario|CD-PPO(Ours)|IPPO-IR(w/o context)|IPPO-IR|IPPO(w/o context)|IPPO|Base-stock(Dynamic)|
|---|---|---|---|---|---|---|
|N50-C500|310.81 ± 76.46|235.09 ± 60.61|250.03 ± 58.38|164.43 ± 143.01|366.74 ± 89.58|−408.14|
|N50-C2000|694.87 ± 174.184|689.27 ± 48.92|545.86 ± 459.71|−1373.29 ± 870.03|−1102.97 ± 1115.69|42.71|
|N100-C1000|660.28 ± 149.94|−2106.98 ± 315.38|−1126.42 ± 409.83|−1768.19 ± 1063.61|−669.83 ± 1395.92|−22.05|
|N100-C4000|1297.75 ± 124.52|−2223.11 ± 2536.00|148.00 ± 1017.47|−6501.42 ± 6234.06|−6019.28 ± 9056.49|493.32|


Env Scenario CD-PPO(Ours) IPPO-IR(w/o context) IPPO-IR IPPO(w/o context) IPPO Base-stock(Dynamic)

N50-C500N50-C2000N100-C1000N100-C4000 3106946601297...818728.75 ± ± ± ± 76 174 149 124.46..18494.52 235689−−21062223..0927 ± ±..9811 60 48 ± ±.. 315 25366192 _.38.00_ 250545−1481126...038600 ± ± ±.42 58 459 1017 ±. 40938.71.47.83 164−−−137317686501.43 ±...291942 143 ± ± ± 870 1063 6234.01.03..6106 **366−−−11026696019.74.83..9728 ± ± ± ± 89 1395 1115 9056.58.92..6949** _−42−49340822.71..3205.14_


shown in Figure 2. It is straightforward to observe that CD-PPO converges to the same performance
comparing with other methods. In particular, CD-PPO is more sample-efficient due to its local simulations in parallel. To evaluate our algorithm on a larger scenario, we also run CD-PPO and IPPO
(same with previous settings) on 50 and 100 SKUs and record the number of data samples when
reaching the median performance of baselines. The results are summarized in Table 1 and Figure 3,
where “N” and “C” denote the number of SKUs and the maximum capacity of shared resource,
respectively. More details about the implementation and hyper-parameters we used can be found in
Appendix C.


Figure 2: Training curves of different algorithms
in 5-SKUs environment. “IR” means the environment only provides individual rewards and “w/o
context” denotes the algorithm does not use information related to shared resource as parts of its
inputs. The X-axis [†] also takes in samples from
local simulation for CD-PPO.


Figure 3: Average samples needed by
different algorithms in 100-SKUs environment to reach the median performance of baselines. The lower the
value, the higher the sample-efficiency
for the algorithm.


As demonstrated in Figure 2, Figure 3 and Table 1, CD-PPO is able to produce results comparable
to other strong baselines and even continues improving as the training proceeds. Moreover, our algorithm can curtail non-stationary effects as in centralized training and, in the meanwhile, can scale
up to scenarios with a large number of agents. In contrast, traditional MARL methods with CTDE
paradigm even cannot start running on the 50-SKUs and the 100-SKUS environment, since the input of its critic is too huge to fit in the memory. IPPO, on the other hand, can run successfully, but
lack stability under different levels of warehouse spaces. The full results and more ablation studies
about how the capacity of shared resource affects the performance of CD-PPO and the influence of
augmentation for context dynamics can be found in Appendix G.

5 RELATED WORK

In this section we will introduce relevant prior work including studies of IM problem and
common-used training paradigms in MARL. More detailed description of other related work(e.g.
Constrained/Weakly-Coupled MDP, Model-Based MARL and Mean-Filed MARL) are shown in
Appendix A.

†Specifically, for one interaction in the global environment with N agents, we consider it as N data samples.
For one interaction in the local simulator(based on context trajs) with one agent, we consider it as 1 data sample.


-----

5.1 INVENTORY MANAGEMENT

Since the pioneer work in (Fukuda, 1964), many approaches have been proposed to solve different
variants of IM problems, either using exact (Goldberg et al., 2016; Huh et al., 2009) or approximate (Halman et al., 2009; Fang et al., 2013; Chen & Yang, 2019) dynamic programming. As reinforcement learning based approaches are our main focus, we only present related work of this branch
in the following. Interested readers will be referred to (Gijsbrechts et al., 2019) for an overview of
traditional approaches for IM.

The attempt to apply reinforcement learning to solve inventory management problem has a long
history, see for instance (Giannoccaro & Pontrandolfo, 2002; Jiang & Sheng, 2009; Kara & Dogan,
2018; Barat et al., 2019; Gijsbrechts et al., 2019; Oroojlooyjadid et al., 2017; 2020). However, as
their main focus is to deal with challenges like volatile customer demands, bullwhip effects etc. in
IM, they are restricted to simplified scenarios with only a single SKU. While these approaches are
able to outperform traditional approaches in these scenarios, they overlook system constraints and
coordination of SKUs imposed by shared resources. Exceptions are two recent works (Barat et al.,
2019; Sultana et al., 2020), where more realistic scenarios containing multiple SKUs are considered.
In (Barat et al., 2019), the main contribution is to propose a framework that supports efficient deployment of RL algorithms in real systems. As an example, the authors also introduce a centralized
algorithm for solving IM problems. In contrast, a decentralized algorithm is proposed in (Sultana
et al., 2020) to solve IM problems with not only multiple SKUs but also multiple echelons. In both
works, the training algorithm is the advantage actor critic (A2C) (Wu et al., 2018).

5.2 TRAINING PARADIGM IN MARL

MARL algorithms generally fall between two frameworks: centralized and decentralized learning. There are two lines of fully decentralized training: Independent Learning methods (IL) (Tan,
1993; de Witt et al., 2020) and decentralized training with communication (Zhang et al., 2018;
Sukhbaatar et al., 2016; Peng et al., 2017). For IL, each agent is learning independently to optimize
its own reward and perceives the other agents as part of the environment. As for fully decentralized
methods, representative methods usually build a direct communication pipe to share the message
amongst agents to avoid non-stationary issue in MARL framework. One part of fully centralized
approaches (Claus & Boutilier, 1998) assume a cooperative game and directly extend single-agent
RL algorithms by learning a single policy to produce the joint actions of all agents simultaneously.
Another type of centralized methods is called value decomposition(VD), which typically represents
the joint Q-function as a function of agents’ local Q-functions (Sunehag et al., 2017; Son et al.,
2019; Rashid et al., 2018) and has been considered a gold standard in MARL. In contrast to previous methods, Centralised Training Decentralised Execution (CTDE) allows sharing of information during training, while policies are only conditioned on the agents’ local observations enabling
decentralised execution. The main category of CTDE algorithms are centralised policy gradient
methods in which each agent consists of a decentralised actor and a centralised critic, which is
optimised based on shared information between the agents. Representative studies of CTDE are
MADDPG (Lowe et al., 2017), COMA (Foerster et al., 2018) and MAPPO (Yu et al., 2021) etc.

6 CONCLUSION

In this paper, we address inventory management problem for a single store with a large number
of SKUs. Our method is based on shared resource structure, where agents only interact indirectly
with each other through shared-resource. By leveraging such structure, our method can outperform
existing MARL algorithms in terms of both final performance and computation efficiency. It is worth
mentioning that our method is not limited to IM, but also applicable to a wide range of real-world
applications with shared-resource structure.

In real-world applications, we usually need to deal with thousands of agents, which poses a challenge
for existing MARL algorithms. To address such challenges, we need develop efficient algorithms
which are compatible with distributed training. In this paper, we take our first step towards developing efficient and practical MARL algorithms for real-world applications with shared-resource
structure, and will continue to address above challenges arisen in real-world applications in our
future work.


-----

REFERENCES

Pritee Agrawal, Pradeep Varakantham, and William Yeoh. Scalable greedy algorithms for
task/resource constrained multi-agent stochastic planning.(2016). In Proceedings of the 25th In_ternational Joint Conference on Artificial Intelligence IJCAI 2016: New York, July 9, volume 15,_
pp. 10–16, 2016. A.2

Souvik Barat, Harshad Khadilkar, Hardik Meisheri, Vinay Kulkarni, Vinita Baniwal, Prashant Kumar, and Monika Gajrani. Actor based simulation for closed loop control of supply chain using reinforcement learning. In Proceedings of the 18th international conference on autonomous
_agents and multiagent systems, pp. 1802–1804, 2019. 5.1_

Shalabh Bhatnagar and K Lakshmanan. An online actor–critic algorithm with function approximation for constrained markov decision processes. Journal of Optimization Theory and Applications,
153(3):688–708, 2012. A.2

Craig Boutilier and Tyler Lu. Budget allocation using weakly coupled, constrained markov decision
processes. In Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence (UAI_16), pp. 52–61, New York, NY, 2016. A.2_

Ren´e Carmona, Mathieu Lauri`ere, and Zongjun Tan. Model-free mean-field reinforcement learning:
mean-field mdp and mean-field q-learning. arXiv preprint arXiv:1910.12802, 2019. A.1

Wenbo Chen and Huixiao Yang. A heuristic based on quadratic approximation for dual sourcing
problem with general lead times and supply capacity uncertainty. IISE Transactions, 51(9):943–
956, 2019. ISSN 24725862. doi: 10.1080/24725854.2018.1537532. 1, 5.1

Yu Fan Chen, Miao Liu, Michael Everett, and Jonathan P How. Decentralized non-communicating
multiagent collision avoidance with deep reinforcement learning. In 2017 IEEE international
_conference on robotics and automation (ICRA), pp. 285–292. IEEE, 2017. A.1_

Caroline Claus and Craig Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems. AAAI/IAAI, 1998(746-752):2, 1998. 5.2

Christian Schroeder de Witt, Tarun Gupta, Denys Makoviichuk, Viktor Makoviychuk, Philip HS
Torr, Mingfei Sun, and Shimon Whiteson. Is independent learning all you need in the starcraft
multi-agent challenge? arXiv preprint arXiv:2011.09533, 2020. 4.1, 5.2

Raghuram Bharadwaj Diddigi, Sai Koti Reddy Danda, Shalabh Bhatnagar, et al. Actor-critic algorithms for constrained multi-agent reinforcement learning. arXiv preprint arXiv:1905.02907,
2019. A.2

Roel Dobbe, David Fridovich-Keil, and Claire Tomlin. Fully decentralized policies for multi-agent
systems: An information theoretic approach. arXiv preprint arXiv:1707.06334, 2017. A.1

Dmitri A Dolgov and Edmund H Durfee. Resource allocation among agents with mdp-induced
preferences. Journal of Artificial Intelligence Research, 27:505–549, 2006. A.2

Jiarui Fang, Lei Zhao, Jan C. Fransoo, and Tom Van Woensel. Sourcing strategies in supply
risk management: An approximate dynamic programming approach. _Computers & Opera-_
_tions Research, 40(5):1371–1382, 2013. ISSN 0305-0548. doi: https://doi.org/10.1016/j.cor._
2012.08.016. [URL https://www.sciencedirect.com/science/article/pii/](https://www.sciencedirect.com/science/article/pii/S030505481200189X)
[S030505481200189X. 1, 5.1](https://www.sciencedirect.com/science/article/pii/S030505481200189X)

Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson.
Counterfactual multi-agent policy gradients. In Proceedings of the AAAI Conference on Artificial
_Intelligence, volume 32, 2018. 4.1, 5.2, A.1_

Michael Fowler, Pratap Tokekar, T Charles Clancy, and Ryan K Williams. Constrained-action
pomdps for multi-agent intelligent knowledge distribution. In 2018 IEEE International Con_ference on Robotics and Automation (ICRA), pp. 3701–3708. IEEE, 2018. A.2_

Yoichiro Fukuda. Optimal Policies for the Inventory Problem with Negotiable Leadtime. Manage_ment Science, 10(4):690–708, 1964. ISSN 0025-1909. doi: 10.1287/mnsc.10.4.690. 5.1_


-----

Ilaria Giannoccaro and Pierpaolo Pontrandolfo. Inventory management in supply chains: A reinforcement learning approach. International Journal of Production Economics, 78(2):153–161,
2002. ISSN 09255273. doi: 10.1016/S0925-5273(00)00156-0. 5.1

Joren Gijsbrechts, Robert N. Boute, Jan Albert Van Mieghem, and Dennis Zhang. Can Deep Reinforcement Learning Improve Inventory Management? Performance and Implementation of Dual
Sourcing-Mode Problems. SSRN Electronic Journal, pp. 1–33, 2019. ISSN 1556-5068. doi:
10.2139/ssrn.3302881. 1, 5.1

David A Goldberg, Dmitriy A Katz-Rogozhnikov, Yingdong Lu, Mayank Sharma, and Mark S
Squillante. Asymptotic optimality of constant-order policies for lost sales inventory models with
large lead times. Mathematics of Operations Research, 41(3):898–913, 2016. 5.1

Jayesh K Gupta, Maxim Egorov, and Mykel Kochenderfer. Cooperative multi-agent control using
deep reinforcement learning. In International Conference on Autonomous Agents and Multiagent
_Systems, pp. 66–83. Springer, 2017. A.1_

Nir Halman, Diego Klabjan, Mohamed Mostagir, Jim Orlin, and David Simchi-Levi. A fully
polynomial-time approximation scheme for single-item stochastic inventory control with discrete
demand. Mathematics of Operations Research, 34:674–685, 2009. 1, 5.1

Christian D. Hubbs, Hector D. Perez, Owais Sarwar, Nikolaos V. Sahinidis, Ignacio E. Grossmann,
and John M. Wassick. Or-gym: A reinforcement learning library for operations research problems, 2020. 4.1, F

Woonghee Tim Huh, Ganesh Janakiraman, John A. Muckstadt, and Paat Rusmevichientong.
Asymptotic optimality of order-up-to policies in lost sales inventory systems. Management Sci_[ence, 55(3):404–420, 2009. ISSN 00251909, 15265501. URL http://www.jstor.org/](http://www.jstor.org/stable/40539156)_
[stable/40539156. 5.1](http://www.jstor.org/stable/40539156)

Chengzhi Jiang and Zhaohan Sheng. Case-based reinforcement learning for dynamic inventory
control in a multi-agent supply-chain system. Expert Systems with Applications, 36(3 PART 2):
6520–6526, 2009. ISSN 09574174. doi: 10.1016/j.eswa.2008.07.036. 5.1

Nan Jiang and Alekh Agarwal. Open problem: The dependence of sample complexity lower bounds
on planning horizon. In Conference On Learning Theory, pp. 3395–3398. PMLR, 2018. 1

Roman Kapuscinski and Sridhar Tayur. Optimal Policies and Simulation-Based Optimization for
_Capacitated Production Inventory Systems, pp. 7–40. Springer US, Boston, MA, 1999. ISBN 978-_
[1-4615-4949-9. doi: 10.1007/978-1-4615-4949-9 2. URL https://doi.org/10.1007/](https://doi.org/10.1007/978-1-4615-4949-9_2)
[978-1-4615-4949-9_2. 4.1](https://doi.org/10.1007/978-1-4615-4949-9_2)

Ahmet Kara and Ibrahim Dogan. Reinforcement learning approaches for specifying ordering policies of perishable inventory systems. Expert Systems with Applications, 91:150–158, 2018. ISSN
09574174. doi: 10.1016/j.eswa.2017.08.046. 5.1

Orr Krupnik, Igor Mordatch, and Aviv Tamar. Multi-agent reinforcement learning with multi-step
generative models. In Conference on Robot Learning, pp. 776–790. PMLR, 2020. A.3

Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multi-agent
reinforcement learning in sequential social dilemmas. arXiv preprint arXiv:1702.03037, 2017.
A.1

Shihui Li, Yi Wu, Xinyue Cui, Honghua Dong, Fei Fang, and Stuart Russell. Robust multi-agent
reinforcement learning via minimax deep deterministic policy gradient. In Proceedings of the
_AAAI Conference on Artificial Intelligence, volume 33, pp. 4213–4220, 2019. A.1_

Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actorcritic for mixed cooperative-competitive environments. arXiv preprint arXiv:1706.02275, 2017.
1, 5.2, A.1

Xueguang Lyu, Yuchen Xiao, Brett Daley, and Christopher Amato. Contrasting centralized and
decentralized critics in multi-agent reinforcement learning. arXiv preprint arXiv:2102.04402,
2021. A.1


-----

Anuj Mahajan, Tabish Rashid, Mikayel Samvelyan, and Shimon Whiteson. Maven: Multi-agent
variational exploration. arXiv preprint arXiv:1910.07483, 2019. A.1

S. Makridakis, E. Spiliotis, and V. Assimakopoulos. The m5 accuracy competition: Results, findings
and conclusions. International Journal of Forecasting, 36(1):224–227, 2020. 4.1

Andrei Marinescu, Ivana Dusparic, Adam Taylor, Vinny Cahill, and Siobh Clarke. P-marl:
Prediction-based multi-agent reinforcement learning for non-stationary environments. In AAMAS,
pp. 1897–1898, 2015. A.3

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim
Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement
learning. In International conference on machine learning, pp. 1928–1937. PMLR, 2016. A.1,
C.2

Steven Nahmias and Stephen A Smith. Mathematical models of retailer inventory systems: A review.
_Perspectives in operations management, pp. 249–278, 1993. 1_

Frans A Oliehoek and Christopher Amato. _A concise introduction to decentralized POMDPs._
Springer, 2016. A.2

Afshin Oroojlooyjadid, M Nazari, Lawrence Snyder, and Martin Tak´aˇc. A deep q-network for the
beer game: A reinforcement learning algorithm to solve inventory optimization problems. arXiv
_preprint arXiv:1708.05924, 2017. 5.1_

Afshin Oroojlooyjadid, Lawrence V. Snyder, and Martin Tak´aˇc. Applying deep learning to the
newsvendor problem. IISE Transactions, 52(4):444–463, 2020. ISSN 24725862. doi: 10.1080/
24725854.2019.1632502. 5.1

Georgios Papoudakis, Filippos Christianos, Lukas Sch¨afer, and Stefano V Albrecht. Benchmarking
multi-agent deep reinforcement learning algorithms in cooperative tasks. In Thirty-fifth Confer_ence on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021._
[URL https://openreview.net/forum?id=cIrPX-Sn5n. 4.1, E.1](https://openreview.net/forum?id=cIrPX-Sn5n)

Young Joon Park, Yoon Sang Cho, and Seoung Bum Kim. Multi-agent reinforcement learning with
approximate model learning for competitive games. PloS one, 14(9):e0222215, 2019. A.3

Bei Peng, Tabish Rashid, Christian A Schroeder de Witt, Pierre-Alexandre Kamienny, Philip HS
Torr, Wendelin B¨ohmer, and Shimon Whiteson. Facmac: Factored multi-agent centralised policy
gradients. arXiv preprint arXiv:2003.06709, 2020. A.1

Peng Peng, Ying Wen, Yaodong Yang, Quan Yuan, Zhenkun Tang, Haitao Long, and Jun Wang.
Multiagent bidirectionally-coordinated nets: Emergence of human-level coordination in learning
to play starcraft combat games. arXiv preprint arXiv:1703.10069, 2017. 5.2

Tabish Rashid, Mikayel Samvelyan, Christian Schroeder, Gregory Farquhar, Jakob Foerster, and
Shimon Whiteson. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 4295–4304. PMLR, 2018.
4.1, 5.2, A.1

Tabish Rashid, Gregory Farquhar, Bei Peng, and Shimon Whiteson. Weighted qmix: Expanding
monotonic value function factorisation. arXiv e-prints, pp. arXiv–2006, 2020. A.1

T. Remani, E. A. Jasmin, and T. P.Imthias Ahamed. Residential Load Scheduling with Renewable
Generation in the Smart Grid: A Reinforcement Learning Approach. IEEE Systems Journal, 13
(3):3283–3294, 2019. ISSN 19379234. doi: 10.1109/JSYST.2018.2855689. 1

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. In Proceedings of the
_International Conference on Learning Representations (ICLR), 2016. C.2_

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy
optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. A.1


-----

Lloyd S Shapley. Stochastic games. Proceedings of the national academy of sciences, 39(10):
1095–1100, 1953. 2.1

Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero, and Yung Yi. Qtran: Learning
to factorize with transformation for cooperative multi-agent reinforcement learning. In Interna_tional Conference on Machine Learning, pp. 5887–5896. PMLR, 2019. 5.2, A.1_

Sriram Srinivasan, Marc Lanctot, Vinicius Zambaldi, Julien P´erolat, Karl Tuyls, R´emi Munos, and
Michael Bowling. Actor-critic policy optimization in partially observable multiagent environments. arXiv preprint arXiv:1810.09026, 2018. A.1

DJ Strouse, Max Kleiman-Weiner, Josh Tenenbaum, Matt Botvinick, and David Schwab. Learning
to share and hide intentions using information regularization. arXiv preprint arXiv:1808.02093,
2018. A.1

Sainbayar Sukhbaatar, Rob Fergus, et al. Learning multiagent communication with backpropagation. Advances in neural information processing systems, 29:2244–2252, 2016. 5.2

Nazneen N Sultana, Hardik Meisheri, Vinita Baniwal, Somjit Nath, Balaraman Ravindran, and Harshad Khadilkar. Reinforcement learning for multi-product multi-node inventory management in
supply chains. arXiv preprint arXiv:2006.04037, 2020. 5.1

Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max
Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al. Value-decomposition
networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296, 2017. 5.2, A.1

Ming Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings
_of the tenth international conference on machine learning, pp. 330–337, 1993. 1, 5.2_

Jianhong Wang, Yuan Zhang, Tae-Kyun Kim, and Yunjie Gu. Shapley q-value: a local reward
approach to solve global reward games. In Proceedings of the AAAI Conference on Artificial
_Intelligence, volume 34, pp. 7285–7292, 2020a. A.1_

Tonghan Wang, Jianhao Wang, Chongyi Zheng, and Chongjie Zhang. Learning nearly decomposable value functions via communication minimization. arXiv preprint arXiv:1910.05366, 2019.
A.1, A.2

Tonghan Wang, Heng Dong, Victor Lesser, and Chongjie Zhang. Roma: Multi-agent reinforcement
learning with emergent roles. arXiv preprint arXiv:2003.08039, 2020b. A.1

Yuhuai Wu, Elman Mansimov, Shun Liao, Alec Radford, and John Schulman. Openai baselines:
[Acktr & a2c. https://openai.com/blog/baselines-acktr-a2c/, 2018. 5.1](https://openai.com/blog/baselines-acktr-a2c/)

Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. Mean field multiagent reinforcement learning. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th
_International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning_
_Research, pp. 5567–5576, Stockholmsm¨assan, Stockholm Sweden, 10–15 Jul 2018. PMLR. A.1_

Yunan Ye, Hengzhi Pei, Boxin Wang, Pin Yu Chen, Yada Zhu, Jun Xiao, and Bo Li. Reinforcementlearning based portfolio management with augmented asset movement prediction states. AAAI
_2020 - 34th AAAI Conference on Artificial Intelligence, pp. 1112–1119, 2020. ISSN 2159-5399._
doi: 10.1609/aaai.v34i01.5462. 1

Chao Yu, Akash Velu, Eugene Vinitsky, Yu Wang, Alexandre Bayen, and Yi Wu. The surprising
effectiveness of mappo in cooperative, multi-agent games. _arXiv preprint arXiv:2103.01955,_
2021. 4.1, 5.2, A.1, C.1

Kaiqing Zhang, Zhuoran Yang, Han Liu, Tong Zhang, and Tamer Basar. Fully decentralized multiagent reinforcement learning with networked agents. In International Conference on Machine
_Learning, pp. 5872–5881. PMLR, 2018. 5.2_

Ruohan Zhang, Yue Yu, Mahmoud El Chamie, Behc¸et Ac¸ikmese, and Dana H Ballard. Decisionmaking policies for heterogeneous autonomous multi-agent systems with safety constraints. In
_IJCAI, pp. 546–553, 2016. A.2_


-----

Weinan Zhang, Xihuai Wang, Jian Shen, and Ming Zhou. Model-based multi-agent policy optimization with adaptive opponent-wise rollouts. arXiv preprint arXiv:2105.03363, 2021. A.3


-----

A RELATED WORK

A.1 TRAINING PARADIGM IN MARL

In this section we will provide a more detailed description of related work for training paradigm
used in MARL. From the perspective of training schemes, it can be devided into three categories:
decentralized training decentralized execution (DTDE), centralized training centralized execution
(CTCE) and centralized training decentralized execution (CTDE). Recent deep MARL works often
use the CTDE or CTCE training pradigm.

The CTCE paradigm allows the straightforward employment of single-agent approaches such
as Actor-Critic (Mnih et al., 2016) or policy gradient algorithms (Schulman et al., 2017) to
multi-agent problems. The representative work of CTCE is (Gupta et al., 2017), which represented
the centralized executor as a set of independent sub-policies such that agents’ individual action
distributions are captured rather than the joint action distribution of all agents.
Value-based CTDE approaches, which are also known as Value Decomposition methods (Peng
et al., 2020; Mahajan et al., 2019; Rashid et al., 2020; 2018; Son et al., 2019; Sunehag et al., 2017;
Wang et al., 2020b; 2019), mianly focus on how centrally learned value functions can be reasonably
decoupled into decentralized ones and have shown promising results. Policy-gradient-based
methods on CTDE, on the other hand, have heavily relied on centralized critics. One of the first
works utilizating a centralized critic was COMA (Foerster et al., 2018), a framework adopting a
centralized critic with a counterfactual baseline. For convergence properties, COMA establishes
that the overall effect on decentralized policy gradient with a centralized critic can be reduced to a
single-agent actor-critic approach, which ensures convergence under the similar assumptions like
A2C.

Concurrently with COMA, MADDPG (Lowe et al., 2017) proposed to use a dedicated centralized
critic for each agent in semi-competitive domains, demonstrating compelling empirical results
in continuous action environments. Recently, MAPPO (Yu et al., 2021), an on-policy policy
gradient multi-agent reinforcement learning algorithm, achieves strong results comparable to the
state-of-the-art on a variety of cooperative multi-agent challenges. Despite its on-policy nature,
MAPPO is competitive to ubiquitous off-policy methods such as MADDPG, QMix, and RODE in
terms of final performance, and in the vast majority of cases, comparable to off-policy methods in
terms of sample-efficiency. In addition, many incremental research inspired by MADDPG, COMA
or MAPPO also borrowed the centralized critic baselines e.g. M3DDPG (Li et al., 2019), SQDDPG (Wang et al., 2020a), etc. Mean Filed Q-Learning (Yang et al., 2018; Carmona et al., 2019)
takes a different approach from the CTDE based methods. It employs mean field approximation
over the joint action space in order to address the scalability issue that exists in the prior methods.

Contrary to CTDE, in DTDE paradigm, each agent has an associated policy which maps local
observations to a distribution over individual actions. No information is shared between agents such
that each agent learns independently. DTDE has been applied to cooperative navigation task (Chen
et al., 2017; Strouse et al., 2018), to partially observable domains (Dobbe et al., 2017; Srinivasan
et al., 2018), and to social dilemmas (Leibo et al., 2017). For more comparisons of centralized critic
and decentralized critic, please see (Lyu et al., 2021).

In this paper, we design a decentralized traning paradigm avoiding the flaws of traditional training
paradigms proposed in literature. The fundamental drawback of the DTDE paradigm is that the
environment appears non-stationary from a single agent’s viewpoint because agents neither have
access to the knowledge of others, nor do they perceive the joint action. Some studies reported that
DTDE scale poorly with the number of agent due to the extra sample complexity, which is added
to the learning problem (Gupta et al., 2017); An obvious flaw for CTDE/CTCE is that state-action
spaces grow exponentially by the number of agents. Even though there are some attempts proposed
that the joint model can be factored into individual policies for each agent To address the so-called
curse of dimensionality, CTDE/CTCE methods have to use at least joint states overall agents as the
input to approximate global value function to give guidance for centralized critics or decentralized
policies. Meaning that these traditional training schemes still not have strong expansion capabilities


-----

to large number of agents when system’s state is combined by local state for each agent. As for
our approach, we train agents independently with learned dynamics model of utilization’s trend for
shared resource, which will give agent enough information to learn optimal policy (we will explain it
in the following sections). At the meantime, we can also, improve efficiency of data sampling since
we don’t always use the original joint simulator containing all agents for data collection. Instead,
we mainly running the light-cost local simulator by embedding the learned dynamic model, which
can significantly reduce the cost of data collection process, especially when running joint simulator
(such as inventory management) expensively.

A.2 MDP SCENARIOS IN MARL

There are limited similar variants of MDP settings which have been studied under MARL framework, to our knowledge. Dec-POMDP (Oliehoek & Amato, 2016) is the most common setting studied in MARL research, especially in fully cooperative tasks in which agents share the global team
reward rather than individual rewards; In the scenarios of Constrained MDP setting (C-MDP, (Bhatnagar & Lakshmanan, 2012; Wang et al., 2019; Diddigi et al., 2019)), there are some constraints
in the system, such as the penalty caused by illegal actions in autonomous driving (Zhang et al.,
2016; Fowler et al., 2018), the resource-allocation constraints in scheduling tasks (Agrawal et al.,
2016; Dolgov & Durfee, 2006), or limited bandwidth during communicating among agents (Fowler
et al., 2018), etc. Similar with C-MDP, Weakly-Coupled Constrained MDP (WC-C-MDP (Boutilier
& Lu, 2016)) consider the problem of budget (or other resource) allocation in sequential decision
problems involving a large number of concurrently running sub-processes,whose only interaction is
through their consumption of budget. Different from mentioned scenarios, we focus on the situation where agents can only get the individual reward from the environment, so that team reward is
invisible; Comparing with C-MDP and WC-C-MDP, the penalty of overstocking in which is related
with stock levels of all agents in future, while in C-MDP and WC-C-MDP, the corresponding cost
function is only based on historical states of the system. Leading that we can not optimize only the
combined objective of long-term summation of return and penalty.

A.3 MODEL-BASED MARL

For model-based MARL, there are relatively limited works of literature. P-MARL (Marinescu et al.,
2015) proposed an approach that integrates prediction and pattern change detection abilities into
MARL and thus minimises the effect of non-stationarity in the environment. The environment is
modelled as a time-series, with future estimates provided using prediction techniques. Learning is
based on the predicted environment behaviour, with agents employing this knowledge to improve
their performance in real-time. (Park et al., 2019) proposed to use a centralized auxiliary prediction network to model the environment dynamics to alleviate the non-stationary dynamics problem.
(Krupnik et al., 2020) built a centralized multi-step generative model with a disentangled variational
auto-encoder to predict the environment dynamics and the opponent actions and then performed trajectory planning. AORPO (Zhang et al., 2021) is a Dyna-style method in which each agent builds
its multi-agent environment model that consist of a dynamics model and multiple opponent models,
and trains its policy with the data both generated from adaptive opponent-wise rollout and interacted
from the real environment. To our knowledge, unlike previous work, our approach is the first to
accelerate the learning process in MARL by building only a part dynamic model of the whole system. Our algorithm is also a Dyna-style method like AORPO. And it is clear that the difficulty of
modeling process will be reduced significantly and the learning process is more efficient.

A.4 MEAN-FIELD MARL

B DETAILS FOR INVENTORY MANAGEMENT PROBLEM

We summarize all notations for Section 2.2 in Table 2. We also give an example with 2 SKUs in
Fig. 4 to further illustrate the whole procedure for inventory dynamics.


-----

Figure 4: A diagram to illustrate an inventory dynamics in two time steps.

Table 2: Notations.

|Notation|Explanation|
|---|---|
|I˙i t Iˆi t Oi t Di t Si t T ti Pti t p i q i o h|Units in stock of SKU i at the t-th time step Units in stock of SKU i at the end of time step t if no discard Order quantity of the i-th SKU at the t-th time step Demand of the i-th SKU at the t-th time step Sale quantity of the i-th SKU at the t-th time step Units in transit of the i-th SKU at the t-th time step Profit generated on the i-th SKU at the t-th time step Unit sales price of i-th SKU Unit procurement cost of i-th SKU Unit order cost Unit holding cost for a time step|



C ALGORITHM DETAILS

Here we provide the pseudocode of our algorithm with context augmentation in Algorithm 4. And
details of sub-algorithms will be introduced in the following subsections. Notes that all pseudocode
are assumed using RNN as networks.

C.1 GET CONTEXT DYNAMICS

Our algorithm follows the algorithmic structure of the Multi-Agent PPO (Yu et al., 2021), in which
we maintain two separate networks for πθ and Vϕ(s). And the first stage of our algorithm is only
running policies in the origin joint environment to get episodes for context dynamics, i.e. the trajectories of c[i]t [and][ c]t[−][i][. This process is similar with all other traditional MARL methods. Notes that]
we can also save the transitions of each agent for learning policy and value function network. In
the following pseudocode of joint sampling(Algorithm 2), we only record the part for the context
dynamics to train the context model prepared for next stage’s training.
After data collecting for context dynamics, we will use a LSTM model fc to train a surrogate predictor as an extra augmentation for collected dynamics in next stage. In details, we split the collected
trajectories of context into several sub-sequences of length 8 and the training objective is to minimize the mean-squared error of the L + 1 day’s capacity predicted given the dynamics of previous
_L days._

min(fc(ct−L, . . ., ct; ω) − _ct+1)[2]_ (14)


-----

C.2 DECENTRALIZED PPO

With the collected context dynamics of the shared resource, it is easy to run the second stage:
sampling on the local simulators for each agent and then training the policy and critic with data. It is
worth noting that the main difference between our training paradigm and traditional MARL methods
under CTDE structure is that we directly sampling local observations in the extra simulators in
which only one agent existed rather than the joint simulator in which all agents interacting with
each other. In other words, in the new local simulator, there is only one SKU in the entire store, and
the trend of available capacity is completely simulated according to the given context dynamics.

In practice, we parallelly initialize new instances of the original inventory environment with the new
configure settings which only contains a specific SKU i and a fixed trajectory of context. As for the
fixed context trajectory, we use the subtraction results {c[−][i]}t[T]=0 [with some augmentations: 1) add]
some noise in some items with a predefined probability;2) replace some items with predicted values
comes from the trained context model also by the predefined probability. Then we run the policy
to interact with the local simulators to conduct episodes under the embedded context dynamics
and put them into a shared replay buffer since all transitions are homogeneous and we shared the
parameters over all policies. And decentralized training will be started by utilizing the shared replay
buffer of transitions collected from local simulators.

We consider a variant of the advantage function based on decentralized learning, where each agent
learns a agent-specific state based critic Vϕ(s[i]t[)][ parameterised by][ ϕ][ using][ Generalized Advantage]
_Estimation (GAE, (Schulman et al., 2016)) with discount factor γ = 0.99 and λ = 0.95. We also_
add an entropy regularization term to the final policy loss (Mnih et al., 2016). For each agent i, we
have its advantage estimation as follows:


_A[i]t_ [=]


(γλ)[l]δt[i]+l (15)
_l=0_

X


where δt[i][i][ =][ r][t] _s[i]t[, a][i]t_ + γVϕ _zt[i]+1_ _Vϕ_ _s[i]t_ is the TD error at time step t and h is marked as
_−_
steps num. And we use individual reward provided from local simulator      _rt[i][(][s][i]t[, a][i]t[)][. So that the final]_
policy loss for each agent i becomes:

_πθ_ _a[i]t_ _t_

(θ) = Esit[,a][i]t[∼T][local][(][c][−]t _[i])_ min _[|][ s][i]_ _A[i]t[,]_
_L[i]_ " _πθold_ _a[i]t_ t

_[|][ s][i]_ (16)

clip _πθ_ _a[i]t_ _[|][ s]t[i]_ , 1 _ϵ, 1 + ϵ_ _A[i]t_

_πθold_ _a[i]t_ t _−_ ! !#

_[|][ s][i]_

As for training value function, in addition to clipping the policy updates, our method also use value 
clipping to restrict the update of critic function for each agent i to be smaller than ϵ as proposed by
GAE using:


2

(ϕ) = Esit[∼T][local][(][c][−]t _[i])_ min _Vϕ_ _s[i]t_ _Vt[i]_ _,_
_L[i]_ _−_ [ˆ]

     2[] (17)

_Vϕold_ _s[i]t_ + clip _Vϕ_ _s[i]t_ _Vϕold_ _s[i]t_ _,_ _ϵ, +ϵ_ _Vt[i]_
_−_ _−_ _−_ [ˆ]
         

where ϕold are old parameters before the update and _V[ˆ]t[i]_ [=][ A]t[i] [+][ V][ϕ] _s[i]t_ . The update equation
restricts the update of the value function to within the trust region, and therefore helps us to avoid
 
overfitting to the most recent batch of data. For each agent, the overall learning loss becomes:

_N_

_L(θ, ϕ) =_ _L[i](θ) + λcriticL[i](ϕ) + λentropyH_ _π[i][]_ (18)

_i=1_

X 

It is obvious that all networks are trained in the decentralized way since their inputs are all local
variables which stem from the light-cost local simulators. As mentioned before, at this learning


-----

stage, there are no interactions between any two agents. Although it seems like the way of independent learning, we need to point that we use the global context simulated from the joint environment,
which is essentially different from independent learning methods since they will not consider this
style global information which is simulated from joint simulator but be fixed in the local simulators.
Our decentralized training have several advantages: firstly, the local simulator is running efficient
because of its simple only-one-agent transition function; secondly, this paradigm avoid the issue for
non-stationary occurred in the traditional MARL methods since there are no interaction amongst
agents so that it is no need to consider influences of other agents; thirdly, we can use more diverse
context trajectories to make agents face various situations of the available levels of the store, which
leads to improve the generalization of the networks to be trained; fourthly, it is easy for this training
paradigm to be extended to large-scale distributed training by running parallel simulation whose
communication cost is also acceptable for modern distributed training frameworks.

If the critic and actor networks are RNNs, then the loss functions additionally sum over time, and
the networks are trained via Backpropagation Through Time (BPTT). Pseudocode for local sampling
with recurrent version of policy networks is shown in Algorithm 3.

**Algorithm 2 GetContextDynamic**

**INPUT policies** _πθ[i]_ _i_ _[}]i[n]=1_ [and the joint simulator Env][joint]
_{_

(Optional) Initialize ω, the parameters for context model fc
set data buffer D = {}
**for bs = 1 to batch size do**

_τ = [] empty list_
initialize h[1]0,π[,][ · · ·][ h][n]0,π [actor RNN states]

**for t = 1 to T do**

**for all agents i do**

_p[i]t[, h][i]t,π_ [=][ π][i][(][s]t[i][, c][t][, h][i]t 1,π[;][ θ][i][)]
_−_

_a[i]t_ _t_

**end for[∼]** _[p][i]_
Execute actions at, observe rt, st+1
_τ += [st, ct, ht,π, at, rt, st+1, ct+1]_

**end for**
// Split trajectory τ into chunks of length L
**for l = 0, 1, .., T//L do**

_D = D ∪_ (c[l : l + T ])

**end for**

**end for**
// (Optional) Train the context model for augmentation
**if Need to train context model then**

**for mini-batch k = 1, . . ., K1 do**

_b ←_ random mini-batch from D with all agent data
Update capacity-context dynamics model with Eq. 14

**end for**
Adam update ω with data b

**end if**
**OUTPUT context dynamics** _c[−]t_ _[i]_ _t=1_ [for][ i][ = 1][, . . ., n][; (Optional)][ f][ω]
_{_ _[}][T]_


D LOCAL SAMPLING V.S. JOINT SAMPLING

Our algorithm follows a decentralized training paradigm, which is compatible with distributed training framework. We can run parallel local simulation while communication cost of context dynamics
is also acceptable. With distributed training, our method can be much more efficient than learning
from the joint simulator directly.

One may argue that we can also learn from joint simulator efficiently by implementing a distributed/parallelized joint simulator. While this can indeed improve sampling efficiency in many
scenarios, its usefulness is limited particularly for systems with a large number of agents. In a typ

-----

**Algorithm 3 DecentralizedPPO**

// Generate data for agent i with corresponding context dynamic model
**INPUT local simulator Env[i]local[, policy][ π]θ[i]** _i_ [and value function][ V][ i]ϕi

Set data buffer D = {}
Initialize h[1]0,π[,][ · · ·][, h]0[n],π [actor RNN states]

Initialize h[1]0,V _[, . . . h]0[n],V_ [critic RNN states]

_τ = [] empty list_
**for t = 1 to T do**

_p[i]t[, h][i]t,π_ [=][ π][i][(][s]t[i][, h][i]t 1,π[, c]t[i][;][ θ][i][)]
_−_

_a[i]t_ _t_

_vt[i][, h][∼][i]t,V[p][i]_ [=][ V][ i][(][s]t[i][, c][i]t[, h][i]t 1,V [;][ ϕ][i][)]
_−_

Execute actions a[i]t [in Env]local[i] [, and then observe][ r]t[i][, c][i]t+1[, s][i]t+1

_τ_ + = [s[i]t[, c]t[i][, a]t[i][, h]t,π[i] _[, h][i]t,V_ _[, s]t[i]+1[, c][i]t+1[]]_

**end for**
Compute advantage estimate _A[ˆ] via GAE on τ (Eq. 15)_

Compute reward-to-go _R[ˆ] on τ_

// Split trajectory τ into chunks of length L in D
**for l = 0, 1, .., T//L do**

_D = D ∪_ (τ [l : l + T, _A[ˆ][l : l + L],_ _R[ˆ][l : l + L])_

**end for**
**for mini-batch k = 1, . . ., K2 do**

_b ←_ random mini-batch from D with all agent data
**for each data chunk c in the mini-batch b do**

update RNN hidden states for π[i] and V _[i]_ from first hidden state in data chunk

**end for**

**end for**
Calculate the overall loss according to Eq. 16 to Eq. 18
Adam update θi on L[i](θi) and H with data b
Adam update ϕi on L[i](ϕi) with data b
**OUTPUT policy πθ[i]** _i_ [and value function][ V][ i]ϕi


**Algorithm 4 Context-aware Decentralized PPO with Context Augmentation**

Given the joint simulator Envjoint and local simulators Env[i]local[}][n]i=1
_{_

Initialize policies π[i] and value functions V _[i]_ for i = 1, . . ., n
Initialize context model fω and the augmentation probability paug
**for M epochs do**

// Collect context dynamics via running joint simulation
_{c[−]t_ [1][}]t[T]=0[, . . .,][ {][c]t[−][n]}t[T]=0[, f][ω] _[←]_ **[GetContextDynamics][(][Env][joint][,][ {][π][i][}]i[n]=1[, f][ω][)][ (Algorithm][ 2][)]**

**for k = 1, 2 . . ., K do**

**for all agents i do**

// Set capacity trajectory by augmented context dynamics
Env// Train policy by running simulation in the corresponding local environment[i]local[.][set c trajectory][(][aug][(][{][c]t[−][i][}]t[T]=0[, f][ω][, p][aug][))]

_π[i], V_ _[i]_ _←_ **DecentralizedPPO(Env[i]local[, π][i][, V][ i][)][ (Algorithm][ 3][)]**

**end for**

**end for**
Evaluate policies _π[i]_ _i=1_ [on joint simulator Env][joint]
_{_ _}[n]_

**end for**


-----

ical multi-agent system, interactions occur frequently among all agents. Actually, one advantage
of multi-agent systems is to model complex interactions among agents. However, such interactions
will remarkably reduce the efficiency improvement brought by parallelism, as these interactions
often require involved agents to synchronize which usually consumes lots of time on waiting and
communicating. For instance, in IM problems, at each time step, all agents should be synchronized
in order to calculate ρ in Eq. 6 before moving to the next time step.

On the other hand, in our local sampling approach we simplify such costly interactions by utilizing
special structures of shared-resource stochastic games. Under our approach, there is no need for
agents to be synchronized before moving to the next time step in local simulators. As a result, our
method can be much more efficient than learning only from the joint simulator in practice.

E TRAINING DETAILS

E.1 THE CODEBASE

As part of this work we extended the well-known EPyMARL codebase((Papoudakis et al., 2021)) to
integrated our simulator and algorithm, which already include several common-used algorithms and
support more environments as well as allow for more flexible tuning of the implementation details. It
is convenience for us to compare our algorithm with other baselines. All code for our new codebase
[is publicly available open-source on Anonymous GitHub under the following link: https://](https://anonymous.4open.science/r/replenishment-marl-baselines-75F4)
[anonymous.4open.science/r/replenishment-marl-baselines-75F4](https://anonymous.4open.science/r/replenishment-marl-baselines-75F4)

E.2 HYPERPARAMETERS DETAILS

Table 3 presents the hyperparameters used in our algorithm for 5-SKUs environment.

|Hyperparameters|Value|
|---|---|
|runner|ParallelRunner|
|batch size run|10|
|decoupled training|True|
|use individual envs|True|
|max individual envs|5|
|decoupled iterations|1|
|train with joint data|True|
|context perturbation prob|1.0|
|hidden dimension|64|
|learning rate|0.00025|
|reward standardisation|True|
|network type|FC|
|entropy coefficient|0.01|
|target update|200 (hard)|
|n-step|5|



Table 3: Hyparameters used in CD-PPO

E.3 DETAILS OF STATES AND REWARDS

Table 4 shows the features of the state for our MARL agent that corresponds to the i-th SKU on the
_t-th time step._

It’s worthy to note that we use the profit generated on the i-th SKU at the t-th time step Pt _[i]t_ [divided]
by 1000000 as the individual reward of i-th agent at the t-th time step, for team reward methods, we
simply sum up all the individual rewards, which corresponds to the daily profit of the whole store at
the t-th time step, divided by 1000000.


-----

Table 4: Features of the state [P][l][−][1]

|Col1|Features|
|---|---|
|State Storage information Inventory information History information Product information|Storage capacity C Quantity of products in stock Ii t Quantity of products in transit T ti R S Sa te alp e nl s de an ri h ds ih sm t doe eryn t ai th n ii s nt to hr e oy i ln a t iet ssh tt oe ril c2a at 1e s dt a a2 y l1 es d Sa ti sy ts , S·O −ti ·− 2,2 1S1 t, i · · · , O −ti − 11 − d21 · ·− 1 v i o f h l s s ti, · ·, S ti Unit sales price p i Unit procurement cost q i|
|Context Global storage utilization Global unloading level Global excess level|Current total storage level of the store Pl j− =1 I tj 1 Current total unloading level of the store Pl j− =1 O tj 1 −Lj+1 Current total excess level of the store ρ × Pl j− =1 1 O tj −Lj+1|



F BASE-STOCK POLICY

**Algorithm 5 Base-stock Policy**


**Input:**
_{Dt[i][}]tt2=t1_ [,][ v][,][ τ] [, IM parameters]

**Output:**
_Zi,_ _Ot[i][}]t[t]=[4]_ _t3_
_{_
_// Description:Base-stock policy for single SKU i_
_// Zi:Base-stock level for SKU i_
_//_ _Ot[i][}]t[t]=[4]_ _t3_ [:Base-stock replenishment policy for SKU i, from][ t][3] [to][ t][4]
_// {Dt[i][}]tt2=t1_ [:Demand series of SKU i used to infer][ Z][i][, from][ t][1][ to][ t][2]
_{_
_// v : v ≥_ 1, v ∈ R, Hyper-parameter to control storage utilization level
_// τ : τ ∈_ N+, Hyper-parameter to control replenishing interval
_// IM parameters:including leading time Li, storage capacity C, etc_

_// Solve Mixed Integer Programming with dual simplex methods:_

_St[i]_ [= min] _Dt[i][, I]t[i]_

_Zi ←_ maxZi Xtt2=t1 _[Pt]t[i]_ [s.t.] TIPtOt[i]t[i]+1t[i]+1[i]t[= max][=][=][=][ p][ I][ T][i]t[i][S][ i]t[−]t[i][−]0[S], Z[O]t[i] _t[i][+]i− −[ O]Lt_ _i+1It[i]−t[i]_ _L[−][+]i+1t[ O][T][ i]t_ _t[i]+1_

 

_[−]_ _[q][i][O][i]_ _[−]_ _[hI]_ _[i]_
_// Replenishing policy deduction:_ t1 ≤ _t ≤_ _t2_
_Ot[i]_ [= max] 0, Zi − _It[i]_ _[−]_ _[T][ i]t_ _, t3 ≤_ _t ≤_ _t4_
_Ot[i]_ [= min] Ot[i][, vC][ −] [P][n]j=1[(][I]t[j] [+][ T][ j]t [)] _, t3_ _t_ _t4_
_≤_ _≤_
_Ot[i]_ [=][ O]t[i][I][ [][t][ mod][ τ][ = 0]][, t][3] _[≤]_ _[t][ ≤]_ _[t][4]_ 
return Zi, _Ot[i][}]t[t]=[4]_ _t3_
_{_

In addition to comparing with related MARL baselines, we also add a well-known algorithm ”Basestock” from OR community as our non-RL baseline. The pseudo-code for base-stock policy can be
found in Algorithm 5, where Zi is called the base-stock for agent i and is computed by solving a
mixed integer programming problem. After that, Zi will be used to guide replenishment of agent
_i periodically. We shall note that base-stock policy can not deal with complex IM variant settings_


-----

like coordinated replenishment (multiple SKUs with storage capacities), order cost and stochastic
VLTs, etc. These complicated realistic constraints are exactly what we use to test other MARL algorithms. Thus it may happen that Base-stock policies constantly overflow warehouse capacities when
storage is tight, in which cases incoming products are abandoned proportionally as we explained in
Section 2.2. This explains Base-stock’s poor performance on some of the envs.

Base-stock utilizes Linear Programming to work out the base stock levels and thus the replenishing
policy. ”Static” indicates that the base stock levels are based on demand data from training set and
then kept static while making decisions on test set. ”Dynamic” updates its base stock levels on a
regular time cycle. ”Oracle” directly access the whole test set to calculate its base stock levels, which
is essentially a cheating version used to show the upper limits of Base-stock policy. In practice, we
conduct grid search to find proper v and τ . We refer readers to (Hubbs et al., 2020) for a detailed
description. Our implementation of Base-stock baselines is also inspired by it.

G ADDITIONAL RESULTS

G.1 THE FULL RESULTS ON ALL ENVIRONMENTS

Here we present all the experiment results in Table 5 and display the training curves of algorithms
in 50-SKUs scenarios in Figure 5 and Figure 6. We also present the results from OR methods
designed to deal with IM problem, namely Base-stock, in Table 5.

Please note that, actually, VLT (i.e leading time) for each SKU in N50 and N100 is stochastic
during simulation. On the one hand, the specified VLTs can be modelled as exponential distributions
in our simulator. On the other hand, the real lead time for each procurement can be longer than
specified due to lack of upstream vehicles. (Distribution capacity is also considered in the simulator,
though not the focus of this paper). So that the results running on N50 and N100 are all considering
stochastic VLTs.

Table 5: Profit Comparison on All Environments

|Col1|Col2|Profit(10k dollar)|Col4|
|---|---|---|---|
|Env Scenario|CD-PPO(ours)|IPPO-IR(w/o context) MAPPO-IR(w/o context) IPPO-IR MAPPO-IR IPPO(w/o context) MAPPO(w/o context) IPPO MAPPO|Basestock(Static) Basestock(Dynamic) Basestock(Oracle)|
|N5-C50|40.58 ± 6.02|40.37 ± 4.89 39.32 ± 15.53 43.33 ± 3.30 54.87 ± 9.26 74.11 ± 1.55 49.24 ± 1.32 63.22 ± 13.75 48.49 ± 1.89|17.4834 33.9469 38.6207|
|N5-C100|99.21 ± 1.91|92.41 ± 2.78 94.70 ± 18.84 91.38 ± 3.57 97.69 ± 14.41 97.89 ± 6.65 74.71 ± 1.51 92.90 ± 13.36 71.57 ± 3.14|48.8944 80.8602 97.7010|
|N50-C500|310.81 ± 76.46|235.09 ± 60.61 N/A 250.03 ± 58.38 N/A 164.43 ± 143.01 N/A 366.74 ± 89.58 N/A|−430.0810 −408.1434 −397.831|
|N50-C2000|694.87 ± 174.184|689.27 ± 48.92 N/A 545.86 ± 459.71 N/A −1373.29 ± 870.03 N/A −1102.97 ± 1115.69 N/A|−15.5912 42.7092 1023.6574|
|N100-C1000|660.28 ± 149.94|−2106.98 ± 315.38 N/A −1126.42 ± 409.83 N/A −1768.19 ± 1063.61 N/A −669.83 ± 1395.92 N/A|−173.39 −22.05 91.17|
|N100-C4000|1297.75 ± 124.52|−2223.11 ± 2536.00 N/A 148.00 ± 1017.47 N/A −6501.42 ± 6234.06 N/A −6019.28 ± 9056.49 N/A|410.59 493.32 755.47|



Table 6: Average samples needed by different algorithms to reach the median performance of baselines on All Environments

|Col1|Data Samples(10k)|
|---|---|
|Env Scenario|CD-PPO(ours) IPPO-IR(w/o context) MAPPO-IR(w/o context) IPPO-IR MAPPO-IR IPPO(w/o context) MAPPO(w/o context) IPPO MAPPO|
|N5-C50|∞ ∞ ∞ ∞ 708.10 588.63 1671.10 708.10 1671.10|
|N5-C100|522.80 ∞ 711.97 ∞ 987.27 806.56 ∞ 1298.40 ∞|
|N50-C500|5484.49 ∞ N/A ∞ N/A 9195.23 N/A 12802.89 N/A|
|N50-C2000|2996.87 3138.49 N/A 8483.04 N/A ∞ N/A ∞ N/A|
|N100-C1000|47.14 ∞ N/A ∞ N/A 539.26 N/A 191.88 N/A|
|N100-C4000|60.07 ∞ N/A 127.57 N/A ∞ N/A 1151.57 N/A|



It’s worthy noting that throughout all the environments we’ve experimented on, CD-PPO enjoys
higher sample efficiency as shown in Table 6. Though it seems that in environment where the
storage capacity is extremely tense, like N5-C50, CD-PPO performs not as well as IPPO, we reason
that these tense situations favor IPPO (with team reward), which may quickly learn how to adjust
each agents’ behavior to improve team reward. While CD-PPO, owing to its decentralized training
paradigm, struggles to learn a highly cooperative policy to cope with the strong resource limits while
meeting stochastic customer demands. Yet we still claim that CD-PPO does outperform its IR (w/o
context) baselines while producing comparable performance when compared to IR (with context)
algorithms. On the other hand, IPPO(with team reward) struggles to learn a good policy in 50-agents
environment for the large action space compared to the single team reward signal, and it fails when
faced with N50-C2000 whose state space is even larger. To top it all off, CD-PPO does have its
strengths in terms of sample efficiency. Though we will continue to address the above challenge in
our future work.


-----

Figure 5: Training curves on N50-C500. Figure 6: Training curves on N50-C2000

G.2 ABLATION STUDIES FOR CONTEXT AUGMENTATION

In this section, we aim to seek a better way to augment the context dynamics in order to boost
performance of our algorithm. We ponder the following two questions:

**Q1: Which is a better way to augment the original data, adding noise or using a Deep predic-**
**tion model?**

To answer this question, we set out experiments on environment N5-C100 with our algorithm CDPPO, using either context trajectories generated by a deep LSTM prediction model, or simply adding
a random normal perturbation to the original dynamics trajectories, both on 3 different seeds. As
shown in Figure 7, those runs with deep prediction model generated dynamics enjoy less std and better final performance. This could result from that the diversity of deep model generated trajectories
surpasses that of random perturbation.

Figure 7: Training curves of CD-PPO with different augmentation methods


-----

Figure 8: Training curves of CD-PPO with varied ratio of augmented data

**Q2: Does dynamics augmentation improve the performance of the algorithm? If so, how much**
**should we perturb the original data?**

We run similar experiments on environment N5-C100 with CD-PPO, in which the local simulator
is ingested with a mixture of original dynamics data and LSTM generated data. The ratio of perturbed dynamics data varies from 0% to 100%. And we found that the algorithm turns out the best
performance when we use fully generated data, as shown in Figure 8.


-----