|
# MULTI-AGENT REINFORCEMENT LEARNING WITH SHARED RESOURCE FOR INVENTORY MANAGEMENT |
|
|
|
**Anonymous authors** |
|
Paper under double-blind review |
|
|
|
ABSTRACT |
|
|
|
We consider inventory management (IM) problem for a single store with a large |
|
number of SKUs (stock keeping units) in this paper, where we need to make replenishment decisions for each SKU to balance its supply and demand. Each |
|
SKU should cooperate with each other to maximize profits, as well as compete |
|
for shared resources e.g., warehouse spaces, budget etc. Co-existence of cooperation and competition behaviors makes IM a complicate game, hence IM can be |
|
naturally modelled as a multi-agent reinforcement learning (MARL) problem. In |
|
IM problem, we find that agents only interact indirectly with each other through |
|
some shared resources, e.g., warehouse spaces. To formally model MARL problems with above structure, we propose shared resource stochastic game along with |
|
an efficient algorithm to learn policies particularly for a large number of agents. |
|
By leveraging shared-resource structure, our method can greatly reduce model |
|
complexity and accelerate learning procedure compared with standard MARL algorithms, as shown by extensive experiments. |
|
|
|
1 INTRODUCTION |
|
|
|
Inventory management (IM) problem has long been one of the most important applications in the |
|
supply-chain industry (Nahmias & Smith, 1993). Its main purpose is to maintain a balance between |
|
the supply and demand of stock keeping units (SKUs) in a supply chain by optimizing replenishment |
|
decisions of each SKU. Besides leading to profit increment and operational cost reduction, efficient |
|
IM can even give rise to better services to customers. However, it is quite a challenging task in |
|
practice, especially when there are lots of SKUs involved in the supply-chain. Particularly, while all |
|
SKUs should cooperate with each other to achieve high profits, they also need to compete for shared |
|
resources e.g., warehouse spaces, budget etc. Such co-existence of cooperation and competition |
|
renders IM a complicated game that is hard to address. |
|
|
|
Traditional methods usually reduce IM problems to solving dynamic programming problems. However, these approaches often rely on some unrealistic assumptions such as i.i.d. demand, deterministic leading time, etc. Moreover, as the state space grows exponentially along with some key |
|
factors like leading time and number of SKUs (Gijsbrechts et al., 2019), the corresponding dynamic |
|
programming problems become intractable due to the curse of dimensionality. Because of these limitations, many approaches based on approximate dynamic programming are proposed to solve IM |
|
problems in different settings (Halman et al., 2009; Fang et al., 2013; Chen & Yang, 2019). While |
|
these approaches perform well in certain scenarios, they heavily rely on problem-specific expertise |
|
or assumptions e.g., the zero or one period leading time assumption in (Halman et al., 2009), hence |
|
can hardly generalize to other settings. In contrast, reinforcement learning (RL) based methods, |
|
with fast inference speed, can generalize to various scenarios in a data-driven manner. However, it |
|
is usually too costly to train a global policy that can make decisions for all SKUs, since the training |
|
efficiency can be notably curtailed because of the large global state and action space (Jiang & Agarwal, 2018)). To further address the training efficiency issue, it is natural to adopt the multi-agent |
|
reinforcement learning (MARL) paradigm, where each SKU is regarded as an agent whose state and |
|
action spaces are localized and only contain information relevant to itself. |
|
|
|
There are currently two popular paradigms to train MARL in the literature: independent learning |
|
(Tan, 1993) and joint action learning (Lowe et al., 2017). Despite of their success in many scenarios, these two MARL paradigms also exhibit certain weaknesses that restrain their effectiveness |
|
|
|
|
|
----- |
|
|
|
in solving IM problems. On one hand, if applying independent learning, policy training of one |
|
agent simply treats all other agents as parts of the stochastic environment, hence is hard to converge |
|
due to non-stationary of the environment. On the other hand, joint action learning usually learns a |
|
centralized critic conditioned on the joint action and state spaces of all agents, which can easily become intractable with increasing number of SKUs. Furthermore, it could be quite time-consuming |
|
to sample data from joint simulator for a great number of SKUs, since it usually involves much |
|
computation on many internal variables caused by complex agent interactions. |
|
|
|
To address these challenges, we took a closer look at the IM problem and find that there exists a |
|
special structure that can be leveraged to design more effective MARL paradigm. Particularly, each |
|
agent in IM only interacts with others through the shared resource, e.g., warehouse spaces. We |
|
introduce an auxiliary variable to represent the whole inventory level, implying the available shared |
|
resource, for all SKUs, and refer to this variable as context. From the MARL perspective, one agent |
|
can be influenced by other agents only through such context. The context dynamics actually reflect |
|
the collective behaviors of all agents. And, conditioned on context dynamics, one agent’s state |
|
transition and reward function are independent of other agents. In this way, leveraging context as an |
|
additional input for each agent’s policy/value functions enables us to both avoid the non-stationary |
|
problem caused by independent learning and mitigate the intractable centralized critics learning |
|
caused by exploding state and action spaces. Moreover, introducing context dynamics inspires us |
|
to build a local simulator for each agent in order to facilitate more efficient policy learning for each |
|
agent. |
|
|
|
Based on this structure with context dynamics, we propose a shared-resource stochastic game to |
|
model the IM problem. Specifically, we introduce a surrogate objective function to optimize the |
|
return of agent i conditioned on all other agents, denoted by −i. Here, we make two approximations to get the surrogate objective function: 1) rearranging the sampling process by first sampling |
|
context dynamics then sampling local state/action/reward for each agent i; 2) using context dynamics sampled by previous policies. Based on above surrogate objective, our algorithm consists of |
|
two iterative learning procedures: 1) obtaining context dynamics from joint simulator, 2) updating |
|
policy for each agent by data sampled from its respective local simulator conditioned on collective |
|
context dynamics. By decoupling each agent from all others with a separate training procedure, our |
|
method can greatly reduce model complexity and accelerate learning procedure, as shown by extensive experiments. It is worthwhile to mention that our method is not limited to IM, but can also be |
|
applied to many other applications with shared-resource structure, e.g., portfolio management (Ye |
|
et al., 2020), smart grid scheduling (Remani et al., 2019), etc. |
|
|
|
Our contributions are summarized as follows: |
|
|
|
- We propose shared-resource stochastic game to capture the problem structure in IM, that |
|
agents only interact with each other through shared resource. |
|
|
|
- We propose a novel algorithm that leverages the shared-resource structure to solve IM |
|
problem efficiently. |
|
|
|
- Extensive experiments demonstrate that our method outperforms existing MARL algorithms on both profits and computing efficiency. |
|
|
|
2 BACKGROUND |
|
|
|
2.1 STOCHASTIC GAMES |
|
|
|
We build our work on stochastic games (SGs) (Shapley, 1953), since each SKU in IM problem has its own profit (reward) to optimize. A Stochastic Game is defined by a tuple |
|
_N_ _, S,_ _A[i]_ _i∈N_ _[,][ T][,]_ _R[i]_ _i∈N_ _[, γ]_, where N = {1, · · ·, n} denotes the set of n > 1 agents, |
|
denotes the state space observed by all agents, denotes the action space of agent i. Let |
|
|
|
_S_ _A[i]_ |
|
_A := A[1]_ _× · · · × A[n], then T : S × A →_ ∆(S) denotes the transition probability from any |
|
state s ∈S to any state s[′] _∈S after taking a joint action a ∈A; R[i]_ : S × A × S → R is |
|
the reward function that determines the immediate reward received by agent i for a transition from |
|
(s, a) to s[′]; γ ∈ [0, 1) is the discount factor. We can formulate the joint policy of other agents’ |
|
|
|
|
|
----- |
|
|
|
as π[−][i] = _j∈−i_ _[π][j][. Each agent][ i][ optimizes its policy][ π][i][ :][ S →]_ [∆] _A[i][]_ to maximize its own |
|
|
|
long-term reward, which is conditioned on other agents’ behavior, defined as |
|
|
|
[Q] |
|
|
|
|
|
_maxπi_ _η[i](π[i], π[−][i]) = E(st,ait[,a][−]t_ _[i])_ _,π[i],π[−][i]_ [[] |
|
_∼T_ |
|
|
|
|
|
_γ[t]rt[i][]][.]_ (1) |
|
_t=0_ |
|
|
|
X |
|
|
|
|
|
We will illustrate the shared resource structure of IM problem in Section 2.2, which motivates us |
|
to propose shared-resource stochastic game as a special case of stochastic game to capture such |
|
structure in Section 3.1. |
|
|
|
2.2 INVENTORY MANAGEMENT WITH SHARED RESOURCE |
|
|
|
While a typical setting for IM shall involve a supply network of multi-echelon including stores, |
|
warehouses and factories, we simplify the setting to ease our presentation. In the following, we shall |
|
focus on scenarios with one store and multiple SKUs. We further assume that there is an upstream |
|
warehouse that can fulfill requirements from the store perfectly. Our objective is to learn high-quality |
|
replenishing policies for each SKU in the store, particularly when there are a large number of SKUs. |
|
As replenishing decisions for SKUs in stores should directly consider consumption behaviors of |
|
customers and competitions from other SKUs due to limited resources like spaces, budget etc., they |
|
are more challenging to optimize comparing to SKUs in warehouses. It is worthwhile to mention |
|
that, due to the flexibility of RL algorithms, our method can also be applied to more complex settings |
|
with multi-echelon, fluctuate supply, non-deterministic leading time etc. |
|
|
|
Similar to previous work, we follow the multi-agent RL (MARL) framework with decentralized |
|
agents, each of which manages inventory of one SKU in the store. We assume that the store has n |
|
SKUs in sell, all of which share a common space that can store up to Imax units at the same time. |
|
Replenishing decisions of each SKU are made on discretized time steps, which are days in our paper. |
|
|
|
For each time step t and SKU i, let _I[˙]t[i]_ |
|
constraint shall hold for all time steps: _[∈]_ [Z][ denote units of][ i][ that are in stock. Hence, the following] |
|
|
|
|
|
_I˙t[i]_ (2) |
|
_i=1_ _[≤]_ _[I][max]_ |
|
|
|
X |
|
|
|
|
|
_∀t ≥_ 0. |
|
|
|
|
|
At any time step t, the agent corresponding to SKU i may place a replenishment order to request |
|
_Ot[i]_ |
|
fulfillment instantly, but will take several time steps, referred as leading time, before these products[∈] [Z][ units of products from its upstream warehouse. These replenishment orders cannot be] |
|
are delivered to the store. Lettransit at the time step t. Meanwhile, demands from customers Li denote the leading time of SKU Dt[i] [may consume inventory of SKU] i and Tt[i] _[∈]_ [Z][ its total units in] |
|
_ithan and cause an actual sale of units Dt[i][. Formally, dynamics of these variables can be described as follows:] St[i]_ _[∈]_ [Z][. Due to the possibility of out-of-stock,][ S]t[i] [may be less] |
|
|
|
_St[i]_ [= min] _Dt[i][,][ ˙]It[i]_ (3) |
|
|
|
|
|
_Tt[i]+1_ [=][ T][ i]t _t_ _Li+1_ [+][ O]t[i]+1 (4) |
|
|
|
_[−]_ _[O][i]−_ |
|
|
|
_Iˆt[i]_ [= ˙]It[i] _t_ [+][ O]t[i] _Li+1_ (5) |
|
|
|
_[−]_ _[S][i]_ _−_ |
|
|
|
0 if _i=1_ _I[ˆ]t[i]_ |
|
_n_ |
|
|
|
_ρ =_ _i=1_ _I[ˆ]t[i][−][I][max]_ _[≤]_ _[I][max]_ (6) |
|
|
|
_n_ otherwise |
|
( Pi=1 _[O]t[i]−Li_ +1 [P][n] |
|
|
|
_I˙t[i]+1P[= ˙]It[i]_ _t_ [+][ ⌊][(1][ −] _[ρ][)][O]t[i]_ _Li+1[⌋]_ (7) |
|
|
|
_[−]_ _[S][i]_ _−_ |
|
As we mentioned before, due to all SKUs share a common space, it may happen that the storage |
|
overflows when ordered products arrive. In this paper, we assume that the excess SKUs will be |
|
discarded proportionally according to ρ defined in Eq. 6, which corresponds to the overflowing ratio |
|
if we accept all coming SKUs without considering the space constraint. To calculate ρ, we also |
|
introduce an extra variable _I[ˆ]t[i][, which we call afterstate of][ ˙]It[i][. Intuitively, it denotes units of SKU][ i]_ |
|
in stock at the end of the t-th time step if we omit the capacity constraint. We shall note that other |
|
manners e.g., prioritizing all SKUs, to resolve space overflows are possible. The algorithm that we |
|
|
|
|
|
----- |
|
|
|
will introduce in the following section will also apply to these settings. Undoubtedly, such behaviors |
|
will cause extra operational cost which should be avoided as much as possible in our replenishing |
|
decisions. |
|
|
|
The intermediate profit Pt _[i]t_ [of the][ i][-th SKU is calculated according to the following equation:] |
|
|
|
_Pt_ _[i]t_ [=][ p][i][S]t[i] _[−]_ _[q][i][O]t[i]_ _[−]_ _[o][I]_ _Ot[i]_ _[>][ 0]_ _−_ _hI[˙]t[i]_ (8) |
|
|
|
where pi and qi are the unit sale price, unit procurement price for the _i-th SKU, respectively, and o_ |
|
and h are the order cost and unit holding cost, respectively. I[·] is an indicator function which equals |
|
to one when the condition is true and zero otherwise. The order cost reflects the fixed transportation |
|
cost or the order processing cost, and yields whenever the order quantity is non-zero. |
|
|
|
For convenience, we summarize all notations in Table 2 in Appendix B, where we also give an |
|
example with 2 SKUs in Fig. 4 to further illustrate the whole procedure. |
|
|
|
3 METHODOLOGY |
|
|
|
3.1 SHARED-RESOURCE STOCHASTIC GAME |
|
|
|
In this section, we show that IM problem introduced in Section 2.2 can be formulated |
|
as a shared-resource stochastic game, where each agent is only influenced by other agents |
|
through a shared resource pool. We define a shared-resource stochastic game as a tuple |
|
_N_ _,_ _S_ _[i]_ _i∈N_ _[,][ C][,]_ _A[i]_ _i∈N_ _[,][ T][,]_ _R[i]_ _i∈N_ _[, γ]_, where N = {1, · · ·, n} denotes the set of n > 1 |
|
agents, denotes the state space of agent _i,_ denotes the context space observed by all agents, |
|
|
|
_S_ _[i]_ _C_ _A[i]_ |
|
denotes the action space of agent i. Here, the context represents the occupied capacity of sharedresource. Let S := S [1] _× · · · × S_ _[n]_ and A := A[1] _× · · · × A[n], then T : S × C × A →_ ∆(S × C) |
|
denotes the transition probability, which can be decomposed as follows. The context is affected by |
|
all agents, i.e., ct+1 _Pc(_ _ct, st, at). Since we are dealing with resource, the transition function_ |
|
of context usually has some additive structure with respect to all agents, which we will illustrate later ∼ _· |_ |
|
in Section 3.2. Given context ct+1, the transition function of state for each agent is independent with |
|
other agents, i.e., s[i]t+1 _s[(][· |][ s][i]t[, a][i]t[, c][t][+1][)][. Reward function is also independent with other agents]_ |
|
given context,γ ∈ [0, 1) is the discount factor. Each agent rt[i] _[∼]_ _[R][i][∼][(][s][P]t[i][, a][ i]_ _[i]t[, c][t][+1][)][. We refer] i optimizes its own policy[ P][ i]s_ [and][ R][i][ as][ local][ transition and reward functions.] π[i] : S _[i]_ _× C →_ ∆ _A[i][]_ to |
|
maximize the expected long-term reward conditioned on other agents’ policies π[−][i], defined as |
|
|