File size: 28,231 Bytes
f71c233 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 |
# RESEARCH ON FUSION ALGORITHM OF MULTI## ATTRIBUTE DECISION MAKING AND REINFORCEMENT LEARNING BASED ON INTUITIONISTIC FUZZY NUMBER IN WARGAME ENVIRONMENT **Anonymous authors** Paper under double-blind review ABSTRACT Intelligent games have seen an increasing interest within the research community on artificial intelligence . The article proposes an algorithm that combines the multi-attribute management and reinforcement learning methods, and that joined their effect on wargaming, it solves the problem of the agent’s low rate of winning against specific rules and its inability to quickly converge during intelligent wargame training. At the same time, this paper studied a multi-attribute decision making and reinforcement learning algorithm in a wargame simulation environment, yielding data on the conflict between red and blue sides. We calculate the weight of each attribute based on the intuitionistic fuzzy number weight calculations. And then we determine the threat posed by each opponent’s game agents . Using the red side reinforcement learning reward function, the AC framework is trained on the reward function, and an algorithm combining multi-attribute decision making with reinforcement learning is obtained. A simulation experiment confirms that the algorithm of multi-attribute decision making combined with reinforcement learning presented in this paper is significantly more intelligent than the pure reinforcement learning algorithm. By resolving the shortcomings of the agent’s neural network, coupled with sparse rewards in large-map combat games, this robust algorithm effectively reduces the difficulties of convergence. It is also the first time in this field that an algorithm design for intelligent wargaming combines multi-attribute decision making with reinforcement learning. Finally, another novelty of this research is the interdisciplinary, like designing intelligent wargames and improving reinforcement learning algorithms. ABSTRACT must be centered, in small caps, and in point size 12. Two line spaces precede the abstract. The abstract must be limited to one paragraph. 1 INTRODUCTION Artificial intelligence (AI) and machine learning (ML) are becoming increasingly popular in realworld applications. For example, AlphaGo has attracted huge attention in the research community and society by showing the capability of AI defeating professional human players in the board game Go. Yet Alphastar, another strong AI program, has achieved great success in the human-machine combating game ’StarCraft’ Pang et al. (2019); Silver et al. (2016). In RTS games, AI-driven methods are widely studied and integrated into the game AI design to increase the intelligence of computer opponent and generate more realistic confrontation gaming experience. In the King Glory Game, Ye D used an improved PPO algorithm to train the game AI, with positive results Ye et al. (2020). By using reinforcement learning techniques, Silver D et al. developed a training framework that requires no human knowledge other than the rules of the game, allowing AlphaGo to train itself, and achieving high levels of intelligence in the process Silver et al. (2017). Using deep reinforcement learning and supervised strategy learning, Barrigan el al. improved the AI performance of RTS games, and defeats the built-in game AI Barriga et al. (2019). AI has become a hot research topic in recent years, showing a wide variety of applications such as deduction and analysis Schrittwieser et al. (2020); Barriga et al. (2017); O’Hanlon (2021). However, there are still limited research to ----- address the problem of slow convergence during AI training process under a variety of conditions, especially when it comes to human-AI confrontation games. Indexes measure the value of things or the parameter of an evaluation system. It is the scale of the effectiveness of things to the subject. As an attribute value, it provides the subjective consciousness or the objective facts expressed in numbers or words. It is important to select a scientifically valid target threat assessment (TA) index and evaluate that index scientifically. Target threat assessment contributes to intelligence wargame decision-making as part of current intelligent wargames. It is mainly based on rules, decision trees, reinforcement learning, and other technologies in the current mainstream game intelligent decision-making field, but rarely incorporates multi-attribute decisionmaking theory and methods into the intelligent decision-making field. The actual wargame data obtained through wargame environments are presented in this paper, as well as the multi-attribute threat assessment indicators that are effectively transformed and presented as a unified expression. Using three expression forms of real number, interval number, and intuitionistic fuzzy number, the multi-attribute decision-making theory and methods are used to analyse the target threat degree. Then, an enhanced reward function based on the generated threat degree is established to train more effective intelligent decision making model. To the best of our knowledge, this is the first work that combines the multi-attribute decision making with reinforcement learning to produce high performance for game AI in a wargame experiment. 2 WARGAMING MULTIPLE ATTRIBUTE INDEX THREAT QUANTIFICATION Obtaining scientific evaluation results requires a reasonable quantification of indicators. An important aspect of decision-making assistance in wargames is target threat assessment, and the evaluation result directly affects the effectiveness of wargame AI. The aim of this section is to introduce threat quantification methods for different types of indicators. By combining the target type, this section divides the target into target distance threat, target attack threat, target speed threat, terrain visibility threat, environmental indicator threat, and target defense value. The acquired confrontation data are incorporated into different indicator types, and then the corresponding comprehensive threat value is calculated. In Table 1 are the attributes and meanings of specific indicators. Table 1: A list of indicator attributes and their meanings Indicator Attribute Meaning Target distance threat Cost type Distance between the two parties will influence the kill probability. Target attack threat Benefit type Threat degrees should be determined by the opponent’s type, range, and lethality of the weapon. Target speed threat Benefit type The threat of speed from our opponents. Terrain visibility threat Intervisibility > no intervisibility Whether or not the terrain is visible will directly impact the threat. Environmental indicator threat Benefit type While the opponent’s environment is conducive to concealment, mobility is more dangerous. Target defence value Cost type The stronger the opponent’s armor, the harder it is to destroy it. 3 ESTABLISHMENT OF A MULTI-ATTRIBUTE QUANTITATIVE THREAT MODEL BASED ON INTUITIONISTIC FUZZY NUMBERS By using the interval number method, our framework indicates whether visibility is possible, and different threats are generated. Nevertheless, the quantified values of other threat targets are real numbers. To unify the problem-solving method, our algorithm converts all interval numbers and real numbers to intuitionistic fuzzy numbers, and calculates the size of the threat by calculating the intuitionistic fuzzy numbers. (1) This intuitionistic fuzzy entropy describes the degree of fuzzy judgment information provided by an intuitionistic fuzzy set. The larger the intuitionistic fuzzy entropy of an evaluation criterion, the smaller the weight it is; otherwise, the larger needs to be. Based on formulas from the literature Vlachos & Sergiadis (2007), we calculated the entropy weights for each intuitionistic fuzzy. Among them, ideal solution Si[+] [is a conceived optimal solution (scheme), and its attribute values hit the] best value among the alternatives; and the negative ideal solution Si[−] [is the worst conceived solution] (scheme), and its attribute values hit the worst value among the alternatives. pi is generated by comparing each alternative scheme with the ideal solution and negative ideal solution. If one of the solutions is closest to the ideal solution, but at the same time far from the negative ideal solution, ----- then it is the best solution among the alternatives. _i=1_ [µij ln µij + vij ln vij− (µij + vij) ln (µij + vij) − (1 − _µij −_ _vij) ln 2] (1)_ X _Hj = −_ _n ln 2_ If µij = 0, vij = 0, then µij ln µij = 0, vij ln vij = 0, (µij + vij) ln (µij + vij) = 0. The entropy weight of the j attribute is defined as: 1 _Hj_ _wj =_ _−n_ (2) _n_ _Hj_ _−_ _j=1_ P Among wj 0, j = 1, 2, _, n,_ _≥_ _· · ·_ _wj = 1_ _j=1_ P (2) Determine the optimal solution A+ and the worst solution A- using the following formula: _A[+]_ = _µ[+]1_ _[, ν]1[+]_ _,_ _µ[+]2_ _[, ν]2[+]_ _,_ _,_ _µ[+]n_ _[, ν]n[+]_ _A[−]_ = _µ[−]1_ _[, ν]1[−]_ _,_ _µ[−]2_ _[, ν]2[−]_ _, · · ·, ⟨µ[−]n_ _[, ν]n[−][⟩]_ (3) _· · ·_ _⟨_ _[⟩]_ Where _µ[+]i_ [=] _j=12max......m_ _i_ [=] _j=1min,2,...,m_ (4) _[{][µ][ij][}][, ν][+]_ _[{][ν][ij][}]_ _µ[−]i_ [=] _j=1min,2,···,m_ _[{][µ][ij][}][, ν]i[−]_ [=] _j=1max,2,···,m_ _[{][ν][ij][}]_ (5) (3) Calculate the similarity between the fuzzy intuitionistic A and B as follows: _π1 + π2_ 2 _s (_ _µ1, ν1_ _,_ _µ2, ν2_ ) = 1 _⟨_ _⟩_ _⟨_ _⟩_ _−_ _[|][2 (][µ][1][ −]_ _[µ][2][)]3[ −]_ [(][ν][1][ −] _[ν][2][)][|]_ 1 _−_ _[π][1][ +]2_ _[ π][2]_ _−_ _[|][2 (][ν][1][ −]_ _[ν][2][)][ −]3_ [(][µ][1][ −] _[µ][2][)][|]_ (6) In which, π1 = 1 − _µ1 −_ _ν1, π2 = 1 −_ _µ2 −_ _ν2_ (4) Calculate the similarity Si[+] [and][ S]i[−] [between each solution and the optimal solution and the worst] solution based on the following formula: _Si[+]_ [=] _Si[−]_ [=] _k=1_ _wk · s_ _µ[+]k_ _[, ν]k[+]_ _, ⟨µik, νik⟩_ _n_ (7) P _k=1_ _wk · s_ _µ[−]k_ _[, ν]k[−]_ _, ⟨µik, νik⟩_ P (5) Then calculate the relative closeness _pi = Si[−][/]_ _Si[+]_ [+][ S]i[−] (8) Comparing threat levels of opponents based on their closeness to the target depends on the level of threat assessment performed. 4 MULTI-ATTRIBUTE THREAT QUANTITATIVE SIMULATION The threat assessment problem is transformed into a multi-attribute decision making problem, while the combat intention of the target is incorporated into the evaluation system to make the evaluation more realistic and the results more reliable. A simulation scene includes ten tanks on each side, i.e. red and blue, fighting each other, and ten opposite are found as game agents in the wargame. A unified intuitiveistic fuzzy number representation has been created for all multi-attribute indicators. An example of an intuitionistic fuzzy number representation of threat assessment indicators is illustrated in Table 2. ----- Table 2: Information decision table for threat target parameters (intuitionistic fuzzy number) Tank1 Tank2 Tank3 Tank4 Tank5 Tank6 Tank7 Tank8 Tank9 Tank10 Quantification of target distance threatsQuantification of target speed threatsQuantifying the threat from target attacksQuantifying the threat posed by terrain visibilityQuantification of environmental indicators of threatQuantification of target defense [0.0, 0.0][0.2, 0.0][0.153863899, 0.046136101][0.2, 0.0][0.2, 0.0][0.187378998, 0.012621002] [0.171440811, 0.028559189][0.0, 0.0][0.18749387, 0.01250613][0.2, 0.0][0.2, 0.0][0.2, 0.0] [0.0, 0.0][0.2, 0.0][0.2, 0.0][0.000664452, 0.199335548][0.187608882, 0.012391118][0.2, 0.0] [0.0, 0.0][0.2, 0.0][0.2, 0.0][0.2, 0.0][0.000399202, 0.199600798][0.176663586, 0.023336414] [0.0, 0.0][0.2, 0.0][0.2, 0.0][0.2, 0.0][0.0001998, 0.1998002][0.187608882, 0.012391118] [0.0, 0.0][0.2, 0.0][0.2, 0.0][0.2, 0.0][0.2, 0.0][0.176663586, 0.023336414] [0.0, 0.0][0.2, 0.0][0.2, 0.0][0.2, 0.0][0.2, 0.0][0.176561598, 0.023438402] [0.0, 0.0][0.2, 0.0][0.2, 0.0][0.171440811, 0.028559189][0.000664452, 0.199335548][0.199738767, 0.000261233] [0.0, 0.0][6.6644e-05, 0.199933356][0.2, 0.0][0.186672886, 0.013327114][0.0001998, 0.1998002][0.2, 0.0] [0.0, 0.0][0.2, 0.0][0.2, 0.0][0.171440811, 0.028559189][0.000285307, 0.199714693][0.2, 0.0] Table 3: Threat assessment for target |Si+|[0.9900131572106283, 0.9930194457658972, 0.9713249517102417, 0.9694274902547305, 0.9712630240082707, 0.9960298049584839, 0.9960124538670997, 0.9685356920167532, 0.9447732710194203, 0.9685296037271114]| |---|---| |Si−|[0.9451975215527424, 0.9421912329974735, 0.963885727053129, 0.9657831885086402, 0.9639476547551001, 0.9391808738048868, 0.9391982248962711, 0.9666749867466174, 0.9904374077439504, 0.9666810750362593]| |Pi|[0.5115790069137391, 0.5131324752716081, 0.5019220710020746, 0.5009415775207532, 0.5018900705058751, 0.5146880470889931, 0.5146790810929003, 0.5004807500523212, 0.4882017660336315, 0.500477603991942]| |Ranking|T6>T7>T2>T1>T3>T5>T4>T8>T10>T9| _SSPii[−][+]i_ [0.9900131572106283, 0.9930194457658972, 0.9713249517102417, 0.9694274902547305, 0.9712630240082707, 0.9960298049584839, 0.9960124538670997, 0.9685356920167532, 0.9447732710194203, 0.9685296037271114][0.9451975215527424, 0.9421912329974735, 0.963885727053129, 0.9657831885086402, 0.9639476547551001, 0.9391808738048868, 0.9391982248962711, 0.9666749867466174, 0.9904374077439504, 0.9666810750362593][0.5115790069137391, 0.5131324752716081, 0.5019220710020746, 0.5009415775207532, 0.5018900705058751, 0.5146880470889931, 0.5146790810929003, 0.5004807500523212, 0.4882017660336315, 0.500477603991942] Ranking T6>T7>T2>T1>T3>T5>T4>T8>T10>T9 By obtaining data represented by the intuitionistic vagueness of the threat assessment indicators shown in the Table 2, formulae in (7) and (8) may be used to obtain the intuitionistic vague target threat assessment based on multi-attribute decision making approaches. Table 3 shows the assessment scores to determine the target threat level. In Table 4, the opposite target at T 1 is shown as a threat. Table 4: Ranking of opposite targets at time Tt Type of piece Indicator comprehensive Ranking Tank 1 0.511579007 4 Tank 2 0.513132475 3 Tank 3 0.501922071 5 Tank 4 0.500941578 7 Tank 5 0.501890071 6 Tank 6 0.514688047 1 Tank 7 0.514679081 2 Tank 8 0.50048075 8 Tank 9 0.488201766 10 Tank 10 0.500477604 9 Based on the evaluation results, it can be concluded that the blue T 6 tank is the most harmful and the T 7 tank is the second most harmful, this is shown in figure 1. This paper does not limit evaluation to subjective analysis of experts, but also introduces reinforcement learning, associates the reinforcement learning algorithm through a reward function and analyses the actual wargame AI’s winning rate. 5 A FUSION MODEL OF REINFORCEMENT LEARNING AND MULTI-ATTRIBUTE THREAT ANALYSIS 5.1 REINFORCEMENT LEARNING ALGORITHM AND MULTI-ATTRIBUTE MODEL FORMULATION Previous sections described the quantified value of multi-attribute analysis of threat levels based on the entropy weight method. The section integrate this method with with reinforcement learning. Its essence is to establish a multi-attribute decision-making mechanism that is based on reinforcement learning, and then select the entity with the highest threat level to establish the return value and threat level. The higher the threat level, the greater the return value, this is shown in figure 2. A reinforcement learning algorithm is built using the AC framework to achieve intelligent decisionmaking. It includes a reinforcement learning pre-training module that integrates multi-attribute decision-making, critic evaluation network update module and a new and old strategy network update module. In the intensive pre-training module, multi-attribute decision making mainly uses state data obtained from the wargame environment, such as elevation, distance, armour thickness, etc., to make multi-attribute decisions. By normalizing the data, calculating the threat of each piece of the opponent by using the entropy method, and then setting the reward function and storing it in the experience, further actions in the environment will be taken to obtain the next state and action rewards. ----- Figure 1: The threat value on the ordinate, and the threat of the opponent’s ten tanks at time T represented by ten colours on the abscissa. Figure 2: A fusion model of reinforcement learning and multi-attribute threat estimation based on AC framework. The module mainly consists of a reinforcement learning pre-training module that integrates multi-attribute decision-making, Critic evaluation network update module, and a new and old strategy network module ----- The critic network calculates the value from the reward value determined during the last step of the action. combines the experience store data with the value calculated by the critic network, slashes it from the reward value determined during the last action, then returns to update the critic network parameters. As the advantage value guides the calculation of the actor network value, the network outputs the action value according to the old and new networks, and the distribution probability overall, and outputs the action from the network. As a result, the advantage value is corrected, the actor loss is calculated, and the actor network is updated in the reverse direction. 5.2 SETTING REWARD FUNCTION VALUE As a core challenge of deep reinforcement learning in solving practical tasks, the sparse reward problem relates to the fact that the training environment cannot supervise the updating of agent parameters in the process of reinforcement learning Kaelbling et al. (1996). When supervised learning is used, the training process is supervised by humans, while in reinforcement learning, rewards are used to supervise the training process, and the agent optimizes strategies based on rewards ?. The specific additional rewards is showed in Table 5. Table 5: Reward settings Situation Reward The state is now closer to the control point than Reward+0.5 the previous state This state is nearly as far from the control point as Reward-0.3 the previous state The map boundary has been reached Reward-1 Consumption per step (to avoid falling into local Reward-0.005 optimum) The opposite piece was hit Reward+(5*Risk of being hit by a piece) Hit by an opposite round Reward-(5*Risk of being hit by a piece) An opposite piece is annihilated Reward+10 Taking out one of the opposite’s pieces will lead Reward+20 to victory Defeat an opposite piece leading to failure (other Reward-10 opposite pieces reach the control point) Get to the control point Reward+10 opposite wins Reward-10 When the above additional rewards are added to the training process, the convergence speed can be significantly accelerated, and the likelihood that the agent falls into the local optimum is significantly reduced. 6 WARGAMES AI SIMULATIONS AND EVALUATIONS 6.1 EXPERIMENT SETTING Figure 3 shows the starting interface of our simulation which generates the initial states of red and blue tanks Sun et al. (2021) Sun et al. (2020). There are two tank pawns on each side, and the centre is the point of contention. In a confrontation, both sides compete for control points, and the party that reaches the middle red flag first wins. At the same time, both red and blue parties can shoot at each other, while they can hide in urban residential areas. By concealing, it is difficult for our opponents to find our targets. Each hexagon has its own number and elevation. The higher the elevation, the darker the hexagon. On the highway, the tanks move faster than on the secondary roads. The red straight line represents the secondary road and the black straight line represents the primary road. As the cross symbol represents aiming and shooting, the destroyed target disappears from the map. ----- Figure 3: Gaming environment display. The red and blue pawns fight separately, the red flag in the middle is the control point, and the first player to reach the control point wins. Alternatively, when all the wargame agents on one side are destroyed, the opponent wins. 6.2 RESULTS AND ANALYSIS OF THE EXPERIMENT In this article, the PPO algorithm Schulman et al. (2017) and the PPO algorithm combined with multi-attribute decision-making are used to compare and analyse the winning rate. MADM-PPO and PPO are trained for 24 hours, and this article uses the MADM-PPO algorithm as the red side and the rule-based blue side algorithm to fight. At the same time, the second round uses the PPO algorithm as the red side, and the blue side fights according to rules. Next, this article observes the winning percentage of both algorithms in 100 games. Experiments have shown that the agents using the PPO reinforcement learning algorithm combined with the multi-attribute decision-making method performed better than the agents using the PPO algorithm based on the threat of the opponent. As can be seen in the Figure 4 and Figure 5, our proposed multi-attribute decision-making method, combined with PPO algorithm of reinforcement learning, proves to effectively improve the effectiveness of intelligent wargame decision-making. A winning rate chart is presented in the Table 6, and Table 7. (a) (b) Figure 4: (a) Win rate: the red side is the AI of MADM-PPO intelligent algorithm and the blue side is rule-based AI; (b) Win times: the red side is the AI of MADM-PPO intelligent algorithm and the blue side is rule-based AI; The winning rate and the number of wins for the red and blue sides. The first round wins so one side starts from 1 and the other from 0. ----- (a) (b) Figure 5: (a) Win rate: the red side is the AI of PPO intelligent algorithm and the blue side is rulebased AI; (b) Win times: the red side is the AI of PPO intelligent algorithm and the blue side is rule-based AI; The winning rate and the number of wins for the red and blue sides. The first round wins so one side starts from 1 and the other from 0. The experimental results show that the MADM-PPO model can reduce the number of times to explore during training, and improve the problem that the PPO algorithm takes too long to train. It shows that the introduction of prior knowledge improves the performance of the PPO algorithm, and has a certain theoretical significance for improving the efficiency of the algorithm, the detail score is shown in Figure 6. 7 CONCLUSION We have designed an intelligent wargaming AI that To design intelligent wargaming AI that combines multi-attribute decision making and reinforcement learning to improve both the convergence speed of the online training process and the winning rate of wargaming AI. As part of this study, this paper conducts experiments on the multi-attribute decision making and reinforcement learning algorithms in a wargame simulation environment, and obtains red and blue confrontation data from the wargame environment. Calculate the weight of each attribute based on the intuitionistic fuzzy number weight calculations. Then determine the threat posed by each opponent’s game agents . On the basis of the degree of threat, the red side reinforcement learning reward function is constructed and the AC framework is trained with the reward function, and the algorithm combines multi-attribute decision making with reinforcement learning. A study demonstrated that the algorithm can gradually increase the reward value of the agent when exploring an environment over a short training period, while the final victory rate of the agent against specific rules and strategies reached 78%, which is significantly higher than that of a pure reinforcement learning algorithm, which is 62%. Solved the convergence difficulties of the state-space wargame’s sparse rewards caused by the randomization of an agent’s neural network. For the algorithm design of intelligent wargaming, this is the first research in this field to combine the multi-attribute decision making method in management with the reinforcement learning algorithm in cybernetics. An interdisciplinary approach to crossinnovation in academia could lead to improvements in the design of intelligent wargames and even improvements in reinforcement learning algorithms. The future research direction can be based on this paper to carry out a series of research, including the introduction of new methods in management multi-attribute decision-making and the fusion and intersection of a series of algorithms such as reinforcement learning SAC, MADDPG and DDQN etc, which can develop more, better and more stable fusion innovative algorithms. REFERENCES Nicolas A Barriga, Marius Stanescu, and Michael Buro. Combining strategic learning with tactical search in real-time strategy games. In Thirteenth Artificial Intelligence and Interactive Digital ----- (a) (b) (c) (d) (e) (f) Figure 6: (a) The get goal score of both sides (Red: PPO); (b) the kill score of both sides (Red: PPO); (c) the survive score of both sides (Red: PPO); (d) the get goal score of both sides (Red: MADM-PPO); (e) the kill score of both sides (Red: MADM-PPO); (f) the survive score of both sides (Red: MADM-PPO). The x-axis is the training episodes, and the y-axis is the score. Red and blue represent two teams in the wargame environment. ----- _Entertainment Conference, 2017._ Nicolas A Barriga, Marius Stanescu, Felipe Besoain, and Michael Buro. Improving rts game ai by supervised policy learning, tactical search, and deep reinforcement learning. IEEE Computational _Intelligence Magazine, 14(3):8–18, 2019._ Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal of artificial intelligence research, 4:237–285, 1996. Michael E O’Hanlon. 2. gaming and modeling combat. In Defense 101, pp. 85–133. Cornell University Press, 2021. Zhen-Jia Pang, Ruo-Ze Liu, Zhou-Yu Meng, Yi Zhang, Yang Yu, and Tong Lu. On reinforcement learning for full-length game of starcraft. In Proceedings of the AAAI Conference on Artificial _Intelligence, volume 33, pp. 4691–4698, 2019._ Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016. David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017. Yuxiang Sun, Bo Yuan, Tao Zhang, Bojian Tang, Wanwen Zheng, and Xianzhong Zhou. Research and implementation of intelligent decision based on a priori knowledge and dqn algorithms in wargame environment. Electronics, 9(10):1668, 2020. Yuxiang Sun, Bo Yuan, Yongliang Zhang, Wanwen Zheng, Qingfeng Xia, Bojian Tang, and Xianzhong Zhou. Research on action strategies and simulations of drl and mcts-based intelligent round game. International Journal of Control, Automation and Systems, pp. 1–15, 2021. Ioannis K Vlachos and George D Sergiadis. Intuitionistic fuzzy information–applications to pattern recognition. Pattern Recognition Letters, 28(2):197–206, 2007. Deheng Ye, Zhao Liu, Mingfei Sun, Bei Shi, Peilin Zhao, Hao Wu, Hongsheng Yu, Shaojie Yang, Xipeng Wu, Qingwei Guo, et al. Mastering complex control in moba games with deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 6672–6679, 2020. ----- |