pradachan
/

AI-Scientist

Model card Files Files and versions Community

AI-Scientist / review_iclr_bench /iclr_parsed /27aftiBeius.txt

pradachan

Upload folder using huggingface_hub

f71c233 verified 19 days ago

raw

history blame

28.2 kB

	# RESEARCH ON FUSION ALGORITHM OF MULTI## ATTRIBUTE DECISION MAKING AND REINFORCEMENT LEARNING BASED ON INTUITIONISTIC FUZZY NUMBER IN WARGAME ENVIRONMENT

	Anonymous authors
	Paper under double-blind review

	ABSTRACT

	Intelligent games have seen an increasing interest within the research community
	on artificial intelligence . The article proposes an algorithm that combines the
	multi-attribute management and reinforcement learning methods, and that joined
	their effect on wargaming, it solves the problem of the agent’s low rate of winning against specific rules and its inability to quickly converge during intelligent
	wargame training. At the same time, this paper studied a multi-attribute decision
	making and reinforcement learning algorithm in a wargame simulation environment, yielding data on the conflict between red and blue sides. We calculate the
	weight of each attribute based on the intuitionistic fuzzy number weight calculations. And then we determine the threat posed by each opponent’s game agents
	. Using the red side reinforcement learning reward function, the AC framework
	is trained on the reward function, and an algorithm combining multi-attribute decision making with reinforcement learning is obtained. A simulation experiment
	confirms that the algorithm of multi-attribute decision making combined with reinforcement learning presented in this paper is significantly more intelligent than
	the pure reinforcement learning algorithm. By resolving the shortcomings of the
	agent’s neural network, coupled with sparse rewards in large-map combat games,
	this robust algorithm effectively reduces the difficulties of convergence. It is also
	the first time in this field that an algorithm design for intelligent wargaming combines multi-attribute decision making with reinforcement learning. Finally, another novelty of this research is the interdisciplinary, like designing intelligent
	wargames and improving reinforcement learning algorithms. ABSTRACT must be
	centered, in small caps, and in point size 12. Two line spaces precede the abstract.
	The abstract must be limited to one paragraph.

	1 INTRODUCTION

	Artificial intelligence (AI) and machine learning (ML) are becoming increasingly popular in realworld applications. For example, AlphaGo has attracted huge attention in the research community
	and society by showing the capability of AI defeating professional human players in the board game
	Go. Yet Alphastar, another strong AI program, has achieved great success in the human-machine
	combating game ’StarCraft’ Pang et al. (2019); Silver et al. (2016). In RTS games, AI-driven methods are widely studied and integrated into the game AI design to increase the intelligence of computer opponent and generate more realistic confrontation gaming experience. In the King Glory
	Game, Ye D used an improved PPO algorithm to train the game AI, with positive results Ye et al.
	(2020). By using reinforcement learning techniques, Silver D et al. developed a training framework
	that requires no human knowledge other than the rules of the game, allowing AlphaGo to train itself,
	and achieving high levels of intelligence in the process Silver et al. (2017). Using deep reinforcement learning and supervised strategy learning, Barrigan el al. improved the AI performance of RTS
	games, and defeats the built-in game AI Barriga et al. (2019). AI has become a hot research topic
	in recent years, showing a wide variety of applications such as deduction and analysis Schrittwieser
	et al. (2020); Barriga et al. (2017); O’Hanlon (2021). However, there are still limited research to


	-----

	address the problem of slow convergence during AI training process under a variety of conditions,
	especially when it comes to human-AI confrontation games.

	Indexes measure the value of things or the parameter of an evaluation system. It is the scale of the
	effectiveness of things to the subject. As an attribute value, it provides the subjective consciousness
	or the objective facts expressed in numbers or words. It is important to select a scientifically valid
	target threat assessment (TA) index and evaluate that index scientifically. Target threat assessment
	contributes to intelligence wargame decision-making as part of current intelligent wargames. It is
	mainly based on rules, decision trees, reinforcement learning, and other technologies in the current
	mainstream game intelligent decision-making field, but rarely incorporates multi-attribute decisionmaking theory and methods into the intelligent decision-making field. The actual wargame data
	obtained through wargame environments are presented in this paper, as well as the multi-attribute
	threat assessment indicators that are effectively transformed and presented as a unified expression.
	Using three expression forms of real number, interval number, and intuitionistic fuzzy number, the
	multi-attribute decision-making theory and methods are used to analyse the target threat degree.
	Then, an enhanced reward function based on the generated threat degree is established to train
	more effective intelligent decision making model. To the best of our knowledge, this is the first
	work that combines the multi-attribute decision making with reinforcement learning to produce high
	performance for game AI in a wargame experiment.

	2 WARGAMING MULTIPLE ATTRIBUTE INDEX THREAT QUANTIFICATION

	Obtaining scientific evaluation results requires a reasonable quantification of indicators. An important aspect of decision-making assistance in wargames is target threat assessment, and the evaluation
	result directly affects the effectiveness of wargame AI. The aim of this section is to introduce threat
	quantification methods for different types of indicators. By combining the target type, this section
	divides the target into target distance threat, target attack threat, target speed threat, terrain visibility
	threat, environmental indicator threat, and target defense value. The acquired confrontation data are
	incorporated into different indicator types, and then the corresponding comprehensive threat value
	is calculated. In Table 1 are the attributes and meanings of specific indicators.

	Table 1: A list of indicator attributes and their meanings

	Indicator Attribute Meaning

	Target distance threat Cost type Distance between the two parties will influence the kill probability.
	Target attack threat Benefit type Threat degrees should be
	determined by the opponent’s type, range, and lethality of the weapon.
	Target speed threat Benefit type The threat of speed from our opponents.
	Terrain visibility threat Intervisibility > no intervisibility Whether or not the terrain is visible will directly impact the threat.
	Environmental indicator threat Benefit type While the opponent’s environment
	is conducive to concealment, mobility is more dangerous.
	Target defence value Cost type The stronger the opponent’s armor, the harder it is to destroy it.

	3 ESTABLISHMENT OF A MULTI-ATTRIBUTE QUANTITATIVE THREAT MODEL
	BASED ON INTUITIONISTIC FUZZY NUMBERS

	By using the interval number method, our framework indicates whether visibility is possible, and
	different threats are generated. Nevertheless, the quantified values of other threat targets are real
	numbers. To unify the problem-solving method, our algorithm converts all interval numbers and
	real numbers to intuitionistic fuzzy numbers, and calculates the size of the threat by calculating the
	intuitionistic fuzzy numbers.

	(1) This intuitionistic fuzzy entropy describes the degree of fuzzy judgment information provided
	by an intuitionistic fuzzy set. The larger the intuitionistic fuzzy entropy of an evaluation criterion,
	the smaller the weight it is; otherwise, the larger needs to be. Based on formulas from the literature
	Vlachos & Sergiadis (2007), we calculated the entropy weights for each intuitionistic fuzzy. Among
	them, ideal solution Si[+] [is a conceived optimal solution (scheme), and its attribute values hit the]
	best value among the alternatives; and the negative ideal solution Si[−] [is the worst conceived solution]
	(scheme), and its attribute values hit the worst value among the alternatives. pi is generated by
	comparing each alternative scheme with the ideal solution and negative ideal solution. If one of the
	solutions is closest to the ideal solution, but at the same time far from the negative ideal solution,


	-----

	then it is the best solution among the alternatives.


	_i=1_ [µij ln µij + vij ln vij− (µij + vij) ln (µij + vij) − (1 − _µij −_ _vij) ln 2] (1)_

	X


	_Hj = −_


	_n ln 2_


	If µij = 0, vij = 0, then µij ln µij = 0, vij ln vij = 0, (µij + vij) ln (µij + vij) = 0.

	The entropy weight of the j attribute is defined as:

	1 _Hj_
	_wj =_ _−n_ (2)

	_n_ _Hj_
	_−_ _j=1_
	P


	Among wj 0, j = 1, 2, _, n,_
	_≥_ _· · ·_


	_wj = 1_
	_j=1_

	P


	(2) Determine the optimal solution A+ and the worst solution A- using the following formula:
	_A[+]_ = _µ[+]1_ _[, ν]1[+]_ _,_ _µ[+]2_ _[, ν]2[+]_ _,_ _,_ _µ[+]n_ _[, ν]n[+]_

	_A[−]_ = _µ[−]1_ _[, ν]1[−]_ _,_ _µ[−]2_ _[, ν]2[−]_ _, · · ·, ⟨µ[−]n_ _[, ν]n[−][⟩]_ (3)

	_· · ·_ _⟨_ _[⟩]_



	Where
	_µ[+]i_ [=] _j=12max......m_ _i_ [=] _j=1min,2,...,m_ (4)

	_[{][µ][ij][}][, ν][+]_ _[{][ν][ij][}]_

	_µ[−]i_ [=] _j=1min,2,···,m_ _[{][µ][ij][}][, ν]i[−]_ [=] _j=1max,2,···,m_ _[{][ν][ij][}]_ (5)

	(3) Calculate the similarity between the fuzzy intuitionistic A and B as follows:


	_π1 + π2_

	2




	_s (_ _µ1, ν1_ _,_ _µ2, ν2_ ) = 1
	_⟨_ _⟩_ _⟨_ _⟩_ _−_ _[\|][2 (][µ][1][ −]_ _[µ][2][)]3[ −]_ [(][ν][1][ −] _[ν][2][)][\|]_


	1
	_−_ _[π][1][ +]2_ _[ π][2]_


	_−_ _[\|][2 (][ν][1][ −]_ _[ν][2][)][ −]3_ [(][µ][1][ −] _[µ][2][)][\|]_

	(6)


	In which, π1 = 1 − _µ1 −_ _ν1, π2 = 1 −_ _µ2 −_ _ν2_

	(4) Calculate the similarity Si[+] [and][ S]i[−] [between each solution and the optimal solution and the worst]
	solution based on the following formula:


	_Si[+]_ [=]

	_Si[−]_ [=]


	_k=1_ _wk · s_ _µ[+]k_ _[, ν]k[+]_ _, ⟨µik, νik⟩_

	_n_ (7)

	P

	_k=1_ _wk · s_ _µ[−]k_ _[, ν]k[−]_ _, ⟨µik, νik⟩_

	P


	(5) Then calculate the relative closeness

	_pi = Si[−][/]_ _Si[+]_ [+][ S]i[−] (8)


	Comparing threat levels of opponents based on their closeness to the target depends on the level of
	threat assessment performed.

	4 MULTI-ATTRIBUTE THREAT QUANTITATIVE SIMULATION

	The threat assessment problem is transformed into a multi-attribute decision making problem, while
	the combat intention of the target is incorporated into the evaluation system to make the evaluation
	more realistic and the results more reliable. A simulation scene includes ten tanks on each side, i.e.
	red and blue, fighting each other, and ten opposite are found as game agents in the wargame.

	A unified intuitiveistic fuzzy number representation has been created for all multi-attribute indicators. An example of an intuitionistic fuzzy number representation of threat assessment indicators is
	illustrated in Table 2.


	-----

	Table 2: Information decision table for threat target parameters (intuitionistic fuzzy number)

	Tank1 Tank2 Tank3 Tank4 Tank5 Tank6 Tank7 Tank8 Tank9 Tank10

	Quantification of target distance threatsQuantification of target speed threatsQuantifying the threat from target attacksQuantifying the threat posed by terrain visibilityQuantification of environmental indicators of threatQuantification of target defense [0.0, 0.0][0.2, 0.0][0.153863899, 0.046136101][0.2, 0.0][0.2, 0.0][0.187378998, 0.012621002] [0.171440811, 0.028559189][0.0, 0.0][0.18749387, 0.01250613][0.2, 0.0][0.2, 0.0][0.2, 0.0] [0.0, 0.0][0.2, 0.0][0.2, 0.0][0.000664452, 0.199335548][0.187608882, 0.012391118][0.2, 0.0] [0.0, 0.0][0.2, 0.0][0.2, 0.0][0.2, 0.0][0.000399202, 0.199600798][0.176663586, 0.023336414] [0.0, 0.0][0.2, 0.0][0.2, 0.0][0.2, 0.0][0.0001998, 0.1998002][0.187608882, 0.012391118] [0.0, 0.0][0.2, 0.0][0.2, 0.0][0.2, 0.0][0.2, 0.0][0.176663586, 0.023336414] [0.0, 0.0][0.2, 0.0][0.2, 0.0][0.2, 0.0][0.2, 0.0][0.176561598, 0.023438402] [0.0, 0.0][0.2, 0.0][0.2, 0.0][0.171440811, 0.028559189][0.000664452, 0.199335548][0.199738767, 0.000261233] [0.0, 0.0][6.6644e-05, 0.199933356][0.2, 0.0][0.186672886, 0.013327114][0.0001998, 0.1998002][0.2, 0.0] [0.0, 0.0][0.2, 0.0][0.2, 0.0][0.171440811, 0.028559189][0.000285307, 0.199714693][0.2, 0.0]

	Table 3: Threat assessment for target

	\|Si+\|[0.9900131572106283, 0.9930194457658972, 0.9713249517102417, 0.9694274902547305, 0.9712630240082707, 0.9960298049584839, 0.9960124538670997, 0.9685356920167532, 0.9447732710194203, 0.9685296037271114]\|
	\|---\|---\|
	\|Si−\|[0.9451975215527424, 0.9421912329974735, 0.963885727053129, 0.9657831885086402, 0.9639476547551001, 0.9391808738048868, 0.9391982248962711, 0.9666749867466174, 0.9904374077439504, 0.9666810750362593]\|
	\|Pi\|[0.5115790069137391, 0.5131324752716081, 0.5019220710020746, 0.5009415775207532, 0.5018900705058751, 0.5146880470889931, 0.5146790810929003, 0.5004807500523212, 0.4882017660336315, 0.500477603991942]\|
	\|Ranking\|T6>T7>T2>T1>T3>T5>T4>T8>T10>T9\|


	_SSPii[−][+]i_ [0.9900131572106283, 0.9930194457658972, 0.9713249517102417, 0.9694274902547305, 0.9712630240082707, 0.9960298049584839, 0.9960124538670997, 0.9685356920167532, 0.9447732710194203, 0.9685296037271114][0.9451975215527424, 0.9421912329974735, 0.963885727053129, 0.9657831885086402, 0.9639476547551001, 0.9391808738048868, 0.9391982248962711, 0.9666749867466174, 0.9904374077439504, 0.9666810750362593][0.5115790069137391, 0.5131324752716081, 0.5019220710020746, 0.5009415775207532, 0.5018900705058751, 0.5146880470889931, 0.5146790810929003, 0.5004807500523212, 0.4882017660336315, 0.500477603991942]

	Ranking T6>T7>T2>T1>T3>T5>T4>T8>T10>T9


	By obtaining data represented by the intuitionistic vagueness of the threat assessment indicators
	shown in the Table 2, formulae in (7) and (8) may be used to obtain the intuitionistic vague target
	threat assessment based on multi-attribute decision making approaches. Table 3 shows the assessment scores to determine the target threat level.

	In Table 4, the opposite target at T 1 is shown as a threat.

	Table 4: Ranking of opposite targets at time Tt

	Type of piece Indicator comprehensive Ranking

	Tank 1 0.511579007 4
	Tank 2 0.513132475 3
	Tank 3 0.501922071 5
	Tank 4 0.500941578 7
	Tank 5 0.501890071 6
	Tank 6 0.514688047 1
	Tank 7 0.514679081 2
	Tank 8 0.50048075 8
	Tank 9 0.488201766 10
	Tank 10 0.500477604 9

	Based on the evaluation results, it can be concluded that the blue T 6 tank is the most harmful
	and the T 7 tank is the second most harmful, this is shown in figure 1. This paper does not limit
	evaluation to subjective analysis of experts, but also introduces reinforcement learning, associates
	the reinforcement learning algorithm through a reward function and analyses the actual wargame
	AI’s winning rate.

	5 A FUSION MODEL OF REINFORCEMENT LEARNING AND
	MULTI-ATTRIBUTE THREAT ANALYSIS

	5.1 REINFORCEMENT LEARNING ALGORITHM AND MULTI-ATTRIBUTE MODEL
	FORMULATION

	Previous sections described the quantified value of multi-attribute analysis of threat levels based on
	the entropy weight method. The section integrate this method with with reinforcement learning. Its
	essence is to establish a multi-attribute decision-making mechanism that is based on reinforcement
	learning, and then select the entity with the highest threat level to establish the return value and
	threat level. The higher the threat level, the greater the return value, this is shown in figure 2.

	A reinforcement learning algorithm is built using the AC framework to achieve intelligent decisionmaking. It includes a reinforcement learning pre-training module that integrates multi-attribute
	decision-making, critic evaluation network update module and a new and old strategy network update module. In the intensive pre-training module, multi-attribute decision making mainly uses state
	data obtained from the wargame environment, such as elevation, distance, armour thickness, etc., to
	make multi-attribute decisions. By normalizing the data, calculating the threat of each piece of the
	opponent by using the entropy method, and then setting the reward function and storing it in the experience, further actions in the environment will be taken to obtain the next state and action rewards.


	-----

	Figure 1: The threat value on the ordinate, and the threat of the opponent’s ten tanks at time T
	represented by ten colours on the abscissa.

	Figure 2: A fusion model of reinforcement learning and multi-attribute threat estimation based on
	AC framework. The module mainly consists of a reinforcement learning pre-training module that
	integrates multi-attribute decision-making, Critic evaluation network update module, and a new and
	old strategy network module


	-----

	The critic network calculates the value from the reward value determined during the last step of the
	action. combines the experience store data with the value calculated by the critic network, slashes
	it from the reward value determined during the last action, then returns to update the critic network
	parameters. As the advantage value guides the calculation of the actor network value, the network
	outputs the action value according to the old and new networks, and the distribution probability
	overall, and outputs the action from the network. As a result, the advantage value is corrected, the
	actor loss is calculated, and the actor network is updated in the reverse direction.

	5.2 SETTING REWARD FUNCTION VALUE

	As a core challenge of deep reinforcement learning in solving practical tasks, the sparse reward
	problem relates to the fact that the training environment cannot supervise the updating of agent parameters in the process of reinforcement learning Kaelbling et al. (1996). When supervised learning
	is used, the training process is supervised by humans, while in reinforcement learning, rewards are
	used to supervise the training process, and the agent optimizes strategies based on rewards ?. The
	specific additional rewards is showed in Table 5.

	Table 5: Reward settings

	Situation Reward

	The state is now closer to the control point than Reward+0.5
	the previous state

	This state is nearly as far from the control point as Reward-0.3
	the previous state

	The map boundary has been reached Reward-1
	Consumption per step (to avoid falling into local Reward-0.005
	optimum)

	The opposite piece was hit Reward+(5*Risk of being hit by a piece)
	Hit by an opposite round Reward-(5*Risk of being hit by a piece)
	An opposite piece is annihilated Reward+10
	Taking out one of the opposite’s pieces will lead Reward+20
	to victory

	Defeat an opposite piece leading to failure (other Reward-10
	opposite pieces reach the control point)

	Get to the control point Reward+10
	opposite wins Reward-10


	When the above additional rewards are added to the training process, the convergence speed can be
	significantly accelerated, and the likelihood that the agent falls into the local optimum is significantly
	reduced.

	6 WARGAMES AI SIMULATIONS AND EVALUATIONS

	6.1 EXPERIMENT SETTING

	Figure 3 shows the starting interface of our simulation which generates the initial states of red and
	blue tanks Sun et al. (2021) Sun et al. (2020). There are two tank pawns on each side, and the
	centre is the point of contention. In a confrontation, both sides compete for control points, and the
	party that reaches the middle red flag first wins. At the same time, both red and blue parties can
	shoot at each other, while they can hide in urban residential areas. By concealing, it is difficult for
	our opponents to find our targets. Each hexagon has its own number and elevation. The higher the
	elevation, the darker the hexagon. On the highway, the tanks move faster than on the secondary
	roads. The red straight line represents the secondary road and the black straight line represents the
	primary road. As the cross symbol represents aiming and shooting, the destroyed target disappears
	from the map.


	-----

	Figure 3: Gaming environment display. The red and blue pawns fight separately, the red flag in the
	middle is the control point, and the first player to reach the control point wins. Alternatively, when
	all the wargame agents on one side are destroyed, the opponent wins.

	6.2 RESULTS AND ANALYSIS OF THE EXPERIMENT

	In this article, the PPO algorithm Schulman et al. (2017) and the PPO algorithm combined with
	multi-attribute decision-making are used to compare and analyse the winning rate. MADM-PPO
	and PPO are trained for 24 hours, and this article uses the MADM-PPO algorithm as the red side
	and the rule-based blue side algorithm to fight. At the same time, the second round uses the PPO
	algorithm as the red side, and the blue side fights according to rules. Next, this article observes
	the winning percentage of both algorithms in 100 games. Experiments have shown that the agents
	using the PPO reinforcement learning algorithm combined with the multi-attribute decision-making
	method performed better than the agents using the PPO algorithm based on the threat of the opponent. As can be seen in the Figure 4 and Figure 5, our proposed multi-attribute decision-making
	method, combined with PPO algorithm of reinforcement learning, proves to effectively improve the
	effectiveness of intelligent wargame decision-making. A winning rate chart is presented in the Table
	6, and Table 7.

	(a) (b)

	Figure 4: (a) Win rate: the red side is the AI of MADM-PPO intelligent algorithm and the blue side
	is rule-based AI; (b) Win times: the red side is the AI of MADM-PPO intelligent algorithm and the
	blue side is rule-based AI; The winning rate and the number of wins for the red and blue sides. The
	first round wins so one side starts from 1 and the other from 0.


	-----

	(a) (b)

	Figure 5: (a) Win rate: the red side is the AI of PPO intelligent algorithm and the blue side is rulebased AI; (b) Win times: the red side is the AI of PPO intelligent algorithm and the blue side is
	rule-based AI; The winning rate and the number of wins for the red and blue sides. The first round
	wins so one side starts from 1 and the other from 0.

	The experimental results show that the MADM-PPO model can reduce the number of times to
	explore during training, and improve the problem that the PPO algorithm takes too long to train. It
	shows that the introduction of prior knowledge improves the performance of the PPO algorithm, and
	has a certain theoretical significance for improving the efficiency of the algorithm, the detail score
	is shown in Figure 6.

	7 CONCLUSION

	We have designed an intelligent wargaming AI that To design intelligent wargaming AI that combines multi-attribute decision making and reinforcement learning to improve both the convergence
	speed of the online training process and the winning rate of wargaming AI. As part of this study, this
	paper conducts experiments on the multi-attribute decision making and reinforcement learning algorithms in a wargame simulation environment, and obtains red and blue confrontation data from the
	wargame environment. Calculate the weight of each attribute based on the intuitionistic fuzzy number weight calculations. Then determine the threat posed by each opponent’s game agents . On the
	basis of the degree of threat, the red side reinforcement learning reward function is constructed and
	the AC framework is trained with the reward function, and the algorithm combines multi-attribute
	decision making with reinforcement learning. A study demonstrated that the algorithm can gradually increase the reward value of the agent when exploring an environment over a short training
	period, while the final victory rate of the agent against specific rules and strategies reached 78%,
	which is significantly higher than that of a pure reinforcement learning algorithm, which is 62%.
	Solved the convergence difficulties of the state-space wargame’s sparse rewards caused by the randomization of an agent’s neural network. For the algorithm design of intelligent wargaming, this is
	the first research in this field to combine the multi-attribute decision making method in management
	with the reinforcement learning algorithm in cybernetics. An interdisciplinary approach to crossinnovation in academia could lead to improvements in the design of intelligent wargames and even
	improvements in reinforcement learning algorithms. The future research direction can be based on
	this paper to carry out a series of research, including the introduction of new methods in management multi-attribute decision-making and the fusion and intersection of a series of algorithms such
	as reinforcement learning SAC, MADDPG and DDQN etc, which can develop more, better and
	more stable fusion innovative algorithms.

	REFERENCES

	Nicolas A Barriga, Marius Stanescu, and Michael Buro. Combining strategic learning with tactical
	search in real-time strategy games. In Thirteenth Artificial Intelligence and Interactive Digital


	-----

	(a) (b)

	(c) (d)

	(e) (f)

	Figure 6: (a) The get goal score of both sides (Red: PPO); (b) the kill score of both sides (Red:
	PPO); (c) the survive score of both sides (Red: PPO); (d) the get goal score of both sides (Red:
	MADM-PPO); (e) the kill score of both sides (Red: MADM-PPO); (f) the survive score of both
	sides (Red: MADM-PPO). The x-axis is the training episodes, and the y-axis is the score. Red and
	blue represent two teams in the wargame environment.


	-----

	_Entertainment Conference, 2017._

	Nicolas A Barriga, Marius Stanescu, Felipe Besoain, and Michael Buro. Improving rts game ai by
	supervised policy learning, tactical search, and deep reinforcement learning. IEEE Computational
	_Intelligence Magazine, 14(3):8–18, 2019._

	Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A
	survey. Journal of artificial intelligence research, 4:237–285, 1996.

	Michael E O’Hanlon. 2. gaming and modeling combat. In Defense 101, pp. 85–133. Cornell University Press, 2021.

	Zhen-Jia Pang, Ruo-Ze Liu, Zhou-Yu Meng, Yi Zhang, Yang Yu, and Tong Lu. On reinforcement
	learning for full-length game of starcraft. In Proceedings of the AAAI Conference on Artificial
	_Intelligence, volume 33, pp. 4691–4698, 2019._

	Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon
	Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari,
	go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.

	John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy
	optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

	David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche,
	Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering
	the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.

	David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez,
	Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go
	without human knowledge. nature, 550(7676):354–359, 2017.

	Yuxiang Sun, Bo Yuan, Tao Zhang, Bojian Tang, Wanwen Zheng, and Xianzhong Zhou. Research
	and implementation of intelligent decision based on a priori knowledge and dqn algorithms in
	wargame environment. Electronics, 9(10):1668, 2020.

	Yuxiang Sun, Bo Yuan, Yongliang Zhang, Wanwen Zheng, Qingfeng Xia, Bojian Tang, and Xianzhong Zhou. Research on action strategies and simulations of drl and mcts-based intelligent
	round game. International Journal of Control, Automation and Systems, pp. 1–15, 2021.

	Ioannis K Vlachos and George D Sergiadis. Intuitionistic fuzzy information–applications to pattern
	recognition. Pattern Recognition Letters, 28(2):197–206, 2007.

	Deheng Ye, Zhao Liu, Mingfei Sun, Bei Shi, Peilin Zhao, Hao Wu, Hongsheng Yu, Shaojie Yang,
	Xipeng Wu, Qingwei Guo, et al. Mastering complex control in moba games with deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.
	6672–6679, 2020.


	-----