CN112131786A - Target detection and distribution method and device based on multi-agent reinforcement learning - Google Patents
- ️Fri Dec 25 2020
Info
-
Publication number
- CN112131786A CN112131786A CN202010959038.7A CN202010959038A CN112131786A CN 112131786 A CN112131786 A CN 112131786A CN 202010959038 A CN202010959038 A CN 202010959038A CN 112131786 A CN112131786 A CN 112131786A Authority
- CN
- China Prior art keywords
- model
- reinforcement learning
- combat
- behavior model
- agent Prior art date
- 2020-09-14 Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
- G06F30/27—Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2119/00—Details relating to the type or aim of the analysis or the optimisation
- G06F2119/14—Force analysis or force optimisation, e.g. static or dynamic forces
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Computer Hardware Design (AREA)
- Geometry (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to a target detection and distribution method and device based on multi-agent reinforcement learning, which comprises the steps of constructing a combat behavior model and a reinforcement learning training environment; training the combat behavior model to model convergence by adopting a reinforcement learning training environment to obtain an artificial intelligence behavior model; and training the artificial intelligent behavior model by adopting a combat simulation engine, and outputting an optimization model. The invention integrates the reinforcement learning algorithm MADDPG into the war game deduction system, constructs a simple to complex simulation environment, optimizes the reinforcement learning convergence speed, and effectively solves the problem of optimizing the convergence speed by an intelligent body in the war game deduction system.
Description
Technical Field
The invention belongs to the technical field of analog simulation, and particularly relates to a target detection and distribution method and device based on multi-agent reinforcement learning.
Background
With the development of artificial intelligence, the era of tactical research and military planning relying on manpower is gradually moving away from us. In the past, in the process of applying a computer to war chess deduction simulation, people rely on a differential equation and a war theory to effectively simulate the process of war, and the operational level of a military is greatly improved. Nowadays, the application of artificial intelligence in war game deduction will play a more important role. The capability of describing complex systems and the capability of modeling behaviors in a dynamic environment based on multi-agent modeling are superior to the traditional modeling method. The appearance of the multi-agent system provides a new platform for further expanding the war game deduction system.
In the process of war chess simulation deduction, an experienced commander can judge and predict the battle tasks executed by the enemy according to the information such as the state, the fighting capacity, the fighting rules and the like of the enemy. With the continuous development and improvement of the military chess system, the simulated combat mission faces many new changes: firstly, the number of the combat units is sharply increased, and a commander analyzes and determines that the workload of each target combat task is heavy one by one, so that the battlefield situation is difficult to be comprehensively and accurately mastered; secondly, the continuous development of the information technology enables the battlefield situation evolution speed to be continuously accelerated, the enemy aerial task is only identified manually, the response time of the enemy is seriously influenced, and the combat efficiency is reduced; finally, the mass battlefield data is often incomplete, untimely and inaccurate, even with deception, making it difficult for the commander to analyze the key situation hidden therein. The series of profound changes increase the difficulty for aerial task recognition, and the traditional method relying on manual recognition is difficult to adapt to the battlefield situation with high complexity and rapid transformation, so that the research on the intelligent battlefield task recognition method frees a commander from multi-source, complex and heterogeneous battlefield data, puts more energy into command and decision, and is a great trend for the development of the future intelligent military chess system.
With the continuous development of the multi-agent reinforcement learning, the reinforcement learning has the capabilities of autonomous learning, distribution coordination and organization, and changes the state information of the reinforcement learning by cooperating with other agents to plan the behavior of the reinforcement learning, so that the goal is finally and efficiently completed. The multi-agent system can not only completely replace a single agent to finish the target, but also finish the efficiency exceeding that of the single agent, thereby reflecting the phenomenon of a plurality of people and large labor. The goal of multiple intelligent agents to form a team and cooperate like a person is a new topic. Deep reinforcement learning uses an asynchronous framework to train multiple agents, each agent is independent relative to other agents, and the asynchronous framework is not suitable if the agents are in different labor division. The interaction of the intelligent bodies in the multi-intelligent-body algorithm is full connection, so that the complexity of the algorithm is increased, the algorithm is more difficult to apply to reality, and the optimization convergence speed of the battle behavior model in the war push system is low.
Disclosure of Invention
In view of the above, the present invention provides a target detection and allocation method and device based on multi-agent reinforcement learning to solve the problem of slow optimization convergence rate of the combat behavior model in the war game deduction system in the prior art.
In order to achieve the purpose, the invention adopts the following technical scheme: a target detection and distribution method based on multi-agent reinforcement learning comprises the following steps:
constructing a combat behavior model and a reinforcement learning training environment;
training the combat behavior model to model convergence by adopting a reinforcement learning training environment to obtain an artificial intelligence behavior model;
and training the artificial intelligent behavior model by adopting a combat simulation engine, and outputting an optimization model.
Further, constructing a reinforcement learning training environment, comprising:
and mapping the combat simulation engine and the reinforcement learning training environment by adopting the MADDPG algorithm.
Further, the mapping of the combat simulation engine and the reinforcement learning training environment by using the madpg algorithm includes:
mapping a combat behavior model in the combat simulation engine to a plurality of agents in the reinforcement learning training environment, the agents serving as training objects;
mapping a perception model in the combat simulation engine to a perception agent module in the reinforcement learning training environment, wherein the perception agent module is used for acquiring the current battlefield situation;
mapping a decision model in the combat simulation engine to a decision agent module in the reinforcement learning training environment, wherein the decision agent module is used for selecting an action to be executed according to the current battlefield situation;
mapping an action model in the combat simulation engine to an action agent module in the reinforcement learning training environment for executing the selected action;
and mapping a memory model in the combat simulation engine to a memory agent module in the reinforcement learning training environment, wherein the memory agent module is used for storing battlefield situations.
Further, it is right to adopt reinforcement learning training environment the action model of fighting trains to the model convergence, obtains artificial intelligence action model, includes:
initializing an agent;
the sensing agent module acquires environmental information, determines the current battlefield situation and stores the situation in the memory agent module;
the decision agent module selects an action to be executed according to the current battlefield situation;
the action agent module executes the selected action;
the reinforcement learning training environment feeds a battlefield environment back to the intelligent agent for optimization according to action results;
and judging whether the intelligent agent converges or not, and outputting an artificial intelligence behavior model after the intelligent agent converges.
Further, the training of the artificial intelligence behavior model by adopting a combat simulation engine and the output of an optimization model comprise:
initializing an artificial intelligence behavior model;
the sensing model acquires environmental information, determines the current battlefield situation and stores the situation in the memory model;
the decision model selects an action to be executed according to the current battlefield situation;
the action model performs the selected action;
the battle policy engine feeds back battlefield environment to the artificial intelligence behavior model for optimization according to action results;
and judging whether the artificial intelligence behavior model is converged or not, and outputting an optimization model after the intelligent agent is converged.
Further, before determining whether to converge, the method further includes:
judging whether a preset training end time is reached;
and if the training end time is reached, ending and exiting, otherwise, continuing the training.
Further, the reinforcement learning training environment utilizes a MADDPG algorithm to run and train the combat behavior model in a distributed and centralized manner.
Further, the number of the agents is 3.
Furthermore, the combat behavior model adopts a multi-agent artificial neural network.
The embodiment of the application provides a target detection and distribution device based on multi-agent reinforcement learning, including:
the construction module is used for constructing a combat behavior model and a reinforcement learning training environment;
the obtaining module is used for training the combat behavior model to model convergence by adopting a reinforcement learning training environment, and obtaining an artificial intelligence behavior model;
and the output module is used for training the artificial intelligence behavior model by adopting a combat simulation engine and outputting an optimization model.
By adopting the technical scheme, the invention can achieve the following beneficial effects:
the invention provides a target detection and distribution method and device based on multi-agent reinforcement learning, which comprises the steps of constructing a combat behavior model and a reinforcement learning training environment; training the combat behavior model to model convergence by adopting a reinforcement learning training environment to obtain an artificial intelligence behavior model; and training the artificial intelligent behavior model by adopting a combat simulation engine, and outputting an optimization model. The invention integrates the reinforcement learning algorithm MADDPG into the war game deduction system, constructs a simple to complex simulation environment, optimizes the reinforcement learning convergence speed, and effectively solves the problem of optimizing the convergence speed by an intelligent body in the war game deduction system.
The invention applies the MADDPG idea to the military simulation field, each fighting unit becomes an independent intelligent agent which wants to be combined, the intelligent agent leaves own pheromone after action, and the multi-intelligent agents learn how to aggravate good pheromones and pheromones with poor attenuation as time goes on. Thus, by increasing the interaction between the agents, the agents will optimize their own strategies, and even if the environment changes, the agents will be able to achieve the goal well according to the learned strategies.
The invention applies the MADDPG idea to the military simulation field, so that each combat unit becomes an independent and combined intelligent agent, and aiming at the problem of convergence speed of a multi-intelligent agent in the MADDPG algorithm during training, the invention adopts MPE (Multi-agent-envs) environment developed by OpenAI as the basis, removes most of the mathematical operation of combat models, and retains most of the function simulation warfare and chess simulation deduction of an engine. After each step, the learned experience of the intelligent agent is inherited to the war game simulation and deduction system, and the war game simulation and deduction system trains again, so that the problem of optimizing convergence speed of the intelligent agent is effectively solved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic diagram illustrating the steps of a multi-agent reinforcement learning-based target detection and distribution method according to the present invention;
FIG. 2 is a schematic view of a combat simulation scenario according to the present invention;
FIG. 3 is a diagram of the MADDPG algorithm structure of the present invention;
FIG. 4 is a schematic structural diagram of an objective detecting and distributing device based on multi-agent reinforcement learning according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without any inventive step, are within the scope of the present invention.
Inspired by MADDPG (Multi-Agent Deep Deterministic Policy Gradient) Multi-Agent algorithm, the strategy Gradient algorithm is improved in series, so that the method can be suitable for complex Multi-Agent scenes which cannot be processed by the traditional algorithm. The MADDPG algorithm has the following three characteristics:
1, the optimal strategy obtained by learning can give the optimal action by only utilizing local information during application.
2, there is no need to know the dynamic model of the environment and the special communication requirements.
3, the algorithm can be used not only in a cooperative environment but also in a competitive environment.
The invention applies the MADDPG idea to the military simulation field, each fighting unit becomes an independent and mutually cooperative intelligent agent, the intelligent agent leaves own pheromone after action, and the multi-intelligent agents learn how to aggravate good pheromones and poor attenuation pheromones as time goes on. Thus, by increasing the interaction between the agents, the agents will optimize their own strategies, and even if the environment changes, the agents will be able to achieve the goal well according to the learned strategies.
The following describes a specific target detection and assignment method and apparatus based on multi-agent reinforcement learning provided in the embodiments of the present application with reference to the accompanying drawings.
As shown in fig. 1, an object detection and distribution method based on multi-agent reinforcement learning provided in the embodiment of the present application includes:
s101, constructing a combat behavior model and a reinforcement learning training environment;
as shown in fig. 2, for the convenience of study, the following settings are made for the battle behavior model:
(1) the target and the target do not overlap;
(2) the target and the radar detection range are not overlapped;
(3) supposing that the unmanned aerial vehicle cluster is fixed at a stable flying height, the measurement accuracy and the ground resolution of the magnetic detector can be ensured.
A group of unmanned aerial vehicles detect dynamic and static targets of a large-scale unstructured environment with obstacles, and the quality of the process is measured through a proper objective function, wherein the perception radius of the detector changes with the environment and the targets in real time. The overall quality is optimized herein by minimizing the time required to find a given static object, or by maximizing the average number of dynamic objects found within a certain search time, in anticipation of the observation time as a function of the task objective.
The combat behavior model corresponds to a multi-agent artificial neural network, is the core for generating intelligence and is the object for strengthening learning and training.
S102, training the combat behavior model by adopting a reinforcement learning training environment until the model converges to obtain an artificial intelligence behavior model;
the reinforcement learning training environment is an environment set according to a combat simulation engine, such as: the combat simulation engine is a large environment, and the reinforcement learning training environment is a small environment made by extracting necessary factors in the large environment. And mapping the combat behavior model into multiple intelligent agents, and training the multiple intelligent agents to obtain an optimized artificial intelligence behavior model. In the present application, 3 agents are used, and it is understood that 4, 5, and 6 agents may also be used, which is not limited herein.
And S103, training the artificial intelligence behavior model by adopting a combat simulation engine, and outputting an optimization model.
And (4) putting the artificial intelligence behavior model obtained after the small environment pre-training into a large environment for compensation training again to obtain an optimized model. And (3) constructing a simple to complex simulation environment, and optimizing the convergence speed of reinforcement learning.
The working principle of the target detection and distribution method based on multi-agent reinforcement learning is as follows: constructing a combat behavior model and a reinforcement learning training environment; training the combat behavior model to model convergence by adopting a reinforcement learning training environment to obtain an artificial intelligence behavior model; and training the artificial intelligent behavior model by adopting a combat simulation engine, and outputting an optimization model. The training stage is divided into two stages of small environment pre-training and large environment compensation training, and adaptability of the artificial intelligence behavior model is improved.
In some embodiments, a combat simulation engine is mapped to a reinforcement learning training environment using the MADDPG algorithm.
Preferably, the mapping of the combat simulation engine and the reinforcement learning training environment by using the madpg algorithm includes:
mapping a combat behavior model in the combat simulation engine to a plurality of agents in the reinforcement learning training environment, the agents serving as training objects;
mapping a perception model in the combat simulation engine to a perception agent module in the reinforcement learning training environment, wherein the perception agent module is used for acquiring the current battlefield situation;
mapping a decision model in the combat simulation engine to a decision agent module in the reinforcement learning training environment, wherein the decision agent module is used for selecting an action to be executed according to the current battlefield situation;
mapping an action model in the combat simulation engine to an action agent module in the reinforcement learning training environment for executing the selected action;
and mapping a memory model in the combat simulation engine to a memory agent module in the reinforcement learning training environment, wherein the memory agent module is used for storing battlefield situations.
In the application, because the reinforcement learning training environment is different from the operation environment and the programming language of a combat simulation engine and is difficult to integrate directly, the combat simulation engine and the reinforcement learning training environment are mapped, the most advanced multi-agent reinforcement learning algorithm MADDPG is adopted, an algorithm frame is shown in figure 3, and because the data volume and the calculation volume of a real simulation engine are huge, a simple battlefield environment called a weapon chess deduction small environment is firstly established on the basis of MPE of OpenAI at the outside, and the weapon chess deduction small environment can provide simple geographic information data and can also generate simple deduction process data.
In some embodiments, the training the combat behavior model to model convergence by using a reinforcement learning training environment to obtain an artificial intelligence behavior model includes:
initializing an agent;
the sensing agent module acquires environmental information, determines the current battlefield situation and stores the situation in the memory agent module;
the decision agent module selects an action to be executed according to the current battlefield situation;
the action agent module executes the selected action;
the reinforcement learning training environment feeds a battlefield environment back to the intelligent agent for optimization according to action results;
and judging whether the intelligent agent converges or not, and outputting an artificial intelligence behavior model after the intelligent agent converges.
Specifically, the training is extremely divided into two stages, pre-training in a small environment is firstly carried out, specifically, the intelligent agent is placed in a reinforcement learning training environment, the sensing module calls a simulated sensor and a simulated communication interface to obtain environment information, the current battlefield situation is determined and stored in the memory agent module, wherein the sensor is a radar and can obtain the position of teammates, the position of enemies and the like. The movement is performed by selecting an action, such as a left action or a right action, based on the positional relationship. And the reinforcement learning training environment gives feedback to the intelligent agent according to the action result, and determines a reward function so as to optimize.
The expression of the reward function is
Where 100 is the reward and 100 is the penalty, the penalty being given when the agent has a collision.
And continuously optimizing until the intelligent agent converges, and outputting an artificial intelligence behavior model.
The above is a simulation training process of one combat, and after a plurality of times of sample training, the combat behavior model is gradually converged to generate an artificial intelligent behavior model with a countermeasure being a reactive behavior model. Because the first-stage training countermeasure is a traditional reactive behavior model, the behavior logic is relatively solidified, and in order to increase the training sample space and improve the adaptability of the artificial intelligence behavior model, the second-stage training is also needed.
In some embodiments, the training the artificial intelligence behavior model with the combat simulation engine to output an optimization model includes:
initializing an artificial intelligence behavior model;
the sensing model acquires environmental information, determines the current battlefield situation and stores the situation in the memory model;
the decision model selects an action to be executed according to the current battlefield situation;
the action model performs the selected action;
the battle policy engine feeds back battlefield environment to the artificial intelligence behavior model for optimization according to action results;
and judging whether the artificial intelligence behavior model is converged or not, and outputting an optimization model after the intelligent agent is converged.
And then putting the artificial intelligence behavior model into a large environment for compensation training, wherein the probability of errors of the data acquired in the 'small environment pre-training' stage is high, the data processing in the pre-training stage is not complete enough, and a real combat simulation engine is adopted as the environment for re-training after the 'small environment pre-training'. The sensing model, the decision model, the memory model and the action model of the combat simulation engine are adopted for training, the training process is the same as that of a reinforcement learning training environment, and the detailed description is omitted here.
Specifically, before determining whether to converge, the method further includes:
judging whether a preset training end time is reached;
and if the training end time is reached, ending and exiting, otherwise, continuing the training.
The end of the training process of the present application can be divided into two categories, one is to achieve the maximum running time, for example, i run at most one thousand steps and can also be understood as one thousand frame periods, and the other is to achieve the predicted optimization goal.
Preferably, the reinforcement learning training environment utilizes a madpg algorithm to run the warfare behavior model in a distributed and centralized manner.
Preferably, the combat behavior model employs a multi-agent artificial neural network.
As shown in fig. 4, an embodiment of the present application provides an object detecting and assigning apparatus based on multi-agent reinforcement learning, including:
the
construction module401 is used for constructing a combat behavior model and a reinforcement learning training environment;
an obtaining
module402, configured to train the combat behavior model to model convergence by using a reinforcement learning training environment, and obtain an artificial intelligence behavior model;
and an
output module403, configured to train the artificial intelligence behavior model by using a combat simulation engine, and output an optimization model.
The working principle of the multi-agent reinforcement learning-based target detection and distribution device provided by the application is that the
construction module401 constructs a combat behavior model and a reinforcement learning training environment; the obtaining
module402 adopts a reinforcement learning training environment to train the combat behavior model until the model converges, and obtains an artificial intelligence behavior model; the
output module403 trains the artificial intelligence behavior model by using a combat simulation engine, and outputs an optimization model.
The embodiment of the application provides computer equipment, which comprises a processor and a memory connected with the processor;
the memory is used for storing a computer program, and the computer program is used for executing the target detection and distribution method based on multi-agent reinforcement learning provided by any one of the above embodiments;
the processor is used to call and execute the computer program in the memory.
In summary, the invention provides a target detection and allocation method and device based on multi-agent reinforcement learning, which integrates a reinforcement learning algorithm MADDPG into a chess deduction system, constructs a simple to complex simulation environment, optimizes reinforcement learning convergence speed, and effectively solves the problem of optimizing convergence speed by an agent in the chess deduction system. The invention applies the MADDPG idea to the military simulation field, each fighting unit becomes an independent intelligent agent which wants to be combined, the intelligent agent leaves own pheromone after action, and the multi-intelligent agents learn how to aggravate good pheromones and pheromones with poor attenuation as time goes on. Thus, by increasing the interaction between the agents, the agents will optimize their own strategies, and even if the environment changes, the agents will be able to achieve the goal well according to the learned strategies. The invention applies the MADDPG idea to the military simulation field, so that each combat unit becomes an independent and combined intelligent agent, and aiming at the problem of convergence speed of a multi-intelligent agent in the MADDPG algorithm during training, the invention adopts MPE (Multi-agent-envs) environment developed by OpenAI as the basis, removes most of the mathematical operation of combat models, and retains most of the function simulation warfare and chess simulation deduction of an engine. After each step, the learned experience of the intelligent agent is inherited to the war game simulation and deduction system, and the war game simulation and deduction system trains again, so that the problem of optimizing convergence speed of the intelligent agent is effectively solved.
It is to be understood that the embodiments of the method provided above correspond to the embodiments of the apparatus described above, and the corresponding specific contents may be referred to each other, which is not described herein again.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.
Claims (10)
1. A target detection and distribution method based on multi-agent reinforcement learning is characterized by comprising the following steps:
constructing a combat behavior model and a reinforcement learning training environment;
training the combat behavior model to model convergence by adopting a reinforcement learning training environment to obtain an artificial intelligence behavior model;
and training the artificial intelligent behavior model by adopting a combat simulation engine, and outputting an optimization model.
2. The method of claim 1, wherein constructing a reinforcement learning training environment comprises:
and mapping the combat simulation engine and the reinforcement learning training environment by adopting the MADDPG algorithm.
3. The method of claim 2, wherein said mapping a combat simulation engine to a reinforcement learning training environment using a madpg algorithm comprises:
mapping a combat behavior model in the combat simulation engine to a plurality of agents in the reinforcement learning training environment, the agents serving as training objects;
mapping a perception model in the combat simulation engine to a perception agent module in the reinforcement learning training environment, wherein the perception agent module is used for acquiring the current battlefield situation;
mapping a decision model in the combat simulation engine to a decision agent module in the reinforcement learning training environment, wherein the decision agent module is used for selecting an action to be executed according to the current battlefield situation;
mapping an action model in the combat simulation engine to an action agent module in the reinforcement learning training environment for executing the selected action;
and mapping a memory model in the combat simulation engine to a memory agent module in the reinforcement learning training environment, wherein the memory agent module is used for storing battlefield situations.
4. The method of claim 3, wherein the training the combat behavior model to model convergence using a reinforcement learning training environment to obtain an artificial intelligence behavior model comprises:
initializing an agent;
the sensing agent module acquires environmental information, determines the current battlefield situation and stores the situation in the memory agent module;
the decision agent module selects an action to be executed according to the current battlefield situation;
the action agent module executes the selected action;
the reinforcement learning training environment feeds a battlefield environment back to the intelligent agent for optimization according to action results;
and judging whether the intelligent agent converges or not, and outputting an artificial intelligence behavior model after the intelligent agent converges.
5. The method of claim 4, wherein the training the artificial intelligence behavior model with the combat simulation engine to output an optimization model comprises:
initializing an artificial intelligence behavior model;
the sensing model acquires environmental information, determines the current battlefield situation and stores the situation in the memory model;
the decision model selects an action to be executed according to the current battlefield situation;
the action model performs the selected action;
the battle policy engine feeds back battlefield environment to the artificial intelligence behavior model for optimization according to action results;
and judging whether the artificial intelligence behavior model is converged or not, and outputting an optimization model after the intelligent agent is converged.
6. The method of claim 4 or 5, further comprising, before determining whether to converge:
judging whether a preset training end time is reached;
and if the training end time is reached, ending and exiting, otherwise, continuing the training.
7. The method of claim 1,
and the reinforcement learning training environment utilizes a MADDPG algorithm to perform distributed operation and centralized training on the combat behavior model.
8. The method of claim 3,
the number of the intelligent agents is 3.
9. The method of claim 1,
the combat behavior model adopts a multi-agent artificial neural network.
10. An object detection and distribution device based on multi-agent reinforcement learning, comprising:
the construction module is used for constructing a combat behavior model and a reinforcement learning training environment;
the obtaining module is used for training the combat behavior model to model convergence by adopting a reinforcement learning training environment, and obtaining an artificial intelligence behavior model;
and the output module is used for training the artificial intelligence behavior model by adopting a combat simulation engine and outputting an optimization model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010959038.7A CN112131786B (en) | 2020-09-14 | 2020-09-14 | Target detection and distribution method and device based on multi-agent reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010959038.7A CN112131786B (en) | 2020-09-14 | 2020-09-14 | Target detection and distribution method and device based on multi-agent reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112131786A true CN112131786A (en) | 2020-12-25 |
CN112131786B CN112131786B (en) | 2024-05-31 |
Family
ID=73846639
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010959038.7A Active CN112131786B (en) | 2020-09-14 | 2020-09-14 | Target detection and distribution method and device based on multi-agent reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112131786B (en) |
Cited By (14)
* Cited by examiner, † Cited by third partyPublication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112633519A (en) * | 2021-03-11 | 2021-04-09 | 中国科学院自动化研究所 | Man-machine antagonistic action prediction method, device, electronic equipment and storage medium |
CN112905166A (en) * | 2021-03-04 | 2021-06-04 | 青岛海科创新科技有限公司 | Artificial intelligence programming system, computer device, computer-readable storage medium |
CN113222106A (en) * | 2021-02-10 | 2021-08-06 | 西北工业大学 | Intelligent military chess deduction method based on distributed reinforcement learning |
CN113435598A (en) * | 2021-07-08 | 2021-09-24 | 中国人民解放军国防科技大学 | Knowledge-driven intelligent strategy deduction decision method |
CN113469853A (en) * | 2021-05-13 | 2021-10-01 | 航天科工空间工程发展有限公司 | Method for accelerating command control of fighting and artificial intelligence device |
CN113723013A (en) * | 2021-09-10 | 2021-11-30 | 中国人民解放军国防科技大学 | Multi-agent decision method for continuous space chess deduction |
CN114327916A (en) * | 2022-03-10 | 2022-04-12 | 中国科学院自动化研究所 | A training method, device and equipment for a resource allocation system |
CN114611669A (en) * | 2022-03-14 | 2022-06-10 | 三峡大学 | Intelligent strategy-chess-deducing decision-making method based on double experience pool DDPG network |
CN114706381A (en) * | 2022-03-04 | 2022-07-05 | 达闼机器人股份有限公司 | Training method, device, storage medium and electronic device for intelligent body |
WO2023038605A1 (en) * | 2021-09-08 | 2023-03-16 | Havelsan Hava Elektronik San. Ve Tic. A.S. | Autonomous virtual entities continuously learning from experience |
CN115906673A (en) * | 2023-01-10 | 2023-04-04 | 中国人民解放军陆军工程大学 | Integrated modeling method and system for combat entity behavior model |
CN116739323A (en) * | 2023-08-16 | 2023-09-12 | 北京航天晨信科技有限责任公司 | Intelligent evaluation method and system for emergency resource scheduling |
CN116861795A (en) * | 2023-07-24 | 2023-10-10 | 北京航天晨信科技有限责任公司 | Automatic evaluation system and method for reinforcement learning algorithm model |
CN117217100A (en) * | 2023-11-08 | 2023-12-12 | 中国人民解放军63963部队 | Intelligent modeling method and simulation system for certain team numbers based on reinforcement learning |
Families Citing this family (1)
* Cited by examiner, † Cited by third partyPublication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220107628A1 (en) * | 2020-10-07 | 2022-04-07 | The Boeing Company | Systems and methods for distributed hierarchical control in multi-agent adversarial environments |
Citations (15)
* Cited by examiner, † Cited by third partyPublication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101964019A (en) * | 2010-09-10 | 2011-02-02 | 北京航空航天大学 | Against behavior modeling simulation platform and method based on Agent technology |
KR101179074B1 (en) * | 2011-12-13 | 2012-09-05 | 국방과학연구소 | Airburst simulation apparatus and method of simulation for airbrust |
RU2562096C1 (en) * | 2014-06-25 | 2015-09-10 | Федеральное государственное казённое военное образовательное учреждение высшего профессионального образования "Военная академия воздушно-космической обороны им. Маршала Советского Союза Г.К. Жукова" Министерства обороны Российской Федерации | Training command post of main rocket attack warning centre |
CN105677443A (en) * | 2015-12-29 | 2016-06-15 | 中国人民解放军空军指挥学院 | Heterogeneous simulation system |
WO2018175551A1 (en) * | 2017-03-22 | 2018-09-27 | Circadence Corporation | Mission-based, game-implemented cyber training system and method |
CN108629422A (en) * | 2018-05-10 | 2018-10-09 | 浙江大学 | A kind of intelligent body learning method of knowledge based guidance-tactics perception |
CN108646589A (en) * | 2018-07-11 | 2018-10-12 | 北京晶品镜像科技有限公司 | A kind of battle simulation training system and method for the formation of attack unmanned plane |
CN109636699A (en) * | 2018-11-06 | 2019-04-16 | 中国电子科技集团公司第五十二研究所 | A kind of unsupervised intellectualized battle deduction system based on deeply study |
CN109740283A (en) * | 2019-01-17 | 2019-05-10 | 清华大学 | Autonomous multi-agent confrontation simulation method and system |
CN110147883A (en) * | 2019-05-28 | 2019-08-20 | 航天科工系统仿真科技(北京)有限公司 | Training method, device, equipment and the storage medium of model for emulation of fighting |
CN110428057A (en) * | 2019-05-06 | 2019-11-08 | 南京大学 | A kind of intelligent game playing system based on multiple agent deeply learning algorithm |
CN110766169A (en) * | 2019-10-31 | 2020-02-07 | 深圳前海微众银行股份有限公司 | Transfer training optimization method and device for reinforcement learning, terminal and storage medium |
CN110929871A (en) * | 2019-11-15 | 2020-03-27 | 南京星火技术有限公司 | Game decision method and system |
CN111027862A (en) * | 2019-12-11 | 2020-04-17 | 中国舰船研究设计中心 | Multidimensional-based hierarchical aggregation combat simulation training evaluation method |
CN111632387A (en) * | 2020-06-12 | 2020-09-08 | 南京大学 | A Command and Control System Based on StarCraft II |
-
2020
- 2020-09-14 CN CN202010959038.7A patent/CN112131786B/en active Active
Patent Citations (16)
* Cited by examiner, † Cited by third partyPublication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101964019A (en) * | 2010-09-10 | 2011-02-02 | 北京航空航天大学 | Against behavior modeling simulation platform and method based on Agent technology |
KR101179074B1 (en) * | 2011-12-13 | 2012-09-05 | 국방과학연구소 | Airburst simulation apparatus and method of simulation for airbrust |
US20140065578A1 (en) * | 2011-12-13 | 2014-03-06 | Joon-Ho Lee | Airburst simulation system and method of simulation for airburst |
RU2562096C1 (en) * | 2014-06-25 | 2015-09-10 | Федеральное государственное казённое военное образовательное учреждение высшего профессионального образования "Военная академия воздушно-космической обороны им. Маршала Советского Союза Г.К. Жукова" Министерства обороны Российской Федерации | Training command post of main rocket attack warning centre |
CN105677443A (en) * | 2015-12-29 | 2016-06-15 | 中国人民解放军空军指挥学院 | Heterogeneous simulation system |
WO2018175551A1 (en) * | 2017-03-22 | 2018-09-27 | Circadence Corporation | Mission-based, game-implemented cyber training system and method |
CN108629422A (en) * | 2018-05-10 | 2018-10-09 | 浙江大学 | A kind of intelligent body learning method of knowledge based guidance-tactics perception |
CN108646589A (en) * | 2018-07-11 | 2018-10-12 | 北京晶品镜像科技有限公司 | A kind of battle simulation training system and method for the formation of attack unmanned plane |
CN109636699A (en) * | 2018-11-06 | 2019-04-16 | 中国电子科技集团公司第五十二研究所 | A kind of unsupervised intellectualized battle deduction system based on deeply study |
CN109740283A (en) * | 2019-01-17 | 2019-05-10 | 清华大学 | Autonomous multi-agent confrontation simulation method and system |
CN110428057A (en) * | 2019-05-06 | 2019-11-08 | 南京大学 | A kind of intelligent game playing system based on multiple agent deeply learning algorithm |
CN110147883A (en) * | 2019-05-28 | 2019-08-20 | 航天科工系统仿真科技(北京)有限公司 | Training method, device, equipment and the storage medium of model for emulation of fighting |
CN110766169A (en) * | 2019-10-31 | 2020-02-07 | 深圳前海微众银行股份有限公司 | Transfer training optimization method and device for reinforcement learning, terminal and storage medium |
CN110929871A (en) * | 2019-11-15 | 2020-03-27 | 南京星火技术有限公司 | Game decision method and system |
CN111027862A (en) * | 2019-12-11 | 2020-04-17 | 中国舰船研究设计中心 | Multidimensional-based hierarchical aggregation combat simulation training evaluation method |
CN111632387A (en) * | 2020-06-12 | 2020-09-08 | 南京大学 | A Command and Control System Based on StarCraft II |
Non-Patent Citations (5)
* Cited by examiner, † Cited by third partyTitle |
---|
WEI XIAOLONG 等: "Recurrent MADDPG for Object Detection and Assignment in Combat Tasks", 《IEEE ACCESS》, vol. 8, 18 September 2020 (2020-09-18), pages 163334 - 163343, XP011809178, DOI: 10.1109/ACCESS.2020.3022638 * |
ZHANG GUANYU 等: "Efficient training techniques for multi-agent reinforcement learning in combat tasks", 《IEEE ACCESS》, vol. 7, 21 August 2019 (2019-08-21), pages 109301 - 109310, XP011740787, DOI: 10.1109/ACCESS.2019.2933454 * |
欧微 等: "基于深度学习的兵棋实体决策效果智能评估模型", 《军事运筹与评估》, vol. 32, no. 4, 31 December 2018 (2018-12-31), pages 29 - 34 * |
王述运 等: "舰载机对海作战训练仿真系统设计与关键技术", 《指挥控制与仿真》, vol. 43, no. 3, 30 June 2020 (2020-06-30), pages 81 - 86 * |
郜富晓: "空地协作组网的无人机位置部署及能量优化机制研究", 《中国优秀硕士学位论文全文数据库 工程科技Ⅱ辑》, vol. 2020, no. 01, 15 January 2020 (2020-01-15), pages 034 - 1016 * |
Cited By (21)
* Cited by examiner, † Cited by third partyPublication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113222106A (en) * | 2021-02-10 | 2021-08-06 | 西北工业大学 | Intelligent military chess deduction method based on distributed reinforcement learning |
CN113222106B (en) * | 2021-02-10 | 2024-04-30 | 西北工业大学 | Intelligent soldier chess deduction method based on distributed reinforcement learning |
CN112905166A (en) * | 2021-03-04 | 2021-06-04 | 青岛海科创新科技有限公司 | Artificial intelligence programming system, computer device, computer-readable storage medium |
CN112905166B (en) * | 2021-03-04 | 2024-04-05 | 青岛海科智汇信息科技有限公司 | Artificial intelligence programming system, computer device, and computer-readable storage medium |
CN112633519A (en) * | 2021-03-11 | 2021-04-09 | 中国科学院自动化研究所 | Man-machine antagonistic action prediction method, device, electronic equipment and storage medium |
CN113469853A (en) * | 2021-05-13 | 2021-10-01 | 航天科工空间工程发展有限公司 | Method for accelerating command control of fighting and artificial intelligence device |
CN113435598A (en) * | 2021-07-08 | 2021-09-24 | 中国人民解放军国防科技大学 | Knowledge-driven intelligent strategy deduction decision method |
CN113435598B (en) * | 2021-07-08 | 2022-06-21 | 中国人民解放军国防科技大学 | Knowledge-driven intelligent decision-making method for wargames |
WO2023038605A1 (en) * | 2021-09-08 | 2023-03-16 | Havelsan Hava Elektronik San. Ve Tic. A.S. | Autonomous virtual entities continuously learning from experience |
CN113723013A (en) * | 2021-09-10 | 2021-11-30 | 中国人民解放军国防科技大学 | Multi-agent decision method for continuous space chess deduction |
CN114706381A (en) * | 2022-03-04 | 2022-07-05 | 达闼机器人股份有限公司 | Training method, device, storage medium and electronic device for intelligent body |
CN114327916A (en) * | 2022-03-10 | 2022-04-12 | 中国科学院自动化研究所 | A training method, device and equipment for a resource allocation system |
CN114611669B (en) * | 2022-03-14 | 2023-10-13 | 三峡大学 | Intelligent decision-making method for chess deduction based on double experience pool DDPG network |
CN114611669A (en) * | 2022-03-14 | 2022-06-10 | 三峡大学 | Intelligent strategy-chess-deducing decision-making method based on double experience pool DDPG network |
CN115906673A (en) * | 2023-01-10 | 2023-04-04 | 中国人民解放军陆军工程大学 | Integrated modeling method and system for combat entity behavior model |
CN115906673B (en) * | 2023-01-10 | 2023-11-03 | 中国人民解放军陆军工程大学 | Combat entity behavior model integrated modeling method and system |
CN116861795A (en) * | 2023-07-24 | 2023-10-10 | 北京航天晨信科技有限责任公司 | Automatic evaluation system and method for reinforcement learning algorithm model |
CN116739323A (en) * | 2023-08-16 | 2023-09-12 | 北京航天晨信科技有限责任公司 | Intelligent evaluation method and system for emergency resource scheduling |
CN116739323B (en) * | 2023-08-16 | 2023-11-10 | 北京航天晨信科技有限责任公司 | Intelligent evaluation method and system for emergency resource scheduling |
CN117217100A (en) * | 2023-11-08 | 2023-12-12 | 中国人民解放军63963部队 | Intelligent modeling method and simulation system for certain team numbers based on reinforcement learning |
CN117217100B (en) * | 2023-11-08 | 2024-01-30 | 中国人民解放军63963部队 | Intelligent modeling method and simulation system for certain team numbers based on reinforcement learning |
Also Published As
Publication number | Publication date |
---|---|
CN112131786B (en) | 2024-05-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112131786A (en) | 2020-12-25 | Target detection and distribution method and device based on multi-agent reinforcement learning |
CN113791634B (en) | 2024-02-02 | Multi-agent reinforcement learning-based multi-machine air combat decision method |
Liu et al. | 2017 | A deep reinforcement learning based intelligent decision method for UCAV air combat |
CN112215350B (en) | 2023-11-03 | Method and device for controlling agent based on reinforcement learning |
CN113705102B (en) | 2024-05-10 | Deduction simulation system, deduction simulation method, deduction simulation equipment and deduction simulation storage medium for sea-air cluster countermeasure |
CN112784445A (en) | 2021-05-11 | Parallel distributed computing system and method for flight control agent |
CN113848974B (en) | 2023-08-15 | Aircraft trajectory planning method and system based on deep reinforcement learning |
CN114510012B (en) | 2024-11-29 | Unmanned cluster evolution system and method based on meta-action sequence reinforcement learning |
CN116661503B (en) | 2023-10-13 | Cluster track automatic planning method based on multi-agent safety reinforcement learning |
Cao et al. | 2023 | Autonomous maneuver decision of UCAV air combat based on double deep Q network algorithm and stochastic game theory |
CN113298260A (en) | 2021-08-24 | Confrontation simulation deduction method based on deep reinforcement learning |
Zhou et al. | 2020 | An air combat decision learning system based on a brain-like cognitive mechanism |
CN111625930A (en) | 2020-09-04 | Desktop formula is artillery simulation training system by oneself |
CN113625569B (en) | 2022-02-08 | Small unmanned aerial vehicle prevention and control decision method and system based on hybrid decision model |
CN116663416A (en) | 2023-08-29 | CGF decision behavior simulation method based on behavior tree |
CN115185294B (en) | 2024-07-05 | QMIX-based aviation soldier multi-formation collaborative autonomous behavior decision modeling method |
Li et al. | 2019 | Improved Q-learning based route planning method for UAVs in unknown environment |
Wang et al. | 2023 | A review of deep reinforcement learning methods and military application research |
CN117908565A (en) | 2024-04-19 | Unmanned aerial vehicle safety path planning method based on maximum entropy multi-agent reinforcement learning |
KR101345645B1 (en) | 2013-12-27 | Simulation System And Method for War Game |
Dimitriu et al. | 2024 | A Reinforcement Learning Approach to Military Simulations in Command: Modern Operations |
Wang et al. | 2022 | Research on naval air defense intelligent operations on deep reinforcement learning |
Gao et al. | 2022 | Hybrid path planning algorithm of the mobile agent based on Q-learning |
Liu et al. | 2022 | Intelligent air defense task assignment based on hierarchical reinforcement learning |
Cao et al. | 2020 | Anti-air Missiles Synergic Battle Effectiveness Technique Study in Support of Information of Sky and Firmament |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
2020-12-25 | PB01 | Publication | |
2020-12-25 | PB01 | Publication | |
2021-01-12 | SE01 | Entry into force of request for substantive examination | |
2021-01-12 | SE01 | Entry into force of request for substantive examination | |
2024-05-31 | GR01 | Patent grant | |
2024-05-31 | GR01 | Patent grant |