patents.google.com

CN116362289A - An improved MATD3 multi-robot cooperative round-up method based on BiGRU structure - Google Patents

  • ️Fri Jun 30 2023
An improved MATD3 multi-robot cooperative round-up method based on BiGRU structure Download PDF

Info

Publication number
CN116362289A
CN116362289A CN202310130915.3A CN202310130915A CN116362289A CN 116362289 A CN116362289 A CN 116362289A CN 202310130915 A CN202310130915 A CN 202310130915A CN 116362289 A CN116362289 A CN 116362289A Authority
CN
China
Prior art keywords
robot
robots
matd3
network
algorithm
Prior art date
2023-02-17
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310130915.3A
Other languages
Chinese (zh)
Inventor
刘中常
戴兵
刘田禾
岳伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian Maritime University
Original Assignee
Dalian Maritime University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
2023-02-17
Filing date
2023-02-17
Publication date
2023-06-30
2023-02-17 Application filed by Dalian Maritime University filed Critical Dalian Maritime University
2023-02-17 Priority to CN202310130915.3A priority Critical patent/CN116362289A/en
2023-06-30 Publication of CN116362289A publication Critical patent/CN116362289A/en
Status Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/02Total factory control, e.g. smart factories, flexible manufacturing systems [FMS] or integrated manufacturing systems [IMS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Feedback Control In General (AREA)

Abstract

本发明提供一种基于BiGRU结构的改进MATD3多机器人协同围捕方法,包括:建立追逃对抗场景;基于建立的追逃对抗场景和强化学习TD3算法,设计适用于多机器人的MATD3算法;基于BiGRU结构,对设计的适用于多机器人的MATD3算法进行改进;基于个人、团队奖励的多机器人围捕策略,设计奖励函数;将设计的奖励函数与改进的MATD3算法结合,得到基于MATD3‑BiGRU的多机器人协同围捕策略算法。本发明采用基于MATD3的多智能体强化学习算法求解二维场景中多机器人协同围捕问题,算法中奖励函数采用个体捕获奖励和团队配合奖励的加权形式;采用基于BiGRU结构的改进MATD3的多智能体强化学习算法来改善当追捕机器人数量发生改变时围捕性能下降的问题。

Figure 202310130915

The invention provides an improved MATD3 multi-robot collaborative round-up method based on the BiGRU structure, including: establishing a chasing and fleeing confrontation scene; based on the established chasing and fleeing confrontation scene and the reinforcement learning TD3 algorithm, designing a MATD3 algorithm suitable for multi-robots; based on the BiGRU structure , improve the designed MATD3 algorithm suitable for multi-robots; design a reward function based on the multi-robot capture strategy based on individual and team rewards; combine the designed reward function with the improved MATD3 algorithm to obtain multi-robot collaboration based on MATD3‑BiGRU Round up strategy algorithm. The present invention adopts the multi-agent reinforcement learning algorithm based on MATD3 to solve the multi-robot cooperative round-up problem in the two-dimensional scene, and the reward function in the algorithm adopts the weighted form of the individual capture reward and the team cooperation reward; the multi-agent based on BiGRU structure improvement MATD3 is adopted Reinforcement learning algorithm to improve the problem of round-up performance degradation when the number of hunting robots changes.

Figure 202310130915

Description

一种基于BiGRU结构的改进MATD3多机器人协同围捕方法An improved MATD3 multi-robot cooperative round-up method based on BiGRU structure

技术领域technical field

本发明涉及多机器人协同围捕技术领域,具体而言,尤其涉及一种基于BiGRU结构的改进MATD3多机器人协同围捕方法。The present invention relates to the technical field of multi-robot cooperative round-up, in particular to an improved MATD3 multi-robot cooperative round-up method based on the BiGRU structure.

背景技术Background technique

多机器人协同围捕策略在军事对抗、自主搜救和救援等场景中具有重要的应用价值,一直是研究热点之一。研究方法从早期的基于模型的微分博弈、最优控制方法,到近期不依赖于具体模型的深度强化学习的方法,其可以针对设定的任务场景进行离线训练,然后将学习到的策略应用于实际场景中。在多机器人协同围捕的任务中,当有机器人发生故障导致不能继续参与围捕任务,或者在围捕过程中有新的机器人加入,导致机器人的数量发生改变,这时已经训练到的策略就不能更好地适应新的环境。因此,深度强化学习算法需要对环境的变化保持稳健性,能够适应没见过(但是相似)的环境。已有的方法大多是重新对新的环境下的机器人进行重新训练,比较耗费训练时间而且会浪费计算资源,在真实情况下也不允许这类情况的发生。The multi-robot cooperative round-up strategy has important application value in scenarios such as military confrontation, autonomous search and rescue, and has always been one of the research hotspots. Research methods range from early model-based differential game and optimal control methods to recent deep reinforcement learning methods that do not depend on specific models, which can perform offline training for set task scenarios, and then apply the learned strategies to in the actual scene. In the multi-robot collaborative siege task, when a robot fails to continue to participate in the siege task, or a new robot joins during the siege process, resulting in a change in the number of robots, the trained strategy cannot be better. adapt to the new environment. Therefore, deep reinforcement learning algorithms need to be robust to changes in the environment, able to adapt to unseen (but similar) environments. Most of the existing methods are to retrain the robot in the new environment, which is more time-consuming and wastes computing resources, and this kind of situation is not allowed to happen in real situations.

发明内容Contents of the invention

根据上述提出在多机器人协同围捕的任务中,针对实际应用于环境的机器人数量与训练中不一致的技术问题,提供一种基于BiGRU结构的改进MATD3多机器人协同围捕方法。本发明实现了多个追捕机器人对单个逃逸机器人的有效围捕,并提高了算法对机器人数量变化的适应能力。According to the above proposal, in the multi-robot collaborative round-up task, aiming at the technical problem that the number of robots actually applied to the environment is inconsistent with the training, an improved MATD3 multi-robot collaborative round-up method based on the BiGRU structure is provided. The invention realizes the effective round-up of a single escaping robot by multiple chasing robots, and improves the adaptability of the algorithm to changes in the number of robots.

本发明采用的技术手段如下:The technical means adopted in the present invention are as follows:

一种基于BiGRU结构的改进MATD3多机器人协同围捕方法,包括:An improved MATD3 multi-robot collaborative round-up method based on the BiGRU structure, including:

建立追逃对抗场景;Establish a chasing and fleeing confrontation scene;

基于建立的追逃对抗场景和强化学习TD3算法,设计适用于多机器人的MATD3算法;Based on the established chasing and fleeing confrontation scene and the TD3 algorithm of reinforcement learning, the MATD3 algorithm suitable for multi-robots is designed;

基于BiGRU结构,对设计的适用于多机器人的MATD3算法进行改进;Based on the BiGRU structure, the designed MATD3 algorithm suitable for multi-robots is improved;

基于个人、团队奖励的多机器人围捕策略,设计奖励函数;Based on the multi-robot round-up strategy based on individual and team rewards, design reward functions;

将设计的奖励函数与改进的MATD3算法结合,得到基于MATD3-BiGRU的多机器人协同围捕策略算法。Combining the designed reward function with the improved MATD3 algorithm, a multi-robot cooperative capture strategy algorithm based on MATD3-BiGRU is obtained.

进一步地,所述建立追逃对抗场景,包括:Further, the establishment of the chasing and fleeing confrontation scene includes:

建立包含N个追捕机器人和单个逃逸机器人的追逃对抗场景,将每个机器人的运动方程表示为:Establish a chasing and fleeing confrontation scenario containing N chasing robots and a single escaping robot, and express the motion equation of each robot as:

Figure BDA0004083785890000021

Figure BDA0004083785890000021

其中,vi表示追捕机器人或逃逸机器人的速度大小,ψi表示追捕机器人或逃逸机器人的速度方向角,ai表示追捕机器人或逃逸机器人的加速度。Among them, v i represents the velocity of the pursuit robot or the escape robot, ψ i represents the velocity direction angle of the pursuit robot or the escape robot, and a i represents the acceleration of the pursuit robot or the escape robot.

进一步地,所述基于建立的追逃对抗场景和强化学习TD3算法,设计适用于多机器人的MATD3算法,包括:Further, the MATD3 algorithm suitable for multi-robot design is based on the established chasing and fleeing confrontation scene and the reinforcement learning TD3 algorithm, including:

利用集中式训练、分散式执行的框架将强化学习TD3算法扩展到适用于多机器人的MATD3算法,MATD3算法中的每个智能体包括六个网络,分别是当前Actor网络、目标Actor网络、两个当前Critic网络和两个目标Critic网络;Actor网络采用分散式执行的方式,输入自身的状态信息,输出决策的动作;两个Critic网络采用集中式训练的方式,不仅输入自身的状态和动作信息,还包括其他智能体的状态和动作信息,输出的是两个接近真值的Q值,在计算目标Q值时,为了缓解过估计问题,选择较小的目标Q值来指导Actor网络的参数更新;Using the framework of centralized training and decentralized execution, the reinforcement learning TD3 algorithm is extended to the MATD3 algorithm suitable for multi-robots. Each agent in the MATD3 algorithm includes six networks, namely the current Actor network, the target Actor network, two The current critic network and two target critic networks; the actor network adopts a decentralized execution method to input its own state information and output decision-making actions; the two critic networks adopt a centralized training method, not only input their own state and action information, It also includes the state and action information of other agents, and the output is two Q values close to the true value. When calculating the target Q value, in order to alleviate the overestimation problem, a smaller target Q value is selected to guide the parameter update of the Actor network. ;

在多机器人协同围捕任务中,每个机器人的学习目标是最大化期望奖励,即Q值;对于追捕机器人来说,其学习的目标函数为:In the multi-robot collaborative siege task, the learning goal of each robot is to maximize the expected reward, that is, the Q value; for the hunting robot, the learning objective function is:

Figure BDA0004083785890000022

Figure BDA0004083785890000022

其中,φ是Actor网络参数,s是机器人当前状态,μ是策略网络输出所对应的确定性策略,θ是Critic网络参数,N是追捕机器人数量,D是经验数据,即经验回放池中的数据;Among them, φ is the Actor network parameter, s is the current state of the robot, μ is the deterministic policy corresponding to the policy network output, θ is the Critic network parameter, N is the number of hunting robots, D is the experience data, that is, the data in the experience playback pool ;

在每一个时刻t,Critic网络的损失函数为:At each time t, the loss function of the Critic network is:

Figure BDA0004083785890000031

Figure BDA0004083785890000031

Figure BDA0004083785890000032

Figure BDA0004083785890000032

其中,θ′是目标Critic网络的参数,φ′是目标Actor网络的参数,目标Critic网络和目标Actor网络与Critic网络和Actor网络结构完全相同,参数更新频率不同;Among them, θ' is the parameter of the target critic network, φ' is the parameter of the target actor network, the target critic network and the target actor network are exactly the same as the critic network and actor network structure, and the parameter update frequency is different;

通过最大化目标函数来训练Actor网络,训练方式采取随机梯度上升的方法来更新Actor网络的参数,根据上述定义的目标函数,目标函数的梯度为:The actor network is trained by maximizing the objective function, and the training method adopts the method of stochastic gradient ascent to update the parameters of the actor network. According to the objective function defined above, the gradient of the objective function is:

Figure BDA0004083785890000033

Figure BDA0004083785890000033

通过最小化Critic网络的损失函数来训练Critic网络,采取随机梯度下降的方法更新Critic的网络参数;根据Critic网络的损失函数公式,计算Critic网络损失函数的梯度为:Train the critic network by minimizing the loss function of the critic network, and update the network parameters of the critic by adopting the method of stochastic gradient descent; according to the loss function formula of the critic network, the gradient of the loss function of the critic network is calculated as:

Figure BDA0004083785890000034

Figure BDA0004083785890000034

Figure BDA0004083785890000035

Figure BDA0004083785890000035

最后运用滑动平均的方法对目标网络参数进行更新,用于保证算法的稳定,更新方式如下:Finally, the moving average method is used to update the target network parameters to ensure the stability of the algorithm. The update method is as follows:

Figure BDA0004083785890000036

Figure BDA0004083785890000036

进一步地,所述基于BiGRU结构,对设计的适用于多机器人的MATD3算法进行改进,包括:Further, based on the BiGRU structure, the designed MATD3 algorithm suitable for multi-robots is improved, including:

将每个Critic网络的结构引入双向门控循环单元,利用双向门控循环单元的双向连接特性将同一个团队的机器人相互连接,使得机器人之间可以交换信息,共享实时信息;Introduce the structure of each Critic network into the two-way gated cyclic unit, and use the bidirectional connection characteristics of the two-way gated cyclic unit to connect the robots of the same team to each other, so that the robots can exchange information and share real-time information;

利用双向门控循环单元的参数共享特性,同一个团队的机器人之间可以实现参数共享,机器人数目独立于参数的数目,当机器人数量发生改变时,网络结构也可以自动进行扩展,因而可以运用于机器人数目发生改变的情况。Using the parameter sharing feature of the two-way gated cyclic unit, the robots of the same team can share parameters. The number of robots is independent of the number of parameters. When the number of robots changes, the network structure can also be automatically expanded, so it can be used in When the number of bots changes.

进一步地,所述基于个人、团队奖励的多机器人围捕策略,设计奖励函数,包括:Further, the multi-robot round-up strategy based on individual and team rewards, the design reward function includes:

机器人i在第t个时间步长内的整体奖励函数包含三个部分,表示为:The overall reward function of robot i in the tth time step consists of three parts, expressed as:

Figure BDA0004083785890000041

Figure BDA0004083785890000041

其中,

Figure BDA0004083785890000042

为围捕奖励,/>

Figure BDA0004083785890000043

为机器人相互避碰奖励,/>

Figure BDA0004083785890000044

为场景边界避碰奖励;in,

Figure BDA0004083785890000042

For round up rewards, />

Figure BDA0004083785890000043

Reward for mutual avoidance of robots, />

Figure BDA0004083785890000044

Collision avoidance reward for the scene boundary;

设计围捕奖励函数

Figure BDA0004083785890000045

设计过程如下:Designing a round-up reward function

Figure BDA0004083785890000045

The design process is as follows:

Figure BDA0004083785890000046

旨在鼓励每个追捕机器人抓获逃逸机器人,并鼓励追捕机器人进行团队合作以形成良好的围捕队形,具体定义如下:

Figure BDA0004083785890000046

The purpose is to encourage each pursuit robot to capture the escaped robot, and to encourage the pursuit robot to cooperate in a team to form a good siege formation. The specific definitions are as follows:

Figure BDA0004083785890000047

Figure BDA0004083785890000047

其中di,e和dj,e分别表示追捕机器人i和j到逃逸机器人的距离,dcap表示捕获距离;如果目标被机器人i捕获(即di,e≤dcap),那么机器人i将收到奖励

Figure BDA0004083785890000048

而其他机器人j将收到/>

Figure BDA0004083785890000049

设置/>

Figure BDA00040837858900000410

来鼓励追捕机器人进行相互配合;在未捕获目标的运动过程中,每个追捕机器人都会收到一个负奖励-ωqq-ωddi,e,其由个体奖励(其与目标的距离di,e)和群体奖励(队形得分q)的加权线性组合构成,其中队形得分q定义为:Among them, d i, e and d j, e represent the distance from the chasing robot i and j to the escaping robot, and d cap represents the capture distance; if the target is captured by robot i (that is, d i, e ≤ d cap ), then robot i will receive reward

Figure BDA0004083785890000048

while other robot j will receive />

Figure BDA0004083785890000049

set />

Figure BDA00040837858900000410

To encourage the pursuit robots to cooperate with each other; during the movement of the uncaptured target, each pursuit robot will receive a negative reward -ω q q-ω d d i,e , which is rewarded by the individual (its distance from the target d i, e ) and group reward (formation score q), where the formation score q is defined as:

Figure BDA00040837858900000411

Figure BDA00040837858900000411

上式中,n表示机器人个数,距离目标最近的追捕机器人被定义为机器人0,

Figure BDA00040837858900000412

表示机器人i和目标方向的单位向量;此队形得分鼓励机器人围绕目标散布开来,即从不同方向接近目标来形成围捕态势;权重ωq和ωd的选择应确保机器人在接近目标的过程中奖励主要由队形得分决定,从而鼓励追捕机器人进行团队合作形成良好的围捕队形。In the above formula, n represents the number of robots, and the chasing robot closest to the target is defined as robot 0,

Figure BDA00040837858900000412

A unit vector representing the direction of robot i and the target; this formation score encourages robots to spread out around the target, that is, approaching the target from different directions to form a round-up situation; the choice of weights ω q and ω d should ensure that the robot is in the process of approaching the target The rewards are mainly determined by the formation score, which encourages the hunting robots to cooperate as a team to form a good hunting formation.

设计机器人相互避碰奖励函数

Figure BDA00040837858900000413

函数定义如下:Design robot mutual collision avoidance reward function

Figure BDA00040837858900000413

The function definition is as follows:

Figure BDA0004083785890000051

Figure BDA0004083785890000051

其中,δsafety为安全距离,ζ为一个负常数,di,o为机器人i到障碍物o之间的距离,di,j为机器人i和j之间的距离;Among them, δ safety is the safety distance, ζ is a negative constant, d i, o is the distance between robot i and obstacle o, and d i, j is the distance between robot i and j;

设计场景边界避碰奖励函数

Figure BDA0004083785890000052

用于引导机器人避免碰撞到运动场景的边界,定义如下:Design scene boundary collision avoidance reward function

Figure BDA0004083785890000052

Used to guide the robot to avoid collisions with the boundary of the motion scene, defined as follows:

Figure BDA0004083785890000053

Figure BDA0004083785890000053

其中,δsafety为安全距离,ξ为一个负常数,di,b为机器人i到边界之间的距离。Among them, δ safety is the safety distance, ξ is a negative constant, and d i, b are the distances between robot i and the boundary.

进一步地,所述将设计的奖励函数与改进的MATD3算法结合,得到基于MATD3-BiGRU的多机器人协同围捕策略算法,包括:Further, the designed reward function is combined with the improved MATD3 algorithm to obtain a multi-robot cooperative round-up strategy algorithm based on MATD3-BiGRU, including:

初始化2个Critic网络

Figure BDA0004083785890000054

和对应的Critic网络参数θi,1,θi,2,Actor网络μθi和对应的Actor网络参数φi,2个目标Critic网络参数θi,1←θi,1,θi,2←θi,2和Actor网络参数φi′←φi,经验回放池D以及机器人的状态;Initialize 2 critic networks

Figure BDA0004083785890000054

And corresponding Critic network parameters θ i,1 , θ i,2 , Actor network μ θi and corresponding Actor network parameters φ i , 2 target Critic network parameters θ i,1 ←θ i,1 , θ i, 2 ←θ i, 2 and Actor network parameters φ i ′←φ i , the experience playback pool D and the state of the robot;

对每个机器人i根据探索噪声选择动作

Figure BDA0004083785890000055

获得奖励/>

Figure BDA0004083785890000056

For each robot i, an action is selected according to the exploration noise

Figure BDA0004083785890000055

get rewards/>

Figure BDA0004083785890000056

将样本

Figure BDA0004083785890000057

存储在经验池D中,从经验池随机抽样M个样本

Figure BDA0004083785890000058

the sample

Figure BDA0004083785890000057

Stored in the experience pool D, randomly sample M samples from the experience pool

Figure BDA0004083785890000058

计算

Figure BDA0004083785890000059

calculate

Figure BDA0004083785890000059

更新Critic网络参数,

Figure BDA00040837858900000510

Update Critic network parameters,

Figure BDA00040837858900000510

通过策略梯度更新Actor网络参数,

Figure BDA00040837858900000511

Update Actor network parameters through policy gradients,

Figure BDA00040837858900000511

通过滑动平均法更新目标网络参数,

Figure BDA00040837858900000512

Update the target network parameters by moving average method,

Figure BDA00040837858900000512

较现有技术相比,本发明具有以下优点:Compared with the prior art, the present invention has the following advantages:

1、本发明提供的基于BiGRU结构的改进MATD3多机器人协同围捕方法,针对二维场景中机器人集群的协同围捕问题,利用集中式训练分散式执行的框架将TD3算法扩展到多个机器人的情况,从而适用于多机器人的围捕应用中。1. The improved MATD3 multi-robot collaborative round-up method based on the BiGRU structure provided by the present invention aims at the problem of cooperative round-up of robot clusters in two-dimensional scenes, and uses the framework of centralized training and distributed execution to extend the TD3 algorithm to multiple robots. Therefore, it is suitable for the round-up application of multi-robots.

2、本发明提供的基于BiGRU结构的改进MATD3多机器人协同围捕方法,通过引入一种基于BiGRU结构的改进MATD3深度强化学习算法,改善了追逃问题中机器人数量发生改变时算法性能下降的问题。2. The improved MATD3 multi-robot collaborative capture method based on the BiGRU structure provided by the present invention, through the introduction of an improved MATD3 deep reinforcement learning algorithm based on the BiGRU structure, improves the problem of algorithm performance degradation when the number of robots changes in the chasing and fleeing problem.

基于上述理由本发明可在多机器人协同围捕等领域广泛推广。Based on the above reasons, the present invention can be widely promoted in fields such as multi-robot cooperative round-up.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图做以简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description These are some embodiments of the present invention. For those skilled in the art, other drawings can also be obtained according to these drawings without any creative effort.

图1为本发明方法流程图。Fig. 1 is a flow chart of the method of the present invention.

图2为本发明实施例提供的追捕机器人P和逃跑机器人E的运动示意图。FIG. 2 is a schematic diagram of the movement of the chasing robot P and the escape robot E provided by the embodiment of the present invention.

图3为本发明实施例提供的MATD3算法框架图。Fig. 3 is a frame diagram of the MATD3 algorithm provided by the embodiment of the present invention.

图4为本发明实施例提供的基于BiGRU结构的MATD3算法框架图。FIG. 4 is a frame diagram of the MATD3 algorithm based on the BiGRU structure provided by the embodiment of the present invention.

图5为本发明实施例提供的多对一追逃对抗环境示意图;FIG. 5 is a schematic diagram of a many-to-one chasing and fleeing confrontation environment provided by an embodiment of the present invention;

图6为本发明实施例提供的实验1追捕机器人奖励值之和曲线图。FIG. 6 is a graph of the sum of reward values of chasing robots in Experiment 1 provided by the embodiment of the present invention.

图7为本发明实施例提供的实验2追捕机器人奖励值之和曲线图。FIG. 7 is a graph of the sum of reward values of chasing robots in Experiment 2 provided by the embodiment of the present invention.

具体实施方式Detailed ways

需要说明的是,在不冲突的情况下,本发明中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本发明。It should be noted that, in the case of no conflict, the embodiments of the present invention and the features in the embodiments can be combined with each other. The present invention will be described in detail below with reference to the accompanying drawings and examples.

为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本发明及其应用或使用的任何限制。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is only some embodiments of the present invention, but not all embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and in no way taken as limiting the invention, its application or uses. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

需要注意的是,这里所使用的术语仅是为了描述具体实施方式,而非意图限制根据本发明的示例性实施方式。如在这里所使用的,除非上下文另外明确指出,否则单数形式也意图包括复数形式,此外,还应当理解的是,当在本说明书中使用术语“包含”和/或“包括”时,其指明存在特征、步骤、操作、器件、组件和/或它们的组合。It should be noted that the terminology used here is only for describing specific embodiments, and is not intended to limit exemplary embodiments according to the present invention. As used herein, unless the context clearly dictates otherwise, the singular is intended to include the plural, and it should also be understood that when the terms "comprising" and/or "comprising" are used in this specification, they mean There are features, steps, operations, means, components and/or combinations thereof.

除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本发明的范围。同时,应当清楚,为了便于描述,附图中所示出的各个部分的尺寸并不是按照实际的比例关系绘制的。对于相关领域普通技术人员己知的技术、方法和设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为授权说明书的一部分。在这里示出和讨论的所有示例中,任向具体值应被解释为仅仅是示例性的,而不是作为限制。因此,示例性实施例的其它示例可以具有不同的值。应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。The relative arrangements of components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise. At the same time, it should be clear that, for the convenience of description, the sizes of the various parts shown in the drawings are not drawn according to the actual proportional relationship. Techniques, methods, and devices known to those of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, such techniques, methods, and devices should be considered part of the authorized description. In all examples shown and discussed herein, any specific values should be construed as exemplary only, and not as limitations. Therefore, other examples of the exemplary embodiment may have different values. It should be noted that like numerals and letters denote like items in the following figures, therefore, once an item is defined in one figure, it does not require further discussion in subsequent figures.

在本发明的描述中,需要理解的是,方位词如“前、后、上、下、左、右”、“横向、竖向、垂直、水平”和“顶、底”等所指示的方位或位置关系通常是基于附图所示的方位或位置关系,仅是为了便于描述本发明和简化描述,在未作相反说明的情况下,这些方位词并不指示和暗示所指的装置或元件必须具有特定的方位或者以特定的方位构造和操作,因此不能理解为对本发明保护范围的限制:方位词“内、外”是指相对于各部件本身的轮廓的内外。In the description of the present invention, it should be understood that orientation words such as "front, back, up, down, left, right", "horizontal, vertical, vertical, horizontal" and "top, bottom" etc. indicate the orientation Or positional relationship is generally based on the orientation or positional relationship shown in the drawings, and is only for the convenience of describing the present invention and simplifying the description. In the absence of a contrary statement, these orientation words do not indicate or imply the device or element referred to. It must have a specific orientation or be constructed and operated in a specific orientation, so it should not be construed as limiting the scope of the present invention: the orientation words "inside and outside" refer to inside and outside relative to the outline of each part itself.

为了便于描述,在这里可以使用空间相对术语,如“在……之上”、“在……上方”、“在……上表面”、“上面的”等,用来描述如在图中所示的一个器件或特征与其他器件或特征的空间位置关系。应当理解的是,空间相对术语旨在包含除了器件在图中所描述的方位之外的在使用或操作中的不同方位。例如,如果附图中的器件被倒置,则描述为“在其他器件或构造上方”或“在其他器件或构造之上”的器件之后将被定位为“在其他器件或构造下方”或“在其位器件或构造之下”。因而,示例性术语“在……上方”可以包括“在……上方”和“在……下方”两种方位。该器件也可以其他不同方式定位(旋转90度或处于其他方位),并且对这里所使用的空间相对描述作出相应解释。For the convenience of description, spatially relative terms may be used here, such as "on ...", "over ...", "on the surface of ...", "above", etc., to describe The spatial positional relationship between one device or feature shown and other devices or features. It will be understood that the spatially relative terms are intended to encompass different orientations of the device in use or operation in addition to the orientation depicted in the figures. For example, if the device in the figures is turned over, devices described as "above" or "above" other devices or configurations would then be oriented "beneath" or "above" the other devices or configurations. its underlying device or construction". Thus, the exemplary term "above" can encompass both an orientation of "above" and "beneath". The device may be otherwise oriented (rotated 90 degrees or at other orientations) and the spatially relative descriptions used herein interpreted accordingly.

此外,需要说明的是,使用“第一”、“第二”等词语来限定零部件,仅仅是为了便于对相应零部件进行区别,如没有另行声明,上述词语并没有特殊含义,因此不能理解为对本发明保护范围的限制。In addition, it should be noted that the use of words such as "first" and "second" to define components is only for the convenience of distinguishing corresponding components. To limit the protection scope of the present invention.

如图1所示,本发明提供了一种基于BiGRU结构的改进MATD3多机器人协同围捕方法,包括:As shown in Figure 1, the present invention provides a kind of improved MATD3 multi-robot cooperative round-up method based on BiGRU structure, comprising:

S1、建立追逃对抗场景;S1. Establish a chasing and fleeing confrontation scene;

S2、基于建立的追逃对抗场景和强化学习TD3算法,设计适用于多机器人的MATD3算法;S2. Design the MATD3 algorithm suitable for multi-robots based on the established chasing and fleeing confrontation scenarios and the reinforcement learning TD3 algorithm;

S3、基于BiGRU结构,对设计的适用于多机器人的MATD3算法进行改进;S3. Based on the BiGRU structure, improve the MATD3 algorithm designed for multi-robots;

S4、基于个人、团队奖励的多机器人围捕策略,设计奖励函数;S4. Based on the multi-robot round-up strategy based on individual and team rewards, design a reward function;

S5、将设计的奖励函数与改进的MATD3算法结合,得到基于MATD3-BiGRU的多机器人协同围捕策略算法。S5. Combining the designed reward function with the improved MATD3 algorithm, a multi-robot cooperative capture strategy algorithm based on MATD3-BiGRU is obtained.

具体实施时,作为本发明优选的实施方式,所述步骤S1中,建立追逃对抗场景,包括:During specific implementation, as a preferred implementation mode of the present invention, in the step S1, establishing a chasing and fleeing confrontation scene includes:

建立包含N个追捕机器人和单个逃逸机器人的追逃对抗场景,如图2所示,将每个机器人的运动方程表示为:Establish a chasing and fleeing confrontation scene containing N chasing robots and a single escaping robot, as shown in Figure 2, and express the motion equation of each robot as:

Figure BDA0004083785890000081

Figure BDA0004083785890000081

其中,vi表示追捕机器人或逃逸机器人的速度大小,ψi表示追捕机器人或逃逸机器人的速度方向角,ai表示追捕机器人或逃逸机器人的加速度。Among them, v i represents the velocity of the pursuit robot or the escape robot, ψ i represents the velocity direction angle of the pursuit robot or the escape robot, and a i represents the acceleration of the pursuit robot or the escape robot.

具体实施时,作为本发明优选的实施方式,所述步骤S2中,基于建立的追逃对抗场景和强化学习TD3算法,设计适用于多机器人的MATD3算法,包括:During specific implementation, as a preferred embodiment of the present invention, in the step S2, based on the established chasing and fleeing confrontation scene and the reinforcement learning TD3 algorithm, the MATD3 algorithm applicable to multi-robots is designed, including:

利用集中式训练、分散式执行的框架将强化学习TD3算法扩展到适用于多机器人的MATD3算法,MATD3算法有以下特点:每个智能体不仅有当前的Actor网络和对应的目标Actor网络,还有两个当前Critic网络和两个对应的目标Critic网络,通过选择较小的Q值作为下一状态的Q值以缓解过估计问题;采用延迟更新的方式更新Actor网络,即在Critic网络更新几次之后Actor网络再进行更新,解决值函数和策略耦合的问题;对目标Actor网络输出进行了加噪声处理,平滑目标策略,对目标动作添加了噪声,相当于一种正则化方式缓和了过拟合问题,从而使得目标Critic网络的预测输出Q值尽可能精确。如图3所示,为MATD3算法框架图,图中的每个智能体包括六个网络,分别是当前Actor网络、目标Actor网络、两个当前Critic网络和两个目标Critic网络;Actor网络采用分散式执行的方式,输入自身的状态信息,输出决策的动作;两个Critic网络采用集中式训练的方式,不仅输入自身的状态和动作信息,还包括其他智能体的状态和动作信息,输出的是两个接近真值的Q值,在计算目标Q值时,为了缓解过估计问题,选择较小的目标Q值来指导Actor网络的参数更新;Using the framework of centralized training and decentralized execution, the reinforcement learning TD3 algorithm is extended to the MATD3 algorithm suitable for multi-robots. The MATD3 algorithm has the following characteristics: each agent not only has the current Actor network and the corresponding target Actor network, but also Two current Critic networks and two corresponding target Critic networks, by selecting a smaller Q value as the Q value of the next state to alleviate the overestimation problem; update the Actor network by delaying the update, that is, update the Critic network several times Afterwards, the Actor network is updated to solve the problem of value function and policy coupling; the output of the target Actor network is processed with noise, the target policy is smoothed, and noise is added to the target action, which is equivalent to a regularization method to alleviate overfitting problem, so that the predicted output Q value of the target critic network is as accurate as possible. As shown in Figure 3, it is a frame diagram of the MATD3 algorithm. Each agent in the figure includes six networks, namely the current Actor network, the target Actor network, two current Critic networks and two target Critic networks; the Actor network adopts decentralized The two critic networks adopt a centralized training method, which not only inputs their own state and action information, but also includes the state and action information of other agents, and the output is Two Q values close to the true value, when calculating the target Q value, in order to alleviate the overestimation problem, choose a smaller target Q value to guide the parameter update of the Actor network;

在多机器人协同围捕任务中,每个机器人的学习目标是最大化期望奖励,即Q值;对于追捕机器人来说,其学习的目标函数为:In the multi-robot collaborative siege task, the learning goal of each robot is to maximize the expected reward, that is, the Q value; for the hunting robot, the learning objective function is:

Figure BDA0004083785890000091

Figure BDA0004083785890000091

其中,φ是Actor网络参数,s是机器人当前状态,μ是策略网络输出所对应的确定性策略,θ是Critic网络参数,N是追捕机器人数量,D是经验数据,即经验回放池中的数据;Among them, φ is the Actor network parameter, s is the current state of the robot, μ is the deterministic policy corresponding to the policy network output, θ is the Critic network parameter, N is the number of hunting robots, D is the experience data, that is, the data in the experience playback pool ;

在每一个时刻t,Critic网络的损失函数为:At each time t, the loss function of the Critic network is:

Figure BDA0004083785890000092

Figure BDA0004083785890000092

Figure BDA0004083785890000093

Figure BDA0004083785890000093

其中,θ′是目标Critic网络的参数,φ′是目标Actor网络的参数,目标Critic网络和目标Actor网络与Critic网络和Actor网络结构完全相同,参数更新频率不同;Among them, θ' is the parameter of the target critic network, φ' is the parameter of the target actor network, the target critic network and the target actor network are exactly the same as the critic network and actor network structure, and the parameter update frequency is different;

通过最大化目标函数来训练Actor网络,训练方式采取随机梯度上升的方法来更新Actor网络的参数,根据上述定义的目标函数,目标函数的梯度为:The actor network is trained by maximizing the objective function, and the training method adopts the method of stochastic gradient ascent to update the parameters of the actor network. According to the objective function defined above, the gradient of the objective function is:

Figure BDA0004083785890000101

Figure BDA0004083785890000101

通过最小化Critic网络的损失函数来训练Critic网络,采取随机梯度下降的方法更新Critic的网络参数;根据Critic网络的损失函数公式,计算Critic网络损失函数的梯度为:Train the critic network by minimizing the loss function of the critic network, and update the network parameters of the critic by adopting the method of stochastic gradient descent; according to the loss function formula of the critic network, the gradient of the loss function of the critic network is calculated as:

Figure BDA0004083785890000102

Figure BDA0004083785890000102

Figure BDA0004083785890000103

Figure BDA0004083785890000103

最后运用滑动平均的方法对目标网络参数进行更新,用于保证算法的稳定,更新方式如下:Finally, the moving average method is used to update the target network parameters to ensure the stability of the algorithm. The update method is as follows:

Figure BDA0004083785890000104

Figure BDA0004083785890000104

具体实施时,作为本发明优选的实施方式,所述步骤S3中,基于BiGRU结构,对设计的适用于多机器人的MATD3算法进行改进,包括:During specific implementation, as a preferred embodiment of the present invention, in the step S3, based on the BiGRU structure, the designed MATD3 algorithm suitable for multi-robots is improved, including:

为了提升算法的泛化能力,将每个Critic网络的结构引入双向门控循环单元(BiGRU),利用BiGRU的双向连接特性将同一个团队的机器人相互连接,使得机器人之间可以交换信息,共享实时信息;另外,利用BiGRU的参数共享特性,同一个团队的机器人之间可以实现参数共享,机器人数目独立于参数的数目,当机器人数量发生改变时,网络结构也可以自动进行扩展,因而可以运用于机器人数目发生改变的情况。基于BiGRU结构的MATD3算法框架如图4所示。In order to improve the generalization ability of the algorithm, the structure of each critic network is introduced into the bidirectional gated recurrent unit (BiGRU), and the bidirectional connection characteristics of BiGRU are used to connect the robots of the same team to each other, so that the robots can exchange information and share real-time information; in addition, using the parameter sharing feature of BiGRU, the parameters can be shared between the robots of the same team, the number of robots is independent of the number of parameters, when the number of robots changes, the network structure can also be automatically expanded, so it can be used in When the number of bots changes. The MATD3 algorithm framework based on the BiGRU structure is shown in Figure 4.

具体实施时,作为本发明优选的实施方式,所述步骤S4中,基于个人、团队奖励的多机器人围捕策略,设计奖励函数,包括:During specific implementation, as a preferred embodiment of the present invention, in said step S4, based on the multi-robot round-up strategy of individual and team rewards, a reward function is designed, including:

机器人i在第t个时间步长内的整体奖励函数包含三个部分,表示为:The overall reward function of robot i in the tth time step consists of three parts, expressed as:

Figure BDA0004083785890000111

Figure BDA0004083785890000111

其中,

Figure BDA0004083785890000112

为围捕奖励,/>

Figure BDA0004083785890000113

为机器人相互避碰奖励,/>

Figure BDA0004083785890000114

为场景边界避碰奖励;in,

Figure BDA0004083785890000112

For round up rewards, />

Figure BDA0004083785890000113

Reward for mutual avoidance of robots, />

Figure BDA0004083785890000114

Collision avoidance reward for the scene boundary;

设计围捕奖励函数

Figure BDA0004083785890000115

设计过程如下:Designing a round-up reward function

Figure BDA0004083785890000115

The design process is as follows:

Figure BDA0004083785890000116

旨在鼓励每个追捕机器人抓获逃逸机器人,并鼓励追捕机器人进行团队合作以形成良好的围捕队形,具体定义如下:

Figure BDA0004083785890000116

The purpose is to encourage each pursuit robot to capture the escaped robot, and to encourage the pursuit robot to cooperate in a team to form a good siege formation. The specific definitions are as follows:

Figure BDA0004083785890000117

Figure BDA0004083785890000117

其中di,e和dj,e分别表示追捕机器人i和j到逃逸机器人的距离,dcap表示捕获距离;如果目标被机器人i捕获(即di,e≤dcap),那么机器人i将收到奖励

Figure BDA0004083785890000118

而其他机器人j将收到/>

Figure BDA0004083785890000119

设置/>

Figure BDA00040837858900001110

来鼓励追捕机器人进行相互配合;在未捕获目标的运动过程中,每个追捕机器人都会收到一个负奖励-ωqq-ωddi,e,其由个体奖励(其与目标的距离di,e)和群体奖励(队形得分q)的加权线性组合构成,其中队形得分q定义为:Among them, d i, e and d j, e represent the distance from the chasing robot i and j to the escaping robot, and d cap represents the capture distance; if the target is captured by robot i (that is, d i, e ≤ d cap ), then robot i will receive reward

Figure BDA0004083785890000118

while other robot j will receive />

Figure BDA0004083785890000119

set />

Figure BDA00040837858900001110

To encourage the pursuit robots to cooperate with each other; during the movement of the uncaptured target, each pursuit robot will receive a negative reward -ω q q-ω d d i,e , which is rewarded by the individual (its distance from the target d i, e ) and group reward (formation score q), where the formation score q is defined as:

Figure BDA00040837858900001111

Figure BDA00040837858900001111

上式中,n表示机器人个数,距离目标最近的追捕机器人被定义为机器人0,

Figure BDA00040837858900001112

表示机器人i和目标方向的单位向量;此队形得分鼓励机器人围绕目标散布开来,即从不同方向接近目标来形成围捕态势;权重ωq和ωd的选择应确保机器人在接近目标的过程中奖励主要由队形得分决定,从而鼓励追捕机器人进行团队合作形成良好的围捕队形。In the above formula, n represents the number of robots, and the chasing robot closest to the target is defined as robot 0,

Figure BDA00040837858900001112

A unit vector representing the direction of robot i and the target; this formation score encourages robots to spread out around the target, that is, approaching the target from different directions to form a round-up situation; the choice of weights ω q and ω d should ensure that the robot is in the process of approaching the target The rewards are mainly determined by the formation score, which encourages the hunting robots to cooperate as a team to form a good hunting formation.

设计机器人相互避碰奖励函数

Figure BDA00040837858900001113

函数定义如下:Design robot mutual collision avoidance reward function

Figure BDA00040837858900001113

The function definition is as follows:

Figure BDA00040837858900001114

Figure BDA00040837858900001114

其中,δsafety为安全距离,ζ为一个负常数,di,o为机器人i到障碍物o之间的距离,di,j为机器人i和j之间的距离;Among them, δ safety is the safety distance, ζ is a negative constant, d i, o is the distance between robot i and obstacle o, and d i, j is the distance between robot i and j;

设计场景边界避碰奖励函数

Figure BDA00040837858900001115

用于引导机器人避免碰撞到运动场景的边界,定义如下:Design scene boundary collision avoidance reward function

Figure BDA00040837858900001115

Used to guide the robot to avoid collisions with the boundary of the motion scene, defined as follows:

Figure BDA00040837858900001116

Figure BDA00040837858900001116

其中,δsafety为安全距离,ξ为一个负常数,di,b为机器人i到边界之间的距离。Among them, δ safety is the safety distance, ξ is a negative constant, and d i, b are the distances between robot i and the boundary.

具体实施时,作为本发明优选的实施方式,所述步骤S5中,将设计的奖励函数与改进的MATD3算法结合,得到基于MATD3-BiGRU的多机器人协同围捕策略算法,包括:During specific implementation, as a preferred embodiment of the present invention, in the step S5, the reward function of design is combined with the improved MATD3 algorithm to obtain a multi-robot cooperative round-up strategy algorithm based on MATD3-BiGRU, including:

初始化2个Critic网络

Figure BDA0004083785890000121

和对应的Critic网络参数θi,1,θi,2,Actor网络/>

Figure BDA00040837858900001210

和对应的Actor网络参数φi,2个目标Critic网络参数θ′i,1←θi,1,θ′i,2←θi,2和Actor网络参数φi′←φi,经验回放池D以及机器人的状态;Initialize 2 critic networks

Figure BDA0004083785890000121

And the corresponding Critic network parameters θ i,1 , θ i,2 , Actor network />

Figure BDA00040837858900001210

And corresponding Actor network parameters φ i , 2 target Critic network parameters θ′ i,1 ←θ i,1 , θ′ i,2 ←θ i,2 and Actor network parameters φ i ′ ←φ i , experience playback pool D and the state of the robot;

对每个机器人i根据探索噪声选择动作

Figure BDA0004083785890000122

获得奖励/>

Figure BDA0004083785890000123

For each robot i, an action is selected according to the exploration noise

Figure BDA0004083785890000122

get rewards/>

Figure BDA0004083785890000123

将样本

Figure BDA0004083785890000124

存储在经验池D中,从经验池随机抽样M个样本

Figure BDA0004083785890000125

the sample

Figure BDA0004083785890000124

Stored in the experience pool D, randomly sample M samples from the experience pool

Figure BDA0004083785890000125

计算

Figure BDA0004083785890000126

calculate

Figure BDA0004083785890000126

更新Critic网络参数,

Figure BDA0004083785890000127

Update Critic network parameters,

Figure BDA0004083785890000127

通过策略梯度更新Actor网络参数,

Figure BDA0004083785890000128

Update Actor network parameters through policy gradients,

Figure BDA0004083785890000128

通过滑动平均法更新目标网络参数,

Figure BDA0004083785890000129

Update the target network parameters by moving average method,

Figure BDA0004083785890000129

实施例Example

根据多对一的追捕对抗场景,搭建多对一追逃环境如图5所示,设定黑圈方为追捕机器人,白圈方为逃逸机器人,场景中黑色矩形表示障碍物。仿真环境基于Python语言编写,利用Pycharm Community 2020.3和Anaconda3平台,深度学习框架采用百度飞桨PaddlePaddle2.2,实验中采用的训练优化器为AdamOptimizer。计算机配置为Windows 10操作系统,CPU Inter i5-9400F@2.90GHz,内存8GB。According to the many-to-one pursuit confrontation scene, build a many-to-one pursuit environment as shown in Figure 5, set the black circle as the hunting robot, the white circle as the escape robot, and the black rectangle in the scene represents the obstacle. The simulation environment is written based on the Python language, using the Pycharm Community 2020.3 and Anaconda3 platforms, the deep learning framework uses Baidu PaddlePaddle2.2, and the training optimizer used in the experiment is AdamOptimizer. The computer configuration is Windows 10 operating system, CPU Inter i5-9400F@2.90GHz, memory 8GB.

特别地,针对多追捕机器人单逃逸机器人追逃实验设置作以下说明:所有追捕机器人为同构机器人,即追捕机器人的参数性能完全相同;设置双方处于一个二维连续有界区域圆内,半径为10m。逃逸机器人的逃逸策略也通过训练来完成,其只能通过阻止自己被捕获和避免碰撞来获得奖励,其中是否被捕获所设置的奖励如表3中所示,避免碰撞的奖励设置与追捕机器人所设一致。实验设定的训练参数表如表1所示,实验中的环境参数设置如表2所示。In particular, the following explanations are given for the chasing and fleeing experiments of multiple hunting robots and a single escaping robot: all hunting robots are isomorphic robots, that is, the parameters and performance of the hunting robots are exactly the same; the two parties are set to be in a two-dimensional continuous bounded area circle with a radius of 10m. The escaping strategy of the escaping robot is also completed through training. It can only get rewards by preventing itself from being captured and avoiding collisions. The rewards set by whether it is captured or not are shown in Table 3. Set to be consistent. The training parameter table set in the experiment is shown in Table 1, and the environment parameter settings in the experiment are shown in Table 2.

表1训练超参表Table 1 Training hyperparameter table

Figure BDA0004083785890000131

Figure BDA0004083785890000131

表2多对一追讨对抗环境实验参数Table 2 Many-to-one recovery confrontation environment experimental parameters

Figure BDA0004083785890000132

Figure BDA0004083785890000132

表3逃逸机器人的奖励设计Table 3 Reward design of escape robot

Figure BDA0004083785890000133

Figure BDA0004083785890000133

对追捕机器人和逃逸机器人均使用MATD3算法进行训练,共训练20000轮。实验1为三个追捕机器人去围捕一个逃逸机器人,在算法收敛后(10000轮处)加入三个新的追捕机器人,每100轮进行一次测试,统计获得所有追捕机器人的奖励值之和如图6所示。可以看出,由于更多的机器人可能发生的碰撞增加而使得总体奖励有所下降,但使用了BiGRU的MATD3算法比原始MATD3算法的最终奖励值下降幅度更小,而且奖励值曲线恢复稳定的速度更快。Both the chasing robot and the escape robot are trained using the MATD3 algorithm, with a total of 20,000 rounds of training. Experiment 1 is for three hunting robots to round up an escaping robot. After the algorithm converges (at 10,000 rounds), three new hunting robots are added, and a test is performed every 100 rounds. The sum of the reward values of all hunting robots is calculated as shown in Figure 6 shown. It can be seen that the overall reward has decreased due to the increase in possible collisions of more robots, but the MATD3 algorithm using BiGRU has a smaller decrease in the final reward value than the original MATD3 algorithm, and the reward value curve returns to a stable speed faster.

实验2中,开始时使用六个追捕机器人去围捕一个逃逸机器人,在算法收敛后(10000轮处)有三个追捕机器人退出,剩下三个机器人继续完成围捕任务,每100轮进行一次测试,统计获得所有追捕机器人的奖励值之和如图7所示。可以看出,使用了BiGRU的MATD3算法能够更快地收敛,并且最终的奖励值更高。In experiment 2, at the beginning, six hunting robots were used to round up an escaped robot. After the algorithm converged (at 10,000 rounds), three hunting robots quit, and the remaining three robots continued to complete the hunting task. A test was performed every 100 rounds, and the statistics The sum of reward values obtained from all chasing robots is shown in Figure 7. It can be seen that the MATD3 algorithm using BiGRU can converge faster and the final reward value is higher.

最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, rather than limiting them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: It is still possible to modify the technical solutions described in the foregoing embodiments, or perform equivalent replacements for some or all of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the various embodiments of the present invention. scope.

Claims (6)

1. The improved MATD3 multi-robot collaborative trapping method based on the BiGRU structure is characterized by comprising the following steps of:

establishing a chase countermeasure scene;

designing an MATD3 algorithm suitable for multiple robots based on the established chase countermeasure scene and the reinforcement learning TD3 algorithm;

based on the BiGRU structure, the designed MATD3 algorithm suitable for multiple robots is improved;

designing a reward function based on a multi-robot trapping strategy of personal and team rewards;

and combining the designed reward function with an improved MATD3 algorithm to obtain a multi-robot cooperative capture strategy algorithm based on MATD 3-BiGRU.

2. The improved MATD3 multi-robot co-capture method based on biglu architecture according to claim 1, wherein the creating a chase-to-challenge scenario comprises:

establishing a pursuit countermeasure scene comprising N pursuit robots and a single escape robot, and expressing the motion equation of each robot as:

Figure FDA0004083785880000011

wherein v is i Indicating the speed of the pursuit robot or the escape robot i A represents the speed direction angle, a, of a pursuing robot or an escaping robot i Indicating the acceleration of the chase robot or the escape robot.

3. The improved MATD3 multi-robot collaborative trapping method based on a biglu structure according to claim 1, wherein the designing a MATD3 algorithm applicable to multiple robots based on the established chase-after challenge scene and reinforcement learning TD3 algorithm comprises:

the method comprises the steps that a reinforcement learning TD3 algorithm is expanded to a MATD3 algorithm suitable for multiple robots by utilizing a framework of centralized training and decentralized execution, wherein each agent in the MATD3 algorithm comprises six networks, namely a current Actor network, a target Actor network, two current Critic networks and two target Critic networks; the Actor network adopts a distributed execution mode, inputs own state information and outputs decision actions; the two Critic networks adopt a centralized training mode, not only input the state and action information of the two Critic networks, but also include the state and action information of other intelligent agents, output two Q values close to true values, and select a smaller target Q value to guide the parameter update of the Actor network in order to alleviate the overestimation problem when calculating the target Q value;

in a multi-robot collaborative enclosure task, the learning goal of each robot is to maximize the expected rewards, i.e., Q values; for the pursuit robot, the objective function learned is:

Figure FDA0004083785880000021

wherein phi is an Actor network parameter, s is the current state of the robot, mu is a deterministic strategy corresponding to strategy network output, theta is a Critic network parameter, N is the number of the pursuing robots, and D is empirical data, namely data in an empirical playback pool;

at each time t, the loss function of the Critic network is:

Figure FDA0004083785880000022

Figure FDA0004083785880000023

wherein θ 'is a parameter of the target Critic network, Φ' is a parameter of the target Actor network, the structures of the target Critic network and the target Actor network are identical to those of the Critic network and the Actor network, and the parameter updating frequencies are different;

training an Actor network by maximizing an objective function, wherein the training mode adopts a random gradient rising method to update parameters of the Actor network, and according to the defined objective function, the gradient of the objective function is as follows:

Figure FDA0004083785880000024

training the Critic network by minimizing the loss function of the Critic network, and updating the network parameters of the Critic by adopting a random gradient descent method; according to a loss function formula of the Critic network, calculating the gradient of the Critic network loss function as follows:

Figure FDA0004083785880000025

Figure FDA0004083785880000026

and finally, updating the target network parameters by using a moving average method, wherein the updating method is used for ensuring the stability of an algorithm and comprises the following steps of:

Figure FDA0004083785880000031

4. the improved MATD3 multi-robot collaborative trapping method based on a biglu structure according to claim 1, wherein the improvement on a designed MATD3 algorithm suitable for multi-robots based on a biglu structure comprises:

the structure of each Critic network is introduced into a bidirectional gating circulation unit, and the robots of the same team are connected with each other by utilizing the bidirectional connection characteristic of the bidirectional gating circulation unit, so that the robots can exchange information and share real-time information;

by utilizing the parameter sharing characteristic of the bidirectional gating circulation unit, the parameter sharing can be realized among robots of the same team, the number of robots is independent of the number of the parameters, and when the number of the robots is changed, the network structure can be automatically expanded, so that the robot can be applied to the condition that the number of the robots is changed.

5. The improved MATD3 multi-robot co-capture method based on biglu architecture according to claim 1, wherein the personal, team rewards-based multi-robot capture strategy, designing rewards functions, includes:

the overall reward function of robot i in the t-th time step consists of three parts, denoted as:

Figure FDA0004083785880000032

wherein,,

Figure FDA0004083785880000033

to be rewarded for trapping->

Figure FDA0004083785880000034

Rewarding robots for avoiding collision with each other, +.>

Figure FDA0004083785880000035

Rewarding collision avoidance for scene boundaries;

design of the trapping reward function

Figure FDA0004083785880000036

The design process is as follows:

Figure FDA0004083785880000037

aiming at encouraging each chasing robot to capture escaping robots and encouraging the chasing robots to perform team cooperation to form good capturing formations, the specific definition is as follows:

Figure FDA0004083785880000038

wherein d is i,e And d j,e Respectively represent the distances from the pursuit robots i and j to the escape robot, d cap Representing a capture distance; if the target is captured by robot i (i.e. d i,e ≤d cap ) Then robot i will receive the reward

Figure FDA0004083785880000039

While the other robots j will receive

Figure FDA00040837858800000310

Set->

Figure FDA00040837858800000311

To encourage the pursuit robots to cooperate with each other; during the movement of the non-captured target, each pursuing robot receives a negative reward-omega q q-ω d d i,e Which is rewarded by the individual (its distance d from the target i,e ) And a group reward (formation score q), wherein the formation score q is defined as:

Figure FDA0004083785880000041

in the above formula, n represents the number of robots, the last pursuit robot closest to the target is defined as robot 0,

Figure FDA0004083785880000042

a unit vector representing the robot i and the target direction; this formation score encourages robots to spread around the target, i.e., approach the target from different directions to form an enclosure; weight omega q And omega d The selection of (2) should ensure that the awards of robots in approaching targets are primarily determined by the team score, thereby encouraging the pursuit robots to team cooperation to form good trapping teams.

Designing mutual collision prevention reward function of robots

Figure FDA0004083785880000043

The function is defined as follows:

Figure FDA0004083785880000044

wherein delta safety Zeta is a negative constant, d i,o For the distance d between robot i and obstacle o i,j Is the distance between robots i and j;

designing scene boundary collision avoidance reward function

Figure FDA0004083785880000045

The boundary for guiding the robot to avoid collision to the motion scene is defined as follows:

Figure FDA0004083785880000046

wherein delta safety For a safe distance, ζ is a negative constant, d i,b Is the distance between robot i and the boundary.

6. The method for improving the cooperative trapping of MATD3 robots based on a BiGRU structure according to claim 1, wherein the method for combining the designed reward function with the improved MATD3 algorithm to obtain the cooperative trapping strategy algorithm of the robots based on MATD3-BiGRU comprises the following steps:

initializing 2 Critic networks

Figure FDA0004083785880000047

And corresponding Critic network parameter θ i,1 ,θ i,2 Actor network->

Figure FDA0004083785880000048

And corresponding Actor network parameters phi i 2 target Critic network parameters θ i,1 ←θ i,1 ,θ i,2 ←θ i,2 And an Actor network parameter phi i ′←φ i Experience playback pool D and robot status;

selecting an operation for each robot i based on the search noise

Figure FDA0004083785880000049

Obtain rewards->

Figure FDA00040837858800000410

Sample the sample

Figure FDA00040837858800000411

Stored in experience pool D, M samples are randomly sampled from experience pool

Figure FDA00040837858800000412

Calculation of

Figure FDA00040837858800000413

The parameters of the Critic network are updated,

Figure FDA00040837858800000414

the Actor network parameters are updated by means of a policy gradient,

Figure FDA0004083785880000051

the target network parameters are updated by a moving average method,

Figure FDA0004083785880000052

CN202310130915.3A 2023-02-17 2023-02-17 An improved MATD3 multi-robot cooperative round-up method based on BiGRU structure Pending CN116362289A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310130915.3A CN116362289A (en) 2023-02-17 2023-02-17 An improved MATD3 multi-robot cooperative round-up method based on BiGRU structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310130915.3A CN116362289A (en) 2023-02-17 2023-02-17 An improved MATD3 multi-robot cooperative round-up method based on BiGRU structure

Publications (1)

Publication Number Publication Date
CN116362289A true CN116362289A (en) 2023-06-30

Family

ID=86931399

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310130915.3A Pending CN116362289A (en) 2023-02-17 2023-02-17 An improved MATD3 multi-robot cooperative round-up method based on BiGRU structure

Country Status (1)

Country Link
CN (1) CN116362289A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117103282A (en) * 2023-10-20 2023-11-24 南京航空航天大学 A collaborative motion control method for dual-arm robots based on MATD3 algorithm
CN117644520A (en) * 2024-01-29 2024-03-05 北京市农林科学院智能装备技术研究中心 Multi-arm robot picking task planning method, device, electronic equipment and medium
CN118862315A (en) * 2024-09-24 2024-10-29 西北工业大学 A method of aircraft penetration based on deep reinforcement learning

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117103282A (en) * 2023-10-20 2023-11-24 南京航空航天大学 A collaborative motion control method for dual-arm robots based on MATD3 algorithm
CN117103282B (en) * 2023-10-20 2024-02-13 南京航空航天大学 Double-arm robot cooperative motion control method based on MATD3 algorithm
CN117644520A (en) * 2024-01-29 2024-03-05 北京市农林科学院智能装备技术研究中心 Multi-arm robot picking task planning method, device, electronic equipment and medium
CN117644520B (en) * 2024-01-29 2024-05-10 北京市农林科学院智能装备技术研究中心 Multi-arm robot picking task planning method, device, electronic equipment and medium
CN118862315A (en) * 2024-09-24 2024-10-29 西北工业大学 A method of aircraft penetration based on deep reinforcement learning
CN118862315B (en) * 2024-09-24 2025-01-24 西北工业大学 A method of aircraft penetration based on deep reinforcement learning

Similar Documents

Publication Publication Date Title
CN112947562B (en) 2021-11-30 Multi-unmanned aerial vehicle motion planning method based on artificial potential field method and MADDPG
US11794898B2 (en) 2023-10-24 Air combat maneuvering method based on parallel self-play
CN116362289A (en) 2023-06-30 An improved MATD3 multi-robot cooperative round-up method based on BiGRU structure
CN111260031B (en) 2022-03-01 Unmanned aerial vehicle cluster target defense method based on deep reinforcement learning
CN112633415B (en) 2023-05-19 UAV swarm intelligent task execution method and device based on rule constraint training
CN111723931B (en) 2021-12-17 Multi-agent confrontation action prediction method and device
CN114489144B (en) 2022-07-12 Unmanned aerial vehicle autonomous maneuver decision method and device and unmanned aerial vehicle
CN113848984B (en) 2022-06-10 Unmanned aerial vehicle cluster control method and system
Jin et al. 2022 Soft formation control for unmanned surface vehicles under environmental disturbance using multi-task reinforcement learning
CN117400248A (en) 2024-01-16 Method for implementing trapping by using barrier based on multi-agent deep reinforcement learning
Curran et al. 2015 Using PCA to efficiently represent state spaces
Wang et al. 2021 Unmanned surface vessel obstacle avoidance with prior knowledge‐based reward shaping
Chipade et al. 2019 Herding an adversarial swarm in an obstacle environment
CN112651486A (en) 2021-04-13 Method for improving convergence rate of MADDPG algorithm and application thereof
CN116225065A (en) 2023-06-06 Multi-agent reinforcement learning multi-degree-of-freedom model cooperative hunting method for drones
CN115097861B (en) 2024-04-26 A multi-UAV roundup strategy method based on CEL-MADDPG
Wu et al. 2022 Crafting a robotic swarm pursuit–evasion capture strategy using deep reinforcement learning
Cai et al. 2024 A survey on collaborative hunting with robotic swarm: Key technologies and application scenarios
Sun et al. 2024 Cooperative defense of autonomous surface vessels with quantity disadvantage using behavior cloning and deep reinforcement learning
Niu et al. 2022 Trajectory Planning for A Massive Number of UAVs in the Environment with Static and Dynamic Obstacles: A Mean Field Game Approach
CN118070066B (en) 2024-08-13 Training method and device for underwater multi-agent trapping escape game strategy
CN116341777A (en) 2023-06-27 Multi-robot collaborative trapping method based on maximum entropy reinforcement learning
CN114492677B (en) 2022-07-12 Unmanned aerial vehicle countermeasure method and device
Zhang et al. 2024 Reinforcement Learning-Based Cooperative Hunting for an Unmanned Surface Vehicle Swarm
Feng et al. 2025 Decentralized Reinforcement Learning for Multiple Robotic Fish in Cooperative Pursuit Task

Legal Events

Date Code Title Description
2023-06-30 PB01 Publication
2023-06-30 PB01 Publication
2023-07-18 SE01 Entry into force of request for substantive examination
2023-07-18 SE01 Entry into force of request for substantive examination