CN116068900A

CN116068900A - Reinforced learning behavior control method for multiple incomplete constraint mobile robots

Info

Publication number: CN116068900A
Application number: CN202310255701.9A
Authority: CN
Inventors: 黄捷; 张祯毅
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2023-03-16
Filing date: 2023-03-16
Publication date: 2023-05-05
Anticipated expiration: 2043-03-16
Also published as: CN116068900B

Abstract

The invention provides a reinforcement learning behavior control method for a multi-incomplete constraint mobile robot, which comprises the steps of establishing a kinematic model of the multi-incomplete constraint mobile robot based on an incomplete constraint matrix, establishing a dynamic model of the multi-incomplete constraint mobile robot based on an Euler Lagrange equation, establishing basic behaviors according to the established kinematic model, and combining the designed basic behaviors into composite behaviors according to different priority orders through a zero-space projection technology; by applying the technical scheme, the use of a centralized unit in the task execution stage can be avoided, and the dynamic property and the intelligent property of behavior priority switching are improved.

Description

Reinforcement learning behavior control method for mobile robots with multiple nonholonomic constraints

技术领域Technical Field

本发明涉及智能机器人技术领域，特别是一种面向多非完整约束移动机器人的强化学习行为控制方法。The invention relates to the technical field of intelligent robots, and in particular to a reinforcement learning behavior control method for multiple nonholonomically constrained mobile robots.

背景技术Background Art

近年来，非完整约束移动机器人在各个领域得到了广泛应用。由于非完整约束移动机器人不能通过使用任意时不变平滑状态反馈控制律来稳定，因此它的跟踪控制问题被优先地研究了。通过群体协作，多非完整约束移动机器人通常比单个机器人具有更好的任务性能。然而，非完整约束往往会影响团队表现，如何在非完整约束下实施协作控制提出了一个具有挑战性的控制问题。In recent years, nonholonomic mobile robots have been widely used in various fields. Since nonholonomic mobile robots cannot be stabilized by using arbitrary time-invariant smooth state feedback control laws, their tracking control problems have been studied preferentially. Through group collaboration, multiple nonholonomic mobile robots usually have better task performance than a single robot. However, nonholonomic constraints often affect team performance, and how to implement collaborative control under nonholonomic constraints poses a challenging control problem.

现有的多非完整约束移动机器人协助控制通常基于集中式或分布式框架。集中式的方法使用一个集中式控制器激活团队行为和避免违反非完整约束。由于控制器必须拿到全局信息，集中式方法的可扩展性不令人满意。为此，分布式方法通过使用一组具有拓扑结构的网络化控制器来避免使用集中式控制器。大多数分布式方法只解决具有唯一任务或控制目标的协作控制问题。然而，多任务冲突在协作控制问题中很常见，且不容忽视。行为控制方法是最有效的解决方案之一。最初的行为控制方法为一种分层框架，低层次的行为只有在所有高层次行为完成时才会被执行。为了提高任务执行效率，通过对具有可调整权重的行为命令求和，提出了一种运动模式行为控制框架，但没有完成任何行为完整执行。通过结合上述两种方法的优点，提出了一种零空间行为控制方法，其不仅完成最高优先级行为，而且通过零空间投影执行部分低优先级的行为。尽管零空间行为控制方法被扩展到不同的多智能体系统场景中，但它具有隐含集中式的固有缺陷，即它依赖于集中式的任务监管器来分配行为优先级。为此，首次提出了一种分布式行为控制框架用于聚集控制，但缺乏任务和控制器稳定性分析。接着，分布式行为控制的任务误差被证明是渐近稳定的，但它仅限于无障碍环境中的三角形编队。然后，为分布式行为控制设计了一组非线性快速终端滑模控制器，实现了跟踪误差的有限时间收敛。最后，通过设计固定时间估计器和终端滑模控制律，任务和跟踪误差都实现固定时间稳定。Existing cooperative control of multiple nonholonomically constrained mobile robots is usually based on centralized or distributed frameworks. Centralized methods use a centralized controller to activate team behaviors and avoid violating nonholonomic constraints. Since the controller must obtain global information, the scalability of centralized methods is not satisfactory. To this end, distributed methods avoid the use of centralized controllers by using a set of networked controllers with topological structures. Most distributed methods only solve cooperative control problems with unique tasks or control objectives. However, multi-task conflicts are common in cooperative control problems and cannot be ignored. Behavior control methods are one of the most effective solutions. The original behavior control method is a hierarchical framework, and low-level behaviors are only executed when all high-level behaviors are completed. In order to improve the efficiency of task execution, a motion mode behavior control framework is proposed by summing behavior commands with adjustable weights, but no behavior is fully executed. By combining the advantages of the above two methods, a zero-space behavior control method is proposed, which not only completes the highest priority behavior, but also executes some low-priority behaviors through zero-space projection. Although the zero-space behavior control method has been extended to different multi-agent system scenarios, it has the inherent defect of implicit centralization, that is, it relies on a centralized task supervisor to assign behavior priorities. To this end, a distributed behavioral control framework is first proposed for cluster control, but task and controller stability analysis is lacking. Next, the task error of the distributed behavioral control is shown to be asymptotically stable, but it is limited to triangular formations in an obstacle-free environment. Then, a set of nonlinear fast terminal sliding mode controllers are designed for the distributed behavioral control, achieving finite-time convergence of the tracking error. Finally, by designing a fixed-time estimator and a terminal sliding mode control law, both the task and tracking errors are fixed-time stable.

然而，现有分布式行为控制方法仍然存在以下缺点：1、行为的优先级是固定且预先设置的，这会导致任务动态性能不佳，严重依赖人类智能。2、缺乏最优性和智能性，这导致过度消耗控制资源以保持良好的控制性能，特别是在切换行为优先级时。3、控制输入均没有饱和约束限制，这导致执行器在切换行为优先级之后可能违反物理限制。However, existing distributed behavior control methods still have the following disadvantages: 1. The priority of the behavior is fixed and pre-set, which leads to poor task dynamic performance and heavy reliance on human intelligence. 2. Lack of optimality and intelligence, which leads to excessive consumption of control resources to maintain good control performance, especially when switching behavior priorities. 3. There are no saturation constraints on the control inputs, which may cause the actuator to violate physical limitations after switching behavior priorities.

发明内容Summary of the invention

有鉴于此，本发明的目的在于提供一种面向多非完整约束移动机器人的强化学习行为控制方法，基于辨识者-执行者-评论家算法设计了强化学习控制器，在线地学习系统的未知动力学和最优控制策略，以保证在任务执行过程中，控制性能和控制损耗始终保持平衡，并且还考虑了输入饱和约束，避免执行器违反实际物理限制。In view of this, the purpose of the present invention is to provide a reinforcement learning behavior control method for multiple non-holonomic constrained mobile robots. A reinforcement learning controller is designed based on the identifier-actor-critic algorithm to learn the unknown dynamics and optimal control strategy of the system online to ensure that the control performance and control loss are always balanced during the task execution. Input saturation constraints are also considered to avoid the actuator from violating actual physical limitations.

为实现上述目的，本发明采用如下技术方案：面向多非完整约束移动机器人的强化学习行为控制方法，包括以下步骤：To achieve the above object, the present invention adopts the following technical solution: a reinforcement learning behavior control method for multiple nonholonomic constrained mobile robots, comprising the following steps:

步骤S1，基于非完整约束矩阵建立多非完整约束移动机器人的运动学模型，基于欧拉拉格朗日方程建立多非完整约束移动机器人的动力学模型，并根据所建立的运动学模型构建基本行为，同时通过零空间投影技术，将所设计的基本行为以不同的优先级顺序组合成为复合行为；Step S1, establishing a kinematic model of a mobile robot with multiple nonholonomic constraints based on a nonholonomic constraint matrix, establishing a dynamic model of a mobile robot with multiple nonholonomic constraints based on the Euler-Lagrange equations, and constructing basic behaviors according to the established kinematic models, and combining the designed basic behaviors into composite behaviors in different priority orders through a null space projection technique;

步骤S2，将行为优先级切换建模为一个分布式部分可观测的马尔科夫决策过程，在集中式训练分布式执行的强化学习算法框架下，设置复合行为的参考速度指令作为强化学习算法的动作集合，选取非完整约束机器人的位置和优先级，以及其邻居机器人的位置和优先级作为强化学习算法的观测集合，设计奖励函数，从而构建分布式强化学习任务监管器DRLMSs；Step S2, modeling the behavior priority switching as a distributed partially observable Markov decision process, setting the reference speed instruction of the composite behavior as the action set of the reinforcement learning algorithm under the framework of the centralized training and distributed execution reinforcement learning algorithm, selecting the position and priority of the non-holonomic constrained robot, and the position and priority of its neighboring robots as the observation set of the reinforcement learning algorithm, designing the reward function, and thus constructing the distributed reinforcement learning task supervisor DRLMSs;

步骤S3，以平衡控制性能和控制损耗为目标，引入辨识者-执行者-评论家强化学习算法，在线地辨识未知动力学模型、实施控制策略以及评估控制性能，从而设计强化学习控制器RLCs；Step S3, with the goal of balancing control performance and control loss, an identifier-executor-critic reinforcement learning algorithm is introduced to online identify unknown dynamic models, implement control strategies, and evaluate control performance, thereby designing reinforcement learning controllers RLCs;

步骤S4，基于自适应控制理论，设计自适应补偿器，以维持最优的控制性能和实时抵消饱和效应。Step S4, based on the adaptive control theory, an adaptive compensator is designed to maintain the optimal control performance and offset the saturation effect in real time.

在一较佳的实施例中，步骤S1具体包括如下步骤：In a preferred embodiment, step S1 specifically includes the following steps:

步骤S11：多非完整约束移动机器人运动学建模Step S11: Kinematic modeling of mobile robots with multiple nonholonomic constraints

考虑一组N(N＞2)的非完整约束移动机器人，其中每个机器人由差速轮驱动，i＝1,...,N；第i个非完整约束移动机器人的广义速度表示为Consider a group of N (N＞2) nonholonomic constrained mobile robots, where each robot is driven by a differential wheel, i＝1,...,N; the generalized velocity of the i-th nonholonomic constrained mobile robot is expressed as

其中，

和

分别是线速度和角速度，

和

分别是左右轮的线速度，

是左右轮间的距离，

表示实数集合；in,

and

are the linear velocity and angular velocity, respectively.

and

are the linear speeds of the left and right wheels,

is the distance between the left and right wheels,

represents the set of real numbers;

然后，第i个非完整约束移动机器人的运动学方程表示为Then, the kinematic equation of the i-th nonholonomic constrained mobile robot is expressed as

其中，

表示广义状态，

和

分别是位置和方向，

表示非完整约束矩阵；in,

represents a generalized state,

and

are position and direction,

represents a non-holonomic constraint matrix;

此外，第i个非完整约束移动机器人在惯性坐标系下的运动学方程为In addition, the kinematic equation of the i-th nonholonomic constrained mobile robot in the inertial coordinate system is:

其中，

是轮半径，

表示惯性坐标性下的非完整约束矩阵，

和

分别是左右轮的旋转速度；in,

is the wheel radius,

represents the nonholonomic constraint matrix in inertial coordinates,

and

are the rotation speeds of the left and right wheels respectively;

步骤S12：多非完整约束移动机器人动力学建模Step S12: Dynamics modeling of mobile robots with multiple nonholonomic constraints

通过使用欧拉拉格朗日方程，第i个非完整约束移动机器人的动力学模型推导为By using the Euler-Lagrange equations, the dynamic model of the i-th nonholonomic constrained mobile robot is derived as

其中，

是惯性矩阵，

是科氏力和向心力矩阵，G_i(x_i)是重力矩阵，

表示未知非线性项，

是可设计的输入增益矩阵，

是控制输入，

是非完整约束力；in,

is the inertia matrix,

is the Coriolis force and centripetal force matrix, G _i ( _xi ) is the gravity matrix,

represents the unknown nonlinear term,

is a designable input gain matrix,

is the control input,

It is not completely binding;

首先，公式(3)的微分形式推导如下First, the differential form of formula (3) is derived as follows

其中，

表示S_i(x_i)的微分，

是轮的角加速度；in,

represents the differential of S _i ( _xi ),

is the angular acceleration of the wheel;

然后，将公式(3)和(5)代入(4)，并左乘

得到以下方程Then, substitute formulas (3) and (5) into (4) and multiply on the left by

The following equation is obtained

其中，

in,

根据假设2，公式(6)改写为According to Assumption 2, formula (6) can be rewritten as

其中，

是精确项，

是非精确项；in,

is the exact term,

is an inexact term;

假设1：多非完整约束移动机器人系统工作在一个静态的场景中，所有非机器人的障碍物均为静态且固定的；Assumption 1: The multi-nonholonomic constrained mobile robot system works in a static scene, and all non-robot obstacles are static and fixed;

假设2：输入增益矩阵E_i(x_i)始终满足设计为

步骤S13：多非完整约束移动机器人基本行为构建Assumption 2: The input gain matrix E _i ( _xi ) always satisfies the design

Step S13: Construction of basic behaviors of mobile robots with multiple nonholonomic constraints

假设每一个非完整约束移动机器人均有M个基本行为，其中第i个非完整约束移动机器人的第k个基本行为可以使用一个任务变量

进行数学建模如下Assume that each nonholonomic constrained mobile robot has M basic behaviors, where the kth basic behavior of the ith nonholonomic constrained mobile robot can use a task variable

The mathematical modeling is as follows

其中，g_i,k(·):

表示任务函数；where g _i,k (·):

Represents the task function;

然后，任务变量σ_i,k的微分形式表示为Then, the differential form of the task variable σ _i,k is expressed as

其中，

是任务的雅克比矩阵；in,

is the Jacobian matrix of the task;

最后，第i个非完整约束移动机器人的第k个基本行为的参考速度指令可以表示为Finally, the reference velocity command of the kth basic behavior of the i-th nonholonomic constrained mobile robot can be expressed as

其中，

是任务的雅克比矩阵J_i,k的右伪逆，

是期望的任务函数，

是任务增益，

是任务误差；in,

is the right pseudo-inverse of the Jacobian matrix Ji _,k of the task,

is the desired task function,

is the task gain,

is the task error;

在不失一般性的前提下，避障行为、分布式编队行为和分布式重构行为设计如下：避障行为：避障行为是一种局部行为，旨在确保非完整约束移动机器人避开路径附近的障碍物，其相应的任务函数、期望任务和任务雅克比矩阵分别表示为：Without loss of generality, the obstacle avoidance behavior, distributed formation behavior and distributed reconstruction behavior are designed as follows: Obstacle avoidance behavior: Obstacle avoidance behavior is a local behavior that aims to ensure that the non-holonomic constrained mobile robot avoids obstacles near the path. Its corresponding task function, expected task and task Jacobian matrix are expressed as follows:

其中，

表示第i个非完整约束移动机器人与障碍物的最小距离，d_OA为安全距离，

是最小距离的相对位置，

是避障行为期望的方向，+和-分别表示障碍物在第i个非完整约束移动机器人的左边和右边；in,

represents the minimum distance between the i-th nonholonomic constrained mobile robot and the obstacle, d _OA is the safety distance,

is the relative position of the minimum distance,

is the desired direction of the obstacle avoidance behavior, + and − respectively indicate that the obstacle is to the left and right of the i-th nonholonomically constrained mobile robot;

分布式编队行为：分布式编队行为是一种分布式协作行为，旨在确保多非完整约束移动机器人仅通过使用邻居的状态形成所需的队形，其相应的任务函数、期望任务和任务雅克比矩阵分别表示为：Distributed formation behavior: Distributed formation behavior is a distributed cooperative behavior that aims to ensure that multiple nonholonomic constrained mobile robots form the desired formation by only using the states of their neighbors. The corresponding task function, expected task, and task Jacobian matrix are expressed as:

其中，

是分布式编队行为的估计状态，其通过设计具有如下更新率的自适应估计器来估计：in,

is the estimated state of the distributed formation behavior, which is estimated by designing an adaptive estimator with the following update rate:

其中，κ_DF是一个正常数，

是编队的相对位置，

表示领航者的状态，

表示第i个非完整约束移动机器人的邻居；Among them, κ _DF is a positive constant,

is the relative position of the formation,

Indicates the status of the navigator.

represents the neighbors of the i-th nonholonomic constrained mobile robot;

分布式重构行为：分布式重构行为是一种分布式协作行为，旨在确保多非完整约束移动机器人仅通过使用邻居的状态重构所需的队形，其相应的任务函数、期望任务和任务雅克比矩阵分别表示为：Distributed Reconfiguration Behavior: Distributed reconfiguration behavior is a distributed cooperative behavior designed to ensure that multiple nonholonomic constrained mobile robots reconstruct the desired formation only by using the states of their neighbors. The corresponding task function, expected task, and task Jacobian matrix are expressed as:

其中，

其中，κ_DR是一个正常数，

是编队重构矩阵；Among them, κ _DR is a positive constant,

is the formation reconstruction matrix;

步骤S14：多非完整约束移动机器人复合行为构建Step S14: Construction of composite behaviors of mobile robots with multiple nonholonomic constraints

一个复合任务是多个基本行为以一定的优先级顺序的组合；设定

为第i个非完整约束移动机器人的任务函数，其中k_m∈N_M，N_M＝{1,...,M}，m_k表示任务空间的维度，M表示任务的数量；定义与时间相关的优先级函数g_i(k_m,t):N_M×[0,∞]→N_M；同时，定义一个具有如下规则的任务层次结构：A composite task is a combination of multiple basic behaviors in a certain priority order;

is the task function of the i-th nonholonomic constrained mobile robot, where _km∈NM , _NM ={1,...,M}, _mk represents the dimension of the task space, and _M represents the number of tasks. Define a time-related priority function _gi ( _km ,t): _NM ×[0,∞]→ _NM . At the same time, define a task hierarchy with the following rules:

1)一个具有g_i(k_α)优先级的任务k_α不能干扰具有g_i(k_β)优先级的任务k_β，如果g_i(k_α)≥g_i(k_β)，

k_α≠k_β；1) A task k _α with priority _gi (k _α ) cannot interfere with task k _β with priority _gi (k _β ) if _gi (k _α ) _≥gi (k _β ),

k _α ≠ k _β ;

2)从速度到任务速度的映射关系由任务的雅可比矩阵

表示；2) The mapping from speed to task speed is given by the Jacobian matrix of the task

express;

3)具有最低优先级任务m_M的维度可能大于

因此要确保维度m_n大于所有任务的总维度；3) The dimension of the task _m with the lowest priority may be greater than

Therefore, make sure that the dimension m _n is greater than the total dimension of all tasks;

4)g_i(k_m)的值由任务监管器根据任务的需求和传感器信息进行分配；4) The value of g _i (k _m ) is assigned by the task supervisor according to the task requirements and sensor information;

通过给基本任务分配给定的优先级，t时刻复合任务的速度表示为By assigning a given priority to the basic tasks, the speed of the composite task at time t is expressed as

其中，

是行为优先级，

是零空间投影的增广雅克比矩阵。in,

is the behavioral priority,

is the augmented Jacobian matrix of the null space projection.

在一较佳的实施例中，所述步骤S2具体为：定义集中式训练环境为，全局的状态为

其中

是联合的位置，

是联合的优先级，

是编队标志位，S表示全局状态集合；定义b_i,t＝{v_r,i,t}∈B为局部/本地行为，其中B表示行为集合；定义

为独立的局部观测，其中s_i,t＝{x_i,Pr_i}是局部/本地状态，

表示第i个非完整约束移动机器人的邻居，O表示局部观测集合；由于局部观测，定义局部/本地的行为观测历史为z_i,t∈Z，其中Z表示行为观测历史集合；所有的分布式强化学习任务监管器贡献一个奖励信号，且奖励函数设计如下In a preferred embodiment, the step S2 is specifically as follows: defining the centralized training environment as, the global state is

in

is the joint location,

is the priority of the union,

is the formation flag, S represents the global state set; define b _i,t ={v _r,i,t }∈B as a local behavior, where B represents the behavior set; define

is an independent local observation, where s _i,t = {x _i ,Pr _i } is a local state,

represents the neighbors of the i-th nonholonomic constrained mobile robot, O represents the local observation set; due to local observation, the local behavior observation history is defined as z _i,t ∈Z, where Z represents the behavior observation history set; all distributed reinforcement learning task supervisors contribute a reward signal, and the reward function is designed as follows

r_t＝r₁+r₂, (25)r _t = r ₁ + r ₂ , (25)

其中，

分别表示无编队、重构编队和期望编队状态的标识；r₁和r₂是分别设置实现任务目标和减少行为切换的奖励信号；in,

They represent the identities of no formation, reconstructed formation, and desired formation states, respectively; _r1 and _r2 are reward signals for achieving task goals and reducing behavior switching, respectively;

多非完整约束移动机器人与环境在t时间步进行交互，其中第i个非完整约束移动机器人观测到一个局部观测o_i,t，获取到上一个行为b_i,t-1，根据具有衰减因子γ_ε的ε贪心策略选取一个行为b_i,t，然后得到一个团队奖励r_t和转移至下一个局部观测o_i,t+1；具体而言，分布式强化学习任务监管器的集中式训练是通过分层渐进模块进行的，包括独立Q值模块和混合模块；首先，每一个非完整约束移动机器人都有一个独立Q值模块，其使用循环Q网络输入门循环神经网络的隐藏层状态h_i,t-1，局部观测o_i,t，上一个行为b_i,t-1，输出局部的Q值

然后，混合模块通过求和所有的局部的Q值

生成联合Q值

如下Multiple nonholonomic constrained mobile robots interact with the environment at t time steps, where the i-th nonholonomic constrained mobile robot observes a local observation o _i,t , obtains the previous behavior b _i,t-1 , selects a behavior b _i,t according to the ε-greedy strategy with a decay factor γ _ε , and then obtains a team reward r _t and transfers to the next local observation o _i,t+1 ; Specifically, the centralized training of the distributed reinforcement learning task supervisor is carried out through hierarchical progressive modules, including independent Q-value modules and hybrid modules; First, each nonholonomic constrained mobile robot has an independent Q-value module, which uses the recurrent Q network to input the hidden layer state h _i,t-1 of the gated recurrent neural network, the local observation o _i,t , the previous behavior b _i,t-1 , and outputs the local Q-value

The mixing module then sums up all the local Q values

Generate joint Q value

as follows

其中，

表示独立Q值网络的参数；in,

represents the parameters of the independent Q-value network;

在t时间步采样后，将经历四元组(z_t,b_t,r_t,z_t+1)存储到经验池

中；特别地，将从经验池中采样最小回放次

经历以减少数据相关性和提高样本利用率；然后，训练进行到t+1时间步，且训练直至所有的回合T_total完成后停止；分布式强化学习任务监管器通过最小化以下损失进行训练：After sampling at time step t, the experience quadruple (z _t , b _t , r _t , z _t+1 ) is stored in the experience pool

In particular, the minimum number of replays will be sampled from the experience pool

Experience to reduce data correlation and improve sample utilization; then, training proceeds to t+1 time step, and training stops after all rounds T _total are completed; the distributed reinforcement learning task supervisor is trained by minimizing the following loss:

其中，

表示目标网络的参数；in,

Represents the parameters of the target network;

最后，多非完整约束移动机器人在集中式训练后学习到一组最优分布式行为优先级策略；在实际场景中，多非完整约束移动机器人根据学习到的策略切换行为优先级；一旦在每个采样时刻确定了多非完整约束移动机器人的行为优先级，就通过公式(22)-(24)和(2)获取参考速度v_i,r和参考轨迹x_i,r；根据公式(3)，进一步计算得到惯性坐标系下的参考速度

和参考轨迹θ_i,r。Finally, after centralized training, the multi-nonholonomic constrained mobile robot learns a set of optimal distributed behavior priority strategies. In actual scenarios, the multi-nonholonomic constrained mobile robot switches behavior priorities according to the learned strategies. Once the behavior priorities of the multi-nonholonomic constrained mobile robot are determined at each sampling moment, the reference speed v _i,r and reference trajectory x _i,r are obtained through formulas (22)-(24) and (2). According to formula (3), the reference speed in the inertial coordinate system is further calculated.

and the reference trajectory θ _i,r .

在一较佳的实施例中：步骤S3具体为：定义惯性坐标系下的位置和速度跟踪误差分别为In a preferred embodiment, step S3 specifically includes: defining the position and velocity tracking errors in the inertial coordinate system as:

e_p,i＝θ_i-θ_i,r, (30)e _p,i =θ _i -θ _i,r , (30)

其中，

和

是左右轮的角度，

和

分别是参考位置和参考速度；in,

and

is the angle of the left and right wheels,

and

are the reference position and reference velocity respectively;

公式(30)和(31)的微分形式推导为The differential forms of formulas (30) and (31) are derived as follows:

其中，

是

的微分；in,

yes

The differential of

定义集成的跟踪误差如下The integrated tracking error is defined as follows

定义值函数如下Define the value function as follows

其中，

表示代价函数，α_V,β_V∈(0,2)是可调整的代价参数，且满足α_V+β_V＝2；in,

represents the cost function, α _V , β _V ∈(0,2) are adjustable cost parameters, and satisfy α _V +β _V ＝2;

定义

为最优的跟踪控制策略；因此，最优的值函数可以表示为definition

is the optimal tracking control strategy; therefore, the optimal value function can be expressed as

其中，

表示可容许的控制策略；in,

represents the permissible control strategies;

通过结合公式(32)-(35)和(37)，哈密顿-雅克比-贝尔曼(Hamilton-Jacobi-Bellman，HJB)方程可推导为By combining equations (32)-(35) and (37), the Hamilton-Jacobi-Bellman (HJB) equation can be derived as

其中，

表示V_i ^*相对于e_i的梯度，

和

表示分别为V_i ^*相对于e_p,i和e_v,i的梯度；in,

represents the gradient of _Vi ^* relative to _ei ,

and

denote the gradients of _Vi ^* relative to _ep,i and ev _,i respectively;

通过求解

最优的控制策略

可推导为By solving

Optimal control strategy

It can be deduced as

此外，将公式(39)代入(38)可获得以下等式：Furthermore, substituting formula (39) into (38) yields the following equation:

为了实施

需要求解公式(40)获取

然而，由于多非完整约束移动机器人动力学模型的非线性和不精确，

的解析解难以求取；To implement

We need to solve formula (40) to obtain

However, due to the nonlinearity and inaccuracy of the dynamics model of mobile robots with multiple nonholonomic constraints,

The analytical solution of is difficult to obtain;

因此，需要将最优值函数梯度分解如下：Therefore, the optimal value function gradient needs to be decomposed as follows:

其中，

和

是正常数，

是自适应补偿项；in,

and

is a normal number,

is the adaptive compensation term;

将公式(41)代入(39)，获取等式如下：Substituting formula (41) into (39), we obtain the following equation:

众所周知，神经网络具有强大的逼近能力；因此，给定紧集

和

对于

和

未知项f_i(x_i)和V_i ^o通过神经网络近似如下：As we all know, neural networks have powerful approximation capabilities; therefore, given a compact set

and

for

and

The unknown terms _fi ( _xi ) and _Vio ^are approximated by the neural network as follows:

其中，

和

是理想的权重矩阵，w_f和w_V是神经元数量，

和

是基函数向量，

和

是逼近误差，且有界如||δ_f,i||≤ε_f和||δ_V,i||≤ε_V，ε_f和ε_V是正常数；in,

and

is the ideal weight matrix, _wf and _wV are the number of neurons,

and

is the basis function vector,

and

is the approximation error and is bounded such that ||δ _f,i ||≤ε _f and ||δ _V,i ||≤ε _V , ε _f and ε _V are positive constants;

然后，将公式(43)和(44)代入(41)和(42)，获取到以下方程：Then, substitute equations (43) and (44) into (41) and (42) to obtain the following equations:

然而，由于

和

是未知的，

无法实施；因此，使用一种辨识者-执行者-评论家强化学习算法以学习最优控制策略；However, due to

and

is unknown,

It is not feasible to implement; therefore, an Identifier-Actor-Critic reinforcement learning algorithm is used to learn the optimal control policy;

具体而言，设计辨识者神经网络以估计未知非线性项如下：Specifically, the discriminator neural network is designed to estimate the unknown nonlinear terms as follows:

其中，

是f_i(x_i)的估计，

是辨识者神经网络的权重；辨识者神经网络的更新率可设计为：in,

is an estimate of _fi ( _xi ),

is the weight of the identifier neural network; the update rate of the identifier neural network can be designed as:

其中，

是正定矩阵，

是设计的辨识者参数；in,

is a positive definite matrix,

is the identifier parameter of the design;

然后，设计评论家神经网络以评估控制性能如下：Then, a critic neural network is designed to evaluate the control performance as follows:

其中，

是

的估计值，

是评论家神经网络的权重；评论家网络的更新率可设计为in,

yes

The estimated value of

is the weight of the critic neural network; the update rate of the critic network can be designed as

其中，γ_c,i是评论家的学习率；where γ _c,i is the critic’s learning rate;

最后，设计执行者神经网络设施控制输入如下：Finally, the design of the actuator neural network facility control input is as follows:

其中，

是执行者神经网络的权重；执行者网络的更新率可设计为in,

is the weight of the actor neural network; the update rate of the actor network can be designed as

其中，γ_a,i是执行者的学习率。where γ _a,i is the learning rate of the executor.

在一较佳的实施例中：所述步骤S4具体为：首先，考虑控制输入

受到对称执行机构饱和约束的限制如下：In a preferred embodiment: Step S4 is specifically as follows: First, consider the control input

The restrictions subject to the saturation constraints of the symmetric actuators are as follows:

其中，τ_lim,i＞0是已知的阈值；Among them, τ _lim,i ＞0 is a known threshold;

其次，可将控制输入分为两项如下：Secondly, the control input can be divided into two items as follows:

τ_i＝τ_0,i+τ_Δ,i, (54)τ _i =τ _0,i +τ _Δ,i , (54)

其中，

是标称项，

是补偿项，且满足如下条件：in,

is a nominal term,

is a compensation item and meets the following conditions:

最后，设计自适应补偿器为

且具有更新率如下Finally, the adaptive compensator is designed as

And has an update rate of

其中，

是设计的自适应补偿器参数。in,

are the designed adaptive compensator parameters.

与现有技术相比，本发明具有以下有益效果：本发明以多完整约束移动机器人系统为研究对象，提出了一种面向多非完整约束移动机器人的分布式强化学习行为控制方法。首先，该方法通过将行为优先级切换建模为分布式部分可观察的马尔可夫决策过程，提出了一组新颖的分布式强化学习任务监管器来学习最优的分布式行为优先级策略，从而使得在任务执行期间，零空间行为控制方法能够不依赖任何集中式单元来切换行为优先级，进而从根本上解决了零空间行为控制方法隐含集中式的致命缺陷。与此同时，使用所学习到最优的分布式行为优先级策略实现行为优先级的切换，不仅弥补了分布式行为控制框架下行为优先级固定的不足，提升了零空间行为控制方法的动态性能，而且将大量的在线计算过程转移至了离线学习阶段，降低了零空间行为控制方法对高性能硬件的依赖。其次，该方法提出了强化学习控制器，通过使用辨识者-执行者-评论家强化学习算法来学习未知的动态模型和最优控制策略。在任务执行过程中，始终维持控制性能和控制损耗的平衡，尤其是在行为优先级切换时，相比于现有的零空间行为控制方法，控制代价会降低，避免了多非完整约束移动机器人出现为了维持高性能的控制，过度消耗控制资源的情况。最后，相较于现有零空间行为控制方法不考虑输入饱和约束，本发明为了防止多非完整约束移动机器人的执行机构超过物理限制，实施了输入饱和约束，设计了一组自适应补偿器，以维持最优性能和实时抵消饱和效应。Compared with the prior art, the present invention has the following beneficial effects: The present invention takes a multi-holonomic constrained mobile robot system as the research object, and proposes a distributed reinforcement learning behavior control method for multi-nonholonomic constrained mobile robots. First, by modeling the behavior priority switching as a distributed partially observable Markov decision process, the method proposes a set of novel distributed reinforcement learning task supervisors to learn the optimal distributed behavior priority strategy, so that during the task execution, the zero-space behavior control method can switch the behavior priority without relying on any centralized unit, thereby fundamentally solving the fatal defect of the zero-space behavior control method implicitly centralized. At the same time, the use of the learned optimal distributed behavior priority strategy to achieve the behavior priority switching not only makes up for the deficiency of fixed behavior priority under the distributed behavior control framework, improves the dynamic performance of the zero-space behavior control method, but also transfers a large number of online computing processes to the offline learning stage, reducing the dependence of the zero-space behavior control method on high-performance hardware. Secondly, the method proposes a reinforcement learning controller, which learns unknown dynamic models and optimal control strategies by using the identifier-executor-critic reinforcement learning algorithm. During the task execution, the balance between control performance and control loss is always maintained, especially when the behavior priority is switched. Compared with the existing zero-space behavior control method, the control cost will be reduced, avoiding the situation where the multi-nonholonomic constraint mobile robot consumes too much control resources in order to maintain high-performance control. Finally, compared with the existing zero-space behavior control method that does not consider input saturation constraints, in order to prevent the actuator of the multi-nonholonomic constraint mobile robot from exceeding the physical limit, the present invention implements input saturation constraints and designs a set of adaptive compensators to maintain optimal performance and offset the saturation effect in real time.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明实施例的一种面向多非完整约束移动机器人的分布式强化学习行为控制方法的原理框图；FIG1 is a principle block diagram of a distributed reinforcement learning behavior control method for multiple nonholonomic constrained mobile robots according to an embodiment of the present invention;

图2为本发明实施例的第i个非完整约束移动机器人的示意图；FIG2 is a schematic diagram of an i-th nonholonomic constrained mobile robot according to an embodiment of the present invention;

图3为本发明实施例的分布式强化学习任务监管器的伪代码图；FIG3 is a pseudo code diagram of a distributed reinforcement learning task supervisor according to an embodiment of the present invention;

图4为本发明实施例的多非完整约束移动机器人的网络拓扑示意图；FIG4 is a schematic diagram of a network topology of multiple nonholonomically constrained mobile robots according to an embodiment of the present invention;

图5为本发明实施例的仿真参数值选取图；FIG5 is a diagram showing the selection of simulation parameter values according to an embodiment of the present invention;

图6为本发明实施例的分布式强化学习任务监管器学习前的任务性能图，a)轨迹，(b)方向，(c)非完整约束移动机器人与障碍物的距离，(d)行为优先级；6 is a task performance diagram of a distributed reinforcement learning task supervisor before learning according to an embodiment of the present invention, (a) trajectory, (b) direction, (c) distance between a nonholonomically constrained mobile robot and an obstacle, and (d) behavior priority;

图7为本发明实施例的分布式强化学习任务监管器学习后的任务性能图，a)轨迹，(b)方向，(c)非完整约束移动机器人与障碍物的距离，(d)行为优先级；7 is a task performance diagram after learning by a distributed reinforcement learning task supervisor according to an embodiment of the present invention, (a) trajectory, (b) direction, (c) distance between a nonholonomically constrained mobile robot and an obstacle, and (d) behavior priority;

图8为本发明实施例的具有不同任务监管器的多非完整约束移动机器人任务性能图，a)分布式强化学习任务监管器，(b)分布式有限状态机任务监管器，(c)分布式模型预测控制任务监管器，(d)传统强化学习任务监管器；FIG8 is a graph of the performance of multiple nonholonomic constrained mobile robot tasks with different task supervisors according to an embodiment of the present invention, (a) distributed reinforcement learning task supervisor, (b) distributed finite state machine task supervisor, (c) distributed model predictive control task supervisor, and (d) traditional reinforcement learning task supervisor;

图9为本发明实施例的具有不同任务监管器的第2个非完整约束移动机器人任务性能图，a)分布式强化学习任务监管器，(b)分布式有限状态机任务监管器，(c)分布式模型预测控制任务监管器，(d)传统强化学习任务监管器；FIG9 is a graph of the performance of the second nonholonomic constrained mobile robot task with different task supervisors according to an embodiment of the present invention, (a) distributed reinforcement learning task supervisor, (b) distributed finite state machine task supervisor, (c) distributed model predictive control task supervisor, and (d) traditional reinforcement learning task supervisor;

图10为本发明实施例的具有输入饱和约束的分布式强化学习控制性能图，a)轨迹，(b)控制输入，(c)跟踪误差，(d)控制代价；FIG10 is a performance diagram of distributed reinforcement learning control with input saturation constraints according to an embodiment of the present invention, a) trajectory, (b) control input, (c) tracking error, and (d) control cost;

图11为本发明实施例的分布式强化学习的学习性能图，a)辨识者权重，(b)执行者权重，(c)评论家权重，(d)自适应补偿器；FIG11 is a graph of the learning performance of distributed reinforcement learning according to an embodiment of the present invention, (a) identifier weight, (b) executor weight, (c) critic weight, and (d) adaptive compensator;

图12为本发明实施例的具有和不具有输入饱和约束的第5个非完整约束移动机器人控制性能图，a)轨迹，(b)控制输入，(c)跟踪误差，(d)控制代价；FIG12 is a graph of the control performance of the fifth nonholonomic constrained mobile robot with and without input saturation constraints according to an embodiment of the present invention, a) trajectory, (b) control input, (c) tracking error, and (d) control cost;

图13为本发明实施例的具有不同分布式行为控制的多非完整约束移动机器人轨迹图，a)分布式强化学习行为控制，(b)有限时间分布式行为控制，(c)固有时间分布式行为控制，(d)传统强化学习行为控制；FIG13 is a trajectory diagram of a multi-nonholonomic constrained mobile robot with different distributed behavior controls according to an embodiment of the present invention, (a) distributed reinforcement learning behavior control, (b) finite time distributed behavior control, (c) intrinsic time distributed behavior control, and (d) traditional reinforcement learning behavior control;

图14为本发明实施例的有不同分布式行为控制的控制性能图，a)轨迹，(b)控制输入，(c)跟踪误差，(d)控制代价。14 is a control performance diagram of different distributed behavior controls according to an embodiment of the present invention, including: a) trajectory, (b) control input, (c) tracking error, and (d) control cost.

具体实施方式DETAILED DESCRIPTION

下面结合附图及实施例对本发明做进一步说明。The present invention will be further described below in conjunction with the accompanying drawings and embodiments.

应该指出，以下详细说明都是例示性的，旨在对本申请提供进一步的说明。除非另有指明，本文使用的所有技术和科学术语具有与本申请所属技术领域的普通技术人员通常理解的相同含义。It should be noted that the following detailed descriptions are illustrative and are intended to provide further explanation of the present application. Unless otherwise specified, all technical and scientific terms used herein have the same meanings as those commonly understood by those skilled in the art to which the present application belongs.

需要注意的是，这里所使用的术语仅是为了描述具体实施方式，而非意图限制根据本申请的示例性实施方式；如在这里所使用的，除非上下文另外明确指出，否则单数形式也意图包括复数形式，此外，还应当理解的是，当在本说明书中使用术语“包含”和/或“包括”时，其指明存在特征、步骤、操作、器件、组件和/或它们的组合。It should be noted that the terms used herein are only for describing specific embodiments and are not intended to limit the exemplary embodiments according to the present application; as used herein, unless the context clearly indicates otherwise, the singular form is also intended to include the plural form. In addition, it should be understood that when the terms "comprise" and/or "include" are used in this specification, they indicate the presence of features, steps, operations, devices, components and/or their combinations.

步骤一：运动学和动力学模型、基本行为和复合行为构建Step 1: Kinematic and dynamic models, basic behaviors and composite behaviors

a.多非完整约束移动机器人运动学建模a. Kinematic modeling of mobile robots with multiple nonholonomic constraints

考虑一组N(N＞2)的非完整约束移动机器人，其中每个机器人由差速轮驱动，第i个非完整约束移动机器人的示意图如图2所示，i＝1,...,N。第i个非完整约束移动机器人的广义速度可表示为Consider a group of N (N>2) nonholonomic constrained mobile robots, each of which is driven by a differential wheel. The schematic diagram of the i-th nonholonomic constrained mobile robot is shown in Figure 2, where i=1,...,N. The generalized velocity of the i-th nonholonomic constrained mobile robot can be expressed as

其中，

和

分别是线速度和角速度，

和

分别是左右轮的线速度，

是左右轮间的距离，

表示实数集合。in,

and

are the linear velocity and angular velocity, respectively.

and

are the linear speeds of the left and right wheels,

is the distance between the left and right wheels,

Represents the set of real numbers.

然后，第i个非完整约束移动机器人的运动学方程可以表示为Then, the kinematic equation of the i-th nonholonomic constrained mobile robot can be expressed as

其中，

表示广义状态，

和

分别是位置和方向，

表示非完整约束矩阵。in,

represents a generalized state,

and

are position and direction,

represents a nonholonomic constraint matrix.

其中，

是轮半径，

表示惯性坐标性下的非完整约束矩阵，

和

分别是左右轮的旋转速度。in,

is the wheel radius,

represents the nonholonomic constraint matrix in inertial coordinates,

and

are the rotation speeds of the left and right wheels respectively.

b.多非完整约束移动机器人动力学建模b. Dynamic modeling of mobile robots with multiple nonholonomic constraints

通过使用欧拉拉格朗日方程，第i个非完整约束移动机器人的动力学模型可以推导为By using the Euler-Lagrange equations, the dynamic model of the i-th nonholonomic constrained mobile robot can be derived as

其中，

是惯性矩阵，

是科氏力和向心力矩阵，G_i(x_i)是重力矩阵，

表示未知非线性项，

是可设计的输入增益矩阵，

是控制输入，

是非完整约束力。in,

is the inertia matrix,

represents the unknown nonlinear term,

is a designable input gain matrix,

is the control input,

It is not completely binding.

首先，公式(3)的微分形式可推导如下First, the differential form of formula (3) can be derived as follows

其中，

表示S_i(x_i)的微分，

是轮的角加速度。in,

represents the differential of S _i ( _xi ),

is the angular acceleration of the wheel.

然后，将公式(3)和(5)代入(4)，并左乘

可以得到以下方程Then, substitute formulas (3) and (5) into (4) and multiply on the left by

The following equation can be obtained

其中，

in,

根据假设2，公式(6)可以改写为According to Assumption 2, formula (6) can be rewritten as

其中，

是精确项，

是非精确项。in,

is the exact term,

It is an inexact term.

假设1：多非完整约束移动机器人系统工作在一个静态的场景中，所有非机器人的障碍物均为静态且固定的。Assumption 1: The multi-nonholonomic constrained mobile robot system works in a static scene, and all non-robot obstacles are static and fixed.

假设2：输入增益矩阵E_i(x_i)始终满足设计为

Assumption 2: The input gain matrix E _i ( _xi ) always satisfies the design

c.多非完整约束移动机器人基本行为构建c. Construction of basic behaviors of mobile robots with multiple nonholonomic constraints

The mathematical modeling is as follows

σ_i,k＝g_i,k(x_i), (8)σ _i,k =gi _,k ( _xi ), (8)

其中，g_i,k(·):

表示任务函数。where g _i,k (·):

Represents a task function.

然后，任务变量σ_i,k的微分形式可以表示为Then, the differential form of the task variable σ _i,k can be expressed as

其中，

是任务的雅克比矩阵。in,

is the Jacobian matrix of the task.

其中，

是任务的雅克比矩阵J_i,k的右伪逆，

是期望的任务函数，

是任务增益，

是任务误差。in,

is the right pseudo-inverse of the Jacobian matrix Ji _,k of the task,

is the desired task function,

is the task gain,

It is a task error.

在不失一般性的前提下，避障行为、分布式编队行为和分布式重构行为设计如下：避障行为(Obstacle Avoidance，OA)：避障行为是一种局部行为，旨在确保非完整约束移动机器人避开路径附近的障碍物，其相应的任务函数、期望任务和任务雅克比矩阵分别表示为：Without loss of generality, the obstacle avoidance behavior, distributed formation behavior and distributed reconstruction behavior are designed as follows: Obstacle Avoidance (OA): Obstacle avoidance behavior is a local behavior that aims to ensure that the non-holonomic constrained mobile robot avoids obstacles near the path. Its corresponding task function, expected task and task Jacobian matrix are expressed as follows:

其中，

是最小距离的相对位置，

是避障行为期望的方向，+和-分别表示障碍物在第i个非完整约束移动机器人的左边和右边。in,

is the relative position of the minimum distance,

is the desired direction of the obstacle avoidance behavior, and + and - respectively indicate that the obstacle is on the left and right of the i-th nonholonomic constrained mobile robot.

分布式编队行为(Distributed Formation，DF)：分布式编队行为是一种分布式协作行为，旨在确保多非完整约束移动机器人仅通过使用邻居的状态形成所需的队形，其相应的任务函数、期望任务和任务雅克比矩阵分别表示为：Distributed Formation (DF): Distributed formation behavior is a distributed cooperative behavior that aims to ensure that multiple nonholonomic constrained mobile robots form the desired formation by using only the states of their neighbors. The corresponding task function, expected task, and task Jacobian matrix are expressed as:

其中，

其中，κ_DF是一个正常数，

是编队的相对位置，

表示领航者的状态，

表示第i个非完整约束移动机器人的邻居。Among them, κ _DF is a positive constant,

is the relative position of the formation,

Indicates the status of the navigator.

denotes the neighbors of the i-th nonholonomic constrained mobile robot.

分布式重构行为(Distributed Reconstruction，DR)：分布式重构行为是一种分布式协作行为，旨在确保多非完整约束移动机器人仅通过使用邻居的状态重构所需的队形，其相应的任务函数、期望任务和任务雅克比矩阵分别表示为：Distributed Reconstruction (DR): Distributed reconstruction behavior is a distributed cooperative behavior that aims to ensure that multiple nonholonomic constrained mobile robots reconstruct the desired formation only by using the states of their neighbors. The corresponding task function, expected task, and task Jacobian matrix are expressed as:

其中，

其中，κ_DR是一个正常数，

是编队重构矩阵。Among them, κ _DR is a positive constant,

is the formation reconstruction matrix.

d.多非完整约束移动机器人复合行为构建d. Construction of composite behaviors of mobile robots with multiple nonholonomic constraints

一个复合任务是多个基本行为以一定的优先级顺序的组合。设定

为第i个非完整约束移动机器人的任务函数，其中k_m∈N_M，N_M＝{1,...,M}，m_k表示任务空间的维度，M表示任务的数量。定义与时间相关的优先级函数g_i(k_m,t):N_M×[0,∞]→N_M。同时，定义一个具有如下规则的任务层次结构：A composite task is a combination of multiple basic behaviors in a certain priority order.

is the task function of the i-th nonholonomic constrained mobile robot, where _km ∈ _{N M} , N _M = {1, ..., M}, m _k represents the dimension of the task space, and M represents the number of tasks. Define a time-dependent priority function _gi (k _m , t): N _M × [0, ∞] → N _M . At the same time, define a task hierarchy with the following rules:

k_α≠k_β。1) A task k _α with priority _gi (k _α ) cannot interfere with task k _β with priority _gi (k _β ) if _gi (k _α ) _≥gi (k _β ),

k _α ≠k _β .

2)从速度到任务速度的映射关系由任务的雅可比矩阵

表示。2) The mapping from speed to task speed is given by the Jacobian matrix of the task

express.

3)具有最低优先级任务m_M的维度可能大于

因此要确保维度m_n大于所有任务的总维度。3) The dimension of the task _m with the lowest priority may be greater than

Therefore, make sure that the dimension m _n is larger than the total dimension of all tasks.

4)g_i(k_m)的值由任务监管器根据任务的需求和传感器信息进行分配。4) The value of g _i (k _m ) is assigned by the task supervisor according to the task requirements and sensor information.

通过给基本任务分配给定的优先级，t时刻复合任务的速度可以表示为By assigning a given priority to the basic tasks, the speed of the composite task at time t can be expressed as

其中，

是行为优先级，

是零空间投影的增广雅克比矩阵。in,

is the behavioral priority,

is the augmented Jacobian matrix of the null space projection.

步骤二：分布式多智能体强化学习任务监管器设计Step 2: Design of Distributed Multi-agent Reinforcement Learning Task Supervisor

对于分布式多非完整约束移动机器人，每个智能体必须学习局部/本地的行为优先级策略以实现协作任务目标。由于多非完整约束移动机器人通常可以在集中式的离线环境中训练，则分布式行为优先级切换问题能够创新性地建模为分布式部分可观察的马尔可夫决策过程。通过使用值分解网络(value-decomposition networks，VDN)强化学习算法和集中训练分布式执行范式(centralized training distributed execution，CTDE)，提出了一组分布式强化学习任务监管器。For distributed multi-nonholonomic constrained mobile robots, each agent must learn a local/local behavior priority strategy to achieve the collaborative task goal. Since multi-nonholonomic constrained mobile robots can usually be trained in a centralized offline environment, the distributed behavior priority switching problem can be innovatively modeled as a distributed partially observable Markov decision process. By using the value-decomposition networks (VDN) reinforcement learning algorithm and the centralized training distributed execution (CTDE) paradigm, a set of distributed reinforcement learning task supervisors is proposed.

定义集中式训练环境为，全局的状态为

其中

是联合的位置，

是联合的优先级，

是编队标志位，S表示全局状态集合。定义b_i,t＝{v_r,i,t}∈B为局部/本地行为，其中B表示行为集合。由于分布式多智能体强化学习任务监管器无法使用全局状态，因此必须使用部分可观测的状态。定义

为独立的局部观测，其中s_i,t＝{x_i,Pr_i}是局部/本地状态，

表示第i个非完整约束移动机器人的邻居，O表示局部观测集合。由于局部观测，定义局部/本地的行为观测历史为z_i,t∈Z，其中Z表示行为观测历史集合。所有的分布式强化学习任务监管器贡献一个奖励信号，且奖励函数设计如下Define the centralized training environment as, and the global state is

in

is the joint location,

is the priority of the union,

is the formation flag, S represents the global state set. Define b _i,t = {v _r,i,t }∈B as a local behavior, where B represents the behavior set. Since the distributed multi-agent reinforcement learning task supervisor cannot use the global state, it must use partially observable states. Definition

represents the neighbors of the i-th nonholonomic constrained mobile robot, and O represents the local observation set. Due to local observation, the local behavior observation history is defined as z _i,t ∈ Z, where Z represents the behavior observation history set. All distributed reinforcement learning task supervisors contribute a reward signal, and the reward function is designed as follows

r_t＝r₁+r₂, (25)r _t = r ₁ + r ₂ , (25)

其中，

分别表示无编队、重构编队和期望编队状态的标识。r₁和r₂是分别设置实现任务目标和减少行为切换的奖励信号。in,

They represent the flags of no formation, reconstructed formation and desired formation state respectively. _r1 and _r2 are reward signals for achieving task goals and reducing behavior switching respectively.

分布式强化学习任务监管器的算法伪代码图如图3所示。多非完整约束移动机器人与环境在t时间步进行交互，其中第i个非完整约束移动机器人观测到一个局部观测o_i,t，获取到上一个行为b_i,t-1，根据具有衰减因子γ_ε的ε贪心策略选取一个行为b_i,t，然后得到一个团队奖励r_t和转移至下一个局部观测o_i,t+1。具体而言，分布式强化学习任务监管器的集中式训练是通过分层渐进模块进行的，包括独立Q值模块和混合模块。首先，每一个非完整约束移动机器人都有一个独立Q值模块，其使用循环Q网络输入门循环神经网络的隐藏层状态h_i,t-1，局部观测o_i,t，上一个行为b_i,t-1，输出局部的Q值

然后，混合模块通过求和所有的局部的Q值

生成联合Q值

如下The algorithm pseudocode diagram of the distributed reinforcement learning task supervisor is shown in Figure 3. Multiple non-holonomic constrained mobile robots interact with the environment at t time steps, where the i-th non-holonomic constrained mobile robot observes a local observation o _i,t , obtains the previous behavior b _i,t-1 , selects a behavior b _i,t according to the ε-greedy strategy with a decay factor γ _ε , and then obtains a team reward r _t and transfers to the next local observation _{o i,t+1} . Specifically, the centralized training of the distributed reinforcement learning task supervisor is carried out through hierarchical progressive modules, including independent Q-value modules and hybrid modules. First, each non-holonomic constrained mobile robot has an independent Q-value module, which uses the recurrent Q network to input the hidden layer state h _i,t-1 of the gate recurrent neural network, the local observation o _i,t , the previous behavior b _i,t-1 , and outputs the local Q value

The mixing module then sums up all the local Q values

Generate joint Q value

as follows

其中，

表示独立Q值网络的参数。in,

Represents the parameters of the independent Q-value network.

中。特别地，将从经验池中采样最小回放次

经历以减少数据相关性和提高样本利用率。然后，训练进行到t+1时间步，且训练直至所有的回合T_total完成后停止。分布式强化学习任务监管器通过最小化以下损失进行训练：After sampling at time step t, the experience quadruple (z _t , b _t , r _t , z _t+1 ) is stored in the experience pool

Experience to reduce data correlation and improve sample utilization. Then, training proceeds to t+1 time step, and training stops after all rounds T _total are completed. The distributed reinforcement learning task supervisor is trained by minimizing the following loss:

其中，

表示目标网络的参数。in,

Represents the parameters of the target network.

最后，多非完整约束移动机器人在集中式训练后学习到一组最优分布式行为优先级策略。在实际场景中，多非完整约束移动机器人根据学习到的策略切换行为优先级。一旦在每个采样时刻确定了多非完整约束移动机器人的行为优先级，就可以通过公式(22)-(24)和(2)获取参考速度v_i,r和参考轨迹x_i,r。根据公式(3)，可进一步计算得到惯性坐标系下的参考速度

和参考轨迹θ_i,r。Finally, the multi-nonholonomic constrained mobile robot learns a set of optimal distributed behavior priority strategies after centralized training. In actual scenarios, the multi-nonholonomic constrained mobile robot switches behavior priorities according to the learned strategies. Once the behavior priorities of the multi-nonholonomic constrained mobile robot are determined at each sampling time, the reference speed v _i,r and reference trajectory x _i,r can be obtained by formulas (22)-(24) and (2). According to formula (3), the reference speed in the inertial coordinate system can be further calculated

and the reference trajectory θ _i,r .

步骤三：强化学习控制器的设计Step 3: Design of reinforcement learning controller

定义惯性坐标系下的位置和速度跟踪误差分别为The position and velocity tracking errors in the inertial coordinate system are defined as

e_p,i＝θ_i-θ_i,r, (30)e _p,i =θ _i -θ _i,r , (30)

其中，

和

是左右轮的角度，

和

分别是参考位置和参考速度。in,

and

is the angle of the left and right wheels,

and

are the reference position and reference speed respectively.

公式(30)和(31)的微分形式可以推导为The differential forms of formulas (30) and (31) can be derived as

其中，

是

的微分。in,

yes

The differential of .

定义值函数如下Define the value function as follows

其中，

表示代价函数，α_V,β_V∈(0,2)是可调整的代价参数，且满足α_V+β_V＝2。in,

Denotes the cost function, α _V , β _V ∈(0,2) are adjustable cost parameters, and satisfy α _V +β _V ＝2.

定义

为最优的跟踪控制策略。因此，最优的值函数可以表示为definition

is the optimal tracking control strategy. Therefore, the optimal value function can be expressed as

其中，

表示可容许的控制策略。in,

Represents an admissible control strategy.

其中，

表示V_i ^*相对于e_i的梯度，

和

表示分别为V_i ^*相对于e_p,i和e_v,i的梯度。in,

represents the gradient of _Vi ^* relative to _ei ,

and

Denote the gradients of _Vi ^* with respect to _ep,i and ev _,i respectively.

通过求解

最优的控制策略

可推导为By solving

Optimal control strategy

It can be deduced as

为了实施

需要求解公式(40)获取

的解析解难以求取。To implement

We need to solve formula (40) to obtain

The analytical solution of is difficult to obtain.

其中，

和

是正常数，

是自适应补偿项。in,

and

is a normal number,

is the adaptive compensation term.

将公式(41)代入(39)，可以获取等式如下：Substituting formula (41) into (39), we can obtain the following equation:

众所周知，神经网络具有强大的逼近能力。因此，给定紧集

和

对于

和

未知项f_i(x_i)和V_i ^o可以通过神经网络近似如下：As we all know, neural networks have powerful approximation capabilities. Therefore, given a compact set

and

for

and

The unknown terms _fi ( _xi ) and _Vio can ^be approximated by a neural network as follows:

其中，

和

是理想的权重矩阵，w_f和w_V是神经元数量，

和

是基函数向量，

和

是逼近误差，且有界如||δ_f,i||≤ε_f和||δ_V,i||≤ε_V，ε_f和ε_V是正常数。in,

and

is the ideal weight matrix, _wf and _wV are the number of neurons,

and

is the basis function vector,

and

is the approximation error and is bounded such that ||δ _f,i ||≤ε _f and ||δ _V,i ||≤ε _V , where ε _f and ε _V are positive constants.

然后，将公式(43)和(44)代入(41)和(42)，可以获取到以下方程：Then, substituting equations (43) and (44) into (41) and (42), we can obtain the following equations:

然而，由于

和

是未知的，

无法实施。因此，使用一种辨识者-执行者-评论家强化学习算法以学习最优控制策略。However, due to

and

is unknown,

Therefore, an Identifier-Actor-Critic reinforcement learning algorithm is used to learn the optimal control policy.

其中，

是f_i(x_i)的估计，

是辨识者神经网络的权重。辨识者神经网络的更新率可设计为：in,

is an estimate of _fi ( _xi ),

is the weight of the identifier neural network. The update rate of the identifier neural network can be designed as:

其中，

是正定矩阵，

是设计的辨识者参数。in,

is a positive definite matrix,

is the identifier parameter of the design.

其中，

是

的估计值，

是评论家神经网络的权重。评论家网络的更新率可设计为in,

yes

The estimated value of

is the weight of the critic neural network. The update rate of the critic network can be designed as

其中，γ_c,i是评论家的学习率。where γ _c,i is the critic’s learning rate.

其中，

是执行者神经网络的权重。执行者网络的更新率可设计为in,

is the weight of the actor neural network. The update rate of the actor network can be designed as

步骤四：自适应补偿器设计Step 4: Adaptive compensator design

首先，考虑控制输入

受到对称执行机构饱和约束的限制如下：First, consider the control input

其中，τ_lim,i＞0是已知的阈值。Among them, τ _lim,i >0 is a known threshold.

τ_i＝τ_0,i+τ_Δ,i, (54)τ _i =τ _0,i +τ _Δ,i , (54)

其中，

是标称项，

是补偿项，且满足如下条件：in,

is a nominal term,

is a compensation item and meets the following conditions:

最后，设计自适应补偿器为

且具有更新率如下Finally, the adaptive compensator is designed as

And has an update rate of

其中，

是设计的自适应补偿器参数。in,

are the designed adaptive compensator parameters.

为了详细介绍本发明，以下给出一个数值仿真实例以体现所提出的一种面向多非完整约束移动机器人的分布式强化学习行为控制方法的有效性及优越性。In order to introduce the present invention in detail, a numerical simulation example is given below to demonstrate the effectiveness and superiority of the proposed distributed reinforcement learning behavior control method for multiple non-holonomic constrained mobile robots.

仿真对比与分析Simulation comparison and analysis

数值仿真考虑了5个网络化的多非完整约束移动机器人通过执行避障、编队和重构行为形成期望的队形，且避开障碍物。设置第1个非完整约束移动机器人为领航者，且其期望编队任务函数为x_1,d＝x₀＝[t；0]。多非完整约束移动机器人的初始位置分别为x_1,0＝[0；0；0]，x_2,0＝[-7；7；0]，x_3,0＝[-7；-7；0]，x_4,0＝[-12；12；0]和x_5,0＝[-12；-12；0]。多非完整约束移动机器人的目标位置分别为x_1,g＝[80；0；0]，x_2,g＝[75；5；0]，x_3,g＝[75；-5；0]，x_4,g＝[70；10；0]和x_5,g＝[70；-10；0]。多非完整约束移动机器人的编队相对位置分别为

和

多非完整约束移动机器人的重构矩阵分别为Γ₂＝[2/5,0,0；0,0,0；0,0,0]，Γ₃＝[4/5,0,0；0,0,0；0,0,0]，Γ₄＝[3/5,0,0；0,0,0；0,0,0]和Γ₅＝[4/5,0,0；0,0,0；0,0,0]。多非完整约束移动机器人的初始辨识者权重矩阵分别为ω_f,1＝[0.46]_24×2，ω_f,2＝[0.47]_24×2，ω_f,3＝[0.48]_24×2，ω_f,4＝[0.49]_24×2和ω_f,5＝[0.5]_24×2。多非完整约束移动机器人的初始执行者权重矩阵分别为ω_a,1＝[0.93]_24×2，ω_a,2＝[0.95]_24×2，ω_a,3＝[0.96]_24×2，ω_a,4＝[0.97]_24×2和ω_a,5＝[0.99]_24×2。多非完整约束移动机器人的初始评论家权重矩阵分别为ω_c,1＝[0.92]_24×2，ω_c,2＝[0.94]_24×2，ω_c,3＝[0.96]_24×2，ω_c,4＝[0.98]_24×2和ω_c,5＝[1]_24×2。神经网络具有神经元w_f＝24和w_V＝24，中心均匀的分布在范围[-6,6]，宽度μ_i＝2。多非完整约束移动机器人的未知非线性项设置为

多非完整约束移动机器人的网络拓扑结构如图4所示。其他的仿真所使用的参数如图5所示。图6-7对比了分布式强化学习任务监管器学习前后的任务性能，其中分布式强化学习任务监管器学习前会随机选取行为优先级，导致优先级切换频繁且轨迹不光滑，同时未能完成任务目标，而分布式强化学习任务监管器学习后，优先级切换显著减少，轨迹变得光滑，且完成了预定的任务目标。图8-9对比了分布式强化学习任务监管器、分布式有限状态机任务监管器(Distributed Finite State Automata Mission Supervisors，DFSAMSs)、分布式模型预测控制任务监管器(Distributed Model Prediction Control Mission Supervisors，DMPCMSs)和传统强化学习任务监管器(Reinforcement Learning Mission Supervisor，RLMS)的任务性能，其中DFSAMSs的轨迹有强烈的震荡且有时违反了安全距离，DMPCMSs具有最大的算法迭代时间且实时性无法保证，RLMS忽略了群体智能，而DRLMSs维持了良好的任务性能，且保持了较低的迭代时间。图10-12对比了具有输入饱和约束和不具有输入饱和约束分布式行为控制方法的控制性能，其中当不具有输入饱和约束和行为优先级切换发生切换时，控制输入值和控制代价达到了高昂且无法接受的数值，而具有输入饱和约束时，控制输入值始终被维持在一个可接受的范围之内。图13-14对比了分布式强化学习行为控制、有限时间分布式行为控制(finite-time Distributed Behavioral Control，finite-timeDBC)、固有时间分布式行为控制(fixed-time Distributed Behavioral Control，fixed-time DBC)和传统强化学习行为控制(Reinforcement Learning Behavioral Control，RLBC)的控制性能，其中finite-time DBC和fixed-time DBC由于滑模控制的抖振，轨迹会存在一些震荡，RLBC忽略了群体智能导致了一些非常不理想的控制结果，而DRLBC具有光滑的轨迹，且是唯一可以满足控制饱和约束的分布式行为控制方法。所有对比都证明了所提出的分布式行为控制方法的有效性和优越性。The numerical simulation considers five networked multi-nonholonomic constrained mobile robots to form a desired formation and avoid obstacles by performing obstacle avoidance, formation and reconstruction behaviors. The first nonholonomic constrained mobile robot is set as the leader, and its desired formation task function is x _1,d = x ₀ = [t; 0]. The initial positions of the multi-nonholonomic constrained mobile robots are x _1,0 = [0; 0; 0], x _2,0 = [-7; 7; 0], x _3,0 = [-7; -7; 0], x _4,0 = [-12; 12; 0] and x _5,0 = [-12; -12; 0]. The target positions of the multi-nonholonomic constrained mobile robots are _x1,g = [80; 0; 0], _x2,g = [75; 5; 0], _x3,g = [75; -5; 0], _x4,g = [70; 10; 0] and x5 _,g = [70; -10; 0]. The relative positions of the formations of the multi-nonholonomic constrained mobile robots are

and

The reconstruction matrices of the multi-nonholonomic constrained mobile robot are Γ ₂ = [2/5, 0, 0; 0, 0, 0; 0, 0, 0], Γ ₃ = [4/5, 0, 0; 0, 0, 0; 0, 0, 0], Γ ₄ = [3/5, 0, 0; 0, 0, 0; 0, 0, 0] and Γ ₅ = [4/5, 0, 0; 0, 0, 0; 0, 0, 0]. The initial identifier weight matrices of the multi-nonholonomic constrained mobile robot are ω _{f, 1} = [0.46] _24×2 _, ω _{f, 2} ₌ [0.47] _24×2 , ω f, 3 = [0.48] 24×2 , ω _{f, 4} = [0.49] _24×2 and ω _{f, 5} = [0.5] _24×2 . The initial actor weight matrices of the multi-nonholonomic constrained mobile robot are _ωa,1 = [0.93] _24×2 , _ωa,2 = [0.95] _24×2 , _ωa,3 = [0.96] ₂₄ ×2, _ωa,4 = [0.97] _24×2 and _ωa,5 = [0.99] _24×2 . The initial critic weight matrices of the multi-nonholonomic constrained mobile robot are _ωc,1 = [0.92] _24×2 , _ωc,2 = [0.94] _24×2 , _ωc,3 = [0.96] _24×2 , _ωc,4 = [0.98] _24×2 and _ωc,5 = [1] _24×2 . The neural network has neurons w _f = 24 and w _V = 24, with centers uniformly distributed in the range [-6, 6] and width μ _i = 2. The unknown nonlinear terms of the multi-nonholonomic constrained mobile robot are set as

The network topology of multiple nonholonomic constrained mobile robots is shown in Figure 4. The other parameters used in the simulation are shown in Figure 5. Figures 6-7 compare the task performance before and after the distributed reinforcement learning task supervisor is learned. Before learning, the distributed reinforcement learning task supervisor randomly selects the behavior priority, resulting in frequent priority switching and uneven trajectory, and the task goal cannot be completed. After learning, the distributed reinforcement learning task supervisor significantly reduces the priority switching, the trajectory becomes smooth, and the predetermined task goal is completed. Figures 8-9 compare the task performance of the distributed reinforcement learning task supervisor, the distributed finite state machine task supervisor (DFSAMSs), the distributed model prediction control task supervisor (DMPCMSs) and the traditional reinforcement learning task supervisor (RLMS). The trajectory of DFSAMSs has strong oscillations and sometimes violates the safety distance. DMPCMSs has the largest algorithm iteration time and real-time performance cannot be guaranteed. RLMS ignores swarm intelligence, while DRLMSs maintains good task performance and keeps a low iteration time. Figures 10-12 compare the control performance of distributed behavior control methods with and without input saturation constraints. When there is no input saturation constraint and behavior priority switching occurs, the control input value and control cost reach high and unacceptable values, while with input saturation constraints, the control input value is always maintained within an acceptable range. Figures 13-14 compare the control performance of distributed reinforcement learning behavior control, finite-time distributed behavior control (finite-time DBC), fixed-time distributed behavior control (fixed-time DBC) and traditional reinforcement learning behavior control (RLBC). Due to the chattering of sliding mode control, the trajectory of finite-time DBC and fixed-time DBC will have some oscillations. RLBC ignores swarm intelligence and leads to some very undesirable control results. DRLBC has a smooth trajectory and is the only distributed behavior control method that can meet the control saturation constraint. All comparisons prove the effectiveness and superiority of the proposed distributed behavior control method.

Claims

1. A reinforcement learning behavior control method for a mobile robot with multiple nonholonomic constraints, characterized in that it comprises the following steps:

Step S1, establishing a kinematic model of a mobile robot with multiple nonholonomic constraints based on a nonholonomic constraint matrix, establishing a dynamic model of a mobile robot with multiple nonholonomic constraints based on the Euler-Lagrange equations, and constructing basic behaviors according to the established kinematic models, and combining the designed basic behaviors into composite behaviors in different priority orders through a null space projection technique;

Step S2, modeling the behavior priority switching as a distributed partially observable Markov decision process, setting the reference speed instruction of the composite behavior as the action set of the reinforcement learning algorithm under the framework of the centralized training distributed execution reinforcement learning algorithm, selecting the position and priority of the non-holonomic constrained robot, and the position and priority of its neighboring robots as the observation set of the reinforcement learning algorithm, designing the reward function, and thus constructing the distributed reinforcement learning task supervisor DRLMSs;

Step S3, with the goal of balancing control performance and control loss, an identifier-executor-critic reinforcement learning algorithm is introduced to online identify unknown dynamic models, implement control strategies, and evaluate control performance, thereby designing reinforcement learning controllers RLCs;

Step S4, based on the adaptive control theory, an adaptive compensator is designed to maintain the optimal control performance and offset the saturation effect in real time.

2. The reinforcement learning behavior control method for multiple nonholonomic constrained mobile robots according to claim 1 is characterized in that step S1 specifically comprises the following steps:

Step S11: Kinematic modeling of mobile robots with multiple nonholonomic constraints

Consider a group of N (N＞2) nonholonomic constrained mobile robots, where each robot is driven by a differential wheel, i＝1,...,N; the generalized velocity of the i-th nonholonomic constrained mobile robot is expressed as

in,

and

are the linear velocity and angular velocity, respectively.

and

are the linear speeds of the left and right wheels,

is the distance between the left and right wheels,

represents the set of real numbers;

Then, the kinematic equation of the i-th nonholonomic constrained mobile robot is expressed as

in,

represents a generalized state,

and

are position and direction,

represents a non-holonomic constraint matrix;

In addition, the kinematic equation of the i-th nonholonomic constrained mobile robot in the inertial coordinate system is:

in,

is the wheel radius,

represents the nonholonomic constraint matrix in inertial coordinates,

and

are the rotation speeds of the left and right wheels respectively;

Step S12: Dynamics modeling of mobile robots with multiple nonholonomic constraints

By using the Euler-Lagrange equations, the dynamic model of the i-th nonholonomic constrained mobile robot is derived as

in,

is the inertia matrix,

represents the unknown nonlinear term,

is a designable input gain matrix,

is the control input,

It is not completely binding;

First, the differential form of formula (3) is derived as follows

in,

represents the differential of S _i ( _xi ),

is the angular acceleration of the wheel;

Then, substitute formulas (3) and (5) into (4) and multiply on the left by

The following equation is obtained

in,

According to Assumption 2, formula (6) can be rewritten as

in,

is the exact term,

is an inexact term;

Assumption 1: The multi-nonholonomic constrained mobile robot system works in a static scene, and all non-robot obstacles are static and fixed;

Assumption 2: The input gain matrix E _i ( _xi ) always satisfies the design

Assume that each nonholonomic constrained mobile robot has M basic behaviors, where the kth basic behavior of the ith nonholonomic constrained mobile robot can use a task variable

The mathematical modeling is as follows

σ _i,k =gi _,k ( _xi ), (8)

in,

Represents the task function;

Then, the differential form of the task variable σ _i,k is expressed as

in,

is the Jacobian matrix of the task;

Finally, the reference velocity command of the kth basic behavior of the i-th nonholonomic constrained mobile robot can be expressed as

in,

is the right pseudo-inverse of the Jacobian matrix Ji _,k of the task,

is the desired task function,

is the task gain,

is the task error;

Without loss of generality, the obstacle avoidance behavior, distributed formation behavior, and distributed reconstruction behavior are designed as follows:

Obstacle avoidance behavior: Obstacle avoidance behavior is a local behavior that aims to ensure that the non-holonomic constrained mobile robot avoids obstacles near the path. Its corresponding task function, expected task, and task Jacobian matrix are expressed as:

in,

is the relative position of the minimum distance,

Distributed formation behavior: Distributed formation behavior is a distributed cooperative behavior that aims to ensure that multiple nonholonomic constrained mobile robots form the desired formation by only using the states of their neighbors. The corresponding task function, expected task, and task Jacobian matrix are expressed as:

in,

Among them, κ _DF is a positive constant,

is the relative position of the formation,

Indicates the status of the navigator.

represents the neighbors of the i-th nonholonomic constrained mobile robot;

Distributed Reconfiguration Behavior: Distributed reconfiguration behavior is a distributed cooperative behavior designed to ensure that multiple nonholonomic constrained mobile robots reconstruct the desired formation only by using the states of their neighbors. The corresponding task function, expected task, and task Jacobian matrix are expressed as:

in,

Among them, κ _DR is a positive constant,

is the formation reconstruction matrix;

Step S14: Construction of composite behaviors of mobile robots with multiple nonholonomic constraints

A composite task is a combination of multiple basic behaviors in a certain priority order;

is the task function of the i-th nonholonomic constrained mobile robot, where _km∈NM , _NM ={1,...,M}, _mk represents the dimension of the task space, _and M represents the number of tasks; define a time-related priority function _gi ( _km ,t): _NM ×[0,∞]→ _NM ; and define a task hierarchy with the following rules:

1) A task k _α with priority _gi (k _α ) cannot interfere with task k _β with priority _gi (k _β ) if _gi (k _α ) _≥gi (k _β ),

k _α ≠ k _β ;

2) The mapping from speed to task speed is given by the Jacobian matrix of the task

express;

3) The dimension of the task _m with the lowest priority may be greater than

4) The value of g _i (k _m ) is assigned by the task supervisor according to the task requirements and sensor information;

By assigning a given priority to the basic tasks, the speed of the composite task at time t is expressed as

in,

is the behavioral priority,

is the augmented Jacobian matrix of the null space projection.

3. The reinforcement learning behavior control method for multiple nonholonomic constrained mobile robots according to claim 1 is characterized in that: the step S2 specifically includes: defining the centralized training environment as ε, and the global state as

in

is the joint location,

is the priority of the union,

r _t = r ₁ + r ₂ , (25)

in,

Multiple nonholonomic constrained mobile robots interact with the environment ε at t time steps, where the i-th nonholonomic constrained mobile robot observes a local observation o _i,t , obtains the previous behavior b _i,t-1 , selects a behavior b _i,t according to the ε-greedy strategy with a decay factor γ _ε , and then obtains a team reward r _t and transfers to the next local observation o _i,t+1 ; Specifically, the centralized training of the distributed reinforcement learning task supervisor is carried out through hierarchical progressive modules, including independent Q-value modules and hybrid modules; First, each nonholonomic constrained mobile robot has an independent Q-value module, which uses the recurrent Q network input gate recurrent neural network hidden layer state h _i,t-1 , local observation o _i,t , previous behavior b _i,t-1 , and outputs a local Q-value

The mixing module then sums up all the local Q values

Generate joint Q value

as follows

in,

represents the parameters of the independent Q value network;

After sampling at time step t, the experience quadruple (z _t , b _t , r _t , z _t+1 ) is stored in the experience pool

in,

Represents the parameters of the target network;

Finally, after centralized training, the multi-nonholonomic constrained mobile robot learns a set of optimal distributed behavior priority strategies; in actual scenarios, the multi-nonholonomic constrained mobile robot switches behavior priorities according to the learned strategies; once the behavior priorities of the multi-nonholonomic constrained mobile robot are determined at each sampling moment, the reference speed v _i,r and reference trajectory x _i,r are obtained through formulas (22)-(24) and (2); according to formula (3), the reference speed in the inertial coordinate system is further calculated

and the reference trajectory θ _i,r .

4. The reinforcement learning behavior control method for multiple nonholonomic constrained mobile robots according to claim 1 is characterized in that: step S3 specifically comprises: defining the position and velocity tracking errors in the inertial coordinate system as:

e _p,i =θ _i -θ _i,r , (30)

in,

and

is the angle of the left and right wheels,

and

are the reference position and reference velocity respectively;

The differential forms of formulas (30) and (31) are derived as follows:

in,

yes

The differential of

The integrated tracking error is defined as follows

Define the value function as follows

in,

definition

in,

represents the permissible control strategies;

By combining equations (32)-(35) and (37), the Hamilton-Jacobi-Bellman (HJB) equation can be derived as

in,

represents the gradient of _Vi ^* relative to _ei ,

and

denote the gradients of _Vi ^* relative to _ep,i and ev _,i respectively;

By solving

Optimal control strategy

It can be deduced as

Furthermore, substituting formula (39) into (38) yields the following equation:

To implement

We need to solve formula (40) to obtain

The analytical solution of is difficult to obtain;

Therefore, the optimal value function gradient needs to be decomposed as follows:

in,

and

is a normal number,

is the adaptive compensation term;

Substituting formula (41) into (39), we obtain the following equation:

As we all know, neural networks have powerful approximation capabilities; therefore, given a compact set

and

for

and

in,

and

is the ideal weight matrix, _wf and _wV are the number of neurons,

and

is the basis function vector,

and

Then, substitute equations (43) and (44) into (41) and (42) to obtain the following equations:

However, due to

and

is unknown,

Specifically, the discriminator neural network is designed to estimate the unknown nonlinear terms as follows:

in,

is an estimate of _fi ( _xi ),

in,

is a positive definite matrix,

is the identifier parameter of the design;

Then, a critic neural network is designed to evaluate the control performance as follows:

in,

yes

The estimated value of

where γ _c,i is the critic’s learning rate;

Finally, the design of the actuator neural network facility control input is as follows:

in,

where γ _a,i is the learning rate of the executor.

5. The reinforcement learning behavior control method for multiple nonholonomic constrained mobile robots according to claim 1 is characterized in that: the step S4 specifically comprises: first, considering the control input