[go: up one dir, main page]

CN116068900A - Reinforced learning behavior control method for multiple incomplete constraint mobile robots - Google Patents

Reinforced learning behavior control method for multiple incomplete constraint mobile robots Download PDF

Info

Publication number
CN116068900A
CN116068900A CN202310255701.9A CN202310255701A CN116068900A CN 116068900 A CN116068900 A CN 116068900A CN 202310255701 A CN202310255701 A CN 202310255701A CN 116068900 A CN116068900 A CN 116068900A
Authority
CN
China
Prior art keywords
behavior
nonholonomic
task
mobile robot
distributed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310255701.9A
Other languages
Chinese (zh)
Other versions
CN116068900B (en
Inventor
黄捷
张祯毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN202310255701.9A priority Critical patent/CN116068900B/en
Publication of CN116068900A publication Critical patent/CN116068900A/en
Application granted granted Critical
Publication of CN116068900B publication Critical patent/CN116068900B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/02Total factory control, e.g. smart factories, flexible manufacturing systems [FMS] or integrated manufacturing systems [IMS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)
  • Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)

Abstract

The invention provides a reinforcement learning behavior control method for a multi-incomplete constraint mobile robot, which comprises the steps of establishing a kinematic model of the multi-incomplete constraint mobile robot based on an incomplete constraint matrix, establishing a dynamic model of the multi-incomplete constraint mobile robot based on an Euler Lagrange equation, establishing basic behaviors according to the established kinematic model, and combining the designed basic behaviors into composite behaviors according to different priority orders through a zero-space projection technology; by applying the technical scheme, the use of a centralized unit in the task execution stage can be avoided, and the dynamic property and the intelligent property of behavior priority switching are improved.

Description

面向多非完整约束移动机器人的强化学习行为控制方法Reinforcement learning behavior control method for mobile robots with multiple nonholonomic constraints

技术领域Technical Field

本发明涉及智能机器人技术领域,特别是一种面向多非完整约束移动机器人的强化学习行为控制方法。The invention relates to the technical field of intelligent robots, and in particular to a reinforcement learning behavior control method for multiple nonholonomically constrained mobile robots.

背景技术Background Art

近年来,非完整约束移动机器人在各个领域得到了广泛应用。由于非完整约束移动机器人不能通过使用任意时不变平滑状态反馈控制律来稳定,因此它的跟踪控制问题被优先地研究了。通过群体协作,多非完整约束移动机器人通常比单个机器人具有更好的任务性能。然而,非完整约束往往会影响团队表现,如何在非完整约束下实施协作控制提出了一个具有挑战性的控制问题。In recent years, nonholonomic mobile robots have been widely used in various fields. Since nonholonomic mobile robots cannot be stabilized by using arbitrary time-invariant smooth state feedback control laws, their tracking control problems have been studied preferentially. Through group collaboration, multiple nonholonomic mobile robots usually have better task performance than a single robot. However, nonholonomic constraints often affect team performance, and how to implement collaborative control under nonholonomic constraints poses a challenging control problem.

现有的多非完整约束移动机器人协助控制通常基于集中式或分布式框架。集中式的方法使用一个集中式控制器激活团队行为和避免违反非完整约束。由于控制器必须拿到全局信息,集中式方法的可扩展性不令人满意。为此,分布式方法通过使用一组具有拓扑结构的网络化控制器来避免使用集中式控制器。大多数分布式方法只解决具有唯一任务或控制目标的协作控制问题。然而,多任务冲突在协作控制问题中很常见,且不容忽视。行为控制方法是最有效的解决方案之一。最初的行为控制方法为一种分层框架,低层次的行为只有在所有高层次行为完成时才会被执行。为了提高任务执行效率,通过对具有可调整权重的行为命令求和,提出了一种运动模式行为控制框架,但没有完成任何行为完整执行。通过结合上述两种方法的优点,提出了一种零空间行为控制方法,其不仅完成最高优先级行为,而且通过零空间投影执行部分低优先级的行为。尽管零空间行为控制方法被扩展到不同的多智能体系统场景中,但它具有隐含集中式的固有缺陷,即它依赖于集中式的任务监管器来分配行为优先级。为此,首次提出了一种分布式行为控制框架用于聚集控制,但缺乏任务和控制器稳定性分析。接着,分布式行为控制的任务误差被证明是渐近稳定的,但它仅限于无障碍环境中的三角形编队。然后,为分布式行为控制设计了一组非线性快速终端滑模控制器,实现了跟踪误差的有限时间收敛。最后,通过设计固定时间估计器和终端滑模控制律,任务和跟踪误差都实现固定时间稳定。Existing cooperative control of multiple nonholonomically constrained mobile robots is usually based on centralized or distributed frameworks. Centralized methods use a centralized controller to activate team behaviors and avoid violating nonholonomic constraints. Since the controller must obtain global information, the scalability of centralized methods is not satisfactory. To this end, distributed methods avoid the use of centralized controllers by using a set of networked controllers with topological structures. Most distributed methods only solve cooperative control problems with unique tasks or control objectives. However, multi-task conflicts are common in cooperative control problems and cannot be ignored. Behavior control methods are one of the most effective solutions. The original behavior control method is a hierarchical framework, and low-level behaviors are only executed when all high-level behaviors are completed. In order to improve the efficiency of task execution, a motion mode behavior control framework is proposed by summing behavior commands with adjustable weights, but no behavior is fully executed. By combining the advantages of the above two methods, a zero-space behavior control method is proposed, which not only completes the highest priority behavior, but also executes some low-priority behaviors through zero-space projection. Although the zero-space behavior control method has been extended to different multi-agent system scenarios, it has the inherent defect of implicit centralization, that is, it relies on a centralized task supervisor to assign behavior priorities. To this end, a distributed behavioral control framework is first proposed for cluster control, but task and controller stability analysis is lacking. Next, the task error of the distributed behavioral control is shown to be asymptotically stable, but it is limited to triangular formations in an obstacle-free environment. Then, a set of nonlinear fast terminal sliding mode controllers are designed for the distributed behavioral control, achieving finite-time convergence of the tracking error. Finally, by designing a fixed-time estimator and a terminal sliding mode control law, both the task and tracking errors are fixed-time stable.

然而,现有分布式行为控制方法仍然存在以下缺点:1、行为的优先级是固定且预先设置的,这会导致任务动态性能不佳,严重依赖人类智能。2、缺乏最优性和智能性,这导致过度消耗控制资源以保持良好的控制性能,特别是在切换行为优先级时。3、控制输入均没有饱和约束限制,这导致执行器在切换行为优先级之后可能违反物理限制。However, existing distributed behavior control methods still have the following disadvantages: 1. The priority of the behavior is fixed and pre-set, which leads to poor task dynamic performance and heavy reliance on human intelligence. 2. Lack of optimality and intelligence, which leads to excessive consumption of control resources to maintain good control performance, especially when switching behavior priorities. 3. There are no saturation constraints on the control inputs, which may cause the actuator to violate physical limitations after switching behavior priorities.

发明内容Summary of the invention

有鉴于此,本发明的目的在于提供一种面向多非完整约束移动机器人的强化学习行为控制方法,基于辨识者-执行者-评论家算法设计了强化学习控制器,在线地学习系统的未知动力学和最优控制策略,以保证在任务执行过程中,控制性能和控制损耗始终保持平衡,并且还考虑了输入饱和约束,避免执行器违反实际物理限制。In view of this, the purpose of the present invention is to provide a reinforcement learning behavior control method for multiple non-holonomic constrained mobile robots. A reinforcement learning controller is designed based on the identifier-actor-critic algorithm to learn the unknown dynamics and optimal control strategy of the system online to ensure that the control performance and control loss are always balanced during the task execution. Input saturation constraints are also considered to avoid the actuator from violating actual physical limitations.

为实现上述目的,本发明采用如下技术方案:面向多非完整约束移动机器人的强化学习行为控制方法,包括以下步骤:To achieve the above object, the present invention adopts the following technical solution: a reinforcement learning behavior control method for multiple nonholonomic constrained mobile robots, comprising the following steps:

步骤S1,基于非完整约束矩阵建立多非完整约束移动机器人的运动学模型,基于欧拉拉格朗日方程建立多非完整约束移动机器人的动力学模型,并根据所建立的运动学模型构建基本行为,同时通过零空间投影技术,将所设计的基本行为以不同的优先级顺序组合成为复合行为;Step S1, establishing a kinematic model of a mobile robot with multiple nonholonomic constraints based on a nonholonomic constraint matrix, establishing a dynamic model of a mobile robot with multiple nonholonomic constraints based on the Euler-Lagrange equations, and constructing basic behaviors according to the established kinematic models, and combining the designed basic behaviors into composite behaviors in different priority orders through a null space projection technique;

步骤S2,将行为优先级切换建模为一个分布式部分可观测的马尔科夫决策过程,在集中式训练分布式执行的强化学习算法框架下,设置复合行为的参考速度指令作为强化学习算法的动作集合,选取非完整约束机器人的位置和优先级,以及其邻居机器人的位置和优先级作为强化学习算法的观测集合,设计奖励函数,从而构建分布式强化学习任务监管器DRLMSs;Step S2, modeling the behavior priority switching as a distributed partially observable Markov decision process, setting the reference speed instruction of the composite behavior as the action set of the reinforcement learning algorithm under the framework of the centralized training and distributed execution reinforcement learning algorithm, selecting the position and priority of the non-holonomic constrained robot, and the position and priority of its neighboring robots as the observation set of the reinforcement learning algorithm, designing the reward function, and thus constructing the distributed reinforcement learning task supervisor DRLMSs;

步骤S3,以平衡控制性能和控制损耗为目标,引入辨识者-执行者-评论家强化学习算法,在线地辨识未知动力学模型、实施控制策略以及评估控制性能,从而设计强化学习控制器RLCs;Step S3, with the goal of balancing control performance and control loss, an identifier-executor-critic reinforcement learning algorithm is introduced to online identify unknown dynamic models, implement control strategies, and evaluate control performance, thereby designing reinforcement learning controllers RLCs;

步骤S4,基于自适应控制理论,设计自适应补偿器,以维持最优的控制性能和实时抵消饱和效应。Step S4, based on the adaptive control theory, an adaptive compensator is designed to maintain the optimal control performance and offset the saturation effect in real time.

在一较佳的实施例中,步骤S1具体包括如下步骤:In a preferred embodiment, step S1 specifically includes the following steps:

步骤S11:多非完整约束移动机器人运动学建模Step S11: Kinematic modeling of mobile robots with multiple nonholonomic constraints

考虑一组N(N>2)的非完整约束移动机器人,其中每个机器人由差速轮驱动,i=1,...,N;第i个非完整约束移动机器人的广义速度表示为Consider a group of N (N>2) nonholonomic constrained mobile robots, where each robot is driven by a differential wheel, i=1,...,N; the generalized velocity of the i-th nonholonomic constrained mobile robot is expressed as

Figure BDA0004129513260000021
Figure BDA0004129513260000021

其中,

Figure BDA0004129513260000022
Figure BDA0004129513260000023
分别是线速度和角速度,
Figure BDA0004129513260000024
Figure BDA0004129513260000025
分别是左右轮的线速度,
Figure BDA0004129513260000026
是左右轮间的距离,
Figure BDA0004129513260000027
表示实数集合;in,
Figure BDA0004129513260000022
and
Figure BDA0004129513260000023
are the linear velocity and angular velocity, respectively.
Figure BDA0004129513260000024
and
Figure BDA0004129513260000025
are the linear speeds of the left and right wheels,
Figure BDA0004129513260000026
is the distance between the left and right wheels,
Figure BDA0004129513260000027
represents the set of real numbers;

然后,第i个非完整约束移动机器人的运动学方程表示为Then, the kinematic equation of the i-th nonholonomic constrained mobile robot is expressed as

Figure BDA0004129513260000028
Figure BDA0004129513260000028

其中,

Figure BDA0004129513260000031
表示广义状态,
Figure BDA0004129513260000032
Figure BDA0004129513260000033
分别是位置和方向,
Figure BDA0004129513260000034
表示非完整约束矩阵;in,
Figure BDA0004129513260000031
represents a generalized state,
Figure BDA0004129513260000032
and
Figure BDA0004129513260000033
are position and direction,
Figure BDA0004129513260000034
represents a non-holonomic constraint matrix;

此外,第i个非完整约束移动机器人在惯性坐标系下的运动学方程为In addition, the kinematic equation of the i-th nonholonomic constrained mobile robot in the inertial coordinate system is:

Figure BDA0004129513260000035
Figure BDA0004129513260000035

其中,

Figure BDA0004129513260000036
是轮半径,
Figure BDA0004129513260000037
表示惯性坐标性下的非完整约束矩阵,
Figure BDA0004129513260000038
Figure BDA0004129513260000039
分别是左右轮的旋转速度;in,
Figure BDA0004129513260000036
is the wheel radius,
Figure BDA0004129513260000037
represents the nonholonomic constraint matrix in inertial coordinates,
Figure BDA0004129513260000038
and
Figure BDA0004129513260000039
are the rotation speeds of the left and right wheels respectively;

步骤S12:多非完整约束移动机器人动力学建模Step S12: Dynamics modeling of mobile robots with multiple nonholonomic constraints

通过使用欧拉拉格朗日方程,第i个非完整约束移动机器人的动力学模型推导为By using the Euler-Lagrange equations, the dynamic model of the i-th nonholonomic constrained mobile robot is derived as

Figure BDA00041295132600000310
Figure BDA00041295132600000310

其中,

Figure BDA00041295132600000311
是惯性矩阵,
Figure BDA00041295132600000312
是科氏力和向心力矩阵,Gi(xi)是重力矩阵,
Figure BDA00041295132600000313
表示未知非线性项,
Figure BDA00041295132600000314
是可设计的输入增益矩阵,
Figure BDA00041295132600000315
是控制输入,
Figure BDA00041295132600000316
是非完整约束力;in,
Figure BDA00041295132600000311
is the inertia matrix,
Figure BDA00041295132600000312
is the Coriolis force and centripetal force matrix, G i ( xi ) is the gravity matrix,
Figure BDA00041295132600000313
represents the unknown nonlinear term,
Figure BDA00041295132600000314
is a designable input gain matrix,
Figure BDA00041295132600000315
is the control input,
Figure BDA00041295132600000316
It is not completely binding;

首先,公式(3)的微分形式推导如下First, the differential form of formula (3) is derived as follows

Figure BDA00041295132600000317
Figure BDA00041295132600000317

其中,

Figure BDA00041295132600000318
表示Si(xi)的微分,
Figure BDA00041295132600000319
是轮的角加速度;in,
Figure BDA00041295132600000318
represents the differential of S i ( xi ),
Figure BDA00041295132600000319
is the angular acceleration of the wheel;

然后,将公式(3)和(5)代入(4),并左乘

Figure BDA00041295132600000320
得到以下方程Then, substitute formulas (3) and (5) into (4) and multiply on the left by
Figure BDA00041295132600000320
The following equation is obtained

Figure BDA00041295132600000321
Figure BDA00041295132600000321

其中,

Figure BDA00041295132600000322
Figure BDA00041295132600000323
Figure BDA00041295132600000324
in,
Figure BDA00041295132600000322
Figure BDA00041295132600000323
Figure BDA00041295132600000324

根据假设2,公式(6)改写为According to Assumption 2, formula (6) can be rewritten as

Figure BDA00041295132600000325
Figure BDA00041295132600000325

其中,

Figure BDA0004129513260000041
是精确项,
Figure BDA0004129513260000042
是非精确项;in,
Figure BDA0004129513260000041
is the exact term,
Figure BDA0004129513260000042
is an inexact term;

假设1:多非完整约束移动机器人系统工作在一个静态的场景中,所有非机器人的障碍物均为静态且固定的;Assumption 1: The multi-nonholonomic constrained mobile robot system works in a static scene, and all non-robot obstacles are static and fixed;

假设2:输入增益矩阵Ei(xi)始终满足设计为

Figure BDA0004129513260000043
步骤S13:多非完整约束移动机器人基本行为构建Assumption 2: The input gain matrix E i ( xi ) always satisfies the design
Figure BDA0004129513260000043
Step S13: Construction of basic behaviors of mobile robots with multiple nonholonomic constraints

假设每一个非完整约束移动机器人均有M个基本行为,其中第i个非完整约束移动机器人的第k个基本行为可以使用一个任务变量

Figure BDA0004129513260000044
进行数学建模如下Assume that each nonholonomic constrained mobile robot has M basic behaviors, where the kth basic behavior of the ith nonholonomic constrained mobile robot can use a task variable
Figure BDA0004129513260000044
The mathematical modeling is as follows

Figure BDA0004129513260000045
Figure BDA0004129513260000045

其中,gi,k(·):

Figure BDA0004129513260000046
表示任务函数;where g i,k (·):
Figure BDA0004129513260000046
Represents the task function;

然后,任务变量σi,k的微分形式表示为Then, the differential form of the task variable σ i,k is expressed as

Figure BDA0004129513260000047
Figure BDA0004129513260000047

其中,

Figure BDA0004129513260000048
是任务的雅克比矩阵;in,
Figure BDA0004129513260000048
is the Jacobian matrix of the task;

最后,第i个非完整约束移动机器人的第k个基本行为的参考速度指令可以表示为Finally, the reference velocity command of the kth basic behavior of the i-th nonholonomic constrained mobile robot can be expressed as

Figure BDA0004129513260000049
Figure BDA0004129513260000049

其中,

Figure BDA00041295132600000410
是任务的雅克比矩阵Ji,k的右伪逆,
Figure BDA00041295132600000411
是期望的任务函数,
Figure BDA00041295132600000412
是任务增益,
Figure BDA00041295132600000413
是任务误差;in,
Figure BDA00041295132600000410
is the right pseudo-inverse of the Jacobian matrix Ji ,k of the task,
Figure BDA00041295132600000411
is the desired task function,
Figure BDA00041295132600000412
is the task gain,
Figure BDA00041295132600000413
is the task error;

在不失一般性的前提下,避障行为、分布式编队行为和分布式重构行为设计如下:避障行为:避障行为是一种局部行为,旨在确保非完整约束移动机器人避开路径附近的障碍物,其相应的任务函数、期望任务和任务雅克比矩阵分别表示为:Without loss of generality, the obstacle avoidance behavior, distributed formation behavior and distributed reconstruction behavior are designed as follows: Obstacle avoidance behavior: Obstacle avoidance behavior is a local behavior that aims to ensure that the non-holonomic constrained mobile robot avoids obstacles near the path. Its corresponding task function, expected task and task Jacobian matrix are expressed as follows:

Figure BDA00041295132600000414
Figure BDA00041295132600000414

Figure BDA00041295132600000415
Figure BDA00041295132600000415

Figure BDA00041295132600000416
Figure BDA00041295132600000416

其中,

Figure BDA00041295132600000417
表示第i个非完整约束移动机器人与障碍物的最小距离,dOA为安全距离,
Figure BDA00041295132600000418
是最小距离的相对位置,
Figure BDA00041295132600000419
是避障行为期望的方向,+和-分别表示障碍物在第i个非完整约束移动机器人的左边和右边;in,
Figure BDA00041295132600000417
represents the minimum distance between the i-th nonholonomic constrained mobile robot and the obstacle, d OA is the safety distance,
Figure BDA00041295132600000418
is the relative position of the minimum distance,
Figure BDA00041295132600000419
is the desired direction of the obstacle avoidance behavior, + and − respectively indicate that the obstacle is to the left and right of the i-th nonholonomically constrained mobile robot;

分布式编队行为:分布式编队行为是一种分布式协作行为,旨在确保多非完整约束移动机器人仅通过使用邻居的状态形成所需的队形,其相应的任务函数、期望任务和任务雅克比矩阵分别表示为:Distributed formation behavior: Distributed formation behavior is a distributed cooperative behavior that aims to ensure that multiple nonholonomic constrained mobile robots form the desired formation by only using the states of their neighbors. The corresponding task function, expected task, and task Jacobian matrix are expressed as:

Figure BDA0004129513260000051
Figure BDA0004129513260000051

Figure BDA0004129513260000052
Figure BDA0004129513260000052

Figure BDA0004129513260000053
Figure BDA0004129513260000053

其中,

Figure BDA0004129513260000054
是分布式编队行为的估计状态,其通过设计具有如下更新率的自适应估计器来估计:in,
Figure BDA0004129513260000054
is the estimated state of the distributed formation behavior, which is estimated by designing an adaptive estimator with the following update rate:

Figure BDA0004129513260000055
Figure BDA0004129513260000055

其中,κDF是一个正常数,

Figure BDA0004129513260000056
是编队的相对位置,
Figure BDA0004129513260000057
表示领航者的状态,
Figure BDA0004129513260000058
表示第i个非完整约束移动机器人的邻居;Among them, κ DF is a positive constant,
Figure BDA0004129513260000056
is the relative position of the formation,
Figure BDA0004129513260000057
Indicates the status of the navigator.
Figure BDA0004129513260000058
represents the neighbors of the i-th nonholonomic constrained mobile robot;

分布式重构行为:分布式重构行为是一种分布式协作行为,旨在确保多非完整约束移动机器人仅通过使用邻居的状态重构所需的队形,其相应的任务函数、期望任务和任务雅克比矩阵分别表示为:Distributed Reconfiguration Behavior: Distributed reconfiguration behavior is a distributed cooperative behavior designed to ensure that multiple nonholonomic constrained mobile robots reconstruct the desired formation only by using the states of their neighbors. The corresponding task function, expected task, and task Jacobian matrix are expressed as:

Figure BDA0004129513260000059
Figure BDA0004129513260000059

Figure BDA00041295132600000510
Figure BDA00041295132600000510

Figure BDA00041295132600000511
Figure BDA00041295132600000511

其中,

Figure BDA00041295132600000512
是分布式编队行为的估计状态,其通过设计具有如下更新率的自适应估计器来估计:in,
Figure BDA00041295132600000512
is the estimated state of the distributed formation behavior, which is estimated by designing an adaptive estimator with the following update rate:

Figure BDA00041295132600000513
Figure BDA00041295132600000513

其中,κDR是一个正常数,

Figure BDA00041295132600000514
是编队重构矩阵;Among them, κ DR is a positive constant,
Figure BDA00041295132600000514
is the formation reconstruction matrix;

步骤S14:多非完整约束移动机器人复合行为构建Step S14: Construction of composite behaviors of mobile robots with multiple nonholonomic constraints

一个复合任务是多个基本行为以一定的优先级顺序的组合;设定

Figure BDA00041295132600000515
为第i个非完整约束移动机器人的任务函数,其中km∈NM,NM={1,...,M},mk表示任务空间的维度,M表示任务的数量;定义与时间相关的优先级函数gi(km,t):NM×[0,∞]→NM;同时,定义一个具有如下规则的任务层次结构:A composite task is a combination of multiple basic behaviors in a certain priority order;
Figure BDA00041295132600000515
is the task function of the i-th nonholonomic constrained mobile robot, where km∈NM , NM ={1,...,M}, mk represents the dimension of the task space, and M represents the number of tasks. Define a time-related priority function gi ( km ,t): NM ×[0,∞]→ NM . At the same time, define a task hierarchy with the following rules:

1)一个具有gi(kα)优先级的任务kα不能干扰具有gi(kβ)优先级的任务kβ,如果gi(kα)≥gi(kβ),

Figure BDA0004129513260000061
kα≠kβ;1) A task k α with priority gi (k α ) cannot interfere with task k β with priority gi (k β ) if gi (k α ) ≥gi (k β ),
Figure BDA0004129513260000061
k α ≠ k β ;

2)从速度到任务速度的映射关系由任务的雅可比矩阵

Figure BDA0004129513260000062
表示;2) The mapping from speed to task speed is given by the Jacobian matrix of the task
Figure BDA0004129513260000062
express;

3)具有最低优先级任务mM的维度可能大于

Figure BDA0004129513260000063
因此要确保维度mn大于所有任务的总维度;3) The dimension of the task m with the lowest priority may be greater than
Figure BDA0004129513260000063
Therefore, make sure that the dimension m n is greater than the total dimension of all tasks;

4)gi(km)的值由任务监管器根据任务的需求和传感器信息进行分配;4) The value of g i (k m ) is assigned by the task supervisor according to the task requirements and sensor information;

通过给基本任务分配给定的优先级,t时刻复合任务的速度表示为By assigning a given priority to the basic tasks, the speed of the composite task at time t is expressed as

Figure BDA0004129513260000064
Figure BDA0004129513260000064

Figure BDA0004129513260000065
Figure BDA0004129513260000065

Figure BDA0004129513260000066
Figure BDA0004129513260000066

其中,

Figure BDA0004129513260000067
是行为优先级,
Figure BDA0004129513260000068
是零空间投影的增广雅克比矩阵。in,
Figure BDA0004129513260000067
is the behavioral priority,
Figure BDA0004129513260000068
is the augmented Jacobian matrix of the null space projection.

在一较佳的实施例中,所述步骤S2具体为:定义集中式训练环境为,全局的状态为

Figure BDA0004129513260000069
其中
Figure BDA00041295132600000610
是联合的位置,
Figure BDA00041295132600000611
是联合的优先级,
Figure BDA00041295132600000612
是编队标志位,S表示全局状态集合;定义bi,t={vr,i,t}∈B为局部/本地行为,其中B表示行为集合;定义
Figure BDA00041295132600000613
为独立的局部观测,其中si,t={xi,Pri}是局部/本地状态,
Figure BDA00041295132600000614
表示第i个非完整约束移动机器人的邻居,O表示局部观测集合;由于局部观测,定义局部/本地的行为观测历史为zi,t∈Z,其中Z表示行为观测历史集合;所有的分布式强化学习任务监管器贡献一个奖励信号,且奖励函数设计如下In a preferred embodiment, the step S2 is specifically as follows: defining the centralized training environment as, the global state is
Figure BDA0004129513260000069
in
Figure BDA00041295132600000610
is the joint location,
Figure BDA00041295132600000611
is the priority of the union,
Figure BDA00041295132600000612
is the formation flag, S represents the global state set; define b i,t ={v r,i,t }∈B as a local behavior, where B represents the behavior set; define
Figure BDA00041295132600000613
is an independent local observation, where s i,t = {x i ,Pr i } is a local state,
Figure BDA00041295132600000614
represents the neighbors of the i-th nonholonomic constrained mobile robot, O represents the local observation set; due to local observation, the local behavior observation history is defined as z i,t ∈Z, where Z represents the behavior observation history set; all distributed reinforcement learning task supervisors contribute a reward signal, and the reward function is designed as follows

rt=r1+r2, (25)r t = r 1 + r 2 , (25)

Figure BDA0004129513260000071
Figure BDA0004129513260000071

Figure BDA0004129513260000072
Figure BDA0004129513260000072

其中,

Figure BDA0004129513260000073
分别表示无编队、重构编队和期望编队状态的标识;r1和r2是分别设置实现任务目标和减少行为切换的奖励信号;in,
Figure BDA0004129513260000073
They represent the identities of no formation, reconstructed formation, and desired formation states, respectively; r1 and r2 are reward signals for achieving task goals and reducing behavior switching, respectively;

多非完整约束移动机器人与环境在t时间步进行交互,其中第i个非完整约束移动机器人观测到一个局部观测oi,t,获取到上一个行为bi,t-1,根据具有衰减因子γε的ε贪心策略选取一个行为bi,t,然后得到一个团队奖励rt和转移至下一个局部观测oi,t+1;具体而言,分布式强化学习任务监管器的集中式训练是通过分层渐进模块进行的,包括独立Q值模块和混合模块;首先,每一个非完整约束移动机器人都有一个独立Q值模块,其使用循环Q网络输入门循环神经网络的隐藏层状态hi,t-1,局部观测oi,t,上一个行为bi,t-1,输出局部的Q值

Figure BDA0004129513260000074
然后,混合模块通过求和所有的局部的Q值
Figure BDA0004129513260000075
生成联合Q值
Figure BDA0004129513260000076
如下Multiple nonholonomic constrained mobile robots interact with the environment at t time steps, where the i-th nonholonomic constrained mobile robot observes a local observation o i,t , obtains the previous behavior b i,t-1 , selects a behavior b i,t according to the ε-greedy strategy with a decay factor γ ε , and then obtains a team reward r t and transfers to the next local observation o i,t+1 ; Specifically, the centralized training of the distributed reinforcement learning task supervisor is carried out through hierarchical progressive modules, including independent Q-value modules and hybrid modules; First, each nonholonomic constrained mobile robot has an independent Q-value module, which uses the recurrent Q network to input the hidden layer state h i,t-1 of the gated recurrent neural network, the local observation o i,t , the previous behavior b i,t-1 , and outputs the local Q-value
Figure BDA0004129513260000074
The mixing module then sums up all the local Q values
Figure BDA0004129513260000075
Generate joint Q value
Figure BDA0004129513260000076
as follows

Figure BDA0004129513260000077
Figure BDA0004129513260000077

其中,

Figure BDA0004129513260000078
表示独立Q值网络的参数;in,
Figure BDA0004129513260000078
represents the parameters of the independent Q-value network;

在t时间步采样后,将经历四元组(zt,bt,rt,zt+1)存储到经验池

Figure BDA0004129513260000079
中;特别地,将从经验池中采样最小回放次
Figure BDA00041295132600000710
经历以减少数据相关性和提高样本利用率;然后,训练进行到t+1时间步,且训练直至所有的回合Ttotal完成后停止;分布式强化学习任务监管器通过最小化以下损失进行训练:After sampling at time step t, the experience quadruple (z t , b t , r t , z t+1 ) is stored in the experience pool
Figure BDA0004129513260000079
In particular, the minimum number of replays will be sampled from the experience pool
Figure BDA00041295132600000710
Experience to reduce data correlation and improve sample utilization; then, training proceeds to t+1 time step, and training stops after all rounds T total are completed; the distributed reinforcement learning task supervisor is trained by minimizing the following loss:

Figure BDA00041295132600000711
Figure BDA00041295132600000711

其中,

Figure BDA0004129513260000081
Figure BDA0004129513260000082
表示目标网络的参数;in,
Figure BDA0004129513260000081
Figure BDA0004129513260000082
Represents the parameters of the target network;

最后,多非完整约束移动机器人在集中式训练后学习到一组最优分布式行为优先级策略;在实际场景中,多非完整约束移动机器人根据学习到的策略切换行为优先级;一旦在每个采样时刻确定了多非完整约束移动机器人的行为优先级,就通过公式(22)-(24)和(2)获取参考速度vi,r和参考轨迹xi,r;根据公式(3),进一步计算得到惯性坐标系下的参考速度

Figure BDA0004129513260000083
和参考轨迹θi,r。Finally, after centralized training, the multi-nonholonomic constrained mobile robot learns a set of optimal distributed behavior priority strategies. In actual scenarios, the multi-nonholonomic constrained mobile robot switches behavior priorities according to the learned strategies. Once the behavior priorities of the multi-nonholonomic constrained mobile robot are determined at each sampling moment, the reference speed v i,r and reference trajectory x i,r are obtained through formulas (22)-(24) and (2). According to formula (3), the reference speed in the inertial coordinate system is further calculated.
Figure BDA0004129513260000083
and the reference trajectory θ i,r .

在一较佳的实施例中:步骤S3具体为:定义惯性坐标系下的位置和速度跟踪误差分别为In a preferred embodiment, step S3 specifically includes: defining the position and velocity tracking errors in the inertial coordinate system as:

ep,i=θii,r, (30)e p,iii,r , (30)

Figure BDA0004129513260000084
Figure BDA0004129513260000084

其中,

Figure BDA0004129513260000085
Figure BDA0004129513260000086
是左右轮的角度,
Figure BDA0004129513260000087
Figure BDA0004129513260000088
分别是参考位置和参考速度;in,
Figure BDA0004129513260000085
and
Figure BDA0004129513260000086
is the angle of the left and right wheels,
Figure BDA0004129513260000087
and
Figure BDA0004129513260000088
are the reference position and reference velocity respectively;

公式(30)和(31)的微分形式推导为The differential forms of formulas (30) and (31) are derived as follows:

Figure BDA0004129513260000089
Figure BDA0004129513260000089

Figure BDA00041295132600000810
Figure BDA00041295132600000810

其中,

Figure BDA00041295132600000811
Figure BDA00041295132600000812
的微分;in,
Figure BDA00041295132600000811
yes
Figure BDA00041295132600000812
The differential of

定义集成的跟踪误差如下The integrated tracking error is defined as follows

Figure BDA00041295132600000813
Figure BDA00041295132600000813

Figure BDA00041295132600000814
Figure BDA00041295132600000814

定义值函数如下Define the value function as follows

Figure BDA00041295132600000815
Figure BDA00041295132600000815

其中,

Figure BDA00041295132600000816
表示代价函数,αVV∈(0,2)是可调整的代价参数,且满足αVV=2;in,
Figure BDA00041295132600000816
represents the cost function, α V , β V ∈(0,2) are adjustable cost parameters, and satisfy α VV =2;

定义

Figure BDA00041295132600000817
为最优的跟踪控制策略;因此,最优的值函数可以表示为definition
Figure BDA00041295132600000817
is the optimal tracking control strategy; therefore, the optimal value function can be expressed as

Figure BDA00041295132600000818
Figure BDA00041295132600000818

其中,

Figure BDA0004129513260000091
表示可容许的控制策略;in,
Figure BDA0004129513260000091
represents the permissible control strategies;

通过结合公式(32)-(35)和(37),哈密顿-雅克比-贝尔曼(Hamilton-Jacobi-Bellman,HJB)方程可推导为By combining equations (32)-(35) and (37), the Hamilton-Jacobi-Bellman (HJB) equation can be derived as

Figure BDA0004129513260000092
Figure BDA0004129513260000092

其中,

Figure BDA0004129513260000093
表示Vi *相对于ei的梯度,
Figure BDA0004129513260000094
Figure BDA0004129513260000095
表示分别为Vi *相对于ep,i和ev,i的梯度;in,
Figure BDA0004129513260000093
represents the gradient of Vi * relative to ei ,
Figure BDA0004129513260000094
and
Figure BDA0004129513260000095
denote the gradients of Vi * relative to ep,i and ev ,i respectively;

通过求解

Figure BDA0004129513260000096
最优的控制策略
Figure BDA0004129513260000097
可推导为By solving
Figure BDA0004129513260000096
Optimal control strategy
Figure BDA0004129513260000097
It can be deduced as

Figure BDA0004129513260000098
Figure BDA0004129513260000098

此外,将公式(39)代入(38)可获得以下等式:Furthermore, substituting formula (39) into (38) yields the following equation:

Figure BDA0004129513260000099
Figure BDA0004129513260000099

为了实施

Figure BDA00041295132600000910
需要求解公式(40)获取
Figure BDA00041295132600000911
然而,由于多非完整约束移动机器人动力学模型的非线性和不精确,
Figure BDA00041295132600000912
的解析解难以求取;To implement
Figure BDA00041295132600000910
We need to solve formula (40) to obtain
Figure BDA00041295132600000911
However, due to the nonlinearity and inaccuracy of the dynamics model of mobile robots with multiple nonholonomic constraints,
Figure BDA00041295132600000912
The analytical solution of is difficult to obtain;

因此,需要将最优值函数梯度分解如下:Therefore, the optimal value function gradient needs to be decomposed as follows:

Figure BDA00041295132600000913
Figure BDA00041295132600000913

其中,

Figure BDA0004129513260000101
Figure BDA0004129513260000102
Figure BDA0004129513260000103
是正常数,
Figure BDA0004129513260000104
是自适应补偿项;in,
Figure BDA0004129513260000101
Figure BDA0004129513260000102
and
Figure BDA0004129513260000103
is a normal number,
Figure BDA0004129513260000104
is the adaptive compensation term;

将公式(41)代入(39),获取等式如下:Substituting formula (41) into (39), we obtain the following equation:

Figure BDA0004129513260000105
Figure BDA0004129513260000105

众所周知,神经网络具有强大的逼近能力;因此,给定紧集

Figure BDA0004129513260000106
Figure BDA0004129513260000107
对于
Figure BDA0004129513260000108
Figure BDA0004129513260000109
未知项fi(xi)和Vi o通过神经网络近似如下:As we all know, neural networks have powerful approximation capabilities; therefore, given a compact set
Figure BDA0004129513260000106
and
Figure BDA0004129513260000107
for
Figure BDA0004129513260000108
and
Figure BDA0004129513260000109
The unknown terms fi ( xi ) and Vio are approximated by the neural network as follows:

Figure BDA00041295132600001010
Figure BDA00041295132600001010

Figure BDA00041295132600001011
Figure BDA00041295132600001011

其中,

Figure BDA00041295132600001012
Figure BDA00041295132600001013
是理想的权重矩阵,wf和wV是神经元数量,
Figure BDA00041295132600001014
Figure BDA00041295132600001015
是基函数向量,
Figure BDA00041295132600001016
Figure BDA00041295132600001017
是逼近误差,且有界如||δf,i||≤εf和||δV,i||≤εV,εf和εV是正常数;in,
Figure BDA00041295132600001012
and
Figure BDA00041295132600001013
is the ideal weight matrix, wf and wV are the number of neurons,
Figure BDA00041295132600001014
and
Figure BDA00041295132600001015
is the basis function vector,
Figure BDA00041295132600001016
and
Figure BDA00041295132600001017
is the approximation error and is bounded such that ||δ f,i ||≤ε f and ||δ V,i ||≤ε V , ε f and ε V are positive constants;

然后,将公式(43)和(44)代入(41)和(42),获取到以下方程:Then, substitute equations (43) and (44) into (41) and (42) to obtain the following equations:

Figure BDA00041295132600001018
Figure BDA00041295132600001018

Figure BDA00041295132600001019
Figure BDA00041295132600001019

然而,由于

Figure BDA00041295132600001020
Figure BDA00041295132600001021
是未知的,
Figure BDA00041295132600001022
无法实施;因此,使用一种辨识者-执行者-评论家强化学习算法以学习最优控制策略;However, due to
Figure BDA00041295132600001020
and
Figure BDA00041295132600001021
is unknown,
Figure BDA00041295132600001022
It is not feasible to implement; therefore, an Identifier-Actor-Critic reinforcement learning algorithm is used to learn the optimal control policy;

具体而言,设计辨识者神经网络以估计未知非线性项如下:Specifically, the discriminator neural network is designed to estimate the unknown nonlinear terms as follows:

Figure BDA00041295132600001023
Figure BDA00041295132600001023

其中,

Figure BDA0004129513260000111
是fi(xi)的估计,
Figure BDA0004129513260000112
是辨识者神经网络的权重;辨识者神经网络的更新率可设计为:in,
Figure BDA0004129513260000111
is an estimate of fi ( xi ),
Figure BDA0004129513260000112
is the weight of the identifier neural network; the update rate of the identifier neural network can be designed as:

Figure BDA0004129513260000113
Figure BDA0004129513260000113

其中,

Figure BDA0004129513260000114
是正定矩阵,
Figure BDA0004129513260000115
是设计的辨识者参数;in,
Figure BDA0004129513260000114
is a positive definite matrix,
Figure BDA0004129513260000115
is the identifier parameter of the design;

然后,设计评论家神经网络以评估控制性能如下:Then, a critic neural network is designed to evaluate the control performance as follows:

Figure BDA0004129513260000116
Figure BDA0004129513260000116

其中,

Figure BDA0004129513260000117
Figure BDA0004129513260000118
的估计值,
Figure BDA0004129513260000119
是评论家神经网络的权重;评论家网络的更新率可设计为in,
Figure BDA0004129513260000117
yes
Figure BDA0004129513260000118
The estimated value of
Figure BDA0004129513260000119
is the weight of the critic neural network; the update rate of the critic network can be designed as

Figure BDA00041295132600001110
Figure BDA00041295132600001110

其中,γc,i是评论家的学习率;where γ c,i is the critic’s learning rate;

最后,设计执行者神经网络设施控制输入如下:Finally, the design of the actuator neural network facility control input is as follows:

Figure BDA00041295132600001111
Figure BDA00041295132600001111

其中,

Figure BDA00041295132600001112
是执行者神经网络的权重;执行者网络的更新率可设计为in,
Figure BDA00041295132600001112
is the weight of the actor neural network; the update rate of the actor network can be designed as

Figure BDA00041295132600001113
Figure BDA00041295132600001113

其中,γa,i是执行者的学习率。where γ a,i is the learning rate of the executor.

在一较佳的实施例中:所述步骤S4具体为:首先,考虑控制输入

Figure BDA00041295132600001114
受到对称执行机构饱和约束的限制如下:In a preferred embodiment: Step S4 is specifically as follows: First, consider the control input
Figure BDA00041295132600001114
The restrictions subject to the saturation constraints of the symmetric actuators are as follows:

Figure BDA00041295132600001115
Figure BDA00041295132600001115

其中,τlim,i>0是已知的阈值;Among them, τ lim,i >0 is a known threshold;

其次,可将控制输入分为两项如下:Secondly, the control input can be divided into two items as follows:

τi=τ0,iΔ,i, (54)τ i0,iΔ,i , (54)

其中,

Figure BDA0004129513260000121
是标称项,
Figure BDA0004129513260000122
是补偿项,且满足如下条件:in,
Figure BDA0004129513260000121
is a nominal term,
Figure BDA0004129513260000122
is a compensation item and meets the following conditions:

Figure BDA0004129513260000123
Figure BDA0004129513260000123

最后,设计自适应补偿器为

Figure BDA0004129513260000124
且具有更新率如下Finally, the adaptive compensator is designed as
Figure BDA0004129513260000124
And has an update rate of

Figure BDA0004129513260000125
Figure BDA0004129513260000125

其中,

Figure BDA0004129513260000126
是设计的自适应补偿器参数。in,
Figure BDA0004129513260000126
are the designed adaptive compensator parameters.

与现有技术相比,本发明具有以下有益效果:本发明以多完整约束移动机器人系统为研究对象,提出了一种面向多非完整约束移动机器人的分布式强化学习行为控制方法。首先,该方法通过将行为优先级切换建模为分布式部分可观察的马尔可夫决策过程,提出了一组新颖的分布式强化学习任务监管器来学习最优的分布式行为优先级策略,从而使得在任务执行期间,零空间行为控制方法能够不依赖任何集中式单元来切换行为优先级,进而从根本上解决了零空间行为控制方法隐含集中式的致命缺陷。与此同时,使用所学习到最优的分布式行为优先级策略实现行为优先级的切换,不仅弥补了分布式行为控制框架下行为优先级固定的不足,提升了零空间行为控制方法的动态性能,而且将大量的在线计算过程转移至了离线学习阶段,降低了零空间行为控制方法对高性能硬件的依赖。其次,该方法提出了强化学习控制器,通过使用辨识者-执行者-评论家强化学习算法来学习未知的动态模型和最优控制策略。在任务执行过程中,始终维持控制性能和控制损耗的平衡,尤其是在行为优先级切换时,相比于现有的零空间行为控制方法,控制代价会降低,避免了多非完整约束移动机器人出现为了维持高性能的控制,过度消耗控制资源的情况。最后,相较于现有零空间行为控制方法不考虑输入饱和约束,本发明为了防止多非完整约束移动机器人的执行机构超过物理限制,实施了输入饱和约束,设计了一组自适应补偿器,以维持最优性能和实时抵消饱和效应。Compared with the prior art, the present invention has the following beneficial effects: The present invention takes a multi-holonomic constrained mobile robot system as the research object, and proposes a distributed reinforcement learning behavior control method for multi-nonholonomic constrained mobile robots. First, by modeling the behavior priority switching as a distributed partially observable Markov decision process, the method proposes a set of novel distributed reinforcement learning task supervisors to learn the optimal distributed behavior priority strategy, so that during the task execution, the zero-space behavior control method can switch the behavior priority without relying on any centralized unit, thereby fundamentally solving the fatal defect of the zero-space behavior control method implicitly centralized. At the same time, the use of the learned optimal distributed behavior priority strategy to achieve the behavior priority switching not only makes up for the deficiency of fixed behavior priority under the distributed behavior control framework, improves the dynamic performance of the zero-space behavior control method, but also transfers a large number of online computing processes to the offline learning stage, reducing the dependence of the zero-space behavior control method on high-performance hardware. Secondly, the method proposes a reinforcement learning controller, which learns unknown dynamic models and optimal control strategies by using the identifier-executor-critic reinforcement learning algorithm. During the task execution, the balance between control performance and control loss is always maintained, especially when the behavior priority is switched. Compared with the existing zero-space behavior control method, the control cost will be reduced, avoiding the situation where the multi-nonholonomic constraint mobile robot consumes too much control resources in order to maintain high-performance control. Finally, compared with the existing zero-space behavior control method that does not consider input saturation constraints, in order to prevent the actuator of the multi-nonholonomic constraint mobile robot from exceeding the physical limit, the present invention implements input saturation constraints and designs a set of adaptive compensators to maintain optimal performance and offset the saturation effect in real time.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明实施例的一种面向多非完整约束移动机器人的分布式强化学习行为控制方法的原理框图;FIG1 is a principle block diagram of a distributed reinforcement learning behavior control method for multiple nonholonomic constrained mobile robots according to an embodiment of the present invention;

图2为本发明实施例的第i个非完整约束移动机器人的示意图;FIG2 is a schematic diagram of an i-th nonholonomic constrained mobile robot according to an embodiment of the present invention;

图3为本发明实施例的分布式强化学习任务监管器的伪代码图;FIG3 is a pseudo code diagram of a distributed reinforcement learning task supervisor according to an embodiment of the present invention;

图4为本发明实施例的多非完整约束移动机器人的网络拓扑示意图;FIG4 is a schematic diagram of a network topology of multiple nonholonomically constrained mobile robots according to an embodiment of the present invention;

图5为本发明实施例的仿真参数值选取图;FIG5 is a diagram showing the selection of simulation parameter values according to an embodiment of the present invention;

图6为本发明实施例的分布式强化学习任务监管器学习前的任务性能图,a)轨迹,(b)方向,(c)非完整约束移动机器人与障碍物的距离,(d)行为优先级;6 is a task performance diagram of a distributed reinforcement learning task supervisor before learning according to an embodiment of the present invention, (a) trajectory, (b) direction, (c) distance between a nonholonomically constrained mobile robot and an obstacle, and (d) behavior priority;

图7为本发明实施例的分布式强化学习任务监管器学习后的任务性能图,a)轨迹,(b)方向,(c)非完整约束移动机器人与障碍物的距离,(d)行为优先级;7 is a task performance diagram after learning by a distributed reinforcement learning task supervisor according to an embodiment of the present invention, (a) trajectory, (b) direction, (c) distance between a nonholonomically constrained mobile robot and an obstacle, and (d) behavior priority;

图8为本发明实施例的具有不同任务监管器的多非完整约束移动机器人任务性能图,a)分布式强化学习任务监管器,(b)分布式有限状态机任务监管器,(c)分布式模型预测控制任务监管器,(d)传统强化学习任务监管器;FIG8 is a graph of the performance of multiple nonholonomic constrained mobile robot tasks with different task supervisors according to an embodiment of the present invention, (a) distributed reinforcement learning task supervisor, (b) distributed finite state machine task supervisor, (c) distributed model predictive control task supervisor, and (d) traditional reinforcement learning task supervisor;

图9为本发明实施例的具有不同任务监管器的第2个非完整约束移动机器人任务性能图,a)分布式强化学习任务监管器,(b)分布式有限状态机任务监管器,(c)分布式模型预测控制任务监管器,(d)传统强化学习任务监管器;FIG9 is a graph of the performance of the second nonholonomic constrained mobile robot task with different task supervisors according to an embodiment of the present invention, (a) distributed reinforcement learning task supervisor, (b) distributed finite state machine task supervisor, (c) distributed model predictive control task supervisor, and (d) traditional reinforcement learning task supervisor;

图10为本发明实施例的具有输入饱和约束的分布式强化学习控制性能图,a)轨迹,(b)控制输入,(c)跟踪误差,(d)控制代价;FIG10 is a performance diagram of distributed reinforcement learning control with input saturation constraints according to an embodiment of the present invention, a) trajectory, (b) control input, (c) tracking error, and (d) control cost;

图11为本发明实施例的分布式强化学习的学习性能图,a)辨识者权重,(b)执行者权重,(c)评论家权重,(d)自适应补偿器;FIG11 is a graph of the learning performance of distributed reinforcement learning according to an embodiment of the present invention, (a) identifier weight, (b) executor weight, (c) critic weight, and (d) adaptive compensator;

图12为本发明实施例的具有和不具有输入饱和约束的第5个非完整约束移动机器人控制性能图,a)轨迹,(b)控制输入,(c)跟踪误差,(d)控制代价;FIG12 is a graph of the control performance of the fifth nonholonomic constrained mobile robot with and without input saturation constraints according to an embodiment of the present invention, a) trajectory, (b) control input, (c) tracking error, and (d) control cost;

图13为本发明实施例的具有不同分布式行为控制的多非完整约束移动机器人轨迹图,a)分布式强化学习行为控制,(b)有限时间分布式行为控制,(c)固有时间分布式行为控制,(d)传统强化学习行为控制;FIG13 is a trajectory diagram of a multi-nonholonomic constrained mobile robot with different distributed behavior controls according to an embodiment of the present invention, (a) distributed reinforcement learning behavior control, (b) finite time distributed behavior control, (c) intrinsic time distributed behavior control, and (d) traditional reinforcement learning behavior control;

图14为本发明实施例的有不同分布式行为控制的控制性能图,a)轨迹,(b)控制输入,(c)跟踪误差,(d)控制代价。14 is a control performance diagram of different distributed behavior controls according to an embodiment of the present invention, including: a) trajectory, (b) control input, (c) tracking error, and (d) control cost.

具体实施方式DETAILED DESCRIPTION

下面结合附图及实施例对本发明做进一步说明。The present invention will be further described below in conjunction with the accompanying drawings and embodiments.

应该指出,以下详细说明都是例示性的,旨在对本申请提供进一步的说明。除非另有指明,本文使用的所有技术和科学术语具有与本申请所属技术领域的普通技术人员通常理解的相同含义。It should be noted that the following detailed descriptions are illustrative and are intended to provide further explanation of the present application. Unless otherwise specified, all technical and scientific terms used herein have the same meanings as those commonly understood by those skilled in the art to which the present application belongs.

需要注意的是,这里所使用的术语仅是为了描述具体实施方式,而非意图限制根据本申请的示例性实施方式;如在这里所使用的,除非上下文另外明确指出,否则单数形式也意图包括复数形式,此外,还应当理解的是,当在本说明书中使用术语“包含”和/或“包括”时,其指明存在特征、步骤、操作、器件、组件和/或它们的组合。It should be noted that the terms used herein are only for describing specific embodiments and are not intended to limit the exemplary embodiments according to the present application; as used herein, unless the context clearly indicates otherwise, the singular form is also intended to include the plural form. In addition, it should be understood that when the terms "comprise" and/or "include" are used in this specification, they indicate the presence of features, steps, operations, devices, components and/or their combinations.

步骤一:运动学和动力学模型、基本行为和复合行为构建Step 1: Kinematic and dynamic models, basic behaviors and composite behaviors

a.多非完整约束移动机器人运动学建模a. Kinematic modeling of mobile robots with multiple nonholonomic constraints

考虑一组N(N>2)的非完整约束移动机器人,其中每个机器人由差速轮驱动,第i个非完整约束移动机器人的示意图如图2所示,i=1,...,N。第i个非完整约束移动机器人的广义速度可表示为Consider a group of N (N>2) nonholonomic constrained mobile robots, each of which is driven by a differential wheel. The schematic diagram of the i-th nonholonomic constrained mobile robot is shown in Figure 2, where i=1,...,N. The generalized velocity of the i-th nonholonomic constrained mobile robot can be expressed as

Figure BDA0004129513260000131
Figure BDA0004129513260000131

其中,

Figure BDA0004129513260000132
Figure BDA0004129513260000133
分别是线速度和角速度,
Figure BDA0004129513260000134
Figure BDA0004129513260000135
分别是左右轮的线速度,
Figure BDA0004129513260000136
是左右轮间的距离,
Figure BDA0004129513260000137
表示实数集合。in,
Figure BDA0004129513260000132
and
Figure BDA0004129513260000133
are the linear velocity and angular velocity, respectively.
Figure BDA0004129513260000134
and
Figure BDA0004129513260000135
are the linear speeds of the left and right wheels,
Figure BDA0004129513260000136
is the distance between the left and right wheels,
Figure BDA0004129513260000137
Represents the set of real numbers.

然后,第i个非完整约束移动机器人的运动学方程可以表示为Then, the kinematic equation of the i-th nonholonomic constrained mobile robot can be expressed as

Figure BDA0004129513260000141
Figure BDA0004129513260000141

其中,

Figure BDA0004129513260000142
表示广义状态,
Figure BDA0004129513260000143
Figure BDA0004129513260000144
分别是位置和方向,
Figure BDA0004129513260000145
表示非完整约束矩阵。in,
Figure BDA0004129513260000142
represents a generalized state,
Figure BDA0004129513260000143
and
Figure BDA0004129513260000144
are position and direction,
Figure BDA0004129513260000145
represents a nonholonomic constraint matrix.

此外,第i个非完整约束移动机器人在惯性坐标系下的运动学方程为In addition, the kinematic equation of the i-th nonholonomic constrained mobile robot in the inertial coordinate system is:

Figure BDA0004129513260000146
Figure BDA0004129513260000146

其中,

Figure BDA0004129513260000147
是轮半径,
Figure BDA0004129513260000148
表示惯性坐标性下的非完整约束矩阵,
Figure BDA0004129513260000149
Figure BDA00041295132600001410
分别是左右轮的旋转速度。in,
Figure BDA0004129513260000147
is the wheel radius,
Figure BDA0004129513260000148
represents the nonholonomic constraint matrix in inertial coordinates,
Figure BDA0004129513260000149
and
Figure BDA00041295132600001410
are the rotation speeds of the left and right wheels respectively.

b.多非完整约束移动机器人动力学建模b. Dynamic modeling of mobile robots with multiple nonholonomic constraints

通过使用欧拉拉格朗日方程,第i个非完整约束移动机器人的动力学模型可以推导为By using the Euler-Lagrange equations, the dynamic model of the i-th nonholonomic constrained mobile robot can be derived as

Figure BDA00041295132600001411
Figure BDA00041295132600001411

其中,

Figure BDA00041295132600001412
是惯性矩阵,
Figure BDA00041295132600001413
是科氏力和向心力矩阵,Gi(xi)是重力矩阵,
Figure BDA00041295132600001414
表示未知非线性项,
Figure BDA00041295132600001415
是可设计的输入增益矩阵,
Figure BDA00041295132600001416
是控制输入,
Figure BDA00041295132600001417
是非完整约束力。in,
Figure BDA00041295132600001412
is the inertia matrix,
Figure BDA00041295132600001413
is the Coriolis force and centripetal force matrix, G i ( xi ) is the gravity matrix,
Figure BDA00041295132600001414
represents the unknown nonlinear term,
Figure BDA00041295132600001415
is a designable input gain matrix,
Figure BDA00041295132600001416
is the control input,
Figure BDA00041295132600001417
It is not completely binding.

首先,公式(3)的微分形式可推导如下First, the differential form of formula (3) can be derived as follows

Figure BDA00041295132600001418
Figure BDA00041295132600001418

其中,

Figure BDA00041295132600001419
表示Si(xi)的微分,
Figure BDA00041295132600001420
是轮的角加速度。in,
Figure BDA00041295132600001419
represents the differential of S i ( xi ),
Figure BDA00041295132600001420
is the angular acceleration of the wheel.

然后,将公式(3)和(5)代入(4),并左乘

Figure BDA00041295132600001421
可以得到以下方程Then, substitute formulas (3) and (5) into (4) and multiply on the left by
Figure BDA00041295132600001421
The following equation can be obtained

Figure BDA00041295132600001422
Figure BDA00041295132600001422

其中,

Figure BDA00041295132600001423
Figure BDA00041295132600001424
Figure BDA00041295132600001425
in,
Figure BDA00041295132600001423
Figure BDA00041295132600001424
Figure BDA00041295132600001425

根据假设2,公式(6)可以改写为According to Assumption 2, formula (6) can be rewritten as

Figure BDA0004129513260000151
Figure BDA0004129513260000151

其中,

Figure BDA0004129513260000152
是精确项,
Figure BDA0004129513260000153
是非精确项。in,
Figure BDA0004129513260000152
is the exact term,
Figure BDA0004129513260000153
It is an inexact term.

假设1:多非完整约束移动机器人系统工作在一个静态的场景中,所有非机器人的障碍物均为静态且固定的。Assumption 1: The multi-nonholonomic constrained mobile robot system works in a static scene, and all non-robot obstacles are static and fixed.

假设2:输入增益矩阵Ei(xi)始终满足设计为

Figure BDA0004129513260000154
Assumption 2: The input gain matrix E i ( xi ) always satisfies the design
Figure BDA0004129513260000154

c.多非完整约束移动机器人基本行为构建c. Construction of basic behaviors of mobile robots with multiple nonholonomic constraints

假设每一个非完整约束移动机器人均有M个基本行为,其中第i个非完整约束移动机器人的第k个基本行为可以使用一个任务变量

Figure BDA0004129513260000155
进行数学建模如下Assume that each nonholonomic constrained mobile robot has M basic behaviors, where the kth basic behavior of the ith nonholonomic constrained mobile robot can use a task variable
Figure BDA0004129513260000155
The mathematical modeling is as follows

σi,k=gi,k(xi), (8)σ i,k =gi ,k ( xi ), (8)

其中,gi,k(·):

Figure BDA0004129513260000156
表示任务函数。where g i,k (·):
Figure BDA0004129513260000156
Represents a task function.

然后,任务变量σi,k的微分形式可以表示为Then, the differential form of the task variable σ i,k can be expressed as

Figure BDA0004129513260000157
Figure BDA0004129513260000157

其中,

Figure BDA0004129513260000158
是任务的雅克比矩阵。in,
Figure BDA0004129513260000158
is the Jacobian matrix of the task.

最后,第i个非完整约束移动机器人的第k个基本行为的参考速度指令可以表示为Finally, the reference velocity command of the kth basic behavior of the i-th nonholonomic constrained mobile robot can be expressed as

Figure BDA0004129513260000159
Figure BDA0004129513260000159

其中,

Figure BDA00041295132600001510
是任务的雅克比矩阵Ji,k的右伪逆,
Figure BDA00041295132600001511
是期望的任务函数,
Figure BDA00041295132600001512
是任务增益,
Figure BDA00041295132600001513
是任务误差。in,
Figure BDA00041295132600001510
is the right pseudo-inverse of the Jacobian matrix Ji ,k of the task,
Figure BDA00041295132600001511
is the desired task function,
Figure BDA00041295132600001512
is the task gain,
Figure BDA00041295132600001513
It is a task error.

在不失一般性的前提下,避障行为、分布式编队行为和分布式重构行为设计如下:避障行为(Obstacle Avoidance,OA):避障行为是一种局部行为,旨在确保非完整约束移动机器人避开路径附近的障碍物,其相应的任务函数、期望任务和任务雅克比矩阵分别表示为:Without loss of generality, the obstacle avoidance behavior, distributed formation behavior and distributed reconstruction behavior are designed as follows: Obstacle Avoidance (OA): Obstacle avoidance behavior is a local behavior that aims to ensure that the non-holonomic constrained mobile robot avoids obstacles near the path. Its corresponding task function, expected task and task Jacobian matrix are expressed as follows:

Figure BDA00041295132600001514
Figure BDA00041295132600001514

Figure BDA00041295132600001515
Figure BDA00041295132600001515

Figure BDA00041295132600001516
Figure BDA00041295132600001516

其中,

Figure BDA0004129513260000161
表示第i个非完整约束移动机器人与障碍物的最小距离,dOA为安全距离,
Figure BDA0004129513260000162
是最小距离的相对位置,
Figure BDA0004129513260000163
是避障行为期望的方向,+和-分别表示障碍物在第i个非完整约束移动机器人的左边和右边。in,
Figure BDA0004129513260000161
represents the minimum distance between the i-th nonholonomic constrained mobile robot and the obstacle, d OA is the safety distance,
Figure BDA0004129513260000162
is the relative position of the minimum distance,
Figure BDA0004129513260000163
is the desired direction of the obstacle avoidance behavior, and + and - respectively indicate that the obstacle is on the left and right of the i-th nonholonomic constrained mobile robot.

分布式编队行为(Distributed Formation,DF):分布式编队行为是一种分布式协作行为,旨在确保多非完整约束移动机器人仅通过使用邻居的状态形成所需的队形,其相应的任务函数、期望任务和任务雅克比矩阵分别表示为:Distributed Formation (DF): Distributed formation behavior is a distributed cooperative behavior that aims to ensure that multiple nonholonomic constrained mobile robots form the desired formation by using only the states of their neighbors. The corresponding task function, expected task, and task Jacobian matrix are expressed as:

Figure BDA0004129513260000164
Figure BDA0004129513260000164

Figure BDA0004129513260000165
Figure BDA0004129513260000165

Figure BDA0004129513260000166
Figure BDA0004129513260000166

其中,

Figure BDA0004129513260000167
是分布式编队行为的估计状态,其通过设计具有如下更新率的自适应估计器来估计:in,
Figure BDA0004129513260000167
is the estimated state of the distributed formation behavior, which is estimated by designing an adaptive estimator with the following update rate:

Figure BDA0004129513260000168
Figure BDA0004129513260000168

其中,κDF是一个正常数,

Figure BDA0004129513260000169
是编队的相对位置,
Figure BDA00041295132600001610
表示领航者的状态,
Figure BDA00041295132600001611
表示第i个非完整约束移动机器人的邻居。Among them, κ DF is a positive constant,
Figure BDA0004129513260000169
is the relative position of the formation,
Figure BDA00041295132600001610
Indicates the status of the navigator.
Figure BDA00041295132600001611
denotes the neighbors of the i-th nonholonomic constrained mobile robot.

分布式重构行为(Distributed Reconstruction,DR):分布式重构行为是一种分布式协作行为,旨在确保多非完整约束移动机器人仅通过使用邻居的状态重构所需的队形,其相应的任务函数、期望任务和任务雅克比矩阵分别表示为:Distributed Reconstruction (DR): Distributed reconstruction behavior is a distributed cooperative behavior that aims to ensure that multiple nonholonomic constrained mobile robots reconstruct the desired formation only by using the states of their neighbors. The corresponding task function, expected task, and task Jacobian matrix are expressed as:

Figure BDA00041295132600001612
Figure BDA00041295132600001612

Figure BDA00041295132600001613
Figure BDA00041295132600001613

Figure BDA00041295132600001614
Figure BDA00041295132600001614

其中,

Figure BDA00041295132600001615
是分布式编队行为的估计状态,其通过设计具有如下更新率的自适应估计器来估计:in,
Figure BDA00041295132600001615
is the estimated state of the distributed formation behavior, which is estimated by designing an adaptive estimator with the following update rate:

Figure BDA00041295132600001616
Figure BDA00041295132600001616

其中,κDR是一个正常数,

Figure BDA00041295132600001617
是编队重构矩阵。Among them, κ DR is a positive constant,
Figure BDA00041295132600001617
is the formation reconstruction matrix.

d.多非完整约束移动机器人复合行为构建d. Construction of composite behaviors of mobile robots with multiple nonholonomic constraints

一个复合任务是多个基本行为以一定的优先级顺序的组合。设定

Figure BDA0004129513260000171
为第i个非完整约束移动机器人的任务函数,其中km∈NM,NM={1,...,M},mk表示任务空间的维度,M表示任务的数量。定义与时间相关的优先级函数gi(km,t):NM×[0,∞]→NM。同时,定义一个具有如下规则的任务层次结构:A composite task is a combination of multiple basic behaviors in a certain priority order.
Figure BDA0004129513260000171
is the task function of the i-th nonholonomic constrained mobile robot, where kmN M , N M = {1, ..., M}, m k represents the dimension of the task space, and M represents the number of tasks. Define a time-dependent priority function gi (k m , t): N M × [0, ∞] → N M . At the same time, define a task hierarchy with the following rules:

1)一个具有gi(kα)优先级的任务kα不能干扰具有gi(kβ)优先级的任务kβ,如果gi(kα)≥gi(kβ),

Figure BDA0004129513260000172
kα≠kβ。1) A task k α with priority gi (k α ) cannot interfere with task k β with priority gi (k β ) if gi (k α ) ≥gi (k β ),
Figure BDA0004129513260000172
k α ≠k β .

2)从速度到任务速度的映射关系由任务的雅可比矩阵

Figure BDA0004129513260000173
表示。2) The mapping from speed to task speed is given by the Jacobian matrix of the task
Figure BDA0004129513260000173
express.

3)具有最低优先级任务mM的维度可能大于

Figure BDA0004129513260000174
因此要确保维度mn大于所有任务的总维度。3) The dimension of the task m with the lowest priority may be greater than
Figure BDA0004129513260000174
Therefore, make sure that the dimension m n is larger than the total dimension of all tasks.

4)gi(km)的值由任务监管器根据任务的需求和传感器信息进行分配。4) The value of g i (k m ) is assigned by the task supervisor according to the task requirements and sensor information.

通过给基本任务分配给定的优先级,t时刻复合任务的速度可以表示为By assigning a given priority to the basic tasks, the speed of the composite task at time t can be expressed as

Figure BDA0004129513260000175
Figure BDA0004129513260000175

Figure BDA0004129513260000176
Figure BDA0004129513260000176

Figure BDA0004129513260000177
Figure BDA0004129513260000177

其中,

Figure BDA0004129513260000178
是行为优先级,
Figure BDA0004129513260000179
是零空间投影的增广雅克比矩阵。in,
Figure BDA0004129513260000178
is the behavioral priority,
Figure BDA0004129513260000179
is the augmented Jacobian matrix of the null space projection.

步骤二:分布式多智能体强化学习任务监管器设计Step 2: Design of Distributed Multi-agent Reinforcement Learning Task Supervisor

对于分布式多非完整约束移动机器人,每个智能体必须学习局部/本地的行为优先级策略以实现协作任务目标。由于多非完整约束移动机器人通常可以在集中式的离线环境中训练,则分布式行为优先级切换问题能够创新性地建模为分布式部分可观察的马尔可夫决策过程。通过使用值分解网络(value-decomposition networks,VDN)强化学习算法和集中训练分布式执行范式(centralized training distributed execution,CTDE),提出了一组分布式强化学习任务监管器。For distributed multi-nonholonomic constrained mobile robots, each agent must learn a local/local behavior priority strategy to achieve the collaborative task goal. Since multi-nonholonomic constrained mobile robots can usually be trained in a centralized offline environment, the distributed behavior priority switching problem can be innovatively modeled as a distributed partially observable Markov decision process. By using the value-decomposition networks (VDN) reinforcement learning algorithm and the centralized training distributed execution (CTDE) paradigm, a set of distributed reinforcement learning task supervisors is proposed.

定义集中式训练环境为,全局的状态为

Figure BDA00041295132600001710
其中
Figure BDA00041295132600001711
是联合的位置,
Figure BDA00041295132600001712
是联合的优先级,
Figure BDA00041295132600001713
是编队标志位,S表示全局状态集合。定义bi,t={vr,i,t}∈B为局部/本地行为,其中B表示行为集合。由于分布式多智能体强化学习任务监管器无法使用全局状态,因此必须使用部分可观测的状态。定义
Figure BDA00041295132600001714
为独立的局部观测,其中si,t={xi,Pri}是局部/本地状态,
Figure BDA00041295132600001715
表示第i个非完整约束移动机器人的邻居,O表示局部观测集合。由于局部观测,定义局部/本地的行为观测历史为zi,t∈Z,其中Z表示行为观测历史集合。所有的分布式强化学习任务监管器贡献一个奖励信号,且奖励函数设计如下Define the centralized training environment as, and the global state is
Figure BDA00041295132600001710
in
Figure BDA00041295132600001711
is the joint location,
Figure BDA00041295132600001712
is the priority of the union,
Figure BDA00041295132600001713
is the formation flag, S represents the global state set. Define b i,t = {v r,i,t }∈B as a local behavior, where B represents the behavior set. Since the distributed multi-agent reinforcement learning task supervisor cannot use the global state, it must use partially observable states. Definition
Figure BDA00041295132600001714
is an independent local observation, where s i,t = {x i ,Pr i } is a local state,
Figure BDA00041295132600001715
represents the neighbors of the i-th nonholonomic constrained mobile robot, and O represents the local observation set. Due to local observation, the local behavior observation history is defined as z i,t ∈ Z, where Z represents the behavior observation history set. All distributed reinforcement learning task supervisors contribute a reward signal, and the reward function is designed as follows

rt=r1+r2, (25)r t = r 1 + r 2 , (25)

Figure BDA0004129513260000181
Figure BDA0004129513260000181

Figure BDA0004129513260000182
Figure BDA0004129513260000182

其中,

Figure BDA0004129513260000183
分别表示无编队、重构编队和期望编队状态的标识。r1和r2是分别设置实现任务目标和减少行为切换的奖励信号。in,
Figure BDA0004129513260000183
They represent the flags of no formation, reconstructed formation and desired formation state respectively. r1 and r2 are reward signals for achieving task goals and reducing behavior switching respectively.

分布式强化学习任务监管器的算法伪代码图如图3所示。多非完整约束移动机器人与环境在t时间步进行交互,其中第i个非完整约束移动机器人观测到一个局部观测oi,t,获取到上一个行为bi,t-1,根据具有衰减因子γε的ε贪心策略选取一个行为bi,t,然后得到一个团队奖励rt和转移至下一个局部观测oi,t+1。具体而言,分布式强化学习任务监管器的集中式训练是通过分层渐进模块进行的,包括独立Q值模块和混合模块。首先,每一个非完整约束移动机器人都有一个独立Q值模块,其使用循环Q网络输入门循环神经网络的隐藏层状态hi,t-1,局部观测oi,t,上一个行为bi,t-1,输出局部的Q值

Figure BDA0004129513260000184
然后,混合模块通过求和所有的局部的Q值
Figure BDA0004129513260000185
生成联合Q值
Figure BDA0004129513260000186
如下The algorithm pseudocode diagram of the distributed reinforcement learning task supervisor is shown in Figure 3. Multiple non-holonomic constrained mobile robots interact with the environment at t time steps, where the i-th non-holonomic constrained mobile robot observes a local observation o i,t , obtains the previous behavior b i,t-1 , selects a behavior b i,t according to the ε-greedy strategy with a decay factor γ ε , and then obtains a team reward r t and transfers to the next local observation o i,t+1 . Specifically, the centralized training of the distributed reinforcement learning task supervisor is carried out through hierarchical progressive modules, including independent Q-value modules and hybrid modules. First, each non-holonomic constrained mobile robot has an independent Q-value module, which uses the recurrent Q network to input the hidden layer state h i,t-1 of the gate recurrent neural network, the local observation o i,t , the previous behavior b i,t-1 , and outputs the local Q value
Figure BDA0004129513260000184
The mixing module then sums up all the local Q values
Figure BDA0004129513260000185
Generate joint Q value
Figure BDA0004129513260000186
as follows

Figure BDA0004129513260000187
Figure BDA0004129513260000187

其中,

Figure BDA0004129513260000188
表示独立Q值网络的参数。in,
Figure BDA0004129513260000188
Represents the parameters of the independent Q-value network.

在t时间步采样后,将经历四元组(zt,bt,rt,zt+1)存储到经验池

Figure BDA0004129513260000189
中。特别地,将从经验池中采样最小回放次
Figure BDA00041295132600001810
经历以减少数据相关性和提高样本利用率。然后,训练进行到t+1时间步,且训练直至所有的回合Ttotal完成后停止。分布式强化学习任务监管器通过最小化以下损失进行训练:After sampling at time step t, the experience quadruple (z t , b t , r t , z t+1 ) is stored in the experience pool
Figure BDA0004129513260000189
In particular, the minimum number of replays will be sampled from the experience pool
Figure BDA00041295132600001810
Experience to reduce data correlation and improve sample utilization. Then, training proceeds to t+1 time step, and training stops after all rounds T total are completed. The distributed reinforcement learning task supervisor is trained by minimizing the following loss:

Figure BDA0004129513260000191
Figure BDA0004129513260000191

其中,

Figure BDA0004129513260000192
表示目标网络的参数。in,
Figure BDA0004129513260000192
Represents the parameters of the target network.

最后,多非完整约束移动机器人在集中式训练后学习到一组最优分布式行为优先级策略。在实际场景中,多非完整约束移动机器人根据学习到的策略切换行为优先级。一旦在每个采样时刻确定了多非完整约束移动机器人的行为优先级,就可以通过公式(22)-(24)和(2)获取参考速度vi,r和参考轨迹xi,r。根据公式(3),可进一步计算得到惯性坐标系下的参考速度

Figure BDA0004129513260000193
和参考轨迹θi,r。Finally, the multi-nonholonomic constrained mobile robot learns a set of optimal distributed behavior priority strategies after centralized training. In actual scenarios, the multi-nonholonomic constrained mobile robot switches behavior priorities according to the learned strategies. Once the behavior priorities of the multi-nonholonomic constrained mobile robot are determined at each sampling time, the reference speed v i,r and reference trajectory x i,r can be obtained by formulas (22)-(24) and (2). According to formula (3), the reference speed in the inertial coordinate system can be further calculated
Figure BDA0004129513260000193
and the reference trajectory θ i,r .

步骤三:强化学习控制器的设计Step 3: Design of reinforcement learning controller

定义惯性坐标系下的位置和速度跟踪误差分别为The position and velocity tracking errors in the inertial coordinate system are defined as

ep,i=θii,r, (30)e p,iii,r , (30)

Figure BDA0004129513260000194
Figure BDA0004129513260000194

其中,

Figure BDA0004129513260000195
Figure BDA0004129513260000196
是左右轮的角度,
Figure BDA0004129513260000197
Figure BDA0004129513260000198
分别是参考位置和参考速度。in,
Figure BDA0004129513260000195
and
Figure BDA0004129513260000196
is the angle of the left and right wheels,
Figure BDA0004129513260000197
and
Figure BDA0004129513260000198
are the reference position and reference speed respectively.

公式(30)和(31)的微分形式可以推导为The differential forms of formulas (30) and (31) can be derived as

Figure BDA0004129513260000199
Figure BDA0004129513260000199

Figure BDA00041295132600001910
Figure BDA00041295132600001910

其中,

Figure BDA00041295132600001911
Figure BDA00041295132600001912
的微分。in,
Figure BDA00041295132600001911
yes
Figure BDA00041295132600001912
The differential of .

定义集成的跟踪误差如下The integrated tracking error is defined as follows

Figure BDA00041295132600001913
Figure BDA00041295132600001913

Figure BDA00041295132600001914
Figure BDA00041295132600001914

定义值函数如下Define the value function as follows

Figure BDA00041295132600001915
Figure BDA00041295132600001915

其中,

Figure BDA00041295132600001916
表示代价函数,αVV∈(0,2)是可调整的代价参数,且满足αVV=2。in,
Figure BDA00041295132600001916
Denotes the cost function, α V , β V ∈(0,2) are adjustable cost parameters, and satisfy α VV =2.

定义

Figure BDA0004129513260000201
为最优的跟踪控制策略。因此,最优的值函数可以表示为definition
Figure BDA0004129513260000201
is the optimal tracking control strategy. Therefore, the optimal value function can be expressed as

Figure BDA0004129513260000202
Figure BDA0004129513260000202

其中,

Figure BDA0004129513260000203
表示可容许的控制策略。in,
Figure BDA0004129513260000203
Represents an admissible control strategy.

通过结合公式(32)-(35)和(37),哈密顿-雅克比-贝尔曼(Hamilton-Jacobi-Bellman,HJB)方程可推导为By combining equations (32)-(35) and (37), the Hamilton-Jacobi-Bellman (HJB) equation can be derived as

Figure BDA0004129513260000204
Figure BDA0004129513260000204

其中,

Figure BDA0004129513260000205
表示Vi *相对于ei的梯度,
Figure BDA0004129513260000206
Figure BDA0004129513260000207
表示分别为Vi *相对于ep,i和ev,i的梯度。in,
Figure BDA0004129513260000205
represents the gradient of Vi * relative to ei ,
Figure BDA0004129513260000206
and
Figure BDA0004129513260000207
Denote the gradients of Vi * with respect to ep,i and ev ,i respectively.

通过求解

Figure BDA0004129513260000208
最优的控制策略
Figure BDA0004129513260000209
可推导为By solving
Figure BDA0004129513260000208
Optimal control strategy
Figure BDA0004129513260000209
It can be deduced as

Figure BDA00041295132600002010
Figure BDA00041295132600002010

此外,将公式(39)代入(38)可获得以下等式:Furthermore, substituting formula (39) into (38) yields the following equation:

Figure BDA00041295132600002011
Figure BDA00041295132600002011

为了实施

Figure BDA00041295132600002012
需要求解公式(40)获取
Figure BDA00041295132600002013
然而,由于多非完整约束移动机器人动力学模型的非线性和不精确,
Figure BDA00041295132600002014
的解析解难以求取。To implement
Figure BDA00041295132600002012
We need to solve formula (40) to obtain
Figure BDA00041295132600002013
However, due to the nonlinearity and inaccuracy of the dynamics model of mobile robots with multiple nonholonomic constraints,
Figure BDA00041295132600002014
The analytical solution of is difficult to obtain.

因此,需要将最优值函数梯度分解如下:Therefore, the optimal value function gradient needs to be decomposed as follows:

Figure BDA0004129513260000211
Figure BDA0004129513260000211

其中,

Figure BDA0004129513260000212
Figure BDA0004129513260000213
是正常数,
Figure BDA0004129513260000214
是自适应补偿项。in,
Figure BDA0004129513260000212
and
Figure BDA0004129513260000213
is a normal number,
Figure BDA0004129513260000214
is the adaptive compensation term.

将公式(41)代入(39),可以获取等式如下:Substituting formula (41) into (39), we can obtain the following equation:

Figure BDA0004129513260000215
Figure BDA0004129513260000215

众所周知,神经网络具有强大的逼近能力。因此,给定紧集

Figure BDA0004129513260000216
Figure BDA0004129513260000217
对于
Figure BDA0004129513260000218
Figure BDA0004129513260000219
未知项fi(xi)和Vi o可以通过神经网络近似如下:As we all know, neural networks have powerful approximation capabilities. Therefore, given a compact set
Figure BDA0004129513260000216
and
Figure BDA0004129513260000217
for
Figure BDA0004129513260000218
and
Figure BDA0004129513260000219
The unknown terms fi ( xi ) and Vio can be approximated by a neural network as follows:

Figure BDA00041295132600002110
Figure BDA00041295132600002110

Figure BDA00041295132600002111
Figure BDA00041295132600002111

其中,

Figure BDA00041295132600002112
Figure BDA00041295132600002113
是理想的权重矩阵,wf和wV是神经元数量,
Figure BDA00041295132600002114
Figure BDA00041295132600002115
是基函数向量,
Figure BDA00041295132600002116
Figure BDA00041295132600002117
是逼近误差,且有界如||δf,i||≤εf和||δV,i||≤εV,εf和εV是正常数。in,
Figure BDA00041295132600002112
and
Figure BDA00041295132600002113
is the ideal weight matrix, wf and wV are the number of neurons,
Figure BDA00041295132600002114
and
Figure BDA00041295132600002115
is the basis function vector,
Figure BDA00041295132600002116
and
Figure BDA00041295132600002117
is the approximation error and is bounded such that ||δ f,i ||≤ε f and ||δ V,i ||≤ε V , where ε f and ε V are positive constants.

然后,将公式(43)和(44)代入(41)和(42),可以获取到以下方程:Then, substituting equations (43) and (44) into (41) and (42), we can obtain the following equations:

Figure BDA00041295132600002118
Figure BDA00041295132600002118

Figure BDA00041295132600002119
Figure BDA00041295132600002119

然而,由于

Figure BDA00041295132600002120
Figure BDA00041295132600002121
是未知的,
Figure BDA00041295132600002122
无法实施。因此,使用一种辨识者-执行者-评论家强化学习算法以学习最优控制策略。However, due to
Figure BDA00041295132600002120
and
Figure BDA00041295132600002121
is unknown,
Figure BDA00041295132600002122
Therefore, an Identifier-Actor-Critic reinforcement learning algorithm is used to learn the optimal control policy.

具体而言,设计辨识者神经网络以估计未知非线性项如下:Specifically, the discriminator neural network is designed to estimate the unknown nonlinear terms as follows:

Figure BDA0004129513260000221
Figure BDA0004129513260000221

其中,

Figure BDA0004129513260000222
是fi(xi)的估计,
Figure BDA0004129513260000223
是辨识者神经网络的权重。辨识者神经网络的更新率可设计为:in,
Figure BDA0004129513260000222
is an estimate of fi ( xi ),
Figure BDA0004129513260000223
is the weight of the identifier neural network. The update rate of the identifier neural network can be designed as:

Figure BDA0004129513260000224
Figure BDA0004129513260000224

其中,

Figure BDA0004129513260000225
是正定矩阵,
Figure BDA0004129513260000226
是设计的辨识者参数。in,
Figure BDA0004129513260000225
is a positive definite matrix,
Figure BDA0004129513260000226
is the identifier parameter of the design.

然后,设计评论家神经网络以评估控制性能如下:Then, a critic neural network is designed to evaluate the control performance as follows:

Figure BDA0004129513260000227
Figure BDA0004129513260000227

其中,

Figure BDA0004129513260000228
Figure BDA0004129513260000229
的估计值,
Figure BDA00041295132600002210
是评论家神经网络的权重。评论家网络的更新率可设计为in,
Figure BDA0004129513260000228
yes
Figure BDA0004129513260000229
The estimated value of
Figure BDA00041295132600002210
is the weight of the critic neural network. The update rate of the critic network can be designed as

Figure BDA00041295132600002211
Figure BDA00041295132600002211

其中,γc,i是评论家的学习率。where γ c,i is the critic’s learning rate.

最后,设计执行者神经网络设施控制输入如下:Finally, the design of the actuator neural network facility control input is as follows:

Figure BDA00041295132600002212
Figure BDA00041295132600002212

其中,

Figure BDA00041295132600002213
是执行者神经网络的权重。执行者网络的更新率可设计为in,
Figure BDA00041295132600002213
is the weight of the actor neural network. The update rate of the actor network can be designed as

Figure BDA00041295132600002214
Figure BDA00041295132600002214

其中,γa,i是执行者的学习率。where γ a,i is the learning rate of the executor.

步骤四:自适应补偿器设计Step 4: Adaptive compensator design

首先,考虑控制输入

Figure BDA00041295132600002215
受到对称执行机构饱和约束的限制如下:First, consider the control input
Figure BDA00041295132600002215
The restrictions subject to the saturation constraints of the symmetric actuators are as follows:

Figure BDA00041295132600002216
Figure BDA00041295132600002216

其中,τlim,i>0是已知的阈值。Among them, τ lim,i >0 is a known threshold.

其次,可将控制输入分为两项如下:Secondly, the control input can be divided into two items as follows:

τi=τ0,iΔ,i, (54)τ i0,iΔ,i , (54)

其中,

Figure BDA0004129513260000231
是标称项,
Figure BDA0004129513260000232
是补偿项,且满足如下条件:in,
Figure BDA0004129513260000231
is a nominal term,
Figure BDA0004129513260000232
is a compensation item and meets the following conditions:

Figure BDA0004129513260000233
Figure BDA0004129513260000233

最后,设计自适应补偿器为

Figure BDA0004129513260000234
且具有更新率如下Finally, the adaptive compensator is designed as
Figure BDA0004129513260000234
And has an update rate of

Figure BDA0004129513260000235
Figure BDA0004129513260000235

其中,

Figure BDA0004129513260000236
是设计的自适应补偿器参数。in,
Figure BDA0004129513260000236
are the designed adaptive compensator parameters.

为了详细介绍本发明,以下给出一个数值仿真实例以体现所提出的一种面向多非完整约束移动机器人的分布式强化学习行为控制方法的有效性及优越性。In order to introduce the present invention in detail, a numerical simulation example is given below to demonstrate the effectiveness and superiority of the proposed distributed reinforcement learning behavior control method for multiple non-holonomic constrained mobile robots.

仿真对比与分析Simulation comparison and analysis

数值仿真考虑了5个网络化的多非完整约束移动机器人通过执行避障、编队和重构行为形成期望的队形,且避开障碍物。设置第1个非完整约束移动机器人为领航者,且其期望编队任务函数为x1,d=x0=[t;0]。多非完整约束移动机器人的初始位置分别为x1,0=[0;0;0],x2,0=[-7;7;0],x3,0=[-7;-7;0],x4,0=[-12;12;0]和x5,0=[-12;-12;0]。多非完整约束移动机器人的目标位置分别为x1,g=[80;0;0],x2,g=[75;5;0],x3,g=[75;-5;0],x4,g=[70;10;0]和x5,g=[70;-10;0]。多非完整约束移动机器人的编队相对位置分别为

Figure BDA0004129513260000237
Figure BDA0004129513260000238
多非完整约束移动机器人的重构矩阵分别为Γ2=[2/5,0,0;0,0,0;0,0,0],Γ3=[4/5,0,0;0,0,0;0,0,0],Γ4=[3/5,0,0;0,0,0;0,0,0]和Γ5=[4/5,0,0;0,0,0;0,0,0]。多非完整约束移动机器人的初始辨识者权重矩阵分别为ωf,1=[0.46]24×2,ωf,2=[0.47]24×2,ωf,3=[0.48]24×2,ωf,4=[0.49]24×2和ωf,5=[0.5]24×2。多非完整约束移动机器人的初始执行者权重矩阵分别为ωa,1=[0.93]24×2,ωa,2=[0.95]24×2,ωa,3=[0.96]24×2,ωa,4=[0.97]24×2和ωa,5=[0.99]24×2。多非完整约束移动机器人的初始评论家权重矩阵分别为ωc,1=[0.92]24×2,ωc,2=[0.94]24×2,ωc,3=[0.96]24×2,ωc,4=[0.98]24×2和ωc,5=[1]24×2。神经网络具有神经元wf=24和wV=24,中心均匀的分布在范围[-6,6],宽度μi=2。多非完整约束移动机器人的未知非线性项设置为
Figure BDA0004129513260000241
多非完整约束移动机器人的网络拓扑结构如图4所示。其他的仿真所使用的参数如图5所示。图6-7对比了分布式强化学习任务监管器学习前后的任务性能,其中分布式强化学习任务监管器学习前会随机选取行为优先级,导致优先级切换频繁且轨迹不光滑,同时未能完成任务目标,而分布式强化学习任务监管器学习后,优先级切换显著减少,轨迹变得光滑,且完成了预定的任务目标。图8-9对比了分布式强化学习任务监管器、分布式有限状态机任务监管器(Distributed Finite State Automata Mission Supervisors,DFSAMSs)、分布式模型预测控制任务监管器(Distributed Model Prediction Control Mission Supervisors,DMPCMSs)和传统强化学习任务监管器(Reinforcement Learning Mission Supervisor,RLMS)的任务性能,其中DFSAMSs的轨迹有强烈的震荡且有时违反了安全距离,DMPCMSs具有最大的算法迭代时间且实时性无法保证,RLMS忽略了群体智能,而DRLMSs维持了良好的任务性能,且保持了较低的迭代时间。图10-12对比了具有输入饱和约束和不具有输入饱和约束分布式行为控制方法的控制性能,其中当不具有输入饱和约束和行为优先级切换发生切换时,控制输入值和控制代价达到了高昂且无法接受的数值,而具有输入饱和约束时,控制输入值始终被维持在一个可接受的范围之内。图13-14对比了分布式强化学习行为控制、有限时间分布式行为控制(finite-time Distributed Behavioral Control,finite-timeDBC)、固有时间分布式行为控制(fixed-time Distributed Behavioral Control,fixed-time DBC)和传统强化学习行为控制(Reinforcement Learning Behavioral Control,RLBC)的控制性能,其中finite-time DBC和fixed-time DBC由于滑模控制的抖振,轨迹会存在一些震荡,RLBC忽略了群体智能导致了一些非常不理想的控制结果,而DRLBC具有光滑的轨迹,且是唯一可以满足控制饱和约束的分布式行为控制方法。所有对比都证明了所提出的分布式行为控制方法的有效性和优越性。The numerical simulation considers five networked multi-nonholonomic constrained mobile robots to form a desired formation and avoid obstacles by performing obstacle avoidance, formation and reconstruction behaviors. The first nonholonomic constrained mobile robot is set as the leader, and its desired formation task function is x 1,d = x 0 = [t; 0]. The initial positions of the multi-nonholonomic constrained mobile robots are x 1,0 = [0; 0; 0], x 2,0 = [-7; 7; 0], x 3,0 = [-7; -7; 0], x 4,0 = [-12; 12; 0] and x 5,0 = [-12; -12; 0]. The target positions of the multi-nonholonomic constrained mobile robots are x1,g = [80; 0; 0], x2,g = [75; 5; 0], x3,g = [75; -5; 0], x4,g = [70; 10; 0] and x5 ,g = [70; -10; 0]. The relative positions of the formations of the multi-nonholonomic constrained mobile robots are
Figure BDA0004129513260000237
and
Figure BDA0004129513260000238
The reconstruction matrices of the multi-nonholonomic constrained mobile robot are Γ 2 = [2/5, 0, 0; 0, 0, 0; 0, 0, 0], Γ 3 = [4/5, 0, 0; 0, 0, 0; 0, 0, 0], Γ 4 = [3/5, 0, 0; 0, 0, 0; 0, 0, 0] and Γ 5 = [4/5, 0, 0; 0, 0, 0; 0, 0, 0]. The initial identifier weight matrices of the multi-nonholonomic constrained mobile robot are ω f, 1 = [0.46] 24×2 , ω f, 2 = [0.47] 24×2 , ω f, 3 = [0.48] 24×2 , ω f, 4 = [0.49] 24×2 and ω f, 5 = [0.5] 24×2 . The initial actor weight matrices of the multi-nonholonomic constrained mobile robot are ωa,1 = [0.93] 24×2 , ωa,2 = [0.95] 24×2 , ωa,3 = [0.96] 24 ×2, ωa,4 = [0.97] 24×2 and ωa,5 = [0.99] 24×2 . The initial critic weight matrices of the multi-nonholonomic constrained mobile robot are ωc,1 = [0.92] 24×2 , ωc,2 = [0.94] 24×2 , ωc,3 = [0.96] 24×2 , ωc,4 = [0.98] 24×2 and ωc,5 = [1] 24×2 . The neural network has neurons w f = 24 and w V = 24, with centers uniformly distributed in the range [-6, 6] and width μ i = 2. The unknown nonlinear terms of the multi-nonholonomic constrained mobile robot are set as
Figure BDA0004129513260000241
The network topology of multiple nonholonomic constrained mobile robots is shown in Figure 4. The other parameters used in the simulation are shown in Figure 5. Figures 6-7 compare the task performance before and after the distributed reinforcement learning task supervisor is learned. Before learning, the distributed reinforcement learning task supervisor randomly selects the behavior priority, resulting in frequent priority switching and uneven trajectory, and the task goal cannot be completed. After learning, the distributed reinforcement learning task supervisor significantly reduces the priority switching, the trajectory becomes smooth, and the predetermined task goal is completed. Figures 8-9 compare the task performance of the distributed reinforcement learning task supervisor, the distributed finite state machine task supervisor (DFSAMSs), the distributed model prediction control task supervisor (DMPCMSs) and the traditional reinforcement learning task supervisor (RLMS). The trajectory of DFSAMSs has strong oscillations and sometimes violates the safety distance. DMPCMSs has the largest algorithm iteration time and real-time performance cannot be guaranteed. RLMS ignores swarm intelligence, while DRLMSs maintains good task performance and keeps a low iteration time. Figures 10-12 compare the control performance of distributed behavior control methods with and without input saturation constraints. When there is no input saturation constraint and behavior priority switching occurs, the control input value and control cost reach high and unacceptable values, while with input saturation constraints, the control input value is always maintained within an acceptable range. Figures 13-14 compare the control performance of distributed reinforcement learning behavior control, finite-time distributed behavior control (finite-time DBC), fixed-time distributed behavior control (fixed-time DBC) and traditional reinforcement learning behavior control (RLBC). Due to the chattering of sliding mode control, the trajectory of finite-time DBC and fixed-time DBC will have some oscillations. RLBC ignores swarm intelligence and leads to some very undesirable control results. DRLBC has a smooth trajectory and is the only distributed behavior control method that can meet the control saturation constraint. All comparisons prove the effectiveness and superiority of the proposed distributed behavior control method.

Claims (5)

1.面向多非完整约束移动机器人的强化学习行为控制方法,其特征在于:包括以下步骤:1. A reinforcement learning behavior control method for a mobile robot with multiple nonholonomic constraints, characterized in that it comprises the following steps: 步骤S1,基于非完整约束矩阵建立多非完整约束移动机器人的运动学模型,基于欧拉拉格朗日方程建立多非完整约束移动机器人的动力学模型,并根据所建立的运动学模型构建基本行为,同时通过零空间投影技术,将所设计的基本行为以不同的优先级顺序组合成为复合行为;Step S1, establishing a kinematic model of a mobile robot with multiple nonholonomic constraints based on a nonholonomic constraint matrix, establishing a dynamic model of a mobile robot with multiple nonholonomic constraints based on the Euler-Lagrange equations, and constructing basic behaviors according to the established kinematic models, and combining the designed basic behaviors into composite behaviors in different priority orders through a null space projection technique; 步骤S2,将行为优先级切换建模为一个分布式部分可观测的马尔科夫决策过程,在集中式训练分布式执行的强化学习算法框架下,设置复合行为的参考速度指令作为强化学习算法的动作集合,选取非完整约束机器人的位置和优先级,以及其邻居机器人的位置和优先级作为强化学习算法的观测集合,设计奖励函数,从而构建分布式强化学习任务监管器DRLMSs;Step S2, modeling the behavior priority switching as a distributed partially observable Markov decision process, setting the reference speed instruction of the composite behavior as the action set of the reinforcement learning algorithm under the framework of the centralized training distributed execution reinforcement learning algorithm, selecting the position and priority of the non-holonomic constrained robot, and the position and priority of its neighboring robots as the observation set of the reinforcement learning algorithm, designing the reward function, and thus constructing the distributed reinforcement learning task supervisor DRLMSs; 步骤S3,以平衡控制性能和控制损耗为目标,引入辨识者-执行者-评论家强化学习算法,在线地辨识未知动力学模型、实施控制策略以及评估控制性能,从而设计强化学习控制器RLCs;Step S3, with the goal of balancing control performance and control loss, an identifier-executor-critic reinforcement learning algorithm is introduced to online identify unknown dynamic models, implement control strategies, and evaluate control performance, thereby designing reinforcement learning controllers RLCs; 步骤S4,基于自适应控制理论,设计自适应补偿器,以维持最优的控制性能和实时抵消饱和效应。Step S4, based on the adaptive control theory, an adaptive compensator is designed to maintain the optimal control performance and offset the saturation effect in real time. 2.根据权利要求1所述的面向多非完整约束移动机器人的强化学习行为控制方法,其特征在于:步骤S1具体包括如下步骤:2. The reinforcement learning behavior control method for multiple nonholonomic constrained mobile robots according to claim 1 is characterized in that step S1 specifically comprises the following steps: 步骤S11:多非完整约束移动机器人运动学建模Step S11: Kinematic modeling of mobile robots with multiple nonholonomic constraints 考虑一组N(N>2)的非完整约束移动机器人,其中每个机器人由差速轮驱动,i=1,...,N;第i个非完整约束移动机器人的广义速度表示为Consider a group of N (N>2) nonholonomic constrained mobile robots, where each robot is driven by a differential wheel, i=1,...,N; the generalized velocity of the i-th nonholonomic constrained mobile robot is expressed as
Figure FDA0004129513250000021
Figure FDA0004129513250000021
其中,
Figure FDA0004129513250000022
Figure FDA0004129513250000023
Figure FDA0004129513250000024
分别是线速度和角速度,
Figure FDA0004129513250000025
Figure FDA0004129513250000026
分别是左右轮的线速度,
Figure FDA0004129513250000027
是左右轮间的距离,
Figure FDA0004129513250000028
表示实数集合;
in,
Figure FDA0004129513250000022
Figure FDA0004129513250000023
and
Figure FDA0004129513250000024
are the linear velocity and angular velocity, respectively.
Figure FDA0004129513250000025
and
Figure FDA0004129513250000026
are the linear speeds of the left and right wheels,
Figure FDA0004129513250000027
is the distance between the left and right wheels,
Figure FDA0004129513250000028
represents the set of real numbers;
然后,第i个非完整约束移动机器人的运动学方程表示为Then, the kinematic equation of the i-th nonholonomic constrained mobile robot is expressed as
Figure FDA0004129513250000029
Figure FDA0004129513250000029
其中,
Figure FDA00041295132500000210
表示广义状态,
Figure FDA00041295132500000211
Figure FDA00041295132500000212
分别是位置和方向,
Figure FDA00041295132500000213
表示非完整约束矩阵;
in,
Figure FDA00041295132500000210
represents a generalized state,
Figure FDA00041295132500000211
and
Figure FDA00041295132500000212
are position and direction,
Figure FDA00041295132500000213
represents a non-holonomic constraint matrix;
此外,第i个非完整约束移动机器人在惯性坐标系下的运动学方程为In addition, the kinematic equation of the i-th nonholonomic constrained mobile robot in the inertial coordinate system is:
Figure FDA00041295132500000214
Figure FDA00041295132500000214
其中,
Figure FDA00041295132500000215
是轮半径,
Figure FDA00041295132500000216
表示惯性坐标性下的非完整约束矩阵,
Figure FDA00041295132500000217
Figure FDA00041295132500000218
分别是左右轮的旋转速度;
in,
Figure FDA00041295132500000215
is the wheel radius,
Figure FDA00041295132500000216
represents the nonholonomic constraint matrix in inertial coordinates,
Figure FDA00041295132500000217
and
Figure FDA00041295132500000218
are the rotation speeds of the left and right wheels respectively;
步骤S12:多非完整约束移动机器人动力学建模Step S12: Dynamics modeling of mobile robots with multiple nonholonomic constraints 通过使用欧拉拉格朗日方程,第i个非完整约束移动机器人的动力学模型推导为By using the Euler-Lagrange equations, the dynamic model of the i-th nonholonomic constrained mobile robot is derived as
Figure FDA00041295132500000219
Figure FDA00041295132500000219
其中,
Figure FDA0004129513250000031
是惯性矩阵,
Figure FDA0004129513250000032
是科氏力和向心力矩阵,Gi(xi)是重力矩阵,
Figure FDA0004129513250000033
表示未知非线性项,
Figure FDA0004129513250000034
是可设计的输入增益矩阵,
Figure FDA0004129513250000035
是控制输入,
Figure FDA0004129513250000036
是非完整约束力;
in,
Figure FDA0004129513250000031
is the inertia matrix,
Figure FDA0004129513250000032
is the Coriolis force and centripetal force matrix, G i ( xi ) is the gravity matrix,
Figure FDA0004129513250000033
represents the unknown nonlinear term,
Figure FDA0004129513250000034
is a designable input gain matrix,
Figure FDA0004129513250000035
is the control input,
Figure FDA0004129513250000036
It is not completely binding;
首先,公式(3)的微分形式推导如下First, the differential form of formula (3) is derived as follows
Figure FDA0004129513250000037
Figure FDA0004129513250000037
其中,
Figure FDA0004129513250000038
表示Si(xi)的微分,
Figure FDA0004129513250000039
是轮的角加速度;
in,
Figure FDA0004129513250000038
represents the differential of S i ( xi ),
Figure FDA0004129513250000039
is the angular acceleration of the wheel;
然后,将公式(3)和(5)代入(4),并左乘
Figure FDA00041295132500000310
得到以下方程
Then, substitute formulas (3) and (5) into (4) and multiply on the left by
Figure FDA00041295132500000310
The following equation is obtained
Figure FDA00041295132500000311
Figure FDA00041295132500000311
其中,
Figure FDA00041295132500000312
Figure FDA00041295132500000313
Figure FDA00041295132500000314
in,
Figure FDA00041295132500000312
Figure FDA00041295132500000313
Figure FDA00041295132500000314
根据假设2,公式(6)改写为According to Assumption 2, formula (6) can be rewritten as
Figure FDA00041295132500000315
Figure FDA00041295132500000315
其中,
Figure FDA00041295132500000316
是精确项,
Figure FDA00041295132500000317
是非精确项;
in,
Figure FDA00041295132500000316
is the exact term,
Figure FDA00041295132500000317
is an inexact term;
假设1:多非完整约束移动机器人系统工作在一个静态的场景中,所有非机器人的障碍物均为静态且固定的;Assumption 1: The multi-nonholonomic constrained mobile robot system works in a static scene, and all non-robot obstacles are static and fixed; 假设2:输入增益矩阵Ei(xi)始终满足设计为
Figure FDA00041295132500000318
Assumption 2: The input gain matrix E i ( xi ) always satisfies the design
Figure FDA00041295132500000318
步骤S13:多非完整约束移动机器人基本行为构建Step S13: Construction of basic behaviors of mobile robots with multiple nonholonomic constraints 假设每一个非完整约束移动机器人均有M个基本行为,其中第i个非完整约束移动机器人的第k个基本行为可以使用一个任务变量
Figure FDA00041295132500000319
Figure FDA00041295132500000320
进行数学建模如下
Assume that each nonholonomic constrained mobile robot has M basic behaviors, where the kth basic behavior of the ith nonholonomic constrained mobile robot can use a task variable
Figure FDA00041295132500000319
Figure FDA00041295132500000320
The mathematical modeling is as follows
σi,k=gi,k(xi), (8)σ i,k =gi ,k ( xi ), (8) 其中,
Figure FDA00041295132500000321
表示任务函数;
in,
Figure FDA00041295132500000321
Represents the task function;
然后,任务变量σi,k的微分形式表示为Then, the differential form of the task variable σ i,k is expressed as
Figure FDA00041295132500000322
Figure FDA00041295132500000322
其中,
Figure FDA00041295132500000323
是任务的雅克比矩阵;
in,
Figure FDA00041295132500000323
is the Jacobian matrix of the task;
最后,第i个非完整约束移动机器人的第k个基本行为的参考速度指令可以表示为Finally, the reference velocity command of the kth basic behavior of the i-th nonholonomic constrained mobile robot can be expressed as
Figure FDA0004129513250000041
Figure FDA0004129513250000041
其中,
Figure FDA0004129513250000042
是任务的雅克比矩阵Ji,k的右伪逆,
Figure FDA0004129513250000043
是期望的任务函数,
Figure FDA0004129513250000044
是任务增益,
Figure FDA0004129513250000045
是任务误差;
in,
Figure FDA0004129513250000042
is the right pseudo-inverse of the Jacobian matrix Ji ,k of the task,
Figure FDA0004129513250000043
is the desired task function,
Figure FDA0004129513250000044
is the task gain,
Figure FDA0004129513250000045
is the task error;
在不失一般性的前提下,避障行为、分布式编队行为和分布式重构行为设计如下:Without loss of generality, the obstacle avoidance behavior, distributed formation behavior, and distributed reconstruction behavior are designed as follows: 避障行为:避障行为是一种局部行为,旨在确保非完整约束移动机器人避开路径附近的障碍物,其相应的任务函数、期望任务和任务雅克比矩阵分别表示为:Obstacle avoidance behavior: Obstacle avoidance behavior is a local behavior that aims to ensure that the non-holonomic constrained mobile robot avoids obstacles near the path. Its corresponding task function, expected task, and task Jacobian matrix are expressed as:
Figure FDA0004129513250000046
Figure FDA0004129513250000046
Figure FDA0004129513250000047
Figure FDA0004129513250000047
Figure FDA0004129513250000048
Figure FDA0004129513250000048
其中,
Figure FDA0004129513250000049
表示第i个非完整约束移动机器人与障碍物的最小距离,dOA为安全距离,
Figure FDA00041295132500000410
Figure FDA00041295132500000411
是最小距离的相对位置,
Figure FDA00041295132500000412
是避障行为期望的方向,+和-分别表示障碍物在第i个非完整约束移动机器人的左边和右边;
in,
Figure FDA0004129513250000049
represents the minimum distance between the i-th nonholonomic constrained mobile robot and the obstacle, d OA is the safety distance,
Figure FDA00041295132500000410
Figure FDA00041295132500000411
is the relative position of the minimum distance,
Figure FDA00041295132500000412
is the desired direction of the obstacle avoidance behavior, + and − respectively indicate that the obstacle is to the left and right of the i-th nonholonomically constrained mobile robot;
分布式编队行为:分布式编队行为是一种分布式协作行为,旨在确保多非完整约束移动机器人仅通过使用邻居的状态形成所需的队形,其相应的任务函数、期望任务和任务雅克比矩阵分别表示为:Distributed formation behavior: Distributed formation behavior is a distributed cooperative behavior that aims to ensure that multiple nonholonomic constrained mobile robots form the desired formation by only using the states of their neighbors. The corresponding task function, expected task, and task Jacobian matrix are expressed as:
Figure FDA00041295132500000413
Figure FDA00041295132500000413
Figure FDA00041295132500000414
Figure FDA00041295132500000414
Figure FDA00041295132500000415
Figure FDA00041295132500000415
其中,
Figure FDA00041295132500000416
是分布式编队行为的估计状态,其通过设计具有如下更新率的自适应估计器来估计:
in,
Figure FDA00041295132500000416
is the estimated state of the distributed formation behavior, which is estimated by designing an adaptive estimator with the following update rate:
Figure FDA0004129513250000051
Figure FDA0004129513250000051
其中,κDF是一个正常数,
Figure FDA0004129513250000052
是编队的相对位置,
Figure FDA0004129513250000053
表示领航者的状态,
Figure FDA0004129513250000054
表示第i个非完整约束移动机器人的邻居;
Among them, κ DF is a positive constant,
Figure FDA0004129513250000052
is the relative position of the formation,
Figure FDA0004129513250000053
Indicates the status of the navigator.
Figure FDA0004129513250000054
represents the neighbors of the i-th nonholonomic constrained mobile robot;
分布式重构行为:分布式重构行为是一种分布式协作行为,旨在确保多非完整约束移动机器人仅通过使用邻居的状态重构所需的队形,其相应的任务函数、期望任务和任务雅克比矩阵分别表示为:Distributed Reconfiguration Behavior: Distributed reconfiguration behavior is a distributed cooperative behavior designed to ensure that multiple nonholonomic constrained mobile robots reconstruct the desired formation only by using the states of their neighbors. The corresponding task function, expected task, and task Jacobian matrix are expressed as:
Figure FDA0004129513250000055
Figure FDA0004129513250000055
Figure FDA0004129513250000056
Figure FDA0004129513250000056
Figure FDA0004129513250000057
Figure FDA0004129513250000057
其中,
Figure FDA0004129513250000058
是分布式编队行为的估计状态,其通过设计具有如下更新率的自适应估计器来估计:
in,
Figure FDA0004129513250000058
is the estimated state of the distributed formation behavior, which is estimated by designing an adaptive estimator with the following update rate:
Figure FDA0004129513250000059
Figure FDA0004129513250000059
其中,κDR是一个正常数,
Figure FDA00041295132500000510
是编队重构矩阵;
Among them, κ DR is a positive constant,
Figure FDA00041295132500000510
is the formation reconstruction matrix;
步骤S14:多非完整约束移动机器人复合行为构建Step S14: Construction of composite behaviors of mobile robots with multiple nonholonomic constraints 一个复合任务是多个基本行为以一定的优先级顺序的组合;设定
Figure FDA00041295132500000511
为第i个非完整约束移动机器人的任务函数,其中km∈NM,NM={1,...,M},mk表示任务空间的维度,M表示任务的数量;定义与时间相关的优先级函数gi(km,t):NM×[0,∞]→NM;同时,定义一个具有如下规则的任务层次结构:
A composite task is a combination of multiple basic behaviors in a certain priority order;
Figure FDA00041295132500000511
is the task function of the i-th nonholonomic constrained mobile robot, where km∈NM , NM ={1,...,M}, mk represents the dimension of the task space, and M represents the number of tasks; define a time-related priority function gi ( km ,t): NM ×[0,∞]→ NM ; and define a task hierarchy with the following rules:
1)一个具有gi(kα)优先级的任务kα不能干扰具有gi(kβ)优先级的任务kβ,如果gi(kα)≥gi(kβ),
Figure FDA00041295132500000512
kα≠kβ
1) A task k α with priority gi (k α ) cannot interfere with task k β with priority gi (k β ) if gi (k α ) ≥gi (k β ),
Figure FDA00041295132500000512
k α ≠ k β ;
2)从速度到任务速度的映射关系由任务的雅可比矩阵
Figure FDA00041295132500000513
表示;
2) The mapping from speed to task speed is given by the Jacobian matrix of the task
Figure FDA00041295132500000513
express;
3)具有最低优先级任务mM的维度可能大于
Figure FDA00041295132500000514
因此要确保维度mn大于所有任务的总维度;
3) The dimension of the task m with the lowest priority may be greater than
Figure FDA00041295132500000514
Therefore, make sure that the dimension m n is greater than the total dimension of all tasks;
4)gi(km)的值由任务监管器根据任务的需求和传感器信息进行分配;4) The value of g i (k m ) is assigned by the task supervisor according to the task requirements and sensor information; 通过给基本任务分配给定的优先级,t时刻复合任务的速度表示为By assigning a given priority to the basic tasks, the speed of the composite task at time t is expressed as
Figure FDA0004129513250000061
Figure FDA0004129513250000061
Figure FDA0004129513250000062
Figure FDA0004129513250000062
Figure FDA0004129513250000063
Figure FDA0004129513250000063
其中,
Figure FDA0004129513250000064
是行为优先级,
Figure FDA0004129513250000065
是零空间投影的增广雅克比矩阵。
in,
Figure FDA0004129513250000064
is the behavioral priority,
Figure FDA0004129513250000065
is the augmented Jacobian matrix of the null space projection.
3.根据权利要求1所述的面向多非完整约束移动机器人的强化学习行为控制方法,其特征在于:所述步骤S2具体为:定义集中式训练环境为ε,全局的状态为
Figure FDA0004129513250000066
其中
Figure FDA0004129513250000067
是联合的位置,
Figure FDA0004129513250000068
是联合的优先级,
Figure FDA0004129513250000069
是编队标志位,S表示全局状态集合;定义bi,t={vr,i,t}∈B为局部/本地行为,其中B表示行为集合;定义
Figure FDA00041295132500000610
为独立的局部观测,其中si,t={xi,Pri}是局部/本地状态,
Figure FDA00041295132500000611
表示第i个非完整约束移动机器人的邻居,O表示局部观测集合;由于局部观测,定义局部/本地的行为观测历史为zi,t∈Z,其中Z表示行为观测历史集合;所有的分布式强化学习任务监管器贡献一个奖励信号,且奖励函数设计如下
3. The reinforcement learning behavior control method for multiple nonholonomic constrained mobile robots according to claim 1 is characterized in that: the step S2 specifically includes: defining the centralized training environment as ε, and the global state as
Figure FDA0004129513250000066
in
Figure FDA0004129513250000067
is the joint location,
Figure FDA0004129513250000068
is the priority of the union,
Figure FDA0004129513250000069
is the formation flag, S represents the global state set; define b i,t ={v r,i,t }∈B as a local behavior, where B represents the behavior set; define
Figure FDA00041295132500000610
is an independent local observation, where s i,t = {x i ,Pr i } is a local state,
Figure FDA00041295132500000611
represents the neighbors of the i-th nonholonomic constrained mobile robot, O represents the local observation set; due to local observation, the local behavior observation history is defined as z i,t ∈Z, where Z represents the behavior observation history set; all distributed reinforcement learning task supervisors contribute a reward signal, and the reward function is designed as follows
rt=r1+r2, (25)r t = r 1 + r 2 , (25)
Figure FDA0004129513250000071
Figure FDA0004129513250000071
Figure FDA0004129513250000072
Figure FDA0004129513250000072
其中,
Figure FDA0004129513250000073
分别表示无编队、重构编队和期望编队状态的标识;r1和r2是分别设置实现任务目标和减少行为切换的奖励信号;
in,
Figure FDA0004129513250000073
They represent the identities of no formation, reconstructed formation, and desired formation states, respectively; r1 and r2 are reward signals for achieving task goals and reducing behavior switching, respectively;
多非完整约束移动机器人与环境ε在t时间步进行交互,其中第i个非完整约束移动机器人观测到一个局部观测oi,t,获取到上一个行为bi,t-1,根据具有衰减因子γε的ε贪心策略选取一个行为bi,t,然后得到一个团队奖励rt和转移至下一个局部观测oi,t+1;具体而言,分布式强化学习任务监管器的集中式训练是通过分层渐进模块进行的,包括独立Q值模块和混合模块;首先,每一个非完整约束移动机器人都有一个独立Q值模块,其使用循环Q网络输入门循环神经网络的隐藏层状态hi,t-1,局部观测oi,t,上一个行为bi,t-1,输出局部的Q值
Figure FDA0004129513250000074
然后,混合模块通过求和所有的局部的Q值
Figure FDA0004129513250000075
生成联合Q值
Figure FDA0004129513250000076
如下
Multiple nonholonomic constrained mobile robots interact with the environment ε at t time steps, where the i-th nonholonomic constrained mobile robot observes a local observation o i,t , obtains the previous behavior b i,t-1 , selects a behavior b i,t according to the ε-greedy strategy with a decay factor γ ε , and then obtains a team reward r t and transfers to the next local observation o i,t+1 ; Specifically, the centralized training of the distributed reinforcement learning task supervisor is carried out through hierarchical progressive modules, including independent Q-value modules and hybrid modules; First, each nonholonomic constrained mobile robot has an independent Q-value module, which uses the recurrent Q network input gate recurrent neural network hidden layer state h i,t-1 , local observation o i,t , previous behavior b i,t-1 , and outputs a local Q-value
Figure FDA0004129513250000074
The mixing module then sums up all the local Q values
Figure FDA0004129513250000075
Generate joint Q value
Figure FDA0004129513250000076
as follows
Figure FDA0004129513250000077
Figure FDA0004129513250000077
其中,
Figure FDA0004129513250000078
表示独立Q值网络的参数;
in,
Figure FDA0004129513250000078
represents the parameters of the independent Q value network;
在t时间步采样后,将经历四元组(zt,bt,rt,zt+1)存储到经验池
Figure FDA0004129513250000081
中;特别地,将从经验池中采样最小回放次
Figure FDA0004129513250000082
经历以减少数据相关性和提高样本利用率;然后,训练进行到t+1时间步,且训练直至所有的回合Ttotal完成后停止;分布式强化学习任务监管器通过最小化以下损失进行训练:
After sampling at time step t, the experience quadruple (z t , b t , r t , z t+1 ) is stored in the experience pool
Figure FDA0004129513250000081
In particular, the minimum number of replays will be sampled from the experience pool
Figure FDA0004129513250000082
Experience to reduce data correlation and improve sample utilization; then, training proceeds to t+1 time step, and training stops after all rounds T total are completed; the distributed reinforcement learning task supervisor is trained by minimizing the following loss:
Figure FDA0004129513250000083
Figure FDA0004129513250000083
其中,
Figure FDA0004129513250000084
Figure FDA0004129513250000085
表示目标网络的参数;
in,
Figure FDA0004129513250000084
Figure FDA0004129513250000085
Represents the parameters of the target network;
最后,多非完整约束移动机器人在集中式训练后学习到一组最优分布式行为优先级策略;在实际场景中,多非完整约束移动机器人根据学习到的策略切换行为优先级;一旦在每个采样时刻确定了多非完整约束移动机器人的行为优先级,就通过公式(22)-(24)和(2)获取参考速度vi,r和参考轨迹xi,r;根据公式(3),进一步计算得到惯性坐标系下的参考速度
Figure FDA0004129513250000086
和参考轨迹θi,r
Finally, after centralized training, the multi-nonholonomic constrained mobile robot learns a set of optimal distributed behavior priority strategies; in actual scenarios, the multi-nonholonomic constrained mobile robot switches behavior priorities according to the learned strategies; once the behavior priorities of the multi-nonholonomic constrained mobile robot are determined at each sampling moment, the reference speed v i,r and reference trajectory x i,r are obtained through formulas (22)-(24) and (2); according to formula (3), the reference speed in the inertial coordinate system is further calculated
Figure FDA0004129513250000086
and the reference trajectory θ i,r .
4.根据权利要求1所述的面向多非完整约束移动机器人的强化学习行为控制方法,其特征在于:步骤S3具体为:定义惯性坐标系下的位置和速度跟踪误差分别为4. The reinforcement learning behavior control method for multiple nonholonomic constrained mobile robots according to claim 1 is characterized in that: step S3 specifically comprises: defining the position and velocity tracking errors in the inertial coordinate system as: ep,i=θii,r, (30)e p,iii,r , (30)
Figure FDA0004129513250000087
Figure FDA0004129513250000087
其中,
Figure FDA0004129513250000088
Figure FDA0004129513250000089
Figure FDA00041295132500000810
是左右轮的角度,
Figure FDA00041295132500000811
Figure FDA00041295132500000812
分别是参考位置和参考速度;
in,
Figure FDA0004129513250000088
Figure FDA0004129513250000089
and
Figure FDA00041295132500000810
is the angle of the left and right wheels,
Figure FDA00041295132500000811
and
Figure FDA00041295132500000812
are the reference position and reference velocity respectively;
公式(30)和(31)的微分形式推导为The differential forms of formulas (30) and (31) are derived as follows:
Figure FDA00041295132500000813
Figure FDA00041295132500000813
Figure FDA0004129513250000091
Figure FDA0004129513250000091
其中,
Figure FDA0004129513250000092
Figure FDA0004129513250000093
的微分;
in,
Figure FDA0004129513250000092
yes
Figure FDA0004129513250000093
The differential of
定义集成的跟踪误差如下The integrated tracking error is defined as follows
Figure FDA0004129513250000094
Figure FDA0004129513250000094
Figure FDA0004129513250000095
Figure FDA0004129513250000095
定义值函数如下Define the value function as follows
Figure FDA0004129513250000096
Figure FDA0004129513250000096
其中,
Figure FDA0004129513250000097
表示代价函数,αVV∈(0,2)是可调整的代价参数,且满足αVV=2;
in,
Figure FDA0004129513250000097
represents the cost function, α V , β V ∈(0,2) are adjustable cost parameters, and satisfy α VV =2;
定义
Figure FDA0004129513250000098
为最优的跟踪控制策略;因此,最优的值函数可以表示为
definition
Figure FDA0004129513250000098
is the optimal tracking control strategy; therefore, the optimal value function can be expressed as
Figure FDA0004129513250000099
Figure FDA0004129513250000099
其中,
Figure FDA00041295132500000910
表示可容许的控制策略;
in,
Figure FDA00041295132500000910
represents the permissible control strategies;
通过结合公式(32)-(35)和(37),哈密顿-雅克比-贝尔曼(Hamilton-Jacobi-Bellman,HJB)方程可推导为By combining equations (32)-(35) and (37), the Hamilton-Jacobi-Bellman (HJB) equation can be derived as
Figure FDA00041295132500000911
Figure FDA00041295132500000911
其中,
Figure FDA0004129513250000101
表示Vi *相对于ei的梯度,
Figure FDA0004129513250000102
Figure FDA0004129513250000103
表示分别为Vi *相对于ep,i和ev,i的梯度;
in,
Figure FDA0004129513250000101
represents the gradient of Vi * relative to ei ,
Figure FDA0004129513250000102
and
Figure FDA0004129513250000103
denote the gradients of Vi * relative to ep,i and ev ,i respectively;
通过求解
Figure FDA0004129513250000104
最优的控制策略
Figure FDA0004129513250000105
可推导为
By solving
Figure FDA0004129513250000104
Optimal control strategy
Figure FDA0004129513250000105
It can be deduced as
Figure FDA0004129513250000106
Figure FDA0004129513250000106
此外,将公式(39)代入(38)可获得以下等式:Furthermore, substituting formula (39) into (38) yields the following equation:
Figure FDA0004129513250000107
Figure FDA0004129513250000107
为了实施
Figure FDA0004129513250000108
需要求解公式(40)获取
Figure FDA0004129513250000109
然而,由于多非完整约束移动机器人动力学模型的非线性和不精确,
Figure FDA00041295132500001010
的解析解难以求取;
To implement
Figure FDA0004129513250000108
We need to solve formula (40) to obtain
Figure FDA0004129513250000109
However, due to the nonlinearity and inaccuracy of the dynamics model of mobile robots with multiple nonholonomic constraints,
Figure FDA00041295132500001010
The analytical solution of is difficult to obtain;
因此,需要将最优值函数梯度分解如下:Therefore, the optimal value function gradient needs to be decomposed as follows:
Figure FDA00041295132500001011
Figure FDA00041295132500001011
其中,
Figure FDA00041295132500001012
Figure FDA00041295132500001013
Figure FDA00041295132500001014
是正常数,
Figure FDA00041295132500001015
是自适应补偿项;
in,
Figure FDA00041295132500001012
Figure FDA00041295132500001013
and
Figure FDA00041295132500001014
is a normal number,
Figure FDA00041295132500001015
is the adaptive compensation term;
将公式(41)代入(39),获取等式如下:Substituting formula (41) into (39), we obtain the following equation:
Figure FDA00041295132500001016
Figure FDA00041295132500001016
众所周知,神经网络具有强大的逼近能力;因此,给定紧集
Figure FDA0004129513250000111
Figure FDA0004129513250000112
对于
Figure FDA0004129513250000113
Figure FDA0004129513250000114
未知项fi(xi)和Vi o通过神经网络近似如下:
As we all know, neural networks have powerful approximation capabilities; therefore, given a compact set
Figure FDA0004129513250000111
and
Figure FDA0004129513250000112
for
Figure FDA0004129513250000113
and
Figure FDA0004129513250000114
The unknown terms fi ( xi ) and Vio are approximated by the neural network as follows:
Figure FDA0004129513250000115
Figure FDA0004129513250000115
Figure FDA0004129513250000116
Figure FDA0004129513250000116
其中,
Figure FDA0004129513250000117
Figure FDA0004129513250000118
是理想的权重矩阵,wf和wV是神经元数量,
Figure FDA0004129513250000119
Figure FDA00041295132500001110
是基函数向量,
Figure FDA00041295132500001111
Figure FDA00041295132500001112
是逼近误差,且有界如||δf,i||≤εf和||δV,i||≤εV,εf和εV是正常数;
in,
Figure FDA0004129513250000117
and
Figure FDA0004129513250000118
is the ideal weight matrix, wf and wV are the number of neurons,
Figure FDA0004129513250000119
and
Figure FDA00041295132500001110
is the basis function vector,
Figure FDA00041295132500001111
and
Figure FDA00041295132500001112
is the approximation error and is bounded such that ||δ f,i ||≤ε f and ||δ V,i ||≤ε V , ε f and ε V are positive constants;
然后,将公式(43)和(44)代入(41)和(42),获取到以下方程:Then, substitute equations (43) and (44) into (41) and (42) to obtain the following equations:
Figure FDA00041295132500001113
Figure FDA00041295132500001113
Figure FDA00041295132500001114
Figure FDA00041295132500001114
然而,由于
Figure FDA00041295132500001115
Figure FDA00041295132500001116
是未知的,
Figure FDA00041295132500001117
无法实施;因此,使用一种辨识者-执行者-评论家强化学习算法以学习最优控制策略;
However, due to
Figure FDA00041295132500001115
and
Figure FDA00041295132500001116
is unknown,
Figure FDA00041295132500001117
It is not feasible to implement; therefore, an Identifier-Actor-Critic reinforcement learning algorithm is used to learn the optimal control policy;
具体而言,设计辨识者神经网络以估计未知非线性项如下:Specifically, the discriminator neural network is designed to estimate the unknown nonlinear terms as follows:
Figure FDA00041295132500001118
Figure FDA00041295132500001118
其中,
Figure FDA00041295132500001119
是fi(xi)的估计,
Figure FDA00041295132500001120
是辨识者神经网络的权重;辨识者神经网络的更新率可设计为:
in,
Figure FDA00041295132500001119
is an estimate of fi ( xi ),
Figure FDA00041295132500001120
is the weight of the identifier neural network; the update rate of the identifier neural network can be designed as:
Figure FDA0004129513250000121
Figure FDA0004129513250000121
其中,
Figure FDA0004129513250000122
是正定矩阵,
Figure FDA0004129513250000123
是设计的辨识者参数;
in,
Figure FDA0004129513250000122
is a positive definite matrix,
Figure FDA0004129513250000123
is the identifier parameter of the design;
然后,设计评论家神经网络以评估控制性能如下:Then, a critic neural network is designed to evaluate the control performance as follows:
Figure FDA0004129513250000124
Figure FDA0004129513250000124
其中,
Figure FDA0004129513250000125
Figure FDA0004129513250000126
的估计值,
Figure FDA0004129513250000127
是评论家神经网络的权重;评论家网络的更新率可设计为
in,
Figure FDA0004129513250000125
yes
Figure FDA0004129513250000126
The estimated value of
Figure FDA0004129513250000127
is the weight of the critic neural network; the update rate of the critic network can be designed as
Figure FDA0004129513250000128
Figure FDA0004129513250000128
其中,γc,i是评论家的学习率;where γ c,i is the critic’s learning rate; 最后,设计执行者神经网络设施控制输入如下:Finally, the design of the actuator neural network facility control input is as follows:
Figure FDA0004129513250000129
Figure FDA0004129513250000129
其中,
Figure FDA00041295132500001210
是执行者神经网络的权重;执行者网络的更新率可设计为
in,
Figure FDA00041295132500001210
is the weight of the actor neural network; the update rate of the actor network can be designed as
Figure FDA00041295132500001211
Figure FDA00041295132500001211
其中,γa,i是执行者的学习率。where γ a,i is the learning rate of the executor.
5.根据权利要求1所述的面向多非完整约束移动机器人的强化学习行为控制方法,其特征在于:所述步骤S4具体为:首先,考虑控制输入
Figure FDA00041295132500001212
受到对称执行机构饱和约束的限制如下:
5. The reinforcement learning behavior control method for multiple nonholonomic constrained mobile robots according to claim 1 is characterized in that: the step S4 specifically comprises: first, considering the control input
Figure FDA00041295132500001212
The restrictions subject to the saturation constraints of the symmetric actuators are as follows:
Figure FDA00041295132500001213
Figure FDA00041295132500001213
其中,τlim,i>0是已知的阈值;Among them, τ lim,i >0 is a known threshold; 其次,可将控制输入分为两项如下:Secondly, the control input can be divided into two items as follows: τi=τ0,iΔ,i, (54)τ i0,iΔ,i , (54) 其中,
Figure FDA0004129513250000131
是标称项,
Figure FDA0004129513250000132
是补偿项,且满足如下条件:
in,
Figure FDA0004129513250000131
is a nominal term,
Figure FDA0004129513250000132
is a compensation item and meets the following conditions:
Figure FDA0004129513250000133
Figure FDA0004129513250000133
最后,设计自适应补偿器为
Figure FDA0004129513250000134
且具有更新率如下
Finally, the adaptive compensator is designed as
Figure FDA0004129513250000134
And has an update rate of
Figure FDA0004129513250000135
Figure FDA0004129513250000135
其中,
Figure FDA0004129513250000136
是设计的自适应补偿器参数。
in,
Figure FDA0004129513250000136
are the designed adaptive compensator parameters.
CN202310255701.9A 2023-03-16 2023-03-16 Reinforcement learning behavior control method for mobile robots with multiple nonholonomic constraints Active CN116068900B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310255701.9A CN116068900B (en) 2023-03-16 2023-03-16 Reinforcement learning behavior control method for mobile robots with multiple nonholonomic constraints

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310255701.9A CN116068900B (en) 2023-03-16 2023-03-16 Reinforcement learning behavior control method for mobile robots with multiple nonholonomic constraints

Publications (2)

Publication Number Publication Date
CN116068900A true CN116068900A (en) 2023-05-05
CN116068900B CN116068900B (en) 2025-07-04

Family

ID=86175202

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310255701.9A Active CN116068900B (en) 2023-03-16 2023-03-16 Reinforcement learning behavior control method for mobile robots with multiple nonholonomic constraints

Country Status (1)

Country Link
CN (1) CN116068900B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116931581A (en) * 2023-08-29 2023-10-24 逻腾(杭州)科技有限公司 Spherical robot track planning and tracking control method suitable for all terrain
CN117031937A (en) * 2023-07-11 2023-11-10 淮阴工学院 Reinforced learning control method of self-balancing robot based on preset performance error
CN119088044A (en) * 2024-11-07 2024-12-06 北京星网船电科技有限公司 Autonomous navigation control method and system for unmanned ships based on artificial intelligence
CN119458386A (en) * 2025-01-15 2025-02-18 福州大学 Dynamic obstacle avoidance method for robotic arms based on intelligent task supervision
CN119847210A (en) * 2025-01-07 2025-04-18 电子科技大学长三角研究院(衢州) Multi-agent formation cooperative control method based on task decomposition reinforcement learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110989576A (en) * 2019-11-14 2020-04-10 北京理工大学 Target following and dynamic obstacle avoidance control method for differential slip steering vehicle
US20200218888A1 (en) * 2017-07-18 2020-07-09 Vision Semantics Limited Target Re-Identification
US20210171024A1 (en) * 2019-12-06 2021-06-10 Elektrobit Automotive Gmbh Deep learning based motion control of a group of autonomous vehicles

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200218888A1 (en) * 2017-07-18 2020-07-09 Vision Semantics Limited Target Re-Identification
CN110989576A (en) * 2019-11-14 2020-04-10 北京理工大学 Target following and dynamic obstacle avoidance control method for differential slip steering vehicle
US20210171024A1 (en) * 2019-12-06 2021-06-10 Elektrobit Automotive Gmbh Deep learning based motion control of a group of autonomous vehicles

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
沈艳军;吴超艳;: "一类链式系统部分变元渐近稳定、有限时间稳定观测器设计", 山东大学学报(工学版), no. 06, 21 November 2013 (2013-11-21), pages 46 - 50 *
王涛;王立强;李宇飞;: "一种基于强化学习的自主导航控制算法研究", 计算机仿真, no. 11, 15 November 2018 (2018-11-15), pages 306 - 310 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117031937A (en) * 2023-07-11 2023-11-10 淮阴工学院 Reinforced learning control method of self-balancing robot based on preset performance error
CN116931581A (en) * 2023-08-29 2023-10-24 逻腾(杭州)科技有限公司 Spherical robot track planning and tracking control method suitable for all terrain
CN119088044A (en) * 2024-11-07 2024-12-06 北京星网船电科技有限公司 Autonomous navigation control method and system for unmanned ships based on artificial intelligence
CN119088044B (en) * 2024-11-07 2025-01-21 北京星网船电科技有限公司 Unmanned ship autonomous navigation control method and system based on artificial intelligence
CN119847210A (en) * 2025-01-07 2025-04-18 电子科技大学长三角研究院(衢州) Multi-agent formation cooperative control method based on task decomposition reinforcement learning
CN119458386A (en) * 2025-01-15 2025-02-18 福州大学 Dynamic obstacle avoidance method for robotic arms based on intelligent task supervision

Also Published As

Publication number Publication date
CN116068900B (en) 2025-07-04

Similar Documents

Publication Publication Date Title
CN116068900A (en) Reinforced learning behavior control method for multiple incomplete constraint mobile robots
Guo et al. Command-filter-based fixed-time bipartite containment control for a class of stochastic multiagent systems
CN110597061B (en) A Multi-Agent Fully Distributed Active Disturbance Rejection Time-Varying Formation Control Method
Shou et al. Finite‐time formation control and obstacle avoidance of multi‐agent system with application
CN111522341A (en) Multi-time-varying formation tracking control method and system for network heterogeneous robot system
Li et al. Active disturbance rejection formation tracking control for uncertain nonlinear multi-agent systems with switching topology via dynamic event-triggered extended state observer
Zhang et al. Reinforcement learning behavioral control for nonlinear autonomous system
Wang et al. Event-triggered integral formation controller for networked nonholonomic mobile robots: Theory and experiment
Sun et al. Iterative learning control based robust distributed algorithm for non-holonomic mobile robots formation
CN112947086A (en) Self-adaptive compensation method for actuator faults in formation control of heterogeneous multi-agent system consisting of unmanned aerial vehicle and unmanned vehicle
CN118131621A (en) A distributed fixed-time optimization method based on multi-agent system
CN115993780A (en) Time-varying formation optimization tracking control method and system for EL type nonlinear cluster system
Ma et al. Adaptive neural cooperative control of multirobot systems with input quantization
Wang et al. Dynamic event-driven finite-horizon optimal consensus control for constrained multiagent systems
Ge et al. State-constrained bipartite tracking of interconnected robotic systems via hierarchical prescribed-performance control
Huang et al. Distributed nonlinear placement for a class of multicluster Euler–Lagrange systems
Tsai et al. Adaptive reinforcement learning formation control using ORFBLS for omnidirectional mobile multi-robots
Liu et al. Who to blame? learning and control strategies with information asymmetry
Yang et al. Adaptive asymptotic tracking control for underactuated autonomous underwater vehicles with state constraints
Guan et al. Adaptive output feedback control for uncertain nonlinear systems subject to deferred state constraints
CN118259588A (en) Hybrid fault-tolerant coordinated tracking control method for discrete nonlinear systems based on reinforcement learning and event triggering
Zhao et al. Event-triggered cooperative adaptive optimal output regulation for multiagent systems under switching network: an adaptive dynamic programming approach
Li et al. Practical Prescribed-Time Consensus Tracking Control for Nonlinear Heterogeneous MASs With Bounded Time-Varying Gain Under Mismatching and Non-Vanishing Uncertainties
Li et al. Distributed optimal formation control for constrained quadrotor unmanned aerial vehicles via Stackelberg-Game
CN120010274B (en) Reinforcement learning tracking control method for mobile robots based on event triggering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant