CN117601120A

CN117601120A - Adaptive variable impedance control method and device, electronic equipment and storage medium

Info

Publication number: CN117601120A
Application number: CN202311543853.5A
Authority: CN
Inventors: 王宏民; 蒋孟; 吴龙华; 黄俊霖; 覃才; 植伟明; 叶欣桐; 刘起; 张海杰; 廖洁玲; 李志宏; 宋莹莹; 江励; 潘增喜; 黄辉; 梁艳阳; 李桦健
Original assignee: Wuyi University Fujian
Current assignee: Wuyi University Fujian
Priority date: 2023-11-17
Filing date: 2023-11-17
Publication date: 2024-02-27

Abstract

The embodiment of the invention provides a self-adaptive variable impedance control method and device, electronic equipment and a storage medium. The invention adopts depth deterministic strategy gradient DDPG algorithm under the framework of reinforcement learning. The DDPG algorithm includes an Actor network and a crit ic network, and by continuously learning in actual operation, the DDPG algorithm enables the double mechanical arms to adjust their action strategies so as to maximize the jackpot. In the training stage, samples are randomly selected from an experience pool and used for training an Actor-Cr it ic network to acquire an optimal network structure, and by using the trained Actor-Cr it ic network, the double mechanical arms can make optimal action selection based on the current state data set so as to adapt to uncertain environments and realize efficient grabbing of a target object. The embodiment of the invention allows the double mechanical arms to continuously optimize the behavior of the double mechanical arms through reinforcement learning, thereby being better suitable for complex working scenes.

Description

Adaptive variable impedance control method and device, electronic equipment and storage medium

技术领域Technical field

本发明实施例涉及机器人控制技术领域，尤其涉及一种自适应变阻抗控制方法和装置、电子设备及存储介质。Embodiments of the present invention relate to the field of robot control technology, and in particular, to an adaptive variable impedance control method and device, electronic equipment and storage media.

背景技术Background technique

在工业生产中，传统的工业机器人通过位置控制来实现工业场景下的具体任务，但是在需要与环境进行交互的应用中，传统的基于位置控制的方法将不再胜任相应的任务。在焊接、抛光、轴孔装配等领域，与环境存在着大量复杂的接触，仅有工业机器人沿着指定路径运动，一旦机器人与指定路径存在位置的偏差，就会产生十分巨大的环境接触力，可能会导致工件的损坏，甚至损坏工业机器人。同复杂化和生产过程的柔性化，现有的以独立工位工作的机器人已不能满足日益变化的制造需求，为了适应非结构环境下任务复杂化、操作智能化及系统柔顺化的要求，两台器人通过相互配合和协作的方式在执行这类作业任务中表现出优势。双臂机器人在协调过程中双臂之间保持一定的约束关系，以完成双臂的协调任务。纯位置控制的基本思路是首先对被操作目标物体的轨迹进行规划，通过目标物体与双臂的约束关系，得到双臂末端的轨迹。但是这种控制方式并没有考虑双臂对目标物体的受力情况，也没有考虑目标物体受到外部干扰的情况。因此，如何在不确定的复杂场景下，实现双机械臂对目标物体的高效抓取成为亟待解决的技术问题。In industrial production, traditional industrial robots use position control to achieve specific tasks in industrial scenarios. However, in applications that require interaction with the environment, traditional position control-based methods will no longer be qualified for the corresponding tasks. In fields such as welding, polishing, and shaft hole assembly, there are a large number of complex contacts with the environment. Only industrial robots move along designated paths. Once there is a position deviation between the robot and the designated path, a very huge environmental contact force will be generated. It may cause damage to the workpiece or even damage the industrial robot. With the complexity and flexibility of the production process, the existing robots working in independent stations can no longer meet the increasingly changing manufacturing needs. In order to adapt to the requirements of complex tasks, intelligent operations and system flexibility in non-structural environments, the two Robots have shown advantages in performing such tasks through mutual cooperation and cooperation. The two-arm robot maintains a certain constraint relationship between its arms during the coordination process to complete the coordination task of the two arms. The basic idea of pure position control is to first plan the trajectory of the operated target object, and obtain the trajectory of the end of the arms through the constraint relationship between the target object and the arms. However, this control method does not consider the force exerted by the arms on the target object, nor does it consider the external interference to the target object. Therefore, how to achieve efficient grabbing of target objects by dual robotic arms in uncertain and complex scenes has become an urgent technical problem to be solved.

发明内容Contents of the invention

本发明实施例提供了一种自适应变阻抗控制方法和装置、电子设备及存储介质，能够使得机器人双机械臂基于当前状态数据集做出最优的动作选择，以适应不确定环境并实现对目标物体的高效抓取，并允许双机械臂通过强化学习不断优化其行为，从而更好地适应复杂的工作场景。Embodiments of the present invention provide an adaptive variable impedance control method and device, electronic equipment and storage media, which can enable the robot's dual manipulators to make optimal action selections based on the current state data set to adapt to the uncertain environment and achieve control. Efficient grasping of target objects and allowing the dual robotic arms to continuously optimize their behavior through reinforcement learning to better adapt to complex work scenarios.

第一方面，本发明实施例提供了一种自适应变阻抗控制方法，包括：In a first aspect, embodiments of the present invention provide an adaptive variable impedance control method, including:

构建机器人双机械臂在抓取目标物体时的阻抗模型，将所述目标物体受到的内力和外力进行解耦，对所述内力与所述外力分别进行自适应阻抗控制；Construct an impedance model of the robot's dual manipulator arms when grabbing the target object, decouple the internal force and external force on the target object, and perform adaptive impedance control on the internal force and the external force respectively;

初始化所述阻抗模型的网络参数和经验池，所述经验池用于存储所述机器人在环境中的经验元组，其中，所述阻抗模型包括Actor网络和Cr it ic网络，所述Actor网络用于生成连续动作，所述Cr it ic网络用于评估动作的质量，输出相应的动作值函数；Initialize the network parameters and experience pool of the impedance model. The experience pool is used to store the experience tuples of the robot in the environment. The impedance model includes an Actor network and a Cr itic network. The Actor network is To generate continuous actions, the Cr itic network is used to evaluate the quality of the action and output the corresponding action value function;

从所述机器人的状态空间中选择动作，执行选择的动作后，将环境反馈的所述经验元组存储到所述经验池中，并从所述经验池中随机采样一批数据，计算所述Cr it ic网络的损失并进行反向传播，通过所述Cr it ic网络计算目标Q值，更新所述Actor网络的参数以最大化Q值，循环训练直至达到预设迭代次数，得到训练好的Actor-Cr it ic网络；Select an action from the state space of the robot, and after executing the selected action, store the experience tuples fed back by the environment into the experience pool, randomly sample a batch of data from the experience pool, and calculate the The loss of the Cr itic network is backpropagated, the target Q value is calculated through the Cr itic network, the parameters of the Actor network are updated to maximize the Q value, and the training is looped until the preset number of iterations is reached to obtain the trained Actor-Cr it ic network;

使用训练好的所述Actor-Cr it ic网络，在实际环境中执行双机械臂的动作对所述目标物体进行抓取。Using the trained Actor-Cr itic network, the actions of the dual robotic arms are executed in the actual environment to grab the target object.

在一些实施例中，所述方法还包括：In some embodiments, the method further includes:

建立双机械臂协同系统坐标系，目标物体对于参考坐标系的位置和姿态用下式进行求解：Establish the coordinate system of the dual manipulator collaborative system. The position and attitude of the target object with respect to the reference coordinate system are solved using the following formula:

式中，为目标物体相对于质心坐标系的转化矩阵；/>为物体相对于质心处坐标系的3X3的旋转矩阵：/>为目标物体相对于质心处坐标系的3X1的位置矩阵；In the formula, is the transformation matrix of the target object relative to the center of mass coordinate system;/> It is the 3X3 rotation matrix of the object relative to the coordinate system at the center of mass: /> is the 3X1 position matrix of the target object relative to the coordinate system at the center of mass;

目标物体通过质心处坐标系与世界坐标系之间的转化为目标物体与机械臂之间的约束条件，由下式进行表达：The target object is converted into a constraint between the target object and the robotic arm through the transformation between the coordinate system at the center of mass and the world coordinate system, which is expressed by the following formula:

式中，为质心处坐标系0相对于世界坐标系W的齐次坐标转换；/>表示双机械臂的基坐标系相对于世界坐标系的齐次坐标转换；/>表示双机械臂的末端坐标系相对于双机械臂的基坐标的其次转换；/>表示目标物体质心坐标系相对于机械臂末端的齐次转换：In the formula, It is the homogeneous coordinate transformation of the coordinate system 0 at the center of mass relative to the world coordinate system W;/> Represents the homogeneous coordinate transformation of the base coordinate system of the dual robot arms relative to the world coordinate system;/> Represents the secondary transformation of the end coordinate system of the dual robotic arms relative to the base coordinates of the dual robotic arms;/> Represents the homogeneous transformation of the coordinate system of the center of mass of the target object relative to the end of the robotic arm:

通过下式对速度约束关系进行分析，使得双臂在运动的过程中保持位置和速度的一致性；The speed constraint relationship is analyzed through the following formula, so that the position and speed of the arms are consistent during the movement;

式中，表示机械臂末端相对于世界坐标系的速度；/>表示物体相对于世界坐标系的速度，角速度；/>表示机械臂末端相对于世界位置变换矩阵；P_i ^O表示机械臂末端相对于目标物体质心的位置变换矩阵；/>表示目标物体质心相对于世界下的方向旋转矩阵。In the formula, Represents the speed of the end of the robotic arm relative to the world coordinate system;/> Represents the object’s velocity relative to the world coordinate system, angular velocity;/> Represents the position transformation matrix of the end of the manipulator arm relative to the world; P _i ^O represents the position transformation matrix of the end of the manipulator arm relative to the center of mass of the target object;/> Represents the direction rotation matrix of the center of mass of the target object relative to the world.

在一些实施例中，所述将所述目标物体受到的内力和外力进行解耦，包括：In some embodiments, decoupling the internal and external forces on the target object includes:

根据牛顿第二定律和欧拉方程建立双机械臂抓取目标物体的状态，建立以下目标物体的动力学方程：According to Newton's second law and Euler's equation, the state of the dual manipulator grabbing the target object is established, and the following dynamic equation of the target object is established:

式中I_O表示目标物体质心处的惯性矩阵；F_O∈R⁶表示双机械臂作用于目标物体上的矢量力；M_o∈R⁶表示目标物体的质量惯性矩阵；/>表示目标物体运动过程中的线加速度和角加速度；C_O∈R⁶表示为目标物体的科氏力、重力和离心力的合力矢量；F_ext∈R⁶表示外部干扰力作用于目标物体上的适量力；将上式转化为下式：in the formula I _O represents the inertia matrix at the center of mass of the target object; F _O ∈R ⁶ represents the vector force of the dual robotic arms acting on the target object; M _o ∈R ⁶ represents the mass inertia matrix of the target object;/> represents the linear acceleration and angular acceleration during the movement of the target object; C _O ∈R ⁶ represents the resultant force vector of the Coriolis force, gravity and centrifugal force of the target object; F _ext ∈R ⁶ represents the appropriate amount of external interference force acting on the target object Force; convert the above formula into the following formula:

式中k＝l，r表示为双机械臂的左臂和右臂，S_k ^T∈R⁶表示抓取矩阵；F_k表示机械臂作用于目标物体上的力；将抓取矩阵分解得到外力式F_I和得到内力式F_E：In the formula, k=l, r represents the left arm and right arm of the double robotic arm, S _k ^T ∈ R ⁶ represents the grasping matrix; F _k represents the force of the robotic arm acting on the target object; decompose the grasping matrix to obtain the external force Formula F _I and the internal force formula F _E are obtained:

式中是/>矩阵的广义逆矩阵。in the formula Yes/> The generalized inverse of a matrix.

在一些实施例中，所述阻抗模型的方程式如下：In some embodiments, the equation of the impedance model is as follows:

式中m为惯性系数，b为阻尼系数，ε为自适应参数，Δf代表力的误差值，和/>分别为机器臂的运动速度和运动加速度。In the formula, m is the inertia coefficient, b is the damping coefficient, ε is the adaptive parameter, Δf represents the error value of the force, and/> are the movement speed and movement acceleration of the machine arm respectively.

在一些实施例中，所述从所述经验池中随机采样一批数据，计算所述Critic网络的损失并进行反向传播，通过所述Critic网络计算目标Q值，更新所述Actor网络的参数以最大化Q值，包括：In some embodiments, a batch of data is randomly sampled from the experience pool, the loss of the Critic network is calculated and backpropagated, the target Q value is calculated through the Critic network, and the parameters of the Actor network are updated. To maximize the Q value, include:

基于确定性梯度策略，根据所述动作值函数，对所述Actor网络参数进行更新，确定性行为策略如下式：Based on the deterministic gradient strategy, the Actor network parameters are updated according to the action value function. The deterministic behavior strategy is as follows:

μ→a_t＝μ(s_t)μ→a _t =μ(s _t )

其中，μ为策略函数，s为当前状态，在确定性策略的动作在状态s时是唯一确定的，其公式如下：Among them, μ is the policy function, s is the current state, and the action of the deterministic policy is uniquely determined in state s, and its formula is as follows:

a_t＝μ(s_t|θ^μ)+N_t a _t =μ(s _t |θ ^μ )+N _t

其中，θ为策略参数；在网络训练时，随机采样多个数据N，作为确定性策略μ的训练数据，衡量确定性策略μ表现在下式所示：Among them, θ is the policy parameter; during network training, multiple data N are randomly sampled as training data for the deterministic strategy μ. The performance of the deterministic strategy μ is measured as follows:

其中，s_t～ρ^β表示从经验分布ρ^β中采样一个状态s_t，Q^μ(s，μ(s))表示Critic网络对于给定状态s和动作μ(s)的值，整个期望表示在采样的状态和动作上取期望值；在训练时，通过样本均值来代替期望值；Among them, s _t ~ ρ ^β represents sampling a state s _t from the empirical distribution ρ ^β , and Q ^μ (s, μ (s)) represents the entire expectation of the Critic network for the given state s and action μ (s). It means taking the expected value on the sampled state and action; during training, the expected value is replaced by the sample mean;

Actor网络通过最大化Critic网络的输出来学习最优策略，Actor网络的参数更新通过下式梯度上升的方式进行：The Actor network learns the optimal strategy by maximizing the output of the Critic network. The parameters of the Actor network are updated through the gradient ascent of the following formula:

其中，a_t～π表示从Actor策略π中采样一个动作a_t，Q(s，a|θ^Q)表示Critic网络对于给定状态s和动作a的值，梯度项表示Critic网络对Actor输出动作的梯度；Among them, a _t ~ π represents sampling an action a _t from the Actor policy π, Q(s, a|θ ^Q ) represents the value of the Critic network for a given state s and action a, and the gradient term Represents the gradient of the Critic network’s output action on the Actor;

依据最小化损失函数L(θ^Q)更新当前价值网络Q的参数θ^Q，所述最小化损失函数如下式：The parameters θ ^Q of the current value network Q are updated according to the minimization loss function L(θ ^Q ). The minimization loss function is as follows:

其中，N为随机采样个数，Y为衰减系数。Among them, N is the number of random samples, and Y is the attenuation coefficient.

在一些实施例中，所述从所述经验池中随机采样一批数据，计算所述Critic网络的损失并进行反向传播，通过所述Critic网络计算目标Q值，更新所述Actor网络的参数以最大化Q值，还包括：In some embodiments, a batch of data is randomly sampled from the experience pool, the loss of the Critic network is calculated and backpropagated, the target Q value is calculated through the Critic network, and the parameters of the Actor network are updated. To maximize the Q value, it also includes:

基于确定性梯度策略，根据下式计算行为策略网络Q的梯度策略，对所述Critic网络参数进行更新：Based on the deterministic gradient strategy, the gradient strategy of the behavioral strategy network Q is calculated according to the following formula, and the Critic network parameters are updated:

其中，s_t和a_t是从经验分布ρ^β中采样得到的状态和动作，s_t+1是从经验分布ε中采样得到的下一个状态，Q(s_i，a_i|θ^Q)是Critic网络的输出，表示在状态s下采取动作a的估计值，y是目标Q值，梯度估计的目标是最小化Critic网络的输出与目标Q值之间的均方误差；通过使用采样得到的状态、动作和下一个状态，对这个均方误差进行期望估计，然后计算对Critic网络参数θ^Q的梯度，从而进行参数更新；Among them, s _t and a _t are the states and actions sampled from the empirical distribution ρ ^β , s _t+1 is the next state sampled from the empirical distribution ε, Q (s _i , a _i |θ ^Q ) is The output of the Critic network represents the estimated value of taking action a in state s. y is the target Q value. The goal of gradient estimation is to minimize the mean square error between the output of the Critic network and the target Q value; obtained by using sampling State, action and next state, estimate the expected mean square error, and then calculate the gradient of the Critic network parameter θ ^Q to update the parameters;

y_i＝r_i+γQ’(s_i+1，μ’(s_i+1|θ^μ’)|θ^Q’)y _i =r _i +γQ'(s _i+1 , μ'(s _i+1 |θ ^μ' )|θ ^Q' )

其中，r_i是当前奖励，γ是折扣因子，Q’和μ’是目标网络；Among them, r _i is the current reward, γ is the discount factor, Q' and μ' are the target network;

采用滑动平均的方法对μ′和Q′进行更新，如下式所示：The sliding average method is used to update μ′ and Q′, as shown in the following formula:

θ^μ′←τθ+(1-τ)θ^μ′ θ ^μ′ ←τθ+(1-τ)θ ^μ′

θ^Q′←τθ+(1-τ)θ^Q′ θ ^Q′ ←τθ+(1-τ)θ ^Q′

其中，θ^μ′、θ^Q′、τ都为参数，τ为学习率，取值0.001；在网络架构中，策略网络Actor用于更新θ^μ′，以输出动作a，价值网络Critic采对参数θ^Q′进行更新，用于逼近状态-行为值函数Q^π(s，a)。Among them, θ ^μ′ , θ ^Q′ , and τ are all parameters, and τ is the learning rate, with a value of 0.001; in the network architecture, the policy network Actor is used to update θ ^μ′ to output action a, and the value network Critic adopts the correct parameters θ ^Q′ is updated to approximate the state-behavior value function Q ^π (s, a).

在一些实施例中，所述双机械臂协作运动过程的状态空间s定义如下式所示：In some embodiments, the state space s of the collaborative motion process of the dual manipulators is defined as follows:

s＝{e_f，e_x，F_a，x_a}s={e _f , e _x , F _a , x _a }

其中，e_f代表力跟踪误差，e_x代表轨迹跟踪误差，F_a表示双机械臂控制过程中的实际力，x_a表示双机械臂控制过程中的实际轨迹；Among them, e _f represents the force tracking error, e _x represents the trajectory tracking error, F _a represents the actual force during the control process of the dual manipulator, and x _a represents the actual trajectory during the control process of the dual manipulator;

DDPG算法的目标是根据力跟踪误差e_f和轨迹跟踪误差e_x输出合适的自适应参数ε；作为DRL算法的输出参数，所设计的动作空间只含有一个确定的元素，在t时刻，代表该时刻下的动作，为自适应参数ε，动作函数如下式所示：The goal of the DDPG algorithm is to output appropriate adaptive parameters ε based on the force tracking error e _f and the trajectory tracking error e _x ; as the output parameter of the DRL algorithm, the designed action space only contains one certain element, which represents the The action at time is the adaptive parameter ε, and the action function is as follows:

a_t＝{ε}a _t ={ε}

其中，ε为自适应参数；在机械臂位置力跟踪过程中需要对每一时间步进行即时评价，根据实际力与力误差的平均值/>的差以及实际轨迹/>与轨迹误差平均值/>的差在不同比例下之和作为奖惩函数的一部分，称为基础奖惩部分；令力跟踪误差e_f、轨迹跟踪误差e_x都始终保持为0；将力跟踪误差e_f和轨迹跟踪误差e_x作为奖惩函数的一部分，称为额外激励部分r_extre；根据力跟踪误差所处的区间给予不同的额外奖惩，结合基础奖惩部分和额外激励部分所构成的奖惩函数如下式所示：Among them, ε is an adaptive parameter; in the process of manipulator position and force tracking, it is necessary to conduct real-time evaluation of each time step. According to the actual force and the average value of the force error/> The difference and the actual trajectory/> and the average trajectory error/> The sum of the differences at different proportions is used as part of the reward and punishment function, which is called the basic reward and punishment part; let the force tracking error e _f and the trajectory tracking error e _x always remain 0; let the force tracking error e _f and the trajectory tracking error e _x As part of the reward and punishment function, it is called the additional incentive part r _extre ; different additional rewards and punishments are given according to the interval of the force tracking error. The reward and punishment function composed of the basic reward and punishment part and the additional incentive part is as follows:

其中， in,

其中，k_a、k_b为比例因子，H为训练过程的时间段，为j时刻轨迹误差的平均值，/>为j时刻力误差的平均值，/>为在j时刻双机械臂的实际力，/>为在j时刻双机械臂的实际位置，在额外激励部分中，根据e_f和e_x所处的区间给予不同的额外奖惩。Among them, k _a and k _b are scaling factors, H is the time period of the training process, is the average value of the trajectory error at time j,/> is the average force error at time j,/> is the actual force of the two robotic arms at time j,/> For the actual position of the dual manipulator arms at time j, in the additional incentive part, different additional rewards and punishments are given according to the intervals between e _f and e _x .

第二方面，本发明实施例还提供了一种自适应变阻抗控制装置，所述装置包括：In a second aspect, embodiments of the present invention also provide an adaptive variable impedance control device, which includes:

构建模块，用于构建机器人双机械臂在抓取目标物体时的阻抗模型，将所述目标物体受到的内力和外力进行解耦，对所述内力与所述外力分别进行自适应阻抗控制；A construction module used to construct an impedance model of the robot's dual manipulator arms when grabbing the target object, decouple the internal force and external force on the target object, and perform adaptive impedance control on the internal force and the external force respectively;

初始化模块，用于初始化所述阻抗模型的网络参数和经验池，所述经验池用于存储所述机器人在环境中的经验元组，其中，所述阻抗模型包括Actor网络和Critic网络，所述Actor网络用于生成连续动作，所述Critic网络用于评估动作的质量，输出相应的动作值函数；An initialization module, used to initialize the network parameters and experience pool of the impedance model. The experience pool is used to store the experience tuples of the robot in the environment. The impedance model includes an Actor network and a Critic network. The Actor network is used to generate continuous actions, and the Critic network is used to evaluate the quality of the actions and output the corresponding action value function;

训练模块，用于从所述机器人的状态空间中选择动作，执行选择的动作后，将环境反馈的所述经验元组存储到所述经验池中，并从所述经验池中随机采样一批数据，计算所述Critic网络的损失并进行反向传播，通过所述Critic网络计算目标Q值，更新所述Actor网络的参数以最大化Q值，循环训练直至达到预设迭代次数，得到训练好的Actor-Critic网络；A training module, used to select actions from the state space of the robot, and after executing the selected actions, store the experience tuples fed back by the environment into the experience pool, and randomly sample a batch of them from the experience pool. data, calculate the loss of the Critic network and perform backpropagation, calculate the target Q value through the Critic network, update the parameters of the Actor network to maximize the Q value, and loop training until the preset number of iterations is reached, and the training is obtained Actor-Critic network;

执行模块，用于使用训练好的所述Actor-Critic网络，在实际环境中执行双机械臂的动作对所述目标物体进行抓取。An execution module is used to use the trained Actor-Critic network to execute the actions of the dual robotic arms to grab the target object in the actual environment.

第三方面，本发明实施例还提供了一种电子设备，包括：存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，其特征在于，所述处理器执行所述计算机程序时实现如第一方面所述的自适应变阻抗控制方法。In a third aspect, embodiments of the present invention also provide an electronic device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that the processor executes the computer program The program implements the adaptive variable impedance control method as described in the first aspect.

第四方面，本发明实施例还提供了一种计算机可读存储介质，存储有计算机可执行指令，所述计算机可执行指令用于执行如第一方面所述的自适应变阻抗控制方法。In a fourth aspect, embodiments of the present invention also provide a computer-readable storage medium that stores computer-executable instructions, and the computer-executable instructions are used to execute the adaptive variable impedance control method as described in the first aspect.

根据本发明实施例提供的自适应变阻抗控制方法和装置、电子设备及存储介质，其中，自适应变阻抗控制方法包括：构建机器人双机械臂在抓取目标物体时的阻抗模型，将目标物体受到的内力和外力进行解耦，对内力与外力分别进行自适应阻抗控制；初始化阻抗模型的网络参数和经验池，经验池用于存储机器人在环境中的经验元组，其中，阻抗模型包括Actor网络和Critic网络，Actor网络用于生成连续动作，Critic网络用于评估动作的质量，输出相应的动作值函数；从机器人的状态空间中选择动作，执行选择的动作后，将环境反馈的经验元组存储到经验池中，并从经验池中随机采样一批数据，计算Critic网络的损失并进行反向传播，通过Critic网络计算目标Q值，更新Actor网络的参数以最大化Q值，循环训练直至达到预设迭代次数，得到训练好的Actor-Critic网络；使用训练好的Actor-Critic网络，在实际环境中执行机器人的动作对目标物体进行抓取。基于此，当构建双机械臂在抓取目标物体时的阻抗模型时，会考虑机械臂在接触环境时的响应，因此对目标物体进行受力分解，将目标物体受理进行解耦，对内力与外力分别进行自适应阻抗控制，优化整体控制策略，提升控制精度。在强化学习的框架下，本发明实施例采用深度确定性策略梯度DDPG算法。DDPG算法包括Actor网络和Critic网络，Actor神经网络用于输出连续的动作，Critic神经网络用于评估动作的质量。通过在实际操作中不断学习，DDPG算法使得双机械臂能够调整其动作策略，以最大化累积奖励。在训练阶段，从经验池中随机选择样本，用于训练Actor-Critic网络，以获取最优的网络结构，通过使用训练好的Actor-Critic网络，双机械臂能够基于当前状态数据集做出最优的动作选择，以适应不确定环境并实现对目标物体的高效抓取。本发明实施例允许双机械臂通过强化学习不断优化其行为，从而更好地适应复杂的工作场景。According to the adaptive variable impedance control method and device, electronic equipment and storage medium provided by embodiments of the present invention, the adaptive variable impedance control method includes: constructing an impedance model of the robot's dual manipulator arms when grabbing the target object, and converting the target object to Decouple the internal and external forces received, and perform adaptive impedance control on internal and external forces respectively; initialize the network parameters and experience pool of the impedance model. The experience pool is used to store the experience tuples of the robot in the environment. Among them, the impedance model includes Actor Network and Critic network, the Actor network is used to generate continuous actions, and the Critic network is used to evaluate the quality of the action and output the corresponding action value function; select an action from the robot's state space, and after executing the selected action, the experience element of the environment feedback The group is stored in the experience pool, and a batch of data is randomly sampled from the experience pool, the loss of the Critic network is calculated and backpropagated, the target Q value is calculated through the Critic network, the parameters of the Actor network are updated to maximize the Q value, and loop training Until the preset number of iterations is reached, the trained Actor-Critic network is obtained; the trained Actor-Critic network is used to execute the robot's actions in the actual environment to grab the target object. Based on this, when constructing the impedance model of the dual manipulator when grabbing the target object, the response of the manipulator when it contacts the environment will be considered. Therefore, the force on the target object is decomposed, the acceptance of the target object is decoupled, and the internal force and External forces are respectively controlled by adaptive impedance to optimize the overall control strategy and improve control accuracy. Under the framework of reinforcement learning, the embodiment of the present invention adopts the deep deterministic policy gradient DDPG algorithm. The DDPG algorithm includes Actor network and Critic network. Actor neural network is used to output continuous actions, and Critic neural network is used to evaluate the quality of actions. By continuously learning in actual operations, the DDPG algorithm enables the dual robotic arms to adjust their action strategies to maximize cumulative rewards. In the training phase, samples are randomly selected from the experience pool and used to train the Actor-Critic network to obtain the optimal network structure. By using the trained Actor-Critic network, the dual robot arms can make the best decision based on the current state data set. Optimal action selection to adapt to uncertain environments and achieve efficient grasping of target objects. Embodiments of the present invention allow dual robotic arms to continuously optimize their behavior through reinforcement learning, thereby better adapting to complex work scenarios.

附图说明Description of drawings

图1A是本发明一个实施例提供的自适应变阻抗控制方法的流程图；Figure 1A is a flow chart of an adaptive variable impedance control method provided by an embodiment of the present invention;

图1B是本发明一个实施例提供的双臂协同系统坐标系示意图；Figure 1B is a schematic diagram of the coordinate system of the two-arm collaboration system provided by an embodiment of the present invention;

图2是本发明一个实施例提供的目标物体的受力分析图；Figure 2 is a force analysis diagram of a target object provided by an embodiment of the present invention;

图3是本发明一个实施例提供的DDPG网络模型结构图；Figure 3 is a structural diagram of a DDPG network model provided by an embodiment of the present invention;

图4是本发明一个实施例提供的模型训练中累积奖励R变化过程图；Figure 4 is a diagram of the change process of cumulative reward R during model training provided by an embodiment of the present invention;

图5是本发明一个实施例提供的深度强化学习自适应变阻抗控制策略结构图；Figure 5 is a structural diagram of the deep reinforcement learning adaptive variable impedance control strategy provided by an embodiment of the present invention;

图6是本发明一个实施例提供的双机械臂控制框图；Figure 6 is a control block diagram of dual robotic arms provided by an embodiment of the present invention;

图7是本发明一个实施例提供的基于深度强化学习的双臂协作自适应阻抗控制流程图；Figure 7 is a flow chart of dual-arm collaborative adaptive impedance control based on deep reinforcement learning provided by an embodiment of the present invention;

图8是本发明一个实施例提供的实验平台硬件连接结构图；Figure 8 is a hardware connection structure diagram of the experimental platform provided by an embodiment of the present invention;

图9A是本发明一个实施例提供的深度强化学习自适应阻抗控制下的恒力跟踪图；Figure 9A is a constant force tracking diagram under deep reinforcement learning adaptive impedance control provided by an embodiment of the present invention;

图9B是本发明一个实施例提供的深度强化学习自适应阻抗控制下的恒轨迹跟踪图；Figure 9B is a constant trajectory tracking diagram under deep reinforcement learning adaptive impedance control provided by an embodiment of the present invention;

图10A是本发明一个实施例提供的深度强化学习自适应阻抗控制下的恒力跟踪图；Figure 10A is a constant force tracking diagram under deep reinforcement learning adaptive impedance control provided by an embodiment of the present invention;

图10B是本发明一个实施例提供的深度强化学习自适应阻抗控制下的变轨迹跟踪图；Figure 10B is a variable trajectory tracking diagram under deep reinforcement learning adaptive impedance control provided by an embodiment of the present invention;

图11A是本发明一个实施例提供的深度强化学习阻抗控制下的变力跟踪图；Figure 11A is a variable force tracking diagram under deep reinforcement learning impedance control provided by an embodiment of the present invention;

图11B是本发明一个实施例提供的深度强化学习阻抗控制下的变轨迹跟踪图；Figure 11B is a variable trajectory tracking diagram under deep reinforcement learning impedance control provided by an embodiment of the present invention;

图12是本发明一个实施例提供的自适应变阻抗控制装置的示意图；Figure 12 is a schematic diagram of an adaptive variable impedance control device provided by an embodiment of the present invention;

图13是本发明一个实施例提供的电子设备的示意图。Figure 13 is a schematic diagram of an electronic device provided by an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。In order to make the purpose, technical solutions and advantages of the present invention more clear, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention and are not intended to limit the present invention.

需要说明的是，虽然在装置示意图中进行了功能模块划分，在流程图中示出了逻辑顺序，但是在某些情况下，可以以不同于装置中的模块划分，或流程图中的顺序执行所示出或描述的步骤。说明书和权利要求书及下述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。It should be noted that although the functional modules are divided in the device schematic diagram and the logical sequence is shown in the flow chart, in some cases, the modules can be divided into different modules in the device or the order in the flow chart can be executed. The steps shown or described. The terms "first", "second", etc. in the description, claims, and following drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence.

本发明实施例中，“进一步地”、“示例性地”或者“可选地”等词用于表示作为例子、例证或说明，不应被解释为比其它实施例或设计方案更优选或更具有优势。使用“进一步地”、“示例性地”或者“可选地”等词旨在以具体方式呈现相关概念。In the embodiments of the present invention, words such as "further", "exemplarily" or "optionally" are used as examples, illustrations or illustrations, and should not be interpreted as being more preferable or better than other embodiments or designs. Advantages. The use of the words "further," "exemplarily," or "optionally" is intended to present the relevant concepts in a specific manner.

首先，对本发明中涉及的若干术语进行解析：First, some terms involved in this invention are analyzed:

双机械臂协作控制：指的是两个机械臂或机器人，通常在一个系统环境中协作工作，以实现共同的目标任务。这设计到同步运动、力和力矩控制、碰撞检测和回避，轨迹规划等以双机械臂协作控制的目标是实现高效的任务执行，提高生产效率，减少人力介入，改进任务精确性和安全性。Dual robotic arm collaborative control: refers to two robotic arms or robots, usually working collaboratively in a system environment to achieve a common target task. This design includes synchronous motion, force and torque control, collision detection and avoidance, trajectory planning, etc. The goal of dual robotic arm collaborative control is to achieve efficient task execution, improve production efficiency, reduce human intervention, and improve task accuracy and safety.

深度强化学习的自适应变阻抗控制：是一种机器学习方法，旨在训练智能体(通常是机器人或控制系统)以在与环境互动的情况下学习最佳行为策略。这种学习是通过奖励和惩罚信号来引导的，智能体的目标是最大化长期奖励。可以自动调整控制系统的参数以适应系统动态性质的变化或不确定性。在自适应控制中，控制系统可以实时调整其控制策略，以提高性能。Adaptive variable impedance control of deep reinforcement learning: is a machine learning method designed to train an agent (usually a robot or control system) to learn optimal behavioral strategies while interacting with the environment. This learning is guided by reward and punishment signals, and the agent's goal is to maximize long-term rewards. The parameters of a control system can be automatically adjusted to accommodate changes or uncertainties in the dynamic nature of the system. In adaptive control, the control system can adjust its control strategy in real time to improve performance.

为了后续更方便地描述本发明实施例的工作原理，以下先给出相关技术场景的介绍。In order to more conveniently describe the working principles of the embodiments of the present invention later, an introduction to relevant technical scenarios will be given below.

在工业生产中，传统的工业机器人通过位置控制来实现工业场景下的具体任务，但是在需要与环境进行交互的应用中，传统的基于位置控制的方法将不再胜任相应的任务。在焊接、抛光、轴孔装配等领域，与环境存在着大量复杂的接触，仅有工业机器人沿着指定路径运动，一旦机器人与指定路径存在位置的偏差，就会产生十分巨大的环境接触力，可能会导致工件的损坏，甚至损坏工业机器人。同复杂化和生产过程的柔性化，现有的以独立工位工作的机器人已不能满足日益变化的制造需求，为了适应非结构环境下任务复杂化、操作智能化及系统柔顺化的要求，两台器人通过相互配合和协作的方式在执行这类作业任务中表现出优势。双臂机器人在协调过程中双臂之间保持一定的约束关系，以完成双臂的协调任务。纯位置控制的基本思路是首先对被操作目标物体的轨迹进行规划，通过目标物体与双臂的约束关系，得到双臂末端的轨迹。但是这种控制方式并没有考虑双臂对目标物体的受力情况，也没有考虑目标物体受到外部干扰的情况。因此，如何在不确定的复杂场景下，实现双机械臂对目标物体的高效抓取成为亟待解决的技术问题。In industrial production, traditional industrial robots use position control to achieve specific tasks in industrial scenarios. However, in applications that require interaction with the environment, traditional position control-based methods will no longer be qualified for the corresponding tasks. In fields such as welding, polishing, and shaft hole assembly, there are a large number of complex contacts with the environment. Only industrial robots move along designated paths. Once there is a position deviation between the robot and the designated path, a very huge environmental contact force will be generated. It may cause damage to the workpiece or even damage the industrial robot. With the complexity and flexibility of the production process, the existing robots working in independent stations can no longer meet the increasingly changing manufacturing needs. In order to adapt to the requirements of complex tasks, intelligent operation and system flexibility in non-structural environments, the two Robots have shown advantages in performing such tasks through mutual cooperation and cooperation. The two-arm robot maintains a certain constraint relationship between its arms during the coordination process to complete the coordination task of the two arms. The basic idea of pure position control is to first plan the trajectory of the operated target object, and obtain the trajectory of the end of the arms through the constraint relationship between the target object and the arms. However, this control method does not take into account the force exerted by the arms on the target object, nor does it consider the external interference to the target object. Therefore, how to achieve efficient grabbing of target objects by dual robotic arms in uncertain and complex scenes has become an urgent technical problem to be solved.

基于此，本发明提供了一种自适应变阻抗控制方法和装置、电子设备及存储介质。其中，自适应变阻抗控制方法包括：构建机器人双机械臂在抓取目标物体时的阻抗模型，将目标物体受到的内力和外力进行解耦，对内力与外力分别进行自适应阻抗控制；初始化阻抗模型的网络参数和经验池，经验池用于存储机器人在环境中的经验元组，其中，阻抗模型包括Actor网络和Critic网络，Actor网络用于生成连续动作，Critic网络用于评估动作的质量，输出相应的动作值函数；从机器人的状态空间中选择动作，执行选择的动作后，将环境反馈的经验元组存储到经验池中，并从经验池中随机采样一批数据，计算Critic网络的损失并进行反向传播，通过Critic网络计算目标Q值，更新Actor网络的参数以最大化Q值，循环训练直至达到预设迭代次数，得到训练好的Actor-Critic网络；使用训练好的Actor-Critic网络，在实际环境中执行机器人的动作对目标物体进行抓取。基于此，当构建双机械臂在抓取目标物体时的阻抗模型时，会考虑机械臂在接触环境时的响应，因此对目标物体进行受力分解，将目标物体受理进行解耦，对内力与外力分别进行自适应阻抗控制，优化整体控制策略，提升控制精度。在强化学习的框架下，本发明实施例采用深度确定性策略梯度DDPG算法。DDPG算法包括Actor网络和Critic网络，Actor神经网络用于输出连续的动作，Critic神经网络用于评估动作的质量。通过在实际操作中不断学习，DDPG算法使得双机械臂能够调整其动作策略，以最大化累积奖励。在训练阶段，从经验池中随机选择样本，用于训练Actor-Critic网络，以获取最优的网络结构，通过使用训练好的Actor-Critic网络，双机械臂能够基于当前状态数据集做出最优的动作选择，以适应不确定环境并实现对目标物体的高效抓取。本发明实施例允许双机械臂通过强化学习不断优化其行为，从而更好地适应复杂的工作场景。Based on this, the present invention provides an adaptive variable impedance control method and device, electronic equipment and storage media. Among them, the adaptive variable impedance control method includes: constructing the impedance model of the robot's dual manipulators when grabbing the target object, decoupling the internal and external forces on the target object, and performing adaptive impedance control on the internal and external forces respectively; initializing the impedance The network parameters and experience pool of the model. The experience pool is used to store the experience tuples of the robot in the environment. The impedance model includes the Actor network and the Critic network. The Actor network is used to generate continuous actions, and the Critic network is used to evaluate the quality of the actions. Output the corresponding action value function; select an action from the robot's state space, and after executing the selected action, store the experience tuples of environmental feedback into the experience pool, and randomly sample a batch of data from the experience pool to calculate the Critic network Loss and backpropagation, calculate the target Q value through the Critic network, update the parameters of the Actor network to maximize the Q value, and loop training until the preset number of iterations is reached to obtain the trained Actor-Critic network; use the trained Actor- Critic network performs robot actions to grab target objects in the actual environment. Based on this, when constructing the impedance model of the dual manipulator when grabbing the target object, the response of the manipulator when it contacts the environment will be considered. Therefore, the force on the target object is decomposed, the acceptance of the target object is decoupled, and the internal force and External forces are respectively controlled by adaptive impedance to optimize the overall control strategy and improve control accuracy. Under the framework of reinforcement learning, the embodiment of the present invention adopts the deep deterministic policy gradient DDPG algorithm. The DDPG algorithm includes Actor network and Critic network. Actor neural network is used to output continuous actions, and Critic neural network is used to evaluate the quality of actions. By continuously learning in actual operations, the DDPG algorithm enables the dual robotic arms to adjust their action strategies to maximize cumulative rewards. In the training phase, samples are randomly selected from the experience pool and used to train the Actor-Critic network to obtain the optimal network structure. By using the trained Actor-Critic network, the dual robot arms can make the best decision based on the current state data set. Optimal action selection to adapt to uncertain environments and achieve efficient grasping of target objects. Embodiments of the present invention allow dual robotic arms to continuously optimize their behavior through reinforcement learning, thereby better adapting to complex work scenarios.

下面结合附图，对本发明实施例作进一步阐述。The embodiments of the present invention will be further described below with reference to the accompanying drawings.

如图1A所示，图1A是本发明一个实施例提供的自适应变阻抗控制方法的流程图，该自适应变阻抗控制方法可以包括但不限于步骤S101至S104。As shown in FIG. 1A , FIG. 1A is a flow chart of an adaptive variable impedance control method provided by an embodiment of the present invention. The adaptive variable impedance control method may include but is not limited to steps S101 to S104.

步骤S101：构建机器人双机械臂在抓取目标物体时的阻抗模型，将目标物体受到的内力和外力进行解耦，对内力与外力分别进行自适应阻抗控制；Step S101: Construct an impedance model of the robot's dual manipulator arms when grabbing the target object, decouple the internal and external forces on the target object, and perform adaptive impedance control on the internal and external forces respectively;

步骤S102：初始化阻抗模型的网络参数和经验池，经验池用于存储机器人在环境中的经验元组，其中，阻抗模型包括Actor网络和Critic网络，Actor网络用于生成连续动作，Critic网络用于评估动作的质量，输出相应的动作值函数；Step S102: Initialize the network parameters and experience pool of the impedance model. The experience pool is used to store the experience tuples of the robot in the environment. The impedance model includes an Actor network and a Critic network. The Actor network is used to generate continuous actions, and the Critic network is used to Evaluate the quality of the action and output the corresponding action value function;

步骤S103：从机器人的状态空间中选择动作，执行选择的动作后，将环境反馈的经验元组存储到经验池中，并从经验池中随机采样一批数据，计算Critic网络的损失并进行反向传播，通过Critic网络计算目标Q值，更新Actor网络的参数以最大化Q值，循环训练直至达到预设迭代次数，得到训练好的Actor-Critic网络；Step S103: Select an action from the robot's state space. After executing the selected action, store the experience tuples of environmental feedback into the experience pool, randomly sample a batch of data from the experience pool, calculate the loss of the Critic network and perform inverse feedback. To propagate, calculate the target Q value through the Critic network, update the parameters of the Actor network to maximize the Q value, and loop training until the preset number of iterations is reached to obtain the trained Actor-Critic network;

步骤S104：使用训练好的Actor-Critic网络，在实际环境中执行双机械臂的动作对目标物体进行抓取。Step S104: Use the trained Actor-Critic network to execute the actions of the dual robotic arms to grab the target object in the actual environment.

在一实施例中，单机械臂的运动轨迹规划是基于被操作目标的估计通过坐标系的运动学转换而来的。因此建立的双臂协同系统坐标系如图1B所示，在图1B中{W_X，W_Y,W_Z}，{O_X，O_Y，O_Z}分别表示为世界坐标系和目标物体坐标系；{OR_X，OR_Y，OR_Z}，{OL_X，OL_Y，OL_Z}分别表示为右侧机械臂基座坐标系和左侧机械臂基座坐标系；{R_X，R_Y，R_Z}，{L_X，L_Y，L_Z}分别表示为右侧机械臂末端坐标系和左侧机械臂末端坐标系，基于上述坐标系，提出面对对象的双机械臂运动规划的转化公式1。In one embodiment, the motion trajectory planning of a single manipulator is based on the estimation of the operated target through kinematic transformation of the coordinate system. Therefore, the coordinate system of the dual-arm collaborative system established is shown in _Figure 1B. In Figure 1B, {W _X , W _Y , W _Z }, _{ _O _system ; _{ OR _X _, OR _Y , OR _Z }, _{ _OL , R _Z } _, _{ _L Conversion formula 1.

对于被操作的目标物体来说，目标物体对于参考坐标系的位置和姿态用式子1进行求解For the target object being operated, the position and attitude of the target object relative to the reference coordinate system are solved using Equation 1

式中，为目标物体相对于质心坐标系的转化矩阵；/>为目标物体相对于质心处坐标系的3X3的旋转矩阵；/>为目标物体相对于质心处坐标系的3X1的位置矩阵。In the formula, is the transformation matrix of the target object relative to the center of mass coordinate system;/> It is the 3X3 rotation matrix of the target object relative to the coordinate system at the center of mass;/> is the 3X1 position matrix of the target object relative to the coordinate system at the center of mass.

目标物体通过质心处坐标系与世界坐标系之间的转化为目标物体与机械臂之间的约束条件，由以下式子2进行表达：The target object is converted into a constraint between the target object and the robotic arm through the transformation between the coordinate system at the center of mass and the world coordinate system, which is expressed by the following equation 2:

式中，为质心处坐标系0相对于世界坐标系W的齐次坐标转换；/>表示双机械臂的基坐标系相对于世界坐标系的齐次坐标转换；/>表示双机械臂的末端坐标系相对于双机械臂的基坐标的其次转换；/>表示目标物体质心坐标系相对于机械臂末端的齐次转换。In the formula, It is the homogeneous coordinate transformation of the coordinate system 0 at the center of mass relative to the world coordinate system W;/> Represents the homogeneous coordinate transformation of the base coordinate system of the dual robot arms relative to the world coordinate system;/> Represents the secondary transformation of the end coordinate system of the dual robotic arms relative to the base coordinates of the dual robotic arms;/> Represents the homogeneous transformation of the coordinate system of the center of mass of the target object relative to the end of the manipulator arm.

通过以上的双臂的位置约束后，为了实现正真意义上的实时协同操作，除了位置约束之外，还需要保证双臂在运动的过程当中的速度一致性，因此需要通过式子3对速度约束关系进行分析，使得双臂在运动的过程中保持位置和速度的一致性，实现协同操作。After passing the above position constraints of the arms, in order to achieve real-time collaborative operation in the true sense, in addition to the position constraints, it is also necessary to ensure the speed consistency of the arms during the movement. Therefore, it is necessary to pair the speed with Equation 3. The constraint relationship is analyzed to maintain the consistency of position and speed of the arms during movement and achieve coordinated operation.

式中，表示机械臂末端相对于世界坐标系的速度；/>表示目标物体相对于世界坐标系的速度和角速度；/>表示机械臂末端相对于世界位置变换矩阵；P_i ^O表示机械臂末端相对于目标物体质心的位置变换矩阵；/>表示目标物体质心相对于世界下的方向旋转矩阵。In the formula, Represents the speed of the end of the robotic arm relative to the world coordinate system;/> Indicates the speed and angular velocity of the target object relative to the world coordinate system;/> Represents the position transformation matrix of the end of the manipulator arm relative to the world; P _i ^O represents the position transformation matrix of the end of the manipulator arm relative to the center of mass of the target object;/> Represents the direction rotation matrix of the center of mass of the target object relative to the world.

目标物体的受力分析如图2所示，在图2中，F_L，F_R，F_ext，τ_L，τ_R，τ_ext分别表示为双机械臂和外力作用于目标物体的表面所产生的作用力和力矩，根据牛顿第二定律和欧拉方程建立双机械臂抓取物体的状态建立以下目标物体的动力学方程：The force analysis of the target object is shown in Figure 2. In Figure 2, F _L , F _R , F _ext , τ _L , τ _R , and τ _ext are respectively expressed as the dual mechanical arms and external forces acting on the surface of the target object. According to Newton's second law and Euler's equation, the state of the dual manipulator grasping the object is established to establish the following dynamic equation of the target object:

简化为式(5)：Simplified to equation (5):

式中I_O表示目标物体质心处的惯性矩阵；F_O∈R⁶表示双机械臂作用于物体上的矢量力；M_O∈R⁶表示目标物体的质量惯性矩阵；/>表示目标物体运动过程中的线加速度和角加速度；C_O∈R⁶表示为目标物体的科氏力、重力和离心力的合力矢量；F_ext∈R⁶表示外部干扰力作用于目标物体上的适量力将式(5)转化为式(6)。in the formula I _O represents the inertia matrix at the center of mass of the target object; F _O ∈R ⁶ represents the vector force of the dual robotic arms acting on the object; M _O ∈R ⁶ represents the mass inertia matrix of the target object;/> represents the linear acceleration and angular acceleration during the movement of the target object; C _O ∈R ⁶ represents the resultant force vector of the Coriolis force, gravity and centrifugal force of the target object; F _ext ∈R ⁶ represents the appropriate amount of external interference force acting on the target object Force transforms equation (5) into equation (6).

式中k＝l，r表示为双机械臂的左臂和右臂，S_k ^T∈R⁶表示抓取矩阵；F_k表示机械臂作用于目标物体上的力，通过六维力传感器可以直接采集数据。将抓取矩阵可以分解得到外力式(7)和得到内力式(8)：In the formula, k=l, r represents the left arm and right arm of the double robotic arm, S _k ^T ∈ R ⁶ represents the grasping matrix; F _k represents the force of the robotic arm acting on the target object. The six-dimensional force sensor can directly Data collection. The grasping matrix can be decomposed to obtain the external force formula (7) and the internal force formula (8):

在一实施例中，传统阻抗控制实现基于位置误差调整力的控制策略，导纳控制实现基于力误差调整位置的控制策略，因此基于位置的阻抗控制也称之为导纳控制。阻抗/导纳控制策略依赖于“质量-阻抗-弹簧”的二阶系统，将目标物体的动力学用“质量-阻抗-弹簧”的二阶系统建立，阻抗控制策略的模型为式(18)，本发明将双机械臂协作目标物体任务分解为内环和外环，内环主要控制内力F_I避免机械臂损坏目标物体，外环控制外力F_E确保目标物体完成协作任务。通过式(7)，(8)可以得知双机械臂作用于目标物体上的力可以分为内力和外力，因此式(9)可以转化为式(13)。In one embodiment, traditional impedance control implements a control strategy of adjusting force based on position error, and admittance control implements a control strategy of adjusting position based on force error. Therefore, position-based impedance control is also called admittance control. The impedance/admittance control strategy relies on the second-order system of "mass-impedance-spring". The dynamics of the target object is established using the second-order system of "mass-impedance-spring". The model of the impedance control strategy is Equation (18) , the present invention decomposes the dual robot arm cooperation target object task into an inner ring and an outer ring. The inner ring mainly controls the internal force F _I to prevent the robot arm from damaging the target object, and the outer ring controls the external force F _E to ensure that the target object completes the cooperation task. From equations (7) and (8), we can know that the force acting on the target object by the double manipulator can be divided into internal force and external force, so equation (9) can be transformed into equation (13).

式中x_e，x_a分别代表期望轨迹和实际轨迹；F_e，F_a分别代表期望受力和实际受力，其中F_k＝F_a；m_d代表惯性矩阵；b_d代表阻尼矩阵；k_d代表刚度矩阵；Δf代表力的误差值。In the formula, x _{e and} x _a represent the expected trajectory and the actual trajectory respectively; F _e and F _a represent the expected force and the actual force respectively, where F _k = F _a ; m _d represents the inertia matrix; b _d represents the damping matrix; k _d represents the stiffness matrix; Δf represents the error value of the force.

F_a＝k_e(x_e-x_a) (10)F _a =k _e (x _e -x _a ) (10)

式中，k_e为K_e为单维度量。In the formula, k _e is K _e and is a one-dimensional quantity.

将式(10带入到式(9)中得式(11)：Put equation (10) into equation (9) to get equation (11):

F_e-k_e(x_e-x_a)＝Δf (11)F _e -k _e (x _e -x _a )=Δf (11)

由式(11)可得x_e为：From formula (11), x _e can be obtained as:

将式(12)代入式(9)中，可得式(13)：Substituting formula (12) into formula (9), we can get formula (13):

对式(13)使用Laplace变换，得：Using Laplace transformation on equation (13), we get:

[ms²+bs+k_d+K_e]ΔF(s)＝[ms²+bs+k_d][F_d(s)+k_e(x_e(s)-x_r(s))] (14)[ms ² +bs+k _d +K _e ]ΔF(s)＝[ms ² +bs+k _d ][F _d (s)+k _e (x _e (s)-x _r (s))] ( 14)

因此，力跟踪稳态误差如式(15)所示：Therefore, the force tracking steady-state error is shown in Equation (15):

由式(15)可知，若要使力跟踪稳态误差Δf_ss＝0，有两种办法。第一种方法如式(16)所示，由于在不确定环境下刚度k_e和位置x_e都未知，故很难达到Δf_ss＝0，式(16)并不适应。第二种方法如式(15)和(17)所示，当k＝0时，Δf_ss＝0，同样因为环境未知，很难精确获得期望轨迹x_r，所以将用初始的操作环境x_e代替x_r，并令e＝x-x_e，得到改进后的方程，如式(18)所示。It can be seen from equation (15) that there are two methods to make the force tracking steady-state error Δf _ss = 0. The first method is shown in equation (16). Since the stiffness k _e and position x _e are unknown in an uncertain environment, it is difficult to achieve Δf _ss = 0, and equation (16) is not suitable. The second method is shown in equations (15) and (17). When k = 0, Δf _ss = 0. Also because the environment is unknown, it is difficult to accurately obtain the desired trajectory x _r , so the initial operating environment x _e will be used. Replace x _r and let e=xx _e to get the improved equation, as shown in Equation (18).

根据实际环境，应选取合适的惯性系数m和阻尼系数b，已实现稳定的力跟踪要求。According to the actual environment, appropriate inertia coefficient m and damping coefficient b should be selected to achieve stable force tracking requirements.

在不确定环境中，考虑操作环境轨迹x_e与环境刚度k_e是未知的，于是将式(18)中所有参数未知项合并，得到(19)。In an uncertain environment, considering that the operating environment trajectory x _e and the environmental stiffness k _e are unknown, all unknown parameters in equation (18) are combined to obtain (19).

设置新的自适应参数ε来代替参数未知项，如式(20)所示。Set a new adaptive parameter ε to replace the unknown parameter, as shown in Equation (20).

得到自适应阻抗方程，如式(21)所式：The adaptive impedance equation is obtained, as shown in equation (21):

为优化阻抗参数和阻抗参数集，将机械臂的力跟踪控制过程计为适用于不同环境的基于马尔科夫决策过程(Markov decision process，MDP)的强化学习框架，为IIC(Intelligent lmpedance Control，智能阻抗控制)方法设计合适的状态和动作值和并构造奖励函数，同时为其添加安全学习机制。马尔科夫决策过程作RL(一种以交互目标为导向的学习方法)的基本框架，其中包括(1)状态集S；(2)动作集合A；(3)状态转移概率(4)奖励概率/>(5)衰减因子γ；(6)actor/critic的online神经网络参数θ^Q和θ^μ。In order to optimize the impedance parameters and impedance parameter sets, the force tracking control process of the manipulator is calculated as a reinforcement learning framework based on Markov decision process (MDP) suitable for different environments, which is IIC (Intelligent lmpedance Control, intelligent Impedance control) method designs appropriate state and action values and constructs a reward function, while adding a safe learning mechanism to it. The Markov decision process is the basic framework of RL (an interactive goal-oriented learning method), which includes (1) state set S; (2) action set A; (3) state transition probability (4)Reward probability/> (5) Attenuation factor γ; (6) online neural network parameters θ ^Q and θ ^μ of actor/critic.

RL的基本过程如下，Agent在时刻t与环境发生交互，观察到环境状态的某种特征S_t∈S，并根据P输出一个动作A_t∈A。在下一时刻t+1，根据动作的结果，Agent获取到一个即时奖惩R_t+1∈R，并进入新的状态S_t+1，然后不断循环往复。上述过程被称为RL方法的策略π，可由式(22)表示，指在状态s时，动作集上的分布。The basic process of RL is as follows. The agent interacts with the environment at time t, observes a certain characteristic of the environment state S _t ∈ S, and outputs an action A _t ∈ A based on P. At the next time t+1, based on the result of the action, the Agent obtains an immediate reward and punishment R _t+1 ∈R, and enters a new state S _t+1 , and then the cycle continues. The above process is called the policy π of the RL method, which can be expressed by Equation (22), which refers to the distribution on the action set in state s.

π(a|s)：S→A (22)π(a|s)：S→A (22)

其中，a是动作集A中的一个动作A_t的特定值，s是动作集S中的一个动作S_t的特定值。Among them, a is the specific value of an action A _t in the action set A, and s is the specific value of an action S _t in the action set S.

可以通过多次环境与机械臂的交互，获得多个学习序列，取某一状态的奖惩(从某一时刻开始以后所有奖励的衰减加权和)的平均值，即为该状态的期望回报，亦即其价值；取某一状态下采取统一动作的回报的平均值，即为该状态行为对的期望回报，亦即状态行为对价值。理论上讲，通过不断地学习，可以获得所有状态和状态行为对的价值，于是可以得到状态价值函数和行为价值函数。Multiple learning sequences can be obtained through multiple interactions between the environment and the robotic arm. Taking the average of the rewards and punishments of a certain state (the attenuated weighted sum of all rewards starting from a certain moment) is the expected return of the state, or That is, its value; taking the average return of taking a unified action in a certain state is the expected return of the state-behavior pair, that is, the value of the state-behavior pair. Theoretically, through continuous learning, the values of all states and state-behavior pairs can be obtained, so the state value function and behavior value function can be obtained.

根据定义可将累计奖惩(R_t)记为式(23)。According to the definition, the cumulative reward and punishment (R _t ) can be recorded as equation (23).

其中，r为奖励(reward)，策略π元是随机的，因此累积奖惩R_t也有多种随机可能值。而为了对状态s进行评价，需要定义一个能够描述状态s的确定值。但是累积奖惩R_t是随机值，无法用于描述状态s，故采用累积奖惩R_t的期望值作为状态价值函数v^π(s)的定义如式(24)。Among them, r is the reward (reward), and the strategy π element is random, so the cumulative reward and punishment R _t also has a variety of random possible values. In order to evaluate the state s, it is necessary to define a certain value that can describe the state s. However, the cumulative reward and punishment R _t is a random value and cannot be used to describe the state s. Therefore, the expected value of the cumulative reward and punishment R _t is used as the state value function v π (s). The definition of the value function v ^π (s) is as shown in Equation (24).

v^π(s)＝E_π[R_t|s_t＝s] (24)v ^π (s)＝E _π [R _t |s _t =s] (24)

相应地，在策略π下在状态s时执行动作a的价值被称为状态-行为价值函数，其定义如式(25)。Correspondingly, the value of performing action a in state s under policy π is called the state-behavior value function, which is defined as Equation (25).

Q^π(s，a)＝E_π[R_t|s_t＝s，a_t＝a] (25)Q ^π (s, a) = E _π [R _t |s _t =s, a _t =a] (25)

根据式(24)得到状态价值函数的贝尔曼方程，该方程表达了当前状态s的价值和后续状态s′的价值的关系，如式(26)所示。According to equation (24), the Bellman equation of the state value function is obtained. This equation expresses the relationship between the value of the current state s and the value of the subsequent state s′, as shown in equation (26).

策略是状态空间到动作空间的映射，即式(27)所示。The strategy is the mapping from state space to action space, which is shown in Equation (27).

a＝π(a|s) (27)a=π(a|s) (27)

于是联立式(25)和式(27)，可分别得到状态-行为价值函数Q^π(s，a)的另一种表达方式，如式(28)所示。Therefore, by combining equations (25) and (27), we can obtain another expression of the state-behavior value function Q ^π (s, a), as shown in equation (28).

基于自适应阻抗控制系统与马尔科夫决策，利用DRL(Deep ReinforcementLearning，深度强化学习)解决高维度动态环境下的复杂计算，通过DDPG(DeepDeterministic Policy Gradient)算法求解IIC方法的阻抗控制策略。DDPG属于DRL算法中的一种，其基于Actor-Critic架构，兼具值函数更新和策略更新的特点，在解决连续动作空间的问题具有一定优势，适用于机械臂的力跟踪控制问题。Based on the adaptive impedance control system and Markov decision-making, DRL (Deep Reinforcement Learning) is used to solve complex calculations in high-dimensional dynamic environments, and the DDPG (DeepDeterministic Policy Gradient) algorithm is used to solve the impedance control strategy of the IIC method. DDPG is one of the DRL algorithms. It is based on the Actor-Critic architecture and has the characteristics of both value function update and strategy update. It has certain advantages in solving problems in continuous action space and is suitable for force tracking control problems of robotic arms.

基于确定性梯度策略，根据所述预测动作价值函数值，对所述Actor神经网络参数进行更新，确定性行为策略如式(29)。Based on the deterministic gradient strategy, the Actor neural network parameters are updated according to the predicted action value function value, and the deterministic behavior strategy is as shown in Equation (29).

μ→a_t＝μ(s_t) (29)μ→a _t =μ(s _t ) (29)

其中，μ为策略函数，s为当前状态，在确定性策略的动作在状态s时是唯一确定的，其公式如下。Among them, μ is the policy function, s is the current state, and the action of the deterministic policy is uniquely determined in state s, and its formula is as follows.

a_t＝μ(s_t|θ^μ)+N_t (30)a _t =μ(s _t |θ ^μ )+N _t (30)

其中，θ为策略参数。网络训练时，会随机采样多个数据N，作为确定性策略μ的训练数据，衡量确定性策略μ表现在式(31)所示。Among them, θ is the strategy parameter. During network training, multiple data N will be randomly sampled as training data for the deterministic strategy μ. The measurement of the deterministic strategy μ is shown in Equation (31).

其中s_t～ρ^β表示从经验分布ρ^β中采样一个状态s_t，Q^μ(s，μ(s))表示Critic网络对于给定状态s和动作μ(s)的值。整个期望表示在采样的状态和动作上取期望值。在训练时，这个期望值通过对一批采样进行平均来估计，即用样本均值来代替期望值。Among them, s _t ~ ρ ^β represents sampling a state s _t from the empirical distribution ρ ^β , and Q ^μ (s, μ(s)) represents the value of the Critic network for a given state s and action μ(s). entire expectation Indicates taking the expected value on the sampled state and action. During training, this expected value is estimated by averaging a batch of samples, i.e. replacing the expected value with the sample mean.

Actor网络通过最大化Critic网络的输出来学习最优策略。Actor网络的参数更新通过式子(32)梯度上升的方式进行：The Actor network learns the optimal policy by maximizing the output of the Critic network. The parameters of the Actor network are updated through the gradient ascent of Equation (32):

其中，a_t～π表示从Actor策略π中采样一个动作a_t，Q(s，a|θ^Q)表示Critic网络对于给定状态s和动作a的值，梯度项表示Critic网络对Actor输出动作的梯度。这个梯度表明如果稍微改变Actor网络的参数，对应的动作会如何影响Critic网络对值函数的估计。Among them, a _t ~ π represents sampling an action a _t from the Actor policy π, Q(s, a|θ ^Q ) represents the value of the Critic network for a given state s and action a, and the gradient term Represents the gradient of the Critic network's output action on the Actor. This gradient shows how if the parameters of the Actor network are slightly changed, the corresponding action will affect the Critic network's estimate of the value function.

依据最小化损失函数L(θ^Q)更新当前价值网络Q的参数θ^Q，所述最小化损失函数为式(33)所示：The parameters θ ^Q of the current value network Q are updated according to the minimization loss function L(θ ^Q ), which is shown in Equation (33):

基于确定性梯度策略，根据式(34)计算行为策略网络Q的梯度策略，对所述Critic神经网络参数进行更新。Based on the deterministic gradient strategy, the gradient strategy of the behavioral strategy network Q is calculated according to Equation (34), and the Critic neural network parameters are updated.

其中，s_t和a_t是从经验分布ρ^β中采样得到的状态和动作，s_t+1是从经验分布ε中采样得到的下一个状态，Q(s_i，a_i|θ^Q)是Critic网络的输出，表示在状态s下采取动作a的估计值，y是目标Q值，通常由目标网络(Target Network)计算得到。这个梯度估计的目标是最小化Critic网络的输出与目标Q值之间的均方误差。通过使用采样得到的状态、动作和下一个状态，对这个均方误差进行期望估计，然后计算对Critic网络参数θ^Q的梯度，从而进行参数更新。Among them, s _t and a _t are the states and actions sampled from the empirical distribution ρ ^β , s _t+1 is the next state sampled from the empirical distribution ε, Q (s _i , a _i |θ ^Q ) is The output of the Critic network represents the estimated value of taking action a in state s. y is the target Q value, which is usually calculated by the target network. The goal of this gradient estimation is to minimize the mean square error between the output of the critic network and the target Q value. By using the sampled state, action and next state, the expected mean square error is estimated, and then the gradient of the Critic network parameter θ ^Q is calculated to update the parameters.

y_i＝r_i+γQ’(s_i+1，μ’(s_i+1|θ^μ’)|θ^Q’) (35)y _i =r _i +γQ'(s _i+1 , μ'(s _i+1 |θ ^μ' )|θ ^Q' ) (35)

其中，r_i是当前奖励，γ是折扣因子，Q’和μ’是目标网络。Among them, r _i is the current reward, γ is the discount factor, Q' and μ' are the target network.

目标网络是行为网络的拷贝，采用滑动平均的方法对μ′和Q′进行更新，如式(36)所示：The target network is a copy of the behavioral network, and the sliding average method is used to update μ′ and Q′, as shown in Equation (36):

θ^μ′←τθ+(1-τ)θ^μ′ θ ^μ′ ←τθ+(1-τ)θ ^μ′

θ^Q′←τθ+(1-τ)θ^Q′ (36)θ ^Q′ ←τθ+(1-τ)θ ^Q′ (36)

其中θ^μ′、θ^Q′、τ都为参数，其中τ为学习率，一般取值0.001。在网络架构中，策略网络Actor用于更新θ^μ′，以输出动作a，价值网络Critic采对参数θ^Q′进行更新，用于逼近状态-行为值函数Q^π(s，a)。Among them, θ ^μ′ , θ ^Q′ and τ are all parameters, among which τ is the learning rate, which generally takes a value of 0.001. In the network architecture, the policy network Actor is used to update θ ^μ′ to output action a, and the value network Critic updates the parameter θ ^Q′ to approximate the state-behavior value function Q ^π (s, a).

基于DDPG算法，其网络结构模型如图3所示。Based on the DDPG algorithm, its network structure model is shown in Figure 3.

双机械臂协作运动过程的状态空间s定义如式(37)所示。The definition of the state space s of the collaborative motion process of the double manipulator is shown in Equation (37).

s＝{e_f，e_x，F_a，x_a} (37)s＝{e _f , e _x , F _a , x _a } (37)

其中，e_f，代表力跟踪误差，e_x代表轨迹跟踪误差，F_a表示双机械臂控制过程中的实际力，x_a表示双机械臂控制过程中的实际轨迹。Among them, e _f represents the force tracking error, e _x represents the trajectory tracking error, F _a represents the actual force during the control process of the dual manipulator, and x _a represents the actual trajectory during the control process of the dual manipulator.

DDPG算法的目标是根据力跟踪误差e_f和轨迹跟踪误差e_x输出合适的自适应参数ε。因此作为DRL算法的输出参数，所设计的动作空间只含有一个确定的元素，在t时刻，代表该时刻下的动作，为自适应参数ε，动作函数如式(38)所示。The goal of the DDPG algorithm is to output appropriate adaptive parameters ε based on the force tracking error e _f and the trajectory tracking error _ex . Therefore, as the output parameter of the DRL algorithm, the designed action space only contains one certain element. At time t, it represents the action at that time, which is the adaptive parameter ε. The action function is shown in Equation (38).

a_t＝{ε} (38)a _t ={ε} (38)

其中，ε为自适应参数。Among them, ε is the adaptive parameter.

在机械臂位置力跟踪过程中需要对每一时间步进行即时评价，根据实际力与力误差的平均值/>的差以及实际轨迹/>与轨迹误差平均值/>的差在不同比例下之和作为奖惩函数的一部分，称为基础奖惩部分。自适应阻抗控制系统的目标，要实现良好的跟踪效果，即令力跟踪误差e_f、轨迹跟踪误差e_x都尽可能始终保持为0。因此，将力跟踪误差e_f和轨迹跟踪误差e_x作为奖惩函数的一部分，称为额外激励部分(r_extre)，如式(40)所示。另外，为了加快DRL模型训练效率，即会根据力跟踪误差所处的区间给予不同的额外奖惩。结合基础奖惩部分和额外激励部分所构成的奖惩函数如式(39)所示。In the process of tracking the position and force of the manipulator, it is necessary to conduct real-time evaluation of each time step. According to the actual force and the average value of the force error/> The difference and the actual trajectory/> and the average trajectory error/> The sum of the differences at different proportions is used as part of the reward and punishment function, which is called the basic reward and punishment part. The goal of the adaptive impedance control system is to achieve good tracking effects, that is, to keep the force tracking error e _f and the trajectory tracking error _ex as zero as possible. Therefore, the force tracking error e _f and the trajectory tracking error e _x are regarded as part of the reward and punishment function, which is called the additional excitation part (r _extre ), as shown in Equation (40). In addition, in order to speed up the training efficiency of the DRL model, different additional rewards and penalties will be given according to the interval of the force tracking error. The reward and punishment function composed of the basic reward and punishment part and the additional incentive part is shown in Equation (39).

其中， in,

其中，k_a、k_b为比例因子，H为训练过程的时间段，为j时刻轨迹误差的平均值，/>为j时刻力误差的平均值，/>为在j时刻双机械臂的实际力，/>为在j时刻双机械臂的实际位置，在额外激励部分中，为了加快DRL模型训练效率，会根据e_f和e_x所处的区间给予不同的额外奖惩。本发明通过Adam梯度下降算法作为Actor网络的优化器，Adam优化器是在深度学习中广泛使用的一种梯度下降算法。它具有自适应学习率的特性，可以有效地处理不同参数的不同梯度尺度，从而提高训练效率。Adam优化器是一种鲁棒、高效的优化器，通常能够在许多深度学习任务中取得好的效果。其自适应学习率的特性使得网络更容易收敛。Among them, k _a and k _b are scaling factors, H is the time period of the training process, is the average value of the trajectory error at time j,/> is the average force error at time j,/> is the actual force of the two robotic arms at time j,/> For the actual position of the dual manipulator arms at time j, in the additional incentive part, in order to speed up the training efficiency of the DRL model, different additional rewards and punishments will be given according to the intervals between e _f and e _x . The present invention uses the Adam gradient descent algorithm as the optimizer of the Actor network. The Adam optimizer is a gradient descent algorithm widely used in deep learning. It has the characteristics of adaptive learning rate and can effectively handle different gradient scales of different parameters, thereby improving training efficiency. The Adam optimizer is a robust and efficient optimizer that can usually achieve good results in many deep learning tasks. Its adaptive learning rate feature makes the network easier to converge.

对模型进行训练，训练结果如图4所示，约前100个回合，算法处于策略探索阶段，累积奖惩快速提升。100个回合之后，累积奖惩曲线开始收敛，表明已搜索到较优策略。The model is trained, and the training results are shown in Figure 4. For about the first 100 rounds, the algorithm is in the strategy exploration stage, and the cumulative rewards and punishments increase rapidly. After 100 rounds, the cumulative reward and penalty curve begins to converge, indicating that a better strategy has been found.

根据的马尔科夫决策过程，将基本自适应阻抗控制与DDPG算法相结合，构成参考模型深度强化学习的自适应变阻抗控制，如图5所示。Based on the Markov decision process, basic adaptive impedance control is combined with the DDPG algorithm to form an adaptive variable impedance control for deep reinforcement learning of the reference model, as shown in Figure 5.

在一实施例中，双臂协同夹持一个与环相互作用的共同目标物体时，目标物体会受到内力与外力的共同作用。本发明基于以上的深度强化学习自适应阻抗控制策略，设计了双机械臂深度强化学习自适应阻抗控制策略，主要控制框图如图6所示，图6显示了双机械臂的对称控制策略以及双机械臂控制系统的外阻抗控制与内阻抗控制策略。In one embodiment, when the two arms cooperate to hold a common target object that interacts with the ring, the target object will be acted upon by internal and external forces. Based on the above deep reinforcement learning adaptive impedance control strategy, the present invention designs a dual robotic arm deep reinforcement learning adaptive impedance control strategy. The main control block diagram is shown in Figure 6. Figure 6 shows the symmetrical control strategy of the dual robotic arms and the dual robotic arms. External impedance control and internal impedance control strategies of the robotic arm control system.

在图6中，F_Ea，F_Ee分别表示为实际外力与期望外力。F_Ia，F_Ie分别表示实际内力与期望内力。σF_E，σF_I分别表示为实际外力和期望外力的误差与实际内力和期望外力的误差。σX_E，σX_I分别表示为外部深度强化学习自适应阻抗控制器和内力深度强化学习自适应阻抗控制器生成的位置补偿。X_c，X_e分别表示目标物体的实际轨迹与目标物体的期望轨迹。X_a，X_al，X_ar分别表示为双机械臂的末端执行器的实际运动轨迹，通过双臂闭链约束条件将其分解为左右两个机械臂的末端执行器的实际运动轨迹。θ_l，θ_r分别表示为通过左右机械臂的末端执行器的实际运动轨迹通过逆运动学生成左右机械臂关节的实际运动角度。In Figure 6, F _Ea and F _Ee are represented as the actual external force and the expected external force respectively. F _Ia and F _Ie respectively represent the actual internal force and the expected internal force. σF _E and σF _I respectively represent the error of the actual external force and the expected external force and the error of the actual internal force and the expected external force. σX _E , σX _I represent the position compensation generated by the external deep reinforcement learning adaptive impedance controller and the internal force deep reinforcement learning adaptive impedance controller respectively. X _c and X _e respectively represent the actual trajectory of the target object and the expected trajectory of the target object. X _a , X _al _, and θ _l and θ _r are respectively expressed as the actual motion angles of the left and right robot arm joints generated by inverse kinematics through the actual motion trajectories of the end effectors of the left and right robot arms.

在一实施例中，构建双机械臂在抓取目标物体下的阻抗模型；In one embodiment, an impedance model of a dual robotic arm grasping a target object is constructed;

当构建双机械臂在抓取目标物体时的阻抗模型时，本发明首先考虑机械臂在接触环境时的响应，因此对目标物体进行受力分解，将目标物体受理进行解耦，对内力与外力分别进行自适应阻抗控制，优化整体控制策略，提升控制精度。When constructing the impedance model of the dual manipulator when grabbing the target object, the present invention first considers the response of the manipulator when it contacts the environment, so it decomposes the force on the target object, decouples the acceptance of the target object, and analyzes the internal and external forces. Adaptive impedance control is performed respectively to optimize the overall control strategy and improve control accuracy.

在双机械臂接触目标物体或者目标环境的过程中，机械臂及控制策略参数会发生变化，这包括机械臂的刚度和阻尼等参数，用于描述其在抓取过程中的运动和受力特性。When the dual manipulator arms contact the target object or the target environment, the parameters of the manipulator arm and the control strategy will change. This includes parameters such as the stiffness and damping of the manipulator arm, which are used to describe its movement and force characteristics during the grasping process. .

为了训练机械臂在不确定环境中的行为，本发明首先需要采集机械臂在实际操作中的状态数据集。这包括机械臂的当前位置、速度、加速度等信息，以及机械臂与环境发生接触时的力和扭矩等传感器数据。In order to train the behavior of the robotic arm in an uncertain environment, the present invention first needs to collect a state data set of the robotic arm during actual operation. This includes information such as the current position, speed, and acceleration of the robotic arm, as well as sensor data such as force and torque when the robotic arm comes into contact with the environment.

在强化学习的框架下，本发明选择深度确定性策略梯度(DDPG)算法。这一算法由两个关键部分组成：Actor神经网络，用于输出连续的动作，以及Critic神经网络，用于评估动作的质量。通过在实际操作中不断学习，DDPG算法使得机械臂能够调整其动作策略，以最大化累积奖励。Under the framework of reinforcement learning, the present invention selects the deep deterministic policy gradient (DDPG) algorithm. This algorithm consists of two key parts: the Actor neural network, which outputs continuous actions, and the Critic neural network, which evaluates the quality of the actions. By continuously learning during actual operations, the DDPG algorithm enables the robotic arm to adjust its action strategy to maximize cumulative rewards.

在训练阶段，本发明建立了一个经验池，其中包含机械臂的当前状态、执行的动作、获得的奖励以及下一个状态。本发明从经验池中随机选择样本，用于训练Actor-Critic神经网络，以获取最优的网络结构。During the training phase, the present invention establishes an experience pool, which contains the current state of the robotic arm, the actions performed, the rewards obtained, and the next state. This invention randomly selects samples from the experience pool and uses them to train the Actor-Critic neural network to obtain the optimal network structure.

最终，通过使用训练有素的Actor-Critic神经网络，机械臂能够基于当前状态数据集做出最优的动作选择，以适应不确定环境并实现对目标物体的高效抓取。本发明允许机械臂通过强化学习不断优化其行为，从而更好地适应复杂的工作场景。Ultimately, by using a well-trained Actor-Critic neural network, the robotic arm is able to make optimal action selections based on the current state data set to adapt to the uncertain environment and achieve efficient grasping of target objects. The invention allows the robotic arm to continuously optimize its behavior through reinforcement learning, thereby better adapting to complex work scenarios.

本发明通过DDPG(Deep Deterministic Policy Gradient)算法解决连续动作空间中双臂机器人夹取目标物体的深度强化学习算法。整个算法流程如图7所示：The present invention uses the DDPG (Deep Deterministic Policy Gradient) algorithm to solve the deep reinforcement learning algorithm for a two-arm robot to grasp target objects in a continuous action space. The entire algorithm process is shown in Figure 7:

首先，在初始化阶段，进行网络的参数初始化。这包括Actor网络和Critic网络，它们分别用于输出连续动作和评估动作的质量。同时，设定一个经验池，该经验池用于存储机器人在环境中的经验，包括状态、动作、奖励和下一状态的元组。First, in the initialization phase, the network parameters are initialized. This includes the Actor network and the Critic network, which are used to output continuous actions and evaluate the quality of actions respectively. At the same time, an experience pool is set up, which is used to store the robot's experience in the environment, including tuples of states, actions, rewards, and next states.

其次，明确定义机器人的状态和动作空间，确保它们能够应对实际环境中的连续性操作。设计Actor网络用于生成连续动作，而Critic网络则用于评估动作的质量，输出相应的动作值函数。Secondly, clearly define the robot's state and action space to ensure that they can cope with continuous operations in the actual environment. The Actor network is designed to generate continuous actions, while the Critic network is used to evaluate the quality of actions and output the corresponding action value function.

为了提高训练的稳定性，引入目标网络，这是Actor和Critic网络的复制版本。目标网络的参数定期以一定比例更新，从而缓解训练中的不稳定性问题。In order to improve the stability of training, the target network is introduced, which is a replicated version of the Actor and Critic networks. The parameters of the target network are regularly updated at a certain ratio to alleviate the instability problem in training.

然后，定义优化器和损失函数。为Actor和Critic网络分别选择合适的优化器，通常采用Adam优化器，同时定义Critic网络的均方误差损失函数，用于度量估计值和目标值之间的差异。Then, define the optimizer and loss function. Select appropriate optimizers for the Actor and Critic networks respectively, usually using the Adam optimizer, and define the mean square error loss function of the Critic network to measure the difference between the estimated value and the target value.

超参数的设定也是重要的一步，包括学习率、折扣因子、软更新参数等。这些超参数的选择直接影响着算法的收敛性和性能。The setting of hyperparameters is also an important step, including learning rate, discount factor, soft update parameters, etc. The choice of these hyperparameters directly affects the convergence and performance of the algorithm.

训练阶段是算法的核心。在每个时间步骤中，机器人从状态空间中选择动作，通过Actor网络生成动作并加入一些探索性的噪声。执行选择的动作后，观察环境的反馈，包括奖励和下一个状态，并将这些经验元组存储到经验池中。接着，从经验池中采样一批数据，计算Critic网络的损失并进行反向传播，以优化网络参数。通过目标Critic网络计算目标Q值，更新Actor网络的参数以最大化Q值，最后进行软更新，调整目标网络的参数。这个训练循环不断迭代，逐渐优化Actor和Critic网络。The training phase is the core of the algorithm. At each time step, the robot selects an action from the state space, generates the action through the Actor network and adds some exploratory noise. After performing the selected action, observe feedback from the environment, including rewards and next states, and store these experience tuples into an experience pool. Then, a batch of data is sampled from the experience pool, the loss of the Critic network is calculated and backpropagated to optimize the network parameters. The target Q value is calculated through the target Critic network, the parameters of the Actor network are updated to maximize the Q value, and finally a soft update is performed to adjust the parameters of the target network. This training loop continues to iterate, gradually optimizing the Actor and Critic networks.

在训练完成后，机器人可以使用训练有素的Actor网络执行动作，进入实际环境中。为了确保算法的有效性，可以定期评估机器人在实际环境中的性能，以便调整和优化超参数，确保算法的鲁棒性和泛化性。DDPG算法通过这一流程，使得机器人能够适应复杂的、连续的动作空间，从而实现高效的强化学习。After training is completed, the robot can use the trained Actor network to perform actions and enter the actual environment. In order to ensure the effectiveness of the algorithm, the performance of the robot in the actual environment can be regularly evaluated in order to adjust and optimize the hyperparameters and ensure the robustness and generalization of the algorithm. Through this process, the DDPG algorithm enables the robot to adapt to complex and continuous action spaces, thereby achieving efficient reinforcement learning.

在一实施例中，为了验证自适应控制算法的实用性，本发明将在如图8所示的实验平台中进行实验的进行。实验平台其中包含了两台UR5机械臂，UR5机械臂控制柜，UR5机械臂示教器，PC电脑，六维力传感器，交换机。其中两台UR5机械臂通过其自身控制柜进行通讯，本发明将UR5机械臂通过Ethernet通信协议进行通信，六维力传感器通过TCP通信协议进行通信。两台UR5机械臂与六维力传感器接入交换机内，交换机通过TCP/IP通信协议与PC机进行通信传输数据。实现PC端可以直接控制机械臂的运动并且实时读取六维力传感器的数据。In one embodiment, in order to verify the practicability of the adaptive control algorithm, the present invention will conduct experiments in the experimental platform as shown in Figure 8. The experimental platform includes two UR5 robotic arms, a UR5 robotic arm control cabinet, a UR5 robotic arm teaching pendant, a PC, a six-dimensional force sensor, and a switch. Two of the UR5 robotic arms communicate through their own control cabinets. In the present invention, the UR5 robotic arms communicate through the Ethernet communication protocol, and the six-dimensional force sensor communicates through the TCP communication protocol. Two UR5 robotic arms and six-dimensional force sensors are connected to the switch, and the switch communicates and transmits data with the PC through the TCP/IP communication protocol. The PC can directly control the movement of the robotic arm and read the data of the six-dimensional force sensor in real time.

在仿真实验中，设定标准力信号作为机械手末端力跟踪目标，设在机械手末端操作空间z轴方向设定跟踪力，其它轴方向不设定跟踪目标力值。设定z轴方向上目标力值Fe＝8，仿真时间设定为3s，观察力位移跟踪效果。如图9A和图9B所示。在传统的导纳控制中加入了深度强化学习模块，并且仿真实验中，本发明加入了三种算法进行对比分析，更加突出算法的优劣性。在恒力恒轨迹的情况下，深度强化学习对于期望力与期望轨迹的跟踪误差并没有优越于其它两种算法，但是误差都保持在0.0005内，并且深度强化学习是到达期望力与期望轨迹比较快的算法，定阻抗控制器虽然在速度以及误差方面比深度强化学习小，但是它在仿真初期存在超调与震荡的现象。因此，深度强化学习的综合表现是优于其他两种算法，并且验证了深度强化学习的自适应变阻抗控制算法的可行性。In the simulation experiment, the standard force signal is set as the manipulator end force tracking target, the tracking force is set in the z-axis direction of the manipulator end operating space, and no tracking target force values are set in other axis directions. Set the target force value Fe=8 in the z-axis direction, set the simulation time to 3s, and observe the force-displacement tracking effect. As shown in Figure 9A and Figure 9B. A deep reinforcement learning module is added to the traditional admittance control, and in the simulation experiment, the present invention adds three algorithms for comparative analysis, which further highlights the advantages and disadvantages of the algorithms. In the case of constant force and constant trajectory, deep reinforcement learning is not superior to the other two algorithms in terms of tracking error of the desired force and desired trajectory, but the errors remain within 0.0005, and deep reinforcement learning is the comparison between the desired force and the desired trajectory. A fast algorithm, although the constant impedance controller is smaller than deep reinforcement learning in terms of speed and error, it has overshoot and oscillation in the early stages of simulation. Therefore, the overall performance of deep reinforcement learning is better than the other two algorithms, and the feasibility of the adaptive variable impedance control algorithm of deep reinforcement learning is verified.

在仿真实验中，设定标准力信号作为机械手末端力跟踪目标，设在机械手末端操作空间z轴方向设定跟踪力，其它轴方向不设定跟踪目标力值。设定z轴方向上目标力值Fe＝8，仿真时间设定为3s，观察力位移跟踪效果。如图10A和图10B所示，可知深度强化学习自适应阻抗控制在恒力恒轨迹下的期望力和期望轨迹跟踪的仿真结果。在恒力变轨迹的情况下，深度强化学习对于期望力与期望轨迹的跟踪误差优越于其它两种算法，并且深度强化学习是对于期望力与期望轨迹的收速度是比较快的，定阻抗控制器收敛速度是最快的，但是它在仿真初期存在超调与震荡的现象，并且在后续的动作中存在震荡的现象，自适应变阻抗控制器误差比较大，并且收敛速度比较慢。因此，深度强化学习的综合表现是优于其他两种算法，同时深度强化学习的自适应变阻抗控制可以使得其到达更好的期望力和期望轨迹。In the simulation experiment, the standard force signal is set as the manipulator end force tracking target, the tracking force is set in the z-axis direction of the manipulator end operating space, and no tracking target force values are set in other axis directions. Set the target force value Fe=8 in the z-axis direction, set the simulation time to 3s, and observe the force-displacement tracking effect. As shown in Figure 10A and Figure 10B, the simulation results of the desired force and desired trajectory tracking of deep reinforcement learning adaptive impedance control under constant force and constant trajectory can be seen. In the case of constant force changing trajectory, the tracking error of deep reinforcement learning for the desired force and desired trajectory is superior to the other two algorithms, and deep reinforcement learning is faster for the desired force and desired trajectory. Constant impedance control The convergence speed of the controller is the fastest, but it has overshoot and oscillation in the early stage of simulation, and oscillation in subsequent actions. The error of the adaptive variable impedance controller is relatively large, and the convergence speed is relatively slow. Therefore, the overall performance of deep reinforcement learning is better than the other two algorithms, and the adaptive variable impedance control of deep reinforcement learning can enable it to achieve better desired force and desired trajectory.

在仿真实验中，设定标准力信号作为机械手末端力跟踪目标，设在机械手末端操作空间z轴方向设定跟踪力，其它轴方向不设定跟踪目标力值。设定z轴方向上目标力值Fe＝8，轨迹值设定为Xe＝0.1+0.1sin(3π)仿真时间设定为3s，观察力位移跟踪效果。如图11A和图11B所示为参考模型自适应阻抗控制在变力变轨迹下的期望力和期望轨迹跟踪的仿真结果。在变力变轨迹的情况下，深度强化学习对于期望力与期望轨迹的跟踪误差优越于其它两种算法，并且深度强化学习是对于期望力与期望轨迹的收速度是比较快的，定阻抗控制器收敛速度是最快的，但是它在仿真初期存在超调与震荡的现象，自适应变阻抗控制器误差比较大，在收敛初期存在超调现象，并且收敛速度比较慢。因此，深度强化学习的综合表现是优于其他两种算法，同时参考模型深度强化学习的自适应变阻抗控制可以将动态轨迹下的力趋于稳定，同时力的期望误差更小。In the simulation experiment, the standard force signal is set as the manipulator end force tracking target, the tracking force is set in the z-axis direction of the manipulator end operating space, and no tracking target force values are set in other axis directions. Set the target force value Fe=8 in the z-axis direction, set the trajectory value to Xe=0.1+0.1sin (3π), set the simulation time to 3s, and observe the force displacement tracking effect. Figures 11A and 11B show the simulation results of the desired force and desired trajectory tracking of the reference model adaptive impedance control under variable force and variable trajectories. In the case of changing force and trajectory, the tracking error of deep reinforcement learning for the desired force and desired trajectory is superior to the other two algorithms, and deep reinforcement learning is faster for the expected force and desired trajectory. Constant impedance control The convergence speed of the controller is the fastest, but it has overshoot and oscillation phenomena in the early stage of simulation. The error of the adaptive variable impedance controller is relatively large, there is overshoot phenomenon in the early stage of convergence, and the convergence speed is relatively slow. Therefore, the overall performance of deep reinforcement learning is better than the other two algorithms. At the same time, the adaptive variable impedance control of deep reinforcement learning of the reference model can stabilize the force under the dynamic trajectory, and at the same time, the expected error of the force is smaller.

与目前现有技术相比，本发明使用到的深度强化学习具有学习能力，可以通过不断的试错和反馈来优化控制策略，从而提高系统的控制性能。相比于传统的控制算法，深度强化学习可以更好地适应未知的环境和任务。同时可以根据实时的环境和任务变化来调整控制策略，从而实现自适应控制。这种自适应性使得系统能够更好地适应复杂和动态的工作环境。通过训练数据来提高系统的鲁棒性，使得系统对噪声、干扰和不确定性具有更好的适应能力，达到更精确的力协调控制。Compared with the current existing technology, the deep reinforcement learning used in the present invention has learning ability and can optimize the control strategy through continuous trial and error and feedback, thereby improving the control performance of the system. Compared with traditional control algorithms, deep reinforcement learning can better adapt to unknown environments and tasks. At the same time, the control strategy can be adjusted according to real-time environment and task changes to achieve adaptive control. This adaptability enables the system to better adapt to complex and dynamic work environments. The robustness of the system is improved through training data, so that the system has better adaptability to noise, interference and uncertainty, and achieves more precise force coordination control.

另外，如图12所示，本发明的一个实施例还公开了一种自适应变阻抗控制装置，该装置包括：In addition, as shown in Figure 12, one embodiment of the present invention also discloses an adaptive variable impedance control device, which includes:

构建模块110，用于构建机器人双机械臂在抓取目标物体时的阻抗模型，将目标物体受到的内力和外力进行解耦，对内力与外力分别进行自适应阻抗控制；The construction module 110 is used to construct the impedance model of the robot's dual mechanical arms when grabbing the target object, decouple the internal force and external force on the target object, and perform adaptive impedance control on the internal force and external force respectively;

初始化模块120，用于初始化阻抗模型的网络参数和经验池，经验池用于存储机器人在环境中的经验元组，其中，阻抗模型包括Actor网络和Critic网络，Actor网络用于生成连续动作，Critic网络用于评估动作的质量，输出相应的动作值函数；Initialization module 120 is used to initialize the network parameters and experience pool of the impedance model. The experience pool is used to store the experience tuples of the robot in the environment. The impedance model includes an Actor network and a Critic network. The Actor network is used to generate continuous actions. Critic The network is used to evaluate the quality of the action and output the corresponding action value function;

训练模块130，用于从机器人的状态空间中选择动作，执行选择的动作后，将环境反馈的经验元组存储到经验池中，并从经验池中随机采样一批数据，计算Critic网络的损失并进行反向传播，通过Critic网络计算目标Q值，更新Actor网络的参数以最大化Q值，循环训练直至达到预设迭代次数，得到训练好的Actor-Critic网络；The training module 130 is used to select actions from the robot's state space. After executing the selected actions, store the experience tuples of environmental feedback into the experience pool, and randomly sample a batch of data from the experience pool to calculate the loss of the Critic network. And perform back propagation, calculate the target Q value through the Critic network, update the parameters of the Actor network to maximize the Q value, and loop training until the preset number of iterations is reached to obtain the trained Actor-Critic network;

执行模块140，用于使用训练好的Actor-Critic网络，在实际环境中执行双机械臂的动作对目标物体进行抓取。The execution module 140 is used to use the trained Actor-Critic network to execute the actions of the dual robotic arms to grab the target object in the actual environment.

本发明实施例的自适应变阻抗控制装置用于执行上述实施例中的自适应变阻抗控制方法，其具体处理过程与上述实施例中的自适应变阻抗控制方法相同，此处不再一一赘述。The adaptive variable impedance control device in the embodiment of the present invention is used to execute the adaptive variable impedance control method in the above embodiment. The specific processing process is the same as the adaptive variable impedance control method in the above embodiment, and will not be repeated here. Repeat.

另外，如图13所示，本发明的一个实施例还公开了一种电子设备，包括：至少一个处理器210；至少一个存储器220，用于存储至少一个程序；当至少一个程序被至少一个处理器210执行时实现如前面任意实施例中的自适应变阻抗控制方法。In addition, as shown in Figure 13, one embodiment of the present invention also discloses an electronic device, including: at least one processor 210; at least one memory 220, used to store at least one program; when at least one program is processed by at least one When the controller 210 is executed, the adaptive variable impedance control method as in any previous embodiment is implemented.

另外，本发明的一个实施例还公开了一种计算机可读存储介质，其中存储有计算机可执行指令，计算机可执行指令用于执行如前面任意实施例中的自适应变阻抗控制方法。In addition, an embodiment of the present invention also discloses a computer-readable storage medium in which computer-executable instructions are stored, and the computer-executable instructions are used to execute the adaptive variable impedance control method as in any of the previous embodiments.

本发明实施例描述的系统架构以及应用场景是为了更加清楚的说明本发明实施例的技术方案，并不构成对于本发明实施例提供的技术方案的限定，本领域技术人员可知，随着系统架构的演变和新应用场景的出现，本发明实施例提供的技术方案对于类似的技术问题，同样适用。The system architecture and application scenarios described in the embodiments of the present invention are to more clearly explain the technical solutions of the embodiments of the present invention, and do not constitute a limitation on the technical solutions provided by the embodiments of the present invention. Those skilled in the art will know that with the system architecture With the evolution of technology and the emergence of new application scenarios, the technical solutions provided by the embodiments of the present invention are also applicable to similar technical problems.

本领域普通技术人员可以理解，上文中所公开方法中的全部或某些步骤、系统、设备中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。Those of ordinary skill in the art can understand that all or some steps, systems, and functional modules/units in the devices disclosed above can be implemented as software, firmware, hardware, and appropriate combinations thereof.

在硬件实施方式中，在以上描述中提及的功能模块/单元之间的划分不一定对应于物理组件的划分；例如，一个物理组件可以具有多个功能，或者一个功能或步骤可以由若干物理组件合作执行。某些物理组件或所有物理组件可以被实施为由处理器，如中央处理器、数字信号处理器或微处理器执行的软件，或者被实施为硬件，或者被实施为集成电路，如专用集成电路。这样的软件可以分布在计算机可读介质上，计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的，术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外，本领域普通技术人员公知的是，通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据，并且可包括任何信息递送介质。In hardware implementations, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may consist of several physical components. Components execute cooperatively. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, a digital signal processor, or a microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit . Such software may be distributed on computer-readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As is known to those of ordinary skill in the art, the term computer storage media includes volatile and nonvolatile media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. removable, removable and non-removable media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disk (DVD) or other optical disk storage, magnetic cassettes, tapes, disk storage or other magnetic storage devices, or may Any other medium used to store the desired information and that can be accessed by a computer. Additionally, it is known to those of ordinary skill in the art that communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .

在本说明书中使用的术语“部件”、“模块”、“系统”等用于表示计算机相关的实体、硬件、固件、硬件和软件的组合、软件、或执行中的软件。例如，部件可以是但不限于，在处理器上运行的进程、处理器、对象、可执行文件、执行线程、程序或计算机。通过图示，在计算设备上运行的应用和计算设备都可以是部件。一个或多个部件可驻留在进程或执行线程中，部件可位于一个计算机上或分布在二个或更多个计算机之间。此外，这些部件可从在上面存储有各种数据结构的各种计算机可读介质执行。部件可例如根据具有一个或多个数据分组(例如来自于自与本地系统、分布式系统或网络间的另一部件交互的二个部件的数据，例如通过信号与其它系统交互的互联网)的信号通过本地或远程进程来通信。The terms "component", "module", "system", etc. used in this specification are used to refer to computer-related entities, hardware, firmware, a combination of hardware and software, software, or software in execution. For example, a component may be, but is not limited to, a process, processor, object, executable file, thread of execution, program or computer running on a processor. Through the illustrations, both applications running on the computing device and the computing device may be components. One or more components can reside in a process or thread of execution, and the component can be localized on one computer or distributed between two or more computers. Additionally, these components can execute from various computer-readable media having various data structures stored thereon. A component may, for example, be based on a signal having one or more data packets (eg, data from two components interacting with another component, such as a local system, a distributed system, or a network, such as the Internet, which interacts with other systems via signals) Communicate through local or remote processes.

Claims

1. An adaptive variable impedance control method, comprising:

constructing an impedance model of a robot double mechanical arm when grabbing a target object, decoupling the internal force and the external force born by the target object, and respectively performing self-adaptive impedance control on the internal force and the external force;

Initializing network parameters and experience pools of the impedance model, wherein the experience pools are used for storing experience tuples of the robot in the environment, the impedance model comprises an Actor network and a Critic network, the Actor network is used for generating continuous actions, the Critic network is used for evaluating the quality of the actions, and corresponding action value functions are output;

selecting actions from a state space of the robot, storing the experience tuples fed back by the environment into the experience pool after the selected actions are executed, randomly sampling a batch of data from the experience pool, calculating the loss of the Critic network and carrying out back propagation, calculating a target Q value through the Critic network, updating parameters of the Actor network to maximize the Q value, and carrying out cyclic training until the preset iteration times are reached, thereby obtaining a trained Actor-Critic network;

and using the trained Actor-Critic network to execute the actions of the double mechanical arms to grab the target object in an actual environment.

2. The method according to claim 1, wherein the method further comprises:

establishing a coordinate system of a double mechanical arm cooperative system, and solving the position and the posture of a target object relative to a reference coordinate system by using the following formula:

In the method, in the process of the invention,a transformation matrix of the target object relative to a centroid coordinate system; />A 3X3 rotation matrix of the object relative to the coordinate system at the centroid; />A 3X1 matrix of positions of the target object relative to a coordinate system at the centroid;

the target object is converted into a constraint condition between the target object and the mechanical arm through the transformation between the coordinate system at the centroid and the world coordinate system, and the constraint condition is expressed by the following formula:

in the method, in the process of the invention,the method comprises the steps of performing homogeneous coordinate conversion on a coordinate system O at the centroid relative to a world coordinate system W; />Homogeneous coordinate conversion of a base coordinate system of the double mechanical arms relative to a world coordinate system is represented; />Representing the secondary transformation of the end coordinate system of the double mechanical arm relative to the base coordinate of the double mechanical arm; />Representing homogeneous conversion of a centroid coordinate system of a target object relative to the tail end of the mechanical arm;

analyzing the speed constraint relation by the following formula, so that the consistency of the position and the speed of the double arms is maintained in the movement process;

in the method, in the process of the invention,representing the velocity of the end of the robotic arm relative to the world coordinate system; />Representing the velocity of the object relative to the world coordinate system,angular velocity; />Representing a matrix of the arm tip relative to the world position; />Representing a position transformation matrix of the tail end of the mechanical arm relative to the mass center of the target object; / >Representing the rotation matrix of the centroid of the target object relative to the direction under the world.

3. The method of claim 1, wherein decoupling the internal and external forces experienced by the target object comprises:

according to Newton's second law and Euler equation, the state that the double mechanical arms grasp the target object is established, and the dynamics equation of the following target object is established:

in the middle ofI _O Representing an inertial matrix at a centroid of the target object; f (F) _O ∈R ⁶ Representing the vector force of the double mechanical arms acting on the target object; m is M _O ∈R ⁶ A mass inertia matrix representing the target object; />Representing linear acceleration and angular acceleration during the movement of the target object; c (C) _O ∈R ⁶ Resultant force vectors expressed as coriolis force, gravity force, and centrifugal force of the target object; f (F) _ext ∈R ⁶ Indicating the suitability of external disturbing forces acting on the target objectMeasuring force; converting the above formula to the following formula:

where k=l, r is denoted as left and right arms of the double mechanical arm, S _k ^T ∈R ⁶ Representing a grabbing matrix; f (F) _k Representing the force of the mechanical arm acting on the target object; decomposing the grabbing matrix to obtain an external force F _I And obtaining an internal force F _E :

In the middle ofIs->Generalized inverse of the matrix.

4. The method of claim 1, wherein the equation of the impedance model is as follows:

Wherein m is an inertia coefficient, b is a damping coefficient, ε is an adaptive parameter, Δf represents an error value of force,and->The movement speed and the movement acceleration of the robot arm are respectively.

5. The method of claim 1, wherein randomly sampling a batch of data from the experience pool, calculating a loss of the Critic network and back-propagating, calculating a target Q value through the Critic network, updating parameters of the Actor network to maximize Q value, comprises:

based on a deterministic gradient strategy, updating the Actor network parameters according to the action value function, wherein the deterministic behavior strategy has the following formula:

μ→a _t ＝μ(s _t )

wherein μ is a policy function, s is a current state, and the action of the deterministic policy is uniquely determined in state s, and the formula is as follows:

a _t ＝μ(s _t |θ ^μ )+N _t

wherein θ is a policy parameter; during network training, randomly sampling a plurality of data N as training data of a deterministic strategy mu, wherein the deterministic strategy mu is measured and expressed as the following formula:

wherein s is _t ～ρ ^β Representing the empirical distribution ρ ^β Sample a state s _t ，Q ^μ (s, μ (s)) represents the value of Critic network for a given state s and action μ(s), the whole expectationRepresenting taking expected values on the sampled state and action; during training, replacing expected values by sample mean values;

The Actor network learns the optimal strategy by maximizing the output of the Critic network, and the parameter updating of the Actor network is performed by the following gradient rising mode:

wherein a is _t Pi represents sampling an action a from the Actor policy pi _t ，Q(s,a|θ ^Q ) Gradient term representing the value of Critic network for a given state s and action aRepresenting the gradient of the Critic network to the action of the Actor output;

according to a minimization loss function L (θ ^Q ) Updating the parameter θ of the current value network Q ^Q The minimization loss function is as follows:

wherein N is the random sampling number, and gamma is the attenuation coefficient.

6. The method of claim 5, wherein the randomly sampling a batch of data from the experience pool, calculating a loss of the Critic network and back-propagating, calculating a target Q value through the Critic network, updating parameters of the Actor network to maximize Q value, further comprising:

based on a deterministic gradient strategy, calculating a gradient strategy of a behavior strategy network Q according to the following formula, and updating the Critic network parameters;

wherein s is _t And a _t Is from the empirical distribution ρ ^β State and action obtained by middle sampling s _t+1 Is the next state sampled from the empirical distribution epsilon, Q (s _i ,a _i |θ ^Q ) Is the output of the Critic network, representing the estimated value of action a taken in state s, y is the target Q value, and the goal of gradient estimation is to minimize the mean square error between the output of the Critic network and the target Q value; using the sampled state, action and next state, the expected estimate of this mean square error is made and then the parameter θ for the Critic network is calculated ^Q To perform parameter updating;

y _i ＝r _i +γQ’(s _i+1 ,μ’(s _i+1 |θ ^μ’ )|θ ^Q’ )

wherein r is _i Is the current prize, γ is the discount factor, Q 'and μ' are the target network;

updating mu 'and Q' by adopting a sliding average method, wherein the updating method is as follows:

θ ^μ′ ←τθ+(1-τ)θ ^μ′

θ ^Q′ ←τθ+(1-τ)θ ^Q′

wherein θ ^μ′ 、θ ^Q′ τ is a parameter, τ is a learning rate, and the value of τ is 0.001; in the network architecture, a policy network Actor is used to update θ ^μ′ Outputting action a, and acquiring the pair parameter theta by the value network Critic ^Q′ Update for approximating the state-behavior value function Q ^π (s,a)。

7. The method according to claim 1, wherein the state space s of the double mechanical arm cooperative motion process is defined as follows:

s＝{e _f ,e _x ,F _a ,x _a }

wherein e _f Representing force tracking error, e _x Representing track following error, F _a Representing the actual force, x, during control of a dual mechanical arm _a Representing an actual track in the control process of the double mechanical arms;

The goal of the DDPG algorithm is to track the error e based on force _f And track tracking error e _x Outputting a proper self-adaptive parameter epsilon; as an output parameter of the DRL algorithm, a designed action spaceOnly one determined element is contained, at the moment t, the action at the moment is represented, the action function is shown as the following formula, and the action function is the self-adaptive parameter epsilon:

a _t ＝{ε}

wherein epsilon is an adaptive parameter; in the mechanical arm position force tracking process, each time step needs to be evaluated in real time, and according to the actual forceMean value of force error->Difference and actual trajectory ∈ ->Mean value of track error->The sum of the differences in different ratios is used as part of a reward and punishment function, called a basic reward and punishment part; force tracking error e _f Tracking error e of track _x All the time kept at 0; tracking error e of force _f And track tracking error e _x As part of a punishment function, called the extra excitation part r _extre The method comprises the steps of carrying out a first treatment on the surface of the Different additional rewards and punishments are given according to the interval where the force tracking error is located, and the reward and punishment function formed by combining the basic reward and punishment part and the additional excitation part is shown in the following formula:

wherein,

wherein k is _a 、k _b Is a scale factor, H is the time period of the training process,is the average value of the track errors at the moment j, +.>Mean value of force errors at moment j, +. >For the actual force of the double arm at moment j +.>In order to double the actual position of the mechanical arm at time j, in the additional excitation part, according to e _f And e _x The region is subjected to different additional rewards and punishments.

8. An adaptive variable impedance control device, the device comprising:

the construction module is used for constructing an impedance model of the robot double mechanical arms when the target object is grabbed, decoupling the internal force and the external force applied to the target object, and respectively carrying out self-adaptive impedance control on the internal force and the external force;

the system comprises an initializing module, a judging module and a judging module, wherein the initializing module is used for initializing network parameters of the impedance model and an experience pool, the experience pool is used for storing experience tuples of the robot in the environment, the impedance model comprises an Actor network and a Critic network, the Actor network is used for generating continuous actions, the Critic network is used for evaluating the quality of the actions and outputting corresponding action value functions;

the training module is used for selecting actions from a state space of the robot, storing the experience tuples fed back by the environment into the experience pool after the selected actions are executed, randomly sampling a batch of data from the experience pool, calculating the loss of the Critic network and carrying out back propagation, calculating a target Q value through the Critic network, updating the parameters of the Actor network to maximize the Q value, and carrying out cyclic training until the preset iteration times are reached to obtain a trained Actor-Critic network;

And the execution module is used for executing the actions of the double mechanical arms to grab the target object in an actual environment by using the trained Actor-Critic network.

9. An electronic device, comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the adaptive variable impedance control method according to any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium storing computer-executable instructions for performing the adaptive variable impedance control method of any one of claims 1 to 7.