[go: up one dir, main page]

CN119758719B - Inverted Pendulum Stabilization Method for Quadruped Robot Based on Hybrid State Estimation and Reinforcement Learning - Google Patents

Inverted Pendulum Stabilization Method for Quadruped Robot Based on Hybrid State Estimation and Reinforcement Learning

Info

Publication number
CN119758719B
CN119758719B CN202411897937.3A CN202411897937A CN119758719B CN 119758719 B CN119758719 B CN 119758719B CN 202411897937 A CN202411897937 A CN 202411897937A CN 119758719 B CN119758719 B CN 119758719B
Authority
CN
China
Prior art keywords
inverted pendulum
quadruped robot
network
parameters
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411897937.3A
Other languages
Chinese (zh)
Other versions
CN119758719A (en
Inventor
秦家虎
江一鸣
刘轻尘
闫成真
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202411897937.3A priority Critical patent/CN119758719B/en
Publication of CN119758719A publication Critical patent/CN119758719A/en
Application granted granted Critical
Publication of CN119758719B publication Critical patent/CN119758719B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Feedback Control In General (AREA)

Abstract

The invention relates to the technical field of robots and automation, and discloses a four-foot robot inverted pendulum stability control method based on deep reinforcement learning, which comprises the following steps of fixing a first-order inverted pendulum on a body of the four-foot robot; modeling the motion of the quadruped robot into a part of observable Markov decision process, and training a strategy output by the part of observable Markov decision process through a proximity strategy optimization algorithm based on an actor-critic model, wherein the actor-critic model comprises a strategy network and a value network, a domain randomization technology is adopted in the strategy training process to randomize the parameters of the environment, and a reward function is designed by comprehensively considering speed tracking rewards, stability penalties and inverted pendulum penalty items so as to train the value network and provide supervision signals. The invention designs an end-to-end four-foot robot inverted pendulum stabilization method based on mixed state estimation, which improves the balance capacity and stability of the robot.

Description

Reinforced learning four-foot robot inverted pendulum stabilization method based on mixed state estimation
Technical Field
The invention relates to the technical field of robots and automation, in particular to an inverted pendulum stabilization method of a reinforcement learning quadruped robot based on hybrid state estimation.
Background
The existing four-legged robot research is mainly focused on motion control, and the comprehensive optimization attention on the stability and the control performance of the robot body is less, so that the adaptability and the application potential of the robot body in higher-level tasks are limited. Inverted pendulum is always regarded as a standard test platform for verifying the effectiveness and robustness of control methods as a classical nonlinear dynamics system. According to the invention, the first-order inverted pendulum is fixed on the body of the quadruped robot, so that the balance capacity and stability of the robot in a dynamic environment are intuitively displayed, and an innovative research view angle and technical means are provided for improving the comprehensive performance of the robot.
The motion control method of the four-legged robot is generally classified into a conventional model-based control method and a learning-based control method. Traditional model-based methods rely on accurate system modeling, often involving multiple complex modules such as state estimation, terrain reconstruction, and whole body controllers. These methods are typically based on strict assumptions, such as collision-free and slip-free, etc. However, these assumptions are often difficult to meet in practical applications, making conventional control methods limited in applicability. In addition, for disturbances that an inverted pendulum may face in practical application on a four-legged robot, such as a change in the mass of the inverted pendulum in a real object, a shift in the centroid position, a damping coefficient, and fluctuation in friction, conventional algorithms lack sufficient self-adaptive capability, and thus it is difficult to effectively cope with these complex dynamic changes.
The four-foot robot motion control method based on reinforcement learning has been significantly advanced in recent years, and particularly exhibits excellent performance in complex scenes such as field running and the like, and has certain mechanical arm control capability. The method based on deep reinforcement learning converts complex optimization problems into optimization in an offline training stage through a learning decision strategy, so that dependence on an accurate model is remarkably reduced, and stronger robustness and adaptability are shown. However, most of the existing reinforcement learning studies focus on motion control, and there is less attention to comprehensive optimization of stability and handling performance of the robot body, which limits its adaptability and application potential in higher-level tasks.
Aiming at the problem of balancing the inverted pendulum on the body of the quadruped robot, the existing methods improve an actor-criticizer network based on depth deterministic strategy Gradient (DDPG, deep Deterministic Policy Gradient), design a layered reinforced reward function, and obtain a control strategy for improving the balancing capability and stability of the robot through interactive training with a model of balancing the inverted pendulum of the quadruped robot. However, these methods have some drawbacks. For example, DDPG's strategy is tightly coupled to the cost function and is susceptible to overestimated deviations, resulting in an unstable training process. In addition, DDPG relies on gaussian noise or noise processes to guide strategy exploration, which is inefficient in high-dimensional motion space, easily resulting in strategy sinking into local optima, limiting its performance.
Disclosure of Invention
In order to solve the technical problems, the invention provides an inverted pendulum stabilization method of a reinforcement learning four-legged robot based on hybrid state estimation, which enables the four-legged robot to realize robust motion control by means of self-perception and stably maintain an inverted pendulum system carried by a body. By designing a parameter estimator based on mixed state information, key parameters of the inverted pendulum are accurately estimated in real time, and the gesture of the robot is dynamically adjusted, so that the deep fusion of motion control and system stability is realized.
In order to solve the technical problems, the invention adopts the following technical scheme:
a four-foot robot inverted pendulum stable control method based on deep reinforcement learning estimates key parameters of an inverted pendulum in real time through design and dynamically adjusts the gesture of the four-foot robot so as to realize deep fusion of motion control and system stability, specifically comprising:
Fixing the first-order inverted pendulum on a body of the four-foot robot;
modeling the motion of the quadruped robot into a part of observable Markov decision process, and training a strategy output by the part of observable Markov decision process through a proximity strategy optimization algorithm based on an actor-critic model;
The actor-critique model comprises a strategy network and a value network, wherein the strategy network outputs a strategy according to an input state, and specific actions are sampled or directly selected from probability distribution corresponding to the strategy, and the input state comprises the observed quantity and state estimated quantity of the quadruped robot;
a domain randomization technology is adopted in the strategy training process, and parameters of the environment are randomized;
The reward function is designed to train the value network and provide supervisory signals in combination with consideration of the velocity tracking reward, stability penalty, and inverted pendulum penalty term.
Further, the policy network outputs a policy according to an input state, and samples or directly selects a specific action from probability distributions corresponding to the policy, where the input state includes an observed quantity o t and a state estimator of the quadruped robot, and the specific method includes:
The state estimator comprises an implicit state z t, a linear speed v t of the airframe and an estimated inverted pendulum parameter p t;
Strategy pi φ(at|ot,vt,zt,pt) presumes the action a t according to the observed quantity o t, the implicit state z t, the linear speed v t of the robot body and the estimated parameter p t of the inverted pendulum of the quadruped robot;
Wherein t is the index of the current time step, and the parameter p t of the inverted pendulum is a vector:
wherein, the Is the friction coefficient of the inverted pendulum,Is the mass of the inverted pendulum, and the weight of the inverted pendulum is equal to the mass of the inverted pendulum,The observed quantity o t of the quadruped robot is a vector containing self-perception information:
Wherein c t、ωt、gt、ft、θt, A t-l is a machine body linear speed instruction, an angular speed of the machine body, a gravity unit vector of a machine body coordinate system, a foot contact Boolean quantity, a joint angle, an inverted pendulum angle, a joint angular speed, an inverted pendulum angular speed and the action of the last time step of the four-foot robot respectively;
The position offset of the joints of the four-foot robot output by the strategy is taken as an offset value theta def of the initial standing posture of the four-foot robot in the action a t,at, so that the expected joint angle theta des of the robot is defined as:
θdes=θdef+at;
the desired angle of each joint is tracked by a proportional-derivative controller.
Further, the domain randomization technology is adopted in the policy training process to randomize parameters of the environment, which specifically includes:
parameters of the environment include fuselage load weight, PD controller parameters, centroid offset, and system delay;
Random noise with different degrees is added into the observed quantity of the quadruped robot which is input into the strategy network, and domain randomization processing is carried out on the load weight of the robot body, the parameters of the PD controller, the centroid offset and the system delay.
Further, the value network is used for evaluating the performance of the current policy network, and specifically comprises:
The input s t of the value network comprises observed quantity o t of the quadruped robot, linear speed v t of the robot body and parameter p t of the inverted pendulum:
st=[otvtpt]T
Further, the integrated consideration of the velocity tracking reward, the stability penalty and the inverted pendulum penalty term designs a reward function to train the value network and provide a supervisory signal, which specifically includes:
The speed tracking rewards include tracking of linear and angular speeds;
The stability penalty includes limiting the speed of the body of the quadruped robot in the z-axis direction, the angular speed in the x-axis direction and the y-axis direction, the orientation, the joint acceleration, the joint force, the body height, the frequency of motion, and the smoothness performance;
the inverted pendulum penalty term includes a penalty for inverted pendulum angle and speed.
Further, the state estimator is used for estimating the state estimation quantity;
The state estimator consists of a memory encoder and a source encoder, wherein the memory encoder adopts a long-period memory network and a short-period memory network;
After the observed quantity o t of the quadruped robot is transmitted into a long-short-period memory network, the obtained output is fed into a source encoder to obtain the state estimator, wherein the state estimator comprises the explicit linear speed v t of the robot body, the inverted pendulum parameter p t and the implicit state z t;
Fuselage linear velocity based on explicit fuselage linear velocity v t, inverted pendulum parameter p t, true value And the real inverted pendulum parametersError calculation using root mean square error
MSE (-) represents root mean square error;
observed quantity for next time step By observed quantity over timePredicting, wherein H represents time step, andAs a result of the fact that the target vector,As a source vector, respectively inputting a target encoder and a source encoder to obtain a target vectorSum source vectorFor the target vectorSum source vectorTaking L2 normalization to obtain a normalized matrix E, and then carrying out target vector normalizationSum source vectorCarrying out normalized exponential function operation with dot products of the normalized matrix to obtain target prediction probabilitySum source prediction probability
Wherein τ is a temperature parameter, E k represents the kth element in the normalized matrix E, and based on the prediction result and the target of cluster allocation, the unique target representing learning is defined as the maximum prediction precision by calculating the cross information entropy J:
h represents the time domain length of the observed quantity, Respectively representing the expected values of source vector and target vector, calculating by Sinkhorn-Knopp algorithm, using cross information entropy J as gradient of target encoder, using cross information entropy J and root mean square errorThe sum is used as the gradient of the long-term memory network and the source encoder for training.
Compared with the prior art, the invention has the beneficial technical effects that:
The invention designs an end-to-end four-foot robot inverted pendulum stabilization method based on mixed state estimation, which improves the balance capacity and stability of the robot. The state estimator of the mixed hidden display information not only can accurately estimate the linear speed of the four-legged robot, but also can estimate key parameters of an inverted pendulum system carried by a machine body in real time, including the mass, the mass center position, the friction force and the like of the inverted pendulum. Through the design, the four-foot robot can still keep high-efficiency and stable balance control capability under the condition of carrying different types of inverted pendulum systems or facing the difference between simulation and an actual system, thereby remarkably enhancing the adaptability and the robustness of the system.
The conventional DDPG algorithm has instability during training and tends to fall into local optima. Therefore, the invention adopts PPO algorithm to improve, effectively limits the updating amplitude of each strategy, and avoids the great fluctuation of the strategy in the optimization process, thereby being capable of covering the action space more smoothly. PPO exhibits more efficient exploration ability and robustness, particularly in high-dimensional environments. Furthermore, LSTM is introduced to perform mixed state estimation, so that sequence data can be processed better and time dependence can be captured. By extracting the mixed state information, the strategy can capture more key features from the environment, and the stability and adaptability of the strategy are remarkably improved, so that more excellent performance is shown in complex dynamic tasks.
Drawings
Fig. 1 is an overall system block diagram of the present invention.
Detailed Description
A preferred embodiment of the present invention will be described in detail with reference to the accompanying drawings.
The invention provides a four-foot robot inverted pendulum stabilization control method based on deep reinforcement learning, which intuitively displays the balance capability and stability performance of a robot in a dynamic environment by fixing a first-order inverted pendulum on a body of the four-foot robot. Based on the experimental platform, the invention combines mixed state estimation, designs an end-to-end motion learning framework for optimizing the actor-critique model based on the proximity strategy, realizes synchronous training of the motion strategy and the state estimator, and effectively improves the control effect and the robustness of the system.
1. Reinforcement learning problem description:
Because of the lack of external perception sensors, the terrain information cannot be fully observed, the present invention models the motion problem as a partially observable markov decision process (POMDP, PARTIALLY OBSERVABLE MARKOV DECISION PROCESS). The state at time step t is defined as x t, the agent's policy performs action a t, the environment transitions to the next time step state x t+1 by transition probability P (x t+1|xt,at), and returns the prize value r t and a partial state observation The goal of reinforcement learning is to find a strategy pi that maximizes future trajectory expectations (also referred to as jackpot expectations):
wherein γ t ε [0, 1) is the discount factor.
In order to learn robust blind exercise capability in a single training stage, the invention adopts a proximity strategy optimization (PPO, proximal Policy Optimization) algorithm to train the strategy and combines an asymmetric actor-commentator framework to improve the learning efficiency and expressive power of the model. The PPO algorithm limits the difference range between the new strategy and the old strategy when the strategy is updated by introducing a truncated probability ratio function, so that fluctuation in the optimization process is effectively controlled. Compared with other strategy gradient methods, the PPO algorithm realizes good balance between the calculation complexity and the performance, and provides reliable guarantee for efficient and stable strategy learning.
2. Policy network:
The policy network outputs a policy, typically a probability distribution, from which specific actions are sampled or directly selected, based on the state of the input. Policy pi φ(at|ot,vt,zt,pt) requires a given parameter phi, the action a t is presumed. The observed quantity o t of the self, the linear speed v t of the fuselage, the implicit state z t and the inverted pendulum parameter p t need to be input, wherein p t is a vector of dimension n×1:
wherein, the Is the friction coefficient of the inverted pendulum,Is the mass of the inverted pendulum, and the weight of the inverted pendulum is equal to the mass of the inverted pendulum,Is the offset of the centroid of the inverted pendulum in the z-axis. The self observed quantity o t is a vector of dimension n×1 containing self perceived information:
Wherein c t、ωt、gt、ft、θt, A t-1 is a machine body linear speed instruction, an angular speed of the machine body, a gravity unit vector of a machine body coordinate system, a foot contact Boolean quantity, a joint angle, an inverted pendulum angle, a joint angular speed, an inverted pendulum angular speed and the action of the last time step of the four-foot robot respectively;
The strategic output is the offset of the positions of the 12 joints of the quadruped robot as the offset θ def of the initial stance of the quadruped robot for action a t∈R12,at, so the desired joint angle of the robot θ des is defined as:
θdes=θdef+at;
the desired angle of each joint is tracked by a proportional-derivative (PD) controller, setting p=40.0, d=1.0 as a physical parameter.
3. Domain randomization:
domain randomization is a technique widely applied to the fields of reinforcement learning and robot control, aiming at improving the generalization capability of a model in a real environment. The core idea is that by randomizing environmental parameters in the training process, the model can be learned under various conditions, so that stronger adaptability and robustness are shown in an unknown scene. According to the method, random noise with different degrees is added into observed quantity of the strategy network, and meanwhile domain randomization processing is carried out on key factors such as fuselage load weight, PD controller parameters, centroid offset and system delay.
4. Value network:
The value network is used for evaluating the performance of the current strategy and helping the strategy network to learn the more optimal strategy. In order to obtain more accurate linear speed of the machine body and parameters of the inverted pendulum, the input of the value network not only comprises the self observed quantity o t, but also increases the privilege observed quantity comprising the linear speed v t of the machine body and the parameters p t of the inverted pendulum:
st=[otvtpt]T
5. and (3) bonus function design:
The reward function provides a supervisory signal for training the value network, the design of which takes into account both the speed tracking reward and the stability penalty to achieve stable and natural locomotor behavior. The speed tracking rewards include accurate tracking of line and angular speeds, while the stability penalty covers multiple dimensions including limiting the speed of the fuselage in the z-axis direction, the angular speed in roll and pitch directions, the offset of the gravitational component, and joint acceleration, among others. In addition, for stabilizing the inverted pendulum task, the reward function introduces a penalty on the inverted pendulum angle and speed to further improve the execution stability of the task, as shown in table 1. The total rewards for the policy to take action at each state are as follows:
rt(st,at)=∑riwi;
where i is an index of each prize.
TABLE 1 bonus function terms
6. State estimator:
the state estimator is an important input of the strategy network and consists of a memory encoder (LSTM) and a source encoder. Memory encoders are typically implemented in two ways, one by stacking a series of historical observations as inputs to the MLP, and one by using a model architecture that can capture past information, such as a recurrent neural network (RNN, recurrent Neural Network) or a time convolutional neural network (TCN, temporal Convolutional Network). However, architectures like MLP and TCN require reserving a certain memory space for storing historical observations, which places a great strain on the use of on-board resources. In contrast, RNNs can embed history information through hidden states, thereby reducing reliance on directly storing the total history observations. Based on this, the invention selects Long Short-Term Memory network (LSTM) as the implementation architecture of RNN.
After the own observables o t are passed into the LSTM, the output is fed to the source encoder, using a multi-layer perceptron (MLP, multilayer Perceptron) to obtain a state estimate of the implicit explicit blend. The state estimator includes an explicit linear velocity v t of the fuselage, an inverted pendulum parameter p t, and an implicit state quantity z t. Wherein for explicit state quantities, the mean square error (MSE, mean Squared Error) and true value of the fuselage linear velocity are usedAnd the real inverted pendulum parametersFor calculating errors
For the observed quantity of the next momentThrough the pastThe observed quantity of a period of time is predicted, and the time step represented by H is set to h=5. Will beAs a result of the fact that the target vector,As a source vector, respectively inputting a target encoder and a source encoder to obtain a target vectorSum source vectorTaking L2 normalization to obtain normalized matrix, and then target vectorSum source vectorCarrying out normalized exponential function operation with dot products of the normalized matrix to obtain prediction probabilityAnd
Where τ is a temperature parameter.
The present invention has now obtained the prediction results and targets of cluster allocation, defining the unique target representing learning as maximizing the prediction accuracy by calculating the cross information entropy J:
Taking the cross information entropy J as the gradient of the target encoder, taking the cross information entropy J and root mean square error The sum is trained as the gradient of LSTM and source encoder.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.
Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a single embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to specific embodiments, and that the embodiments may be combined appropriately to form other embodiments that will be understood by those skilled in the art.

Claims (5)

1.一种基于混合状态估计的强化学习四足机器人倒立摆稳定方法,其特征在于,通过设计实时估计倒立摆的关键参数,并动态调整四足机器人的姿态,从而实现运动控制与系统稳定性的深度融合;具体包括:1. A reinforcement learning method for stabilizing an inverted pendulum of a quadruped robot based on hybrid state estimation. This method is characterized by designing a real-time estimation of key parameters of the inverted pendulum and dynamically adjusting the quadruped robot's posture, thereby achieving a deep integration of motion control and system stability. Specifically, it includes: 将一阶倒立摆固定在四足机器人的机身上;Fix the first-order inverted pendulum to the body of the quadruped robot; 将四足机器人的运动建模为部分可观测马尔可夫决策过程,并通过基于演员-评论家模型的邻近策略优化算法对部分可观测马尔可夫决策过程输出的策略进行训练;The motion of the quadruped robot is modeled as a partially observable Markov decision process, and the policy output by the partially observable Markov decision process is trained using a neighboring policy optimization algorithm based on the actor-critic model. 演员-评论家模型包括策略网络和价值网络;其中,所述策略网络根据输入的状态输出一个策略,从策略对应的概率分布中采样或者直接选择具体的动作,输入的状态包括四足机器人自身的观测量以及状态估计量;价值网络用于评估当前策略网络的表现;The actor-critic model consists of a policy network and a value network. The policy network outputs a policy based on the input state, sampling or directly selecting specific actions from the probability distribution corresponding to the policy. The input state includes the quadruped robot's own observations and state estimates. The value network is used to evaluate the performance of the current policy network. 在策略训练过程中采用域随机化技术,随机化环境的参数;Use domain randomization technology to randomize the parameters of the environment during policy training; 综合考虑速度跟踪奖励、稳定性惩罚以及倒立摆惩罚项来设计奖励函数,以训练价值网络并提供监督信号;A reward function is designed by comprehensively considering speed tracking rewards, stability penalties, and inverted pendulum penalties to train the value network and provide supervision signals. 所述状态估计量通过状态估计器估计得到;The state estimation is obtained by estimating the state estimator; 所述状态估计器由记忆编码器和源编码器组成,所述记忆编码器采用长短期记忆网络;源编码器采用的多层感知机;The state estimator is composed of a memory encoder and a source encoder. The memory encoder adopts a long short-term memory network; the source encoder adopts a multi-layer perceptron; 四足机器人自身的观测量传入长短期记忆网络后,将得到的输出馈送到源编码器,以获得所述状态估计量;所述状态估计量包括显式的机身的线速度、倒立摆的参数以及隐式状态The observation quantity of the quadruped robot itself After being input into the long short-term memory network, the obtained output is fed into the source encoder to obtain the state estimate; the state estimate includes the explicit linear velocity of the fuselage , parameters of the inverted pendulum and implicit state ; 基于显式的机身的线速度、倒立摆的参数、真实值的机身线速度和真实的倒立摆参数,采用均方根误差计算误差Based on the explicit linear velocity of the fuselage , parameters of the inverted pendulum , the true value of the fuselage linear speed and the actual inverted pendulum parameters , using the root mean square error to calculate the error : ; 表示均方根误差; represents the root mean square error; 对于下一时间步的观测量,通过过去一段时间的观测量进行预测,H代表时间步;将作为目标向量,作为源向量,分别输入目标编码器和源编码器,得到目标向量和源向量;对目标向量和源向量取L2归一化得到归一化矩阵,再对目标向量和源向量与归一化矩阵的点积进行归一化指数函数运算,得到目标预测概率和源预测概率For the next time step, the observation , through the observations over the past period of time Make predictions, H represents the time step; As the target vector, As the source vector, input the target encoder and source encoder respectively to obtain the target vector and the source vector ; For the target vector and the source vector Take L2 normalization to get the normalized matrix , and then the target vector and the source vector Perform normalized exponential function operation on the dot product of the normalized matrix to obtain the target prediction probability and source prediction probability : ; ; 其中为温度参数,表示归一化矩阵中的第k个元素;基于聚类分配的预测结果和目标,通过计算交叉信息熵,将表示学习的唯一目标定义为最大化预测精度:in is the temperature parameter, Represents the normalized matrix The kth element in ; Based on the prediction results and targets of cluster assignment, by calculating the cross information entropy , the sole objective of representation learning is defined as maximizing prediction accuracy: ; 表示观测量的时域长度,分别表示源向量和目标向量的期望值,通过Sinkhorn-Knopp算法计算得到;将交叉信息熵作为目标编码器的梯度,将交叉信息熵和均方根误差之和作为长短期记忆网络和源编码器的梯度进行训练。 represents the time domain length of the observation, Represent the expected values of the source vector and the target vector respectively, which are calculated by the Sinkhorn-Knopp algorithm; the cross information entropy As the gradient of the target encoder, the cross entropy and root mean square error The sum is used as the gradient for the LSTM network and source encoder for training. 2.根据权利要求1所述的一种基于混合状态估计的强化学习四足机器人倒立摆稳定方法,其特征在于,所述策略网络根据输入的状态输出一个策略,从策略对应的概率分布中采样或者直接选择具体的动作,输入的状态包括四足机器人自身的观测量以及状态估计量,具体包括:2. A method for stabilizing an inverted pendulum of a quadruped robot based on reinforcement learning and hybrid state estimation according to claim 1, characterized in that the policy network outputs a policy based on the input state, samples or directly selects a specific action from the probability distribution corresponding to the policy, and the input state includes the observation value of the quadruped robot itself And state estimators, including: 状态估计量包括隐式状态、机身的线速度以及估计的倒立摆的参数The state estimator includes the implicit state , the linear speed of the fuselage and the estimated parameters of the inverted pendulum ; 策略根据四足机器人自身的观测量、隐式状态、机身的线速度以及估计的倒立摆的参数,推测出动作Strategy According to the observation quantity of the quadruped robot itself , implicit state , the linear speed of the fuselage and the estimated parameters of the inverted pendulum , infer the action ; 其中,为当前时间步的索引,倒立摆的参数是一个向量:in, is the index of the current time step, the parameters of the inverted pendulum is a vector: ; 其中,为倒立摆的摩擦系数, 为倒立摆的质量,为倒立摆的质心在z轴的偏移量;四足机器人自身的观测量是一个包含自身感知信息的向量:in, is the friction coefficient of the inverted pendulum, is the mass of the inverted pendulum, is the displacement of the center of mass of the inverted pendulum on the z-axis; the observation quantity of the quadruped robot itself Is a vector containing self-perception information: ; 其中,分别为四足机器人的机身线速度指令、机身的角速度、机身坐标系的重力单位向量、足端接触布尔量、关节角度、倒立摆角度、关节角速度和倒立摆角速度和上一个时间步的动作;in, They are the linear velocity command of the quadruped robot, the angular velocity of the body, the gravity unit vector of the body coordinate system, the foot end contact Boolean quantity, the joint angle, the inverted pendulum angle, the joint angular velocity and the inverted pendulum angular velocity and the action of the previous time step; 将策略输出的四足机器人关节的位置偏移量作为动作为四足机器人初始站立姿态的偏移量,因此机器人的期望关节角度被定义为:The position offset of the quadruped robot joint output by the strategy is used as the action , is the offset of the quadruped robot's initial standing posture , so the desired joint angles of the robot is defined as: ; 每个关节的期望角度通过比例-微分控制器进行跟踪。The desired angle of each joint is tracked using a proportional-derivative controller. 3.根据权利要求1所述的一种基于混合状态估计的强化学习四足机器人倒立摆稳定方法,其特征在于,所述在策略训练过程中采用域随机化技术,随机化环境的参数,具体包括:3. The method for stabilizing an inverted pendulum of a quadruped robot based on reinforcement learning and hybrid state estimation according to claim 1 is characterized in that the domain randomization technique is used in the strategy training process to randomize the parameters of the environment, specifically including: 环境的参数包括机身负载重量、PD控制器参数、质心偏移以及系统延时;The environmental parameters include the fuselage load weight, PD controller parameters, center of mass offset, and system delay; 在输入至策略网络的四足机器人自身的观测量中加入不同程度的随机噪声,同时对机身负载重量、PD控制器参数、质心偏移以及系统延时进行域随机化处理。Different degrees of random noise are added to the observation quantities of the quadruped robot itself that are input into the policy network, and domain randomization is performed on the body load weight, PD controller parameters, center of mass offset and system delay. 4.根据权利要求1所述的一种基于混合状态估计的强化学习四足机器人倒立摆稳定方法,其特征在于,所述价值网络用于评估当前策略网络的表现,具体包括:4. The method for stabilizing an inverted pendulum of a quadruped robot based on reinforcement learning and hybrid state estimation according to claim 1, wherein the value network is used to evaluate the performance of the current policy network, specifically comprising: 价值网络的输入包括四足机器人自身的观测量、机身的线速度、倒立摆的参数Inputs to the Value Network Including the observation quantity of the quadruped robot itself , the linear speed of the fuselage , parameters of the inverted pendulum : . 5.根据权利要求1所述的一种基于混合状态估计的强化学习四足机器人倒立摆稳定方法,其特征在于,所述综合考虑速度跟踪奖励、稳定性惩罚以及倒立摆惩罚项来设计奖励函数,以训练价值网络并提供监督信号,具体包括:5. The method for stabilizing an inverted pendulum of a quadruped robot based on reinforcement learning and hybrid state estimation according to claim 1, wherein the reward function is designed by comprehensively considering the speed tracking reward, stability penalty, and inverted pendulum penalty term to train the value network and provide a supervision signal, specifically comprising: 速度跟踪奖励包括线速度和角速度的跟踪;Speed tracking rewards include tracking of linear and angular velocity; 稳定性惩罚包括限制四足机器人的机身在z轴方向的速度、x轴方向和y轴方向的角速度、朝向、关节加速度、关节力、机身高度、动作频率以及平滑性能;The stability penalty includes limiting the quadruped robot's body velocity in the z-axis direction, angular velocity in the x-axis and y-axis directions, orientation, joint acceleration, joint force, body height, action frequency, and smoothness performance; 倒立摆惩罚项包括对倒立摆角度和速度的惩罚。The inverted pendulum penalty term includes penalties for the inverted pendulum angle and speed.
CN202411897937.3A 2024-12-23 2024-12-23 Inverted Pendulum Stabilization Method for Quadruped Robot Based on Hybrid State Estimation and Reinforcement Learning Active CN119758719B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411897937.3A CN119758719B (en) 2024-12-23 2024-12-23 Inverted Pendulum Stabilization Method for Quadruped Robot Based on Hybrid State Estimation and Reinforcement Learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411897937.3A CN119758719B (en) 2024-12-23 2024-12-23 Inverted Pendulum Stabilization Method for Quadruped Robot Based on Hybrid State Estimation and Reinforcement Learning

Publications (2)

Publication Number Publication Date
CN119758719A CN119758719A (en) 2025-04-04
CN119758719B true CN119758719B (en) 2025-10-28

Family

ID=95190321

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411897937.3A Active CN119758719B (en) 2024-12-23 2024-12-23 Inverted Pendulum Stabilization Method for Quadruped Robot Based on Hybrid State Estimation and Reinforcement Learning

Country Status (1)

Country Link
CN (1) CN119758719B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120722767B (en) * 2025-09-01 2025-11-14 湖南大学 Gait network training method for biped robot

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106292288A (en) * 2016-09-22 2017-01-04 同济大学 Model parameter correction method based on Policy-Gradient learning method and application thereof
CN117313826A (en) * 2023-11-30 2023-12-29 安徽大学 Arbitrary-angle inverted pendulum model training method based on reinforcement learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11389957B2 (en) * 2019-09-30 2022-07-19 Mitsubishi Electric Research Laboratories, Inc. System and design of derivative-free model learning for robotic systems

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106292288A (en) * 2016-09-22 2017-01-04 同济大学 Model parameter correction method based on Policy-Gradient learning method and application thereof
CN117313826A (en) * 2023-11-30 2023-12-29 安徽大学 Arbitrary-angle inverted pendulum model training method based on reinforcement learning

Also Published As

Publication number Publication date
CN119758719A (en) 2025-04-04

Similar Documents

Publication Publication Date Title
Precup et al. A survey on fuzzy control for mechatronics applications
CN112119404B (en) Sample-Efficient Reinforcement Learning
CN112247992B (en) A kind of robot feedforward torque compensation method
KR20220137956A (en) Versatile Reinforcement Learning with Objective Action-Value Functions
Duan et al. Sim-to-real learning of footstep-constrained bipedal dynamic walking
CN116834014B (en) Intelligent cooperative control method and system for capturing non-cooperative targets by space dobby robot
Qi et al. Stable indirect adaptive control based on discrete-time T–S fuzzy model
Shamrooz et al. Modeling of Asynchronous Mode–dependent Delays in Stochastic Markovian Jumping Modes Based on Static Neural Networks for Robotic Manipulators.
CN113419424B (en) Modeling reinforcement learning robot control method and system for reducing overestimation
CN119758719B (en) Inverted Pendulum Stabilization Method for Quadruped Robot Based on Hybrid State Estimation and Reinforcement Learning
CN119644704B (en) Biped robot complex terrain self-adaptive gait planning method and biped robot
CN119795175B (en) A dexterous two-handed collaborative control method based on multi-agent reinforcement learning
CN119188729A (en) Robotic arm control method and stability evaluation method based on double evaluation network
CN117601120A (en) Adaptive variable impedance control method and device, electronic equipment and storage medium
Zhang et al. Trajectory-tracking control of robotic systems via deep reinforcement learning
CN112571420A (en) Dual-function model prediction control method under unknown parameters
CN119159582B (en) Multi-axis mechanical arm prediction control method based on information physical neural network
CN120395841A (en) A compliant control method for human-robot collaboration based on improved deep reinforcement learning combined with collaborator intention
CN120326600A (en) Robot hierarchical reinforcement learning variable impedance control method based on vision and touch
CN114118371A (en) A kind of agent deep reinforcement learning method and computer readable medium
CN115047761B (en) Mechanical arm model optimization method based on self-adaptive sliding mode observer
Zhang et al. Tracking control for mobile robot based on deep reinforcement learning
Liu et al. Forward-looking imaginative planning framework combined with prioritized-replay double DQN
CN110531620B (en) Adaptive control method of mountain climbing system of trolley based on Gaussian process approximate model
Liang et al. Trajectory Progress-Based Prioritizing and Intrinsic Reward Mechanism for Robust Training of Robotic Manipulations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant