CN119758719B

CN119758719B - Inverted Pendulum Stabilization Method for Quadruped Robot Based on Hybrid State Estimation and Reinforcement Learning

Info

Publication number: CN119758719B
Application number: CN202411897937.3A
Authority: CN
Inventors: 秦家虎; 江一鸣; 刘轻尘; 闫成真
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2024-12-23
Filing date: 2024-12-23
Publication date: 2025-10-28
Anticipated expiration: 2044-12-23
Also published as: CN119758719A

Abstract

The invention relates to the technical field of robots and automation, and discloses a four-foot robot inverted pendulum stability control method based on deep reinforcement learning, which comprises the following steps of fixing a first-order inverted pendulum on a body of the four-foot robot; modeling the motion of the quadruped robot into a part of observable Markov decision process, and training a strategy output by the part of observable Markov decision process through a proximity strategy optimization algorithm based on an actor-critic model, wherein the actor-critic model comprises a strategy network and a value network, a domain randomization technology is adopted in the strategy training process to randomize the parameters of the environment, and a reward function is designed by comprehensively considering speed tracking rewards, stability penalties and inverted pendulum penalty items so as to train the value network and provide supervision signals. The invention designs an end-to-end four-foot robot inverted pendulum stabilization method based on mixed state estimation, which improves the balance capacity and stability of the robot.

Description

Reinforced learning four-foot robot inverted pendulum stabilization method based on mixed state estimation

Technical Field

The invention relates to the technical field of robots and automation, in particular to an inverted pendulum stabilization method of a reinforcement learning quadruped robot based on hybrid state estimation.

Background

The existing four-legged robot research is mainly focused on motion control, and the comprehensive optimization attention on the stability and the control performance of the robot body is less, so that the adaptability and the application potential of the robot body in higher-level tasks are limited. Inverted pendulum is always regarded as a standard test platform for verifying the effectiveness and robustness of control methods as a classical nonlinear dynamics system. According to the invention, the first-order inverted pendulum is fixed on the body of the quadruped robot, so that the balance capacity and stability of the robot in a dynamic environment are intuitively displayed, and an innovative research view angle and technical means are provided for improving the comprehensive performance of the robot.

The motion control method of the four-legged robot is generally classified into a conventional model-based control method and a learning-based control method. Traditional model-based methods rely on accurate system modeling, often involving multiple complex modules such as state estimation, terrain reconstruction, and whole body controllers. These methods are typically based on strict assumptions, such as collision-free and slip-free, etc. However, these assumptions are often difficult to meet in practical applications, making conventional control methods limited in applicability. In addition, for disturbances that an inverted pendulum may face in practical application on a four-legged robot, such as a change in the mass of the inverted pendulum in a real object, a shift in the centroid position, a damping coefficient, and fluctuation in friction, conventional algorithms lack sufficient self-adaptive capability, and thus it is difficult to effectively cope with these complex dynamic changes.

The four-foot robot motion control method based on reinforcement learning has been significantly advanced in recent years, and particularly exhibits excellent performance in complex scenes such as field running and the like, and has certain mechanical arm control capability. The method based on deep reinforcement learning converts complex optimization problems into optimization in an offline training stage through a learning decision strategy, so that dependence on an accurate model is remarkably reduced, and stronger robustness and adaptability are shown. However, most of the existing reinforcement learning studies focus on motion control, and there is less attention to comprehensive optimization of stability and handling performance of the robot body, which limits its adaptability and application potential in higher-level tasks.

Aiming at the problem of balancing the inverted pendulum on the body of the quadruped robot, the existing methods improve an actor-criticizer network based on depth deterministic strategy Gradient (DDPG, deep Deterministic Policy Gradient), design a layered reinforced reward function, and obtain a control strategy for improving the balancing capability and stability of the robot through interactive training with a model of balancing the inverted pendulum of the quadruped robot. However, these methods have some drawbacks. For example, DDPG's strategy is tightly coupled to the cost function and is susceptible to overestimated deviations, resulting in an unstable training process. In addition, DDPG relies on gaussian noise or noise processes to guide strategy exploration, which is inefficient in high-dimensional motion space, easily resulting in strategy sinking into local optima, limiting its performance.

Disclosure of Invention

In order to solve the technical problems, the invention provides an inverted pendulum stabilization method of a reinforcement learning four-legged robot based on hybrid state estimation, which enables the four-legged robot to realize robust motion control by means of self-perception and stably maintain an inverted pendulum system carried by a body. By designing a parameter estimator based on mixed state information, key parameters of the inverted pendulum are accurately estimated in real time, and the gesture of the robot is dynamically adjusted, so that the deep fusion of motion control and system stability is realized.

In order to solve the technical problems, the invention adopts the following technical scheme:

a four-foot robot inverted pendulum stable control method based on deep reinforcement learning estimates key parameters of an inverted pendulum in real time through design and dynamically adjusts the gesture of the four-foot robot so as to realize deep fusion of motion control and system stability, specifically comprising:

Fixing the first-order inverted pendulum on a body of the four-foot robot;

modeling the motion of the quadruped robot into a part of observable Markov decision process, and training a strategy output by the part of observable Markov decision process through a proximity strategy optimization algorithm based on an actor-critic model;

The actor-critique model comprises a strategy network and a value network, wherein the strategy network outputs a strategy according to an input state, and specific actions are sampled or directly selected from probability distribution corresponding to the strategy, and the input state comprises the observed quantity and state estimated quantity of the quadruped robot;

a domain randomization technology is adopted in the strategy training process, and parameters of the environment are randomized;

The reward function is designed to train the value network and provide supervisory signals in combination with consideration of the velocity tracking reward, stability penalty, and inverted pendulum penalty term.

Further, the policy network outputs a policy according to an input state, and samples or directly selects a specific action from probability distributions corresponding to the policy, where the input state includes an observed quantity o _t and a state estimator of the quadruped robot, and the specific method includes:

The state estimator comprises an implicit state z _t, a linear speed v _t of the airframe and an estimated inverted pendulum parameter p _t;

Strategy pi _φ(a_t|o_t,v_t,z_t,p_t) presumes the action a _t according to the observed quantity o _t, the implicit state z _t, the linear speed v _t of the robot body and the estimated parameter p _t of the inverted pendulum of the quadruped robot;

Wherein t is the index of the current time step, and the parameter p _t of the inverted pendulum is a vector:

wherein, the Is the friction coefficient of the inverted pendulum,Is the mass of the inverted pendulum, and the weight of the inverted pendulum is equal to the mass of the inverted pendulum,The observed quantity o _t of the quadruped robot is a vector containing self-perception information:

Wherein c _t、ω_t、g_t、f_t、θ_t, A _t-l is a machine body linear speed instruction, an angular speed of the machine body, a gravity unit vector of a machine body coordinate system, a foot contact Boolean quantity, a joint angle, an inverted pendulum angle, a joint angular speed, an inverted pendulum angular speed and the action of the last time step of the four-foot robot respectively;

The position offset of the joints of the four-foot robot output by the strategy is taken as an offset value theta _def of the initial standing posture of the four-foot robot in the action a _t,a_t, so that the expected joint angle theta _des of the robot is defined as:

θ_des＝θ_def+a_t;

the desired angle of each joint is tracked by a proportional-derivative controller.

Further, the domain randomization technology is adopted in the policy training process to randomize parameters of the environment, which specifically includes:

parameters of the environment include fuselage load weight, PD controller parameters, centroid offset, and system delay;

Random noise with different degrees is added into the observed quantity of the quadruped robot which is input into the strategy network, and domain randomization processing is carried out on the load weight of the robot body, the parameters of the PD controller, the centroid offset and the system delay.

Further, the value network is used for evaluating the performance of the current policy network, and specifically comprises:

The input s _t of the value network comprises observed quantity o _t of the quadruped robot, linear speed v _t of the robot body and parameter p _t of the inverted pendulum:

s_t＝[o_tv_tp_t]^T。

Further, the integrated consideration of the velocity tracking reward, the stability penalty and the inverted pendulum penalty term designs a reward function to train the value network and provide a supervisory signal, which specifically includes:

The speed tracking rewards include tracking of linear and angular speeds;

The stability penalty includes limiting the speed of the body of the quadruped robot in the z-axis direction, the angular speed in the x-axis direction and the y-axis direction, the orientation, the joint acceleration, the joint force, the body height, the frequency of motion, and the smoothness performance;

the inverted pendulum penalty term includes a penalty for inverted pendulum angle and speed.

Further, the state estimator is used for estimating the state estimation quantity;

The state estimator consists of a memory encoder and a source encoder, wherein the memory encoder adopts a long-period memory network and a short-period memory network;

After the observed quantity o _t of the quadruped robot is transmitted into a long-short-period memory network, the obtained output is fed into a source encoder to obtain the state estimator, wherein the state estimator comprises the explicit linear speed v _t of the robot body, the inverted pendulum parameter p _t and the implicit state z _t;

Fuselage linear velocity based on explicit fuselage linear velocity v _t, inverted pendulum parameter p _t, true value And the real inverted pendulum parametersError calculation using root mean square error

MSE (-) represents root mean square error;

observed quantity for next time step By observed quantity over timePredicting, wherein H represents time step, andAs a result of the fact that the target vector,As a source vector, respectively inputting a target encoder and a source encoder to obtain a target vectorSum source vectorFor the target vectorSum source vectorTaking L2 normalization to obtain a normalized matrix E, and then carrying out target vector normalizationSum source vectorCarrying out normalized exponential function operation with dot products of the normalized matrix to obtain target prediction probabilitySum source prediction probability

Wherein τ is a temperature parameter, E _k represents the kth element in the normalized matrix E, and based on the prediction result and the target of cluster allocation, the unique target representing learning is defined as the maximum prediction precision by calculating the cross information entropy J:

h represents the time domain length of the observed quantity, Respectively representing the expected values of source vector and target vector, calculating by Sinkhorn-Knopp algorithm, using cross information entropy J as gradient of target encoder, using cross information entropy J and root mean square errorThe sum is used as the gradient of the long-term memory network and the source encoder for training.

Compared with the prior art, the invention has the beneficial technical effects that:

The invention designs an end-to-end four-foot robot inverted pendulum stabilization method based on mixed state estimation, which improves the balance capacity and stability of the robot. The state estimator of the mixed hidden display information not only can accurately estimate the linear speed of the four-legged robot, but also can estimate key parameters of an inverted pendulum system carried by a machine body in real time, including the mass, the mass center position, the friction force and the like of the inverted pendulum. Through the design, the four-foot robot can still keep high-efficiency and stable balance control capability under the condition of carrying different types of inverted pendulum systems or facing the difference between simulation and an actual system, thereby remarkably enhancing the adaptability and the robustness of the system.

The conventional DDPG algorithm has instability during training and tends to fall into local optima. Therefore, the invention adopts PPO algorithm to improve, effectively limits the updating amplitude of each strategy, and avoids the great fluctuation of the strategy in the optimization process, thereby being capable of covering the action space more smoothly. PPO exhibits more efficient exploration ability and robustness, particularly in high-dimensional environments. Furthermore, LSTM is introduced to perform mixed state estimation, so that sequence data can be processed better and time dependence can be captured. By extracting the mixed state information, the strategy can capture more key features from the environment, and the stability and adaptability of the strategy are remarkably improved, so that more excellent performance is shown in complex dynamic tasks.

Drawings

Fig. 1 is an overall system block diagram of the present invention.

Detailed Description

A preferred embodiment of the present invention will be described in detail with reference to the accompanying drawings.

The invention provides a four-foot robot inverted pendulum stabilization control method based on deep reinforcement learning, which intuitively displays the balance capability and stability performance of a robot in a dynamic environment by fixing a first-order inverted pendulum on a body of the four-foot robot. Based on the experimental platform, the invention combines mixed state estimation, designs an end-to-end motion learning framework for optimizing the actor-critique model based on the proximity strategy, realizes synchronous training of the motion strategy and the state estimator, and effectively improves the control effect and the robustness of the system.

1. Reinforcement learning problem description:

Because of the lack of external perception sensors, the terrain information cannot be fully observed, the present invention models the motion problem as a partially observable markov decision process (POMDP, PARTIALLY OBSERVABLE MARKOV DECISION PROCESS). The state at time step t is defined as x _t, the agent's policy performs action a _t, the environment transitions to the next time step state x _t+1 by transition probability P (x _t+1|x_t,a_t), and returns the prize value r _t and a partial state observation The goal of reinforcement learning is to find a strategy pi that maximizes future trajectory expectations (also referred to as jackpot expectations):

wherein γ ^t ε [0, 1) is the discount factor.

In order to learn robust blind exercise capability in a single training stage, the invention adopts a proximity strategy optimization (PPO, proximal Policy Optimization) algorithm to train the strategy and combines an asymmetric actor-commentator framework to improve the learning efficiency and expressive power of the model. The PPO algorithm limits the difference range between the new strategy and the old strategy when the strategy is updated by introducing a truncated probability ratio function, so that fluctuation in the optimization process is effectively controlled. Compared with other strategy gradient methods, the PPO algorithm realizes good balance between the calculation complexity and the performance, and provides reliable guarantee for efficient and stable strategy learning.

2. Policy network:

The policy network outputs a policy, typically a probability distribution, from which specific actions are sampled or directly selected, based on the state of the input. Policy pi _φ(a_t|o_t,v_t,z_t,p_t) requires a given parameter phi, the action a _t is presumed. The observed quantity o _t of the self, the linear speed v _t of the fuselage, the implicit state z _t and the inverted pendulum parameter p _t need to be input, wherein p _t is a vector of dimension n×1:

wherein, the Is the friction coefficient of the inverted pendulum,Is the mass of the inverted pendulum, and the weight of the inverted pendulum is equal to the mass of the inverted pendulum,Is the offset of the centroid of the inverted pendulum in the z-axis. The self observed quantity o _t is a vector of dimension n×1 containing self perceived information:

Wherein c _t、ω_t、g_t、f_t、θ_t, A _t-1 is a machine body linear speed instruction, an angular speed of the machine body, a gravity unit vector of a machine body coordinate system, a foot contact Boolean quantity, a joint angle, an inverted pendulum angle, a joint angular speed, an inverted pendulum angular speed and the action of the last time step of the four-foot robot respectively;

The strategic output is the offset of the positions of the 12 joints of the quadruped robot as the offset θ _def of the initial stance of the quadruped robot for action a _t∈R¹²,a_t, so the desired joint angle of the robot θ _des is defined as:

θ_des＝θ_def+a_t;

the desired angle of each joint is tracked by a proportional-derivative (PD) controller, setting p=40.0, d=1.0 as a physical parameter.

3. Domain randomization:

domain randomization is a technique widely applied to the fields of reinforcement learning and robot control, aiming at improving the generalization capability of a model in a real environment. The core idea is that by randomizing environmental parameters in the training process, the model can be learned under various conditions, so that stronger adaptability and robustness are shown in an unknown scene. According to the method, random noise with different degrees is added into observed quantity of the strategy network, and meanwhile domain randomization processing is carried out on key factors such as fuselage load weight, PD controller parameters, centroid offset and system delay.

4. Value network:

The value network is used for evaluating the performance of the current strategy and helping the strategy network to learn the more optimal strategy. In order to obtain more accurate linear speed of the machine body and parameters of the inverted pendulum, the input of the value network not only comprises the self observed quantity o _t, but also increases the privilege observed quantity comprising the linear speed v _t of the machine body and the parameters p _t of the inverted pendulum:

s_t＝[o_tv_tp_t]^T。

5. and (3) bonus function design:

The reward function provides a supervisory signal for training the value network, the design of which takes into account both the speed tracking reward and the stability penalty to achieve stable and natural locomotor behavior. The speed tracking rewards include accurate tracking of line and angular speeds, while the stability penalty covers multiple dimensions including limiting the speed of the fuselage in the z-axis direction, the angular speed in roll and pitch directions, the offset of the gravitational component, and joint acceleration, among others. In addition, for stabilizing the inverted pendulum task, the reward function introduces a penalty on the inverted pendulum angle and speed to further improve the execution stability of the task, as shown in table 1. The total rewards for the policy to take action at each state are as follows:

r_t(s_t,a_t)＝∑r_iw_i;

where i is an index of each prize.

TABLE 1 bonus function terms

6. State estimator:

the state estimator is an important input of the strategy network and consists of a memory encoder (LSTM) and a source encoder. Memory encoders are typically implemented in two ways, one by stacking a series of historical observations as inputs to the MLP, and one by using a model architecture that can capture past information, such as a recurrent neural network (RNN, recurrent Neural Network) or a time convolutional neural network (TCN, temporal Convolutional Network). However, architectures like MLP and TCN require reserving a certain memory space for storing historical observations, which places a great strain on the use of on-board resources. In contrast, RNNs can embed history information through hidden states, thereby reducing reliance on directly storing the total history observations. Based on this, the invention selects Long Short-Term Memory network (LSTM) as the implementation architecture of RNN.

After the own observables o _t are passed into the LSTM, the output is fed to the source encoder, using a multi-layer perceptron (MLP, multilayer Perceptron) to obtain a state estimate of the implicit explicit blend. The state estimator includes an explicit linear velocity v _t of the fuselage, an inverted pendulum parameter p _t, and an implicit state quantity z _t. Wherein for explicit state quantities, the mean square error (MSE, mean Squared Error) and true value of the fuselage linear velocity are usedAnd the real inverted pendulum parametersFor calculating errors

For the observed quantity of the next momentThrough the pastThe observed quantity of a period of time is predicted, and the time step represented by H is set to h=5. Will beAs a result of the fact that the target vector,As a source vector, respectively inputting a target encoder and a source encoder to obtain a target vectorSum source vectorTaking L2 normalization to obtain normalized matrix, and then target vectorSum source vectorCarrying out normalized exponential function operation with dot products of the normalized matrix to obtain prediction probabilityAnd

Where τ is a temperature parameter.

The present invention has now obtained the prediction results and targets of cluster allocation, defining the unique target representing learning as maximizing the prediction accuracy by calculating the cross information entropy J:

Taking the cross information entropy J as the gradient of the target encoder, taking the cross information entropy J and root mean square error The sum is trained as the gradient of LSTM and source encoder.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a single embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to specific embodiments, and that the embodiments may be combined appropriately to form other embodiments that will be understood by those skilled in the art.

Claims

1. A reinforcement learning method for stabilizing an inverted pendulum of a quadruped robot based on hybrid state estimation. This method is characterized by designing a real-time estimation of key parameters of the inverted pendulum and dynamically adjusting the quadruped robot's posture, thereby achieving a deep integration of motion control and system stability. Specifically, it includes:

Fix the first-order inverted pendulum to the body of the quadruped robot;

The motion of the quadruped robot is modeled as a partially observable Markov decision process, and the policy output by the partially observable Markov decision process is trained using a neighboring policy optimization algorithm based on the actor-critic model.

The actor-critic model consists of a policy network and a value network. The policy network outputs a policy based on the input state, sampling or directly selecting specific actions from the probability distribution corresponding to the policy. The input state includes the quadruped robot's own observations and state estimates. The value network is used to evaluate the performance of the current policy network.

Use domain randomization technology to randomize the parameters of the environment during policy training;

A reward function is designed by comprehensively considering speed tracking rewards, stability penalties, and inverted pendulum penalties to train the value network and provide supervision signals.

The state estimation is obtained by estimating the state estimator;

The state estimator is composed of a memory encoder and a source encoder. The memory encoder adopts a long short-term memory network; the source encoder adopts a multi-layer perceptron;

The observation quantity of the quadruped robot itself After being input into the long short-term memory network, the obtained output is fed into the source encoder to obtain the state estimate; the state estimate includes the explicit linear velocity of the fuselage , parameters of the inverted pendulum and implicit state ;

Based on the explicit linear velocity of the fuselage , parameters of the inverted pendulum , the true value of the fuselage linear speed and the actual inverted pendulum parameters , using the root mean square error to calculate the error :

;

represents the root mean square error;

For the next time step, the observation , through the observations over the past period of time Make predictions, H represents the time step; As the target vector, As the source vector, input the target encoder and source encoder respectively to obtain the target vector and the source vector ; For the target vector and the source vector Take L2 normalization to get the normalized matrix , and then the target vector and the source vector Perform normalized exponential function operation on the dot product of the normalized matrix to obtain the target prediction probability and source prediction probability :

;

in is the temperature parameter, Represents the normalized matrix The kth element in ; Based on the prediction results and targets of cluster assignment, by calculating the cross information entropy , the sole objective of representation learning is defined as maximizing prediction accuracy:

;

represents the time domain length of the observation, Represent the expected values of the source vector and the target vector respectively, which are calculated by the Sinkhorn-Knopp algorithm; the cross information entropy As the gradient of the target encoder, the cross entropy and root mean square error The sum is used as the gradient for the LSTM network and source encoder for training.

2. A method for stabilizing an inverted pendulum of a quadruped robot based on reinforcement learning and hybrid state estimation according to claim 1, characterized in that the policy network outputs a policy based on the input state, samples or directly selects a specific action from the probability distribution corresponding to the policy, and the input state includes the observation value of the quadruped robot itself And state estimators, including:

The state estimator includes the implicit state , the linear speed of the fuselage and the estimated parameters of the inverted pendulum ;

Strategy According to the observation quantity of the quadruped robot itself , implicit state , the linear speed of the fuselage and the estimated parameters of the inverted pendulum , infer the action ;

in, is the index of the current time step, the parameters of the inverted pendulum is a vector:

;

in, is the friction coefficient of the inverted pendulum, is the mass of the inverted pendulum, is the displacement of the center of mass of the inverted pendulum on the z-axis; the observation quantity of the quadruped robot itself Is a vector containing self-perception information:

;

in, 、、、、、、、、 They are the linear velocity command of the quadruped robot, the angular velocity of the body, the gravity unit vector of the body coordinate system, the foot end contact Boolean quantity, the joint angle, the inverted pendulum angle, the joint angular velocity and the inverted pendulum angular velocity and the action of the previous time step;

The position offset of the quadruped robot joint output by the strategy is used as the action , is the offset of the quadruped robot's initial standing posture , so the desired joint angles of the robot is defined as:

;

The desired angle of each joint is tracked using a proportional-derivative controller.

3. The method for stabilizing an inverted pendulum of a quadruped robot based on reinforcement learning and hybrid state estimation according to claim 1 is characterized in that the domain randomization technique is used in the strategy training process to randomize the parameters of the environment, specifically including:

The environmental parameters include the fuselage load weight, PD controller parameters, center of mass offset, and system delay;

Different degrees of random noise are added to the observation quantities of the quadruped robot itself that are input into the policy network, and domain randomization is performed on the body load weight, PD controller parameters, center of mass offset and system delay.

4. The method for stabilizing an inverted pendulum of a quadruped robot based on reinforcement learning and hybrid state estimation according to claim 1, wherein the value network is used to evaluate the performance of the current policy network, specifically comprising:

Inputs to the Value Network Including the observation quantity of the quadruped robot itself , the linear speed of the fuselage , parameters of the inverted pendulum :

.

5. The method for stabilizing an inverted pendulum of a quadruped robot based on reinforcement learning and hybrid state estimation according to claim 1, wherein the reward function is designed by comprehensively considering the speed tracking reward, stability penalty, and inverted pendulum penalty term to train the value network and provide a supervision signal, specifically comprising:

Speed tracking rewards include tracking of linear and angular velocity;

The stability penalty includes limiting the quadruped robot's body velocity in the z-axis direction, angular velocity in the x-axis and y-axis directions, orientation, joint acceleration, joint force, body height, action frequency, and smoothness performance;

The inverted pendulum penalty term includes penalties for the inverted pendulum angle and speed.