CN119758719B - Inverted Pendulum Stabilization Method for Quadruped Robot Based on Hybrid State Estimation and Reinforcement Learning - Google Patents
Inverted Pendulum Stabilization Method for Quadruped Robot Based on Hybrid State Estimation and Reinforcement LearningInfo
- Publication number
- CN119758719B CN119758719B CN202411897937.3A CN202411897937A CN119758719B CN 119758719 B CN119758719 B CN 119758719B CN 202411897937 A CN202411897937 A CN 202411897937A CN 119758719 B CN119758719 B CN 119758719B
- Authority
- CN
- China
- Prior art keywords
- inverted pendulum
- quadruped robot
- network
- parameters
- state
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Feedback Control In General (AREA)
Abstract
The invention relates to the technical field of robots and automation, and discloses a four-foot robot inverted pendulum stability control method based on deep reinforcement learning, which comprises the following steps of fixing a first-order inverted pendulum on a body of the four-foot robot; modeling the motion of the quadruped robot into a part of observable Markov decision process, and training a strategy output by the part of observable Markov decision process through a proximity strategy optimization algorithm based on an actor-critic model, wherein the actor-critic model comprises a strategy network and a value network, a domain randomization technology is adopted in the strategy training process to randomize the parameters of the environment, and a reward function is designed by comprehensively considering speed tracking rewards, stability penalties and inverted pendulum penalty items so as to train the value network and provide supervision signals. The invention designs an end-to-end four-foot robot inverted pendulum stabilization method based on mixed state estimation, which improves the balance capacity and stability of the robot.
Description
Technical Field
The invention relates to the technical field of robots and automation, in particular to an inverted pendulum stabilization method of a reinforcement learning quadruped robot based on hybrid state estimation.
Background
The existing four-legged robot research is mainly focused on motion control, and the comprehensive optimization attention on the stability and the control performance of the robot body is less, so that the adaptability and the application potential of the robot body in higher-level tasks are limited. Inverted pendulum is always regarded as a standard test platform for verifying the effectiveness and robustness of control methods as a classical nonlinear dynamics system. According to the invention, the first-order inverted pendulum is fixed on the body of the quadruped robot, so that the balance capacity and stability of the robot in a dynamic environment are intuitively displayed, and an innovative research view angle and technical means are provided for improving the comprehensive performance of the robot.
The motion control method of the four-legged robot is generally classified into a conventional model-based control method and a learning-based control method. Traditional model-based methods rely on accurate system modeling, often involving multiple complex modules such as state estimation, terrain reconstruction, and whole body controllers. These methods are typically based on strict assumptions, such as collision-free and slip-free, etc. However, these assumptions are often difficult to meet in practical applications, making conventional control methods limited in applicability. In addition, for disturbances that an inverted pendulum may face in practical application on a four-legged robot, such as a change in the mass of the inverted pendulum in a real object, a shift in the centroid position, a damping coefficient, and fluctuation in friction, conventional algorithms lack sufficient self-adaptive capability, and thus it is difficult to effectively cope with these complex dynamic changes.
The four-foot robot motion control method based on reinforcement learning has been significantly advanced in recent years, and particularly exhibits excellent performance in complex scenes such as field running and the like, and has certain mechanical arm control capability. The method based on deep reinforcement learning converts complex optimization problems into optimization in an offline training stage through a learning decision strategy, so that dependence on an accurate model is remarkably reduced, and stronger robustness and adaptability are shown. However, most of the existing reinforcement learning studies focus on motion control, and there is less attention to comprehensive optimization of stability and handling performance of the robot body, which limits its adaptability and application potential in higher-level tasks.
Aiming at the problem of balancing the inverted pendulum on the body of the quadruped robot, the existing methods improve an actor-criticizer network based on depth deterministic strategy Gradient (DDPG, deep Deterministic Policy Gradient), design a layered reinforced reward function, and obtain a control strategy for improving the balancing capability and stability of the robot through interactive training with a model of balancing the inverted pendulum of the quadruped robot. However, these methods have some drawbacks. For example, DDPG's strategy is tightly coupled to the cost function and is susceptible to overestimated deviations, resulting in an unstable training process. In addition, DDPG relies on gaussian noise or noise processes to guide strategy exploration, which is inefficient in high-dimensional motion space, easily resulting in strategy sinking into local optima, limiting its performance.
Disclosure of Invention
In order to solve the technical problems, the invention provides an inverted pendulum stabilization method of a reinforcement learning four-legged robot based on hybrid state estimation, which enables the four-legged robot to realize robust motion control by means of self-perception and stably maintain an inverted pendulum system carried by a body. By designing a parameter estimator based on mixed state information, key parameters of the inverted pendulum are accurately estimated in real time, and the gesture of the robot is dynamically adjusted, so that the deep fusion of motion control and system stability is realized.
In order to solve the technical problems, the invention adopts the following technical scheme:
a four-foot robot inverted pendulum stable control method based on deep reinforcement learning estimates key parameters of an inverted pendulum in real time through design and dynamically adjusts the gesture of the four-foot robot so as to realize deep fusion of motion control and system stability, specifically comprising:
Fixing the first-order inverted pendulum on a body of the four-foot robot;
modeling the motion of the quadruped robot into a part of observable Markov decision process, and training a strategy output by the part of observable Markov decision process through a proximity strategy optimization algorithm based on an actor-critic model;
The actor-critique model comprises a strategy network and a value network, wherein the strategy network outputs a strategy according to an input state, and specific actions are sampled or directly selected from probability distribution corresponding to the strategy, and the input state comprises the observed quantity and state estimated quantity of the quadruped robot;
a domain randomization technology is adopted in the strategy training process, and parameters of the environment are randomized;
The reward function is designed to train the value network and provide supervisory signals in combination with consideration of the velocity tracking reward, stability penalty, and inverted pendulum penalty term.
Further, the policy network outputs a policy according to an input state, and samples or directly selects a specific action from probability distributions corresponding to the policy, where the input state includes an observed quantity o t and a state estimator of the quadruped robot, and the specific method includes:
The state estimator comprises an implicit state z t, a linear speed v t of the airframe and an estimated inverted pendulum parameter p t;
Strategy pi φ(at|ot,vt,zt,pt) presumes the action a t according to the observed quantity o t, the implicit state z t, the linear speed v t of the robot body and the estimated parameter p t of the inverted pendulum of the quadruped robot;
Wherein t is the index of the current time step, and the parameter p t of the inverted pendulum is a vector:
wherein, the Is the friction coefficient of the inverted pendulum,Is the mass of the inverted pendulum, and the weight of the inverted pendulum is equal to the mass of the inverted pendulum,The observed quantity o t of the quadruped robot is a vector containing self-perception information:
Wherein c t、ωt、gt、ft、θt, A t-l is a machine body linear speed instruction, an angular speed of the machine body, a gravity unit vector of a machine body coordinate system, a foot contact Boolean quantity, a joint angle, an inverted pendulum angle, a joint angular speed, an inverted pendulum angular speed and the action of the last time step of the four-foot robot respectively;
The position offset of the joints of the four-foot robot output by the strategy is taken as an offset value theta def of the initial standing posture of the four-foot robot in the action a t,at, so that the expected joint angle theta des of the robot is defined as:
θdes=θdef+at;
the desired angle of each joint is tracked by a proportional-derivative controller.
Further, the domain randomization technology is adopted in the policy training process to randomize parameters of the environment, which specifically includes:
parameters of the environment include fuselage load weight, PD controller parameters, centroid offset, and system delay;
Random noise with different degrees is added into the observed quantity of the quadruped robot which is input into the strategy network, and domain randomization processing is carried out on the load weight of the robot body, the parameters of the PD controller, the centroid offset and the system delay.
Further, the value network is used for evaluating the performance of the current policy network, and specifically comprises:
The input s t of the value network comprises observed quantity o t of the quadruped robot, linear speed v t of the robot body and parameter p t of the inverted pendulum:
st=[otvtpt]T。
Further, the integrated consideration of the velocity tracking reward, the stability penalty and the inverted pendulum penalty term designs a reward function to train the value network and provide a supervisory signal, which specifically includes:
The speed tracking rewards include tracking of linear and angular speeds;
The stability penalty includes limiting the speed of the body of the quadruped robot in the z-axis direction, the angular speed in the x-axis direction and the y-axis direction, the orientation, the joint acceleration, the joint force, the body height, the frequency of motion, and the smoothness performance;
the inverted pendulum penalty term includes a penalty for inverted pendulum angle and speed.
Further, the state estimator is used for estimating the state estimation quantity;
The state estimator consists of a memory encoder and a source encoder, wherein the memory encoder adopts a long-period memory network and a short-period memory network;
After the observed quantity o t of the quadruped robot is transmitted into a long-short-period memory network, the obtained output is fed into a source encoder to obtain the state estimator, wherein the state estimator comprises the explicit linear speed v t of the robot body, the inverted pendulum parameter p t and the implicit state z t;
Fuselage linear velocity based on explicit fuselage linear velocity v t, inverted pendulum parameter p t, true value And the real inverted pendulum parametersError calculation using root mean square error
MSE (-) represents root mean square error;
observed quantity for next time step By observed quantity over timePredicting, wherein H represents time step, andAs a result of the fact that the target vector,As a source vector, respectively inputting a target encoder and a source encoder to obtain a target vectorSum source vectorFor the target vectorSum source vectorTaking L2 normalization to obtain a normalized matrix E, and then carrying out target vector normalizationSum source vectorCarrying out normalized exponential function operation with dot products of the normalized matrix to obtain target prediction probabilitySum source prediction probability
Wherein τ is a temperature parameter, E k represents the kth element in the normalized matrix E, and based on the prediction result and the target of cluster allocation, the unique target representing learning is defined as the maximum prediction precision by calculating the cross information entropy J:
h represents the time domain length of the observed quantity, Respectively representing the expected values of source vector and target vector, calculating by Sinkhorn-Knopp algorithm, using cross information entropy J as gradient of target encoder, using cross information entropy J and root mean square errorThe sum is used as the gradient of the long-term memory network and the source encoder for training.
Compared with the prior art, the invention has the beneficial technical effects that:
The invention designs an end-to-end four-foot robot inverted pendulum stabilization method based on mixed state estimation, which improves the balance capacity and stability of the robot. The state estimator of the mixed hidden display information not only can accurately estimate the linear speed of the four-legged robot, but also can estimate key parameters of an inverted pendulum system carried by a machine body in real time, including the mass, the mass center position, the friction force and the like of the inverted pendulum. Through the design, the four-foot robot can still keep high-efficiency and stable balance control capability under the condition of carrying different types of inverted pendulum systems or facing the difference between simulation and an actual system, thereby remarkably enhancing the adaptability and the robustness of the system.
The conventional DDPG algorithm has instability during training and tends to fall into local optima. Therefore, the invention adopts PPO algorithm to improve, effectively limits the updating amplitude of each strategy, and avoids the great fluctuation of the strategy in the optimization process, thereby being capable of covering the action space more smoothly. PPO exhibits more efficient exploration ability and robustness, particularly in high-dimensional environments. Furthermore, LSTM is introduced to perform mixed state estimation, so that sequence data can be processed better and time dependence can be captured. By extracting the mixed state information, the strategy can capture more key features from the environment, and the stability and adaptability of the strategy are remarkably improved, so that more excellent performance is shown in complex dynamic tasks.
Drawings
Fig. 1 is an overall system block diagram of the present invention.
Detailed Description
A preferred embodiment of the present invention will be described in detail with reference to the accompanying drawings.
The invention provides a four-foot robot inverted pendulum stabilization control method based on deep reinforcement learning, which intuitively displays the balance capability and stability performance of a robot in a dynamic environment by fixing a first-order inverted pendulum on a body of the four-foot robot. Based on the experimental platform, the invention combines mixed state estimation, designs an end-to-end motion learning framework for optimizing the actor-critique model based on the proximity strategy, realizes synchronous training of the motion strategy and the state estimator, and effectively improves the control effect and the robustness of the system.
1. Reinforcement learning problem description:
Because of the lack of external perception sensors, the terrain information cannot be fully observed, the present invention models the motion problem as a partially observable markov decision process (POMDP, PARTIALLY OBSERVABLE MARKOV DECISION PROCESS). The state at time step t is defined as x t, the agent's policy performs action a t, the environment transitions to the next time step state x t+1 by transition probability P (x t+1|xt,at), and returns the prize value r t and a partial state observation The goal of reinforcement learning is to find a strategy pi that maximizes future trajectory expectations (also referred to as jackpot expectations):
wherein γ t ε [0, 1) is the discount factor.
In order to learn robust blind exercise capability in a single training stage, the invention adopts a proximity strategy optimization (PPO, proximal Policy Optimization) algorithm to train the strategy and combines an asymmetric actor-commentator framework to improve the learning efficiency and expressive power of the model. The PPO algorithm limits the difference range between the new strategy and the old strategy when the strategy is updated by introducing a truncated probability ratio function, so that fluctuation in the optimization process is effectively controlled. Compared with other strategy gradient methods, the PPO algorithm realizes good balance between the calculation complexity and the performance, and provides reliable guarantee for efficient and stable strategy learning.
2. Policy network:
The policy network outputs a policy, typically a probability distribution, from which specific actions are sampled or directly selected, based on the state of the input. Policy pi φ(at|ot,vt,zt,pt) requires a given parameter phi, the action a t is presumed. The observed quantity o t of the self, the linear speed v t of the fuselage, the implicit state z t and the inverted pendulum parameter p t need to be input, wherein p t is a vector of dimension n×1:
wherein, the Is the friction coefficient of the inverted pendulum,Is the mass of the inverted pendulum, and the weight of the inverted pendulum is equal to the mass of the inverted pendulum,Is the offset of the centroid of the inverted pendulum in the z-axis. The self observed quantity o t is a vector of dimension n×1 containing self perceived information:
Wherein c t、ωt、gt、ft、θt, A t-1 is a machine body linear speed instruction, an angular speed of the machine body, a gravity unit vector of a machine body coordinate system, a foot contact Boolean quantity, a joint angle, an inverted pendulum angle, a joint angular speed, an inverted pendulum angular speed and the action of the last time step of the four-foot robot respectively;
The strategic output is the offset of the positions of the 12 joints of the quadruped robot as the offset θ def of the initial stance of the quadruped robot for action a t∈R12,at, so the desired joint angle of the robot θ des is defined as:
θdes=θdef+at;
the desired angle of each joint is tracked by a proportional-derivative (PD) controller, setting p=40.0, d=1.0 as a physical parameter.
3. Domain randomization:
domain randomization is a technique widely applied to the fields of reinforcement learning and robot control, aiming at improving the generalization capability of a model in a real environment. The core idea is that by randomizing environmental parameters in the training process, the model can be learned under various conditions, so that stronger adaptability and robustness are shown in an unknown scene. According to the method, random noise with different degrees is added into observed quantity of the strategy network, and meanwhile domain randomization processing is carried out on key factors such as fuselage load weight, PD controller parameters, centroid offset and system delay.
4. Value network:
The value network is used for evaluating the performance of the current strategy and helping the strategy network to learn the more optimal strategy. In order to obtain more accurate linear speed of the machine body and parameters of the inverted pendulum, the input of the value network not only comprises the self observed quantity o t, but also increases the privilege observed quantity comprising the linear speed v t of the machine body and the parameters p t of the inverted pendulum:
st=[otvtpt]T。
5. and (3) bonus function design:
The reward function provides a supervisory signal for training the value network, the design of which takes into account both the speed tracking reward and the stability penalty to achieve stable and natural locomotor behavior. The speed tracking rewards include accurate tracking of line and angular speeds, while the stability penalty covers multiple dimensions including limiting the speed of the fuselage in the z-axis direction, the angular speed in roll and pitch directions, the offset of the gravitational component, and joint acceleration, among others. In addition, for stabilizing the inverted pendulum task, the reward function introduces a penalty on the inverted pendulum angle and speed to further improve the execution stability of the task, as shown in table 1. The total rewards for the policy to take action at each state are as follows:
rt(st,at)=∑riwi;
where i is an index of each prize.
TABLE 1 bonus function terms
6. State estimator:
the state estimator is an important input of the strategy network and consists of a memory encoder (LSTM) and a source encoder. Memory encoders are typically implemented in two ways, one by stacking a series of historical observations as inputs to the MLP, and one by using a model architecture that can capture past information, such as a recurrent neural network (RNN, recurrent Neural Network) or a time convolutional neural network (TCN, temporal Convolutional Network). However, architectures like MLP and TCN require reserving a certain memory space for storing historical observations, which places a great strain on the use of on-board resources. In contrast, RNNs can embed history information through hidden states, thereby reducing reliance on directly storing the total history observations. Based on this, the invention selects Long Short-Term Memory network (LSTM) as the implementation architecture of RNN.
After the own observables o t are passed into the LSTM, the output is fed to the source encoder, using a multi-layer perceptron (MLP, multilayer Perceptron) to obtain a state estimate of the implicit explicit blend. The state estimator includes an explicit linear velocity v t of the fuselage, an inverted pendulum parameter p t, and an implicit state quantity z t. Wherein for explicit state quantities, the mean square error (MSE, mean Squared Error) and true value of the fuselage linear velocity are usedAnd the real inverted pendulum parametersFor calculating errors
For the observed quantity of the next momentThrough the pastThe observed quantity of a period of time is predicted, and the time step represented by H is set to h=5. Will beAs a result of the fact that the target vector,As a source vector, respectively inputting a target encoder and a source encoder to obtain a target vectorSum source vectorTaking L2 normalization to obtain normalized matrix, and then target vectorSum source vectorCarrying out normalized exponential function operation with dot products of the normalized matrix to obtain prediction probabilityAnd
Where τ is a temperature parameter.
The present invention has now obtained the prediction results and targets of cluster allocation, defining the unique target representing learning as maximizing the prediction accuracy by calculating the cross information entropy J:
Taking the cross information entropy J as the gradient of the target encoder, taking the cross information entropy J and root mean square error The sum is trained as the gradient of LSTM and source encoder.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.
Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a single embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to specific embodiments, and that the embodiments may be combined appropriately to form other embodiments that will be understood by those skilled in the art.
Claims (5)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202411897937.3A CN119758719B (en) | 2024-12-23 | 2024-12-23 | Inverted Pendulum Stabilization Method for Quadruped Robot Based on Hybrid State Estimation and Reinforcement Learning |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202411897937.3A CN119758719B (en) | 2024-12-23 | 2024-12-23 | Inverted Pendulum Stabilization Method for Quadruped Robot Based on Hybrid State Estimation and Reinforcement Learning |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN119758719A CN119758719A (en) | 2025-04-04 |
| CN119758719B true CN119758719B (en) | 2025-10-28 |
Family
ID=95190321
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202411897937.3A Active CN119758719B (en) | 2024-12-23 | 2024-12-23 | Inverted Pendulum Stabilization Method for Quadruped Robot Based on Hybrid State Estimation and Reinforcement Learning |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN119758719B (en) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120722767B (en) * | 2025-09-01 | 2025-11-14 | 湖南大学 | Gait network training method for biped robot |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106292288A (en) * | 2016-09-22 | 2017-01-04 | 同济大学 | Model parameter correction method based on Policy-Gradient learning method and application thereof |
| CN117313826A (en) * | 2023-11-30 | 2023-12-29 | 安徽大学 | Arbitrary-angle inverted pendulum model training method based on reinforcement learning |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11389957B2 (en) * | 2019-09-30 | 2022-07-19 | Mitsubishi Electric Research Laboratories, Inc. | System and design of derivative-free model learning for robotic systems |
-
2024
- 2024-12-23 CN CN202411897937.3A patent/CN119758719B/en active Active
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106292288A (en) * | 2016-09-22 | 2017-01-04 | 同济大学 | Model parameter correction method based on Policy-Gradient learning method and application thereof |
| CN117313826A (en) * | 2023-11-30 | 2023-12-29 | 安徽大学 | Arbitrary-angle inverted pendulum model training method based on reinforcement learning |
Also Published As
| Publication number | Publication date |
|---|---|
| CN119758719A (en) | 2025-04-04 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Precup et al. | A survey on fuzzy control for mechatronics applications | |
| CN112119404B (en) | Sample-Efficient Reinforcement Learning | |
| CN112247992B (en) | A kind of robot feedforward torque compensation method | |
| KR20220137956A (en) | Versatile Reinforcement Learning with Objective Action-Value Functions | |
| Duan et al. | Sim-to-real learning of footstep-constrained bipedal dynamic walking | |
| CN116834014B (en) | Intelligent cooperative control method and system for capturing non-cooperative targets by space dobby robot | |
| Qi et al. | Stable indirect adaptive control based on discrete-time T–S fuzzy model | |
| Shamrooz et al. | Modeling of Asynchronous Mode–dependent Delays in Stochastic Markovian Jumping Modes Based on Static Neural Networks for Robotic Manipulators. | |
| CN113419424B (en) | Modeling reinforcement learning robot control method and system for reducing overestimation | |
| CN119758719B (en) | Inverted Pendulum Stabilization Method for Quadruped Robot Based on Hybrid State Estimation and Reinforcement Learning | |
| CN119644704B (en) | Biped robot complex terrain self-adaptive gait planning method and biped robot | |
| CN119795175B (en) | A dexterous two-handed collaborative control method based on multi-agent reinforcement learning | |
| CN119188729A (en) | Robotic arm control method and stability evaluation method based on double evaluation network | |
| CN117601120A (en) | Adaptive variable impedance control method and device, electronic equipment and storage medium | |
| Zhang et al. | Trajectory-tracking control of robotic systems via deep reinforcement learning | |
| CN112571420A (en) | Dual-function model prediction control method under unknown parameters | |
| CN119159582B (en) | Multi-axis mechanical arm prediction control method based on information physical neural network | |
| CN120395841A (en) | A compliant control method for human-robot collaboration based on improved deep reinforcement learning combined with collaborator intention | |
| CN120326600A (en) | Robot hierarchical reinforcement learning variable impedance control method based on vision and touch | |
| CN114118371A (en) | A kind of agent deep reinforcement learning method and computer readable medium | |
| CN115047761B (en) | Mechanical arm model optimization method based on self-adaptive sliding mode observer | |
| Zhang et al. | Tracking control for mobile robot based on deep reinforcement learning | |
| Liu et al. | Forward-looking imaginative planning framework combined with prioritized-replay double DQN | |
| CN110531620B (en) | Adaptive control method of mountain climbing system of trolley based on Gaussian process approximate model | |
| Liang et al. | Trajectory Progress-Based Prioritizing and Intrinsic Reward Mechanism for Robust Training of Robotic Manipulations |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |