CN120941407A

CN120941407A - Method, system and device for planning speed of robot based on deep reinforcement learning

Info

Publication number: CN120941407A
Application number: CN202511358553.9A
Authority: CN
Inventors: 李建刚; 刘承炜; 李海君
Original assignee: Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology
Current assignee: Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology
Priority date: 2025-09-23
Filing date: 2025-09-23
Publication date: 2025-11-14

Abstract

The invention relates to a speed planning method, a speed planning system and a speed planning device of a robot based on deep reinforcement learning, wherein the method comprises the steps of constructing a deep reinforcement learning environment and defining a state space and an action space of the deep reinforcement learning; the method comprises the steps of providing a comprehensive rewarding function, carrying out speed planning training and optimizing treatment on a robot based on the comprehensive rewarding function in a deep reinforcement learning environment to obtain an optimal speed planning strategy, generating a speed decision model according to the optimal speed planning strategy, deploying the speed decision model to an intelligent decision module of a robot control system, acquiring a current real state space and a current real action space of the robot body by the robot control system, determining an optimal speed planning strategy of the robot body, and generating an instruction sequence of each joint speed according to the optimal speed planning strategy to drive the robot body to move. The robot speed planning method can intelligently balance and optimize a plurality of performance indexes.

Description

Method, system and device for planning speed of robot based on deep reinforcement learning

Technical Field

The invention relates to the technical field of robot control, in particular to a speed planning method, system and device for a robot based on deep reinforcement learning.

Background

With the continuous development of industrial automation and robot technology, robots are increasingly used in a plurality of fields such as production, logistics, service and the like. The speed planning of the robot is used as a key link in a robot control system, and the efficiency, the precision and the stability of the execution of the robot task are directly affected. Traditional robot speed planning methods are generally based on analytical models or empirical formulas, and when facing complex task scenes, dynamically-changing environments and uncertainty of parameters of the robot, the traditional robot speed planning methods often have various limitations, and the requirements of modern robot systems on efficient, accurate and self-adaptive speed planning are difficult to meet.

On one hand, the traditional method has higher degree of dependence on a dynamic model of the robot, and accurate model parameters are needed to realize speed planning. However, in practical applications, the robot model parameters may deviate due to load variation, mechanical wear, etc., resulting in a decrease in accuracy and reliability of the speed planning. On the other hand, the traditional speed planning method is difficult to effectively balance and optimize among a plurality of performance indexes, and cannot meet the diversified requirements under complex task scenes. In addition, in the face of dynamically changing environments and tasks, the traditional method often needs tedious manual parameter adjustment and offline repeated programming, lacks self-adaptive capacity, and reduces the intelligent level and deployment flexibility of the robot.

Disclosure of Invention

The invention provides a speed planning method, a speed planning system and a speed planning device for a robot based on deep reinforcement learning, which aim to at least solve one of the technical problems in the prior art.

The technical scheme of the invention is a speed planning method of a robot based on deep reinforcement learning, which comprises the following steps:

constructing a deep reinforcement learning environment, establishing a kinematic and approximate dynamic model of a robot in the deep reinforcement learning environment, and defining a state space and an action space of the deep reinforcement learning;

providing a comprehensive rewarding function for speed planning of the agent;

In the deep reinforcement learning environment, performing speed planning training and optimizing processing on the intelligent agent based on the comprehensive rewarding function to obtain an optimal speed planning strategy;

Generating a speed decision model according to the optimal speed planning strategy, solidifying and storing weight parameters of the speed decision model, and deploying the speed decision model to an intelligent decision module of a robot control system, wherein the robot control system acquires a current real state space and an action space of a robot body, and the intelligent decision module determines the optimal speed planning strategy of the robot body based on the current real state space and the action space;

And generating an instruction sequence of each joint speed according to the optimal speed planning strategy of the robot body so as to drive the robot body to move.

According to some embodiments of the invention, the constructing a deep reinforcement learning environment, in which a kinematic and approximate dynamics model of a robot is built, and a state space and an action space of the deep reinforcement learning are defined, includes:

selecting MuJoCo physical engines with open sources as simulation platforms, and constructing a deep reinforcement learning environment by combining OpenAI Gym frames;

Establishing a kinematic and approximate dynamics model of the robot in the deep reinforcement learning environment, wherein the kinematic and approximate dynamics model of the robot comprises D-H parameters, mass, inertia and joint limit information of the robot;

defining a state space of deep reinforcement learning according to the current angle, the current angular speed, the error between a target path point and the current end effector position, the error between the target path point and the current end effector posture, and a plurality of key point information or curvature information of a future path segment of the robot;

And defining an action space of deep reinforcement learning according to the target angular acceleration or the target angular velocity of each joint of the robot in the next control period.

According to some embodiments of the invention, the providing a comprehensive rewards function for speed planning of an agent includes:

Setting task completion time rewards, end effector track tracking precision rewards, motion stability rewards, task completion time weights, end effector track tracking precision weights and motion stability weights of the intelligent agent;

Obtaining a comprehensive rewarding function according to the product of the task completion time rewarding and the task completion time weight, the product of the end effector track tracking precision rewarding and the end effector track tracking precision weight and the product of the motion stability rewarding and the motion stability weight, wherein the expression of the comprehensive rewarding function is as follows:

R_t＝w_timer_time,t+w_pathr_path,t+w_smoothr_smooth,t

Wherein, R _t is a comprehensive rewarding function, w _time is task completion time weight, R _time,t is task completion time rewarding, w _path is end effector track tracking precision weight, R _path,t is end effector track tracking precision rewarding, w _smooth is motion stability weight, and R _smooth,t is motion stability rewarding.

According to some embodiments of the present invention, in the deep reinforcement learning environment, the speed planning training and optimizing process is performed on the agent based on the comprehensive rewarding function to obtain an optimal speed planning strategy, including:

selecting an actor-critic algorithm as a depth reinforcement learning algorithm for processing a continuous action space to construct a corresponding neural network structure;

In the deep reinforcement learning environment, the intelligent agent selects actions according to the current state, and interacts with the environment to generate a new state and instant rewards;

evaluating the selected action effect, the new state and the instant rewards of the agent through the comprehensive rewards function;

And adjusting parameters of the neural network according to the updating rule of the actor-critique algorithm until a speed planning strategy for maximizing accumulated expected rewards on the premise of meeting various constraints is learned or until the performance of the speed planning strategy is learned to be converged or the preset training round number and performance index are reached, so as to obtain the optimal speed planning strategy.

According to some embodiments of the invention, the actor-critter algorithm includes an actor network and a critter network, wherein in the deep reinforcement learning environment, an agent selects actions according to a current state, interacts with the environment to generate a new state and an instant prize, and includes:

the actor network outputs actions according to the current state of the agent, wherein the actions comprise a first action or a second action, the first action is generated by a random strategy, and the second action is generated by a deterministic strategy;

The actor network evaluates the value of the current state or the value of the current state-action pair;

starting to execute tasks in the deep reinforcement learning environment, observing the current state by an agent in each time step, obtaining a target action through the actor network, executing the target action and obtaining a new state, and returning to instant rewards;

An experience tuple is generated from the current state, the target action, the instant prize, and the new state and stored to an experience playback buffer or directly for online updating.

According to some embodiments of the present invention, the adjusting the parameters of the neural network according to the update rule of the actor-critique algorithm until learning a speed planning strategy that maximizes accumulation of expected rewards on the premise of meeting various constraints or until learning that performance of the speed planning strategy converges or reaches a preset training round number and performance index, to obtain an optimal speed planning strategy includes:

processing the loss function of the actor network weight through a gradient descent method to obtain a minimized loss function, and updating parameters of the actor network and the evaluator network by using the minimized loss function;

The speed planning strategy of the intelligent agent is continuously trained and optimized through interactive trial and error learning so as to maximize the accumulated expected rewards in a complete task round, namely an objective function, wherein the expression of the objective function is as follows:

Where J (θ) represents the objective function, θ represents the actor's network weight parameter, τ represents a complete trajectory (s ₀,a₀,r₀,s₁,.),. Gamma.e.0, 1] is the discount factor, T represents the maximum step size of the round, and r _t represents the immediate rewards obtained from the environment after the agent performs the action at time step T.

And continuously optimizing and training the speed planning strategy of the intelligent agent, and guiding the intelligent agent to learn a correct behavior mode from historical experience data by utilizing the comprehensive rewarding function until the performance of the speed planning strategy converges or reaches the preset training round number and performance index.

According to some embodiments of the invention, the generating a speed decision model according to the optimal speed planning strategy, solidifying and saving the weight parameters of the speed decision model, includes:

Generating a speed decision model according to the optimal speed planning strategy, wherein the speed decision model is composed of a neural network based on a depth reinforcement learning framework of actors-critics;

the actor network generates an action with the maximum future expected reward according to the current state of the intelligent agent, wherein the current state comprises a vector of joint angles, angular velocities and task information of the robot, and the action with the maximum future expected reward comprises a target angular velocity or a target angular acceleration of each joint of the robot;

Evaluating, by the evaluator network, a long-term value of performing the action of the future expected reward maximization in the current state and directing an update of the actor network;

and solidifying and storing the updated weight parameters of the actor network.

According to some embodiments of the invention, the robot control system includes a state sensing module, the robot control system obtains a current real state space and an action space of the robot body, and the intelligent decision module determines an optimal speed planning strategy of the robot body based on the current real state space and the action space, including:

The robot control system acquires the actual joint angle, the joint angular velocity, the pose of the end effector and the current task information of each current joint of the robot body in real time through a state sensing module, generates a current real state space according to the actual joint angle, the joint angular velocity, the pose of the end effector and the current task information of each current joint, and transmits the current real state space as input to an intelligent decision module;

After the intelligent decision model receives the current real state space, carrying out a fast forward reasoning calculation and outputting a normalized action vector;

performing inverse normalization processing on the motion vector to obtain a motion space;

And determining an optimal speed planning strategy of the robot body based on the current real state space and the action space.

The technical scheme of the invention also relates to a speed planning system of the robot based on deep reinforcement learning, which comprises the following components:

A robot body;

The robot control system comprises a speed planning module, a state sensing module and an intelligent decision module, wherein the intelligent decision module is internally provided with an intelligent decision model obtained through deep reinforcement learning training, the intelligent decision module is used for outputting an optimal speed planning strategy output by the intelligent decision model, the speed planning module is used for generating optimal joint speed parameters according to the optimal speed planning strategy output by the intelligent decision module, and the state sensing module comprises state data for acquiring actual joint angles, joint angular velocities, end effector pose and current task information of each joint of a robot body and instruction execution conditions for receiving and transmitting the joint speeds.

The technical scheme of the invention also relates to a computer device, which comprises a memory and a processor, wherein the processor executes the computer program stored in the memory to implement the method.

The invention also relates to a computer-readable storage medium, on which computer program instructions are stored, which, when being executed by a processor, carry out the above-mentioned method.

The speed planning method, system and device for the robot based on the deep reinforcement learning provided by the embodiment of the invention have at least one of the following advantages or beneficial effects:

The method comprises the steps of constructing a deep reinforcement learning environment, constructing a kinematic and approximate dynamic model of a robot in the deep reinforcement learning environment, defining a state space and an action space of the deep reinforcement learning, and constructing an environment required by the deep reinforcement learning training, wherein the construction comprises the accurate loading of the robot model, the setting of a working task path, the definition of physical constraints and the materialization of the state and the action space, and lays a solid foundation for the subsequent learning of an intelligent body. The comprehensive rewarding function for carrying out speed planning on the intelligent agent is provided, the deep reinforcement learning intelligent agent is guided to learn a strategy for effectively planning the speed under various tasks and environments in the speed planning training and optimizing process, the optimal behavior mode is extracted from the historical interaction experience, and the intelligent agent can be stimulated to quickly and accurately reach the target by the comprehensive rewarding function. In a deep reinforcement learning environment, speed planning training and optimizing processing are carried out on the intelligent agent based on the comprehensive rewarding function, and the network structure and super parameters of the model are adjusted according to feedback of the comprehensive rewarding function so as to optimize the training effect, so that an optimal speed planning strategy is obtained.

And then, generating a speed decision model according to the optimal speed planning strategy, solidifying and storing weight parameters of the speed decision model, and integrating the speed decision model into an intelligent decision module of the robot control system. In an actual operation scene, a robot control system acquires a real state space and an action space of a current state of a robot body, and an intelligent decision module determines an optimal speed planning strategy of the robot body based on the current real state space and the action space. And calculating the target speed or the target angular speed of each joint according to the optimal speed planning strategy, converting the target speed or the target angular speed into an instruction sequence of each joint speed, and driving the robot body to move through the instruction sequence of each joint speed, so that the speed decision module can rapidly respond to meet the real-time requirement of the robot body movement.

Further, additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

FIG. 1 is a general flow chart of a method for planning the speed of a robot based on deep reinforcement learning provided by an embodiment of the invention;

fig. 2 is a detailed flowchart of step S100 in the speed planning method of the robot based on deep reinforcement learning according to the embodiment of the present invention;

Fig. 3 is a detailed flowchart of step S300 in the speed planning method of the robot based on deep reinforcement learning according to the embodiment of the present invention;

fig. 4 is a detailed flowchart of step S320 in the speed planning method of the robot based on deep reinforcement learning according to the embodiment of the present invention;

fig. 5 is a detailed flowchart of step S340 in the speed planning method of the robot based on deep reinforcement learning according to the embodiment of the present invention;

FIG. 6 is a flow chart of an embodiment of a method for planning the speed of a robot based on deep reinforcement learning according to an embodiment of the present invention;

fig. 7 is a flowchart showing a first detail of step S400 in the speed planning method of the robot based on deep reinforcement learning according to the embodiment of the present invention;

FIG. 8 is a flowchart showing a second detail of step S400 in a speed planning method for a robot based on deep reinforcement learning according to an embodiment of the present invention;

Fig. 9 is a schematic diagram of an implementation method of a speed planning system of a robot based on deep reinforcement learning according to an embodiment of the present invention.

Detailed Description

The conception, specific structure, and technical effects produced by the present invention will be clearly and completely described below with reference to the embodiments and the drawings to fully understand the objects, aspects, and effects of the present invention.

It should be noted that, unless otherwise specified, when a feature is referred to as being "fixed" or "connected" to another feature, it may be directly or indirectly fixed or connected to the other feature. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Furthermore, unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art. The terminology used in the description presented herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and/or" as used herein includes any combination of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element of the same type from another. For example, a first element could also be termed a second element, and, similarly, a second element could also be termed a first element, without departing from the scope of the present invention. The use of any and all examples, or exemplary language (e.g., "such as") provided herein, is intended merely to better illuminate embodiments of the invention and does not pose a limitation on the scope of the invention unless otherwise claimed.

Based on the above, the embodiment of the invention provides a speed planning method, a system and a device for a robot based on deep reinforcement learning, which can effectively cope with complex task scenes, dynamic environment changes and uncertainty of the robot, and can perform intelligent balance and optimization among a plurality of performance indexes.

The following describes a speed planning method, system and device for a robot based on deep reinforcement learning according to an embodiment of the present invention with reference to fig. 1 to 7.

Referring to fig. 1, fig. 1 is a general flowchart of a speed planning method of a robot based on deep reinforcement learning according to an embodiment of the present invention, including but not limited to steps S100 to S500, specifically,

S100, constructing a deep reinforcement learning environment, establishing a kinematic and approximate dynamic model of a robot in the deep reinforcement learning environment, and defining a state space and an action space of the deep reinforcement learning;

S200, providing a comprehensive rewarding function for planning the speed of the intelligent agent;

s300, in a deep reinforcement learning environment, performing speed planning training and optimizing treatment on the intelligent body based on the comprehensive rewarding function to obtain an optimal speed planning strategy;

S400, generating a speed decision model according to an optimal speed planning strategy, solidifying and storing weight parameters of the speed decision model, deploying the speed decision model to an intelligent decision module of a robot control system, acquiring a current real state space and an action space of a robot body by the robot control system, and determining the optimal speed planning strategy of the robot body by the intelligent decision module based on the current real state space and the action space;

S500, generating an instruction sequence of each joint speed according to the optimal speed planning strategy of the robot body so as to drive the robot body to move.

In some embodiments of the invention, the speed planning method of the robot based on the deep reinforcement learning comprises the steps of constructing a deep reinforcement learning environment, establishing a kinematic and approximate dynamic model of the robot in the deep reinforcement learning environment, defining a state space and an action space of the deep reinforcement learning, and thus completing the construction of the environment required by the deep reinforcement learning training, including the accurate loading of the robot model, the setting of a working task path, the definition of physical constraints and the materialization of the state and the action space, and laying a solid foundation for the subsequent learning of the intelligent body.

The comprehensive rewarding function for speed planning of the intelligent agent is provided, the comprehensive rewarding function is the core of reinforcement learning, the behavior target of the intelligent agent is determined, the comprehensive rewarding function guides the deep reinforcement learning intelligent agent to learn a strategy capable of effectively planning the speed in the subsequent speed planning training and optimizing process under various tasks and environments, the optimal behavior mode can be extracted from the history interaction experience, the comprehensive rewarding function can excite the intelligent agent to quickly and accurately reach the target, the stability and the high efficiency of the motion process are guaranteed, and the physical constraint is strictly adhered to.

In a deep reinforcement learning environment, speed planning training and optimization processes are performed on an agent based on a comprehensive rewarding function, for example, the agent is trained using a deep reinforcement learning algorithm. The agent learns how to select actions to maximize the jackpot by interacting with the environment, and during the training process, the agent will continually try different action strategies and adjust according to the feedback of the comprehensive rewarding function. And then, experience playback is used for improving learning efficiency, the network structure and super parameters (such as learning rate, discount factors and the like) of the model are adjusted to optimize training effect, and the optimal speed planning strategy is obtained through the training and optimizing method.

After training is completed, a speed decision model is generated according to an optimal speed planning strategy, weight parameters of the speed decision model are solidified and stored, and the speed decision model is integrated into an intelligent decision module of the robot control system. In an actual operation scene, a robot control system acquires a real state space and an action space of the current state of a robot body, wherein the real state space comprises the actual angle and angular speed of each joint of the robot body, current task information and the like, the information is input into an intelligent decision module, and the intelligent decision module determines an optimal speed planning strategy of the robot body based on the current real state space and the action space.

And calculating the target speed or the target angular speed of each joint according to the optimal speed planning strategy, converting the target speed or the target angular speed into an instruction sequence of each joint speed, and driving the robot body to move through the instruction sequence of each joint speed, so that the speed decision module can rapidly respond to meet the real-time requirement of the robot body movement.

In summary, the method and the system directly learn the speed planning strategy from the interaction with the environment through the deep reinforcement learning, remarkably reduce the dependence on the robot dynamics model, and show stronger robustness and self-adaption capability for the speed planning of the deep reinforcement learning robot under the actual working conditions of uncertain model parameters, dynamic load change, slight mechanical abrasion of the robot and the like. Through the comprehensive rewarding function, intelligent balance and optimization can be carried out among a plurality of even mutually conflicting performance indexes (such as movement time, track precision, movement stability, energy consumption and strict adherence to physical constraint), and the overall operation performance of the robot is remarkably improved. By giving the robot the ability to autonomously learn and optimize the speed planning strategy, the robot can continuously adapt to complex operation tasks and dynamically changing working environments through interaction with simulation environments or actual environments, and the dependence on complicated manual parameter adjustment and offline repeated programming is reduced, so that the intelligent level and deployment flexibility of the robot are remarkably improved. The speed decision model is deployed to an intelligent decision module of the robot control system, the robot control system acquires the current real state space and action space of the robot body, the robot body can complete a designated task in a smoother and more efficient mode through a learned optimal speed planning strategy, the operation cycle time is effectively shortened, and each joint speed is planned efficiently, accurately, stably and adaptively under a complex task environment, meanwhile, unnecessary impact and vibration can be reduced, the production beat and the quality of processed products are effectively improved, mechanical abrasion can be reduced, and the service life of the robot body is prolonged.

It will be appreciated that in some embodiments of the invention, a kinematic model is used to describe the relationship between the position and velocity of a robot joint, and may be generally expressed in terms of forward kinematics (calculating the end position from the joint angle) and reverse kinematics (calculating the joint angle from the end position). The approximate dynamics model takes into account the mass and inertia of the robot, but can be reduced to a linear or nonlinear model for predicting the effect of the motion on the state of the robot.

Referring to fig. 2, fig. 2 is a detailed flowchart of step S100 in the speed planning method of the robot based on deep reinforcement learning according to the embodiment of the present invention, and step S100 includes, but is not limited to, steps S110 to S140, specifically,

S110, selecting an open-source MuJoCo physical engine as a simulation platform, and constructing a deep reinforcement learning environment by combining a OpenAI Gym framework;

S120, establishing a kinematic and approximate kinetic model of the robot in a deep reinforcement learning environment, wherein the kinematic and approximate kinetic model of the robot comprises D-H parameters, mass, inertia and joint limit information of the robot;

s130, defining a state space of deep reinforcement learning according to the current angle, the current angular speed, the error between a target path point and the current end effector position, the error between the target path point and the current end effector posture and a plurality of key point information or curvature information of a future path segment of the robot;

and S140, defining a motion space of deep reinforcement learning according to the target angular acceleration or the target angular velocity of each joint of the robot in the next control period.

In some embodiments of the present invention, in step S100, a deep reinforcement learning environment is constructed, a kinematic and approximate dynamics model of the robot is built in the deep reinforcement learning environment, and a state space and an action space of the deep reinforcement learning are defined.

Firstly, a platform supporting high-fidelity physical simulation is selected, an open-source MuJoCo physical engine is selected as a simulation platform in the embodiment of the invention, a reinforcement learning environment is built by combining a OpenAI Gym framework, a kinematic and approximate dynamics model of a robot is loaded in the physical simulation platform, and the model contains information such as D-H parameters, mass, inertia, joint limit and the like of the robot. Experience data obtained through the simulation environment, and the quality of the experience data directly influences the learning effect.

According to the current angle, the current angular velocity, the error of the target path point and the current end effector position, the error of the target path point and the current end effector posture, and a plurality of key point information or curvature information of a future path section, a state space of deep reinforcement learning is defined, and it is understood that the end effector position error is the error of the target path point and the current end effector position, the end effector posture error is the error of the target path point and the current end effector posture, and the key point information or the curvature information of the future path section is used for providing the forward looking information of the path.

The motion space of the deep reinforcement learning is defined according to the target angular acceleration or the target angular velocity of each joint of the robot in the next control period, and it is understood that the target angular acceleration is the target angular acceleration of each joint in the next control period, and the target angular velocity is the target angular velocity of each joint in the next control period.

The definition of physical constraint and the materialization of the state and the action space can be realized by defining the state space and the action space of deep reinforcement learning, so that a solid foundation is laid for the subsequent learning of the robot body.

In one embodiment of the invention, a robot is defined to perform one or more typical job task paths. For example, a typical task path may be that the robotic end effector moves from a starting point P _start＝(x_s,y_s,z_s) along one straight line segment to an intermediate point P _mid＝(x_m,y_m,z_m) and then moves along another straight line segment to a target point P _end＝(x_e,y_e,z_e).

Wherein P _start represents a starting position point of the robot end effector in a three-dimensional Cartesian coordinate system, coordinate values (x _s,y_s,z_s) respectively represent specific positions of the starting point on a X, Y, Z axis, P _mid represents an intermediate position point or a passing point on a path, coordinate values (x _m,y_m,z_m) define the position of the intermediate point, P _end represents a final target position point which the robot end effector needs to reach, and coordinate values (x _e,y_e,z_e) define the final position of the target point.

In some tasks, the path may also include a circular arc or a more complex curve, such as a B-spline curve. These paths may be represented by a series of discrete cartesian or joint space target points, or defined by parameterized curves. The path information will be input to the agent as part of the state.

Then, the state space and the action space of the deep reinforcement learning agent are specifically defined. The states may include the current angle of each joint of the robot, the current angular velocity, the error of the target path point from the current end effector position, the error of the target path point from the current end effector pose, and several key point information or curvature information of the future path segment. The action may be defined as a target angular acceleration or a target angular velocity of each joint of the robot in a next control cycle. If the target angular velocity is selected as the motion, the state space and motion space equations may be written as follows:

For the state space formula, s _t is the state vector at the time t, which contains all information required by the decision of the agent. q _t is a vector formed by the current angles of the joints of the robot at the moment t. And the vector is formed by the current angular speeds of all joints of the robot at the moment t. e _p,t is a position error vector between the current position of the robot end effector and the target path point at time t. e _o,t is an attitude error vector between the current attitude of the end effector of the robot and the target path point at time t. p _info,t is information related to future path segments at time t, such as coordinates of several key points in the future or curvature information of the path, for use in path look-ahead. S is a state space, which is a set of all possible state vectors S _t.The representation state vector is a D _S -dimensional real vector space.

A is an action space, and is a set of all permitted actions.The motion output by the intelligent body is the target angular velocity vector of all joints of the robot in the next control period.The target angular velocity vector is represented as an n-dimensional real vector, where n is the number of joints (degrees of freedom) of the robot.Is the j-th component in the target angular velocity vector, i.e., the target angular velocity of the j-th joint.The minimum angular velocity allowed for the j-th joint is its physical constraint lower limit.The maximum angular velocity allowed for the j-th joint is its upper physical constraint limit.Indicating that the angular velocity limitation condition applies to all joints from 1 st to nth.

For a typical six-degree-of-freedom tandem industrial robot, the motion vector a _t is generally defined as a 6-dimensional real vector at time t, and the expression of the motion vector a _t is:

a_t＝(a_1,t,a_2,t,a_3,t,a_4,t,a_5,t,a_6,t)^T,

Where a _t represents the complete motion vector output by the actor network at time a _t, and the superscript T represents the transpose. a _j,t (where j=1,.. N denotes the j-th component in the motion vector a _t, corresponding to the motion command value of the j-th joint of the robot at time t, each component being a scalar.

If the motion is defined as the target angular velocity, then the jth component a _j,t corresponds to the target angular velocity of the jth jointMotion component of neural network output of kinematic and approximate dynamics modelTypically normalized to the [ -1, 1] range. Therefore, it is necessary to determine the physical maximum angular velocity (absolute value) of the j-th jointAnd performing inverse normalization processing, wherein the expression of the inverse normalization processing is as follows:

Wherein, the And (5) issuing the target angular velocity to the jth joint of the robot at the moment t.Is the component of the original output value of the strategy network corresponding to the j-th joint.The physical maximum angular velocity (absolute value) of the j-th joint.

If the action is defined as target angular accelerationThen similarly according to the maximum angular accelerationNormalization and inverse normalization are performed. Finally, initializing neural network parameters of the deep reinforcement learning agent and super parameters related to other learning algorithms, such as learning rate, discount factor gamma, exploration rate epsilon and the like.

In some embodiments of the present invention, step S200 in the speed planning method of the deep reinforcement learning-based robot includes, but is not limited to, steps S210 to S220, and in particular,

S210, setting task completion time rewards, end effector track tracking precision rewards, motion stability rewards, task completion time weights, end effector track tracking precision weights and motion stability weights of the intelligent agent;

S220, obtaining a comprehensive rewarding function according to the product of the task completion time rewarding and the task completion time weight, the product of the end effector track tracking precision rewarding and the end effector track tracking precision weight and the product of the motion stability rewarding and the motion stability weight, wherein the expression of the comprehensive rewarding function is as follows:

R_t＝w_timer_time,t+w_pathr_path,t+w_smoothr_smooth,t

In the embodiment of the invention, in order to guide the deep reinforcement learning agent to learn the strategy of effectively planning the speed under various tasks and environments in the training process of the step S300, and extract the optimal behavior mode from the history interaction experience, a comprehensive rewarding function R _t capable of comprehensively reflecting the speed planning quality is provided, and the function not only can excite the agent to quickly and accurately reach the target, but also can ensure the stability and high efficiency of the movement process and strictly obey the physical constraint.

The provision of a comprehensive rewarding function for speed planning of an agent specifically includes:

Setting task completion time rewards, end effector track tracking precision rewards, motion stability rewards, task completion time weights, end effector track tracking precision weights and motion stability weights of the intelligent agent, and defining the parameters:

r _time,t, rewarding time for completing tasks, and promoting the intelligent agent to move to the target point;

r _path,t, rewarding the track tracking precision of the end effector, reducing the error between the rewarding target point and the track in the process of movement, and improving the track precision;

And r _smooth,t, a motion stability reward, which encourages the agent to reduce the track fluctuation of the end effector, so that the motion is smoother.

And obtaining a comprehensive rewarding function according to the product of the task completion time rewarding r _time,t and the task completion time weight w _time, the product of the end effector track tracking precision rewarding r _path,t and the end effector track tracking precision weight w _path and the product of the motion stability rewarding r _smooth,t and the motion stability weight w _smooth.

The overall bonus function parameters adjustment process may be performed by first setting a relatively large weight, such as w _path = 1.0, for the most important target, such as end effector trajectory tracking accuracy, and initially setting the weights of the other optimization terms to a small value, such as w _time＝0.1,w_smooth = 0.05. After training the agent for a period of time, the weight can be adjusted by observing the agent's response to different weights, such as very high end effector trajectory tracking accuracy but slower movement speed, so that w _path decreases and w _time increases to increase the agent's speed in movement. If excessive jitter or uneven acceleration occurs in the agent motion, the iterative direction can be made to proceed toward a more smooth motion by increasing the weight size of w _smooth. The weights may be adjusted stepwise during this process based on different optimization requirements and analysis of the jackpot or jackpot until satisfactory overall performance is achieved. The comprehensive rewarding function is used as a core evaluation standard of the intelligent agent in the learning process of the step S300 to guide the optimization direction of the speed planning strategy.

Referring to fig. 3, fig. 3 is a detailed flowchart of step S300 in the speed planning method of the robot based on deep reinforcement learning according to the embodiment of the present invention, and step S300 includes, but is not limited to, steps S310 to S340, specifically,

S310, selecting an actor-critter algorithm as a deep reinforcement learning algorithm for processing a continuous action space to construct a corresponding neural network structure;

s320, in the deep reinforcement learning environment, the intelligent agent selects actions according to the current state, and interacts with the environment to generate a new state and instant rewards;

S330, evaluating the action effect, the new state and the instant rewards selected by the agent through the comprehensive rewards function;

And S340, adjusting parameters of the neural network according to updating rules of the actor-critique algorithm until a speed planning strategy for accumulating expected rewards is learned to the maximum degree on the premise of meeting various constraints or until the performance of the speed planning strategy is learned to be converged or the preset training round number and performance index are reached, so as to obtain an optimal speed planning strategy.

In some embodiments of the present invention, in a deep reinforcement learning environment, performing speed planning training and optimization processing on an agent based on a comprehensive reward function to obtain an optimal speed planning strategy, including selecting a deep reinforcement learning algorithm suitable for processing a continuous motion space, and selecting an actor-critique algorithm as a training algorithm and constructing a corresponding neural network structure because the actor-critique algorithm is suitable for the motion of a robot in the continuous space. Actor-commentator algorithms include actor networks and commentator networks, it being understood that actor-commentator algorithms are a reinforcement learning algorithm that combines a strategic gradient method (actor) and a cost function estimation (commentator). The method is realized through two neural networks, wherein the actor network is responsible for selecting actions according to the current state of the intelligent agent, outputting probability distribution of the actions or directly outputting action values, and the critic network is responsible for evaluating the quality of the current speed planning strategy and outputting a state cost function or a state-action cost function.

Parameters of the actor network and the critic network are initialized, and an optimizer (e.g., adam) and a loss function (e.g., mean square error loss) are defined. In each time step t, the agent uses the actor network to select action a _t according to the current state s _t, performs action a _t to interact with the environment to obtain a new state s _t+1 and instant prize r _t, uses the critter network to evaluate the performance effect of the action a _t selected by the agent, the new state s _t+1 and instant prize r _t based on the comprehensive prize function to evaluate the quality of the current speed planning strategy,

The training continues, and in each training round, starting from the initial state, action a _t is performed and data is collected, the actor and criticizing network are updated with the collected data, and the above process is repeated until the strategy converges or the preset training round number is reached. In the whole learning process, the reward function r _t always guides the intelligent agent to learn the correct behavior mode from the historical experience data, so that the intelligent agent learns to improve the running speed as much as possible on the premise of meeting the requirements of stable running, high precision and physical constraint compliance, and the intelligent agent can generate high-performance speed planning under various working conditions.

And training the intelligent body to conduct speed planning in the deep reinforcement learning environment by using an actor-critique algorithm, and finally obtaining an optimal speed planning strategy.

Referring to fig. 4, fig. 4 is a detailed flowchart of step S320 in the speed planning method of the robot based on deep reinforcement learning according to the embodiment of the present invention, and step S320 includes, but is not limited to, steps S321 to S324, specifically,

S321, outputting actions by the actor network according to the current state of the agent, wherein the actions comprise a first action or a second action, the first action is generated by a random strategy, and the second action is generated by a deterministic strategy;

S322, the actor network evaluates the value of the current state or the value of the current state-action pair;

S323, starting to execute tasks in the deep reinforcement learning environment, observing the current state by the agent at each time step, obtaining target actions through the actor network, executing the target actions and obtaining new states, and returning to instant rewards;

S324 generates an experience tuple based on the current state, the target action, the instant prize, and the new state and stores the experience tuple in an experience playback buffer or directly for online updating.

Referring to fig. 4 and 6, in some embodiments of the invention, in a deep reinforcement learning environment, an agent selects actions according to a current state, interacts with the environment to generate new states and instant rewards, including:

Actor network outputs action a _t according to agent's current state s _t, action a _t includes first action or second action, the first action is generated by random strategy for exploring environment, increasing diversity of strategy, the first action is expressed as a _t～π_θ(a_t∣s_t), the second action is generated by deterministic strategy for executing current optimal strategy, the second action is expressed as a _t＝μ_θ(s_t), and critique network is responsible for evaluating value of agent's current state s _t Or value of state-action pair (s _t,a_t)

Thereafter, the agent begins performing tasks in the deep reinforcement learning environment, and at each time step t, the agent observes the current state s _t, obtaining a target action a _t through the actor network. After performing the target action a _t, the environment transitions to a new state s _t+1 and returns an instant prize r _t, where the instant prize r _t is calculated from the comprehensive prize function. Experience tuples are generated from the current state s _t, the target action a _t, the instant rewards r _t and the new state s _t+1, expressed as (s _t,a_t,r_t,s_t+1), stored in an experience playback buffer (s _t,a_t,r_t,s_t+1) or used directly for online updates (for online policy algorithms), and a batch of experience tuples is randomly sampled from the buffer for updating.

Through the steps, the intelligent body is trained to conduct speed planning in the deep reinforcement learning environment by using an actor-critique algorithm, and finally an optimal speed planning strategy is obtained.

Referring to fig. 5, fig. 5 is a detailed flowchart of step S340 in the speed planning method of the robot based on deep reinforcement learning according to the embodiment of the present invention, and step S340 includes, but is not limited to, steps S341 to S343, specifically,

S341, processing the loss function of the actor network weight through a gradient descent method to obtain a minimized loss function, and updating parameters of the actor network and the criticism network by utilizing the minimized loss function;

s342, continuously training and optimizing a speed planning strategy of the intelligent agent through interactive trial and error learning so as to maximize accumulated expected rewards, namely an objective function, in a complete task round;

And S343, continuously optimizing the speed planning strategy of the intelligent agent, and guiding the intelligent agent to learn a correct behavior mode from the historical experience data by utilizing the comprehensive rewarding function until the performance of the speed planning strategy converges or reaches the preset training round number and performance index.

Referring to FIGS. 5 and 6, in some embodiments of the invention, parameters of a neural network are adjusted according to the update rules of an actor-critique algorithm until a speed planning strategy is learned that maximizes the accumulation of desired rewards while satisfying various constraints or until the speed planning strategy performance converges or reaches a preset number of training rounds and performance metrics, resulting in an optimal speed planning strategy, including the agent processing the loss function of actor network weights by a gradient descent method according to collected empirical data, resulting in a minimized loss function, updating parameters of actor and critique networks by minimizing the weight loss function L (θ), which typically involves a gradient descent method, such asWherein alpha is learning rate, theta is a weight parameter set of the neural network to be updated, L (theta) is a loss function taking the parameter theta as a variable, and the loss function is used for quantifying the error between a prediction result and a real target; Representing the gradient of the loss function L (θ) to the parameter θ. The formula defines how the gradient of the loss function L (θ) is dependent And updating the weight parameter theta of the actor network with the learning rate alpha as a step length, so that the speed planning strategy of the intelligent body is continuously optimized towards the goal of maximizing the cumulative rewards.

Through a large number of interactive trial and error learning, the speed planning strategy of the intelligent agent is continuously improved so as to maximize the accumulated expected rewards in one complete task round, namely an objective function, wherein the expression of the objective function is as follows:

Where J (θ) represents the objective function, which is the desired cumulative prize available for a complete task round under policy pi _θ. The parameter θ represents a weight parameter of the actor network. Pi _θ is a strategy defined by the parameter θ, which is a function that outputs a probability distribution of action a or a determined action based on the current state s. τ represents a complete track (s ₀,a₀,r₀,s₁). Gamma e 0,1 is a discount factor for balancing the importance of short-term rewards and long-term rewards, and when gamma approaches 1, the agent looks more important for future long-term rewards. r _t is the immediate prize obtained from the environment after the agent performs the action at time step t. By maximizing the objective function, the agent can be driven to learn an optimal speed planning strategy pi _θ, which can realize the maximization of long-term accumulated rewards on the premise of meeting various constraints.

And continuously optimizing and training the speed planning strategy of the intelligent body until the performance of the speed planning strategy converges or reaches the preset training round number/performance index. In the whole learning process, the reward function r _t always guides the intelligent agent to learn the correct behavior mode from the historical experience data, so that the intelligent agent learns to improve the running speed on the premise of meeting the requirements of stable running, high precision and physical constraint compliance, and the intelligent agent can generate high-performance speed planning under various working conditions.

Referring to fig. 7, fig. 7 is a flowchart showing a first detail of step S400 in the speed planning method of the robot based on deep reinforcement learning according to the embodiment of the present invention, and step S400 includes, but is not limited to, steps S410 to S440, specifically,

S410, generating a speed decision model according to an optimal speed planning strategy, wherein the speed decision model is composed of a neural network based on a deep reinforcement learning framework of actors-critics;

s420, generating an action with the largest future expected reward according to the current state of the intelligent agent, wherein the current state comprises a vector of joint angles, angular velocities and task information of the robot, and the action with the largest future expected reward comprises a target angular velocity or a target angular acceleration of each joint of the robot;

S430, evaluating the long-term value of the action with the maximum future expected rewards executed in the current state through the commentator network, and guiding the updating of the actor network;

And S440, solidifying and storing the updated weight parameters of the actor network.

In some embodiments of the invention, generating a speed decision model according to an optimal speed planning strategy, solidifying and saving weight parameters of the speed decision model, comprises generating a speed decision model for generating a high-quality speed plan for a given robot and task type according to the optimal speed planning strategy, wherein the speed decision model is composed of a neural network based on a depth reinforcement learning framework of actors-critics. Actor network generates future maximum expected rewards action based on current state s _t Future action with maximum expected rewardThe mathematical expression of (a) can be defined asThe current state s _t includes a vector of joint angle, angular velocity and task information of the robot, and future motion with maximum expected reward is expectedThe target angular velocity or target angular acceleration of each joint of the robot.

The neural network structure is divided into an actor network and a criticizer network, the mean value and the variance of the gaussian distribution of the actor network output weight parameters are taken as final action instructions in order to ensure the certainty of movement when the application is deployed. The actor-commentator based deep reinforcement learning framework also uses a commentator network in training, denoted asFor evaluating the performance of an action with the maximum future expected reward in the current state s _t Is expressed mathematically asIt receives input from the state-action and directs the updating of the actor network. After training, the weight parameters of the actor network of the speed decision model are solidified and stored, and are deployed into a control system of an actual industrial robot, and are usually embedded into an advanced control layer or a special intelligent decision module of the control system of the robot.

Referring to fig. 8, fig. 8 is a second detailed flowchart of step S400 in the speed planning method of the robot based on deep reinforcement learning according to the embodiment of the present invention, and step S400 includes, but is not limited to, steps S460 to S490, specifically,

S460, the robot control system acquires the actual joint angles, joint angular velocities, end effector poses and current task information of the current joints of the robot body in real time through the state sensing module, generates a current real state space according to the actual joint angles, joint angular velocities, end effector poses and the current task information of the current joints, and transmits the current real state space as input to the intelligent decision module;

s470, after the intelligent decision model receives the current real state space, carrying out one-time fast forward reasoning calculation and outputting a normalized motion vector;

S480, performing inverse normalization processing on the motion vector to obtain a motion space;

And S490, determining an optimal speed planning strategy of the robot body based on the current real state space and the action space.

In some embodiments of the invention, a robot control system acquires a current real state space and an action space of a robot body, and an intelligent decision module determines an optimal speed planning strategy of the robot body based on the current real state space and the action space, wherein the robot control system acquires the actual joint angle, the joint angular velocity, the pose of an end effector and current task information of each current joint of the robot body in real time through a state sensing module, and the information forms a real state s _t at the moment of a real-time state t and is transmitted to a deployed intelligent decision model as input. After the intelligent decision model receives the real state s _t at the time t, a fast forward reasoning calculation is performed, and a normalized motion vector is output immediatelyFor the output motion vectorAnd performing inverse normalization processing to obtain an action space, and generating a control instruction with actual physical meaning according to the action space. And finally, determining an optimal speed planning strategy of the robot body based on the current real state space and the action space.

In one embodiment of the present invention, if the motion state is defined as the target angular velocity, the target angular velocity of the j-th jointThe expression of the target angular velocity is:

Wherein, the For the purpose of the target angular velocity,As the j-th component of the motion vector,Is the physical maximum angular velocity of the j-th joint;

And generating a command sequence of each joint speed according to the target angular speed, and sending the command sequence of each joint speed to a robot control system, wherein the robot control system converts the command sequence of each joint speed into a motor driving signal so as to drive the robot body to move according to the planned speed.

In some embodiments of the present invention, the deployed speed decision model directly outputs instructions regarding the target speed or target acceleration, so real-time online speed planning can be achieved. By periodically detecting the performance of the robot, when the performance of the robot is detected to be reduced, the speed decision model is subjected to off-line fine tuning or retraining by utilizing new data collected in actual operation so as to adapt to the abrasion of a robot body, the load change or the gradual change of a working environment, thereby maintaining or further improving the performance of the robot and realizing the continuous optimized closed loop. The combination of the online reasoning and the offline optimization enables the robot body to be capable of adapting to dynamic changes rapidly, and the robust intelligent speed planning of the robot body in the actual operation process is achieved.

The embodiment of the invention also provides a speed planning system of the robot based on the deep reinforcement learning, which is used for realizing the speed planning method of the robot based on the deep reinforcement learning, wherein the speed planning system of the robot based on the deep reinforcement learning comprises a robot body and a robot control system, the robot control system comprises a speed planning module, a state sensing module and an intelligent decision module, the intelligent decision module is internally provided with an intelligent decision model obtained through the deep reinforcement learning training, the intelligent decision module is used for outputting an optimal speed planning strategy output by the intelligent decision model, the speed planning module is used for generating optimal joint speed parameters according to the optimal speed planning strategy output by the intelligent decision module, and the state sensing module comprises state data for acquiring actual joint angles, joint angular speeds, end effector pose and current task information of each joint of the robot body and instruction execution conditions for receiving and transmitting the joint speeds.

It can be understood that the robot body is the main body of the speed planning system of the robot based on the deep reinforcement learning, performs specific tasks, and performs planning and management on the motion of the robot body by the robot control system, wherein the robot control system is the core part of the speed planning system of the robot based on the deep reinforcement learning and is responsible for planning and managing the speed of the robot body, and comprises the following three key modules:

The speed planning module comprises a kinematic model and is used for generating specific joint speed parameters according to the output of the intelligent decision module and guiding the movement of the robot body.

The state sensing module comprises a data acquisition unit and an execution feedback unit, and is used for acquiring state data (such as joint angles, angular velocities, end effector poses and the like) and task information of the robot body through the data acquisition unit, and providing basis for intelligent decision-making through the execution condition of feedback speed instructions of the execution feedback unit.

The intelligent decision module comprises an intelligent agent, a model obtained through deep reinforcement learning training is built in the intelligent decision module, and an optimal speed planning strategy is output based on data provided by the state sensing module.

In one embodiment, the working flow of the speed planning system of the robot based on deep reinforcement learning is as follows, referring to fig. 9, the state sensing module acquires state data such as actual joint angles, joint angular speeds, end effector pose, current task information and the like of each joint of the robot body in real time through the data acquisition unit, and transmits the data to an intelligent agent of the intelligent decision module through the execution feedback unit. And the intelligent agent of the intelligent decision module calculates by utilizing the built-in deep reinforcement learning model according to the data provided by the state sensing module, outputs an optimal speed planning strategy and outputs the optimal speed planning strategy to a bottom controller of the robot control system. The speed planning module receives the optimal speed planning strategy output by the intelligent decision module, and further generates specific speed parameters of each joint based on the kinematic model. The robot body executes actions according to the speed parameters generated by the speed planning module, and the state sensing module monitors the execution conditions in real time and transmits feedback information to the intelligent decision-making module so as to dynamically adjust.

Through the deep reinforcement learning, the speed planning system of the robot based on the deep reinforcement learning can dynamically adjust the speed planning strategy according to different tasks and environments, and has strong self-adaptive capacity. The goal of the deep reinforcement learning model is to maximize the jackpot prize, which enables the speed planning of the deep reinforcement learning-based robot to output an optimal speed planning strategy, improving the efficiency and accuracy of the robot body performing tasks. The state sensing module collects data in real time through the data collection unit, and feeds back execution conditions to the robot body through the execution feedback unit, so that the speed planning system of the robot based on deep reinforcement learning can quickly respond to environmental changes and task demands, speed planning is timely adjusted, continuous optimized closed loop is realized, the robot body can quickly adapt to dynamic changes, and robust intelligent speed planning of the robot body in an actual operation process is realized.

It should be appreciated that the method steps in embodiments of the present invention may be implemented or carried out by computer hardware, a combination of hardware and software, or by computer instructions stored in non-transitory computer-readable memory. The method may use standard programming techniques. Each program may be implemented in a high level procedural or object oriented programming language to communicate with a computer system. However, the program(s) can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language. Furthermore, the program can be run on a programmed application specific integrated circuit for this purpose.

Furthermore, the operations of the processes described herein may be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The processes (or variations and/or combinations thereof) described herein may be performed under control of one or more computer systems configured with executable instructions, and may be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications), by hardware, or combinations thereof, collectively executing on one or more processors. The computer program includes a plurality of instructions executable by one or more processors.

Further, the method may be implemented in any type of computing platform operatively connected to a suitable computing platform, including, but not limited to, a personal computer, mini-computer, mainframe, workstation, network or distributed computing environment, separate or integrated computer platform, or in communication with a charged particle tool or other imaging device, and so forth. Aspects of the invention may be implemented in machine-readable code stored on a non-transitory storage medium or device, whether removable or integrated into a computing platform, such as a hard disk, optical read and/or write storage medium, RAM, ROM, etc., such that it is readable by a programmable computer, which when read by a computer, is operable to configure and operate the computer to perform the processes described herein. Further, the machine readable code, or portions thereof, may be transmitted over a wired or wireless network. When such media includes instructions or programs that, in conjunction with a microprocessor or other data processor, implement the steps described above, the invention described herein includes these and other different types of non-transitory computer-readable storage media. The invention may also include the computer itself when programmed according to the methods and techniques of the present invention.

The computer program can be applied to the input data to perform the functions described herein, thereby converting the input data to generate output data that is stored to the non-volatile memory. The output information may also be applied to one or more output devices such as a display. In a preferred embodiment of the invention, the transformed data represents physical and tangible objects, including specific visual depictions of physical and tangible objects produced on a display.

The present invention is not limited to the above embodiments, but can be modified, equivalent, improved, etc. by the same means to achieve the technical effects of the present invention, which are included in the spirit and principle of the present invention. Various modifications and variations are possible in the technical solution and/or in the embodiments within the scope of the invention.

Claims

1. The speed planning method of the robot based on the deep reinforcement learning is characterized by comprising the following steps of:

providing a comprehensive rewarding function for speed planning of the agent;

2. The method for planning the speed of a robot based on deep reinforcement learning according to claim 1, wherein the constructing a deep reinforcement learning environment in which a kinematic and approximate dynamics model of the robot is built and a state space and an action space of the deep reinforcement learning are defined, comprises:

3. The method for speed planning for a deep reinforcement learning based robot of claim 1, wherein the providing a comprehensive rewards function for speed planning of an agent comprises:

Rx=w_timer_time,t+w_pathr_path,t+w_smoothr_smooth,t

4. The method for planning the speed of the robot based on the deep reinforcement learning according to claim 1, wherein in the deep reinforcement learning environment, speed planning training and optimization processing are performed on an agent based on the comprehensive rewarding function to obtain an optimal speed planning strategy, comprising:

5. The method of claim 4, wherein the actor-commentator algorithm comprises an actor network and a commentator network, wherein in the deep reinforcement learning environment, an agent selects actions according to a current state, interacts with the environment to generate new states and instant rewards, comprising:

6. The method for planning the speed of the robot based on deep reinforcement learning according to claim 5, wherein the adjusting the parameters of the neural network according to the update rule of the actor-critique algorithm until the speed planning strategy for maximizing the accumulated expected rewards on the premise of meeting various constraints is learned or until the performance of the speed planning strategy is learned to converge or reach a preset training round number and performance index, and obtaining the optimal speed planning strategy comprises:

7. The method for planning the speed of the robot based on the deep reinforcement learning according to claim 5, wherein the generating a speed decision model according to the optimal speed planning strategy, solidifying and saving the weight parameters of the speed decision model, comprises:

and solidifying and storing the updated weight parameters of the actor network.

8. The method for planning the speed of the robot based on the deep reinforcement learning according to claim 7, wherein the robot control system comprises a state sensing module, the robot control system obtains a current real state space and an action space of the robot body, and the intelligent decision module determines an optimal speed planning strategy of the robot body based on the current real state space and the action space, and the method comprises the following steps:

9. A speed planning system for a robot based on deep reinforcement learning, for implementing the speed planning method for a robot based on deep reinforcement learning according to claim 8, comprising:

A robot body;

10. A computer device comprising a memory and a processor, wherein the processor implements the method of any of claims 1 to 8 when executing a computer program stored in the memory.