CN119115953A

CN119115953A - A trajectory planning method for teleoperated space manipulator based on deep reinforcement learning

Info

Publication number: CN119115953A
Application number: CN202411464413.5A
Authority: CN
Inventors: 王学谦; 夏博; 田宪儒; 袁博; 李志恒
Original assignee: Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen International Graduate School of Tsinghua University
Priority date: 2024-10-21
Filing date: 2024-10-21
Publication date: 2024-12-13

Abstract

The invention provides a teleoperation space manipulator track planning method based on Deep Reinforcement Learning (DRL), which effectively solves the track planning problem caused by communication delay. The method comprises the steps of establishing a kinematic model and a reinforcement learning frame of a rigid mechanical arm, constructing a teleoperation frame comprising a master end, a data chain and a slave end, processing communication delay at the master end by adopting a delay processing module (DIP), ensuring real-time performance of states and rewards, updating an agent by utilizing a DRL decision module, learning based on an experience playback pool and an action playback pool, and guiding the mechanical arm to complete tasks through interaction of the agent and the environment. The invention strengthens decision capability of the intelligent agent by integrating the DRL into the remote control framework, is particularly excellent in an inherent delay environment, and has good robustness under different noise and dynamic parameter conditions without additional parameter adjustment.

Description

Teleoperation space mechanical arm track planning method based on deep reinforcement learning

Technical Field

The invention relates to a teleoperation technology of a space robot, in particular to a teleoperation space mechanical arm track planning method based on deep reinforcement learning.

Background

By means of teleoperation technology, the space robot greatly improves the operation capability of astronauts, plays an increasingly important role in orbit service tasks, including tasks such as capturing, oiling, repairing satellites, removing orbit fragments, assembling and maintaining large space infrastructures, and the like. The teleoperation control method commonly used at present comprises remote programming control, bilateral control and virtual prediction control. The remote programming control operates in a supervision mode, and the space robot receives an operation instruction sent by a master end and interacts with the environment at a slave end to form a closed loop system. However, this approach relies on the level of intelligence of the space robot. Both bilateral control and virtual predictive control fall into the category of direct control. The bilateral control directly receives force feedback information from a remote environment, and is suitable for the condition of small delay. By means of suitable control algorithms, such as passive control, robust control and impact control, the force and position information between the master and slave robots remain consistent. In contrast, the virtual prediction control establishes a virtual model similar to the environment of the slave end at the master end, and the master end refers to feedback information from the slave virtual model and the slave end at the same time when deciding, thereby reducing the influence of large delay on the stability and the operation characteristics of the system. However, model-based methods require high accuracy on the model, so that a deep field expertise is required to design complex controllers. And the space robot is a complex dynamic system, and the nonlinearity caused by the characteristics of dynamic coupling, friction, joint flexibility and the like between the base and the mechanical arm brings great challenges to modeling. Furthermore, even if the original system is modeled based on a physical model, many key kinetic parameters are difficult to obtain or get their exact values. These factors can have a significant impact on the model and thus on the performance of the controller.

Data-driven model-free deep reinforcement learning (Deep Reinforcement Learning, DRL) has shown significant promise in a number of fields, such as gaming, industrial control, and large language models. The method is also widely applied to the field of space robots by students, and is mainly focused on track planning of mechanical arms. While previous studies have demonstrated the potential of DRLs in space robotics, these studies generally assume that telerobots are highly intelligent, capable of autonomously achieving and completing a specified task. Such an assumption is not realistic from the current state of the art. In practice, most of intelligence is embodied in the master, and the objective problems of communication delay and limited bandwidth existing between the master and the slave bring great challenges to DRL-based control.

The existing methods for solving the delay problem in reinforcement learning are mainly divided into three categories, namely a state enhancement method, a model prediction method and other methods. The state enhancement method mainly converts the original delayed Markov decision process into a new delay-free Markov decision process according to the information state consisting of the latest observed delay state and the action sequence. From a theoretical perspective, the state space of the information state increases exponentially as the delay increases. The agent thus needs an exponentially increasing number of samples to update the network parameters to optimize the policy network. Therefore, not only is higher requirement on computing resources brought forward, but also the convergence of the strategy becomes slow, and in extreme cases, the strategy even diverges, and from the practical operation point of view, when the strategy network is designed based on the state enhancement method, the input end of the strategy network can only be a determined value due to the characteristics of the deep neural network. When a random delay occurs in the environment, the state of the information corresponding to the delay cannot be adapted to the network where the size of this input has been determined. A policy network with maximum delay corresponding to the information state is usually designed to solve the random delay problem, but such design introduces redundant information to the information state with less than maximum delay, thereby interfering with decision-making.

Model predictive methods generally include two steps, predicting an unknown state due to delay, and then making a final decision based on the predicted state and standard reinforcement learning algorithms. The accurate simulation of the environment dynamic is crucial, the early stage is realized by methods such as deterministic mapping, random forest and the like, and the later stage also respectively utilizes a cyclic neural network, a feedforward model and a particle integration method to learn conversion. The unknown state due to the predicted delay first requires a forward dynamics model to be constructed. The method based on data driving can truly describe the environment by a large amount of data when the data distribution is stable, and the forward dynamics model obtained based on limited data can only reflect the condition of a part of the environment when the data distribution is unstable. In addition, it is assumed that in the case of a more accurate forward dynamics model, the delay state predicted by the latest observed delay state and action sequence has a positive correlation of the inference time with the magnitude of the delay, and the error between it and the true state increases significantly with increasing delay. These have a large impact on the final decision.

It should be noted that the information disclosed in the above background section is only for understanding the background of the application and thus may include information that does not form the prior art that is already known to those of ordinary skill in the art.

Disclosure of Invention

The invention mainly aims to solve the problems in the background technology and provides a teleoperation space mechanical arm track planning method based on deep reinforcement learning.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

A teleoperation space manipulator trajectory planning method based on Deep Reinforcement Learning (DRL), the method comprising the steps of:

The method comprises the steps of S1, establishing a rigid body kinematics model of a floating base space robot of a rigid mechanical arm, wherein the rigid body kinematics model comprises a joint angle vector and a speed vector which define the mechanical arm and a Jacobian matrix of the base and the mechanical arm, and simultaneously defining reinforcement learning six-element tuples comprising states, actions, state transfer functions, rewarding functions, initial state distribution and discount factors, and providing kinematic parameters and reinforcement learning frames for trajectory planning of the mechanical arm, wherein the reinforcement learning frames are used for decision making processes of interaction of an intelligent body with the environment;

S2, constructing a teleoperation overall framework, wherein the teleoperation overall framework comprises a master end, a data chain and a slave end, the data chain is used for transmitting and processing data, the slave end is used for executing an interactive command between the space manipulator and the environment and returning state information and rewards, and the master end is used for carrying out real-time control and decision-making on the manipulator operation;

S3, according to the state information transmitted by the data link, the delay processing DIP module of the master end processes the current state of the master end to obtain the current state and the preamble rewards for removing the communication delay influence between the master end and the slave end;

S4, on the basis of the state after the delay influence is removed, a DRL decision module of the main end continuously updates the intelligent body according to the existing experience playback pool, the action playback pool and the current delay amount, and acquires a new state through interaction of the intelligent body and the environment, and guides the mechanical arm to carry out track planning;

And S5, removing delay influence of slave-end data through a data chain through a delay processing module, updating an agent by using the influence-removed data and interacting with the environment, gradually guiding the slave-end mechanical arm to complete a planning task, wherein the agent generates corresponding actions according to the state and strategy network for removing the delay influence, the slave-end space mechanical arm interacts with the environment according to the actions to generate the next state and rewards, and then feeding back to the master-end, and the master-end updates the agent according to the new state and rewards.

In some embodiments, the method further processes the current state of the master through one or more of a mapping method, a prediction method and a state enhancement method, wherein the mapping method adopts a memory-free strategy, ignores delay and makes a decision by taking the most recently observed state as the real state of the environment, the prediction method adopts historical track data to train a forward model, and the state enhancement method converts a delayed Markov decision process into a delay-free Markov decision process by constructing an information state consisting of delay state information and a historical action sequence.

A computer program product comprising a computer program which when executed by a processor implements the teleoperational spatial manipulator trajectory planning method based on deep reinforcement learning.

The invention has the following beneficial effects:

The method integrates deep reinforcement learning into a traditional remote control frame, and solves the complex problem of remote control space mechanical arm track planning. In the invention, the decision capability of the intelligent agent is enhanced by the delay processing module by utilizing the delay state information and the historical actions, so that the recovery capability of the intelligent agent in an inherent delay environment is ensured. The delay information processing module generates a state that contributes to the current decision by considering both the current delay state and the historical action results. After the delay information processing is completed, the decision module generates corresponding actions according to the obtained new state. Furthermore, the embodiment of the invention designs three innovative methods of mapping, prediction and state enhancement to construct a delay processing module.

The invention effectively solves the problem of track planning of the space manipulator by applying the DRL in a teleoperation scene. Compared with the prior art, the invention has the main advantages that:

1) The invention integrates the deep reinforcement learning into the traditional remote control frame for the first time, and completes the corresponding track planning task;

2) The invention enhances decision making capability by utilizing delay state information and historical action buffering;

3) The invention enhances the decision-making capability of the agent and ensures the adaptability of the algorithm in the environment characterized by inherent delay;

4) The method provided by the invention has effectiveness for various scenes consisting of whether the base floats or not and whether the target rotates or not;

5) The method provided by the invention has stronger robustness under different noise or dynamic parameter conditions, and parameter adjustment is not needed.

Other advantages of embodiments of the present invention are further described below.

Drawings

Fig. 1 is a general flow chart of a teleoperation space manipulator trajectory planning method based on deep reinforcement learning in an embodiment of the present invention.

Fig. 2 is a schematic diagram of a space robot in an embodiment of the invention.

Fig. 3 is an overall flow chart of DRL-based spatial teleoperation in an embodiment of the invention.

Fig. 4 is a schematic diagram of a mapping method of a delay processing module according to an embodiment of the present invention.

FIG. 5 is a schematic diagram of a method for predicting a delay processing module according to an embodiment of the present invention.

Fig. 6 is a block diagram of a spatial seven-degree-of-freedom redundant manipulator in an embodiment of the present invention.

FIG. 7 is a graph showing the comparison of rewards of different algorithms in various scenarios in accordance with an embodiment of the present invention.

Fig. 8 is a graph comparing success rates of different algorithms in various scenarios in accordance with an embodiment of the present invention.

FIG. 9 is a graph showing the comparison of time memory consumption of different algorithms according to an embodiment of the present invention.

FIG. 10 is a graph of evaluation of return of training models without fine tuning for various motion noise in accordance with an embodiment of the present invention.

FIG. 11 is a graph of evaluation of return of training models under various observed noise without fine tuning in accordance with an embodiment of the present invention.

Detailed Description

The following describes embodiments of the present invention in detail. It should be emphasized that the following description is merely exemplary in nature and is in no way intended to limit the scope of the invention or its applications.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the embodiments of the present invention, the meaning of "plurality" is two or more, unless explicitly defined otherwise.

Referring to fig. 1, an embodiment of the invention provides a teleoperation space manipulator track planning method based on deep reinforcement learning, which includes the following steps:

Step S1, establishing a rigid body kinematics model of the floating base space robot of the rigid mechanical arm, wherein the rigid body kinematics model comprises joint angle vectors and speed vectors of the mechanical arm and Jacobian matrixes of the base and the mechanical arm, and meanwhile, reinforcement learning six-element tuples comprising states, actions, state transfer functions, rewarding functions, initial state distribution and discount factors are defined, and kinematic parameters and reinforcement learning frames are provided for trajectory planning of the mechanical arm, wherein the reinforcement learning frames are used for decision making processes of interaction of an agent with the environment.

In some embodiments, the rigid body kinematics model of the floating base space robot of the rigid mechanical arm is established specifically comprises the steps of defining a joint angle vector of the mechanical arm in a space robot system, representing the position state of the mechanical arm at any moment, defining a speed vector corresponding to the joint angle vector, representing the dynamic motion state of the mechanical arm, constructing a Jacobian matrix of the base and the mechanical arm, describing the mathematical relationship between the motion of the mechanical arm and the motion of the base, determining the conservation relationship between the linear momentum and the angular momentum of the space robot under the action of no external force, expressing through an inertia matrix and a coupling inertia matrix, and determining a generalized Jacobian matrix based on the conservation law and the Jacobian matrix, wherein the matrix is related with the kinematics and the dynamics parameters of the mechanical arm and is used for the accurate calculation of subsequent track planning.

Defining reinforcement learning six-element tuples specifically includes designing a state space including joint angle vectors of the robotic arm, joint velocity vectors, positions of the end effectors, velocities of the end effectors, target positions, and distances between the end effectors and targets to capture information about dynamics and tasks of the robotic arm, designing an action space expressed as torque vectors applied to joints of the robotic arm, each component of the torque vectors being limited to a predetermined safety range to ensure safety and validity of the robotic arm actions, designing a state transfer function that predicts a next state of the robotic arm in the environment based on the current state and the actions performed for the agent to learn an effect of its behavior on the environment, designing a reward function that evaluates values of actions taken by the agent based on distances between the end effectors and targets of the robotic arm, velocities of the robotic arm and base, and task completion conditions to encourage the agent to take actions to maximize a cumulative reward, introducing a discount factor for adjusting and balancing an influence of the instant rewards and current playback decisions to balance short-term and long-term, implementing a system of interaction with the environment, storing a new state of the agent, a new state of the agent and a learning strategy, and a learning history of the learning strategy, and a history of the learning history of the new state of the agent.

The rewarding function comprises a mechanism for rewarding the end effector to reduce the distance between the end effector and a target point so as to enable the end effector to execute actions for enabling the end effector to move towards the target point, rewarding the end effector to wander around the target point when the end effector approaches the target point so as to ensure continuous advancing towards the target, rewarding the movement of the intelligent body to smooth the mechanical arm, reducing the speed fluctuation of the base and the end effector so as to optimize the smoothness of a track, activating a final rewarding mechanism when the distance between the end effector and the target point is smaller than or equal to a preset threshold value, wherein the final rewarding mechanism is proportional to the number of steps remained when the task is completed so as to excite the intelligent body to efficiently complete the task, and learning to take proper actions at different operation stages through comprehensive evaluation of the rewarding function, including approaching the target, fine adjustment and task completion, so as to realize optimization of the whole teleoperation process.

And S2, constructing a teleoperation overall framework based on the kinematics model and the reinforcement learning framework established in the step S1, wherein the teleoperation overall framework comprises a master end, a data chain and a slave end, the data chain transmits and processes data, the slave end executes interactive commands of the space manipulator and the environment and returns state information and rewards, and the master end performs real-time control and decision on the operation of the manipulator.

In some embodiments, the overall framework for constructing teleoperation specifically comprises the steps of identifying time delay problems in a teleoperation environment, including action delay, observation delay and rewarding delay, and influences of the time delay problems on an agent decision process, taking loop time delay (RTD) as a reference for measuring communication delay, ensuring that time difference between a master end observation state and a slave end actual state is quantized and compensated, determining an ideal RTD value according to a transmission distance so as to accurately compensate delay at a master end, processing random delay caused by environmental factors such as network congestion and the like, providing adaptive support for decision making of the agent by simulating statistical distribution of the random delay, comprehensively considering constant and random delay, and making a strategy to ensure that the agent can make timely and accurate decision according to observation results of the master end at each time step.

And step S3, processing the current state of the master terminal by using the state information transmitted by the data link in the teleoperation frame in the step S2 by using a delay processing (DIP) module of the master terminal to obtain the current state and the preamble rewards for removing the influence of communication delay between the master terminal and the slave terminal.

In some embodiments, the current state of the master is processed by one or more of a mapping method, a prediction method, and a state enhancement method. The mapping method adopts a memory-free strategy, ignores delay, and makes a decision by taking the most recently observed state as the real state of the environment. The prediction method adopts historical track data to train a forward model. The state enhancement method converts a delayed markov decision process into a non-delayed markov decision process by constructing an information state comprised of delayed state information and a historical sequence of actions.

The mapping method specifically comprises the steps of adopting a memory-free strategy to process the observed state of a main terminal, ignoring delay influence caused by communication delay, taking the latest observed state as the real state of the environment at the current moment for the decision process of an intelligent agent, identifying and confirming that the unobserved state exists when the observed state of the main terminal is inconsistent with the expected state, replacing the current state as the state at the previous moment when the unobserved state is identified, and adjusting instant rewards to reflect the replacement.

The prediction method specifically comprises the steps of utilizing a large amount of historical track data obtained through interaction with the environment, including state, action, rewards and tuples of the next state, training a forward model through supervision and learning to simulate system dynamics, taking the forward model as a nonlinear function, inputting the current state and the action, outputting predicted rewards and the next state for modeling future behaviors of the system, using the last observed state and historical action sequence, iteratively calculating predicted current state and rewards of the previous state through the forward model, carrying out decision process of an agent according to the predicted current state, so as to generate corresponding actions and guide a mechanical arm to carry out the next operation;

The state enhancement method specifically comprises the steps of constructing an information state, combining delay state information and a historical action sequence to simulate a decision-making environment without delay, combining the observed delay state with a series of historical actions to form an information state representing the current environment state when defining a new state, adjusting a reward function, setting instant rewards to zero when observing the delay state to reflect information loss caused by delay, and correcting initial state distribution to comprise a random action sequence adopted when remote state update is not received, so that an intelligent agent can make a decision when information is incomplete.

Step S4, on the basis of the state after the delay influence is removed in the step S3, a DRL decision module of the main end continuously updates the intelligent body according to the existing experience playback pool, the action playback pool and the current delay amount, and acquires a new state through interaction of the intelligent body and the environment, and guides the mechanical arm to carry out track planning;

In some embodiments, the DRL decision module specifically comprises designing an objective function of a main end to maximize a cumulative reward, wherein the function is based on a delay state after DIP processing and the distribution of previous sampling states and actions, namely replaying data in a buffer zone, introducing strategy entropy through the objective function to encourage an agent to explore and control randomness of an optimal strategy to avoid a local optimal solution, respectively approximating a state-action value function, a state value function and a strategy function through a neural network by utilizing an optimization method based on a SAC algorithm to realize continuous updating and optimization of the strategy, updating parameters of a value function by minimizing a soft Belman residual, ensuring the validity of a learning process and the accuracy of the strategy, updating parameters of a soft value function by minimizing a square residual, improving the accuracy of state value estimation, updating parameters of a strategy network by minimizing KL divergence, optimizing the strategy representation and enhancing the strategy robustness, constructing the strategy network by utilizing a re-parameterization method, allowing the strategy network to generate actions under the condition that noise exists, improving the adaptability, managing the strategy of the buffer zone, and managing the state of the strategy, and not really replaying data stored in a real experience pool.

And S5, removing delay influence of slave-end data through a data chain by using the DRL decision module to update the agent, updating the agent by using the influence-removed data and interacting with the environment, and gradually guiding the slave-end mechanical arm to complete the planning task. The slave space mechanical arm interacts with the environment according to the actions to generate the next state and rewards, and then the next state and rewards are fed back to the master, and the master updates the intelligent according to the new state and rewards.

In a preferred embodiment, the remote environment interaction module executes action instructions generated by the agent in the environment, collects environment feedback caused by the action, including new state information and rewards, and then transmits the feedback to the DRL decision module of the master.

The embodiment of the invention solves the problem of track planning of the space manipulator by applying the DRL in a teleoperation scene. The control process comprises a delay information processing module, a DRL decision module and a remote environment interaction module. The purpose of the delay information processing module is to generate a state that contributes to the current decision by taking into account both the current delay state and the historical action results. To achieve this goal, three methods of mapping, prediction and state enhancement are designed. After the delay information processing is completed, the decision module generates corresponding actions according to the obtained new state. A soft actor-critter algorithm (SoftActor-Critic, SAC) is selected as the decision algorithm with which to generate the robustness of the strategy through jackpot and maximum entropy optimization. The remote environment interaction module operates in MuJoCo simulation environment and includes four single-arm scenarios, fixed base and target, fixed base and rotating target, free floating base and fixed target, free floating base and rotating target. After receiving moment information from the master end, the space manipulator of the slave end interacts with the environment to generate the next state and rewards, and then the next state and rewards are fed back to the master end.

Algorithm implementation examples and experimental verification of specific embodiments of the present invention are further described below.

A teleoperation space mechanical arm track planning method based on deep reinforcement learning comprises the following steps:

S1, establishing a rigid body kinematics model of a floating base space robot provided with a rigid mechanical arm, and defining specific forms of reinforcement learning 6-element tuples (states, actions, state transfer functions, rewarding functions, initial state distribution and discount factors);

S2, establishing a teleoperation overall framework, wherein the teleoperation overall framework comprises a master end, a data chain and a slave end, the data chain facilitates data transmission and processing, the slave end mainly executes commands of interaction between a space manipulator and an environment and then returns state information and rewards generated by the environment, and the master end comprises a delay processing (Delay Information Processing, DIP) module and a DRL decision module;

s3, defining a delay processing module, wherein three methods, namely mapping, prediction and state enhancement, can be used, and the module processes the current state of the main terminal to obtain the current state without delay influence and the preamble rewards;

S4, defining a DRL decision module, continuously updating the agent according to the existing experience playback pool, the action playback pool and the current delay amount, and obtaining a new state through interaction between the agent and the environment;

And S5, removing delay influence of slave-end data through a data chain through a delay processing module, updating the intelligent agent by using the influence-removed data, and interacting with the environment to gradually guide the slave-end to complete the planning task.

The processing method of the delay processing module in the S3 is divided into the following three types:

s3.1, adopting a memory-free strategy to ignore delay, and taking the latest observed state as the real state of the environment to make a decision;

S3.2, training a forward model by adopting historical track data by a prediction method;

s3.3, the state enhancement method converts the delayed Markov decision process into a non-delayed Markov decision process by constructing an information state consisting of delayed state information and a historical action sequence.

Space robot kinematics modeling

The space robot system is generally composed of a spacecraft (base) and a multi-Degree of freedom (DOF) mechanical arm, wherein the joint angle vector of the mechanical arm can be expressed as q= [ theta ₁,θ₂,…,θ_n]^T ] and the velocity vector isAs shown in fig. 2. In addition, (v _i,ω_i)^T (i=b, e) represents the speed and angular velocity of the base and the robot arm, respectively:

Wherein J _b and J _m represent the jacobian matrices of the base and the robotic arm, respectively. In the absence of external forces and moments, the spatial robot linear momentum P and angular momentum L are conserved as follows:

Wherein H _b and H _bm are an inertia matrix and a coupling inertia matrix, respectively. Further, assuming that the initial values of the linear velocity and the angular velocity are 0, the base velocity may be expressed as follows:

Wherein J _bm represents the base jacobian matrix. Bringing equation (3) into equation (1) will result in the following expression:

Wherein J _g represents a generalized jacobian matrix, which is related to both kinematic and kinetic parameters.

Reinforcement learning

Reinforcement learning is a sequential decision process based on a markov decision process (Markov Decision Process, MDP) theoretical framework. Typically, MDP is defined by 6-tupleWherein each item represents a state space, an action space, a state transfer function, a reward function, and a discount factor for future rewards, respectively. After the agent obtains the current state s _t, action a _t～π(·|s_t,a_t) may be selected to interact with the environment based on the current policy pi. The environment then returns a new state s _t+1 and a prize r _t＝R(s_t,a_t). The goal of reinforcement learning is to maximize the jackpot to obtain an optimal strategy.

Since the information contained by the states is of vital value to the decision of the agent, as many relevant features that affect the decision as possible must be considered in designing the states. For the track planning task of the space robot, especially when the space robot is in a floating base mode, incomplete constraint exists between the base and the mechanical arm, and the planning of the mechanical arm can be influenced. Thus, for the fixed base mode, the states may be set as follows:

Wherein each item represents the joint angle vector, the joint position vector, the arm end position, the arm end speed, the target position, and the distance of the end from the target, respectively. For free floating mode, the state may be set as follows:

wherein each item represents a joint angle vector, a joint position vector, a mechanical arm end position, a mechanical arm end speed, a base centroid position, a base centroid speed, a target position, and a distance of a terminal end from a target, respectively.

The motion is designed to be a set of torques applied to the robotic arm joints, limited to a range of values: And a _ti∈[-max(torque_i),max(torque_i), i=1,..n, where torque _i represents the moment acting on the ith joint.

The bonus function is defined as follows:

Wherein the end of round condition is that distance ^d t is less than threshold ^d threshold. Each term in equation (7) has a different purpose. The first term excites the end effector of the robotic arm as close as possible to the target point. The second scenario of the process is that when the distance approaches the threshold, the end effector wanders around the target point without further advancement, ensuring that it continues to advance toward the target. The third and fourth items are to mitigate excessive speed fluctuations of the robotic arm base, end effector, and joints during trajectory planning, facilitating smooth transitions of these variables from a rewards perspective. Finally, the fifth item acts as a final prize that is activated only if the distance is less than or equal to the threshold. The size of this prize is proportional to the remaining steps at the end of the round. Thus, as the agent completes the task faster, the prize value associated with that item will increase proportionally.

Teleoperation time delay problem

As shown in fig. 3, in a space teleoperation environment, the markov property of the original decision process is destroyed by the time delay introduced by communication, which seriously affects the performance of the reinforcement learning algorithm. Specifically, the operator will create a delay, i.e., an action delay, before the command issued by the master arrives at the slave. Meanwhile, when the slave space robot interacts with the environment, feedback from the environment (including new status and rewards) is not timely delivered to the master, resulting in observation delays and rewards delays. Since the observation delay and action delay are equivalent to the effect of the agent decision process, the present example focuses only on the observation delay and ensures that its value is equal to the value of the loop time delay (Round TIME DELAY, RTD). Here, the embodiment of the present invention defines that the slave state at time t is s _t, and the master observation state is x _t.

Ideally, the RTD remains unchanged, as it is entirely determined by the transmission distance. At the slave, the state of the robot changes over time, while at the master, the operator's observation lags the slave RTD by a time step due to the presence of the RTD. The RTD value is 1. After the whole system is started, the slave terminal sends out a first state, but the master terminal observes to be empty, and the master terminal observes to be later than the slave terminal at the subsequent moment by a time step, wherein the time step is expressed as x _t＝s_t-1 delta (t is more than or equal to 1) +phi delta (t < 1), wherein phi is expressed as zero, and the delta function is a Dirac function. In addition, in the transmission process, environmental factors such as network congestion often cause time delay to have randomness. On the basis of RTD we assume that the random delay follows a Uniform distribution with a parameter ζ to model the random factor, denoted as d _random -uniformity (ζ). The total delay of communication is therefore d _total(t)＝d_RTD+d_random (t). The focus of the present invention is to make decisions at each time step based on master observations, whether constant or random delays.

DIP module processing method

The embodiment of the invention constructs the DIP module by three methods, namely mapping, prediction and state enhancement. The master delay value is defined as follows:

1) Mapping

Mapping employs a memoryless strategy that takes into account delays and treats the most recently observed state as the true state for the environment to make decisions. When x _t is different from phi, the master observes the current state, but if x _t is the same as phi, indicating that the master has an unobserved state, then it is replaced with the stateAnd will give immediate rewardsSet to 0. Fig. 4 is a simple example.

2) Prediction

Prediction involves training a forward model f using historical trajectory data. Thus, the method trains the nonlinear function f through supervised learning by utilizing a large amount of tuple (s _t,a_t,r_t,s_t+1) data obtained by interaction with the environment. As shown in fig. 5, the inputs are s _t and a _t, and the outputs are r _t and s _t+1, i.e., s _t+1,r_t＝f(s_t,a_t), to effectively model the system dynamics. After training to obtain the forward model, the method uses the last observation stateAnd historical action sequencesIterative computationAndAnd according to predictionsA final decision is made.

3) State augmentation

The state enhancement constructs an information state consisting of delay state information and a historical action sequence, and converts delay MDP into delay-free MDP, which is defined as follows:

For the state When t is greater than or equal to d _RTD, the new state can be defined as follows:

for actions The method is not adjusted, and the rewarding function is used forThe new rewards may be expressed as:

For the initial state distribution ρ, the modified form is as follows:

wherein ρ (s ₀) is the original initial state distribution, Refers to a random action taken when no remote status is received. When x _t = phi, the current is,Wherein the method comprises the steps ofFrom actions taken at past times, when x _t +.phi,R _t-1＝r(x_t-1,a_t-1), where a _t-1 results from past time actions.

Decision module processing method

The objective function of the master is designed as follows:

Where x _t represents the delay state after DIP processing and D represents the distribution of previous sample states and actions, i.e. replay buffers. Equation (12) yields an optimal strategy by maximizing the objective function. However, spatial robotic end-to-end trajectory planning presents significant challenges due to its complex nature and extended state space dimensions. Furthermore, optimization of equation (12) results in deterministic strategies that tend to result in local optimizations, and therefore it is difficult to find strategy parameters that achieve high jackpots. To solve this problem, equation (12) is modified as follows:

Wherein, Representing the entropy of the strategy, α determines the relative importance of the entropy term with respect to the bonus term. Equation (13) not only encourages the agent to explore more efficiently, but also controls the randomness of the optimal strategy.

In the present example, equation (13) is optimized based on the SAC algorithm. Similar to the conventional SAC algorithm, three neural networks Q _θ(x_t,a_t),V_ψ(x_t) and pi _φ(a_t|x_t) aim to approximate the state-action value function, state value function, and policy function, respectively. Updating the parameter θ of the cost function by minimizing the soft bellman residual:

Wherein, Represents the target value network corresponding to V _ψ and is updated by an exponential moving average of the value network weights. The parameter ψ of the soft value function is updated by minimizing the squared residual:

updating parameters phi of the policy network by minimizing KL divergence:

Wherein Z _θ(x_t) is a partitioning function, which is negligible. The method utilizes a re-parameterized skill minimization formula (16) and builds a policy network:

a_t＝f_φ(∈_t;x_t), (17)

where e _t is the noise vector. Thus, equation (16) can be rewritten as:

Unlike the infrastructure of the replay buffer of a conventional SAC. In the conventional method, the quadruples generated in each step are stored in an experience playback pool. However, in a teleoperation environment, the master end does not necessarily get the real state. To ensure the reliability of the empirical playback pool data, non-authentic data should not be stored in the empirical playback pool.

Validity verification

Based on the seven-degree-of-freedom redundant mechanical arm model shown in fig. 6, the invention constructs a single-arm space robot model under MuJoCo environment and applies the single-arm space robot model to four environments (namely SAFBFT, SAFBRT, SAUBFT and SAUBRT in sequence) of a fixed base fixed target, a fixed base rotating target, a floating base fixed target and a floating base rotating target. In the simulation environment, the simulation time step T _m =0.01 s, the action is executed once every 4 time steps are passed, and the maximum simulation step number of each round is t=250, so the maximum simulation duration is 10s. The embodiment of the invention adopts a teleoperation mode to carry out random grabbing tasks, and aims to position the end effector of the space manipulator within the range of 5mm of a target point, so that d _threshold = 0.025m is considered in consideration of the diameter of the end effector to be 0.02 m.

In order to enhance the generalization capability of the model, the embodiment of the invention uniformly samples in a cubic working space range of the mechanical arm and takes the cubic working space range as a target position during training, and simultaneously, noise is introduced into an initial joint angle and an angular speed of the mechanical arm:

In the embodiment of the invention, four experiments with no delay in the environment d _RTD =0 are performed, and experiments with fixed delays d _RTD =1, 2,4,6,8 and corresponding random delays ζ=2 are performed to detect the effect of the algorithm.

In terms of algorithm performance, as shown in fig. 7, 8 and 9, analysis is performed from three angles of return, success rate and time memory consumption, and the experiment obtains the following conclusion:

1) In all four environments, whether the delay is constant or random, the performance of the three delay processing methods deteriorates as the delay increases. Due to the unpredictable composite effect of information delay and information loss, under the same d _RTD condition, the constant delay is better than the random delay effect;

2) Even if the task can be completed successfully, the time required increases with increasing delay, resulting in a decrease in return, as shown in FIG. 7;

3) The DIP module is analyzed by three algorithms, namely, state enhancement always shows the most stable performance, shows the feasibility and effectiveness of converting a delayed Markov decision process into a non-delayed Markov decision process, maps a small delay scene such as a base fixed state and a base floating small delay to have a good effect, but under the condition of severe state change or high delay, the effectiveness of the mapping can be reduced, prediction needs accurate forward dynamics, and accumulated errors in iterative prediction can also have larger influence, so that the effect is worst;

4) The state enhancement approach balances efficiency and performance more effectively as shown in fig. 9.

In terms of robustness analysis, from the experimental data analysis shown in fig. 10 and 11, the following conclusion can be reached:

1) In an ideal environment, noise interference significantly affects the decision of a trained agent, and the greater the probability of noise occurrence, the greater the impact on the decision;

2) Although motion noise may cause the performed motion to deviate from an accurate decision made based on an accurate state, resulting in a partial deviation from an ideal environmental strategy, the impact is relatively limited;

3) State noise can cause significant deviations of observed states from exact states, so that the strategies derived from the observed states in the new environment are not optimal compared to those in the ideal environment, and if the original strategy is continued to be used, performance can be further degraded;

4) For low probability noise scenarios or fixed base scenarios, the original strategy is still valid only when the new environment is very similar to the ideal environment.

Compared with the prior art, the main advantages of the embodiment of the invention are shown in the following aspects:

1) The invention integrates deep reinforcement learning into a traditional remote control frame to complete corresponding track planning tasks;

2) The invention utilizes delay state information and historical action buffer to enhance decision capability, and further provides three construction methods of a delay processing module, namely mapping, prediction and state enhancement;

The embodiments of the present invention also provide a storage medium storing a computer program which, when executed, performs at least the method as described above.

The embodiment of the invention also provides a control device which comprises a processor and a storage medium for storing a computer program, wherein the processor is used for executing at least the method when executing the computer program.

The embodiments of the present invention also provide a processor executing a computer program, at least performing the method as described above.

The storage medium may be implemented by any type of non-volatile storage device, or combination thereof. The nonvolatile Memory may be a Read Only Memory (ROM), a programmable read Only Memory (PROM, programmableRead-Only Memory), an erasable programmable read Only Memory (EPROM, erasableProgrammableRead-Only Memory), an electrically erasable programmable read Only Memory (EEPROM, electricallyErasable ProgrammableRead-Only Memory), a magnetic random access Memory (FRAM, ferromagneticRandomAccessMemory), a flash Memory (flash Memory), a magnetic surface Memory, an optical disk, or a compact disk-Only Memory (CD-ROM, compactDiscRead-Only Memory), and the magnetic surface Memory may be a magnetic disk Memory or a tape Memory. The storage media described in embodiments of the present invention are intended to comprise, without being limited to, these and any other suitable types of memory.

In the several embodiments provided by the present invention, it should be understood that the disclosed systems and methods may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is merely a logical function division, and there may be additional divisions of actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.

The units described as separate components may or may not be physically separate, and components displayed as units may or may not be physical units, may be located in one place, may be distributed on a plurality of network units, and may select some or all of the units according to actual needs to achieve the purpose of the embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated in one processing unit, or each unit may be separately used as a unit, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of hardware plus a form of software functional unit.

It will be appreciated by those of ordinary skill in the art that implementing all or part of the steps of the above method embodiments may be implemented by hardware associated with program instructions, where the above program may be stored in a computer readable storage medium, where the program when executed performs the steps comprising the above method embodiments, where the above storage medium includes various media that may store program code, such as a removable storage device, a Read-Only Memory (ROM), a random access Memory (RAM, randomAccessMemory), a magnetic or optical disk, and so on.

Or the above-described integrated units of the invention may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in essence or a part contributing to the prior art in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the methods described in the embodiments of the present invention. The storage medium includes various media capable of storing program codes such as a removable storage device, a ROM, a RAM, a magnetic disk or an optical disk.

The methods disclosed in the method embodiments provided by the invention can be arbitrarily combined under the condition of no conflict to obtain a new method embodiment.

The features disclosed in the several product embodiments provided by the invention can be combined arbitrarily under the condition of no conflict to obtain new product embodiments.

The features disclosed in the embodiments of the method or the apparatus provided by the invention can be arbitrarily combined without conflict to obtain new embodiments of the method or the apparatus.

The foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several equivalent substitutions and obvious modifications can be made without departing from the spirit of the invention, and the same should be considered to be within the scope of the invention.

Claims

1. The teleoperation space manipulator track planning method based on deep reinforcement learning is characterized by comprising the following steps of:

The method comprises the steps of S1, establishing a rigid body kinematics model of a floating base space robot of a rigid mechanical arm, wherein the rigid body kinematics model comprises a joint angle vector and a speed vector which define the mechanical arm and a Jacobian matrix of the base and the mechanical arm, and defining reinforcement learning six-element tuples comprising states, actions, state transfer functions, rewarding functions, initial state distribution and discount factors, and providing kinematic parameters and reinforcement learning frames for track planning of the mechanical arm, wherein the reinforcement learning frames are used for decision making processes of interaction of an intelligent body with the environment;

S4, based on the state after the delay influence is removed, a deep reinforcement learning DRL decision module of the main end continuously updates the intelligent body according to the existing experience playback pool, the action playback pool and the current delay amount, and acquires a new state through interaction of the intelligent body and the environment, and guides the mechanical arm to carry out track planning;

2. The teleoperation space manipulator trajectory planning method based on deep reinforcement learning of claim 1, wherein in step S1, establishing a rigid body kinematic model of a floating base space robot of a rigid manipulator specifically comprises:

defining a joint angle vector of a mechanical arm in a space robot system, and representing the position state of the mechanical arm at any moment;

defining a speed vector corresponding to the joint angle vector, and representing the dynamic motion state of the mechanical arm;

constructing a Jacobian matrix of the base and the mechanical arm, and describing mathematical relationship between the motion of the mechanical arm and the motion of the base;

Determining the conservation relation of linear momentum and angular momentum of the space robot under the action of no external force, and expressing the conservation relation by an inertia matrix and a coupling inertia matrix;

Based on conservation law and Jacobian matrix, a generalized Jacobian matrix is determined, and the matrix is used for correlating the kinematics and dynamics parameters of the mechanical arm and is used for accurate calculation of subsequent track planning.

3. The tele-operation space manipulator trajectory planning method based on deep reinforcement learning of claim 1 or 2, wherein in step S1, defining reinforcement learning six-element tuples specifically comprises:

designing a state space comprising a joint angle vector, a joint speed vector, a position of an end effector, a speed of the end effector, a target position and a distance between the end effector and a target of the mechanical arm so as to capture information related to the dynamics and tasks of the mechanical arm;

designing an action space, namely a torque vector applied to the joint of the mechanical arm, wherein each component of the torque vector is limited in a preset safety range, so that the safety and the effectiveness of the action of the mechanical arm are ensured;

designing a state transfer function, which predicts the next state of the mechanical arm in the environment based on the current state and the executed action, and is used for the intelligent body to learn the influence of the behavior of the intelligent body on the environment;

Designing a reward function that evaluates the value of the action taken based on the distance between the end effector of the robotic arm and the target, the speed of the robotic arm and the base, and the completion of the task to motivate the agent to take action to maximize the jackpot;

introducing a discount factor for adjusting and balancing the impact of the instant rewards and future rewards on the current decision to balance short-term and long-term benefits;

An empirical playback mechanism is implemented to store a sequence of states, actions, rewards, and new states of the agent during its interaction with the environment so that the agent can learn and refine its policies from historical data.

4. The teleoperation space manipulator trajectory planning method based on deep reinforcement learning of claim 3, wherein the rewarding function comprises a mechanism for rewarding the end effector with a reduced distance from the target point to cause the agent to perform an action of moving the end effector toward the target point, rewarding the end effector with a hovering action near the target to ensure continuous progress toward the target as the end effector approaches the target point, rewarding the agent to smooth the movement of the manipulator, reducing speed fluctuations of the base and end effector to optimize the trajectory smoothness, and a final rewarding mechanism activated when the distance of the end effector from the target point is less than or equal to a preset threshold, the value of which is proportional to the number of steps remaining when the task is completed to encourage the agent to efficiently complete the task, and by comprehensive evaluation of the rewarding function, the agent learns to take appropriate actions at different stages of operation including approaching the target, fine tuning and task completion to optimize the whole teleoperation process.

5. The teleoperation space manipulator trajectory planning method based on deep reinforcement learning of any one of claims 1 to 4, wherein constructing a teleoperation overall framework in step S2 specifically comprises:

Identifying time delay problems in a teleoperational environment, including action delays, observation delays, and rewards delays, and their impact on an agent decision process;

Taking the loop time delay RTD as a reference for measuring communication delay, and ensuring that the time difference between the observed state of the master terminal and the actual state of the slave terminal is quantized and compensated;

Determining the value of RTD under ideal condition according to the transmission distance so as to make delay compensation at the main end;

processing random delay caused by environmental factors, and providing adaptive support for decision of an intelligent agent by simulating statistical distribution of the random delay;

taking into account both constant and random delays, a strategy is developed to enable the agent to make decisions based on the observations at the master at each time step.

6. The teleoperation space manipulator trajectory planning method based on deep reinforcement learning of any one of claims 1 to 5, wherein in step S3, the current state of the master is processed by one or more of a mapping method, a prediction method and a state enhancement method;

The mapping method adopts a memory-free strategy, ignores delay, and takes the latest observed state as the real state of the environment to make a decision;

The prediction method adopts historical track data to train a forward model;

The state enhancement method converts a delayed markov decision process into a non-delayed markov decision process by constructing an information state comprised of delayed state information and a historical sequence of actions.

7. The teleoperation space manipulator trajectory planning method based on deep reinforcement learning of claim 6, wherein in step S3, the mapping method specifically includes processing the observed state of the master terminal with a memoryless strategy, ignoring the delay effect due to the communication delay, regarding the latest observed state as the real state of the environment at the current time for the decision process of the agent, identifying and confirming that there is a non-observed state when the observed state of the master terminal is inconsistent with the expected state, replacing the current state as the state at the previous time when the non-observed state is identified, and adjusting the instant prize to reflect the replacement;

8. The teleoperation space manipulator trajectory planning method based on deep reinforcement learning of any one of claims 1 to 7, wherein in step S4, the DRL decision module specifically includes:

Designing an objective function at the master to maximize the jackpot, the function based on the delay state after DIP processing and the distribution of the previous sample states and actions, i.e., replaying the data in the buffer;

introducing strategy entropy through an objective function to encourage the agent to explore, controlling the randomness of the optimal strategy and avoiding the local optimal solution;

respectively approximating a state-action value function, a state value function and a strategy function through a neural network by using an optimization method based on a SAC algorithm;

Updating parameters of the cost function by minimizing soft bellman residuals;

updating parameters of the soft value function by minimizing the squared residual;

updating parameters of the policy network by minimizing KL divergence;

constructing a strategy network by utilizing a re-parameterization method, and allowing the strategy network to generate actions under the condition of noise;

the management policy of the replay buffer stores only real state data, and non-real data should not be stored in the empirical replay pool.

9. The tele-operation space manipulator trajectory planning method based on deep reinforcement learning of any one of claims 1 to 8, wherein the agent-generated action instructions are executed in the environment by a remote environment interaction module, and the environmental feedback caused by the actions, including new state information and rewards, is collected, and then the feedback is transferred to the DRL decision module of the master.

10. A computer program product comprising a computer program which, when executed by a processor, implements a tele-manipulated space manipulator trajectory planning method based on deep reinforcement learning as claimed in any one of claims 1 to 9.