CN119669952B

CN119669952B - A Sim2Real model construction method and device based on reinforcement learning

Info

Publication number: CN119669952B
Application number: CN202411610785.4A
Authority: CN
Inventors: 张梦娇; 朱博林; 段世红; 徐诚
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2024-11-12
Filing date: 2024-11-12
Publication date: 2025-10-24
Anticipated expiration: 2044-11-12
Also published as: CN119669952A

Abstract

The invention provides a Sim2Real model construction method and device based on reinforcement learning, which relate to the technical field of data processing, and the method comprises the steps of obtaining evaluation indexes of a simulation environment and a Real environment; the method comprises the steps of utilizing a linear weighting method to quantify the weighted difference between a simulation environment and a Real environment index according to an evaluation index, establishing a Sim2Real model for mutually converting data between the simulation environment and the Real environment, and utilizing a reinforcement learning algorithm to conduct field self-adaptive training on the Sim2Real model by taking the smallest weighted difference between the simulation environment and the Real environment index as a target, so as to obtain a final Sim2Real model. According to the invention, the Sim2Real model for data conversion between the simulation environment and the Real environment is built, so that the adjustment of the data layer is performed, and the migration error from the simulation environment to the Real environment is reduced.

Description

Sim2Real model construction method and device based on reinforcement learning

Technical Field

The invention relates to the technical field of data processing, in particular to a Sim2Real model construction method and device based on reinforcement learning.

Background

In the fields of modern manufacturing, service, medical care and the like, intelligent robots are playing an increasingly important role, and the development of automation and intelligence is promoted. In these fields, mobile robots are widely used for tasks such as logistics transportation, warehouse management, medical assistance, and the like. To accomplish these tasks, mobile robots must have autonomous navigational capabilities, meaning that they are able to autonomously move, plan paths, avoid obstacles, and eventually reach a target site in an unknown or partially unknown environment. The active navigation technology is characterized in that the robot perceives the environment, plans the path and controls the decision.

In addition, in recent years, with the development of artificial intelligence technology, particularly the introduction of machine learning and reinforcement learning methods, active navigation technology has been further improved. Reinforcement learning (Reinforcement Learning, RL) guides the robot through a rewarding mechanism to learn how to behave in the environment, developing an optimal behavior strategy. The reinforcement learning method does not need a definite environment model, and can adapt to complex and changeable environments through a large number of trial and error processes, so that the reinforcement learning method has very wide application prospects in dynamic and complex environments.

However, due to the difference between simulation and reality, the strategy trained in the simulation environment often has difficulty in completely reflecting the complexity and dynamic change of the real environment in the real environment, and meanwhile, the existing reinforcement learning method is excellent in specific environment, but often lacks sufficient generalization capability, so that the strategy may show insufficient adaptability when the robot faces new or sudden environmental change.

Disclosure of Invention

In order to solve the technical problems that due to the fact that differences exist between simulation and reality, a strategy trained in a simulation environment is difficult to fully reflect the complexity and dynamic change of the Real environment in the Real environment, meanwhile, the existing reinforcement learning method is excellent in specific environment, but often lacks sufficient generalization capability, and the strategy possibly shows insufficient adaptability when a robot faces new or sudden environment change, the invention provides a Sim2Real model construction method and device based on reinforcement learning.

The technical scheme provided by the embodiment of the invention is as follows:

First aspect:

the Sim2Real model construction method based on reinforcement learning provided by the embodiment of the invention comprises the following steps:

s1, acquiring evaluation indexes of a simulation environment and a real environment;

S2, quantifying the weighted difference between the simulation environment and the real environment index according to the evaluation index by using a linear weighting method;

S3, constructing a Sim2Real model for mutually converting data between the simulation environment and the Real environment;

And S4, performing field self-adaptive training on the Sim2Real model by using a reinforcement learning algorithm and taking the minimum weighted difference between the simulation environment and the Real environment index as a target to obtain a final Sim2Real model.

Second aspect:

The Sim2Real model building device based on reinforcement learning provided by the embodiment of the invention comprises:

The acquisition module is used for acquiring the evaluation indexes of the simulation environment and the real environment;

the quantization module is used for quantizing the weighted difference between the simulation environment and the real environment index according to the evaluation index by using a linear weighting method;

The building module is used for building a Sim2Real model for mutually converting data between the simulation environment and the Real environment;

And the training module is used for performing field self-adaptive training on the Sim2Real model by using a reinforcement learning algorithm and taking the minimum weighted difference between the simulation environment and the Real environment index as a target to obtain a final Sim2Real model.

Third aspect:

A computer-readable storage medium, on which a computer program is stored, is provided according to an embodiment of the present invention, which when executed by a processor implements the reinforcement learning-based Sim2Real model construction method according to the first aspect.

The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:

(1) According to the invention, the difference between the simulation environment and the Real environment is determined by acquiring the evaluation indexes of the two, the weighted difference between the evaluation indexes is quantified by using a linear weighting method, and the data layer is adjusted according to the difference between the simulation environment and the Real environment by constructing a Sim2Real model capable of carrying out data conversion between the simulation environment and the Real environment, so that the migration error from the simulation environment to the Real environment is reduced, and the complexity and dynamic change of the Real environment are effectively reflected.

(2) In the invention, a reinforcement learning algorithm is utilized, the field self-adaptive training is carried out on the Sim2Real model by taking the minimum weighted difference between the simulation environment and the Real environment index as a target, and the final Sim2Real model is obtained. The model is continuously optimized to be excellent in a wider environment, so that the adaptability of the model in the face of various changes in reality is enhanced, and the generalization capability of strategies is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a Sim2Real model construction method based on reinforcement learning according to an embodiment of the present invention;

Fig. 2 is a schematic structural diagram of a Sim2Real model building system based on reinforcement learning according to an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is described below with reference to the accompanying drawings.

In embodiments of the invention, words such as "exemplary," "such as" and the like are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, the term use of an example is intended to present concepts in a concrete fashion. Furthermore, in embodiments of the present invention, the meaning of "and/or" may be that of both, or may be that of either, optionally one of both.

In the embodiments of the present invention, "image" and "picture" may be sometimes used in combination, and it should be noted that the meaning of the expression is consistent when the distinction is not emphasized. "of", "corresponding (corresponding, relevant)" and "corresponding (corresponding)" are sometimes used in combination, and it should be noted that the meaning of the expression is consistent when the distinction is not emphasized.

In embodiments of the present invention, sometimes a subscript such as W ₁ may be wrongly written in a non-subscript form such as W1, and the meaning of the expression is consistent when the distinction is not emphasized.

In order to make the technical problems, technical solutions and advantages to be solved more apparent, the following detailed description will be given with reference to the accompanying drawings and specific embodiments.

Referring to fig. 1 of the specification, a schematic flow chart of a Sim2Real model construction method based on reinforcement learning according to an embodiment of the present invention is shown.

The embodiment of the invention provides a Sim2Real model building method based on reinforcement learning, which can be realized by Sim2Real model building equipment based on reinforcement learning, wherein the Sim2Real model building equipment based on reinforcement learning can be a terminal or a server.

The Sim2Real (Simulation to Reality ) model is a technical method for solving the migration problem from a simulation environment to a reality environment in the fields of robotics, autopilot, reinforcement learning, and the like.

The processing flow of the Sim2Real model construction method based on reinforcement learning can comprise the following steps:

s1, acquiring evaluation indexes of a simulation environment and a real environment.

In one possible implementation, the evaluation index includes a successfully weighted path length and a navigation success rate.

The successful weighted path length is specifically:

Where SPL represents the successfully weighted path length, N represents the number of nodes trained, S _b represents the success or failure of navigation at the current number of nodes, L _b represents the optimal shortest path length to the target point at the current number of nodes, and P _b represents the path length traversed by the robot in the actual test.

The navigation success rate is specifically as follows:

Wherein Success represents Success rate, S represents navigation Success times, and T represents the ratio of the total number of navigation tests.

In the invention, the efficiency of the robot navigation path can be evaluated by calculating the comparison of the path travelled by the robot in successful navigation with the optimal path. A higher SPL value means that the robot has chosen a more nearly optimal path when the navigation task is completed. Meanwhile, PL emphasizes the efficiency of the path, while Success emphasizes the completion rate of the task. The combination of the two enables the evaluation to balance efficiency and success rate, avoiding the one-sidedness of single index evaluation.

S2, quantifying the weighted difference between the simulation environment and the real environment index according to the evaluation index by using a linear weighting method.

The linear weighting method is a commonly used multi-index decision method, and is used for integrating multiple indexes or factors to make decisions or evaluations. A comprehensive score or evaluation value is obtained by assigning a weight to each index and then summing the weighted values of the indices.

In one possible implementation, the weighted difference between the quantized simulation environment and the real environment index in S2 is specifically:

Sim2Real_Gap＝ω₁|SPL_sim-SPL_real|+ω₂|Success_sim-Success_real|

Wherein Sim2Real _Gap represents a weighted difference between the simulation environment and the Real environment index, ω ₁ represents a weight coefficient of a successful weighted path length, ω ₂ represents a weight coefficient of a Success rate, SPL _sim represents a successful weighted path length in the simulation environment, SPL _real represents a successful weighted path length in the Real environment, success represents a Success rate, success _sim represents a navigation Success rate in the simulation environment, and Success _real represents a navigation Success rate in the Real environment.

Optionally, the weight coefficient of the successfully weighted path length is 0.5, and the weight coefficient of the success rate is 0.5.

In the invention, the two key indexes are weighted and summed to generate a comprehensive scoring value which can effectively reflect the overall difference between the simulation environment and the real environment, thereby helping a developer to know more clearly how the performance of the model in the real environment deviates from the simulation result. Meanwhile, the two key indexes are given the same weight, so that the evaluation result distortion caused by the weight bias of a certain index can be avoided.

And S3, constructing a Sim2Real model for mutually converting data between the simulation environment and the Real environment.

It should be noted that, the aim of setting up Sim2Real platform is to realize the data conversion between the simulation environment and the Real environment, reduce the difference of data formats, and thus achieve the goal of mutually fusing environment data. A large amount of state, motion, rewards, etc. information is generated during interaction with the high fidelity simulation environment and the real environment, stored in a buffer pool for analysis and evaluation, and valid data is multiplexed.

In one possible embodiment, S3 specifically includes:

S301, installing Gazebo and Rviz in a Sim2Real model, and starting up a simulation environment of the TurtleBot robot.

Gazebo is a powerful open source robotic simulation tool, and is widely used for robotics research and development. It provides a highly configurable simulation environment that can simulate complex physical characteristics and sensor data. Helping users test and optimize robotic systems in a virtual environment.

Wherein Rviz is a visualization tool of ROS (Robot Operating System) for displaying and analyzing sensor data, robot status and environmental information in a robot operating system.

In the present invention Gazebo provides a simulation platform to simulate the operation of a robot, while Rviz provides a tool to visualize and debug the data and behavior in the simulation. The two are matched for use, so that the efficient development and verification of the robot system can be realized.

S302, developing a bridging node from the simulation environment to the Gazebo-class real environment, determining the data conversion relation between the simulation environment and the real environment, wherein the bridging node is used for creating a Ros node and subscribing and publishing topics.

In ROS (Robot Operating System), the nodes are basic communication units, each of which is an execution unit responsible for handling a particular function or task. The nodes may be independent processes or may run on the same machine, executing different parts of the robotic system.

In the invention, the compatibility and consistency of the data format between the simulation environment and the reality environment are ensured through the bridging node and the data conversion mechanism. This consistency is the basis for ensuring that the model behaves consistently in both environments, enabling the model to seamlessly apply knowledge gained in the simulation environment in the real-world environment.

S303, the simulation-to-reality node is responsible for converting discrete actions output in the simulation environment into continuous actions executable in the real environment.

And S304, converting the topic data related to the sensor in the real environment into topic data identifiable by the robot in the simulation environment by using the real-to-simulation node.

S305, creating a simulation environment Ros node and receiving topic data.

S306, integrating topic data and topic data of sensors in the simulation environment, updating the strategy, and outputting an action instruction.

S307, the action instructions are respectively input into the simulation environment and the real environment for training.

According to the invention, through mutually converting the data in the simulation environment and the data in the Real environment, the Sim2Real model can be exposed to more diversified data in training, and the overfitting of the model to a single environment is reduced, so that the adaptability and the robustness of the model in different environments are enhanced. Meanwhile, in actual operation, an error of the robot or the autopilot system may cause danger or damage. Through advanced testing and optimization in the simulation environment, errors possibly occurring in the real environment can be reduced to the greatest extent, and the safety of the whole system is improved.

Further, by storing data in the simulation and reality environment into a buffer pool, the system is able to efficiently multiplex the data for subsequent analysis and evaluation. The method not only improves the utilization rate of the data, but also improves the model or strategy through further analysis, thereby improving the overall performance of the system.

S4, performing field self-adaptive training on the Sim2Real model by using a reinforcement learning algorithm and taking the minimum weighted difference between the simulation environment and the Real environment index as a target to obtain a final Sim2Real model.

Reinforcement learning (Reinforcement Learning, RL) is a method of machine learning, among other things, aimed at learning how to choose actions by interacting with the environment to maximize the cumulative rewards. The core of reinforcement learning is to constantly optimize strategies through exploration and utilization so that the intelligent agent can make optimal decisions in different environments.

It should be noted that, the field-adaptive-based method collects relevant data of the real environment, such as sensor readings, environment layout, and similar dynamic elements. The data and the data in the simulation environment form a more comprehensive training set, and the more comprehensive training set and the data are added into the training process of the model together, so that the model can be contacted with more diversified environmental characteristics. And adjusting the ROS topics sent by the real environment and the action data calculated by the strategy network to enable the ROS topics to be fused with the simulation environment data, and adjusting training logic to enable the intelligent agent to receive the simulation and the real environment data at the same time and update the strategy model.

In one possible embodiment, S4 specifically includes:

S401, acquiring an original ROS image message sent by a real environment.

Among them, ROS Image (ROS Image) refers to a message format for representing and transmitting Image data in a robot operating system (ROS, robot Operating System). ROS images are commonly used in robotic vision systems, where image data captured by cameras or other sensors may be published, subscribed to, and processed in the ROS network.

S402, converting the original ROS image message into an OpenCV image format.

OpenCV (Open Source Computer Vision Library) is an open source computer vision library, which is widely used for real-time image processing and computer vision tasks. The system provides rich functions and tools, and supports applications such as image and video processing, analysis, recognition and the like.

S403, designating the data type of the converted image as 32-bit floating point number, and adjusting the image size to 256 pixels in height and width.

S404, storing the original ROS image message into an observation data queue through a callback function.

Wherein the callback function (Callback Function) is a programming technique for executing a predefined function upon occurrence of an event. Callback functions are a mechanism for handling asynchronous operations, event-driven programming, and responsive programming. It is widely used in many programming languages and frameworks

S405, judging whether the observation data queue is empty. If yes, continuing the training process of the simulation environment data. Otherwise, the real environment data is taken out from the observation data queue and is used as the data set together with the simulation data to complete the updating of the strategy.

In the invention, the integration of real environment data into the training process helps to narrow the gap between simulation and reality. By including the sensor readings and the environment layout information of the real environment in the training set, the model can learn the actual situation in the real environment, thereby better coping with the real challenges that the simulation environment cannot be completely simulated.

S406, obtaining integer action data through calculation of an Actor-Critic network.

According to the invention, the action data is migrated from the simulation environment to the real environment through the calculation of the Actor-Critic network, so that the action execution of the robot in the real environment is consistent with the simulation. Thus, the strategy learned by the model can be accurately realized in a real environment, and errors caused by action differences are reduced.

And S407, transmitting the action data to the real environment so that the robot in the real environment executes the action data and updates the environment state.

In the Sim2Real model construction method based on reinforcement learning, a joint training strategy is adopted. In particular, the simulation environment can quickly provide a large amount of data for efficient training of reinforcement learning, while the real environment provides critical correction data. By training the model on a large scale in a simulation environment, in combination with small scale fine tuning in a real environment, the model can be optimized simultaneously in both fields, thus performing better in the real world. The method of field self-adaption is introduced to realize alignment of simulation and real state distribution, and the simulation state gradually approaches to the real state, so that the difference in distribution is reduced.

The state space represents the current state of the system and is used for quantifying the difference between the simulation environment and the real environment. In this process, the state space includes a plurality of evaluation metrics reflecting the degree of discrepancy between the simulation and reality. The set state space is S, which can be expressed as:

S={E_S,E_r}

Wherein S represents a state space, E _S is an evaluation index of a simulation environment, and E _r is an evaluation index of a real environment.

The state space consists of the speed v _s of the robot in the simulation environment, the angle θ _s, the sensor data d _s and the corresponding data (v _r,θ_r,d_r) in the real environment. The state space not only contains the attribute of the robot in the simulation environment, but also contains the comparison result with the real world, and the accuracy of the simulation model on different indexes is quantized.

The design of the action space is directed to how to adjust parameters in the simulation environment to reduce the gap between simulation and reality. The action space is set to a, expressed as:

A={a₁,a₂,…,a_m}

Where a denotes an action space, a _j denotes a j-th action in the action space, j=1, 2..m, and m denotes the total number of actions.

Each a _j represents a specific parameter that can be adjusted, and by changing these parameters, the performance of the simulation environment is affected, thereby affecting the adaptability of the Sim2Real model to the Real scene.

In the training of Sim2Real, the design of the bonus function is related to the goal of minimizing the weighted differences between the simulation environment and the Real environment. The difference between the simulation and the real environment at different moments is measured by introducing a time dimension so as to help the model adapt to the environment in long-time dynamic change, and a reward function is set as follows:

Wherein R represents a reward function, T represents the time span of the whole task, omega _i represents the weight of the ith index, n represents the total number of indexes, E _s (i, T) represents the value of the ith index at the time T in the simulation environment, and E _r (i, T) represents the value of the ith index at the time T in the real environment.

By introducing a dimension of time, the reward function can be measured based on the state difference at each moment, and the passing of the reward value reduces the difference accumulated value in time, so that the performance of the simulation environment changing along with the time can gradually approach to the real environment.

According to the invention, the Sim2Real model is subjected to field self-adaptive training through the reinforcement learning algorithm, and the Real environment data and the simulation environment data are combined, so that the performance of the model in the Real environment is improved, the gap between simulation and reality is reduced, the generalization capability of a strategy is enhanced, and the training efficiency and accuracy are optimized. Meanwhile, the robustness of the model can be improved in the training process, the effectiveness of the strategy model is improved, and a more reliable and efficient solution is provided for the intelligent agent in practical application.

Referring to fig. 2 of the specification, a schematic structural diagram of a Sim2Real model building device based on reinforcement learning is shown.

The invention also provides a Sim2Real model construction device 20 based on reinforcement learning, comprising:

an acquisition module 201, configured to acquire evaluation indexes of a simulation environment and a real environment;

In one possible implementation, the evaluation index includes a successfully weighted path length and a navigation success rate;

the successfully weighted path length is specifically:

Wherein SPL represents a successfully weighted path length, N represents a number of nodes trained, S _b represents success or failure of navigation at the current number of nodes, L _b represents an optimal shortest path length to reach a target point at the current number of nodes, and P _b represents a path length travelled by a robot in an actual test;

the navigation success rate is specifically as follows:

A quantization module 202, configured to quantize a weighted difference between the simulation environment and the real environment index according to the evaluation index by using a linear weighting method;

In one possible implementation manner, the weighted difference between the quantized simulation environment and the real environment index is specifically:

Sim2Real_Gap＝ω₁|SPL_sim-SPL_real|+ω₂|Success_sim-Success_real|

Wherein Sim2Real _Gap represents a weighted difference between the simulation environment and the Real environment index, ω ₁ represents a weight coefficient of a successful weighted path length, ω ₂ represents a weight coefficient of a Success rate, SPL _sim represents a successful weighted path length in the simulation environment, SPL _real represents a successful weighted path length in the Real environment, success _sim represents a navigation Success rate in the simulation environment, and Success _real represents a navigation Success rate in the Real environment.

The building module 203 is configured to build a Sim2Real model that performs inter-conversion on data between a simulation environment and a Real environment;

in one possible implementation, the building module 203 is configured to:

Installing Gazebo and Rviz in the Sim2Real model, and starting up a simulation environment of the TurtleBot robot;

developing a bridging node from the simulation environment to Gazebo types of real environments, and determining a data conversion relationship between the simulation environment and the real environments, wherein the bridging node is used for creating a Ros node and subscribing and publishing topics;

The node from simulation to reality is responsible for converting discrete actions output in the simulation environment into continuous actions executable in the real environment;

the realization-to-simulation node converts topic data related to a sensor in a real environment into topic data identifiable by a robot in a simulation environment;

creating a simulation environment Ros node, and receiving the topic data;

integrating the topic data with topic data of sensors in a simulation environment, updating a strategy, and outputting an action instruction;

and respectively inputting the action instructions into a simulation environment and a real environment for training.

And the training module 204 is configured to perform field adaptive training on the Sim2Real model by using a reinforcement learning algorithm with a minimum weighted difference between the simulation environment and the Real environment index as a target, so as to obtain a final Sim2Real model.

In one possible implementation, the training module 204 is configured to:

acquiring an original ROS image message sent by a real environment;

converting the original ROS image message into an OpenCV image format;

designating the data type of the converted image as 32-bit floating point number, and adjusting the image size to 256 pixel points in height and width;

Storing the original ROS image information into an observation data queue through a callback function;

judging whether the observed data queue is empty or not, if so, continuing the training process of the simulation environment data, otherwise, taking out the real environment data from the observed data queue and using the real environment data and the simulation data together as a data set to finish the updating of the strategy;

Obtaining integer action data through calculation of an Actor-Critic network;

and sending the action data to the real environment so that the robot in the real environment executes the action data and updates the environment state.

The Sim2Real model construction device 20 based on reinforcement learning provided by the present invention can execute the Sim2Real model construction method based on reinforcement learning, and achieve the same or similar technical effects, and in order to avoid repetition, the present invention is not repeated.

It should be appreciated that the processor in embodiments of the invention may be a central processing unit (central processing unit, CPU), which may also be other general purpose processors, digital signal processors (digital signalprocessor, DSP), application Specific Integrated Circuits (ASIC), off-the-shelf programmable gate arrays (field programmable GATE ARRAY, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It should also be appreciated that the memory in embodiments of the present invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an erasable programmable ROM (erasable PROM), an electrically erasable programmable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as external cache memory. By way of example, and not limitation, many forms of random access memory (random access memory, RAM) are available, such as static random access memory (STATIC RAM, SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (doubledata RATE SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (ENHANCED SDRAM, ESDRAM), synchronous link dynamic random access memory (SYNCHLINK DRAM, SLDRAM), and direct memory bus random access memory (direct rambus RAM, DR RAM).

The above embodiments may be implemented in whole or in part by software, hardware (e.g., circuitry), firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. When the computer instructions or computer program are loaded or executed on a computer, the processes or functions described in accordance with embodiments of the present invention are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more sets of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.

It should be understood that the term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B, and may mean that a exists alone, while a and B exist alone, and B exists alone, wherein a and B may be singular or plural. In addition, the character "/" herein generally indicates that the associated object is an "or" relationship, but may also indicate an "and/or" relationship, and may be understood by referring to the context.

In the present invention, "at least one" means one or more, and "a plurality" means two or more. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (a, b, or c) of a, b, c, a-b, a-c, b-c, or a-b-c may be represented, wherein a, b, c may be single or plural.

It should be understood that, in various embodiments of the present invention, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus, device and unit described above may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another device, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. The storage medium includes a U disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.

Embodiments of the present invention provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a reinforcement learning based Sim2Real model construction method as described in the method embodiments.

The computer readable storage medium provided by the invention can realize the steps and effects of the Sim2Real model construction method based on reinforcement learning in the method embodiment, and in order to avoid repetition, the invention is not repeated.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

The following points need to be described:

(1) The drawings of the embodiments of the present invention relate only to the structures related to the embodiments of the present invention, and other structures may refer to the general designs.

(2) In the drawings for describing embodiments of the present invention, the thickness of layers or regions is exaggerated or reduced for clarity, i.e., the drawings are not drawn to actual scale. It will be understood that when an element such as a layer, film, region or substrate is referred to as being "on" or "under" another element, it can be "directly on" or "under" the other element or intervening elements may be present.

(3) The embodiments of the invention and the features of the embodiments can be combined with each other to give new embodiments without conflict.

The present invention is not limited to the above embodiments, but the scope of the invention is defined by the claims.

Claims

1. A Sim2Real model construction method based on reinforcement learning, characterized by comprising:

S1: Obtain evaluation indicators of simulation environment and real environment;

S2: using a linear weighted method, according to the evaluation index, quantifying the weighted difference between the simulation environment and the real environment index;

S3: Build a Sim2Real model to convert data between the simulation environment and the real environment;

S4: Using a reinforcement learning algorithm, with the goal of minimizing the weighted difference between the simulated environment and the real environment indicators, the Sim2Real model is trained for domain adaptation to obtain a final Sim2Real model;

The weighted difference between the quantitative simulation environment and the real environment indicators in S2 is specifically:

;

in, represents the weighted difference between the indicators of the simulation environment and the real environment, ω1 represents the weight coefficient of the successful weighted path length, _ω2 represents the weight coefficient _of the success rate, represents the length of the successful weighted path in the simulation environment, represents the length of the successful weighted path in the real environment, Indicates the success rate, represents the navigation success rate in the simulation environment, Indicates the navigation success rate in the real environment;

In the training of the Sim2Real model, the reward function is set as:

;

Where R represents the reward function, T represents the time span of the entire task, ωi represents the weight of the i -th _indicator , and n represents the total number of indicators. represents the value of the i- th indicator at time t in the simulation environment, Represents the value of the i- th indicator at time t in the real environment.

2. The Sim2Real model construction method based on reinforcement learning according to claim 1, wherein the evaluation indicators include: success weighted path length and navigation success rate;

The success weighted path length is specifically:

;

Where SPL represents the successful weighted path length, N represents the number of training segments, Sb represents the success or failure of navigation under the current number of segments, _Lb represents the optimal shortest path length to the target point under the current number of _segments , and Pb represents the path length traversed _by the robot in the actual test;

The navigation success rate is specifically:

;

in, represents the success rate, S represents the number of successful navigations, and T represents the ratio of the total number of navigation tests.

3. The Sim2Real model construction method based on reinforcement learning according to claim 1, wherein S3 specifically includes:

S301: Install Gazebo and Rviz in the Sim2Real model and start the simulation environment of the TurtleBot3 robot;

S302: Develop a bridge node from the simulation environment to the Gazebo-like real environment, determine the data conversion relationship between the simulation environment and the real environment, and the bridge node is used to create a Ros node and subscribe to and publish topics;

S303: The simulation-to-reality node is responsible for converting discrete actions output in the simulation environment into continuous actions that can be executed in the real environment;

S304: The reality-to-simulation node converts the sensor-related topic data in the real environment into topic data that can be recognized by the robot in the simulation environment;

S305: Create a simulation environment Ros node to receive the topic data;

S306: Fusing the topic data and the topic data of sensors in the simulation environment, updating the strategy, and outputting action instructions;

S307: Inputting the action instructions into the simulation environment and the real environment respectively for training.

4. The Sim2Real model construction method based on reinforcement learning according to claim 1, wherein S4 specifically includes:

S401: Obtain the original ROS image message sent by the real environment;

S402: Convert the original ROS image message into an OpenCV image format;

S403: specifying the data type of the converted image as a 32-bit floating point number, and adjusting the image size to 256 pixels in height and width;

S404: storing the original ROS image message into the observation data queue through the callback function;

S405: Determine whether the observed data queue is empty; if so, continue the training process of the simulated environment data; otherwise, take the real environment data from the observed data queue and use it together with the simulated data as a data set to complete the strategy update;

S406: Obtaining integer action data through calculation of the Actor-Critic network;

S407: Sending the motion data to the real environment, so that the robot in the real environment executes the motion data and updates the environment state.

5. A Sim2Real model construction device based on reinforcement learning, characterized by comprising:

The acquisition module is used to obtain the evaluation indicators of the simulation environment and the real environment;

A quantification module, configured to quantify the weighted difference between the simulation environment and the real environment indicators according to the evaluation indicators using a linear weighting method;

Building a module for building a Sim2Real model that converts data between the simulation environment and the real environment;

A training module is used to perform domain adaptive training on the Sim2Real model using a reinforcement learning algorithm with the goal of minimizing the weighted difference between the simulated environment and the real environment indicators to obtain a final Sim2Real model;

The weighted difference between the quantitative simulation environment and the real environment indicators in the quantification module is specifically:

;

In the training of the Sim2Real model, the reward function is set as:

;

6. The Sim2Real model construction device based on reinforcement learning according to claim 5, wherein the evaluation indicators include: successful weighted path length and navigation success rate;

The success weighted path length is specifically:

;

The navigation success rate is specifically:

;

7. The Sim2Real model construction device based on reinforcement learning according to claim 5, wherein the construction module is used to:

Install Gazebo and Rviz in the Sim2Real model and start the simulation environment of the TurtleBot3 robot;

Develop a bridge node from the simulation environment to the Gazebo-like real environment, determine the data conversion relationship between the simulation environment and the real environment, and use the bridge node to create Ros nodes and subscribe to and publish topics;

The simulation-to-reality node is responsible for converting discrete actions output in the simulation environment into continuous actions that can be executed in the real environment;

The reality-to-simulation node converts the sensor-related topic data in the real environment into topic data that can be recognized by the robot in the simulation environment;

Create a simulation environment Ros node to receive the topic data;

fusing the topic data and the topic data of sensors in the simulation environment, updating the strategy, and outputting action instructions;

The action instructions are input into the simulation environment and the real environment respectively for training.

8. The Sim2Real model construction device based on reinforcement learning according to claim 5, wherein the training module is used to:

Get the original ROS image message sent by the real environment;

Convert the original ROS image message to OpenCV image format;

The data type of the converted image is specified as 32-bit floating point number, and the image size is adjusted to 256 pixels in height and width;

The original ROS image message is stored in the observation data queue through the callback function;

Determine whether the observed data queue is empty; if so, continue the training process of the simulated environment data; otherwise, take the real environment data from the observed data queue and use it together with the simulated data as a data set to complete the strategy update;

The integer action data is obtained through the calculation of the Actor-Critic network;

The action data is sent to the real environment so that the robot in the real environment executes the action data and updates the environment state.