[go: up one dir, main page]

CN119669952B - A Sim2Real model construction method and device based on reinforcement learning - Google Patents

A Sim2Real model construction method and device based on reinforcement learning

Info

Publication number
CN119669952B
CN119669952B CN202411610785.4A CN202411610785A CN119669952B CN 119669952 B CN119669952 B CN 119669952B CN 202411610785 A CN202411610785 A CN 202411610785A CN 119669952 B CN119669952 B CN 119669952B
Authority
CN
China
Prior art keywords
environment
data
real environment
simulation
sim2real
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411610785.4A
Other languages
Chinese (zh)
Other versions
CN119669952A (en
Inventor
张梦娇
朱博林
段世红
徐诚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology Beijing USTB
Original Assignee
University of Science and Technology Beijing USTB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology Beijing USTB filed Critical University of Science and Technology Beijing USTB
Priority to CN202411610785.4A priority Critical patent/CN119669952B/en
Publication of CN119669952A publication Critical patent/CN119669952A/en
Application granted granted Critical
Publication of CN119669952B publication Critical patent/CN119669952B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a Sim2Real model construction method and device based on reinforcement learning, which relate to the technical field of data processing, and the method comprises the steps of obtaining evaluation indexes of a simulation environment and a Real environment; the method comprises the steps of utilizing a linear weighting method to quantify the weighted difference between a simulation environment and a Real environment index according to an evaluation index, establishing a Sim2Real model for mutually converting data between the simulation environment and the Real environment, and utilizing a reinforcement learning algorithm to conduct field self-adaptive training on the Sim2Real model by taking the smallest weighted difference between the simulation environment and the Real environment index as a target, so as to obtain a final Sim2Real model. According to the invention, the Sim2Real model for data conversion between the simulation environment and the Real environment is built, so that the adjustment of the data layer is performed, and the migration error from the simulation environment to the Real environment is reduced.

Description

Sim2Real model construction method and device based on reinforcement learning
Technical Field
The invention relates to the technical field of data processing, in particular to a Sim2Real model construction method and device based on reinforcement learning.
Background
In the fields of modern manufacturing, service, medical care and the like, intelligent robots are playing an increasingly important role, and the development of automation and intelligence is promoted. In these fields, mobile robots are widely used for tasks such as logistics transportation, warehouse management, medical assistance, and the like. To accomplish these tasks, mobile robots must have autonomous navigational capabilities, meaning that they are able to autonomously move, plan paths, avoid obstacles, and eventually reach a target site in an unknown or partially unknown environment. The active navigation technology is characterized in that the robot perceives the environment, plans the path and controls the decision.
In addition, in recent years, with the development of artificial intelligence technology, particularly the introduction of machine learning and reinforcement learning methods, active navigation technology has been further improved. Reinforcement learning (Reinforcement Learning, RL) guides the robot through a rewarding mechanism to learn how to behave in the environment, developing an optimal behavior strategy. The reinforcement learning method does not need a definite environment model, and can adapt to complex and changeable environments through a large number of trial and error processes, so that the reinforcement learning method has very wide application prospects in dynamic and complex environments.
However, due to the difference between simulation and reality, the strategy trained in the simulation environment often has difficulty in completely reflecting the complexity and dynamic change of the real environment in the real environment, and meanwhile, the existing reinforcement learning method is excellent in specific environment, but often lacks sufficient generalization capability, so that the strategy may show insufficient adaptability when the robot faces new or sudden environmental change.
Disclosure of Invention
In order to solve the technical problems that due to the fact that differences exist between simulation and reality, a strategy trained in a simulation environment is difficult to fully reflect the complexity and dynamic change of the Real environment in the Real environment, meanwhile, the existing reinforcement learning method is excellent in specific environment, but often lacks sufficient generalization capability, and the strategy possibly shows insufficient adaptability when a robot faces new or sudden environment change, the invention provides a Sim2Real model construction method and device based on reinforcement learning.
The technical scheme provided by the embodiment of the invention is as follows:
First aspect:
the Sim2Real model construction method based on reinforcement learning provided by the embodiment of the invention comprises the following steps:
s1, acquiring evaluation indexes of a simulation environment and a real environment;
S2, quantifying the weighted difference between the simulation environment and the real environment index according to the evaluation index by using a linear weighting method;
S3, constructing a Sim2Real model for mutually converting data between the simulation environment and the Real environment;
And S4, performing field self-adaptive training on the Sim2Real model by using a reinforcement learning algorithm and taking the minimum weighted difference between the simulation environment and the Real environment index as a target to obtain a final Sim2Real model.
Second aspect:
The Sim2Real model building device based on reinforcement learning provided by the embodiment of the invention comprises:
The acquisition module is used for acquiring the evaluation indexes of the simulation environment and the real environment;
the quantization module is used for quantizing the weighted difference between the simulation environment and the real environment index according to the evaluation index by using a linear weighting method;
The building module is used for building a Sim2Real model for mutually converting data between the simulation environment and the Real environment;
And the training module is used for performing field self-adaptive training on the Sim2Real model by using a reinforcement learning algorithm and taking the minimum weighted difference between the simulation environment and the Real environment index as a target to obtain a final Sim2Real model.
Third aspect:
A computer-readable storage medium, on which a computer program is stored, is provided according to an embodiment of the present invention, which when executed by a processor implements the reinforcement learning-based Sim2Real model construction method according to the first aspect.
The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:
(1) According to the invention, the difference between the simulation environment and the Real environment is determined by acquiring the evaluation indexes of the two, the weighted difference between the evaluation indexes is quantified by using a linear weighting method, and the data layer is adjusted according to the difference between the simulation environment and the Real environment by constructing a Sim2Real model capable of carrying out data conversion between the simulation environment and the Real environment, so that the migration error from the simulation environment to the Real environment is reduced, and the complexity and dynamic change of the Real environment are effectively reflected.
(2) In the invention, a reinforcement learning algorithm is utilized, the field self-adaptive training is carried out on the Sim2Real model by taking the minimum weighted difference between the simulation environment and the Real environment index as a target, and the final Sim2Real model is obtained. The model is continuously optimized to be excellent in a wider environment, so that the adaptability of the model in the face of various changes in reality is enhanced, and the generalization capability of strategies is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a Sim2Real model construction method based on reinforcement learning according to an embodiment of the present invention;
Fig. 2 is a schematic structural diagram of a Sim2Real model building system based on reinforcement learning according to an embodiment of the present invention.
Detailed Description
The technical scheme of the invention is described below with reference to the accompanying drawings.
In embodiments of the invention, words such as "exemplary," "such as" and the like are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, the term use of an example is intended to present concepts in a concrete fashion. Furthermore, in embodiments of the present invention, the meaning of "and/or" may be that of both, or may be that of either, optionally one of both.
In the embodiments of the present invention, "image" and "picture" may be sometimes used in combination, and it should be noted that the meaning of the expression is consistent when the distinction is not emphasized. "of", "corresponding (corresponding, relevant)" and "corresponding (corresponding)" are sometimes used in combination, and it should be noted that the meaning of the expression is consistent when the distinction is not emphasized.
In embodiments of the present invention, sometimes a subscript such as W 1 may be wrongly written in a non-subscript form such as W1, and the meaning of the expression is consistent when the distinction is not emphasized.
In order to make the technical problems, technical solutions and advantages to be solved more apparent, the following detailed description will be given with reference to the accompanying drawings and specific embodiments.
Referring to fig. 1 of the specification, a schematic flow chart of a Sim2Real model construction method based on reinforcement learning according to an embodiment of the present invention is shown.
The embodiment of the invention provides a Sim2Real model building method based on reinforcement learning, which can be realized by Sim2Real model building equipment based on reinforcement learning, wherein the Sim2Real model building equipment based on reinforcement learning can be a terminal or a server.
The Sim2Real (Simulation to Reality ) model is a technical method for solving the migration problem from a simulation environment to a reality environment in the fields of robotics, autopilot, reinforcement learning, and the like.
The processing flow of the Sim2Real model construction method based on reinforcement learning can comprise the following steps:
s1, acquiring evaluation indexes of a simulation environment and a real environment.
In one possible implementation, the evaluation index includes a successfully weighted path length and a navigation success rate.
The successful weighted path length is specifically:
Where SPL represents the successfully weighted path length, N represents the number of nodes trained, S b represents the success or failure of navigation at the current number of nodes, L b represents the optimal shortest path length to the target point at the current number of nodes, and P b represents the path length traversed by the robot in the actual test.
The navigation success rate is specifically as follows:
Wherein Success represents Success rate, S represents navigation Success times, and T represents the ratio of the total number of navigation tests.
In the invention, the efficiency of the robot navigation path can be evaluated by calculating the comparison of the path travelled by the robot in successful navigation with the optimal path. A higher SPL value means that the robot has chosen a more nearly optimal path when the navigation task is completed. Meanwhile, PL emphasizes the efficiency of the path, while Success emphasizes the completion rate of the task. The combination of the two enables the evaluation to balance efficiency and success rate, avoiding the one-sidedness of single index evaluation.
S2, quantifying the weighted difference between the simulation environment and the real environment index according to the evaluation index by using a linear weighting method.
The linear weighting method is a commonly used multi-index decision method, and is used for integrating multiple indexes or factors to make decisions or evaluations. A comprehensive score or evaluation value is obtained by assigning a weight to each index and then summing the weighted values of the indices.
In one possible implementation, the weighted difference between the quantized simulation environment and the real environment index in S2 is specifically:
Sim2RealGap=ω1|SPLsim-SPLreal|+ω2|Successsim-Successreal|
Wherein Sim2Real Gap represents a weighted difference between the simulation environment and the Real environment index, ω 1 represents a weight coefficient of a successful weighted path length, ω 2 represents a weight coefficient of a Success rate, SPL sim represents a successful weighted path length in the simulation environment, SPL real represents a successful weighted path length in the Real environment, success represents a Success rate, success sim represents a navigation Success rate in the simulation environment, and Success real represents a navigation Success rate in the Real environment.
Optionally, the weight coefficient of the successfully weighted path length is 0.5, and the weight coefficient of the success rate is 0.5.
In the invention, the two key indexes are weighted and summed to generate a comprehensive scoring value which can effectively reflect the overall difference between the simulation environment and the real environment, thereby helping a developer to know more clearly how the performance of the model in the real environment deviates from the simulation result. Meanwhile, the two key indexes are given the same weight, so that the evaluation result distortion caused by the weight bias of a certain index can be avoided.
And S3, constructing a Sim2Real model for mutually converting data between the simulation environment and the Real environment.
It should be noted that, the aim of setting up Sim2Real platform is to realize the data conversion between the simulation environment and the Real environment, reduce the difference of data formats, and thus achieve the goal of mutually fusing environment data. A large amount of state, motion, rewards, etc. information is generated during interaction with the high fidelity simulation environment and the real environment, stored in a buffer pool for analysis and evaluation, and valid data is multiplexed.
In one possible embodiment, S3 specifically includes:
S301, installing Gazebo and Rviz in a Sim2Real model, and starting up a simulation environment of the TurtleBot robot.
Gazebo is a powerful open source robotic simulation tool, and is widely used for robotics research and development. It provides a highly configurable simulation environment that can simulate complex physical characteristics and sensor data. Helping users test and optimize robotic systems in a virtual environment.
Wherein Rviz is a visualization tool of ROS (Robot Operating System) for displaying and analyzing sensor data, robot status and environmental information in a robot operating system.
In the present invention Gazebo provides a simulation platform to simulate the operation of a robot, while Rviz provides a tool to visualize and debug the data and behavior in the simulation. The two are matched for use, so that the efficient development and verification of the robot system can be realized.
S302, developing a bridging node from the simulation environment to the Gazebo-class real environment, determining the data conversion relation between the simulation environment and the real environment, wherein the bridging node is used for creating a Ros node and subscribing and publishing topics.
In ROS (Robot Operating System), the nodes are basic communication units, each of which is an execution unit responsible for handling a particular function or task. The nodes may be independent processes or may run on the same machine, executing different parts of the robotic system.
In the invention, the compatibility and consistency of the data format between the simulation environment and the reality environment are ensured through the bridging node and the data conversion mechanism. This consistency is the basis for ensuring that the model behaves consistently in both environments, enabling the model to seamlessly apply knowledge gained in the simulation environment in the real-world environment.
S303, the simulation-to-reality node is responsible for converting discrete actions output in the simulation environment into continuous actions executable in the real environment.
And S304, converting the topic data related to the sensor in the real environment into topic data identifiable by the robot in the simulation environment by using the real-to-simulation node.
S305, creating a simulation environment Ros node and receiving topic data.
S306, integrating topic data and topic data of sensors in the simulation environment, updating the strategy, and outputting an action instruction.
S307, the action instructions are respectively input into the simulation environment and the real environment for training.
According to the invention, through mutually converting the data in the simulation environment and the data in the Real environment, the Sim2Real model can be exposed to more diversified data in training, and the overfitting of the model to a single environment is reduced, so that the adaptability and the robustness of the model in different environments are enhanced. Meanwhile, in actual operation, an error of the robot or the autopilot system may cause danger or damage. Through advanced testing and optimization in the simulation environment, errors possibly occurring in the real environment can be reduced to the greatest extent, and the safety of the whole system is improved.
Further, by storing data in the simulation and reality environment into a buffer pool, the system is able to efficiently multiplex the data for subsequent analysis and evaluation. The method not only improves the utilization rate of the data, but also improves the model or strategy through further analysis, thereby improving the overall performance of the system.
S4, performing field self-adaptive training on the Sim2Real model by using a reinforcement learning algorithm and taking the minimum weighted difference between the simulation environment and the Real environment index as a target to obtain a final Sim2Real model.
Reinforcement learning (Reinforcement Learning, RL) is a method of machine learning, among other things, aimed at learning how to choose actions by interacting with the environment to maximize the cumulative rewards. The core of reinforcement learning is to constantly optimize strategies through exploration and utilization so that the intelligent agent can make optimal decisions in different environments.
It should be noted that, the field-adaptive-based method collects relevant data of the real environment, such as sensor readings, environment layout, and similar dynamic elements. The data and the data in the simulation environment form a more comprehensive training set, and the more comprehensive training set and the data are added into the training process of the model together, so that the model can be contacted with more diversified environmental characteristics. And adjusting the ROS topics sent by the real environment and the action data calculated by the strategy network to enable the ROS topics to be fused with the simulation environment data, and adjusting training logic to enable the intelligent agent to receive the simulation and the real environment data at the same time and update the strategy model.
In one possible embodiment, S4 specifically includes:
S401, acquiring an original ROS image message sent by a real environment.
Among them, ROS Image (ROS Image) refers to a message format for representing and transmitting Image data in a robot operating system (ROS, robot Operating System). ROS images are commonly used in robotic vision systems, where image data captured by cameras or other sensors may be published, subscribed to, and processed in the ROS network.
S402, converting the original ROS image message into an OpenCV image format.
OpenCV (Open Source Computer Vision Library) is an open source computer vision library, which is widely used for real-time image processing and computer vision tasks. The system provides rich functions and tools, and supports applications such as image and video processing, analysis, recognition and the like.
S403, designating the data type of the converted image as 32-bit floating point number, and adjusting the image size to 256 pixels in height and width.
S404, storing the original ROS image message into an observation data queue through a callback function.
Wherein the callback function (Callback Function) is a programming technique for executing a predefined function upon occurrence of an event. Callback functions are a mechanism for handling asynchronous operations, event-driven programming, and responsive programming. It is widely used in many programming languages and frameworks
S405, judging whether the observation data queue is empty. If yes, continuing the training process of the simulation environment data. Otherwise, the real environment data is taken out from the observation data queue and is used as the data set together with the simulation data to complete the updating of the strategy.
In the invention, the integration of real environment data into the training process helps to narrow the gap between simulation and reality. By including the sensor readings and the environment layout information of the real environment in the training set, the model can learn the actual situation in the real environment, thereby better coping with the real challenges that the simulation environment cannot be completely simulated.
S406, obtaining integer action data through calculation of an Actor-Critic network.
According to the invention, the action data is migrated from the simulation environment to the real environment through the calculation of the Actor-Critic network, so that the action execution of the robot in the real environment is consistent with the simulation. Thus, the strategy learned by the model can be accurately realized in a real environment, and errors caused by action differences are reduced.
And S407, transmitting the action data to the real environment so that the robot in the real environment executes the action data and updates the environment state.
In the Sim2Real model construction method based on reinforcement learning, a joint training strategy is adopted. In particular, the simulation environment can quickly provide a large amount of data for efficient training of reinforcement learning, while the real environment provides critical correction data. By training the model on a large scale in a simulation environment, in combination with small scale fine tuning in a real environment, the model can be optimized simultaneously in both fields, thus performing better in the real world. The method of field self-adaption is introduced to realize alignment of simulation and real state distribution, and the simulation state gradually approaches to the real state, so that the difference in distribution is reduced.
The state space represents the current state of the system and is used for quantifying the difference between the simulation environment and the real environment. In this process, the state space includes a plurality of evaluation metrics reflecting the degree of discrepancy between the simulation and reality. The set state space is S, which can be expressed as:
S={ES,Er}
Wherein S represents a state space, E S is an evaluation index of a simulation environment, and E r is an evaluation index of a real environment.
The state space consists of the speed v s of the robot in the simulation environment, the angle θ s, the sensor data d s and the corresponding data (v rr,dr) in the real environment. The state space not only contains the attribute of the robot in the simulation environment, but also contains the comparison result with the real world, and the accuracy of the simulation model on different indexes is quantized.
The design of the action space is directed to how to adjust parameters in the simulation environment to reduce the gap between simulation and reality. The action space is set to a, expressed as:
A={a1,a2,…,am}
Where a denotes an action space, a j denotes a j-th action in the action space, j=1, 2..m, and m denotes the total number of actions.
Each a j represents a specific parameter that can be adjusted, and by changing these parameters, the performance of the simulation environment is affected, thereby affecting the adaptability of the Sim2Real model to the Real scene.
In the training of Sim2Real, the design of the bonus function is related to the goal of minimizing the weighted differences between the simulation environment and the Real environment. The difference between the simulation and the real environment at different moments is measured by introducing a time dimension so as to help the model adapt to the environment in long-time dynamic change, and a reward function is set as follows:
Wherein R represents a reward function, T represents the time span of the whole task, omega i represents the weight of the ith index, n represents the total number of indexes, E s (i, T) represents the value of the ith index at the time T in the simulation environment, and E r (i, T) represents the value of the ith index at the time T in the real environment.
By introducing a dimension of time, the reward function can be measured based on the state difference at each moment, and the passing of the reward value reduces the difference accumulated value in time, so that the performance of the simulation environment changing along with the time can gradually approach to the real environment.
According to the invention, the Sim2Real model is subjected to field self-adaptive training through the reinforcement learning algorithm, and the Real environment data and the simulation environment data are combined, so that the performance of the model in the Real environment is improved, the gap between simulation and reality is reduced, the generalization capability of a strategy is enhanced, and the training efficiency and accuracy are optimized. Meanwhile, the robustness of the model can be improved in the training process, the effectiveness of the strategy model is improved, and a more reliable and efficient solution is provided for the intelligent agent in practical application.
The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:
(1) According to the invention, the difference between the simulation environment and the Real environment is determined by acquiring the evaluation indexes of the two, the weighted difference between the evaluation indexes is quantified by using a linear weighting method, and the data layer is adjusted according to the difference between the simulation environment and the Real environment by constructing a Sim2Real model capable of carrying out data conversion between the simulation environment and the Real environment, so that the migration error from the simulation environment to the Real environment is reduced, and the complexity and dynamic change of the Real environment are effectively reflected.
(2) In the invention, a reinforcement learning algorithm is utilized, the field self-adaptive training is carried out on the Sim2Real model by taking the minimum weighted difference between the simulation environment and the Real environment index as a target, and the final Sim2Real model is obtained. The model is continuously optimized to be excellent in a wider environment, so that the adaptability of the model in the face of various changes in reality is enhanced, and the generalization capability of strategies is improved.
Referring to fig. 2 of the specification, a schematic structural diagram of a Sim2Real model building device based on reinforcement learning is shown.
The invention also provides a Sim2Real model construction device 20 based on reinforcement learning, comprising:
an acquisition module 201, configured to acquire evaluation indexes of a simulation environment and a real environment;
In one possible implementation, the evaluation index includes a successfully weighted path length and a navigation success rate;
the successfully weighted path length is specifically:
Wherein SPL represents a successfully weighted path length, N represents a number of nodes trained, S b represents success or failure of navigation at the current number of nodes, L b represents an optimal shortest path length to reach a target point at the current number of nodes, and P b represents a path length travelled by a robot in an actual test;
the navigation success rate is specifically as follows:
Wherein Success represents Success rate, S represents navigation Success times, and T represents the ratio of the total number of navigation tests.
A quantization module 202, configured to quantize a weighted difference between the simulation environment and the real environment index according to the evaluation index by using a linear weighting method;
In one possible implementation manner, the weighted difference between the quantized simulation environment and the real environment index is specifically:
Sim2RealGap=ω1|SPLsim-SPLreal|+ω2|Successsim-Successreal|
Wherein Sim2Real Gap represents a weighted difference between the simulation environment and the Real environment index, ω 1 represents a weight coefficient of a successful weighted path length, ω 2 represents a weight coefficient of a Success rate, SPL sim represents a successful weighted path length in the simulation environment, SPL real represents a successful weighted path length in the Real environment, success sim represents a navigation Success rate in the simulation environment, and Success real represents a navigation Success rate in the Real environment.
The building module 203 is configured to build a Sim2Real model that performs inter-conversion on data between a simulation environment and a Real environment;
in one possible implementation, the building module 203 is configured to:
Installing Gazebo and Rviz in the Sim2Real model, and starting up a simulation environment of the TurtleBot robot;
developing a bridging node from the simulation environment to Gazebo types of real environments, and determining a data conversion relationship between the simulation environment and the real environments, wherein the bridging node is used for creating a Ros node and subscribing and publishing topics;
The node from simulation to reality is responsible for converting discrete actions output in the simulation environment into continuous actions executable in the real environment;
the realization-to-simulation node converts topic data related to a sensor in a real environment into topic data identifiable by a robot in a simulation environment;
creating a simulation environment Ros node, and receiving the topic data;
integrating the topic data with topic data of sensors in a simulation environment, updating a strategy, and outputting an action instruction;
and respectively inputting the action instructions into a simulation environment and a real environment for training.
And the training module 204 is configured to perform field adaptive training on the Sim2Real model by using a reinforcement learning algorithm with a minimum weighted difference between the simulation environment and the Real environment index as a target, so as to obtain a final Sim2Real model.
In one possible implementation, the training module 204 is configured to:
acquiring an original ROS image message sent by a real environment;
converting the original ROS image message into an OpenCV image format;
designating the data type of the converted image as 32-bit floating point number, and adjusting the image size to 256 pixel points in height and width;
Storing the original ROS image information into an observation data queue through a callback function;
judging whether the observed data queue is empty or not, if so, continuing the training process of the simulation environment data, otherwise, taking out the real environment data from the observed data queue and using the real environment data and the simulation data together as a data set to finish the updating of the strategy;
Obtaining integer action data through calculation of an Actor-Critic network;
and sending the action data to the real environment so that the robot in the real environment executes the action data and updates the environment state.
The Sim2Real model construction device 20 based on reinforcement learning provided by the present invention can execute the Sim2Real model construction method based on reinforcement learning, and achieve the same or similar technical effects, and in order to avoid repetition, the present invention is not repeated.
The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:
(1) According to the invention, the difference between the simulation environment and the Real environment is determined by acquiring the evaluation indexes of the two, the weighted difference between the evaluation indexes is quantified by using a linear weighting method, and the data layer is adjusted according to the difference between the simulation environment and the Real environment by constructing a Sim2Real model capable of carrying out data conversion between the simulation environment and the Real environment, so that the migration error from the simulation environment to the Real environment is reduced, and the complexity and dynamic change of the Real environment are effectively reflected.
(2) In the invention, a reinforcement learning algorithm is utilized, the field self-adaptive training is carried out on the Sim2Real model by taking the minimum weighted difference between the simulation environment and the Real environment index as a target, and the final Sim2Real model is obtained. The model is continuously optimized to be excellent in a wider environment, so that the adaptability of the model in the face of various changes in reality is enhanced, and the generalization capability of strategies is improved.
It should be appreciated that the processor in embodiments of the invention may be a central processing unit (central processing unit, CPU), which may also be other general purpose processors, digital signal processors (digital signalprocessor, DSP), application Specific Integrated Circuits (ASIC), off-the-shelf programmable gate arrays (field programmable GATE ARRAY, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
It should also be appreciated that the memory in embodiments of the present invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an erasable programmable ROM (erasable PROM), an electrically erasable programmable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as external cache memory. By way of example, and not limitation, many forms of random access memory (random access memory, RAM) are available, such as static random access memory (STATIC RAM, SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (doubledata RATE SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (ENHANCED SDRAM, ESDRAM), synchronous link dynamic random access memory (SYNCHLINK DRAM, SLDRAM), and direct memory bus random access memory (direct rambus RAM, DR RAM).
The above embodiments may be implemented in whole or in part by software, hardware (e.g., circuitry), firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. When the computer instructions or computer program are loaded or executed on a computer, the processes or functions described in accordance with embodiments of the present invention are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more sets of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.
It should be understood that the term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B, and may mean that a exists alone, while a and B exist alone, and B exists alone, wherein a and B may be singular or plural. In addition, the character "/" herein generally indicates that the associated object is an "or" relationship, but may also indicate an "and/or" relationship, and may be understood by referring to the context.
In the present invention, "at least one" means one or more, and "a plurality" means two or more. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (a, b, or c) of a, b, c, a-b, a-c, b-c, or a-b-c may be represented, wherein a, b, c may be single or plural.
It should be understood that, in various embodiments of the present invention, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus, device and unit described above may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another device, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. The storage medium includes a U disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.
Embodiments of the present invention provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a reinforcement learning based Sim2Real model construction method as described in the method embodiments.
The computer readable storage medium provided by the invention can realize the steps and effects of the Sim2Real model construction method based on reinforcement learning in the method embodiment, and in order to avoid repetition, the invention is not repeated.
The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:
(1) According to the invention, the difference between the simulation environment and the Real environment is determined by acquiring the evaluation indexes of the two, the weighted difference between the evaluation indexes is quantified by using a linear weighting method, and the data layer is adjusted according to the difference between the simulation environment and the Real environment by constructing a Sim2Real model capable of carrying out data conversion between the simulation environment and the Real environment, so that the migration error from the simulation environment to the Real environment is reduced, and the complexity and dynamic change of the Real environment are effectively reflected.
(2) In the invention, a reinforcement learning algorithm is utilized, the field self-adaptive training is carried out on the Sim2Real model by taking the minimum weighted difference between the simulation environment and the Real environment index as a target, and the final Sim2Real model is obtained. The model is continuously optimized to be excellent in a wider environment, so that the adaptability of the model in the face of various changes in reality is enhanced, and the generalization capability of strategies is improved.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
The following points need to be described:
(1) The drawings of the embodiments of the present invention relate only to the structures related to the embodiments of the present invention, and other structures may refer to the general designs.
(2) In the drawings for describing embodiments of the present invention, the thickness of layers or regions is exaggerated or reduced for clarity, i.e., the drawings are not drawn to actual scale. It will be understood that when an element such as a layer, film, region or substrate is referred to as being "on" or "under" another element, it can be "directly on" or "under" the other element or intervening elements may be present.
(3) The embodiments of the invention and the features of the embodiments can be combined with each other to give new embodiments without conflict.
The present invention is not limited to the above embodiments, but the scope of the invention is defined by the claims.

Claims (8)

1.一种基于强化学习的Sim2Real模型构建方法,其特征在于,包括:1. A Sim2Real model construction method based on reinforcement learning, characterized by comprising: S1:获取仿真环境与现实环境的评估指标;S1: Obtain evaluation indicators of simulation environment and real environment; S2:利用线性加权法,根据所述评估指标,量化仿真环境与现实环境指标之间的加权差异;S2: using a linear weighted method, according to the evaluation index, quantifying the weighted difference between the simulation environment and the real environment index; S3:搭建对仿真环境与真实环境之间的数据进行相互转换的Sim2Real模型;S3: Build a Sim2Real model to convert data between the simulation environment and the real environment; S4:利用强化学习算法,以仿真环境与现实环境指标之间的加权差异最小为目标,对所述Sim2Real模型进行领域自适应训练,得到最终的Sim2Real模型;S4: Using a reinforcement learning algorithm, with the goal of minimizing the weighted difference between the simulated environment and the real environment indicators, the Sim2Real model is trained for domain adaptation to obtain a final Sim2Real model; 其中,所述S2中量化仿真环境与现实环境指标之间的加权差异具体为:The weighted difference between the quantitative simulation environment and the real environment indicators in S2 is specifically: ; 其中,表示仿真环境与现实环境指标之间的加权差异,ω 1表示成功加权路径长度的权重系数,ω 2表示成功率的权重系数,表示仿真环境中成功加权路径长度,表示真实环境中成功加权路径长度,表示成功率,表示仿真环境中的导航成功率,表示在真实环境中的导航成功率;in, represents the weighted difference between the indicators of the simulation environment and the real environment, ω1 represents the weight coefficient of the successful weighted path length, ω2 represents the weight coefficient of the success rate, represents the length of the successful weighted path in the simulation environment, represents the length of the successful weighted path in the real environment, Indicates the success rate, represents the navigation success rate in the simulation environment, Indicates the navigation success rate in the real environment; 其中,在所述Sim2Real模型的训练中,设定奖励函数为:In the training of the Sim2Real model, the reward function is set as: ; 其中,R表示奖励函数,T表示整个任务的时间跨度,ω i 表示第i个指标的权重,n表示指标的总数,表示仿真环境中t时刻第i个指标的值,表示现实环境中t时刻第i个指标的值。Where R represents the reward function, T represents the time span of the entire task, ωi represents the weight of the i -th indicator , and n represents the total number of indicators. represents the value of the i- th indicator at time t in the simulation environment, Represents the value of the i- th indicator at time t in the real environment. 2.根据权利要求1所述的基于强化学习的Sim2Real模型构建方法,其特征在于,所述评估指标包括:成功加权路径长度以及导航成功率;2. The Sim2Real model construction method based on reinforcement learning according to claim 1, wherein the evaluation indicators include: success weighted path length and navigation success rate; 所述成功加权路径长度具体为:The success weighted path length is specifically: ; 其中,SPL表示成功加权路径长度,N表示训练的节数,S b 表示在当前节数下导航成功或失败,L b 表示在当前节数下到达目标点的最优最短路径长度,P b 表示实际测试中机器人走过的路径长度;Where SPL represents the successful weighted path length, N represents the number of training segments, Sb represents the success or failure of navigation under the current number of segments, Lb represents the optimal shortest path length to the target point under the current number of segments , and Pb represents the path length traversed by the robot in the actual test; 所述导航成功率具体为:The navigation success rate is specifically: ; 其中,表示成功率,S表示导航成功次数,T表示导航测试总次数之比。in, represents the success rate, S represents the number of successful navigations, and T represents the ratio of the total number of navigation tests. 3.根据权利要求1所述的基于强化学习的Sim2Real模型构建方法,其特征在于,所述S3具体包括:3. The Sim2Real model construction method based on reinforcement learning according to claim 1, wherein S3 specifically includes: S301:在所述Sim2Real模型中安装Gazebo和Rviz,并启动TurtleBot3机器人的仿真环境;S301: Install Gazebo and Rviz in the Sim2Real model and start the simulation environment of the TurtleBot3 robot; S302:开发所述仿真环境到Gazebo类真实环境的桥接节点,确定仿真环境与真实环境之间的数据相互转换关系,所述桥接节点用于创建Ros节点并订阅和发布话题;S302: Develop a bridge node from the simulation environment to the Gazebo-like real environment, determine the data conversion relationship between the simulation environment and the real environment, and the bridge node is used to create a Ros node and subscribe to and publish topics; S303:仿真到现实的节点负责将仿真环境中输出的离散动作转换为真实环境中可执行的连续动作;S303: The simulation-to-reality node is responsible for converting discrete actions output in the simulation environment into continuous actions that can be executed in the real environment; S304:现实到仿真的节点将真实环境中传感器相关话题数据转换为仿真环境中机器人可识别的话题数据;S304: The reality-to-simulation node converts the sensor-related topic data in the real environment into topic data that can be recognized by the robot in the simulation environment; S305:创建仿真环境Ros节点,接收所述话题数据;S305: Create a simulation environment Ros node to receive the topic data; S306:融合所述话题数据以及仿真环境中传感器的话题数据,对策略进行更新,输出动作指令;S306: Fusing the topic data and the topic data of sensors in the simulation environment, updating the strategy, and outputting action instructions; S307:将所述动作指令分别输入到仿真环境和真实环境中,以供训练使用。S307: Inputting the action instructions into the simulation environment and the real environment respectively for training. 4.根据权利要求1所述的基于强化学习的Sim2Real模型构建方法,其特征在于,S4具体包括:4. The Sim2Real model construction method based on reinforcement learning according to claim 1, wherein S4 specifically includes: S401:获取真实环境发送的原始ROS图像消息;S401: Obtain the original ROS image message sent by the real environment; S402:将所述原始ROS图像消息转换为OpenCV的图像格式;S402: Convert the original ROS image message into an OpenCV image format; S403:将转换后的图像的数据类型指定为32位浮点数,图像尺寸调整为高度和宽度均为256个像素点;S403: specifying the data type of the converted image as a 32-bit floating point number, and adjusting the image size to 256 pixels in height and width; S404:通过回调函数将所述原始ROS图像消息存放进观察数据队列中;S404: storing the original ROS image message into the observation data queue through the callback function; S405:判断所述观察数据队列是否为空;若是,继续进行仿真环境数据的训练过程;否则,从观察数据队列中取出真实环境数据,并与仿真数据共同作为数据集完成策略的更新;S405: Determine whether the observed data queue is empty; if so, continue the training process of the simulated environment data; otherwise, take the real environment data from the observed data queue and use it together with the simulated data as a data set to complete the strategy update; S406:通过Actor-Critic网络的计算得到整型的动作数据;S406: Obtaining integer action data through calculation of the Actor-Critic network; S407:将所述动作数据发送至真实环境,以使真实环境中的机器人执行所述动作数据,并更新环境状态。S407: Sending the motion data to the real environment, so that the robot in the real environment executes the motion data and updates the environment state. 5.一种基于强化学习的Sim2Real模型构建装置,其特征在于,包括:5. A Sim2Real model construction device based on reinforcement learning, characterized by comprising: 获取模块,用于获取仿真环境与现实环境的评估指标;The acquisition module is used to obtain the evaluation indicators of the simulation environment and the real environment; 量化模块,用于利用线性加权法,根据所述评估指标,量化仿真环境与现实环境指标之间的加权差异;A quantification module, configured to quantify the weighted difference between the simulation environment and the real environment indicators according to the evaluation indicators using a linear weighting method; 搭建模块,用于搭建对仿真环境与真实环境之间的数据进行相互转换的Sim2Real模型;Building a module for building a Sim2Real model that converts data between the simulation environment and the real environment; 训练模块,用于用强化学习算法,以仿真环境与现实环境指标之间的加权差异最小为目标,对所述Sim2Real模型进行领域自适应训练,得到最终的Sim2Real模型;A training module is used to perform domain adaptive training on the Sim2Real model using a reinforcement learning algorithm with the goal of minimizing the weighted difference between the simulated environment and the real environment indicators to obtain a final Sim2Real model; 其中,所述量化模块中量化仿真环境与现实环境指标之间的加权差异具体为:The weighted difference between the quantitative simulation environment and the real environment indicators in the quantification module is specifically: ; 其中,表示仿真环境与现实环境指标之间的加权差异,ω 1表示成功加权路径长度的权重系数,ω 2表示成功率的权重系数,表示仿真环境中成功加权路径长度,表示真实环境中成功加权路径长度,表示成功率,表示仿真环境中的导航成功率,表示在真实环境中的导航成功率;in, represents the weighted difference between the indicators of the simulation environment and the real environment, ω1 represents the weight coefficient of the successful weighted path length, ω2 represents the weight coefficient of the success rate, represents the length of the successful weighted path in the simulation environment, represents the length of the successful weighted path in the real environment, Indicates the success rate, represents the navigation success rate in the simulation environment, Indicates the navigation success rate in the real environment; 其中,在所述Sim2Real模型的训练中,设定奖励函数为:In the training of the Sim2Real model, the reward function is set as: ; 其中,R表示奖励函数,T表示整个任务的时间跨度,ω i 表示第i个指标的权重,n表示指标的总数,表示仿真环境中t时刻第i个指标的值,表示现实环境中t时刻第i个指标的值。Where R represents the reward function, T represents the time span of the entire task, ωi represents the weight of the i -th indicator , and n represents the total number of indicators. represents the value of the i- th indicator at time t in the simulation environment, Represents the value of the i- th indicator at time t in the real environment. 6.根据权利要求5所述的基于强化学习的Sim2Real模型构建装置,其特征在于,所述评估指标包括:成功加权路径长度以及导航成功率;6. The Sim2Real model construction device based on reinforcement learning according to claim 5, wherein the evaluation indicators include: successful weighted path length and navigation success rate; 所述成功加权路径长度具体为:The success weighted path length is specifically: ; 其中,SPL表示成功加权路径长度,N表示训练的节数,S b 表示在当前节数下导航成功或失败,L b 表示在当前节数下到达目标点的最优最短路径长度,P b 表示实际测试中机器人走过的路径长度;Where SPL represents the successful weighted path length, N represents the number of training segments, Sb represents the success or failure of navigation under the current number of segments, Lb represents the optimal shortest path length to the target point under the current number of segments , and Pb represents the path length traversed by the robot in the actual test; 所述导航成功率具体为:The navigation success rate is specifically: ; 其中,表示成功率,S表示导航成功次数,T表示导航测试总次数之比。in, represents the success rate, S represents the number of successful navigations, and T represents the ratio of the total number of navigation tests. 7.根据权利要求5所述的基于强化学习的Sim2Real模型构建装置,其特征在于,所述搭建模块用于:7. The Sim2Real model construction device based on reinforcement learning according to claim 5, wherein the construction module is used to: 在所述Sim2Real模型中安装Gazebo和Rviz,并启动TurtleBot3机器人的仿真环境;Install Gazebo and Rviz in the Sim2Real model and start the simulation environment of the TurtleBot3 robot; 开发所述仿真环境到Gazebo类真实环境的桥接节点,确定仿真环境与真实环境之间的数据相互转换关系,所述桥接节点用于创建Ros节点并订阅和发布话题;Develop a bridge node from the simulation environment to the Gazebo-like real environment, determine the data conversion relationship between the simulation environment and the real environment, and use the bridge node to create Ros nodes and subscribe to and publish topics; 仿真到现实的节点负责将仿真环境中输出的离散动作转换为真实环境中可执行的连续动作;The simulation-to-reality node is responsible for converting discrete actions output in the simulation environment into continuous actions that can be executed in the real environment; 现实到仿真的节点将真实环境中传感器相关话题数据转换为仿真环境中机器人可识别的话题数据;The reality-to-simulation node converts the sensor-related topic data in the real environment into topic data that can be recognized by the robot in the simulation environment; 创建仿真环境Ros节点,接收所述话题数据;Create a simulation environment Ros node to receive the topic data; 融合所述话题数据以及仿真环境中传感器的话题数据,对策略进行更新,输出动作指令;fusing the topic data and the topic data of sensors in the simulation environment, updating the strategy, and outputting action instructions; 将所述动作指令分别输入到仿真环境和真实环境中,以供训练使用。The action instructions are input into the simulation environment and the real environment respectively for training. 8.根据权利要求5所述的基于强化学习的Sim2Real模型构建装置,其特征在于,所述训练模块用于:8. The Sim2Real model construction device based on reinforcement learning according to claim 5, wherein the training module is used to: 获取真实环境发送的原始ROS图像消息;Get the original ROS image message sent by the real environment; 将所述原始ROS图像消息转换为OpenCV的图像格式;Convert the original ROS image message to OpenCV image format; 将转换后的图像的数据类型指定为32位浮点数,图像尺寸调整为高度和宽度均为256个像素点;The data type of the converted image is specified as 32-bit floating point number, and the image size is adjusted to 256 pixels in height and width; 通过回调函数将所述原始ROS图像消息存放进观察数据队列中;The original ROS image message is stored in the observation data queue through the callback function; 判断所述观察数据队列是否为空;若是,继续进行仿真环境数据的训练过程;否则,从观察数据队列中取出真实环境数据,并与仿真数据共同作为数据集完成策略的更新;Determine whether the observed data queue is empty; if so, continue the training process of the simulated environment data; otherwise, take the real environment data from the observed data queue and use it together with the simulated data as a data set to complete the strategy update; 通过Actor-Critic网络的计算得到整型的动作数据;The integer action data is obtained through the calculation of the Actor-Critic network; 将所述动作数据发送至真实环境,以使真实环境中的机器人执行所述动作数据,并更新环境状态。The action data is sent to the real environment so that the robot in the real environment executes the action data and updates the environment state.
CN202411610785.4A 2024-11-12 2024-11-12 A Sim2Real model construction method and device based on reinforcement learning Active CN119669952B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411610785.4A CN119669952B (en) 2024-11-12 2024-11-12 A Sim2Real model construction method and device based on reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411610785.4A CN119669952B (en) 2024-11-12 2024-11-12 A Sim2Real model construction method and device based on reinforcement learning

Publications (2)

Publication Number Publication Date
CN119669952A CN119669952A (en) 2025-03-21
CN119669952B true CN119669952B (en) 2025-10-24

Family

ID=94997661

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411610785.4A Active CN119669952B (en) 2024-11-12 2024-11-12 A Sim2Real model construction method and device based on reinforcement learning

Country Status (1)

Country Link
CN (1) CN119669952B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120116237B (en) * 2025-05-14 2025-08-01 深圳市众擎机器人科技有限公司 A humanoid robot motion training method and device based on real scenes

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117406706A (en) * 2023-08-11 2024-01-16 汕头大学 Multi-agent obstacle avoidance method and system combining causal model and deep reinforcement learning

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3884432A1 (en) * 2018-11-21 2021-09-29 Amazon Technologies, Inc. Reinforcement learning model training through simulation
CN113609786B (en) * 2021-08-27 2022-08-19 中国人民解放军国防科技大学 Mobile robot navigation method, device, computer equipment and storage medium
US12124230B2 (en) * 2021-12-10 2024-10-22 Mitsubishi Electric Research Laboratories, Inc. System and method for polytopic policy optimization for robust feedback control during learning
CN114290339B (en) * 2022-03-09 2022-06-21 南京大学 A Robot Reality Transfer Method Based on Reinforcement Learning and Residual Modeling
CN116679711A (en) * 2023-06-16 2023-09-01 浙江润琛科技有限公司 Robot obstacle avoidance method based on model-based reinforcement learning and model-free reinforcement learning
CN117596700A (en) * 2023-11-17 2024-02-23 重庆邮电大学 A transmission scheduling method for Internet of Vehicles based on transfer reinforcement learning
CN117733863A (en) * 2023-12-29 2024-03-22 科大讯飞股份有限公司 Robot motion control method, device, equipment, robot and storage medium
CN118311976B (en) * 2024-06-05 2024-09-27 汕头大学 CFS-based multi-unmanned aerial vehicle obstacle avoidance method, system, device and medium
CN118730145B (en) * 2024-06-05 2025-09-23 合肥工业大学 A path planning method for park logistics vehicles based on map-free navigation

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117406706A (en) * 2023-08-11 2024-01-16 汕头大学 Multi-agent obstacle avoidance method and system combining causal model and deep reinforcement learning

Also Published As

Publication number Publication date
CN119669952A (en) 2025-03-21

Similar Documents

Publication Publication Date Title
CN111311685B (en) An unsupervised method for moving scene reconstruction based on IMU and monocular images
CN110692066B (en) Selecting actions using multimodal input
WO2021164276A1 (en) Target tracking method and apparatus, computer device, and storage medium
CN112313672B (en) Stacked Convolutional Long Short-Term Memory for Model-Free Reinforcement Learning
CN110705407B (en) Face beauty prediction method and device based on multi-task transfer
CN112258565B (en) Image processing method and device
US20220366246A1 (en) Controlling agents using causally correct environment models
CN119669952B (en) A Sim2Real model construction method and device based on reinforcement learning
CN115578570A (en) Image processing method, device, readable medium and electronic equipment
CN113095129A (en) Attitude estimation model training method, attitude estimation device and electronic equipment
CN118393900B (en) Automatic driving decision control method, device, system, equipment and storage medium
CN116188366B (en) Knowledge and data-driven brain network computing method, device, electronic device and storage medium
CN114889638A (en) Trajectory prediction method and system in automatic driving system
CN114626505A (en) Mobile robot deep reinforcement learning control method
JP7340055B2 (en) How to train a reinforcement learning policy
Yan et al. Hierarchical reinforcement learning for handling sparse rewards in multi-goal navigation
CN119692470A (en) Environmental spatial relationship reasoning method, medium and device based on multi-agent debate
CN112069445A (en) A 2D SLAM Algorithm Evaluation and Quantification Method
Wu et al. Adaptive Cross-Modal Experts Network with Uncertainty-Driven Fusion for Vision–Language Navigation
CN119691992B (en) A robot evaluation method based on minimal scenarios
Li et al. Segm: A novel semantic evidential grid map by fusing multiple sensors
CN119962292B (en) A Fast Stress Field Calculation Method Based on Graph Network Agent Model
US20240142960A1 (en) Automated simulation method based on database in semiconductor design process, automated simulation generation device and semiconductor design automation system performing the same, and manufacturing method of semiconductor device using the same
CN120125796B (en) A method, device, storage medium and electronic device for automatically adjusting parameters of an adaptive optical system based on reinforcement learning
US20250371223A1 (en) Simulating physical environments with discontinuous dynamics using graph neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant