CN118999570A

CN118999570A - Robot path planning method, device and medium based on quantum search and reinforcement learning

Info

Publication number: CN118999570A
Application number: CN202411105626.9A
Authority: CN
Inventors: 陆骊工; 占美晓; 黄超; 苏燕红; 刘锐
Original assignee: Zhuhai Peoples Hospital
Current assignee: Zhuhai Peoples Hospital
Priority date: 2024-08-12
Filing date: 2024-08-12
Publication date: 2024-11-22
Anticipated expiration: 2044-08-12
Also published as: CN118999570B

Abstract

The invention discloses a robot path planning method, a device and a medium based on quantum search and reinforcement learning, wherein the method comprises the steps of quantizing a real environment and a robot state, dynamically mapping the real environment and the robot state to a quantum field, and establishing a quantum field; searching in a quantum domain by utilizing a quantum superposition principle based on a preset quantum search model to obtain a plurality of search samples; determining one or more target search samples from the plurality of search samples according to the probability that the robot reaches the target position; the method comprises the steps of determining a basic reinforcement learning model in advance, training the basic reinforcement learning model by utilizing a quantum search model and a target search sample, and determining the trained reinforcement learning model as a robot path planning model. The method can quickly obtain the high-quality target search sample, effectively reduce the number of training rounds of the robot path planning and improve the quality and the speed of the robot path planning.

Description

Robot path planning method, device and medium based on quantum search and reinforcement learning

Technical Field

The invention relates to the field of robot path planning, in particular to a method, a device and a medium for robot path planning based on quantum search and reinforcement learning.

Background

With the development of sensor, battery and artificial intelligence (ARTIFICIAL INTELLIGENCE) technologies, robots have assumed important tasks in various unstructured environments, such as disaster-site rescue, search actions within a building cluster, package management of indoor cargo areas, and the like. However, while robots can function in these complex and diverse environments, their speed of adaptation is still limited, mainly due to their limited on-board computing power and dynamic disturbances from the environment, such as dynamic obstacles and variations in battery power. These factors lead to inaccuracy in the robot planning and response delay in emergency situations.

Reinforcement learning (Reinforcement Learning, RL) is a machine learning method that allows an agent (e.g., a robot) to learn how to take action to maximize a jackpot by interacting with the environment. Reinforcement learning offers several key advantages in terms of robot adaptation to dynamic environments. Adaptivity: reinforcement learning algorithms allow robots to learn and adapt themselves in unknown or changing environments. By means of trial and error, the robot can gradually understand the modes in the environment and learn to effectively cope with the changes of the modes; exploring and utilizing balance: reinforcement learning algorithms are typically designed with a balancing mechanism between exploration and utilization, which means that robots will not only utilize the best strategy known, but also explore unknown possibilities to find better strategies. While the introduction of reinforcement learning and environmental feedback to enhance the exploration ability of a robot may increase its adaptation speed, such learning is not only time consuming and costly, but also presents a certain risk, such as a collision problem between the robot and an obstacle, due to the large amount of interaction with the environment required.

The above disclosure of background art is only for aiding in understanding the inventive concept and technical solution of the present application, and it does not necessarily belong to the prior art of the present application, nor does it necessarily give technical teaching; the above background should not be used to assess the novelty and creativity of the present application in the event that no clear evidence indicates that such is already disclosed prior to the filing date of the present application.

Disclosure of Invention

The invention aims to provide a robot path planning method, a device and a medium based on quantum search and reinforcement learning, which are different from the traditional quantum reinforcement learning method based on a quantum network model instead of a neural network model, and are based on real environment quantization and quantum superposition mechanisms, the method has the advantages that the high-quality target search sample is quickly obtained and used for path planning strategy training, so that the accuracy of robot path planning can be improved, the interaction times of reinforcement learning and environment can be reduced under the condition that the bottom reinforcement learning algorithm framework is not changed, the training round number is effectively reduced, and the efficiency and the speed of robot path planning are improved.

In order to achieve the above purpose, the invention adopts the following technical scheme:

a robot path planning method based on quantum search and reinforcement learning comprises the following steps:

quantizing the real environment and the robot state, including dynamically mapping the real world environment and the robot state to a quantum field to establish a quantum field;

Searching in the quantum domain by utilizing a quantum superposition principle based on a preset quantum search model and obtaining a plurality of search samples, wherein each search sample corresponds to the probability of the robot reaching the target position;

determining one or more target search samples from a plurality of search samples according to the probability that the robot reaches the target position;

And a basic reinforcement learning model is predetermined, the basic reinforcement learning model is trained by utilizing the quantum search model and the target search sample, and the trained reinforcement learning model is determined to be a robot path planning model.

Further, any one or a combination of the foregoing aspects, in a process of quantizing a robot state, encoding the robot state, and representing the robot state as a quantum superposition state;

Based on a preset quantum search model, carrying out parallel search in the quantum domain by utilizing the quantum superposition state and obtaining a plurality of search samples, wherein the parallel search is used for controlling a robot to explore a plurality of positions in the quantum domain at the same time.

Further, according to any one or a combination of the above-mentioned technical solutions, a real-world environment and a robot state are dynamically mapped to a quantum field to create a quantum field, wherein the quantum field includes a quantized simulation environment, quantum state parameters and quantum state initialization information;

The method for establishing the quantized simulation environment comprises the following steps: based on the accessibility of the robot in the real environment, each position in the environment is provided with a unique binary code by the environment code, and the binary code corresponds to an orthogonal base of the quantum code to obtain the quantized simulation environment;

The quantum state parameters are obtained by mapping elements in a real-world reinforcement learning framework into the quantized simulation environment, wherein the elements in the real-world reinforcement learning framework comprise a current moment state s _t, a current moment action a _t, a next moment state s _t+1 and a reward r;

The quantum state initialization information is obtained by mapping an initial position and a target position of the robot in the real environment and obstacle distribution information in the real environment into the quantized simulation environment.

Further, any one or a combination of the foregoing, the accessibility of the robot in the real environment is determined by the adjacency matrix and expressed as the following formula:

Wherein A represents the accessibility matrix of the robot in the real environment, A _surround represents the accessibility matrix of the robot under the influence of the characteristics of the surrounding environment, A _obs represents the accessibility matrix of the robot under the influence of the obstacle distribution, A _robot represents the accessibility matrix of the robot under the influence of the robot setting, Representing an element product operator.

Further, any one or a combination of the above technical solutions may further include setting a probability value corresponding to a position at an initial moment of the robot to be greater than probability values corresponding to other positions; and/or the number of the groups of groups,

And setting an exit node, so that when the robot reaches a target position, the robot is transferred to the exit node with 100% probability at the next moment, and when the robot is positioned at the exit node, the robot is not transferred to the target position at the next moment.

Further, any one or a combination of the foregoing technical solutions, based on an open system main equation, the quantum search model is constructed by using a quantum random walk model and a classical random walk model, and the quantum search model is represented by the following formula:

dρ/dt＝(1-p)L_QW(ρ)+pL_CRW(ρ)+L_target(ρ)

Wherein ρ represents the probability distribution of the robot state, p represents the weighting coefficient of the quantum random walk model, 1-p represents the weighting coefficient of the classical random walk model, L _QW (ρ) = -i [ a', ρ ] represents the quantum random walk model, wherein a ^′ represents the reachability matrix of the robot in the quantized simulation environment, A classical random walk model is represented, where L _ij＝(A′^ij/d_j)|i><j|,d_j represents the degree of position j, |i > represents the ith orthogonal basis of the hilbert space, L _target (ρ) =2|n+1 > < n|ρ n > < n+1|- { |n > < n|, ρ } represents that when the robot state is at the target position (n), the robot state will irreversibly transition to the exit node (n+1) at the next time.

Further, in the quantized simulation environment, the probability that the robot is located in each grid is represented by a diagonal element ρ _ij of a probability distribution ρ, where i= j;

After parallel search at the time T, the probability p (T) that the robot reaches the target position in the quantized simulation environment is expressed as: where n represents all target locations in the environment.

Further, any one or a combination of the foregoing, determining one or more target search samples from a plurality of the search samples by comparing probabilities of arrival of the robot at the target position, includes:

Determining a search sample corresponding to the highest probability of the robot reaching the target position as the target search sample; or alternatively

Determining a search sample corresponding to each probability that the robot reaches the target position is greater than a preset threshold as the target search sample; or alternatively

And sequencing according to the sequence from high probability to low probability that the robot reaches the target position, and confirming that the search samples corresponding to k probabilities before sequencing are the target search samples, wherein k is more than or equal to 2 and k is a positive integer.

Further, in the foregoing any one or a combination of the foregoing any one or more aspects, the structure of the target search sample is (s _{t_b},a_{t_b},s_{t+1_b},r_b), where s _{t_b} is a current time state in the quantized simulation environment corresponding to the target search sample, a _{t_b} is a current action of a robot in the quantized simulation environment corresponding to the target search sample, s _{t+1_b} is a next time state in the quantized simulation environment corresponding to the target search sample, including a robot state and an environment state, and r _b is a reward returned by a real environment corresponding to the target search sample.

Further, according to any one or a combination of the foregoing aspects, training the basic reinforcement learning model using the quantum search model and the target search sample includes the steps of:

Packaging the quantum search model and the target search sample into a quantum meditation function module, wherein the input of the quantum meditation function module is in a current state, and the output of the quantum meditation function module is an optimal robot action;

And adding the quantum meditation function module as a robot action selection module into a basic reinforcement learning model, and training the basic reinforcement learning model by taking a target search sample as a training set.

Further, any one or a combination of the foregoing, the optimal robot action is determined by:

based on the quantum search model, taking the current state as input, traversing all possible actions in the quantum field by the robot to obtain the probability that the robot can reach the target position when selecting different actions And determining the robot action with the maximum probability of reaching the target position at the current moment as the optimal robot action according to the position size of the probability of reaching the target position corresponding to each robot action.

Further, any one or a combination of the foregoing aspects, a planned path of the robot from the initial position to the target position is obtained using the robot path planning model.

According to another aspect of the present invention, there is provided a robot path planning apparatus in which a robot path planning model is provided; wherein the robot path planning model is obtained by:

Further, any one or a combination of a plurality of the foregoing technical solutions, and the robot path planning model is obtained based on the robot path planning method based on quantum search and reinforcement learning according to any one or a combination of a plurality of the foregoing technical solutions.

According to another aspect of the present invention there is provided a computer readable storage medium storing program instructions configured to be invoked by a processor to perform the steps of a method as set out in any one or a combination of the above.

The technical scheme provided by the invention has the following beneficial effects:

a. The invention is based on real environment quantization and quantum superposition mechanism, can quickly obtain high-quality target search samples for path planning strategy training, can not only improve the accuracy of robot path planning, but also reduce the number of times of reinforcement learning and environment interaction without changing the bottom reinforcement learning algorithm architecture, effectively reduce the number of training rounds, and improve the efficiency and speed of robot path planning;

b. According to the invention, the state of the robot is expressed as a quantum superposition state in the process of quantizing the state of the robot, so that parallel searching can be realized, the robot can explore a plurality of positions at the same time, the efficiency of acquiring each searching sample can be effectively improved, and the speed of planning a robot path is improved;

c. According to the method, the quantum search is combined with the parallel search, so that a useful high-quality target search sample can be obtained easily, namely, a sample with rewards not being 0 or more target position information is provided;

d. according to the method, the accessibility of the robot in the real environment is determined based on the characteristics of the surrounding environment, the obstacle distribution and the robot setting, and the accessibility is mapped to the quantum domain to construct a quantized simulation environment, so that the accuracy of robot path planning can be improved;

e. According to the invention, the quantum search model is constructed through the weighted combination of the quantum random walk model and the classical random walk model, so that the quick search is ensured, and the anti-interference capability in the search process is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings may be obtained according to the drawings without inventive effort to those skilled in the art.

FIG. 1 is a flow chart of a method for robotic path planning based on quantum search and reinforcement learning provided by an exemplary embodiment of the present invention;

FIG. 2 is a schematic diagram of factors affecting real environment quantization provided by an exemplary embodiment of the present invention;

FIG. 3 is a flow chart of a quantum search process provided by an exemplary embodiment of the present invention;

FIG. 4 is a framework diagram of a quantum search and reinforcement learning fusion provided by an exemplary embodiment of the present invention;

FIG. 5 is a schematic diagram of a comparison of the number of reinforcement learning training rounds with a model of quantum search and the number of reinforcement learning training rounds corresponding to other models according to an exemplary embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, apparatus, article, or device that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or device.

Based on the defects of the prior art, the invention provides a robot path planning method, device and medium based on quantum search and reinforcement learning, and the method and device are based on a quantum superposition mechanism, so that a high-quality target search sample is quickly obtained for path planning strategy training, the accuracy of robot path planning can be improved, the number of training rounds can be effectively reduced, and the speed of robot path planning is improved.

In one embodiment of the present invention, a method for planning a robot path based on quantum search and reinforcement learning is provided, see fig. 1, the method comprising the steps of:

In one embodiment of the invention, in the process of quantizing robot states, the robot states are encoded, and the robot states are represented as quantum superposition states configured to represent probabilities that the robot is at various positions in the quantum domain at the same time. Based on a preset quantum search model, carrying out parallel search in the quantum domain by utilizing the quantum superposition state and obtaining a plurality of search samples, wherein the parallel search is used for controlling a robot to explore a plurality of positions in the quantum domain at the same time.

Embodiments of the present invention are further described below with reference to fig. 1-5.

The real-world environment and the robot state are quantized, including dynamically mapping the real-world environment and the robot state to quantum fields to create quantum domains. The quantum domain comprises a quantized simulation environment, quantum state parameters and quantum state initialization information.

S1, real environment quantization, namely dynamically mapping the real world environment to the quantum field, and establishing a quantized simulation environment.

S1.1, in one embodiment of the invention, a method of using an adjacency matrix is used for representing the accessibility of a robot in a real environment, and describing the real-world environment dynamics in the quantum field based on the accessibility matrix and is used for simulating a quantum search model built later. Of course, the method of implementing real environment quantization is not limited to the method of the adjacency matrix, and any method capable of implementing the same technical effects as the adjacency matrix method can be used for real environment quantization, such as a list, a tree, a graph method, etc. for describing the relationship between the adjacent states of the robot, and the same technical effects as the adjacency matrix method can be implemented.

The method of using the adjacency matrix to represent the accessibility of the robot in a real environment is a preferred embodiment provided by the present invention. In this embodiment, the accessibility of the robot in the real environment is determined by encoding the environment, and a quantized simulation environment is established according to the accessibility of the robot in the real environment, in which each position has a unique binary code corresponding to the orthogonal basis of the quantum code.

S1.1.1, for example, for a 400m×400m real environment, it is divided into 13×13 two-dimensional grids. Of course, in other embodiments, the grid sizes and dimensions may be divided into different grid sizes and dimensions according to the real environment size, requirements and thickness requirements.

Then, for the path planning task, as shown in fig. 2, the following three aspects affect the accessibility of the robot in a real environment:

(1) Characteristics of the surrounding environment: for example, to ensure robot safety, in unstructured and uncertain environments the robot needs to be kept a first distance (e.g. 1.0m or more) from the obstacle, whereas in structured environments (obstacles are simple geometric structures such as rectangular, triangular, circular etc.) the robot needs to be kept a second distance (e.g. 0.5m or more) from the obstacle, said second distance being smaller than said first distance; it should be noted that, 1.0m and 0.5m in this paragraph are only exemplary, and not limiting the scope of the present invention, in other embodiments or specific applications, the first distance may be determined to be 0.5m, 2m, 3m, etc. according to actual needs, and the second distance may be determined to be 0.1m, 0.2m, 1m, etc.;

(2) Barrier distribution: such as the layout of buildings, trees, stones, etc., can affect the accessibility of the robot;

(3) Robot setting: the "capabilities", "oil volume" and "condition" of the robot can affect the accessibility of the robot. For example, a small robot cannot traverse a muddy road, cannot reach a distant location with a small oil amount, and the like. Of course, in other embodiments, the reachability of the robot may be determined based on other empirical knowledge.

Based on the above analysis, the present invention proposes the following formula for determining the reachability of a robot in a real environment:

Wherein A represents the accessibility matrix of the robot in the real environment, A _surround represents the accessibility matrix of the robot under the influence of the characteristics of the surrounding environment, A _obs represents the accessibility matrix of the robot under the influence of the obstacle distribution, A _robot represents the accessibility matrix of the robot under the influence of the robot setting, Representing an element product operator. In the reachable matrix, if the robot can reach the environment position corresponding to the matrix element, the value of the matrix element is set to 1, otherwise, the value is set to 0.

S1.1.2, re-using the python programming language to build a simulation environment based on the reachability matrix a. As shown in fig. 3 (upper right), the simulation environment has a maze structure similar to that shown in the upper right corner of fig. 3, where a reachable matrix element value of 1 corresponds to a white position in the simulation environment, and otherwise corresponds to a black position in the simulation environment. While the cyan position in the simulated environment represents the initial robot position and the red color represents the target position. Of course, in other embodiments, other software and programming languages may be used to build the simulation environment.

S1.2, the state of the robot is quantized, the state of the robot in the real environment is mapped into the quantized simulation environment, and the state of the robot in the quantum domain is expressed as a quantum superposition state by encoding the state of the robot, so that the robot can explore a plurality of positions in the quantized simulation environment at the same time.

S1.2.1, in one embodiment of the present invention, in order to use the efficient parallel search principle of the quantum superposition mechanism and the later more convenient fusion with the reinforcement learning framework, elements in the real world reinforcement learning framework (including the current time state s _t, the current time action a _t, the next time state s _t+1 and the prize r) are mapped into the quantum domain to obtain the quantum state parameters. The quantum state parameters include a current time state |s _t >, a current time action |a _t >, a next time state |s _t+1 >, and a reward |r > in the quantized simulation environment, wherein |s _t > is the current time state in the quantized simulation environment including a robot state and an environment state, |a _t > is the current action of the robot in the quantized simulation environment, |s _t+1 > is the next time state in the quantized simulation environment including a robot state and an environment state, |r > is a reward returned in a quantum domain in an overlapped state.

In the quantum domain, to accelerate computation using quantum superposition, a quantum-dynamics-based quantum search model takes as input the state |s _t > in superposition, the robot can select a plurality of actions |a _t > in superposition, obtain |s _t+1 > in superposition at the next time, and obtain the corresponding reward |r > in superposition. In the quantum domain, the robot may be at multiple positions at each moment, so a discrete probability distribution ρ is used to represent |s _t > in the superimposed state. The number of possible values in the discrete probability distribution ρ is equal to the number of simulation environment grids, 13×13=169 in one embodiment of the invention.

S1.2.2 each grid in the quantized simulation environment has a unique binary code, corresponding to the orthogonal basis of the quantum code, and the required number of quantum bits is obtained by the following equation:

Where N _bit represents the number of qubits, N _cell represents the number of grids, and the number of qubits required in one embodiment of the present invention is 8, although in other embodiments, a suitable number of qubits may be selected according to the actual situation. The quantum field, i.e., robot action |a _t > in the quantized simulation environment, is another probability distribution over all possible actions of the robot, and the discrete uniform distribution is used to represent robot action |a _t > in one embodiment of the invention, although other probability distributions may be used in other embodiments.

S1.3, according to the initial position and the target position of the robot in the real environment and the obstacle distribution information in the real environment, quantum state initialization information is determined, wherein the quantum state initialization information comprises the initial position, the target position and the initial position of the obstacle of the robot in the quantized simulation environment.

And each grid position of the quantized simulation environment is provided with a binary code, the binary codes are in one-to-one correspondence with the quantum code orthogonal bases, and when the robot searches in parallel, each moment can be positioned at a plurality of grid positions, so that the probability distribution on the quantum code orthogonal bases can be used for describing the robot search in parallel. Thus, by quantum state initialization, the initial positions of the robot, the target point, and the obstacle can be set.

As shown in the lower right of fig. 3, the probability value corresponding to the position where the robot initial time is set is higher than the probability of other positions. Adding an Exit Node (Exit Node) at a position corresponding to the target point, wherein the Exit Node is usually a virtual Node, and when the robot reaches the target position in parallel searching, the robot can be transferred to the Exit Node with 100% probability at the next moment; if the robot is at the exit node, the robot will not be transferred to the target position at the next moment and will be kept at the exit node. Therefore, after a period of time of searching, the probability that the robot is at the exit node represents the total probability that the robot can reach the target point in the searching time period, and the total probability is used for judging the quality of the searched sample. Thus, after the quantum random walk parallel search, the probability value of the exit node is taken as the future jackpot |r > obtained after the robot takes action from the initial position.

In the present invention, the sample mass refers to how much this sample can provide information about where the target point is or about the target point. Such as conventional reinforcement learning, the sample obtained by the search environment is likely to be a useless sample (rewarding 0 or not providing positional information about the target point), but the present invention can easily obtain a useful sample (i.e., a sample rewarding not 0 or providing more target positional information) by quantum search in combination with parallel search. And then training reinforcement learning by using the obtained high-quality sample, so that the efficiency of path planning learning training can be improved, and the accuracy and speed of robot path planning can be improved.

S2, searching in parallel in the quantized simulation environment by utilizing a quantum superposition principle based on a preset quantum search model to obtain a plurality of search samples, wherein in the parallel searching process, each search process simultaneously searches different paths by taking the initial position of the robot as a starting point and the target position as an end point, and when the search process reaches the exit node, the search process is ended to obtain the search sample corresponding to the search process.

One or more target search samples are determined from a plurality of the search samples by comparing probabilities of arrival of the robot at the target position, and the target search sample in the present invention is a high-quality search sample. There are various ways to determine high quality target search samples.

The first way to determine the target search sample is: and determining a search sample corresponding to the highest probability of the robot reaching the target position as the target search sample. Another way is: and determining a search sample corresponding to each probability that the robot reaches the target position is greater than a preset threshold value as the target search sample. In addition, the target search samples may be determined by: and sequencing according to the sequence from high to low of the probability that the robot reaches the target position, and confirming that the search samples corresponding to k probabilities before sequencing (for example, the first three probabilities before sequencing) are the target search samples, wherein k is more than or equal to 2 and k is a positive integer.

The first way of determining the target search sample is optimal, which not only ensures that an effective target search sample can be obtained, but also obtains the target search sample with the highest quality.

In this embodiment, the open system Master equation (Lindblad Master) equation may be used to describe the environmental dynamics of the robot in the quantized simulation environment, and construct a quantum search model for fast search of high quality samples. Of course, the method of constructing the quantum search model is not limited to the Lindblad Master equation, and any method capable of achieving the same technical effects as the Lindblad Master equation can be used for constructing the quantum search model, such as the use of the Lindblad main equation, redfield relaxation and Floquet theory, and the techniques of the Suzuki-Trotter expansion and sparse solver numerical method.

In a preferred embodiment of the present invention, step S2 comprises the following steps.

S2.1, constructing a quantum search model by adopting a quantum random walk model and a classical random walk model based on an open system main equation. The classical random walk model can increase the anti-interference capability of the quantum search model. In the process of searching the quantized simulation environment, the evolution of the discrete probability distribution rho along with time is controlled by adopting the following open system main equation (Lindblad Master), wherein the open system main equation is formed by weighting and combining a quantum random walk model and a classical random walk model, so that the quick searching can be ensured, and the anti-interference capability of the quantum search model is improved:

dρ/dt＝(1-p)L_QW(ρ)+pL_CRW(ρ)+L_target(ρ)

Wherein ρ represents a discrete probability distribution of robot states, p represents a weighting coefficient of a quantum random walk model, 1- ρ represents a weighting coefficient of a classical random walk model, L _QW (ρ) = -i [ a, ρ ] represents a quantum random walk model, wherein a 'represents a reachability matrix of a robot in the quantized simulation environment, a' is obtained by converting the reachability matrix a of a robot in a real environment into the quantum domain, A classical random walk model is represented, where L _ij＝(A′_ij/d_j)|i><j|,A′_ij represents the value of the (i, j) element in the reachability matrix a', indicating whether the robot can reach the position corresponding to (i, j), d _j represents the degree of position j, |i > represents the i-th orthogonal basis of the hilbert space, L _target (ρ) =2|n+1 > < n|ρ n > < n+1| - { |n > < n|, ρ } represents that the robot state will irreversibly transition to the exit node (n+1) at the next moment when the robot state is at the target position (n).

In the quantized simulation environment, the probability that a robot is located in each grid is represented by a diagonal element ρ _ij of the probability distribution ρ, where i= j. The probability p (T) of the robot reaching the target position in the quantized simulation environment after the parallel search of T time is expressed as: where n represents all target locations in the environment.

S2.2, after the robots select different actions, comparing the probability of reaching the target position of the future robot corresponding to each search sample, and determining the search sample with the maximum probability value as the target search sample. The search sample in the invention has the same structure as the sample in the traditional reinforcement learning, and consists of a current time state s _t, a current time action a _t, a next time state s _t+1 and a reward r. At the current time s _t, the robot traverses all possible actions and obtains the probability and sum of the target positions that the robot can reach when selecting different actions a according to the quantum search modelAnd selecting the optimal robot action at the current moment as a _t according to the probability of reaching the target position corresponding to each robot action.

S3, a basic reinforcement learning model is determined in advance, the basic reinforcement learning model is trained by utilizing the quantum search model and the target search sample, and the trained reinforcement learning model is determined to be a robot path planning model. According to the invention, the high-quality target search sample is integrated into the reinforcement learning process, and the basic reinforcement learning model is trained, so that the learning efficiency and the path planning strategy quality can be improved.

In a preferred embodiment of the invention, step S3 comprises the following sub-steps.

S3.1, packaging the quantum search model and the target search sample into a quantum meditation function module, wherein the input of the quantum meditation function module is in a current state, and the output of the quantum meditation function module is the optimal robot action.

As shown in fig. 3, the quantum meditation function module has an input of the current state s _t and an output of the optimal robot motion a _t. Given the input s _t, the robot traverses all possible actions in the quantum field and obtains the probability of reaching the target position when the robot selects different actions a according to the quantum search modelAnd selecting and marking the optimal robot action as a _t at the current moment according to the probability of reaching the target position corresponding to each robot action, and outputting the optimal robot action. Preferably, the optimal robot motion is a robot motion corresponding to a maximum probability of reaching the target position.

And S3.2, adding the quantum meditation function module serving as a robot action selection module into a basic reinforcement learning model, and training the basic reinforcement learning model by taking a target search sample as a training set. The quantum meditation function is fused with the traditional reinforcement learning flow, so that the learning training efficiency and the path planning strategy quality can be improved.

As shown in fig. 4, the quantum meditation function is fused as a robotic action selection module into a conventional reinforcement learning process. In the reinforcement learning training process, the current time s _t is transferred to the quantum meditation function module, and the optimal robot action a _t is obtained. Then, under the reinforcement learning framework, the robot performs the optimal actions in the simulation environment and gets rewards r and next time state s _t+1, and high-quality target search samples (s _t,a_t,s_t+1, r) are stored for training the path planning strategy.

It should be noted that, the probability value is rewarded in the quantum domain, and the high-quality target search sample is obtained through probability; the reward in the reinforcement learning process is a true environment return reward, where the reward is not a probability value, the reward function in reinforcement learning training provided by the invention not only considers the probability of reaching the target position of the robot, but also considers the length of a path and the time of reaching a target point, as described in detail below.

In a preferred embodiment of the invention, the underlying learning model may be trained by time-differential prediction error (TD Prediction Error). The time difference prediction error and the reinforcement learning parameter update are shown in the following equation:

TD_error＝r+γV(s_t+1)-V(s_t)

V(s_t)←V(s_t)+α[r+γV(s_t+1)-V(s_t)]

Where r represents a reward returned from a real environment, γ represents a weighting factor of a future jackpot, α represents a reinforcement learning rate, V (s _t+1) represents a state value (state value) predicted at time t+1 in the reinforcement learning process, V (s _t) represents a state value (state value) predicted at time t in the reinforcement learning process, and TD _error represents a difference between a reinforcement learning expectation value and a predicted value.

In one example of the present invention, the environmental rewards r are set in the following manner: the robot reaches the target position rewards +1, otherwise rewards-0.01. The reinforcement learning rate α is set to 0.01 in one embodiment of the invention.

The loss function of reinforcement learning training described above may be iteratively trained using Adam optimization methods.

Referring to fig. 5, the abscissa in fig. 5 represents different environmental dimensions, the ordinate represents the value of the required training round number, the first model represents a depth enhancement learning reference model with Replay Buffer (abbreviated as dqn.w), the second model represents a depth enhancement learning reference model with Replay Buffer (abbreviated as MCT dqn.w) with monte carlo search, and the first model is a depth enhancement learning model with Replay Buffer (abbreviated as Quantum dqn.w) with Quantum search proposed by the present invention. As can be seen from fig. 5, for different environmental dimensions, the number of training rounds required by the path planning method provided by the invention is smaller than the number of training rounds required by the model one and the model two, which indicates that the method can quickly obtain a high-quality target search sample for the path planning strategy training, reduce the number of training rounds, and improve the accuracy and the speed of the robot path planning method.

In one embodiment of the present invention, a planned path of the robot from the initial position to the target position is obtained using the robot path planning model determined in any of the above embodiments.

In one embodiment of the invention, a robot path planning device is provided, wherein a robot path planning model is arranged in the robot path planning device; wherein the robot path planning model is obtained by:

Searching in the quantum domain by utilizing a quantum superposition principle based on a preset quantum search model and obtaining a plurality of search samples, wherein each search sample corresponds to the probability of the robot reaching the target point position;

Determining one or more target search samples from a plurality of search samples according to the probability that the robot reaches the target point position;

In one embodiment of the invention, a robot path planning model training apparatus is provided that includes a quantization module, a quantum search module, and a reinforcement learning module. Wherein:

the quantization module is configured to quantize real world environments and robot states, including dynamically mapping real world environments and robot states to quantum fields to create quantum domains;

The quantum search module is preset with a quantum search model, is configured to search in the quantum domain by utilizing a quantum superposition principle and obtain a plurality of search samples, each search sample corresponds to the probability of a robot reaching a target point position, and is further configured to determine one or more target search samples from the plurality of search samples according to the probability of the robot reaching the target point position;

The reinforcement learning module is preset with a basic reinforcement learning model, and is configured to train the basic reinforcement learning model by utilizing the quantum search model and the target search sample, and determine that the trained reinforcement learning model is a robot path planning model.

In one embodiment of the invention, a computer readable storage medium is provided for storing program instructions configured to be invoked by a processor to perform the steps of the method of any of the embodiments above.

The robot path planning apparatus, the robot path planning model training apparatus, and the computer-readable storage medium embodiment described above are the same invention concept as the robot path planning method embodiment based on the quantum search and the reinforcement learning, and the entire contents of the robot path planning method embodiment based on the quantum search and the reinforcement learning are incorporated into the robot path planning apparatus, the robot path planning model training apparatus, and the computer-readable storage medium embodiment by way of reference.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing is merely illustrative of the embodiments of this application and it will be appreciated by those skilled in the art that variations and modifications may be made without departing from the principles of the application, and it is intended to cover all modifications and variations as fall within the scope of the application.

Claims

1. A robot path planning method based on quantum search and reinforcement learning, characterized in that it includes the following steps:

Quantize the real environment and robot state, including dynamically mapping the real-world environment and robot state to the quantum domain to establish a quantum domain;

Based on a preset quantum search model, the quantum superposition principle is used to search in the quantum domain and obtain multiple search samples, each of which corresponds to the probability that the robot reaches the target position;

Determine one or more target search samples from a plurality of search samples according to the probability that the robot reaches the target position;

A basic reinforcement learning model is predetermined, and the basic reinforcement learning model is trained using the quantum search model and the target search sample, and the trained reinforcement learning model is determined to be a robot path planning model.

2. The robot path planning method based on quantum search and reinforcement learning according to claim 1 is characterized in that, in the process of quantizing the robot state, the robot state is encoded and represented as a quantum superposition state;

Based on a preset quantum search model, the quantum superposition state is used to perform a parallel search in the quantum domain and obtain multiple search samples, wherein the parallel search is to control the robot to simultaneously explore multiple positions in the quantum domain.

3. The robot path planning method based on quantum search and reinforcement learning according to claim 1 is characterized in that the real-world environment and robot state are dynamically mapped to the quantum field to establish a quantum domain, and the quantum domain includes a quantized simulation environment, quantum state parameters and quantum state initialization information;

The method for establishing the quantized simulation environment comprises: based on the reachability of the robot in the real environment, encoding the environment so that each position in the environment has a unique binary code, the binary code corresponds to an orthogonal basis of the quantum code, and obtaining the quantized simulation environment;

The quantum state parameters are obtained by mapping elements in a real-world reinforcement learning framework to the quantized simulation environment, where the elements in the real-world reinforcement learning framework include a current state s _t , a current action a _t , a next state s _t+1 and a reward r;

The quantum state initialization information is obtained by mapping the initial position and target position of the robot in the real environment and the obstacle distribution information in the real environment into the quantized simulation environment.

4. The robot path planning method based on quantum search and reinforcement learning according to claim 3 is characterized in that the reachability of the robot in the real environment is determined by the adjacency matrix and expressed as the following formula:

Among them, A represents the reachability matrix of the robot in the real environment, A _surround represents the reachability matrix of the robot under the influence of the characteristics of the surrounding environment, A _obs represents the reachability matrix of the robot under the influence of obstacle distribution, and A _robot represents the reachability matrix of the robot under the influence of robot settings. Represents the element-wise product operator.

5. The robot path planning method based on quantum search and reinforcement learning according to claim 3 is characterized in that it also includes setting the probability value corresponding to the robot's initial position to be greater than the probability values corresponding to other positions; and/or,

Set the exit node so that when the robot reaches the target position, the robot will transfer to the exit node with a 100% probability at the next moment.

6. The robot path planning method based on quantum search and reinforcement learning according to claim 1 is characterized in that, based on the open system master equation, the quantum search model is constructed by using a quantum random walk model and a classical random walk model, and the quantum search model is expressed by the following formula:

dρ/dt＝(1-p)L _QW (ρ)+pL _CRW (ρ)+L _target (ρ)

Wherein, ρ represents the probability distribution of the robot state, p represents the weight coefficient of the quantum random walk model, 1-p represents the weight coefficient of the classical random walk model, L _QW (ρ) = -i[A ^′ , ρ] represents the quantum random walk model, where A ^′ represents the reachable matrix of the robot in the quantized simulation environment, represents the classical random walk model, where L _ij =(A′ _ij /d _j )|i><j|, d _j represents the degree of position j, |i> represents the i-th orthogonal basis of the Hilbert space, and L _target (ρ) = 2|n+1><n|ρ|n><n+1|-{|n><n|,ρ} means that when the robot state is at the target position (n), the robot state will irreversibly transfer to the exit node (n+1) at the next moment.

7. The robot path planning method based on quantum search and reinforcement learning according to claim 6 is characterized in that, in the quantized simulation environment, the probability of the robot being located in each grid is represented by the diagonal element ρ _ij of the probability distribution ρ, where i == j;

After parallel search at time T, the probability p(T) of the robot reaching the target position in the quantized simulation environment is expressed as: Where n represents all target locations in the environment.

8. The robot path planning method based on quantum search and reinforcement learning according to claim 1 is characterized in that one or more target search samples are determined from a plurality of search samples by comparing the probability of the robot reaching the target position, comprising:

Determine that the search sample corresponding to the highest probability of the robot reaching the target position is the target search sample; or,

Determine that the search samples corresponding to each probability that the robot reaches the target position is greater than a preset threshold as the target search samples; or,

The robots are sorted in descending order according to the probability of reaching the target position, and the search samples corresponding to the first k probabilities are confirmed to be the target search samples, where k ≥ 2 and k is a positive integer.

9. The robot path planning method based on quantum search and reinforcement learning according to claim 1 is characterized in that the structure of the target search sample is (s _{t_b} , a _{t_b} , s _{t+1_b} , r _b ), wherein s _{t_b} is the current state of the target search sample in the quantized simulation environment, a _{t_b} is the current action of the robot in the quantized simulation environment corresponding to the target search sample, s _{t+1_b} is the next state of the target search sample in the quantized simulation environment, including the robot state and the environment state, and r _b is the reward returned by the real environment corresponding to the target search sample.

10. The robot path planning method based on quantum search and reinforcement learning according to claim 1 is characterized in that the basic reinforcement learning model is trained using the quantum search model and the target search sample, comprising the following steps:

Encapsulating the quantum search model and the target search sample into a quantum meditation function module, wherein the input of the quantum meditation function module is the current state, and the output is the optimal robot action;

The quantum meditation function module is added to the basic reinforcement learning model as a robot action selection module, and the basic reinforcement learning model is trained with target search samples as a training set.

11. The robot path planning method based on quantum search and reinforcement learning according to claim 10, characterized in that the optimal robot action is determined by the following method:

Based on the quantum search model, the robot traverses all possible actions in the quantum field with the current state as input, and obtains the probability that the robot can reach the target position when choosing different actions. According to the probability position of reaching the target position corresponding to each robot action, the robot action with the highest probability of reaching the target position at the current moment is determined as the optimal robot action.

12. The robot path planning method based on quantum search and reinforcement learning according to any one of claims 1 to 11, characterized in that the robot path planning model is used to obtain the planned path of the robot from the initial position to the target position.

13. A robot path planning device, characterized in that a robot path planning model is provided in the robot path planning device; wherein the robot path planning model is obtained by:

14. The robot path planning device according to claim 12 is characterized in that the robot path planning model is obtained based on the robot path planning method based on quantum search and reinforcement learning described in any one of claims 2-11.

15 . A computer-readable storage medium for storing program instructions, wherein the program instructions are configured to be called by a processor to execute the steps of the method according to any one of claims 1 to 12.