Disclosure of Invention
The invention aims to overcome the common thinking of the existing unmanned aerial vehicle track planning and provides a method for planning a pseudo path of an unmanned aerial vehicle based on deep reinforcement learning. The invention plans a pseudo-flight path aiming at the condition that the unmanned aerial vehicle avoids the restricted area for flying, and when the actually planned flight path of the unmanned aerial vehicle conflicts with the restricted area for flying, the pseudo-flight path is used for guiding the unmanned aerial vehicle to avoid the restricted area for flying, thereby ensuring the flight safety of the unmanned aerial vehicle in the airspace and the normal operation of other areas.
The technical scheme adopted by the invention is as follows: a method for planning pseudo paths of an unmanned aerial vehicle based on deep reinforcement learning is characterized by comprising the following steps:
step 1: dividing boundary coordinates of a no-fly area on a flight map, and marking coordinates of a starting point and an end point of the unmanned aerial vehicle flight;
step 2: sensing the current environment state of the unmanned aerial vehicle before executing a flight task, wherein the current environment state comprises low and high altitude climate data, the flight height of the unmanned aerial vehicle and the flight position coordinate of the unmanned aerial vehicle; based on the current environment state information, selecting a flight deflection angle and an action in the current environment according to the obtained Q function value by using a deep reinforcement learning algorithm; the unmanned aerial vehicle continuously receives position data given by the ground base station transmitting equipment in the flying process and interacts with the environment to obtain reward return updating Q functions;
and step 3: in the flight process, the no-fly area is used as a virtual barrier, and whether the unmanned aerial vehicle flies according to a normal air route or not is judged;
if the unmanned aerial vehicle is far away from the no-fly zone, the unmanned aerial vehicle continues to interactively plan a path with the environment, and the step 2 is executed;
if the unmanned aerial vehicle approaches the edge of the no-fly zone, guiding the unmanned aerial vehicle to plan a pseudo navigation route through a reward function of deep reinforcement learning, and avoiding the no-fly zone;
and 4, step 4: if the unmanned aerial vehicle reaches the terminal, ending the flight; otherwise, the step 2 is continuously executed.
The invention has the advantages that:
1. the invention can realize the path planning of the unmanned aerial vehicle in a complex environment, so that the unmanned aerial vehicle can efficiently fly to a target position to complete subsequent tasks.
2. The invention can plan a flight pseudo path for the unmanned aerial vehicle to avoid the flight-forbidden airspace by using a deep reinforcement learning method, ensures that the unmanned aerial vehicle does not mistakenly fly into the aviation forbidden area and the radar monitoring area under the condition of no solid barrier, avoids interfering the normal work of other airspaces, and has high efficiency, safety and intelligence.
Detailed Description
In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and examples, it is to be understood that the embodiments described herein are merely illustrative and explanatory of the present invention and are not restrictive thereof.
The invention adopts a method for planning the pseudo path of the unmanned aerial vehicle based on deep reinforcement learning to avoid the danger that the unmanned aerial vehicle mistakenly enters an aviation flight forbidden zone during aviation flight, utilizes a deep reinforcement learning algorithm in combination with grid map positioning, takes a forbidden airspace as a virtual barrier, and replans a pseudo path for the unmanned aerial vehicle through the reinforcement learning algorithm when the flight path planned by the unmanned aerial vehicle mistakenly enters the forbidden zone, so that the unmanned aerial vehicle avoids the aviation forbidden zone, ensures the flight safety of the unmanned aerial vehicle and the normal operation of other aviation zones, and simultaneously improves the efficiency and the safety performance of the route planning of the unmanned aerial vehicle.
Referring to fig. 1, the method for planning the pseudo path of the unmanned aerial vehicle based on deep reinforcement learning provided by the invention comprises the following steps:
step 1: dividing boundary coordinates of a no-fly area on a flight map, and marking coordinates of a starting point and an end point of the unmanned aerial vehicle flight;
the no-fly area in the embodiment comprises a normal civil aviation area and a radar area;
in the embodiment, a flight map is simulated into a grid environment model, the grid environment model divides the flight environment of the unmanned aerial vehicle into a series of cells with binary information and the same or different sizes, and some of the cells are divided into no-fly areas; the boundary coordinates of the no-fly area are definitely marked as { (x) on the grid environment modeli,yi),(xi+1,yi+1),(xi+2,yi+2)……(xi+m,yi+n) | m, n is more than 0, and i is more than or equal to 1 }; marking the starting point (X) of the unmanned plane flight on the flight mapstart,Ystart) And end point (X)end,Yend) The position coordinates of (a).
Step 2: sensing the current environment state of the unmanned aerial vehicle before executing a flight task, wherein the current environment state comprises low and high altitude climate data, the flight height of the unmanned aerial vehicle and the flight position coordinate of the unmanned aerial vehicle; based on the current environment state information, selecting a flight deflection angle and an action in the current environment according to the obtained Q function value by using a deep reinforcement learning algorithm; the unmanned aerial vehicle updates a 0 function according to reward return obtained by continuously receiving position data given by the ground base station transmitting equipment in the flying process and interacting with the environment;
in this embodiment, the deep reinforcement learning network is Double DQN, which is a deep reinforcement learning network of a Double DQN neural network;
the Double DQN is an improved deep convolutional neural network combining a convolutional neural network in deep learning and a Q-learning algorithm of reinforcement learning;
the deep reinforcement learning network comprises a state set S of the unmanned aerial vehicle in flight1,S2,S3……StT is equal to or greater than 1, and action set { a ≧ 1}1,a2,a3……atT is more than or equal to 1}, a reward function R(s) and a deep reinforcement learning network weight value theta;
deep reinforcement learning is carried out according to the state set, the action set and the reward function which are substituted into a state action value function Qt(st,at) Performing the following steps; qt(st,at) The function of (d) is:
wherein Qt+1(st,at) Is the Q value corresponding to the time t +1, Qt(st,at) Is the Q value at time t, alpha ∈ (0.5, 1)]For learning rate, γ ∈ (0, 1) is the discount factor, RtThe return value is the return value when the action at the moment t is executed; max is Qt+1(st+1,at) Or Qt(st,at) The Q value corresponding to the maximum value; if the state s reaches the target point grid after the action a, then R (s, a) is 1; if the state s reaches the barrier grid after the action a, then R (s, a) is-1; otherwise, R (s, a) ═ 0.
After the target network weight theta is added, the action behavior value function is updated as follows:
wherein, Vt+1According to the current state behavior value function Q at the moment of t +1t(st,at(ii) a Theta) the obtained behavior value function is used for updating the state behavior value at the moment t + 1; in the deep reinforcement learning Double DQN, the selection of the action and the evaluation of the action are respectively realized by different value functions;
the value function formula when the action is selected is as follows:
Yt Q=Rt+1+ymaxaQ(St+1,a;θ);
value function in action selection when making a selection, an action a is first selected*The action a*Should be satisfied in state St+1Process Q (S)t+1A) maximum; wherein R ist+1Represents the prize value at time t + 1;
the value function in the evaluation of the movement is that the largest movement a is selected*Then selecting different network weight theta' action evaluation formulas;
wherein,
the value of the state action value function after the calculation of the Double DQN by the deep reinforcement learning network is used.
In this embodiment, the selection of the deep reinforcement learning network weight θ is a priority playback; referring to fig. 2, the specific implementation includes the following sub-steps:
step 2.1: the unmanned aerial vehicle trains in an air flight environment, and a state action data set is collected in the interaction between the unmanned aerial vehicle and the environment and is put into a playback memory unit;
step 2.2: the neural network of the deep reinforcement learning is divided into a real network and an estimation network, and when the empirical data stored in the playback memory unit exceeds the set data set quantity, the intelligent agent (learning brain in the reinforcement learning) starts training;
step 2.3: the unmanned aerial vehicle interacts with the environment to select action according to the current state, wherein the real network has the same structure as the estimation network, and only parameters of a neural network used for training are different; the real network trains in the neural network according to the current state of the unmanned aerial vehicle to obtain the maximum state behavior value Q (s, a; theta), and simultaneously the estimation network trains in the neural network to obtain the state behavior value max in the next statea′Q (s ', a '; theta ') to obtain error functions of the real network and the estimated network, and obtaining a maximum state behavior value function argmax under a greedy strategy by using a random gradient descent methodaQ (s, a; θ); and the unmanned aerial vehicle selects the next action according to the state behavior value function and continues to interact with the environment.
In this embodiment, the unmanned aerial vehicle continuously interacts with the environment during the flight process, and the state behavior value function Q is continuously updated according to the Double DQN algorithmt(st,at(ii) a θ), updating the route trajectory.
And step 3: in the flight process, the no-fly area is used as a virtual barrier, and whether the unmanned aerial vehicle flies according to a normal air route or not is judged;
if the unmanned aerial vehicle is far away from the no-fly zone, the unmanned aerial vehicle continues to interactively plan a path with the environment, and the step 2 is executed;
if the unmanned aerial vehicle approaches the edge of the no-fly zone, guiding the unmanned aerial vehicle to plan a pseudo navigation route through a reward function of deep reinforcement learning, and avoiding the no-fly zone;
and 4, step 4: if the unmanned aerial vehicle reaches the terminal, ending the flight; otherwise, the step 2 is continuously executed.
In this embodiment, a pseudo path diagram of the flight path planning is shown in fig. 3.
The invention carries out the planning of the unmanned aerial vehicle pseudo path by using the deep reinforcement learning method combining the neural networks of reinforcement learning and deep learning, obtains the strategy function value by using the interaction of an intelligent agent and the environment, guides the selection of the flight action of the unmanned aerial vehicle, has stronger convergence and generalization capability and improves the intelligent degree of the unmanned aerial vehicle flight.
It should be understood that parts of the specification not set forth in detail are prior art; the above description of the preferred embodiments is intended to be illustrative, and not to be construed as limiting the scope of the invention, which is defined by the appended claims, and all changes and modifications that fall within the metes and bounds of the claims, or equivalences of such metes and bounds are therefore intended to be embraced by the appended claims.