CN110673637A

CN110673637A - A method for UAV false path planning based on deep reinforcement learning

Info

Publication number: CN110673637A
Application number: CN201910948346.7A
Authority: CN
Inventors: 陈鲤文; 周瑶; 郑日晶; 张文吉
Original assignee: Fujian University of Technology
Current assignee: Fujian University Of Science And Technology
Priority date: 2019-10-08
Filing date: 2019-10-08
Publication date: 2020-01-10
Anticipated expiration: 2039-10-08
Also published as: CN110673637B

Abstract

The invention discloses a method for unmanned aerial vehicle pseudo-path planning based on deep reinforcement learning. First, the boundary coordinates of the no-fly area are divided on the flight map, and the starting and ending coordinates of the flying task of the unmanned aerial vehicle are marked; Before the mission, the current environmental state of the UAV is perceived, and the deep reinforcement learning algorithm is used to select the deflection angle and flight action in the current environment according to the obtained Q function value; the UAV continuously receives information from the ground base station transmitting equipment during the flight process. The Q function is updated with the reward reward obtained from the flight position data and interaction with the environment; during the flight, the no-fly area is used as a virtual obstacle to determine whether the drone is flying according to the preset route; if it is close to the edge of the no-fly area, pass The reward function guides the unmanned aerial vehicle to plan a pseudo-navigation path and avoids the no-fly area; the invention realizes the pseudo-path planning for the unmanned aerial vehicle in an unknown environment, and improves the intelligence and safety of the flying of the unmanned aerial vehicle.

Description

Unmanned aerial vehicle pseudo path planning method based on deep reinforcement learning

Technical Field

The invention belongs to the technical field of machine learning, and particularly relates to an unmanned aerial vehicle pseudo-path planning method based on deep reinforcement learning.

Background

With the great progress of the calculation level and the artificial intelligence field, the application field of the unmanned aerial vehicle is more and more, especially, the unmanned aerial vehicle is more and more widely applied in the military aviation field, the types of tasks executed by the unmanned aerial vehicle are more and more complex, and the unmanned aerial vehicle plays an important role in the military reconnaissance field and the aviation transportation field. The intelligent requirement of unmanned aerial vehicle flight path planning is also higher and higher, when unmanned aerial vehicle carries out special task, requires the flight in-process from the starting point to the terminal point according to the regulation, and unmanned aerial vehicle still need avoid normal civil aviation flight area territory and radar monitoring area to avoid causing the interference to the flight of civil aviation aircraft and radar monitoring. In order to better serve the applications in various fields, the research on the unmanned aerial vehicle pseudo-path planning becomes a research hotspot and difficulty of the current unmanned aerial vehicle flight path planning.

With the progress of artificial intelligence technology, in recent years, intelligent agent control methods based on deep neural networks and deep reinforcement learning enter the public field of vision. Reinforcement learning is one of the important branches of machine learning, and assists an agent in taking more intelligent behaviors and actions in each state by feeding back each action of the agent through environment modeling and maximizing the expected future harvest which can be obtained by the agent in the current state through setting an objective function for accumulating rewards. The deep reinforcement learning is an algorithm for optimizing the strategy of an agent by using a neural network, and eliminates the traditional learning methods such as: time sequence difference and dimension disaster problem in the real strategy difference algorithm provide ideas for real-time calculation.

In the process of solving the problem of actually solving the flight path planning of the unmanned aerial vehicle, according to different tasks and different complexity of terrain environments, an intelligent algorithm which accords with the flight path planning is selected, the existing algorithm plans and navigates according to the real-time flight path and obstacle avoidance of the unmanned aerial vehicle when the flight path planning is carried out, however, in the actual situation, some no-fly areas in the airspace are undetectable invisible obstacles, the unmanned aerial vehicle can easily enter into the no-fly area to fly by mistake in the flight process, and the flight danger of other airspaces is caused.

Disclosure of Invention

The invention aims to overcome the common thinking of the existing unmanned aerial vehicle track planning and provides a method for planning a pseudo path of an unmanned aerial vehicle based on deep reinforcement learning. The invention plans a pseudo-flight path aiming at the condition that the unmanned aerial vehicle avoids the restricted area for flying, and when the actually planned flight path of the unmanned aerial vehicle conflicts with the restricted area for flying, the pseudo-flight path is used for guiding the unmanned aerial vehicle to avoid the restricted area for flying, thereby ensuring the flight safety of the unmanned aerial vehicle in the airspace and the normal operation of other areas.

The technical scheme adopted by the invention is as follows: a method for planning pseudo paths of an unmanned aerial vehicle based on deep reinforcement learning is characterized by comprising the following steps:

step 1: dividing boundary coordinates of a no-fly area on a flight map, and marking coordinates of a starting point and an end point of the unmanned aerial vehicle flight;

step 2: sensing the current environment state of the unmanned aerial vehicle before executing a flight task, wherein the current environment state comprises low and high altitude climate data, the flight height of the unmanned aerial vehicle and the flight position coordinate of the unmanned aerial vehicle; based on the current environment state information, selecting a flight deflection angle and an action in the current environment according to the obtained Q function value by using a deep reinforcement learning algorithm; the unmanned aerial vehicle continuously receives position data given by the ground base station transmitting equipment in the flying process and interacts with the environment to obtain reward return updating Q functions;

and step 3: in the flight process, the no-fly area is used as a virtual barrier, and whether the unmanned aerial vehicle flies according to a normal air route or not is judged;

if the unmanned aerial vehicle is far away from the no-fly zone, the unmanned aerial vehicle continues to interactively plan a path with the environment, and the step 2 is executed;

if the unmanned aerial vehicle approaches the edge of the no-fly zone, guiding the unmanned aerial vehicle to plan a pseudo navigation route through a reward function of deep reinforcement learning, and avoiding the no-fly zone;

and 4, step 4: if the unmanned aerial vehicle reaches the terminal, ending the flight; otherwise, the step 2 is continuously executed.

The invention has the advantages that:

1. the invention can realize the path planning of the unmanned aerial vehicle in a complex environment, so that the unmanned aerial vehicle can efficiently fly to a target position to complete subsequent tasks.

2. The invention can plan a flight pseudo path for the unmanned aerial vehicle to avoid the flight-forbidden airspace by using a deep reinforcement learning method, ensures that the unmanned aerial vehicle does not mistakenly fly into the aviation forbidden area and the radar monitoring area under the condition of no solid barrier, avoids interfering the normal work of other airspaces, and has high efficiency, safety and intelligence.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention;

FIG. 2 is a schematic block diagram of a deep reinforcement learning Double DQN algorithm according to an embodiment of the present invention;

fig. 3 is a schematic diagram of unmanned aerial vehicle pseudo path planning using a deep reinforcement learning DoubleDQN algorithm in the embodiment of the present invention.

Detailed Description

In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and examples, it is to be understood that the embodiments described herein are merely illustrative and explanatory of the present invention and are not restrictive thereof.

The invention adopts a method for planning the pseudo path of the unmanned aerial vehicle based on deep reinforcement learning to avoid the danger that the unmanned aerial vehicle mistakenly enters an aviation flight forbidden zone during aviation flight, utilizes a deep reinforcement learning algorithm in combination with grid map positioning, takes a forbidden airspace as a virtual barrier, and replans a pseudo path for the unmanned aerial vehicle through the reinforcement learning algorithm when the flight path planned by the unmanned aerial vehicle mistakenly enters the forbidden zone, so that the unmanned aerial vehicle avoids the aviation forbidden zone, ensures the flight safety of the unmanned aerial vehicle and the normal operation of other aviation zones, and simultaneously improves the efficiency and the safety performance of the route planning of the unmanned aerial vehicle.

Referring to fig. 1, the method for planning the pseudo path of the unmanned aerial vehicle based on deep reinforcement learning provided by the invention comprises the following steps:

the no-fly area in the embodiment comprises a normal civil aviation area and a radar area;

in the embodiment, a flight map is simulated into a grid environment model, the grid environment model divides the flight environment of the unmanned aerial vehicle into a series of cells with binary information and the same or different sizes, and some of the cells are divided into no-fly areas; the boundary coordinates of the no-fly area are definitely marked as { (x) on the grid environment model_i，y_i)，(x_i+1，y_i+1)，(x_i+2，y_i+2)……(x_i+m，y_i+n) | m, n is more than 0, and i is more than or equal to 1 }; marking the starting point (X) of the unmanned plane flight on the flight map_start，Y_start) And end point (X)_end，Y_end) The position coordinates of (a).

Step 2: sensing the current environment state of the unmanned aerial vehicle before executing a flight task, wherein the current environment state comprises low and high altitude climate data, the flight height of the unmanned aerial vehicle and the flight position coordinate of the unmanned aerial vehicle; based on the current environment state information, selecting a flight deflection angle and an action in the current environment according to the obtained Q function value by using a deep reinforcement learning algorithm; the unmanned aerial vehicle updates a 0 function according to reward return obtained by continuously receiving position data given by the ground base station transmitting equipment in the flying process and interacting with the environment;

in this embodiment, the deep reinforcement learning network is Double DQN, which is a deep reinforcement learning network of a Double DQN neural network;

the Double DQN is an improved deep convolutional neural network combining a convolutional neural network in deep learning and a Q-learning algorithm of reinforcement learning;

the deep reinforcement learning network comprises a state set S of the unmanned aerial vehicle in flight₁，S₂，S₃……S_tT is equal to or greater than 1, and action set { a ≧ 1}₁，a₂，a₃……a_tT is more than or equal to 1}, a reward function R(s) and a deep reinforcement learning network weight value theta;

deep reinforcement learning is carried out according to the state set, the action set and the reward function which are substituted into a state action value function Q_t(s_t，a_t) Performing the following steps; q_t(s_t，a_t) The function of (d) is:

wherein Q_t+1(s_t，a_t) Is the Q value corresponding to the time t +1, Q_t(s_t，a_t) Is the Q value at time t, alpha ∈ (0.5, 1)]For learning rate, γ ∈ (0, 1) is the discount factor, R_tThe return value is the return value when the action at the moment t is executed; max is Q_t+1(s_t+1，a_t) Or Q_t(s_t，a_t) The Q value corresponding to the maximum value; if the state s reaches the target point grid after the action a, then R (s, a) is 1; if the state s reaches the barrier grid after the action a, then R (s, a) is-1; otherwise, R (s, a) ═ 0.

After the target network weight theta is added, the action behavior value function is updated as follows:

wherein, V_t+1According to the current state behavior value function Q at the moment of t +1_t(s_t，a_t(ii) a Theta) the obtained behavior value function is used for updating the state behavior value at the moment t + 1; in the deep reinforcement learning Double DQN, the selection of the action and the evaluation of the action are respectively realized by different value functions;

the value function formula when the action is selected is as follows:

Y_t ^Q＝R_t+1+ymax_aQ(S_t+1，a；θ)；

value function in action selection when making a selection, an action a is first selected^*The action a^*Should be satisfied in state S_t+1Process Q (S)_t+1A) maximum; wherein R is_t+1Represents the prize value at time t + 1;

the value function in the evaluation of the movement is that the largest movement a is selected^*Then selecting different network weight theta' action evaluation formulas;

wherein,

the value of the state action value function after the calculation of the Double DQN by the deep reinforcement learning network is used.

In this embodiment, the selection of the deep reinforcement learning network weight θ is a priority playback; referring to fig. 2, the specific implementation includes the following sub-steps:

step 2.1: the unmanned aerial vehicle trains in an air flight environment, and a state action data set is collected in the interaction between the unmanned aerial vehicle and the environment and is put into a playback memory unit;

step 2.2: the neural network of the deep reinforcement learning is divided into a real network and an estimation network, and when the empirical data stored in the playback memory unit exceeds the set data set quantity, the intelligent agent (learning brain in the reinforcement learning) starts training;

step 2.3: the unmanned aerial vehicle interacts with the environment to select action according to the current state, wherein the real network has the same structure as the estimation network, and only parameters of a neural network used for training are different; the real network trains in the neural network according to the current state of the unmanned aerial vehicle to obtain the maximum state behavior value Q (s, a; theta), and simultaneously the estimation network trains in the neural network to obtain the state behavior value max in the next state_a′Q (s ', a '; theta ') to obtain error functions of the real network and the estimated network, and obtaining a maximum state behavior value function argmax under a greedy strategy by using a random gradient descent method_aQ (s, a; θ); and the unmanned aerial vehicle selects the next action according to the state behavior value function and continues to interact with the environment.

In this embodiment, the unmanned aerial vehicle continuously interacts with the environment during the flight process, and the state behavior value function Q is continuously updated according to the Double DQN algorithm_t(s_t，a_t(ii) a θ), updating the route trajectory.

In this embodiment, a pseudo path diagram of the flight path planning is shown in fig. 3.

The invention carries out the planning of the unmanned aerial vehicle pseudo path by using the deep reinforcement learning method combining the neural networks of reinforcement learning and deep learning, obtains the strategy function value by using the interaction of an intelligent agent and the environment, guides the selection of the flight action of the unmanned aerial vehicle, has stronger convergence and generalization capability and improves the intelligent degree of the unmanned aerial vehicle flight.

It should be understood that parts of the specification not set forth in detail are prior art; the above description of the preferred embodiments is intended to be illustrative, and not to be construed as limiting the scope of the invention, which is defined by the appended claims, and all changes and modifications that fall within the metes and bounds of the claims, or equivalences of such metes and bounds are therefore intended to be embraced by the appended claims.

Claims

1. a method for UAV pseudo-path planning based on deep reinforcement learning, is characterized in that, comprises the following steps:

Step 1: Divide the boundary coordinates of the no-fly area on the flight map, and mark the start and end coordinates of the drone flight;

Step 2: Perceive the current environmental state of the UAV before executing the flight mission, including low and high-altitude climate data, UAV flight height, and UAV flight position coordinates; The Q function value selects the flight deflection angle and action in the current environment; the UAV updates the Q function according to the reward reward obtained by continuously receiving the flight position data from the ground base station transmitting equipment and interacting with the environment during the flight;

Step 3: Use the no-fly area as a virtual obstacle during the flight to determine whether the drone is flying in a normal route;

If it is far away from the no-fly zone, the UAV continues to interact with the environment to plan the path, and go to step 2;

If it is close to the edge of the no-fly area, the UAV will be guided by the reward function of the deep reinforcement learning algorithm to plan a pseudo-navigation route and avoid the no-fly area;

Step 4: If the drone reaches the end point, end the flight; otherwise, continue to step 2.

2. the method for UAV false path planning based on deep reinforcement learning according to claim 1, is characterized in that: in step 1, at first, the flight map is simulated as a grid environment model, and the grid environment model uses the UAV to be simulated. The flight environment is divided into a series of cells of the same or different size with binary information, some of which are divided into no-fly areas; the boundary coordinates of the no-fly areas are clearly marked on the grid environment model as {(xi _i , y _i ), (x _i+1 , y _i+1 ), (x _i+2 , y _i+2 )...(x _i+m , y _i+n )|m, n> 0, i≥1}; simultaneously mark the position coordinates of the starting point (X _start , Y _start ) and the ending point (X _end , Y _end ) of the drone flight on the flight map.

3. the method for the UAV pseudo-path planning based on deep reinforcement learning according to claim 1, is characterized in that: in step 2, described Double DQN algorithm utilizes the convolutional neural network in deep learning and reinforcement learning. Improved deep convolutional neural network algorithm combined with Q-learning algorithm;

The deep reinforcement learning algorithm includes a state set {S ₁ , S ₂ , S ₃ ...... S _t , t≥1}, an action set {a ₁ , a ₂ , a ₃ . ......a _t , t≥1}, the reward function R(s), and the deep reinforcement learning target network weight θ;

The deep reinforcement learning is substituted into the state behavior value function Q _t (s _t , at _t ) according to the state set, the action set and the reward function;

The function of Q _t (s _t , at ₎ is:

where Q _t+1 (s _t , at _t ) is the Q value corresponding to time t+1, Q _t (s _t , at _t ) is the Q value at time t, α is the learning rate, γ is the discount factor, R _t is the reward value when performing the action at time t;

After the target network weight θ is added, the action behavior value function is updated as:

Among them, V _t+1 is the behavior value function obtained according to the current state behavior value function Q _t (s _t , a _t ; θ) at time t+1, which is used to update the state behavior value at time t+1; deep reinforcement learning In Double DQN, the selection of actions and the evaluation of actions are implemented with different value functions;

The value function formula for action selection is the formula:

Y _t ^Q =R _t+1 +γmax _a Q(S _t+1 , a; θ);

The value function of action selection When making a choice, first select an action a ^* , the action a ^* should satisfy the maximum Q(S _t+1 , a) at the state S _t+1 ; where R _t+1 means t+1 the reward value of the moment;

The value function during action evaluation is the formula for selecting different network weights θ′ after selecting the largest action a ^* ;

in, It is the value of the state-action value function after calculation using the deep reinforcement learning network Double DQN.

4. the method for UAV pseudo-path planning based on deep reinforcement learning according to claim 3, is characterized in that: in step 2, the selection of described deep reinforcement learning algorithm weight value θ is priority playback; Concrete realization comprises the following Substeps:

Step 2.1: The UAV is first trained in the air flight environment, and the state action data set is collected from the interaction between the UAV and the environment and put into the playback memory unit;

Step 2.2: The neural network of deep reinforcement learning is divided into two parts: the real network and the estimation network. When the experience data stored in the playback memory unit exceeds the set number of data sets, the agent starts training;

Step 2.3: The drone interacts with the environment and selects actions according to the current state. The structure of the real network is the same as the estimated network, but the parameters of the neural network used for training are different; the real network is based on the current state of the drone in the neural network. The maximum state behavior value Q(s, a; θ) is obtained by training in the network, and the estimated state behavior value max _a' Q(s', a';θ') in the next state is obtained by training the neural network. , get the error function of the real network and the estimated network, and use the stochastic gradient descent method to get the maximum state behavior value function argmax _a Q(s, a; θ) under the greedy strategy; the UAV selects the next step according to the state behavior value function action, and continue to interact with the environment.

5. the method for unmanned aerial vehicle pseudo-path planning based on deep reinforcement learning according to claim 4, it is characterized in that: in step 2, unmanned aerial vehicle constantly interacts with environment in flight process, according to Double DQN algorithm continuously The updated state behavior value function Q(s, a; θ) of , updates the route trajectory.