CN118485134A - Target searching method and device based on incremental reinforcement learning - Google Patents
Target searching method and device based on incremental reinforcement learning Download PDFInfo
- Publication number
- CN118485134A CN118485134A CN202410682100.0A CN202410682100A CN118485134A CN 118485134 A CN118485134 A CN 118485134A CN 202410682100 A CN202410682100 A CN 202410682100A CN 118485134 A CN118485134 A CN 118485134A
- Authority
- CN
- China
- Prior art keywords
- unmanned aerial
- aerial vehicle
- target
- search
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/092—Reinforcement learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The application discloses a target searching method based on incremental reinforcement learning, which comprises the following steps: performing gridding processing on a target space where the unmanned aerial vehicle executes a first search task to generate an environment data network; designing a reward function based on a current state, actions taken by the current state and a next state corresponding to the unmanned aerial vehicle in the environment data network, and generating a reinforcement learning training model based on the reward function and a preset searching strategy; under the condition that the unmanned aerial vehicle executes a second search task, a result of evaluation of the reinforcement learning training model is in response to the second search task, and the model is retrained, the reinforcement learning training model is taken as a first network branch, a model generated by model parameter training corresponding to the second search task is taken as a second network branch, the design of a network structure is carried out, and a new task adaptation model based on incremental learning is generated; based on the new task adaptation model, searching is carried out on the target, and by applying the method, efficient target searching can be realized.
Description
Technical Field
The application relates to the field of target searching and tracking, in particular to a target searching method based on incremental reinforcement learning.
Background
The target search refers to the process of searching for an optimal solution under a given environment, and involves the aspects of defining a search space, designing a search algorithm, establishing an environment model and the like; reinforcement learning is a machine learning method, and target search is achieved through interactive learning of an agent and an environment.
The application of the conventional reinforcement learning method in the field of target search generally includes the following steps: (1) environmental modeling: the environment of the search task is modeled, including defining a state space, an action space, and a reward function, among others. (2) strategy design: the policy of the agent is designed, i.e. the optimal action is selected in a given state. This may involve various reinforcement learning algorithms, such as Q-learning, deep Q Network (DQN), etc. (3) training and optimizing: training and optimizing are performed through interaction of the agent and the environment, so that the agent learns an optimal strategy. This typically involves back-propagation algorithms based on the reward signal, as well as parameter adjustment and model updating. While traditional reinforcement learning methods perform well in some situations, they also have problems, such as "catastrophic forgetting" when faced with new search tasks, resulting in forgetting of learned knowledge and thus affecting their application.
In view of this, how to provide a target searching method, which better adapts to a new searching environment while ensuring the searching effect of the current task, is a technical problem to be solved currently.
Disclosure of Invention
The embodiment of the application provides a target searching method based on incremental reinforcement learning, a target searching device based on the incremental reinforcement learning, a computer readable storage medium and a computer device, which are used for solving the problems of low searching efficiency, poor adaptability to environmental changes and the like in the traditional method.
In a first aspect of an embodiment of the present application, there is provided a target search method based on incremental reinforcement learning, including:
Performing gridding processing on a target space where the unmanned aerial vehicle executes a first search task to generate an environment data network;
Designing a reward function based on a current state, actions taken by the current state and a next state corresponding to the unmanned aerial vehicle in the environment data network, and generating a reinforcement learning training model based on the reward function and a preset search strategy, wherein the search strategy is an optimal search strategy corresponding to the defined unmanned aerial vehicle when searching for a target in a target space;
Under the condition that the unmanned aerial vehicle executes a second search task, a model generated by training model parameters corresponding to the second search task is used as a second network branch to perform network structure design, and a new task adaptation model based on incremental learning is generated in response to the fact that the second search task evaluates the reinforcement learning training model to retrain the model;
And searching the target based on the new task adaptation model.
In a second aspect of the embodiments of the present application, there is provided a target search apparatus based on incremental reinforcement learning, including:
the processing module is configured to grid the target space where the unmanned aerial vehicle executes the first search task, and generate an environment data network;
The first model generation module is configured to design a reward function based on a current state, an action taken by the current state and a next state corresponding to the unmanned aerial vehicle in the environment data network, and generate a reinforcement learning training model based on the reward function and a preset search strategy, wherein the search strategy is an optimal search strategy corresponding to the unmanned aerial vehicle when the unmanned aerial vehicle searches for a target in a target space;
The second model generating module is configured to retrain the model in response to the evaluation result of the evaluation of the reinforcement learning training model by the second search task under the condition that the unmanned aerial vehicle executes the second search task, and then, the reinforcement learning training model is taken as a first network branch, the model generated by training the model parameters corresponding to the second search task is taken as a second network branch, the design of a network structure is carried out, and a new task adaptation model based on incremental learning is generated;
And the searching module is configured to search the target based on the new task adaptation model.
In a third aspect of the embodiments of the present application, there is provided a computer-readable storage medium storing computer-executable instructions which, when executed by a processor, implement the steps of the five-tuple key matching method of the prefix mask described above.
In a fourth aspect of embodiments of the present application, there is provided a computer device comprising:
a memory and a processor;
the memory is configured to store computer-executable instructions that, when executed by the processor, perform the steps of the five-tuple key matching method of the prefix mask described above.
The embodiment of the application provides a target searching method based on incremental reinforcement learning, which comprises the following steps: firstly, performing gridding processing on a target space where an unmanned aerial vehicle executes a first search task to generate an environment data network; then, designing a reward function based on the current state, the action taken by the current state and the next state corresponding to the unmanned aerial vehicle in the environment data network, and generating a reinforcement learning training model based on the reward function and a preset search strategy, wherein the search strategy is an optimal search strategy corresponding to the defined unmanned aerial vehicle when searching for a target in a target space; secondly, under the condition that the unmanned aerial vehicle executes a second search task, a model generated by training model parameters corresponding to the second search task is used as a second network branch when the evaluation result of the reinforcement learning training model is a model retraining result, and a new task adaptation model based on incremental learning is generated by carrying out network structure design; and finally, searching the target based on the new task adaptation model.
By applying the method provided by the embodiment of the application, the advantage of reinforcement learning is utilized, and incremental learning is introduced, so that the current task searching effect is ensured, and meanwhile, the method can be better adapted to a new searching environment, thereby realizing efficient target searching.
The foregoing description is only an overview of the present application, and is intended to be implemented in accordance with the teachings of the present application in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present application more readily apparent.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
FIG. 1 is a schematic flow chart of a target searching method based on incremental reinforcement learning according to an embodiment of the present application;
FIG. 2 is a schematic diagram corresponding to a reward function of a target search method based on incremental reinforcement learning according to an embodiment of the present application;
FIG. 3 is a schematic flow chart of a reinforcement learning training model in a target search method based on incremental reinforcement learning according to an embodiment of the present application;
FIG. 4 is a schematic diagram of reinforcement learning structure in a target search method based on incremental reinforcement learning according to an embodiment of the present application;
FIG. 5 is a schematic diagram of an incremental learning network connection mechanism in a target search method based on incremental reinforcement learning according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a new task adaptation flow of an incremental learning training model in an incremental reinforcement learning-based target search method according to an embodiment of the present application;
FIG. 7 is a schematic structural diagram of a target search device based on incremental reinforcement learning according to an embodiment of the present application;
FIG. 8 is a block diagram of a computing device provided by an embodiment of the application.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Embodiments of the present application provide a target search method based on incremental reinforcement learning, a target search apparatus based on incremental reinforcement learning, a computing device, and a computer-readable storage medium, which are described in detail in the following embodiments one by one.
In the target searching method based on incremental reinforcement learning provided by the embodiment of the application, an unmanned aerial vehicle is arranged to perform target searching in a target space 1, namely, the target space 1 is required to be meshed firstly so as to facilitate subsequent data acquisition; training the acquired data pair to obtain a model 1; and secondly, testing the target searching effect of the model 1 on the task 2, if the effect is good, continuing training, and if the effect is poor, redesigning the model 2 for training, wherein the model 2 is a network architecture based on an increment thought, so that the task 2 can achieve a good effect, and the task 1 can achieve the good effect.
Referring to fig. 1, fig. 1 is a schematic flow chart of a target searching method based on incremental reinforcement learning according to an embodiment of the present application. As shown in fig. 1, the following steps are specifically included.
Step S102: and performing gridding processing on a target space where the unmanned aerial vehicle executes the first search task to generate an environment data network.
Specifically, the performing gridding processing on the target space where the unmanned aerial vehicle executes the first search task to generate an environmental data network includes: dividing the target space into grid cells with the same specification based on a preset grid cell specification, wherein the grid cells are corresponding to unique identifiers, and when the unmanned aerial vehicle moves to each grid cell, the unmanned aerial vehicle respectively corresponds to a current state of the unmanned aerial vehicle; designing an identification symbol corresponding to a moving position of the unmanned aerial vehicle in a target space, wherein the moving position comprises: upward movement, downward movement, leftward movement, and rightward movement; an environmental data network is generated based on the grid cells and the identification symbol.
In practical application, the target space is subjected to gridding treatment, so that the subsequent unmanned aerial vehicle can be facilitated to search.
In the embodiment of the present application, a target space is defined first, and the target space is divided into n×m grid cells assuming that the spatial range of the target space is w×h, where each grid cell has a unique identifier (i, j). Each grid cell is used as a current state of the unmanned plane, and a state space of the grid cell is represented by S, wherein S= { (i, j) |i epsilon [1, N ], j epsilon [1, M ]. Therefore, the grid cell where the unmanned plane is located is the current state.
Secondly, designing a corresponding unmanned aerial vehicle action space (namely the moving position) in the target space, wherein the action space comprises the moving directions of unmanned aerial vehicles such as upward, downward, leftward and rightward in the target space. In the embodiment of the present application, the action space is denoted by a, where a= { a 1,a2,a3,a4},a1 represents upward movement, a 2 represents downward movement, a 3 represents leftward movement, and a 4 represents rightward movement.
Step S104: designing a reward function based on the current state, the action taken by the current state and the next state corresponding to the unmanned aerial vehicle in the environment data network, and generating a reinforcement learning training model based on the reward function and a preset search strategy, wherein the search strategy is an optimal search strategy corresponding to the defined unmanned aerial vehicle when searching for a target in a target space.
In the embodiment of the application, for the reinforcement learning training model, a learning environment, a definition strategy and a reward function are required to be constructed, and the method specifically comprises the following steps:
specifically, designing a reward function based on the current state, the action taken by the current state and the next state corresponding to the unmanned aerial vehicle in the environmental data network, and generating a reinforcement learning training model based on the reward function and a preset search strategy, including:
Designing a reward function based on the current state corresponding to the unmanned aerial vehicle, the action taken by the current state and the next state;
Defining a search strategy of the unmanned aerial vehicle for searching targets in a target space by using a value function;
starting from the initial state, determining actions taken by the current state based on the current state of the unmanned aerial vehicle, and observing rewards and the next state after the unmanned aerial vehicle executes the actions until the reinforcement learning training model converges;
And testing the reinforcement learning training model and generating a test result, wherein the test result is the effect of searching the target in the scene by the unmanned aerial vehicle.
The reward function is used for evaluating the degree of performance of a specific action by the unmanned aerial vehicle in a specific state. Wherein in an embodiment of the application, the reward function is represented by R (s, a, s '), where s represents the current state, a represents the action taken, and s' represents the next state.
In the embodiment of the application, firstly, setting of the reward function is carried out, and the method specifically comprises the following steps.
The method for designing the reward function based on the current state, the action taken by the current state and the next state corresponding to the unmanned aerial vehicle comprises the following steps:
Determining that target searching is successful in response to the unmanned aerial vehicle moving to a first preset range of a cell where a distance searching target is located, and setting rewards as first positive values, wherein the first preset range is a range where the unmanned aerial vehicle moves to a cell where the distance searching target is located;
And responding to the unmanned aerial vehicle moving to a second preset range of a cell where the distance searching target is, determining that the target searching fails, and setting the rewards to be negative, wherein the second preset range is a range where the unmanned aerial vehicle moves to a position outside one cell of the distance searching target, and the first preset range and the second preset range are disjoint spaces.
More specifically, the determining that the target search fails and setting the reward to be negative in response to the unmanned aerial vehicle moving to be within a second preset range from the cell where the search target is located includes:
Setting a reward to a first negative value in response to the unmanned aerial vehicle moving into a first spatial range within the second preset range, and setting the reward to a second negative value based on the number of moving steps of the unmanned aerial vehicle in response to the unmanned aerial vehicle moving within the first spatial range, wherein the first spatial range comprises at least one grid cell within the second preset range, and the second negative value is determined based on the number of moving steps of the unmanned aerial vehicle within the first spatial range;
And setting the reward to a first negative value in response to the drone moving to a spatial boundary of the target space.
In addition, in response to the drone moving from within the first spatial range to within a second spatial range of the second preset range, the reward is set to a second positive value.
In practical application, if the unmanned aerial vehicle moves to be within 1 lattice away from the target, the target is considered to be searched, additional positive rewards are given, the searching process is finished, and the rewards are set to be positive values, so that the unmanned aerial vehicle successfully finds the target.
If the unmanned aerial vehicle touches a space boundary or an illegal area, the reward is set to be negative, which indicates that the unmanned aerial vehicle should avoid collision.
The reward is otherwise a small negative value to encourage the drone to find the target as soon as possible.
| Event(s) | Rewards |
| Finding a target | +5 |
| Entering the unreachable area | +0.1 |
| Entering the reached region | -0.1 |
| Wall or obstacle | -0.1 |
| Consumption per step | -0.01 |
It should be noted that, the event "find target" indicates that the unmanned aerial vehicle is moving to the first preset range of the cell where the search target is located, and then the unmanned aerial vehicle target search is successful, and a positive value +5 is rewarded; the event 'enter the reached region', namely the unmanned aerial vehicle moves to a first space range which is a second preset range from a cell where the search target is located, at the moment, the unmanned aerial vehicle does not search the target, and the unmanned aerial vehicle moves in the first space range, and rewards a negative value of-0.1; the event 'enter an unrealized area', namely the unmanned aerial vehicle moves to a second space range which is a second preset range from a cell where a search target is located, and at the moment, the unmanned aerial vehicle searches for the target, but the unmanned aerial vehicle moves to the second space range from the first space range, wherein the rewards are positive values +0.1; an event of wall collision or obstacle, namely that the unmanned aerial vehicle moves to the space boundary of the target space, rewarding a negative value of-0.1; the event "consumption per step" is the number of steps the drone moves within each spatial range of the second preset range (e.g., first spatial range, second spatial range.) a negative value of-0.01 is awarded.
Referring to fig. 2, fig. 2 is a schematic diagram corresponding to a reward function of a target search method based on incremental reinforcement learning according to an embodiment of the present application.
As shown in fig. 2, the search target is denoted by "Q"; the first preset range is denoted by "P"; the second preset range is denoted by "O"; the first spatial range of the second preset range is denoted by "S"; the second spatial range of the second preset range is denoted by "T"; it should be noted that, the division of the spatial range in the second preset range may be performed according to actual needs, and in the embodiment of the present application, the division of the second preset range into two spatial ranges of the first spatial range and the second spatial range is performed as an example (for example, the third spatial range U may also be included).
Then, in reinforcement learning, the value function Q (s, a) is used until the decision process. It should be noted that, the update rule of the algorithm is as follows:
Where Q (s i,ai) is the Q value of action a i taken in state s i, oc is the learning rate, and γ is the discount factor.
Secondly, the training process of the model specifically comprises the following steps: starting from the initial state s0, an action a is selected according to the current state, and after the action is performed, the reward R (s, a, s ') and the next state s' are observed. This process is repeated until convergence or a predetermined number of training rounds is reached.
And then, testing the generated model: after training is completed, the effect of searching targets in the scene by the unmanned aerial vehicle can be tested. According to the learned strategy, the unmanned aerial vehicle selects actions and executes, and observes rewards and search results.
And finally, judging whether the current model is feasible in the new task or not, namely judging whether the current trained model is suitable for the new task or not. The method comprises the following specific steps:
(1) Effect evaluation: and evaluating the performance of the current model in the new task, including indexes such as search efficiency and the like.
(2) Model training judgment: and judging whether the model needs to be retrained according to the performance of the model in the current task.
Referring to fig. 3, fig. 3 is a schematic flow chart of a reinforcement learning training model in a target searching method based on incremental reinforcement learning according to an embodiment of the present application.
As shown in fig. 3, the training process of the task 1 model includes: space gridding; collecting data; processing the data; the reinforcement learning model trains and outputs the model 1.
A process of target searching, comprising: updating target object information; determining a target search behavior based on the output model 1; performing a target search behavior; and judging whether the target is a target, if so, outputting the position of the target object, otherwise, updating the information of the target object, and re-executing the target searching process.
Referring to fig. 4, fig. 4 is a schematic structural diagram of reinforcement learning in a target search method based on incremental reinforcement learning according to an embodiment of the present application.
As shown in fig. 4, the reinforcement learning structure includes two parts, namely an agent and an environment. The environment determines a current state (S) and a reward (R) by receiving an action (a) of the agent and transmits to the agent so that the agent can implement a search for a target based on the received current state and reward.
By applying the strategy optimization algorithm aiming at the target search task in the incremental reinforcement learning target search method provided by the embodiment of the application, the strategy of the intelligent agent can be dynamically adjusted, and a better search effect is realized.
Step S106: under the condition that the unmanned aerial vehicle executes a second search task, a result of evaluating the reinforcement learning training model is in response to the second search task, namely, retraining the model, taking the reinforcement learning training model as a first network branch, taking a model generated by training model parameters corresponding to the second search task as a second network branch, and designing a network structure to generate a new task adaptation model based on incremental learning.
In practical applications, when the evaluation result of the second search task in evaluating the reinforcement learning training model is retraining the model, the environmental data is gridded, and the reinforcement learning training model is generated and the reward function is designed consistently based on the reinforcement learning training model and the reward function generated in the environmental data network, which is not described in detail herein.
Specifically, in the network design process, the design of a network structure mainly focuses on the learning of a new task in the training process, and adaptation and optimization of the new task are realized by utilizing a shared partial network structure and a newly added module. The network structure designed by the embodiment of the application comprises two branches, namely a first network branch TaskA and a second network branch Task B. The Task A branch keeps the original network structure and model parameters and is used for processing learning and decision of the original Task. The branch mainly focuses on the optimization of the original task in the training process, and the performance of the model on the original task is maintained. The Task B branch has the same structure as the Task A branch network, and the model parameters are retrained and are used for adapting to the characteristics and requirements of new tasks.
Referring to fig. 5, fig. 5 is a schematic diagram of an incremental learning network connection mechanism in a target search method based on incremental reinforcement learning according to an embodiment of the present application.
As shown in fig. 5, the incremental learning network connection mechanism includes an input layer and an output layer, where the output layer includes Task a and Task B. Task A includes: output 1, network layer 2 and network layer 1; task B includes: output 2, network layer 2 and network layer 1, and add the file a corresponding to the network layer 2 of Task A at the output layer 2 of Task B, add the file a corresponding to the network layer 1 of Task A at the network layer 2 of Task B.
In the new Task adaptation model, the first network branch and the second network branch adopt cross-layer connection, and information sharing and transfer between Task A and Task B are realized in a network. And dynamically adjusting the weights corresponding to the first network branch and the second network branch, and realizing task balance and optimization according to the importance of the tasks.
During training, starting from the initial state s0 in a new task, selecting an action a according to the current state, and observing rewards and the next state s' after executing the action. This process is repeated until convergence or a predetermined number of training rounds is reached.
And (3) experimental verification: model performance was compared in experiments with and without the modified network. And testing the effect of the unmanned aerial vehicle on searching the model of the target in the new and old scenes.
Referring to fig. 6, fig. 6 is a schematic diagram of a new task adaptation flow of an incremental learning training model in an incremental reinforcement learning-based target search method according to an embodiment of the present application.
As shown in fig. 6, after task 2 is started, firstly determining a target search behavior, then adapting to model 1 to execute the target search behavior, secondly studying and judging the search effect, if the search efficiency is high, then no retraining is needed, and continuously judging whether the target is a target, if yes, outputting the target object position, and if not, restarting to execute task 2; if the searching efficiency is poor, retraining is needed, wherein the training process of the task 2 model is as follows: firstly, meshing a task 2 space; then collecting data; secondly, processing the data; secondly, adding a new network module to perform reinforcement learning model training; and finally outputting the model 2. The target searching process comprises the following steps: task 2 is started first; then determining target search behavior based on the model 2; secondly, performing target searching behaviors by using the model 2; and then judging whether the target is a target, if so, outputting the target object position, and if not, restarting executing the task 2.
By applying the target search method for incremental reinforcement learning provided by the embodiment of the application, the combination of the incremental learning algorithm and the memory module is realized, so that the intelligent agent can learn a new task while keeping learned knowledge, and the problem of 'disastrous forgetting' existing in the traditional reinforcement learning method is effectively solved. In addition, the embodiment of the application also introduces a performance evaluation and adjustment mechanism for monitoring the performances of the intelligent agent in different tasks and adjusting and optimizing the strategy according to the feedback information, and the design of the mechanism can improve the self-adaptability and the robustness of the intelligent agent.
Step S108: and searching the target based on the new task adaptation model.
The incremental reinforcement learning target searching method provided by the embodiment of the application aims at solving the problem that the traditional target searching strategy is low in efficiency when facing various complex environments. By introducing the incremental reinforcement learning method, the intelligent agent can dynamically learn and optimize the strategy thereof in the search task, thereby improving the search efficiency. The method has the advantages that the user can be helped to find the optimal solution in a complex environment more quickly, and time and resources are saved; secondly, the search method is aimed at improving the adaptability of the search method, so that the search method can be better adapted to a new search environment. The traditional reinforcement learning method can have a problem of 'disastrous forgetting' when facing new tasks, and learned knowledge can be forgotten, thereby affecting performance. By the design of the incremental learning algorithm, the intelligent learning system enables the intelligent agent to learn a new task while keeping the existing knowledge, and improves the adaptability. This advantage can help users cope with changing environments, maintaining the stability and robustness of the search method.
Corresponding to the above method embodiment, the present disclosure further provides an embodiment of a target search device based on incremental reinforcement learning, and fig. 7 is a schematic structural diagram of a target search device based on incremental reinforcement learning according to an embodiment of the present disclosure. As shown in fig. 7, the apparatus includes:
The processing module 702 is configured to perform gridding processing on a target space where the unmanned aerial vehicle executes the first search task, and generate an environment data network;
A first model generating module 704, configured to design a reward function based on a current state, an action taken by the current state and a next state corresponding to the unmanned aerial vehicle in the environmental data network, and generate a reinforcement learning training model based on the reward function and a preset search strategy, wherein the search strategy is an optimal search strategy corresponding to the defined unmanned aerial vehicle when searching for a target in a target space;
The second model generating module 706 is configured to, in a case where the unmanned aerial vehicle performs a second search task, retrain the model in response to an evaluation result of the second search task for evaluating the reinforcement learning training model, and perform design of a network structure by taking the reinforcement learning training model as a first network branch and a model generated by training model parameters corresponding to the second search task as a second network branch, so as to generate a new task adaptation model based on incremental learning;
a search module 708 configured to search for targets based on the new task adaptation model.
Further, the processing module 702 is further configured to:
Dividing the target space into grid cells with the same specification based on a preset grid cell specification, wherein the grid cells are corresponding to unique identifiers, and when the unmanned aerial vehicle moves to each grid cell, the unmanned aerial vehicle respectively corresponds to a current state of the unmanned aerial vehicle;
Designing an identification symbol corresponding to a moving position of the unmanned aerial vehicle in a target space, wherein the moving position comprises: upward movement, downward movement, leftward movement, and rightward movement;
an environmental data network is generated based on the grid cells and the identification symbol.
Further, the first model generation module 704 is further configured to:
Designing a reward function based on the current state corresponding to the unmanned aerial vehicle, the action taken by the current state and the next state;
Defining a search strategy of the unmanned aerial vehicle for searching targets in a target space by using a value function;
starting from the initial state, determining actions taken by the current state based on the current state of the unmanned aerial vehicle, and observing rewards and the next state after the unmanned aerial vehicle executes the actions until the reinforcement learning training model converges;
And testing the reinforcement learning training model and generating a test result, wherein the test result is the effect of searching the target in the scene by the unmanned aerial vehicle.
Further, the first model generation module 704 is further configured to:
Determining that target searching is successful in response to the unmanned aerial vehicle moving to a first preset range of a cell where a distance searching target is located, and setting rewards as first positive values, wherein the first preset range is a range where the unmanned aerial vehicle moves to a cell where the distance searching target is located;
And responding to the unmanned aerial vehicle moving to a second preset range of a cell where the distance searching target is, determining that the target searching fails, and setting the rewards to be negative, wherein the second preset range is a range where the unmanned aerial vehicle moves to a position outside one cell of the distance searching target, and the first preset range and the second preset range are disjoint spaces.
Further, the first model generation module 704 is further configured to:
Setting a reward to a first negative value in response to the unmanned aerial vehicle moving into a first spatial range within the second preset range, and setting the reward to a second negative value based on the number of moving steps of the unmanned aerial vehicle in response to the unmanned aerial vehicle moving within the first spatial range, wherein the first spatial range comprises at least one grid cell within the second preset range, and the second negative value is determined based on the number of moving steps of the unmanned aerial vehicle within the first spatial range;
And setting the reward to a first negative value in response to the drone moving to a spatial boundary of the target space.
Further, the first model generation module 704 is further configured to:
and setting the reward to be a second positive value in response to the unmanned aerial vehicle moving from the first spatial range to the second spatial range of the second preset range.
Further, the second model generation module 706 is further configured to:
In the new task adaptation model, the first network branch and the second network branch are connected in a cross-layer mode, and the weight corresponding to the first network branch and the weight corresponding to the second network branch are dynamically adjusted to optimize the new task adaptation model.
The incremental reinforcement learning target searching device provided by the embodiment of the application aims at solving the problem that the traditional target searching strategy is low in efficiency when facing various complex environments. By introducing the incremental reinforcement learning method, the intelligent agent can dynamically learn and optimize the strategy thereof in the search task, thereby improving the search efficiency. The method has the advantages that the user can be helped to find the optimal solution in a complex environment more quickly, and time and resources are saved; secondly, the search method is aimed at improving the adaptability of the search method, so that the search method can be better adapted to a new search environment. The traditional reinforcement learning method can have a problem of 'disastrous forgetting' when facing new tasks, and learned knowledge can be forgotten, thereby affecting performance. By the design of the incremental learning algorithm, the intelligent learning system enables the intelligent agent to learn a new task while keeping the existing knowledge, and improves the adaptability. This advantage can help users cope with changing environments, maintaining the stability and robustness of the search method.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the object searching apparatus of incremental reinforcement learning, since it is substantially similar to the object searching method embodiment based on incremental reinforcement learning, the description is relatively simple, and the relevant points are only referred to the partial explanation of the object searching method embodiment based on incremental reinforcement learning.
FIG. 8 is a block diagram of a computing device provided by an embodiment of the application. The components of computing device 800 include, but are not limited to, memory 810 and processor 820. Processor 820 is coupled to memory 810 through bus 830 and database 850 is used to hold data.
Computing device 800 also includes access device 840, access device 840 enabling computing device 800 to communicate via one or more networks 860. Examples of such networks include public switched telephone networks (PSTN, public Switched Telephone Network), local area networks (LAN, local Area Network), wide area networks (WAN, wide Area Network), personal area networks (PAN, personal Area Network), or combinations of communication networks such as the internet. The access device 840 may include one or more of any type of network interface, wired or wireless, such as a network interface card (NIC, network interface controller), such as an IEEE802.11 wireless local area network (WLAN, wireless Local Area Network) wireless interface, a worldwide interoperability for microwave access (Wi-MAX, worldwide Interoperability for Microwave Access) interface, an ethernet interface, a universal serial bus (USB, universal Serial Bus) interface, a cellular network interface, a bluetooth interface, near Field Communication (NFC).
In one embodiment of the present description, the above-described components of computing device 800, as well as other components not shown in FIG. 8, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device illustrated in FIG. 8 is for exemplary purposes only and is not intended to limit the scope of the present description. Those skilled in the art may add or replace other components as desired.
Computing device 800 may be any type of stationary or mobile computing device including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or personal computer (PC, personal Computer). Computing device 800 may also be a mobile or stationary server.
Wherein the processor 820 is configured to execute computer-executable instructions that, when executed by the processor, perform the steps of the incremental reinforcement learning-based target search method described above.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for computing device embodiments, since they are substantially similar to incremental reinforcement learning-based target search method embodiments, the description is relatively simple, with reference to a partial description of incremental reinforcement learning-based target search method embodiments.
An embodiment of the present disclosure also provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the incremental reinforcement learning-based target search method described above.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for computer readable storage medium embodiments, since they are substantially similar to the incremental reinforcement learning based target search method embodiments, the description is relatively simple, with reference to the partial description of the incremental reinforcement learning based target search method embodiments.
An embodiment of the present disclosure further provides a computer program, where the computer program, when executed in a computer, causes the computer to perform the steps of the above-described target search method based on incremental reinforcement learning.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the computer program embodiment, since it is substantially similar to the incremental reinforcement learning-based target search method embodiment, the description is relatively simple, and reference is made to a partial explanation of the incremental reinforcement learning-based target search method embodiment.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.
It should be noted that the foregoing describes specific embodiments of the present invention. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the embodiments described in the specification.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.
The preferred embodiments of the present specification disclosed above are merely used to help clarify the present specification. Alternative embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the teaching of the embodiments. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. This specification is to be limited only by the claims and the full scope and equivalents thereof.
Claims (10)
1. The target searching method based on incremental reinforcement learning is characterized by comprising the following steps of:
Performing gridding processing on a target space where the unmanned aerial vehicle executes a first search task to generate an environment data network;
Designing a reward function based on a current state, actions taken by the current state and a next state corresponding to the unmanned aerial vehicle in the environment data network, and generating a reinforcement learning training model based on the reward function and a preset search strategy, wherein the search strategy is an optimal search strategy corresponding to the defined unmanned aerial vehicle when searching for a target in a target space;
Under the condition that the unmanned aerial vehicle executes a second search task, a model generated by training model parameters corresponding to the second search task is used as a second network branch to perform network structure design, and a new task adaptation model based on incremental learning is generated in response to the fact that the second search task evaluates the reinforcement learning training model to retrain the model;
And searching the target based on the new task adaptation model.
2. The method of claim 1, wherein the meshing the target space in which the first search task is performed by the unmanned aerial vehicle to generate the environmental data network comprises:
Dividing the target space into grid cells with the same specification based on a preset grid cell specification, wherein the grid cells are corresponding to unique identifiers, and when the unmanned aerial vehicle moves to each grid cell, the unmanned aerial vehicle respectively corresponds to a current state of the unmanned aerial vehicle;
Designing an identification symbol corresponding to a moving position of the unmanned aerial vehicle in a target space, wherein the moving position comprises: upward movement, downward movement, leftward movement, and rightward movement;
an environmental data network is generated based on the grid cells and the identification symbol.
3. The method of claim 1, wherein designing a reward function based on a current state corresponding to the unmanned aerial vehicle in the environmental data network, an action taken by the current state, and a next state, and generating a reinforcement learning training model based on the reward function and a preset search strategy, comprises:
Designing a reward function based on the current state corresponding to the unmanned aerial vehicle, the action taken by the current state and the next state;
Defining a search strategy of the unmanned aerial vehicle for searching targets in a target space by using a value function;
starting from the initial state, determining actions taken by the current state based on the current state of the unmanned aerial vehicle, and observing rewards and the next state after the unmanned aerial vehicle executes the actions until the reinforcement learning training model converges;
And testing the reinforcement learning training model and generating a test result, wherein the test result is the effect of searching the target in the scene by the unmanned aerial vehicle.
4. A method according to claim 3, wherein designing a reward function based on the current state of the drone, the action taken by the current state, and the next state, comprises:
Determining that target searching is successful in response to the unmanned aerial vehicle moving to a first preset range of a cell where a distance searching target is located, and setting rewards as first positive values, wherein the first preset range is a range where the unmanned aerial vehicle moves to a cell where the distance searching target is located;
And responding to the unmanned aerial vehicle moving to a second preset range of a cell where the distance searching target is, determining that the target searching fails, and setting the rewards to be negative, wherein the second preset range is a range where the unmanned aerial vehicle moves to a position outside one cell of the distance searching target, and the first preset range and the second preset range are disjoint spaces.
5. The method of claim 4, wherein the determining that the target search failed and setting the reward to a negative value in response to the drone moving within a second predetermined range from the cell in which the search target is located comprises:
Setting a reward to a first negative value in response to the unmanned aerial vehicle moving into a first spatial range within the second preset range, and setting the reward to a second negative value based on the number of moving steps of the unmanned aerial vehicle in response to the unmanned aerial vehicle moving within the first spatial range, wherein the first spatial range comprises at least one grid cell within the second preset range, and the second negative value is determined based on the number of moving steps of the unmanned aerial vehicle within the first spatial range;
And setting the reward to a first negative value in response to the drone moving to a spatial boundary of the target space.
6. The method of claim 4, wherein designing a reward function based on the current state of the drone, the action taken by the current state, and the next state, further comprises:
and setting the reward to be a second positive value in response to the unmanned aerial vehicle moving from the first spatial range to the second spatial range of the second preset range.
7. The method according to claim 1, wherein in the new task adaptation model, the first network branch and the second network branch are connected in a cross-layer manner, and optimization of the new task adaptation model is achieved by dynamically adjusting weights corresponding to the first network branch and the second network branch.
8. A target search device based on incremental reinforcement learning, comprising:
the processing module is configured to grid the target space where the unmanned aerial vehicle executes the first search task, and generate an environment data network;
The first model generation module is configured to design a reward function based on a current state, an action taken by the current state and a next state corresponding to the unmanned aerial vehicle in the environment data network, and generate a reinforcement learning training model based on the reward function and a preset search strategy, wherein the search strategy is an optimal search strategy corresponding to the unmanned aerial vehicle when the unmanned aerial vehicle searches for a target in a target space;
The second model generating module is configured to retrain the model in response to the evaluation result of the evaluation of the reinforcement learning training model by the second search task under the condition that the unmanned aerial vehicle executes the second search task, and then, the reinforcement learning training model is taken as a first network branch, the model generated by training the model parameters corresponding to the second search task is taken as a second network branch, the design of a network structure is carried out, and a new task adaptation model based on incremental learning is generated;
And the searching module is configured to search the target based on the new task adaptation model.
9. A computer readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, implements the steps of the method according to any of claims 1 to 7.
10. A computer device, the computer device comprising: memory, a processor and a computer program stored on the memory and executable on the processor, which when executed by the processor, performs the steps of the method according to any one of claims 1 to 7.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410682100.0A CN118485134A (en) | 2024-05-29 | 2024-05-29 | Target searching method and device based on incremental reinforcement learning |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410682100.0A CN118485134A (en) | 2024-05-29 | 2024-05-29 | Target searching method and device based on incremental reinforcement learning |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN118485134A true CN118485134A (en) | 2024-08-13 |
Family
ID=92187670
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202410682100.0A Pending CN118485134A (en) | 2024-05-29 | 2024-05-29 | Target searching method and device based on incremental reinforcement learning |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN118485134A (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119165873A (en) * | 2024-09-13 | 2024-12-20 | 中山大学 | Two-stage target search and tracking method for UAV based on deep reinforcement learning |
-
2024
- 2024-05-29 CN CN202410682100.0A patent/CN118485134A/en active Pending
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119165873A (en) * | 2024-09-13 | 2024-12-20 | 中山大学 | Two-stage target search and tracking method for UAV based on deep reinforcement learning |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN114357067B (en) | Personalized federal element learning method aiming at data isomerism | |
| CN113283426B (en) | Embedded target detection model generation method based on multi-target neural network search | |
| CN116363452B (en) | Task model training method and device | |
| CN116348881A (en) | Combined mixing model | |
| Yang et al. | Deep reinforcement learning based wireless network optimization: A comparative study | |
| CN110097176A (en) | A kind of neural network structure searching method applied to air quality big data abnormality detection | |
| CN113095480A (en) | Interpretable graph neural network representation method based on knowledge distillation | |
| CN119150925B (en) | Generative adversarial network architecture search method and system based on hybrid convolution operation | |
| CN115272774A (en) | Sample attack resisting method and system based on improved self-adaptive differential evolution algorithm | |
| CN118485134A (en) | Target searching method and device based on incremental reinforcement learning | |
| CN115423505A (en) | Task recommendation method and device | |
| CN113516163B (en) | Vehicle classification model compression method, device and storage medium based on network pruning | |
| CN112487933B (en) | A radar waveform recognition method and system based on automated deep learning | |
| CN114707636A (en) | Neural network architecture searching method and device, electronic equipment and storage medium | |
| CN117009823A (en) | Incremental learning method, device, equipment and storage medium based on analogy prompt | |
| CN116091776A (en) | Semantic segmentation method based on field increment learning | |
| CN118674033B (en) | A knowledge reasoning method based on multi-training global perception | |
| CN119417267A (en) | An intelligent auxiliary decision-making system for knowledge evolution and its construction method | |
| Ducange et al. | Multi-objective evolutionary fuzzy systems | |
| CN113919505A (en) | Inverse reinforcement learning processing method, device, storage medium and electronic device | |
| CN119623598A (en) | A knowledge graph completion method and device for alleviating sparsity | |
| CN119204152A (en) | Strategy exploration model training method, device, computer equipment and storage medium | |
| CN114817744B (en) | Recommendation method and device based on multiple agents | |
| CN117592551A (en) | Auxiliary neural network model training method and device in heterogeneous knowledge distillation | |
| CN118941881B (en) | Incremental learning method, device and computer equipment based on target detection |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |