Disclosure of Invention
In order to solve the technical problems, the invention provides an intelligent task dividing method for collaborative operation of unmanned sanitation vehicles.
In order to achieve the above purpose, the technical scheme of the invention is as follows:
in a first aspect, the invention discloses an intelligent task dividing method for unmanned sanitation vehicle collaborative operation, which comprises the following steps:
step S1: clustering all task road sections by using a clustering algorithm, wherein the clustering quantity is the preset subtask quantity, and an initial clustering center is obtained;
Step S2: initializing a policy function table and a value function table;
the probability that each task section selects each subtask is stored in the strategy function table;
the value function table stores the value function of each task section, wherein the value function is the expected return which can be obtained when the corresponding subtask is selected;
step S3: selecting a corresponding subtask for each task section according to the current strategy function table;
step S4: calculating a reward value according to the selection result by using a reward function, and evaluating the expected return of the selection result;
The reward function relates to one or more factors of the distance from the task section to the corresponding clustering center, the optimal path cost of the subtasks, the shortest distance from the task section to the replenishment station and the distance from the task section to the warehouse;
step S5: updating the value function and the strategy function of the current task section according to the rewarding value and the value function of the next task section based on the updating rule of the Actor-Critic;
step S6: and repeatedly executing the steps S3-S5, and updating the strategy function table and the value function table until the optimal subtask is allocated to each task section.
On the basis of the technical scheme, the following improvement can be made:
as a preferred solution, a greedy algorithm is used in the reward function to calculate the optimal path cost of the subtask, including the following steps:
step A: acquiring all task road sections corresponding to the subtasks and the connection relation between the task road sections, and regarding each task road section as a node;
and (B) step (B): initializing a Boolean list for marking whether the node has been accessed;
Initializing the path length to 0, and setting the initial node to 0;
step C: starting from the current node, searching for the next unviewed node, enabling the distance to reach the node to be shortest, and marking the current node as visited;
traversing all the unviewed nodes, calculating the distance from the current node to each unviewed node, and updating the minimum distance and the next node until the shortest path is found by iteration;
step D: updating the path length and adding the minimum distance to the path length.
As a preferable scheme, the Euclidean distance is adopted to calculate the distance from the task road section to the corresponding clustering center.
Preferably, the calculation formula of the reward function R is as follows:
wherein x i is the coordinate or feature vector of the ith task section;
c j is the coordinate of the j-th cluster center;
p is the optimal path length of the subtask calculated by the greedy algorithm;
n is the number of task sections of the current subtask;
the shortest distance from the ith mission section to the replenishment station;
Is the distance from the ith mission section to the warehouse.
As a preferred embodiment, in step S5, the value function is updated by the following formula;
V(s)←V(s)+α*(r+γ*V(s′)-V(s))
wherein V(s) is a value function of the current mission segment;
V (s') is a function of the value of the next mission segment;
r is a prize value;
Alpha is the learning rate;
Gamma is the discount factor;
Updating the policy function table by the following formula;
π(a|s)←π(a|s)+α*(r+γ*V(s′)-V(s))
where pi (a|s) is the probability of selecting the a-th subtask under the s-th mission section.
As a preferred approach, the probabilities in the policy function table are processed using softmax functions.
In a second aspect, the invention also discloses an intelligent task dividing device for unmanned sanitation truck collaborative operation, which comprises:
the clustering module is used for clustering all task road sections by using a clustering algorithm, wherein the clustering quantity is the preset subtask quantity, and an initial clustering center is obtained;
The initialization module is used for initializing the strategy function table and the value function table;
the probability that each task section selects each subtask is stored in the strategy function table;
the value function table stores the value function of each task section, wherein the value function is the expected return which can be obtained when the corresponding subtask is selected;
The selecting module is used for selecting a corresponding subtask for each task section according to the current strategy function table;
The rewarding value calculation module is used for calculating rewarding values by using a rewarding function according to the selection result and evaluating expected returns of the selection result;
The reward function relates to one or more factors of the distance from the task section to the corresponding clustering center, the optimal path cost of the subtasks, the shortest distance from the task section to the replenishment station and the distance from the task section to the warehouse;
The updating module is used for updating the value function and the strategy function of the current task section according to the rewarding value and the value function of the next task section based on the updating rule of the Actor-Critic;
and the repeating module is used for repeatedly executing the method in the selecting module, the rewarding value calculating module and the updating module, and updating the strategy function table and the value function table until the optimal subtasks are allocated to each task section.
As a preferred solution, calculating, by an optimal path cost calculation unit, an optimal path cost of a subtask in a reward function by adopting a greedy algorithm, including:
The acquisition unit is used for acquiring all task road sections corresponding to the subtasks and the connection relation among the task road sections, and treating each task road section as a node;
the greedy algorithm initializing unit is used for initializing a Boolean list and marking whether the node has been accessed;
Initializing the path length to 0, and setting the initial node to 0;
the iteration searching unit is used for searching the next unviewed node from the current node, so that the distance to the node is shortest, and marking the current node as visited;
traversing all the unviewed nodes, calculating the distance from the current node to each unviewed node, and updating the minimum distance and the next node until the shortest path is found by iteration;
and a path updating unit for updating the path length and adding the minimum distance to the path length.
As a preferable scheme, the Euclidean distance is adopted to calculate the distance from the task road section to the corresponding clustering center.
Preferably, the calculation formula of the reward function R is as follows:
wherein x i is the coordinate or feature vector of the ith task section;
c j is the coordinate of the j-th cluster center;
p is the optimal path length of the subtask calculated by the greedy algorithm;
n is the number of task sections of the current subtask;
the shortest distance from the ith mission section to the replenishment station;
Is the distance from the ith mission section to the warehouse.
As a preferred embodiment, in step S5, the value function is updated by the following formula;
V(s)←V(s)+α*(r+γ*V(s′)-V(s))
wherein V(s) is a value function of the current mission segment;
V (s') is a function of the value of the next mission segment;
r is a prize value;
Alpha is the learning rate;
Gamma is the discount factor;
Updating the policy function table by the following formula;
π(a|s)←π(a|s)+α*(r+γ*V(s′)-V(s))
where pi (a|s) is the probability of selecting the a-th subtask under the s-th mission section.
As a preferred approach, the probabilities in the policy function table are processed using softmax functions.
In a third aspect, the present invention also discloses a storage medium, where one or more computer readable programs are stored, where the one or more programs include instructions, where the instructions are adapted to be loaded by a memory and execute any of the above-mentioned intelligent task partitioning methods for unmanned sanitation truck collaborative operations.
The invention discloses an intelligent task dividing method for unmanned sanitation vehicle collaborative operation, which has the following beneficial effects:
The invention divides the task section into a plurality of subtasks, and optimizes the division mode of each subtask by using an Actor-Critic reinforcement learning algorithm, thereby realizing the optimization of the division of the whole task section. Meanwhile, the distance from the task road section to the corresponding clustering center, the optimal path cost of the subtasks, the shortest distance from the task road section to the replenishment station, the distance from the task road section to the warehouse and other factors are comprehensively considered.
According to the invention, through using the Actor-Critic reinforcement learning model and combining greedy algorithm, clustering algorithm and the like, the tasks of the urban sanitation vehicle are effectively divided, so that the problems of route planning and task allocation when the urban sanitation vehicle executes tasks such as cleaning and garbage collection can be solved, the working efficiency is improved, the energy consumption is reduced, and the urban sanitation service is optimized.
Detailed Description
Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The expression "comprising" an element is an "open" expression which merely means that there is a corresponding component or step and should not be interpreted as excluding the inclusion of additional components or steps.
In order to achieve the purpose of the present invention, in some embodiments of a task intelligent dividing method for collaborative operations of unmanned sanitation vehicles, as shown in fig. 1, the task intelligent dividing method includes:
step S1: clustering all task road sections by using a clustering algorithm, wherein the clustering quantity is the preset subtask quantity, and an initial clustering center is obtained;
Step S2: initializing a policy function table and a value function table;
the probability that each task section selects each subtask is stored in the strategy function table;
the value function table stores the value function of each task section, wherein the value function is the expected return which can be obtained when the corresponding subtask is selected;
step S3: selecting a corresponding subtask for each task section according to the current strategy function table;
step S4: calculating a reward value according to the selection result by using a reward function, and evaluating the expected return of the selection result;
The reward function relates to one or more factors of the distance from the task section to the corresponding clustering center, the optimal path cost of the subtasks, the shortest distance from the task section to the replenishment station and the distance from the task section to the warehouse;
step S5: updating the value function and the strategy function of the current task section according to the rewarding value and the value function of the next task section based on the updating rule of the Actor-Critic;
step S6: and repeatedly executing the steps S3-S5, and updating the strategy function table and the value function table until the optimal subtask is allocated to each task section.
The present invention will be described in detail below.
And S1, clustering all task road sections by using a K-means clustering algorithm, wherein the clustering quantity is the preset subtask quantity, and an initial clustering center is obtained.
K-means is a common clustering algorithm used to divide samples in a dataset into a preset number of different groups or clusters of subtasks. The basic principle is that the center point of the cluster is found through iterative optimization, so that the sum of distances from the sample point to the center point of the cluster to which the sample point belongs is minimized.
In recent years, with the development of artificial intelligence and machine learning techniques, reinforcement learning has been widely applied to various optimization problems as a method capable of learning an optimal strategy through interaction with the environment. In particular, the Actor-Critic model has shown excellent performance in many problems as a reinforcement learning algorithm that combines both methods of value iteration and strategy iteration.
However, how to apply reinforcement learning, especially the Actor-Critic model, to the task partitioning problem of urban sanitation vehicles remains a challenging problem.
The invention updates the task division strategy according to the rewarding value of each task section (namely, state) by using reinforcement learning, thereby finding an optimal task division scheme.
The invention uses an Actor-Critic reinforcement learning model to select corresponding subtasks (i.e., actions), an update value function table and a policy function table. The Actor-Critic model combines the advantages of the two methods of value iteration and strategy iteration, and can effectively balance exploration and utilization, so that the task division effect is improved.
As shown in FIG. 2, the Actor-Critic model is a reinforcement learning algorithm that combines two methods, value iteration (Critic) and strategy iteration (Actor). In the Actor-Critic model, "Actor" and "Critic" are two major components, each of which has different responsibilities: the role of the Actor is to select an action according to the current policy. Policies are generally expressed as the probability of each possible action being performed in a given state. In each step, the Actor selects an action according to the current policy. Critic's responsibility is to evaluate the action selected by the Actor. It calculates the expected return for each action and provides feedback to the Actor. This feedback is used to update the policy of the Actor so that the probability of actions that expect a higher return being selected in the future is greater.
The main steps of the Actor-Critic model are mainly embodied in the steps S2-S6.
Step S2: initializing a policy function table and a value function table;
the probability that each task section selects each subtask is stored in the strategy function table;
The value function table stores the value function of each task section, wherein the value function is the expected return which can be obtained when the corresponding subtask is selected.
Step S3: and selecting a corresponding subtask for each task section according to the current strategy function table, wherein the step is completed by an Actor.
Step S4: calculating a reward value according to the selection result by using a reward function, and evaluating the expected return of the selection result, wherein the step is completed by Critic;
The reward function relates to one or more factors of the distance of the task segment to the corresponding cluster center, the optimal path cost of the subtask, the shortest distance of the task segment to the replenishment station, and the distance of the task segment to the warehouse.
Step S5: and updating the value function and the strategy function of the current task section according to the reward value and the value function of the next task section based on the updating rule of the Actor-Critic according to the feedback of Critic.
Step S6: and repeatedly executing the steps S3-S5, and updating the strategy function table and the value function table until the optimal subtask is allocated to each task section.
In particular, it is desirable to increase the probability that an action with a higher expected return is selected and decrease the probability that an action with a lower expected return is selected. At the same time, the present invention updates the value function to more accurately estimate the expected return for each state.
In the Actor-Critic model, the present invention maintains two tables: a value function table and a policy function table. The value function table is used to store a value function for each state that indicates the expected return that can be obtained in that state. The policy function table is used to store the probability of performing each possible action in each state.
With respect to the policy function table. The policy function table is a two-dimensional array in which each row corresponds to a state and each column corresponds to an action. Each element of the policy function table represents the probability of performing a given action in a given state. For example, assuming that there are 3 states and 2 actions, the policy function table may be as follows:
[[0.7,0.3],
[0.4,0.6],
[0.5,0.5]]
This means that the probability of performing action 1 in state 1 is 0.7 and the probability of performing action 2 is 0.3; the probability of performing action 1 in state 2 is 0.4, and the probability of performing action 2 is 0.6; the probability of performing both action 1 and action 2 in state 3 is 0.5.
With respect to the value function table. The value function table is a one-dimensional array in which each element corresponds to a state. Each element of the value function table represents the expected return that can be obtained in that state.
For example, assuming that there are 3 states, the value function table may be as follows:
[0.5,0.6,0.7]
This means that the expected return in state 1 is 0.5, the expected return in state 2 is 0.6, and the expected return in state 3 is 0.7.
In reinforcement learning, the goal is to find the optimal policy function table and value function table by learning so that the maximum cumulative return can be obtained by following the policy function table selection action from any state.
The state represents a task section, and the execution action represents a selection of a subtask.
The following is a detailed description of the update process of the policy function table and the value function table:
In each step, the value function table and the policy function table are updated according to the updating rule of the Actor-Critic. In the Actor-Critic model, critic's feedback is used to update the policy function table and the value function table. Specifically, the present invention uses the following update rules:
In the updating process of the value function table, a time difference Error (Temporal Difference Error, abbreviated as TD Error) is calculated first. TD Error is the difference between the actual prize and the current state value function plus the discount value of the next state value function. This error is then multiplied by the learning rate and added to the value function of the current state to update the value function.
Specifically, in step S5, the value function is updated by the following formula;
V(s)←V(s)+α*(r+γ*V(s′)--V(s))
wherein V(s) is a value function of the current mission segment;
V (s') is a function of the value of the next mission segment;
r is a prize value;
Alpha is the learning rate;
and/is a discount factor.
In the updating process of the strategy function table, the method adopts the same time difference error as the updating of the value function table. Specifically, this error is multiplied by the learning rate and then added to the probability of performing the current action in the current state.
Specifically, in step S5, the policy function table is updated by the following formula;
π(a|s)←π(a|s)+α*(r+γ*V(s′)-V(s))
where pi (a|s) is the probability of selecting the a-th subtask under the s-th mission section.
Through the process, the strategy function and the value function can be gradually improved, so that the reinforcement learning algorithm can better solve the problem.
The invention uses a combination of a softmax function and a random selection function for action selection.
The invention uses softmax functions to process probabilities in a policy function table. The softmax function may convert a set of values into a probability distribution such that the probabilities satisfy the properties of the probability distribution (i.e., the sum of all probabilities is 1, each probability is between 0 and 1).
The softmax function is formulated as:
where x is the input vector, w j and w k are weight vectors, and K is the total number of categories.
Further, the present invention employs a random selection algorithm (e.g., random. Choice functions in numpy libraries) to randomly select an action on the probability distribution of the softmax function after processing. The random selection algorithm is able to randomly select an element according to a given probability distribution, thereby ensuring that the probability of each action being selected is consistent with its probability in the policy function table. This approach provides an efficient way to balance the needs of exploration and utilization, thereby achieving optimization in a complex decision making environment.
The method combining the softmax function and the random selection function can ensure the randomness of the strategy and the superiority of the strategy, namely, the probability of selecting the action with larger value function is higher, but the probability of selecting the action with smaller value function is still higher. Thus, the method can balance exploration and utilization to a certain extent, and the reinforcement learning effect is improved.
In reinforcement learning, the design of the bonus function is very critical. The bonus function of the present invention takes into account a number of factors, including: the distance from the task section to the corresponding cluster center, the optimal path cost of the subtasks, the shortest distance from the task section to the replenishment station, and the distance from the task section to the warehouse. The design of the reward function comprehensively considering a plurality of factors ensures that the task division scheme meets the actual requirements better.
The goal of the greedy algorithm (greedy_ shortest _path) is to calculate the optimal path cost for a given subtask. A greedy algorithm is used to select each time a task segment with the best path cost from the sub-task segments that are not visited. The input of the function is a matrix corresponding to a subtask, and the output is the optimal path cost of the subtask.
The method specifically comprises the following steps:
step A: acquiring all task road sections corresponding to the subtasks and the connection relation between the task road sections, and regarding each task road section as a node;
and (B) step (B): initializing a Boolean list for marking whether the node has been accessed;
Initializing the path length to 0, and setting the initial node to 0;
step C: starting from the current node, searching for the next unviewed node, enabling the distance to reach the node to be shortest, and marking the current node as visited;
traversing all the unviewed nodes, calculating the distance from the current node to each unviewed node, and updating the minimum distance and the next node until the shortest path is found by iteration;
step D: updating the path length and adding the minimum distance to the path length.
The core idea of the algorithm is to select the node with the shortest distance from the current node to the non-accessed node as the node to be accessed in the next step each time until all the nodes are accessed, thereby obtaining the shortest path length of the whole subtask.
Further, the Euclidean distance is adopted to calculate the distance from the task road section to the corresponding clustering center.
The main steps of Euclidean distance acquisition are as follows:
First, two coordinates are acquired from the input parameters.
Each coordinate is a tuple containing two elements (x and y coordinates).
Then, the difference between the two coordinates in the x-axis and the y-axis is calculated.
Next, the distance between the two coordinates is calculated using the euclidean distance formula.
The Euclidean distance formula is:
Wherein (x 1,y1) and (x 2,y2) are two coordinates.
And finally, returning the calculated distance.
The following is a detailed description of the prize calculation process:
In the present invention, the calculation of rewards encompasses a number of key factors including the distance of the task segment to the cluster center, the optimal path cost of the subtasks, the shortest distance of the task segment to the replenishment station, and the distance of the task segment to the warehouse.
First, regarding the distance of the task segment to the cluster center. The distance is calculated by the euclidean distance formula and the goal is to minimize this distance. Thus, the resulting prize is a negative distance.
Second, the optimal path cost for the subtasks. The optimal path cost for a subtask is obtained from the task segment for the same subtask, and a greedy algorithm is used to calculate the optimal path cost with the goal of minimizing this length. Thus, the resulting prize is of negative length.
Third, the shortest distance of the mission section to the tender station is aimed at minimizing this length. Thus, the resulting prize is a negative distance.
Fourth, distance of the mission section to the warehouse. The aim is that the distance is as small as possible. Thus, the resulting prize is a negative distance.
It is worth emphasizing that these rewards are all negative, as the present invention aims to minimize these distances and lengths. Within the framework of reinforcement learning, strategies are sought to maximize the jackpot, and therefore minimizing these distances and lengths is essentially equivalent to maximizing the jackpot.
Further, the calculation formula of the bonus function R is as follows:
wherein x i is the coordinate or feature vector of the ith task section;
c j is the coordinate of the j-th cluster center;
p is the optimal path length of the subtask calculated by the greedy algorithm;
n is the number of task sections of the current subtask;
the shortest distance from the ith mission section to the replenishment station;
Is the distance from the ith mission section to the warehouse.
Dis (a, b) is the Euclidean distance, expressed as
The invention has wide application fields including but not limited to the problems of path planning and logistics distribution of urban unmanned vehicles. For each task section, the id of one subtask is selected. The basis for selecting subtasks is a comprehensive consideration of a plurality of factors, including: the distance from the task section to the corresponding clustering center, the optimal path length of the subtask corresponding cost, the shortest distance from the task section to the replenishment station and the distance from the task section to the warehouse. The invention can find an optimal task division scheme by considering the distance between the task section and the replenishment station and the warehouse while ensuring the task efficiency. The scheme can not only improve the working efficiency of the urban sanitation vehicle and reduce the energy consumption, but also optimize the urban sanitation service and make contribution to the cleaning and environmental protection of the city.
The invention can be realized by technical means such as computer software, hardware and the like, and specific operation steps can be realized by programming in an intelligent vehicle control system. In addition, optimization and management of job task division can be achieved through cloud service and other modes. This diversified embodiment provides a flexible and efficient solution to meet different application requirements and environmental conditions.
The invention can also evaluate the obtained task allocation scheme by adopting an evaluation method. The assessment method is that each subtask is distributed to an urban unmanned sanitation vehicle, and index comparison is carried out after path planning. The evaluation method can comprehensively reflect the effect of the task allocation scheme and provide a powerful reference for further optimization.
The sample set adopted by the embodiment of the invention contains 3 real-scene work tasks in total, and as shown in fig. 3, the sample set comprises a road network, a warehouse position, a position of a replenishment station and the like.
The specific experimental process is as follows:
1. performance index
The embodiment of the invention uses: overall inactive path duty cycle; the number of vehicles and the proportion of the non-working path of each unmanned vehicle running for 1 km; total cost of system: the number of optimal unmanned vehicles for completing the 1KM work task and the total cost required by the corresponding optimal path.
Wherein:
F is the trip fixed cost of the unmanned sanitation vehicle, and the value range is 0-1000;
Q is the number of unmanned vehicles participating in the work task, namely the number of subtasks;
Beta is an adjusting coefficient, and the general value is 10000;
w is the energy consumption required by the unit working distance, and is generally 1;
r is the energy consumption proportion of the transfer mode relative to the working mode, and the value range is 0-1;
Path w is the working Path;
Path r is a transport Path.
2. Comparative test
The present embodiment employs the following methods for comparison evaluation:
K-means_ant: the method takes a k-means algorithm as a task dividing part, and adopts an ant colony algorithm to find an optimal path.
K-means_greedy: the method adopts a K-means algorithm to carry out task allocation, and adopts a greedy algorithm to find an optimal path.
Naive_greedy: the method is a greedy algorithm without task allocation.
Actor_Critical_ant: the method of the invention.
Fig. 4 is a schematic diagram of the number of vehicles according to the proportion of the total non-working path when the operation is completed in 10 x in three real test scenes in the embodiment of the present invention.
Fig. 5 is a proportion of a non-working path of each unmanned vehicle running for 1km in three real test scenes, and a scatter diagram in the figure is a single unmanned vehicle non-working path proportion corresponding to different algorithms. It can be seen that the Actor-Critic model also has better performance in the non-working path ratio of the bicycle.
The total Cost (Cost) in fig. 6 refers to the total Cost of the optimal number of drones (Q) and their corresponding optimal paths required to complete a1 km work mission.
As can be seen from fig. 4-6, the present invention performs best on four criteria:
index one: the overall inactive path duty cycle, i.e., the ratio of inactive path to total path.
Index II: the number of vehicles, i.e. the number of unmanned vehicles involved in a work task.
And (3) index III: the proportion of the non-working path of each unmanned vehicle is 1km per driving, namely the proportion of the non-working path of each unmanned vehicle in the distance of 1 km.
And (4) index IV: the optimal number of unmanned vehicles for completing the 1KM work task and the total cost required by the corresponding optimal path, namely the optimal number of unmanned vehicles required for completing the 1KM work task and the total cost of the corresponding optimal path.
The method provided by the invention is excellent in all indexes, and the effectiveness and superiority of the method in practical application are proved.
In addition, the embodiment of the invention also discloses an intelligent task dividing device for the unmanned sanitation truck collaborative operation, which comprises the following components:
the clustering module is used for clustering all task road sections by using a clustering algorithm, wherein the clustering quantity is the preset subtask quantity, and an initial clustering center is obtained;
The initialization module is used for initializing the strategy function table and the value function table;
the probability that each task section selects each subtask is stored in the strategy function table;
the value function table stores the value function of each task section, wherein the value function is the expected return which can be obtained when the corresponding subtask is selected;
The selecting module is used for selecting a corresponding subtask for each task section according to the current strategy function table;
The rewarding value calculation module is used for calculating rewarding values by using a rewarding function according to the selection result and evaluating expected returns of the selection result;
The reward function relates to one or more factors of the distance from the task section to the corresponding clustering center, the optimal path cost of the subtasks, the shortest distance from the task section to the replenishment station and the distance from the task section to the warehouse;
The updating module is used for updating the value function and the strategy function of the current task section according to the rewarding value and the value function of the next task section based on the updating rule of the Actor-Critic;
and the repeating module is used for repeatedly executing the method in the selecting module, the rewarding value calculating module and the updating module, and updating the strategy function table and the value function table until the optimal subtasks are allocated to each task section.
Further, calculating, by the optimal path cost calculation unit, an optimal path cost of the subtask in the reward function by adopting a greedy algorithm, including:
The acquisition unit is used for acquiring all task road sections corresponding to the subtasks and the connection relation among the task road sections, and treating each task road section as a node;
the greedy algorithm initializing unit is used for initializing a Boolean list and marking whether the node has been accessed;
Initializing the path length to 0, and setting the initial node to 0;
the iteration searching unit is used for searching the next unviewed node from the current node, so that the distance to the node is shortest, and marking the current node as visited;
traversing all the unviewed nodes, calculating the distance from the current node to each unviewed node, and updating the minimum distance and the next node until the shortest path is found by iteration;
and a path updating unit for updating the path length and adding the minimum distance to the path length.
Further, the Euclidean distance is adopted to calculate the distance from the task road section to the corresponding clustering center.
Further, the calculation formula of the bonus function R is as follows:
wherein x i is the coordinate or feature vector of the ith task section;
c j is the coordinate of the j-th cluster center;
p is the optimal path length of the subtask calculated by the greedy algorithm;
n is the number of task sections of the current subtask;
the shortest distance from the ith mission section to the replenishment station;
Is the distance from the ith mission section to the warehouse.
Further, in step S5, the value function is updated by the following formula;
V(s)←V(s)+α*(r+γ*V(s′)-V(s))
wherein V(s) is a value function of the current mission segment;
V (s') is a function of the value of the next mission segment;
r is a prize value;
Alpha is the learning rate;
Gamma is the discount factor;
Updating the policy function table by the following formula;
π(a|s)←π(a|s)+α*(r+γ*V(s′)-V(s))
where pi (a|s) is the probability of selecting the a-th subtask under the s-th mission section.
Further, the probabilities in the policy function table are processed using a softmax function.
In this embodiment, the specific content of the task intelligent dividing device for the unmanned sanitation truck collaborative operation is similar to the content of the task intelligent dividing method for the unmanned sanitation truck collaborative operation disclosed in the above embodiment, and will not be described herein.
The embodiment of the invention also discloses a storage medium, which stores one or more computer-readable programs, wherein the one or more programs comprise instructions, and the instructions are suitable for being loaded by a memory and executing the task intelligent division method for the unmanned sanitation truck collaborative operation disclosed in any embodiment.
The invention discloses an intelligent task dividing method for unmanned sanitation vehicle collaborative operation, which has the following beneficial effects:
The invention divides the task section into a plurality of subtasks, and optimizes the division mode of each subtask by using an Actor-Critic reinforcement learning algorithm, thereby realizing the optimization of the division of the whole task section. Meanwhile, the distance from the task road section to the corresponding clustering center, the optimal path cost of the subtasks, the shortest distance from the task road section to the replenishment station, the distance from the task road section to the warehouse and other factors are comprehensively considered.
According to the invention, through using the Actor-Critic reinforcement learning model and combining greedy algorithm, clustering algorithm and the like, the tasks of the urban sanitation vehicle are effectively divided, so that the problems of route planning and task allocation when the urban sanitation vehicle executes tasks such as cleaning and garbage collection can be solved, the working efficiency is improved, the energy consumption is reduced, and the urban sanitation service is optimized.
It should be understood that the various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions of the methods and apparatus of the present invention, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.
While the basic principles and main features of the present invention and advantages of the present invention have been shown and described, it will be understood by those skilled in the art that the present invention is not limited by the foregoing embodiments, which are described in the foregoing specification merely illustrate the principles of the present invention, and various changes and modifications may be made therein without departing from the spirit and scope of the invention, which is defined in the appended claims and their equivalents.