[go: up one dir, main page]

CN118966684A - Intelligent task division method for collaborative operation of unmanned sanitation vehicles - Google Patents

Intelligent task division method for collaborative operation of unmanned sanitation vehicles Download PDF

Info

Publication number
CN118966684A
CN118966684A CN202411038769.2A CN202411038769A CN118966684A CN 118966684 A CN118966684 A CN 118966684A CN 202411038769 A CN202411038769 A CN 202411038769A CN 118966684 A CN118966684 A CN 118966684A
Authority
CN
China
Prior art keywords
task
function
distance
value
subtask
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202411038769.2A
Other languages
Chinese (zh)
Other versions
CN118966684B (en
Inventor
何弢
吕丰
赵凌子
廖文龙
徐超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kuwa Technology Co ltd
Central South University
Original Assignee
Kuwa Technology Co ltd
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kuwa Technology Co ltd, Central South University filed Critical Kuwa Technology Co ltd
Priority to CN202411038769.2A priority Critical patent/CN118966684B/en
Publication of CN118966684A publication Critical patent/CN118966684A/en
Application granted granted Critical
Publication of CN118966684B publication Critical patent/CN118966684B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0631Resource planning, allocation, distributing or scheduling for enterprises or organisations
    • G06Q10/06311Scheduling, planning or task assignment for a person or group
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0631Resource planning, allocation, distributing or scheduling for enterprises or organisations
    • G06Q10/06315Needs-based resource requirements planning or analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0633Workflow analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/103Workflow collaboration or project management
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Landscapes

  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Engineering & Computer Science (AREA)
  • Strategic Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Economics (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Tourism & Hospitality (AREA)
  • General Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Educational Administration (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Data Mining & Analysis (AREA)
  • Development Economics (AREA)
  • Game Theory and Decision Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明公开一种无人环卫车协同作业的任务智能划分方法,包括:利用聚类算法对所有的任务路段进行聚类,聚类数量为预设的子任务数量,得到初始的聚类中心;初始化策略函数表和值函数表;根据当前的策略函数表为每个任务路段选择一个对应子任务;根据选择结果,利用奖励函数计算奖励值,并评估该选择结果的预期回报;奖励函数涉及多种因素;基于Actor‑Critic的更新规则,根据奖励值和下一个任务路段的值函数,更新当前任务段的值函数和策略函数;重复执行上述步骤,对策略函数表和值函数表进行更新,直至为每个任务路段分配最优的子任务。本发明能够提高工作效率,减少能源消耗,优化城市环卫服务。

The present invention discloses a method for intelligent task division of unmanned sanitation vehicle collaborative operation, comprising: clustering all task sections using a clustering algorithm, the number of clusters being a preset number of subtasks, and obtaining an initial cluster center; initializing a strategy function table and a value function table; selecting a corresponding subtask for each task section according to the current strategy function table; calculating a reward value using a reward function according to the selection result, and evaluating the expected return of the selection result; the reward function involves multiple factors; based on the update rule of Actor-Critic, updating the value function and strategy function of the current task section according to the reward value and the value function of the next task section; repeating the above steps, updating the strategy function table and the value function table, until the optimal subtask is assigned to each task section. The present invention can improve work efficiency, reduce energy consumption, and optimize urban sanitation services.

Description

Intelligent task division method for unmanned sanitation vehicle collaborative operation
Technical Field
The invention belongs to the technical field of task division of unmanned sanitation vehicles, and particularly relates to an intelligent task division method for unmanned sanitation vehicle collaborative operation.
Background
With the vigorous development of automatic driving technology, a large number of unmanned operation vehicles, such as unmanned sweeping vehicles and unmanned sprinkler vehicles, are emerging on the market. These vehicles play an important role in improving efficiency, reducing cost, improving working environment, promoting environmental protection, etc., and bring about revolutionary changes to the fields of modern city management and traffic. How to effectively plan routes and distribute tasks when urban sanitation vehicles execute tasks such as cleaning, garbage collection and the like is an important and complex problem. Conventional methods typically rely on manual experience or simple rules, such as nearest neighbor rules, k-means algorithms, etc., which often fail to obtain globally optimal solutions and are difficult to cope with complex and dynamically changing environments.
Disclosure of Invention
In order to solve the technical problems, the invention provides an intelligent task dividing method for collaborative operation of unmanned sanitation vehicles.
In order to achieve the above purpose, the technical scheme of the invention is as follows:
in a first aspect, the invention discloses an intelligent task dividing method for unmanned sanitation vehicle collaborative operation, which comprises the following steps:
step S1: clustering all task road sections by using a clustering algorithm, wherein the clustering quantity is the preset subtask quantity, and an initial clustering center is obtained;
Step S2: initializing a policy function table and a value function table;
the probability that each task section selects each subtask is stored in the strategy function table;
the value function table stores the value function of each task section, wherein the value function is the expected return which can be obtained when the corresponding subtask is selected;
step S3: selecting a corresponding subtask for each task section according to the current strategy function table;
step S4: calculating a reward value according to the selection result by using a reward function, and evaluating the expected return of the selection result;
The reward function relates to one or more factors of the distance from the task section to the corresponding clustering center, the optimal path cost of the subtasks, the shortest distance from the task section to the replenishment station and the distance from the task section to the warehouse;
step S5: updating the value function and the strategy function of the current task section according to the rewarding value and the value function of the next task section based on the updating rule of the Actor-Critic;
step S6: and repeatedly executing the steps S3-S5, and updating the strategy function table and the value function table until the optimal subtask is allocated to each task section.
On the basis of the technical scheme, the following improvement can be made:
as a preferred solution, a greedy algorithm is used in the reward function to calculate the optimal path cost of the subtask, including the following steps:
step A: acquiring all task road sections corresponding to the subtasks and the connection relation between the task road sections, and regarding each task road section as a node;
and (B) step (B): initializing a Boolean list for marking whether the node has been accessed;
Initializing the path length to 0, and setting the initial node to 0;
step C: starting from the current node, searching for the next unviewed node, enabling the distance to reach the node to be shortest, and marking the current node as visited;
traversing all the unviewed nodes, calculating the distance from the current node to each unviewed node, and updating the minimum distance and the next node until the shortest path is found by iteration;
step D: updating the path length and adding the minimum distance to the path length.
As a preferable scheme, the Euclidean distance is adopted to calculate the distance from the task road section to the corresponding clustering center.
Preferably, the calculation formula of the reward function R is as follows:
wherein x i is the coordinate or feature vector of the ith task section;
c j is the coordinate of the j-th cluster center;
p is the optimal path length of the subtask calculated by the greedy algorithm;
n is the number of task sections of the current subtask;
the shortest distance from the ith mission section to the replenishment station;
Is the distance from the ith mission section to the warehouse.
As a preferred embodiment, in step S5, the value function is updated by the following formula;
V(s)←V(s)+α*(r+γ*V(s′)-V(s))
wherein V(s) is a value function of the current mission segment;
V (s') is a function of the value of the next mission segment;
r is a prize value;
Alpha is the learning rate;
Gamma is the discount factor;
Updating the policy function table by the following formula;
π(a|s)←π(a|s)+α*(r+γ*V(s′)-V(s))
where pi (a|s) is the probability of selecting the a-th subtask under the s-th mission section.
As a preferred approach, the probabilities in the policy function table are processed using softmax functions.
In a second aspect, the invention also discloses an intelligent task dividing device for unmanned sanitation truck collaborative operation, which comprises:
the clustering module is used for clustering all task road sections by using a clustering algorithm, wherein the clustering quantity is the preset subtask quantity, and an initial clustering center is obtained;
The initialization module is used for initializing the strategy function table and the value function table;
the probability that each task section selects each subtask is stored in the strategy function table;
the value function table stores the value function of each task section, wherein the value function is the expected return which can be obtained when the corresponding subtask is selected;
The selecting module is used for selecting a corresponding subtask for each task section according to the current strategy function table;
The rewarding value calculation module is used for calculating rewarding values by using a rewarding function according to the selection result and evaluating expected returns of the selection result;
The reward function relates to one or more factors of the distance from the task section to the corresponding clustering center, the optimal path cost of the subtasks, the shortest distance from the task section to the replenishment station and the distance from the task section to the warehouse;
The updating module is used for updating the value function and the strategy function of the current task section according to the rewarding value and the value function of the next task section based on the updating rule of the Actor-Critic;
and the repeating module is used for repeatedly executing the method in the selecting module, the rewarding value calculating module and the updating module, and updating the strategy function table and the value function table until the optimal subtasks are allocated to each task section.
As a preferred solution, calculating, by an optimal path cost calculation unit, an optimal path cost of a subtask in a reward function by adopting a greedy algorithm, including:
The acquisition unit is used for acquiring all task road sections corresponding to the subtasks and the connection relation among the task road sections, and treating each task road section as a node;
the greedy algorithm initializing unit is used for initializing a Boolean list and marking whether the node has been accessed;
Initializing the path length to 0, and setting the initial node to 0;
the iteration searching unit is used for searching the next unviewed node from the current node, so that the distance to the node is shortest, and marking the current node as visited;
traversing all the unviewed nodes, calculating the distance from the current node to each unviewed node, and updating the minimum distance and the next node until the shortest path is found by iteration;
and a path updating unit for updating the path length and adding the minimum distance to the path length.
As a preferable scheme, the Euclidean distance is adopted to calculate the distance from the task road section to the corresponding clustering center.
Preferably, the calculation formula of the reward function R is as follows:
wherein x i is the coordinate or feature vector of the ith task section;
c j is the coordinate of the j-th cluster center;
p is the optimal path length of the subtask calculated by the greedy algorithm;
n is the number of task sections of the current subtask;
the shortest distance from the ith mission section to the replenishment station;
Is the distance from the ith mission section to the warehouse.
As a preferred embodiment, in step S5, the value function is updated by the following formula;
V(s)←V(s)+α*(r+γ*V(s′)-V(s))
wherein V(s) is a value function of the current mission segment;
V (s') is a function of the value of the next mission segment;
r is a prize value;
Alpha is the learning rate;
Gamma is the discount factor;
Updating the policy function table by the following formula;
π(a|s)←π(a|s)+α*(r+γ*V(s′)-V(s))
where pi (a|s) is the probability of selecting the a-th subtask under the s-th mission section.
As a preferred approach, the probabilities in the policy function table are processed using softmax functions.
In a third aspect, the present invention also discloses a storage medium, where one or more computer readable programs are stored, where the one or more programs include instructions, where the instructions are adapted to be loaded by a memory and execute any of the above-mentioned intelligent task partitioning methods for unmanned sanitation truck collaborative operations.
The invention discloses an intelligent task dividing method for unmanned sanitation vehicle collaborative operation, which has the following beneficial effects:
The invention divides the task section into a plurality of subtasks, and optimizes the division mode of each subtask by using an Actor-Critic reinforcement learning algorithm, thereby realizing the optimization of the division of the whole task section. Meanwhile, the distance from the task road section to the corresponding clustering center, the optimal path cost of the subtasks, the shortest distance from the task road section to the replenishment station, the distance from the task road section to the warehouse and other factors are comprehensively considered.
According to the invention, through using the Actor-Critic reinforcement learning model and combining greedy algorithm, clustering algorithm and the like, the tasks of the urban sanitation vehicle are effectively divided, so that the problems of route planning and task allocation when the urban sanitation vehicle executes tasks such as cleaning and garbage collection can be solved, the working efficiency is improved, the energy consumption is reduced, and the urban sanitation service is optimized.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a task intelligent dividing method provided by an embodiment of the present invention.
FIG. 2 is a block diagram of an Actor-Critic reinforcement learning model according to an embodiment of the present invention.
Fig. 3 (a), (b), and (c) are schematic diagrams of urban environment models of scene one, scene two, and scene three according to the embodiments of the present invention.
Fig. 4 (a), (b), and (c) are schematic views of the number of vehicles according to the proportion of the total non-working paths when the operation is completed in 10 x in three real test scenarios according to the embodiment of the present invention.
Fig. 5 (a), (b), and (c) are respectively proportion diagrams of a non-working path of each unmanned vehicle running for 1km in three real test scenes according to an embodiment of the present invention.
Fig. 6 (a), (b), and (c) are schematic diagrams of total Cost (Cost) in three real test scenarios according to an embodiment of the present invention.
Detailed Description
Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The expression "comprising" an element is an "open" expression which merely means that there is a corresponding component or step and should not be interpreted as excluding the inclusion of additional components or steps.
In order to achieve the purpose of the present invention, in some embodiments of a task intelligent dividing method for collaborative operations of unmanned sanitation vehicles, as shown in fig. 1, the task intelligent dividing method includes:
step S1: clustering all task road sections by using a clustering algorithm, wherein the clustering quantity is the preset subtask quantity, and an initial clustering center is obtained;
Step S2: initializing a policy function table and a value function table;
the probability that each task section selects each subtask is stored in the strategy function table;
the value function table stores the value function of each task section, wherein the value function is the expected return which can be obtained when the corresponding subtask is selected;
step S3: selecting a corresponding subtask for each task section according to the current strategy function table;
step S4: calculating a reward value according to the selection result by using a reward function, and evaluating the expected return of the selection result;
The reward function relates to one or more factors of the distance from the task section to the corresponding clustering center, the optimal path cost of the subtasks, the shortest distance from the task section to the replenishment station and the distance from the task section to the warehouse;
step S5: updating the value function and the strategy function of the current task section according to the rewarding value and the value function of the next task section based on the updating rule of the Actor-Critic;
step S6: and repeatedly executing the steps S3-S5, and updating the strategy function table and the value function table until the optimal subtask is allocated to each task section.
The present invention will be described in detail below.
And S1, clustering all task road sections by using a K-means clustering algorithm, wherein the clustering quantity is the preset subtask quantity, and an initial clustering center is obtained.
K-means is a common clustering algorithm used to divide samples in a dataset into a preset number of different groups or clusters of subtasks. The basic principle is that the center point of the cluster is found through iterative optimization, so that the sum of distances from the sample point to the center point of the cluster to which the sample point belongs is minimized.
In recent years, with the development of artificial intelligence and machine learning techniques, reinforcement learning has been widely applied to various optimization problems as a method capable of learning an optimal strategy through interaction with the environment. In particular, the Actor-Critic model has shown excellent performance in many problems as a reinforcement learning algorithm that combines both methods of value iteration and strategy iteration.
However, how to apply reinforcement learning, especially the Actor-Critic model, to the task partitioning problem of urban sanitation vehicles remains a challenging problem.
The invention updates the task division strategy according to the rewarding value of each task section (namely, state) by using reinforcement learning, thereby finding an optimal task division scheme.
The invention uses an Actor-Critic reinforcement learning model to select corresponding subtasks (i.e., actions), an update value function table and a policy function table. The Actor-Critic model combines the advantages of the two methods of value iteration and strategy iteration, and can effectively balance exploration and utilization, so that the task division effect is improved.
As shown in FIG. 2, the Actor-Critic model is a reinforcement learning algorithm that combines two methods, value iteration (Critic) and strategy iteration (Actor). In the Actor-Critic model, "Actor" and "Critic" are two major components, each of which has different responsibilities: the role of the Actor is to select an action according to the current policy. Policies are generally expressed as the probability of each possible action being performed in a given state. In each step, the Actor selects an action according to the current policy. Critic's responsibility is to evaluate the action selected by the Actor. It calculates the expected return for each action and provides feedback to the Actor. This feedback is used to update the policy of the Actor so that the probability of actions that expect a higher return being selected in the future is greater.
The main steps of the Actor-Critic model are mainly embodied in the steps S2-S6.
Step S2: initializing a policy function table and a value function table;
the probability that each task section selects each subtask is stored in the strategy function table;
The value function table stores the value function of each task section, wherein the value function is the expected return which can be obtained when the corresponding subtask is selected.
Step S3: and selecting a corresponding subtask for each task section according to the current strategy function table, wherein the step is completed by an Actor.
Step S4: calculating a reward value according to the selection result by using a reward function, and evaluating the expected return of the selection result, wherein the step is completed by Critic;
The reward function relates to one or more factors of the distance of the task segment to the corresponding cluster center, the optimal path cost of the subtask, the shortest distance of the task segment to the replenishment station, and the distance of the task segment to the warehouse.
Step S5: and updating the value function and the strategy function of the current task section according to the reward value and the value function of the next task section based on the updating rule of the Actor-Critic according to the feedback of Critic.
Step S6: and repeatedly executing the steps S3-S5, and updating the strategy function table and the value function table until the optimal subtask is allocated to each task section.
In particular, it is desirable to increase the probability that an action with a higher expected return is selected and decrease the probability that an action with a lower expected return is selected. At the same time, the present invention updates the value function to more accurately estimate the expected return for each state.
In the Actor-Critic model, the present invention maintains two tables: a value function table and a policy function table. The value function table is used to store a value function for each state that indicates the expected return that can be obtained in that state. The policy function table is used to store the probability of performing each possible action in each state.
With respect to the policy function table. The policy function table is a two-dimensional array in which each row corresponds to a state and each column corresponds to an action. Each element of the policy function table represents the probability of performing a given action in a given state. For example, assuming that there are 3 states and 2 actions, the policy function table may be as follows:
[[0.7,0.3],
[0.4,0.6],
[0.5,0.5]]
This means that the probability of performing action 1 in state 1 is 0.7 and the probability of performing action 2 is 0.3; the probability of performing action 1 in state 2 is 0.4, and the probability of performing action 2 is 0.6; the probability of performing both action 1 and action 2 in state 3 is 0.5.
With respect to the value function table. The value function table is a one-dimensional array in which each element corresponds to a state. Each element of the value function table represents the expected return that can be obtained in that state.
For example, assuming that there are 3 states, the value function table may be as follows:
[0.5,0.6,0.7]
This means that the expected return in state 1 is 0.5, the expected return in state 2 is 0.6, and the expected return in state 3 is 0.7.
In reinforcement learning, the goal is to find the optimal policy function table and value function table by learning so that the maximum cumulative return can be obtained by following the policy function table selection action from any state.
The state represents a task section, and the execution action represents a selection of a subtask.
The following is a detailed description of the update process of the policy function table and the value function table:
In each step, the value function table and the policy function table are updated according to the updating rule of the Actor-Critic. In the Actor-Critic model, critic's feedback is used to update the policy function table and the value function table. Specifically, the present invention uses the following update rules:
In the updating process of the value function table, a time difference Error (Temporal Difference Error, abbreviated as TD Error) is calculated first. TD Error is the difference between the actual prize and the current state value function plus the discount value of the next state value function. This error is then multiplied by the learning rate and added to the value function of the current state to update the value function.
Specifically, in step S5, the value function is updated by the following formula;
V(s)←V(s)+α*(r+γ*V(s′)--V(s))
wherein V(s) is a value function of the current mission segment;
V (s') is a function of the value of the next mission segment;
r is a prize value;
Alpha is the learning rate;
and/is a discount factor.
In the updating process of the strategy function table, the method adopts the same time difference error as the updating of the value function table. Specifically, this error is multiplied by the learning rate and then added to the probability of performing the current action in the current state.
Specifically, in step S5, the policy function table is updated by the following formula;
π(a|s)←π(a|s)+α*(r+γ*V(s′)-V(s))
where pi (a|s) is the probability of selecting the a-th subtask under the s-th mission section.
Through the process, the strategy function and the value function can be gradually improved, so that the reinforcement learning algorithm can better solve the problem.
The invention uses a combination of a softmax function and a random selection function for action selection.
The invention uses softmax functions to process probabilities in a policy function table. The softmax function may convert a set of values into a probability distribution such that the probabilities satisfy the properties of the probability distribution (i.e., the sum of all probabilities is 1, each probability is between 0 and 1).
The softmax function is formulated as:
where x is the input vector, w j and w k are weight vectors, and K is the total number of categories.
Further, the present invention employs a random selection algorithm (e.g., random. Choice functions in numpy libraries) to randomly select an action on the probability distribution of the softmax function after processing. The random selection algorithm is able to randomly select an element according to a given probability distribution, thereby ensuring that the probability of each action being selected is consistent with its probability in the policy function table. This approach provides an efficient way to balance the needs of exploration and utilization, thereby achieving optimization in a complex decision making environment.
The method combining the softmax function and the random selection function can ensure the randomness of the strategy and the superiority of the strategy, namely, the probability of selecting the action with larger value function is higher, but the probability of selecting the action with smaller value function is still higher. Thus, the method can balance exploration and utilization to a certain extent, and the reinforcement learning effect is improved.
In reinforcement learning, the design of the bonus function is very critical. The bonus function of the present invention takes into account a number of factors, including: the distance from the task section to the corresponding cluster center, the optimal path cost of the subtasks, the shortest distance from the task section to the replenishment station, and the distance from the task section to the warehouse. The design of the reward function comprehensively considering a plurality of factors ensures that the task division scheme meets the actual requirements better.
The goal of the greedy algorithm (greedy_ shortest _path) is to calculate the optimal path cost for a given subtask. A greedy algorithm is used to select each time a task segment with the best path cost from the sub-task segments that are not visited. The input of the function is a matrix corresponding to a subtask, and the output is the optimal path cost of the subtask.
The method specifically comprises the following steps:
step A: acquiring all task road sections corresponding to the subtasks and the connection relation between the task road sections, and regarding each task road section as a node;
and (B) step (B): initializing a Boolean list for marking whether the node has been accessed;
Initializing the path length to 0, and setting the initial node to 0;
step C: starting from the current node, searching for the next unviewed node, enabling the distance to reach the node to be shortest, and marking the current node as visited;
traversing all the unviewed nodes, calculating the distance from the current node to each unviewed node, and updating the minimum distance and the next node until the shortest path is found by iteration;
step D: updating the path length and adding the minimum distance to the path length.
The core idea of the algorithm is to select the node with the shortest distance from the current node to the non-accessed node as the node to be accessed in the next step each time until all the nodes are accessed, thereby obtaining the shortest path length of the whole subtask.
Further, the Euclidean distance is adopted to calculate the distance from the task road section to the corresponding clustering center.
The main steps of Euclidean distance acquisition are as follows:
First, two coordinates are acquired from the input parameters.
Each coordinate is a tuple containing two elements (x and y coordinates).
Then, the difference between the two coordinates in the x-axis and the y-axis is calculated.
Next, the distance between the two coordinates is calculated using the euclidean distance formula.
The Euclidean distance formula is:
Wherein (x 1,y1) and (x 2,y2) are two coordinates.
And finally, returning the calculated distance.
The following is a detailed description of the prize calculation process:
In the present invention, the calculation of rewards encompasses a number of key factors including the distance of the task segment to the cluster center, the optimal path cost of the subtasks, the shortest distance of the task segment to the replenishment station, and the distance of the task segment to the warehouse.
First, regarding the distance of the task segment to the cluster center. The distance is calculated by the euclidean distance formula and the goal is to minimize this distance. Thus, the resulting prize is a negative distance.
Second, the optimal path cost for the subtasks. The optimal path cost for a subtask is obtained from the task segment for the same subtask, and a greedy algorithm is used to calculate the optimal path cost with the goal of minimizing this length. Thus, the resulting prize is of negative length.
Third, the shortest distance of the mission section to the tender station is aimed at minimizing this length. Thus, the resulting prize is a negative distance.
Fourth, distance of the mission section to the warehouse. The aim is that the distance is as small as possible. Thus, the resulting prize is a negative distance.
It is worth emphasizing that these rewards are all negative, as the present invention aims to minimize these distances and lengths. Within the framework of reinforcement learning, strategies are sought to maximize the jackpot, and therefore minimizing these distances and lengths is essentially equivalent to maximizing the jackpot.
Further, the calculation formula of the bonus function R is as follows:
wherein x i is the coordinate or feature vector of the ith task section;
c j is the coordinate of the j-th cluster center;
p is the optimal path length of the subtask calculated by the greedy algorithm;
n is the number of task sections of the current subtask;
the shortest distance from the ith mission section to the replenishment station;
Is the distance from the ith mission section to the warehouse.
Dis (a, b) is the Euclidean distance, expressed as
The invention has wide application fields including but not limited to the problems of path planning and logistics distribution of urban unmanned vehicles. For each task section, the id of one subtask is selected. The basis for selecting subtasks is a comprehensive consideration of a plurality of factors, including: the distance from the task section to the corresponding clustering center, the optimal path length of the subtask corresponding cost, the shortest distance from the task section to the replenishment station and the distance from the task section to the warehouse. The invention can find an optimal task division scheme by considering the distance between the task section and the replenishment station and the warehouse while ensuring the task efficiency. The scheme can not only improve the working efficiency of the urban sanitation vehicle and reduce the energy consumption, but also optimize the urban sanitation service and make contribution to the cleaning and environmental protection of the city.
The invention can be realized by technical means such as computer software, hardware and the like, and specific operation steps can be realized by programming in an intelligent vehicle control system. In addition, optimization and management of job task division can be achieved through cloud service and other modes. This diversified embodiment provides a flexible and efficient solution to meet different application requirements and environmental conditions.
The invention can also evaluate the obtained task allocation scheme by adopting an evaluation method. The assessment method is that each subtask is distributed to an urban unmanned sanitation vehicle, and index comparison is carried out after path planning. The evaluation method can comprehensively reflect the effect of the task allocation scheme and provide a powerful reference for further optimization.
The sample set adopted by the embodiment of the invention contains 3 real-scene work tasks in total, and as shown in fig. 3, the sample set comprises a road network, a warehouse position, a position of a replenishment station and the like.
The specific experimental process is as follows:
1. performance index
The embodiment of the invention uses: overall inactive path duty cycle; the number of vehicles and the proportion of the non-working path of each unmanned vehicle running for 1 km; total cost of system: the number of optimal unmanned vehicles for completing the 1KM work task and the total cost required by the corresponding optimal path.
Wherein:
F is the trip fixed cost of the unmanned sanitation vehicle, and the value range is 0-1000;
Q is the number of unmanned vehicles participating in the work task, namely the number of subtasks;
Beta is an adjusting coefficient, and the general value is 10000;
w is the energy consumption required by the unit working distance, and is generally 1;
r is the energy consumption proportion of the transfer mode relative to the working mode, and the value range is 0-1;
Path w is the working Path;
Path r is a transport Path.
2. Comparative test
The present embodiment employs the following methods for comparison evaluation:
K-means_ant: the method takes a k-means algorithm as a task dividing part, and adopts an ant colony algorithm to find an optimal path.
K-means_greedy: the method adopts a K-means algorithm to carry out task allocation, and adopts a greedy algorithm to find an optimal path.
Naive_greedy: the method is a greedy algorithm without task allocation.
Actor_Critical_ant: the method of the invention.
Fig. 4 is a schematic diagram of the number of vehicles according to the proportion of the total non-working path when the operation is completed in 10 x in three real test scenes in the embodiment of the present invention.
Fig. 5 is a proportion of a non-working path of each unmanned vehicle running for 1km in three real test scenes, and a scatter diagram in the figure is a single unmanned vehicle non-working path proportion corresponding to different algorithms. It can be seen that the Actor-Critic model also has better performance in the non-working path ratio of the bicycle.
The total Cost (Cost) in fig. 6 refers to the total Cost of the optimal number of drones (Q) and their corresponding optimal paths required to complete a1 km work mission.
As can be seen from fig. 4-6, the present invention performs best on four criteria:
index one: the overall inactive path duty cycle, i.e., the ratio of inactive path to total path.
Index II: the number of vehicles, i.e. the number of unmanned vehicles involved in a work task.
And (3) index III: the proportion of the non-working path of each unmanned vehicle is 1km per driving, namely the proportion of the non-working path of each unmanned vehicle in the distance of 1 km.
And (4) index IV: the optimal number of unmanned vehicles for completing the 1KM work task and the total cost required by the corresponding optimal path, namely the optimal number of unmanned vehicles required for completing the 1KM work task and the total cost of the corresponding optimal path.
The method provided by the invention is excellent in all indexes, and the effectiveness and superiority of the method in practical application are proved.
In addition, the embodiment of the invention also discloses an intelligent task dividing device for the unmanned sanitation truck collaborative operation, which comprises the following components:
the clustering module is used for clustering all task road sections by using a clustering algorithm, wherein the clustering quantity is the preset subtask quantity, and an initial clustering center is obtained;
The initialization module is used for initializing the strategy function table and the value function table;
the probability that each task section selects each subtask is stored in the strategy function table;
the value function table stores the value function of each task section, wherein the value function is the expected return which can be obtained when the corresponding subtask is selected;
The selecting module is used for selecting a corresponding subtask for each task section according to the current strategy function table;
The rewarding value calculation module is used for calculating rewarding values by using a rewarding function according to the selection result and evaluating expected returns of the selection result;
The reward function relates to one or more factors of the distance from the task section to the corresponding clustering center, the optimal path cost of the subtasks, the shortest distance from the task section to the replenishment station and the distance from the task section to the warehouse;
The updating module is used for updating the value function and the strategy function of the current task section according to the rewarding value and the value function of the next task section based on the updating rule of the Actor-Critic;
and the repeating module is used for repeatedly executing the method in the selecting module, the rewarding value calculating module and the updating module, and updating the strategy function table and the value function table until the optimal subtasks are allocated to each task section.
Further, calculating, by the optimal path cost calculation unit, an optimal path cost of the subtask in the reward function by adopting a greedy algorithm, including:
The acquisition unit is used for acquiring all task road sections corresponding to the subtasks and the connection relation among the task road sections, and treating each task road section as a node;
the greedy algorithm initializing unit is used for initializing a Boolean list and marking whether the node has been accessed;
Initializing the path length to 0, and setting the initial node to 0;
the iteration searching unit is used for searching the next unviewed node from the current node, so that the distance to the node is shortest, and marking the current node as visited;
traversing all the unviewed nodes, calculating the distance from the current node to each unviewed node, and updating the minimum distance and the next node until the shortest path is found by iteration;
and a path updating unit for updating the path length and adding the minimum distance to the path length.
Further, the Euclidean distance is adopted to calculate the distance from the task road section to the corresponding clustering center.
Further, the calculation formula of the bonus function R is as follows:
wherein x i is the coordinate or feature vector of the ith task section;
c j is the coordinate of the j-th cluster center;
p is the optimal path length of the subtask calculated by the greedy algorithm;
n is the number of task sections of the current subtask;
the shortest distance from the ith mission section to the replenishment station;
Is the distance from the ith mission section to the warehouse.
Further, in step S5, the value function is updated by the following formula;
V(s)←V(s)+α*(r+γ*V(s′)-V(s))
wherein V(s) is a value function of the current mission segment;
V (s') is a function of the value of the next mission segment;
r is a prize value;
Alpha is the learning rate;
Gamma is the discount factor;
Updating the policy function table by the following formula;
π(a|s)←π(a|s)+α*(r+γ*V(s′)-V(s))
where pi (a|s) is the probability of selecting the a-th subtask under the s-th mission section.
Further, the probabilities in the policy function table are processed using a softmax function.
In this embodiment, the specific content of the task intelligent dividing device for the unmanned sanitation truck collaborative operation is similar to the content of the task intelligent dividing method for the unmanned sanitation truck collaborative operation disclosed in the above embodiment, and will not be described herein.
The embodiment of the invention also discloses a storage medium, which stores one or more computer-readable programs, wherein the one or more programs comprise instructions, and the instructions are suitable for being loaded by a memory and executing the task intelligent division method for the unmanned sanitation truck collaborative operation disclosed in any embodiment.
The invention discloses an intelligent task dividing method for unmanned sanitation vehicle collaborative operation, which has the following beneficial effects:
The invention divides the task section into a plurality of subtasks, and optimizes the division mode of each subtask by using an Actor-Critic reinforcement learning algorithm, thereby realizing the optimization of the division of the whole task section. Meanwhile, the distance from the task road section to the corresponding clustering center, the optimal path cost of the subtasks, the shortest distance from the task road section to the replenishment station, the distance from the task road section to the warehouse and other factors are comprehensively considered.
According to the invention, through using the Actor-Critic reinforcement learning model and combining greedy algorithm, clustering algorithm and the like, the tasks of the urban sanitation vehicle are effectively divided, so that the problems of route planning and task allocation when the urban sanitation vehicle executes tasks such as cleaning and garbage collection can be solved, the working efficiency is improved, the energy consumption is reduced, and the urban sanitation service is optimized.
It should be understood that the various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions of the methods and apparatus of the present invention, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.
While the basic principles and main features of the present invention and advantages of the present invention have been shown and described, it will be understood by those skilled in the art that the present invention is not limited by the foregoing embodiments, which are described in the foregoing specification merely illustrate the principles of the present invention, and various changes and modifications may be made therein without departing from the spirit and scope of the invention, which is defined in the appended claims and their equivalents.

Claims (13)

1.无人环卫车协同作业的任务智能划分方法,其特征在于,包括:1. A method for intelligently dividing tasks of unmanned sanitation vehicles for collaborative operation, characterized by comprising: 步骤S1:利用聚类算法对所有的任务路段进行聚类,聚类数量为预设的子任务数量,得到初始的聚类中心;Step S1: cluster all task sections using a clustering algorithm, the number of clusters being the preset number of subtasks, and obtaining the initial cluster center; 步骤S2:初始化策略函数表和值函数表;Step S2: Initialize the strategy function table and the value function table; 所述策略函数表内存储每个任务路段选择每个子任务的概率;The strategy function table stores the probability of selecting each subtask for each task section; 所述值函数表内存储每个任务路段的值函数,所述值函数为选择对应子任务时,所能够获得的预期回报;The value function table stores the value function of each task section, and the value function is the expected return that can be obtained when the corresponding subtask is selected; 步骤S3:根据当前的策略函数表为每个任务路段选择一个对应子任务;Step S3: Select a corresponding subtask for each task section according to the current strategy function table; 步骤S4:根据选择结果,利用奖励函数计算奖励值,并评估该选择结果的预期回报;Step S4: Calculate the reward value using the reward function according to the selection result, and evaluate the expected return of the selection result; 所述奖励函数涉及任务路段到对应聚类中心的距离、子任务的最优路径成本、任务路段到补给站的最短距离以及任务路段到仓库的距离中的一种或多种因素;The reward function involves one or more factors of the distance from the task section to the corresponding cluster center, the optimal path cost of the subtask, the shortest distance from the task section to the supply station, and the distance from the task section to the warehouse; 步骤S5:基于Actor-Critic的更新规则,根据奖励值和下一个任务路段的值函数,更新当前任务段的值函数和策略函数;Step S5: Based on the Actor-Critic update rule, the value function and strategy function of the current task segment are updated according to the reward value and the value function of the next task segment; 步骤S6:重复执行步骤S3-步骤S5,对策略函数表和值函数表进行更新,直至为每个任务路段分配最优的子任务。Step S6: Repeat steps S3 to S5 to update the strategy function table and the value function table until the optimal subtask is assigned to each task section. 2.根据权利要求1所述的任务智能划分方法,其特征在于,在奖励函数中采用贪心算法计算子任务的最优路径成本,包括以下步骤:2. The task intelligent division method according to claim 1 is characterized in that a greedy algorithm is used in the reward function to calculate the optimal path cost of the subtask, comprising the following steps: 步骤A:获取子任务所对应的所有任务路段以及该任务路段之间的连接关系,并将每个任务路段视为一个节点;Step A: Obtain all task sections corresponding to the subtasks and the connection relationship between the task sections, and regard each task section as a node; 步骤B:初始化布尔型列表,用于标记节点是否已经访问过;Step B: Initialize a Boolean list to mark whether a node has been visited; 初始化路径长度为0,将起始节点设置为0;Initialize the path length to 0 and set the starting node to 0; 步骤C:从当前节点开始,寻找下一个未访问过的节点,使得到达该节点的距离最短,并将当前节点标记为已访问过;Step C: Starting from the current node, find the next unvisited node so that the distance to the node is the shortest, and mark the current node as visited; 遍历所有未访问过的节点,计算当前节点到每个未访问过节点的距离,并更新最小距离和下一个节点,直至迭代寻找到最短路径;Traverse all unvisited nodes, calculate the distance from the current node to each unvisited node, and update the minimum distance and the next node until the shortest path is found iteratively; 步骤D:更新路径长度,将最小距离添加到路径长度中。Step D: Update the path length by adding the minimum distance to the path length. 3.根据权利要求1所述的任务智能划分方法,其特征在于,采用欧氏距离计算任务路段到对应聚类中心的距离。3. The task intelligent division method according to claim 1 is characterized in that the Euclidean distance is used to calculate the distance from the task section to the corresponding cluster center. 4.根据权利要求1所述的任务智能划分方法,其特征在于,所述奖励函数R的计算公式如下:4. The task intelligent division method according to claim 1 is characterized in that the calculation formula of the reward function R is as follows: 其中,xi为第i任务路段的坐标或特征向量;Where, xi is the coordinate or feature vector of the i-th task section; cj为第j个聚类中心的坐标;c j is the coordinate of the jth cluster center; P为贪心算法计算的子任务的最优路径长度;P is the optimal path length of the subtask calculated by the greedy algorithm; n为当前子任务的任务路段数量;n is the number of task sections of the current subtask; 为第i个任务路段到补给站的最短距离; is the shortest distance from the i-th mission section to the supply station; 为第i个任务路段到仓库的距离。 is the distance from the i-th task section to the warehouse. 5.根据权利要求1所述的任务智能划分方法,其特征在于,在所述步骤S5中,通过以下公式更新值函数;5. The task intelligent division method according to claim 1, characterized in that, in step S5, the value function is updated by the following formula; V(s)←V(s)+α*(r+γ*V(s′)-V(s))V(s)←V(s)+α*(r+γ*V(s′)-V(s)) 其中,V(s)是当前任务路段的值函数;Among them, V(s) is the value function of the current task section; V(s′)是下一个任务路段的值函数;V(s′) is the value function of the next task segment; r是奖励值;r is the reward value; α是学习率;α is the learning rate; γ是折扣因子;γ is the discount factor; 通过以下公式更新策略函数表;Update the strategy function table by the following formula; π(a|s)←π(a|s)+α*(r+γ*V(s′)-V(s))π(a|s)←π(a|s)+α*(r+γ*V(s′)-V(s)) 其中,π(a|s)是在第s个任务路段下选择第a个子任务的概率。Among them, π(a|s) is the probability of selecting the ath subtask under the sth task segment. 6.根据权利要求5所述的任务智能划分方法,其特征在于,使用softmax函数处理策略函数表中的概率。6. The task intelligent division method according to claim 5 is characterized in that a softmax function is used to process the probabilities in the strategy function table. 7.无人环卫车协同作业的任务智能划分设备,其特征在于,包括:7. An intelligent task division device for unmanned sanitation vehicles to work together, characterized by comprising: 聚类模块,用于利用聚类算法对所有的任务路段进行聚类,聚类数量为预设的子任务数量,得到初始的聚类中心;The clustering module is used to cluster all task sections using a clustering algorithm. The number of clusters is the preset number of subtasks to obtain the initial cluster center. 初始化模块,用于初始化策略函数表和值函数表;Initialization module, used to initialize the strategy function table and value function table; 所述策略函数表内存储每个任务路段选择每个子任务的概率;The strategy function table stores the probability of selecting each subtask for each task section; 所述值函数表内存储每个任务路段的值函数,所述值函数为选择对应子任务时,所能够获得的预期回报;The value function table stores the value function of each task section, and the value function is the expected return that can be obtained when the corresponding subtask is selected; 选择模块,用于根据当前的策略函数表为每个任务路段选择一个对应子任务;A selection module is used to select a corresponding subtask for each task section according to the current strategy function table; 奖励值计算模块,用于根据选择结果,利用奖励函数计算奖励值,并评估该选择结果的预期回报;A reward value calculation module is used to calculate the reward value based on the selection result using the reward function and evaluate the expected return of the selection result; 所述奖励函数涉及任务路段到对应聚类中心的距离、子任务的最优路径成本、任务路段到补给站的最短距离以及任务路段到仓库的距离中的一种或多种因素;The reward function involves one or more factors of the distance from the task section to the corresponding cluster center, the optimal path cost of the subtask, the shortest distance from the task section to the supply station, and the distance from the task section to the warehouse; 更新模块,用于基于Actor-Critic的更新规则,根据奖励值和下一个任务路段的值函数,更新当前任务段的值函数和策略函数;The update module is used to update the value function and strategy function of the current task segment based on the Actor-Critic update rule according to the reward value and the value function of the next task segment; 重复模块,用于重复执行选择模块、奖励值计算模块、更新模块内的方法,对策略函数表和值函数表进行更新,直至为每个任务路段分配最优的子任务。The repetition module is used to repeatedly execute the methods in the selection module, the reward value calculation module, and the update module, and update the strategy function table and the value function table until the optimal subtask is assigned to each task section. 8.根据权利要求7所述的任务智能划分设备,其特征在于,通过最优路径成本计算单元采用贪心算法计算奖励函数中子任务的最优路径成本,包括:8. The task intelligent division device according to claim 7, characterized in that the optimal path cost of the subtask in the reward function is calculated by the optimal path cost calculation unit using a greedy algorithm, comprising: 获取单元,用于获取子任务所对应的所有任务路段以及该任务路段之间的连接关系,并将每个任务路段视为一个节点;An acquisition unit, used to acquire all task sections corresponding to the subtasks and the connection relationship between the task sections, and regard each task section as a node; 贪心算法初始化单元,用于初始化布尔型列表,用于标记节点是否已经访问过;Greedy algorithm initialization unit, used to initialize the Boolean list to mark whether the node has been visited; 初始化路径长度为0,将起始节点设置为0;Initialize the path length to 0 and set the starting node to 0; 迭代寻找单元,用于从当前节点开始,寻找下一个未访问过的节点,使得到达该节点的距离最短,并将当前节点标记为已访问过;Iterative search unit, used to start from the current node, find the next unvisited node so that the distance to the node is the shortest, and mark the current node as visited; 遍历所有未访问过的节点,计算当前节点到每个未访问过节点的距离,并更新最小距离和下一个节点,直至迭代寻找到最短路径;Traverse all unvisited nodes, calculate the distance from the current node to each unvisited node, and update the minimum distance and the next node until the shortest path is found iteratively; 路径更新单元,用于更新路径长度,将最小距离添加到路径长度中。A path updating unit is used to update the path length by adding the minimum distance to the path length. 9.根据权利要求7所述的任务智能划分设备,其特征在于,采用欧氏距离计算任务路段到对应聚类中心的距离。9. The task intelligent division device according to claim 7 is characterized in that the distance from the task section to the corresponding cluster center is calculated using Euclidean distance. 10.根据权利要求7所述的任务智能划分设备,其特征在于,所述奖励函数R的计算公式如下:10. The task intelligent division device according to claim 7, characterized in that the calculation formula of the reward function R is as follows: 其中,xi为第i任务路段的坐标或特征向量;Where, xi is the coordinate or feature vector of the i-th task section; cj为第j个聚类中心的坐标;c j is the coordinate of the jth cluster center; P为贪心算法计算的子任务的最优路径长度;P is the optimal path length of the subtask calculated by the greedy algorithm; n为当前子任务的任务路段数量;n is the number of task sections of the current subtask; 为第i个任务路段到补给站的最短距离; is the shortest distance from the i-th mission section to the supply station; 为第i个任务路段到仓库的距离。 is the distance from the i-th task section to the warehouse. 11.根据权利要求7所述的任务智能划分设备,其特征在于,在所述步骤S5中,通过以下公式更新值函数;11. The task intelligent division device according to claim 7, characterized in that, in the step S5, the value function is updated by the following formula; V(s)←V(s)+α*(r+γ*V(s′)-V(s))V(s)←V(s)+α*(r+γ*V(s′)-V(s)) 其中,V(s)是当前任务路段的值函数;Among them, V(s) is the value function of the current task section; V(s′)是下一个任务路段的值函数;V(s′) is the value function of the next task segment; r是奖励值;r is the reward value; α是学习率;α is the learning rate; γ是折扣因子;γ is the discount factor; 通过以下公式更新策略函数表;Update the strategy function table by the following formula; π(a|s)←π(a|s)+α*(r+γ*V(s′)-V(s))π(a|s)←π(a|s)+α*(r+γ*V(s′)-V(s)) 其中,π(a|s)是在第s个任务路段下选择第a个子任务的概率。Among them, π(a|s) is the probability of selecting the ath subtask under the sth task segment. 12.根据权利要求11所述的任务智能划分设备,其特征在于,使用softmax函数处理策略函数表中的概率。12. The task intelligent division device according to claim 11 is characterized in that a softmax function is used to process the probabilities in the strategy function table. 13.存储介质,其特征在于,所述存储介质存储有一个或多个计算机可读的程序,一个或多个程序包括指令,所述指令适于由存储器加载并执行上述权利要求1-6中任一所述的无人环卫车协同作业的任务智能划分方法。13. Storage medium, characterized in that the storage medium stores one or more computer-readable programs, and the one or more programs include instructions, and the instructions are suitable for being loaded by a memory and executing the method for intelligent task division of unmanned sanitation vehicles collaborative operation as described in any one of claims 1-6.
CN202411038769.2A 2024-07-31 2024-07-31 Intelligent task division method for collaborative operation of unmanned sanitation vehicles Active CN118966684B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411038769.2A CN118966684B (en) 2024-07-31 2024-07-31 Intelligent task division method for collaborative operation of unmanned sanitation vehicles

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411038769.2A CN118966684B (en) 2024-07-31 2024-07-31 Intelligent task division method for collaborative operation of unmanned sanitation vehicles

Publications (2)

Publication Number Publication Date
CN118966684A true CN118966684A (en) 2024-11-15
CN118966684B CN118966684B (en) 2025-06-20

Family

ID=93400440

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411038769.2A Active CN118966684B (en) 2024-07-31 2024-07-31 Intelligent task division method for collaborative operation of unmanned sanitation vehicles

Country Status (1)

Country Link
CN (1) CN118966684B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022120970A1 (en) * 2020-12-10 2022-06-16 中国科学院深圳先进技术研究院 Method and system for order dispatch based on interactive reinforcement learning
CN115481987A (en) * 2022-10-24 2022-12-16 成都秦川物联网科技股份有限公司 Smart city street management method, internet of things system, device and storage medium
CN116494247A (en) * 2023-06-14 2023-07-28 西安电子科技大学广州研究院 Robotic arm path planning method and system based on deep deterministic policy gradient
CN116611635A (en) * 2023-04-23 2023-08-18 暨南大学 Sanitation robot car scheduling method and system based on car-road cooperation and reinforcement learning
KR20230142202A (en) * 2022-04-01 2023-10-11 전북대학교산학협력단 Autonomous flight platform using actor-critic deep reinforcement learning-based target point estimation and collision avoidance technique for intelligent autonomous flight
CN117314049A (en) * 2023-09-01 2023-12-29 中国电子科技集团公司第五十四研究所 Satellite network intelligent resource scheduling method based on reinforcement learning
CN117553803A (en) * 2024-01-09 2024-02-13 大连海事大学 A multi-UAV intelligent path planning method based on deep reinforcement learning
CN118192577A (en) * 2024-03-29 2024-06-14 杭州智元研究院有限公司 Unmanned vehicle autonomous operation decision-making method and system based on reinforcement learning

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022120970A1 (en) * 2020-12-10 2022-06-16 中国科学院深圳先进技术研究院 Method and system for order dispatch based on interactive reinforcement learning
KR20230142202A (en) * 2022-04-01 2023-10-11 전북대학교산학협력단 Autonomous flight platform using actor-critic deep reinforcement learning-based target point estimation and collision avoidance technique for intelligent autonomous flight
CN115481987A (en) * 2022-10-24 2022-12-16 成都秦川物联网科技股份有限公司 Smart city street management method, internet of things system, device and storage medium
CN116611635A (en) * 2023-04-23 2023-08-18 暨南大学 Sanitation robot car scheduling method and system based on car-road cooperation and reinforcement learning
CN116494247A (en) * 2023-06-14 2023-07-28 西安电子科技大学广州研究院 Robotic arm path planning method and system based on deep deterministic policy gradient
CN117314049A (en) * 2023-09-01 2023-12-29 中国电子科技集团公司第五十四研究所 Satellite network intelligent resource scheduling method based on reinforcement learning
CN117553803A (en) * 2024-01-09 2024-02-13 大连海事大学 A multi-UAV intelligent path planning method based on deep reinforcement learning
CN118192577A (en) * 2024-03-29 2024-06-14 杭州智元研究院有限公司 Unmanned vehicle autonomous operation decision-making method and system based on reinforcement learning

Also Published As

Publication number Publication date
CN118966684B (en) 2025-06-20

Similar Documents

Publication Publication Date Title
Qin et al. Ride-hailing order dispatching at didi via reinforcement learning
CN110264062B (en) Distributed multi-AGV dynamic task allocation and path planning method and system thereof
Park et al. Schedulenet: Learn to solve multi-agent scheduling problems with reinforcement learning
Liu et al. Energy-efficient UAV crowdsensing with multiple charging stations by deep learning
Cheng et al. A modified ant colony system for solving the travelling salesman problem with time windows
Huang et al. Large scale real-time ridesharing with service guarantee on road networks
Luo et al. Improved shuffled frog leaping algorithm and its multi-phase model for multi-depot vehicle routing problem
CN113051815B (en) An Agile Imaging Satellite Mission Planning Method Based on Independent Pointer Network
CN112132312A (en) Path planning method based on evolution multi-objective multi-task optimization
Geng et al. An improved ant colony optimization algorithm for nonlinear resource-leveling problems
Feng et al. Flexible job shop scheduling based on deep reinforcement learning
CN118857324B (en) Path planning method for collaborative operation of unmanned sanitation vehicles
CN114237222A (en) Method for planning route of delivery vehicle based on reinforcement learning
Carić et al. A modelling and optimization framework for real-world vehicle routing problems
CN113848970A (en) Multi-target collaborative path planning method for vehicle and unmanned aerial vehicle
Lin et al. Development of new features of ant colony optimization for flowshop scheduling
CN115293623A (en) Training method and device for production scheduling model, electronic equipment and medium
Foa et al. Solving the vehicle routing problem with deep reinforcement learning
CN117073684A (en) An urban vehicle navigation method based on multi-task ant colony algorithm
CN118966684B (en) Intelligent task division method for collaborative operation of unmanned sanitation vehicles
WO2021252683A1 (en) Systems and methods for controlling automated systems using integer programming and column generation techniques
Zhang et al. CT-ACO-hybridizing ant colony optimization with cyclic transfer search for the vehicle routing problem
Chen et al. Mcaaco: a multi-objective strategy heuristic search algorithm for solving capacitated vehicle routing problems
CN117492446A (en) A multi-agent cooperative planning method and system based on combinatorial hybrid optimization
Haghani et al. Integer programming for multi-robot planning: A column generation approach

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant