CN116703641A

CN116703641A - Flexible multi-resource workshop dynamic scheduling method based on deep reinforcement learning

Info

Publication number: CN116703641A
Application number: CN202310703139.1A
Authority: CN
Inventors: 胡耀光; 侯振宇; 张立祥; 蔡泽; 王一衡
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2023-06-14
Filing date: 2023-06-14
Publication date: 2023-09-05

Abstract

The invention discloses a flexible multi-resource workshop dynamic scheduling method based on deep reinforcement learning, and belongs to the technical field of dynamic scheduling. The implementation method of the invention comprises the following steps: defining a workshop state, a workshop scheduling action and a decision rewarding function; taking the total delay time of all workpieces as an optimization target, and constructing a scheduling target optimization problem for scheduling processing machines of the workpieces and AGVs for transporting the workpieces; constructing a neural network for decision making; and solving the scheduling target optimization problem by using a D3QN method, namely calculating the real-time production state of the workshop based on the neural network at the decision time of the workshop, inputting the production state into the trained neural network, outputting the scheduling action of the workshop by the neural network according to the input state, executing the scheduling action and feeding back a reward to the neural network to reduce the total delay of the workshop workpiece, and improving the production efficiency of the workshop. The invention has the advantages of efficiency and performance.

Description

Flexible multi-resource workshop dynamic scheduling method based on deep reinforcement learning

技术领域Technical Field

本发明涉及一种基于深度强化学习的柔性多资源车间动态调度方法，属于动态调度技术领域。The present invention relates to a flexible multi-resource workshop dynamic scheduling method based on deep reinforcement learning, belonging to the technical field of dynamic scheduling.

背景技术Background Art

作业车间问题是生产调度问题中较早出现并且被广泛研究的问题，可以简述为：车间有一批存有既定工艺路线的工件，每个工件的工艺路线中的每一道工序都有对应的机床进行加工，需要确定每一台机床的加工工序及其的加工顺序以达到一个或者多个加工目标。柔性多资源车间调度问题是上述问题中一种更一般的情况，相对于作业车间问题，该问题多了多资源约束，少了机器加工约束，多资源指除了工件和机床外，还需考虑自动导引车辆(Automated Guided Vehicle，AGV)等运输资源，没有机器加工的约束后，工件的每道工序有多台机床可供选择随着柔性车间的发展，车间调度算法也在不断的丰富和拓展。The job shop problem is an early and widely studied problem in production scheduling. It can be briefly described as follows: there are a number of workpieces with established process routes in the workshop. Each process in the process route of each workpiece has a corresponding machine tool for processing. It is necessary to determine the processing process and processing sequence of each machine tool to achieve one or more processing goals. The flexible multi-resource workshop scheduling problem is a more general case of the above problems. Compared with the job shop problem, this problem has more multi-resource constraints and fewer machine processing constraints. Multi-resource means that in addition to workpieces and machine tools, transportation resources such as automated guided vehicles (AGVs) must also be considered. Without the constraints of machine processing, there are multiple machine tools available for each process of the workpiece. With the development of flexible workshops, workshop scheduling algorithms are also constantly enriched and expanded.

目前，传统的调度方法主要可以分为两种类型，一是基于调度规则的方法，另一种是启发式算法。调度规则可以分为简单的调度规则和复合规则，简单的调度规则有先来先服务，最短加工时间优先，最短交货期优先和最短剩余加工时间优先等，而复合规则是由一些简单调度规则组合而来。基于规则的调度方法在每个调度时刻选择一个工件到一台机器上进行加工，并选择一辆AGV用来运输工件，这种方法的优点是实时性高，能够迅速响应，但这种方法的缺点是效率较低，从长期来看并不能取得较好的效果。At present, traditional scheduling methods can be mainly divided into two types, one is the method based on scheduling rules, and the other is the heuristic algorithm. Scheduling rules can be divided into simple scheduling rules and compound rules. Simple scheduling rules include first come first served, shortest processing time priority, shortest delivery time priority and shortest remaining processing time priority, while compound rules are composed of some simple scheduling rules. The rule-based scheduling method selects a workpiece to a machine for processing at each scheduling moment, and selects an AGV to transport the workpiece. The advantage of this method is high real-time performance and rapid response, but the disadvantage of this method is low efficiency and cannot achieve good results in the long run.

常用的启发式算法主要有遗传算法(Genetic Algorithm，GA)、粒子群算法(Particle swarm optimization，PSO)、模拟退火法(Simulated Annealing，SA)和蚁群算法(Ant Colony Optimization,ACO)等。启发式算法的搜索效率较高，在有限的时间内能获取到一个较优解，但在面对动态问题时，基于启发式的调度算法只能将动态问题划分为多阶段的静态问题，即在调度的初始时刻进行计算获得一个初始调度方案，在重调度时刻再对未调度的资源进行计算，再获得重调度方案，这种方法需要大量的计算时间，实时性较差，难以应用到环境状态多变的多资源调度问题中。Commonly used heuristic algorithms include genetic algorithm (GA), particle swarm optimization (PSO), simulated annealing (SA) and ant colony optimization (ACO). Heuristic algorithms have high search efficiency and can obtain a better solution in a limited time. However, when facing dynamic problems, heuristic scheduling algorithms can only divide dynamic problems into multi-stage static problems, that is, calculate at the initial moment of scheduling to obtain an initial scheduling plan, and calculate the unscheduled resources at the rescheduling moment to obtain the rescheduling plan. This method requires a lot of computing time and has poor real-time performance, making it difficult to apply to multi-resource scheduling problems with changing environmental conditions.

因此需要寻找一个能够兼顾响应能力和优化效果的算法，近些年来，基于学习的方法已经逐渐被应用到解决该问题中。Q-learning算法是使用较多的算法，该方法将Q-函数储存一张表中，行为表状态，列为动作，元素则是在某个状态某个动作下对应的Q-函数的价值。然后在调度时刻，只需要在表中寻找当前状态下价值最高的动作。但实际的车间中，车间的状态空间十分复杂，再建立和维护一个表格来存储价值函数是不太现实的，不仅会降低算法的效率，还会带来很多无用的状态。Therefore, it is necessary to find an algorithm that can take into account both responsiveness and optimization effects. In recent years, learning-based methods have gradually been applied to solve this problem. The Q-learning algorithm is a commonly used algorithm. This method stores the Q-function in a table, where the behavior table states are listed as actions, and the elements are the values of the Q-function corresponding to a certain state and action. Then, at the scheduling time, you only need to find the action with the highest value in the current state in the table. However, in actual workshops, the state space of the workshop is very complex. It is not realistic to establish and maintain a table to store the value function. It will not only reduce the efficiency of the algorithm, but also bring many useless states.

发明内容Summary of the invention

为了解决现有技术存在无法兼顾效率和性能的问题，本发明的主要目的是提供一种基于深度强化学习的柔性多资源车间动态调度方法；该方法定义车间的状态、车间的调度动作、决策奖励函数，并构建用于决策的神经网络的架构和学习方法，通过在车间的决策时刻，计算车间的实时生产状态，将生产状态输入到训练好的神经网络中，神经网络根据输入的状态输出车间的调度动作，执行调度动作并反馈奖励给神经网络的方法来实现降低车间工件总延迟的目的，实现车间的高效生产。In order to solve the problem that the existing technology cannot take into account both efficiency and performance, the main purpose of the present invention is to provide a flexible multi-resource workshop dynamic scheduling method based on deep reinforcement learning; this method defines the state of the workshop, the scheduling action of the workshop, and the decision-making reward function, and constructs the architecture and learning method of the neural network for decision-making. At the decision moment of the workshop, the real-time production status of the workshop is calculated, and the production status is input into the trained neural network. The neural network outputs the scheduling action of the workshop according to the input status. The method of executing the scheduling action and feeding back the reward to the neural network can achieve the purpose of reducing the total delay of the workshop workpiece and realizing efficient production of the workshop.

本发明的目的是通过下述技术方案实现的。The purpose of the present invention is achieved through the following technical solutions.

本发明公开的基于深度强化学习的柔性多资源车间动态调度方法，包括如下步骤：The flexible multi-resource workshop dynamic scheduling method based on deep reinforcement learning disclosed in the present invention comprises the following steps:

步骤一、以最小化工件总延迟时间为优化目标，构建调度目标优化问题，用于安排工序的加工机器集合、各台机器上的加工顺序和运输工件的AGV。Step 1: Taking minimizing the total delay time of workpieces as the optimization goal, a scheduling target optimization problem is constructed to arrange the processing machine set of the process, the processing sequence on each machine, and the AGV for transporting workpieces.

步骤二、使用D3QN(Dueling Double DQN)方法对步骤一构建的调度目标优化问题进行求解，得到决策结果；所述决策结果包括当前时间选择待加工的工件，选择加工该工件的机器和选择运输该工件的AGV；通过执行该决策结果，降低所有工件总延迟时间，提高车间的生产效率。Step 2: Use the D3QN (Dueling Double DQN) method to solve the scheduling target optimization problem constructed in step 1 to obtain a decision result; the decision result includes selecting the workpiece to be processed at the current time, selecting the machine to process the workpiece, and selecting the AGV to transport the workpiece; by executing the decision result, the total delay time of all workpieces is reduced and the production efficiency of the workshop is improved.

步骤2.1、D3QN算法拥有两个神经网络，分别是Main Q Network和Target QNetwork，两个网络的结构一样，均为输入层、隐藏层和双输出子网的结构；输入层的输入维数和车间状态的维数保持一致；隐藏层采用一层或者多层全连接层；双输出子网包括状态价值函数网络Vn和优势函数网络An，状态价值函数网络Vn负责评估状态s的价值，优势函数网络An负责评估在当前状态下各个动作a的相对优劣，最终网络的价值输出Q由Vn和An线性组合得到：Step 2.1, D3QN algorithm has two neural networks, namely Main Q Network and Target QNetwork. The structures of the two networks are the same, both of which are input layer, hidden layer and dual output subnet structure; the input dimension of the input layer is consistent with the dimension of the workshop state; the hidden layer uses one or more fully connected layers; the dual output subnet includes the state value function network Vn and the advantage function network An. The state value function network Vn is responsible for evaluating the value of the state s, and the advantage function network An is responsible for evaluating the relative advantages and disadvantages of each action a in the current state. The final network value output Q is obtained by linear combination of Vn and An:

其中，ω表示当前双输出子网的网络参数，α和β分别表示状态价值网络Vn和优势函数网络An的参数，A表示为所有动作的集合，a'∈A；Among them, ω represents the network parameters of the current dual-output subnet, α and β represent the parameters of the state value network Vn and the advantage function network An respectively, A represents the set of all actions, a'∈A;

步骤2.2、参数设置：设置柔性多资源车间动态调度问题的相关参数，包括超参数和神经网络的参数；Step 2.2, parameter setting: set the relevant parameters of the flexible multi-resource workshop dynamic scheduling problem, including hyperparameters and neural network parameters;

步骤2.3、初始化训练次数episode＝0，初始化当前时间t＝0，初始化车间资源的状态；Step 2.3, initialize the number of training episodes = 0, initialize the current time t = 0, and initialize the state of the workshop resources;

步骤2.4、判断当前回合是否结束，如果结束则执行下一个回合并初始化时间和车间资源状态，否则继续执行；回合结束的判定条件为：是否所有的工件都加工结束；Step 2.4: Determine whether the current round is over. If it is over, execute the next round and initialize the time and workshop resource status. Otherwise, continue to execute. The judgment condition for the end of the round is: whether all workpieces have been processed;

步骤2.5、判断当前时刻是否为决策点，决策点的定义为：当前时刻有到来的新工件和已经结束上一道工序的待加工工件；如果是决策点则进行决策，进行第步骤2.6；如果不是决策点，则跳转至第步骤2.15；Step 2.5, determine whether the current moment is a decision point. The definition of a decision point is: at the current moment, there are new workpieces arriving and workpieces to be processed that have completed the previous process; if it is a decision point, make a decision and proceed to step 2.6; if it is not a decision point, jump to step 2.15;

步骤2.6、计算当前车间的三类状态特征值，即已经到达工件的状态、机器的状态和AGV的状态；三类状态特征值一共14种状态，分别如下所示：Step 2.6, calculate the three types of state characteristic values of the current workshop, namely the state of the workpiece that has arrived, the state of the machine, and the state of the AGV; the three types of state characteristic values have a total of 14 states, which are as follows:

(1)已经到达工件的平均完成率：(1) Average completion rate of workpieces that have been reached:

f₁＝mean{jcr}f ₁ =mean{jcr}

工件完成率 Workpiece completion rate

其中OP_i是工件J_i的总工序数，OC_i是工件J_i已经完工的工序数；Where OP _i is the total number of processes for workpiece Ji _i , and OC _i is the number of completed processes for workpiece Ji _i ;

(2)已经到达工件平均完成率的标准差：(2) The standard deviation of the average completion rate of the workpieces has been reached:

f₂＝std{jcr}f ₂ = std{jcr}

(3)已经到达工序平均完成率：(3) The average process completion rate has been reached:

(4)已经到达工序的平均运输等待率(4) Average waiting rate of transportation that has arrived at the process

f₄＝mean{otwr}f ₄ =mean{otwr}

工序的运输等待率 Transportation waiting rate of the process

其中wtt_ij是工序O_ij的运输等待时间，tt_ijr是工序O_ij在AGV A_r上的运输时间，O_i＝{O_ij|O_i1,O_i2,...,O_ij}是工件J_i的工序集；Where wtt _ij is the transportation waiting time of process O _ij , tt _ijr is the transportation time of process O _ij on AGV A _r , O _i ={O _ij |O _i1 ,O _i2 ,...,O _ij } is the process set of workpiece _Ji ;

(5)已经到达工序的平均加工等待率(5) Average processing waiting rate of the process that has arrived

f₅＝mean{opwr}f ₅ =mean{opwr}

工序的加工等待率 Processing waiting rate of the process

其中wpt_ij是工序O_ij的加工等待时间，pt_ijk是工序O_ij的在机器M_k上的加工时间，M_ij是工序O_ij可用的机器集；Where wpt _ij is the processing waiting time of process O _ij , pt _ijk is the processing time of process O _ij on machine M _k , and M _ij is the set of machines available for process O _ij ;

(6)已经到达工件当前工序的平均预计延期率：(6) Average expected delay rate of workpieces that have arrived at the current process:

f₆＝mean{cotr}f ₆ =mean{cotr}

工件的当前工序的预计延期率 The expected delay rate of the current process of the workpiece

其中ct_ic是工序O_ic的完工时间，dt_ic是工序O_ic的工序交货期，C_i是工件J_i的交货期，O_ic是工件J_i当前正在处理工序，at_i是工件J_i的到达时间；Where ct _ic is the completion time of process O _ic , dt _ic is the process delivery time of process O _ic , C _i is the delivery time of workpiece Ji _i , O _ic is the process that workpiece Ji _i is currently processing, and at _i is the arrival time of workpiece Ji _i ;

(7)已经到达工件的平均预计延期率的标准差：(7) Standard deviation of the average expected delay rate of the arrived workpieces:

f₇＝std{cotr}f ₇ = std{cotr}

(8)已经到达工件平均预计延期率的最大值：(8) The maximum value of the average expected delay rate of the workpiece has been reached:

f₈＝max{cotr}f ₈ = max{cotr}

(9)机器的平均利用率：(9) Average machine utilization:

f₉＝mean{mur}f ₉ =mean{mur}

机器的利用率为 The machine utilization rate is

其中RT_k为机器M_k的累计工作时间，t是当前的时间；Where RT _k is the cumulative working time of machine M _k , and t is the current time;

(10)机器的平均利用率的标准差：(10) Standard deviation of the average utilization rate of the machine:

f₁₀＝std{mur}f ₁₀ = std{mur}

(11)机器的平均剩余任务量：(11) Average remaining workload of the machine:

其中rpt_ij是工序O_ij的剩余加工时间，x_ijk是判断工序O_ij是否在机器M_k上加工，m是机器的数量；Where rpt _ij is the remaining processing time of process O _ij , x _ijk is used to determine whether process O _ij is processed on machine M _k , and m is the number of machines;

(12)AGV的平均利用率：(12) Average utilization rate of AGV:

f₁₂＝mean{aur}f ₁₂ =mean{aur}

AGV的利用率为 The utilization rate of AGV is

其中a是AGV的数量，RT_r是AGVA_r的累计工作时间；Where a is the number of AGVs, RT _r is the cumulative working time of AGVA _r ;

(13)AGV的平均利用率的标准差：(13) Standard deviation of average utilization of AGV:

f₁₃＝std{aur}f ₁₃ = std{aur}

(14)AGV的平均剩余任务量：(14) Average remaining task volume of AGV:

其中rtt_ij是工序O_ij的剩余运输时间，y_ijr是判断工序O_ij是否在AGVA_r上进行运输；Where rtt _ij is the remaining transportation time of process O _ij , and y _ijr is used to determine whether process O _ij is transported on AGVA _r ;

最终输入神经网络的车间状态是：The final workshop state input to the neural network is:

s＝{f₁,f₂,f₃,f₄,f₅,f₆,f₇,f₈,f₉,f₁₀,f₁₁,f₁₂,f₁₃,f₁₄}s＝{f ₁ ,f ₂ ,f ₃ ,f ₄ ,f ₅ ,f ₆ ,f ₇ ,f ₈ ,f ₉ ,f ₁₀ ,f ₁₁ ,f ₁₂ ,f ₁₃ ,f ₁₄ }

步骤2.7、进行决策：随机生成一个0到1之间的小数，如果该数小于或者等于当前的探索率，则将随机选择一个复合调度规则；否则，将当前的车间状态值输入到MainQNetwork中进行计算，选择价值最大的复合调度规则；Step 2.7, make a decision: randomly generate a decimal between 0 and 1. If the decimal is less than or equal to the current exploration rate, a composite scheduling rule will be randomly selected; otherwise, the current workshop status value is input into MainQNetwork for calculation, and the composite scheduling rule with the greatest value is selected;

所述复合调度规则是通过将三类调度规则排列组合的方式生成一组复合调度规则；每个单独的复合调度规则能够直接选择出相应的工件、机器和AGV；The composite scheduling rule is generated by arranging and combining three types of scheduling rules to generate a set of composite scheduling rules; each individual composite scheduling rule can directly select the corresponding workpiece, machine and AGV;

所述三类调度规则为选择加工工件的规则、选择加工机器的规则和选择运输AGV的规则；The three types of scheduling rules are rules for selecting workpieces to be processed, rules for selecting processing machines, and rules for selecting transport AGVs;

所述选择加工工件的规则：The rules for selecting workpieces to be processed are:

(1)选择当前工序加工时间最短的工件(1) Select the workpiece with the shortest processing time in the current process

其中J_d是选择出来的工件，L_t是当前时刻t中待调度的工件的集合，pt_ick是工序O_ic在机器M_k上的加工时间，M_ic是工序O_ic可用的机器集合，am_ic是工序O_ic可用的机器数量；Where J _d is the selected workpiece, L _t is the set of workpieces to be scheduled at the current time t, pt _ick is the processing time of process O _ic on machine M _k , M _ic is the set of machines available for process O _ic , and am _ic is the number of machines available for process O _ic ;

(2)选择当前工序加工时间最长的工件(2) Select the workpiece with the longest processing time in the current process

(3)选择剩余加工时间最小的工件(3) Select the workpiece with the shortest remaining processing time

M_ij是工序O_ij可用的机器集合，am_ij是工序O_ij可用的机器数量M _ij is the set of machines available for process O _ij , and am _ij is the number of machines available for process O _ij.

(4)选择剩余加工时间最大的工件(4) Select the workpiece with the largest remaining processing time

(5)选择交货期最短的工件(5) Select the workpiece with the shortest delivery time

其中D_i是工件J_i的交货期；Where D _i is the delivery date of workpiece _Ji ;

选择加工机器的规则：Rules for selecting processing machines:

(1)选择可用机器中剩余工作时间最小的机器(1) Select the machine with the shortest remaining working time among the available machines

其中M_d是选择出的机器，mrt_k是机器M_k的剩余工作时间；Where M _d is the selected machine, mrt _k is the remaining working time of machine M _k ;

(2)选择可用机器中机器利用率最小的机器(2) Select the machine with the lowest utilization rate among the available machines

(3)选择可用机器中加工最快的机器(3) Choose the fastest machine among the available machines

选择运输工件AGV的规则：Rules for selecting AGV for transporting workpieces:

(1)选择剩余运输时间最小的AGV(1) Select the AGV with the shortest remaining transport time

其中A_d是被选择的AGV，art_r是AGVA_r的剩余工作时间；Where A _d is the selected AGV, art _r is the remaining working time of AGVA _r ;

(2)选择利用率最低的AGV(2) Select the AGV with the lowest utilization rate

上述五种选择加工工件的规则、三种选择加工机器的规则和两种选择运输工件的AGV的规则，通过排列组合的方式生成三十种复合调度规则；The above five rules for selecting workpieces to be processed, three rules for selecting processing machines, and two rules for selecting AGVs to transport workpieces are used to generate thirty composite scheduling rules through permutations and combinations;

步骤2.8、执行调度规则：通过决策出的复合规则选择相对应的工件、机器和AGV；Step 2.8, execute the scheduling rules: select the corresponding workpiece, machine and AGV according to the composite rules decided;

步骤2.9、采用步骤2.6的方法计算执行调度规则后的车间的状态特征值；Step 2.9, using the method in step 2.6 to calculate the state characteristic value of the workshop after the scheduling rule is executed;

步骤2.10、计算奖励：优化目标为最小化总延迟时间(Total Delay Time,TDT)，因此奖励函数的设计要能保证奖励最大化的趋势和优化目标的优化趋势一致，具体的奖励函数如为：如果决策选择的工件的预计工序完成时间小于该工序的交货期，则奖励为0；否则，奖励为工件的预计工序完成时间与交货期差值；Step 2.10, calculate the reward: the optimization goal is to minimize the total delay time (Total Delay Time, TDT), so the design of the reward function should ensure that the trend of maximizing the reward is consistent with the optimization trend of the optimization goal. The specific reward function is as follows: if the estimated process completion time of the workpiece selected by the decision is less than the delivery date of the process, the reward is 0; otherwise, the reward is the difference between the estimated process completion time and the delivery date of the workpiece;

r_ij＝max(0,ct_ij-dt_ij) _rij = max(0, _ctij _-dtij )

其中TDT为总延期，C_i是工件J_i的完工时间，D_i是工件J_i的交货期，r_ij代表的是这次决策所得到的奖励，ct_ij代表的是工序O_ij的完工时间，dt_ij代表的是工序O_ij的交货期；Where TDT is the total delay, _Ci is the completion time of workpiece _Ji , _Di is the delivery date of workpiece _Ji , r _ij represents the reward obtained from this decision, ct _ij represents the completion time of process O _ij , and dt _ij represents the delivery date of process O _ij ;

步骤2.11、判断当前决策是否为该回合的最后一步决策，如果是则，done＝1；否则done＝0；Step 2.11: Determine whether the current decision is the last decision of this round. If so, done = 1; otherwise, done = 0;

步骤2.12、将每次的调度经验以(s_t,a_t,s_t+1,r,done)的形式存储到经验池中，为后续神经网络的学习和训练提供经验；其中s_t代表的含义是决策前的车间状态值，a_t代表的是决策出的调度规则，r代表这次决策得到的奖励，s_t+1代表的是执行决策结果后的车间状态值，done代表的是该步决策是否为最后一步决策；Step 2.12, store each scheduling experience in the experience pool in the form of (s _t , a _t , s _t+1 , r, done) to provide experience for subsequent neural network learning and training; where s _t represents the workshop state value before the decision, a _t represents the scheduling rule decided, r represents the reward obtained for this decision, s _t+1 represents the workshop state value after the decision result is executed, and done represents whether this step decision is the last step decision;

步骤2.13、经验回放：判断当前经验池中存储的经验的个数是否大于最小的Batch数，如果大于，则从经验池中随机选取一个Batch大小的经验用于神经网络的训练和更新；如果小于或者等于，则不执行任何操作；Step 2.13, experience playback: determine whether the number of experiences stored in the current experience pool is greater than the minimum batch number. If it is greater, randomly select a batch size of experience from the experience pool for neural network training and updating; if it is less than or equal to, do not perform any operation;

步骤2.14、更新D3QN中的Main Q Network：将状态s_t输入到Main Q Network中并估计其Q值，将状态s_t+1输入Target Q Network中结合奖励值r得到目标值Y，Main QNetwork的参数采用梯度下降的方式更新；而Target Network的参数则是通过间隔一定时间复制Main Q Network的参数得来的；因此，每决策二十次，对Target Q Network的权重进行更新；Step 2.14: Update the Main Q Network in D3QN: Input the state s _t into the Main Q Network and estimate its Q value. Input the state s _t+1 into the Target Q Network and combine it with the reward value r to get the target value Y. The parameters of the Main Q Network are updated by gradient descent. The parameters of the Target Network are obtained by copying the parameters of the Main Q Network at regular intervals. Therefore, the weights of the Target Q Network are updated every twenty decisions.

Y＝r+γQ^t(s_t+1,argmaxQ(s_t+1,a^m；ω,α,β)；ω^t,α^t,β^t)*(1-done)Y＝r+γQ ^t (s _t+1 ,argmaxQ(s _t+1 ,a ^m ;ω,α,β);ω ^t ,α ^t ,β ^t )*(1-done)

其中Ε[]²表示为均方误差，(s_t,a_t,s_t+1,r,done)～U(B)表示从样本中抽取的数据，γ为折扣因子，argmax_a Q(s_t+1,a^m；ω,α,β)表示为Main Q Network评估下一状态s_t+1下价值最大的动作a^m；Where Ε[] ² represents the mean square error, (s _t ,a _t ,s _t+1 ,r,done)~U(B) represents the data extracted from the sample, γ is the discount factor, argmax _a Q(s _t+1 , ^am ;ω,α,β) represents the action a ^m with the greatest value in the next state s _t+1 evaluated by the Main Q Network;

步骤2.15、更新车间工件、机器和AGV的状态，更新当前时间为t＝t+1；Step 2.15, update the status of the workshop workpiece, machine and AGV, and update the current time to t=t+1;

步骤2.16、判断是否已经达到训练次数，如果已经完成，则保存神经网络的参数，得到训练好的神经网络；否则，跳转执行第2.4步；Step 2.16: Determine whether the number of training times has been reached. If it has been completed, save the parameters of the neural network to obtain the trained neural network; otherwise, jump to step 2.4;

步骤2.17、向步骤2.16得到的训练好的神经网络输入车间的当前状态，得到决策结果；所述决策结果包括当前时间选择待加工的工件，选择加工该工件的机器和选择运输该工件的AGV；通过执行该决策结果，降低所有工件的总延迟时间，提高车间的生产效率。Step 2.17, input the current state of the workshop into the trained neural network obtained in step 2.16 to obtain a decision result; the decision result includes selecting the workpiece to be processed at the current time, selecting the machine to process the workpiece, and selecting the AGV to transport the workpiece; by executing the decision result, the total delay time of all workpieces is reduced and the production efficiency of the workshop is improved.

有益效果：Beneficial effects:

1、本发明公共的一种计算车间实时生产状态的方法，为车间中的工件、机器和AGV分别设计了不同的状态特征，通过这些状态特征值可以快速掌握车间的生产状态，为车间决策提供支持。1. The present invention discloses a method for calculating the real-time production status of a workshop, which designs different status characteristics for the workpieces, machines and AGVs in the workshop. Through these status characteristic values, the production status of the workshop can be quickly grasped, providing support for workshop decision-making.

2、本发明公开的一组复合车间调度规则，将单种资源的调度规则通过组合生成复合调度规则，每个单独的调度规则可以直接用于车间决策，直接选出相对应的工件、机器和AGV，多种的复合调度规则可以在不同的情况下选择使用，提高调度的灵活性。2. The present invention discloses a set of composite workshop scheduling rules, which generate composite scheduling rules by combining the scheduling rules of single resources. Each individual scheduling rule can be directly used for workshop decision-making to directly select the corresponding workpieces, machines and AGVs. A variety of composite scheduling rules can be selected and used in different situations to improve the flexibility of scheduling.

3、本发明公开的一种奖励函数，通过最小化每道工序的延迟时间来实现降低工件总延迟时间的目的。3. A reward function disclosed by the present invention achieves the purpose of reducing the total delay time of the workpiece by minimizing the delay time of each process.

4、本发明公开的基于深度强化学习的柔性多资源车间动态调度方法，通过计算车间状态特征值，将车间的状态值输入到神经网络中，结合D3QN算法，决策相对应的复合调度规则，执行调度规则并计算奖励函数获得奖励的方式实现柔性多资源车间的动态调度，能够在一定程度上减少工件的总延迟时间，提高车间的生产效率。4. The method for dynamic scheduling of flexible multi-resource workshops based on deep reinforcement learning disclosed in the present invention calculates the state characteristic values of the workshop, inputs the state values of the workshop into the neural network, combines the D3QN algorithm, decides the corresponding composite scheduling rules, executes the scheduling rules and calculates the reward function to obtain rewards to achieve dynamic scheduling of flexible multi-resource workshops, which can reduce the total delay time of workpieces to a certain extent and improve the production efficiency of the workshop.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为D3QN奖励曲线；Figure 1 shows the D3QN reward curve;

图2为本发明公开的基于深度强化学习的柔性多资源车间动态调度方法流程图。Figure 2 is a flow chart of the flexible multi-resource workshop dynamic scheduling method based on deep reinforcement learning disclosed in the present invention.

具体的实施方式Specific implementation method

下面将结合附图和实施例对本发明加以详细说明。同时也叙述了本发明技术方案解决的技术问题及有益效果，需要指出的是，所描述的实施例仅旨在便于对本发明的理解，而对其不起任何限定作用。The present invention will be described in detail below with reference to the accompanying drawings and embodiments. The technical problems solved by the technical solution of the present invention and the beneficial effects are also described. It should be noted that the described embodiments are only intended to facilitate the understanding of the present invention and do not have any limiting effect on it.

本实施例柔性作业车间调度问题描述如下：车间里有10台机器用于加工工件，有7台自动导引车辆(Automated Guided Vehicle，AGV)用于执行运输任务，一批加工需求中有60个工件，其中在初始时刻有15个初始工件，在后续的时间内会陆续到达45个工件，每个工件的加工存在既定的工艺路线，一条工艺路线存在若干工序，这些工序需要按照一定的顺序一次在，工件的运输任务需要通过AGV来完成。初始时刻，AGV是从仓库出发，工件的初始位置也在仓库。随着时间的推移，后续到达的工件进入车间，这些工件的到达时间，工艺路线和加工时间，可选加工机器等信息并不能提前知道，而是等工件进入车间之后才知道。The flexible job shop scheduling problem of this embodiment is described as follows: There are 10 machines in the workshop for processing workpieces, and 7 automated guided vehicles (AGVs) for performing transportation tasks. There are 60 workpieces in a batch of processing requirements, of which 15 are initial workpieces at the initial moment, and 45 workpieces will arrive in succession in the subsequent period of time. There is a predetermined process route for the processing of each workpiece, and a process route has several processes. These processes need to be completed one at a time in a certain order, and the transportation task of the workpiece needs to be completed by AGV. At the initial moment, the AGV starts from the warehouse, and the initial position of the workpiece is also in the warehouse. As time goes by, the workpieces that arrive later enter the workshop. The arrival time, process route and processing time, optional processing machines and other information of these workpieces cannot be known in advance, but only after the workpiece enters the workshop.

工件的信息如下表所示：The information of the artifact is shown in the following table:

表工件信息表Table Artifact Information Table

其中dt_ij'是工序O_ij'的交货期。Where dt _ij' is the delivery date of process O _ij' .

其中，仓库和机器的位置如下表所示：Among them, the locations of warehouses and machines are shown in the following table:

表车间设施位置表Table Workshop Facility Location Table

通过车间设施的位置，可以计算得出各个设施之间的行驶距离和行驶时间：Based on the location of the workshop facilities, the driving distance and driving time between the facilities can be calculated:

其中t_kk'和d_kk'分别是机器M_k和机器M_k'之间的距离和AGV行驶时间，x_k和y_k是机器M_k的位置坐标。where t _kk' and d _kk' are the distance and AGV travel time between machine M _k and machine M _k', respectively, and x _k and y _k are the position coordinates of machine M _k .

在本发明公开的基于深度强化学习算法的柔性多资源车间动态调度方法中上述调度问题中，为了和其他考虑罂粟不同的调度问题区别开来，做出以下说明：In the above scheduling problem in the flexible multi-resource workshop dynamic scheduling method based on deep reinforcement learning algorithm disclosed by the present invention, in order to distinguish it from other scheduling problems that consider different factors, the following explanation is made:

A、工序的加工时间和AGV的行驶时间是确定的；A. The processing time of the process and the driving time of the AGV are determined;

B、工件的加工准备时间已经考虑在工序加工时间内，工件的装卸时间已经考虑在AGV的运输时间内；B. The processing preparation time of the workpiece has been considered in the process processing time, and the loading and unloading time of the workpiece has been considered in the transportation time of the AGV;

C、工件在机器上的加工是连续进行的，不能被其他工件抢占；工件在AGV上的运输也是连续进行的，不能被其他工件抢占；C. The processing of workpieces on the machine is continuous and cannot be preempted by other workpieces; the transportation of workpieces on the AGV is also continuous and cannot be preempted by other workpieces;

D、每一台机器在同一时刻最多只能够加工一个工件；每一台AGV在同一时刻最多只能够运输一个工件；D. Each machine can only process one workpiece at a time; each AGV can only transport one workpiece at a time;

E、每个AGV的性能完全相同，对于同一个运输任务，可以任意选择其中一台AGV进行完成；E. The performance of each AGV is exactly the same. For the same transportation task, any one of the AGVs can be selected to complete it.

F、每台机器和AGV有足够大的缓冲区容纳工件F. Each machine and AGV has a buffer large enough to accommodate the workpiece

G、不考虑机器故障、机器停机、AGV电量不足、AGV故障和AGV堵塞等问题基于深度强化学习的柔性多资源车间动态调度方法，其特征在于：包括如下步骤：G. A flexible multi-resource workshop dynamic scheduling method based on deep reinforcement learning without considering problems such as machine failure, machine shutdown, AGV power shortage, AGV failure and AGV congestion is characterized by comprising the following steps:

步骤一、以最小化平均延期率ATR(Average Tardiness Rate,ATR)为优化目标合理的安排工序的加工机器集合、各台机器上的加工顺序和运输工件的AGV，使得调度目标得到优化；Step 1: Taking minimizing the average tardiness rate (ATR) as the optimization goal, reasonably arrange the processing machine set of the process, the processing sequence on each machine and the AGV for transporting the workpiece, so that the scheduling goal is optimized;

步骤二、使用D3QN(Dueling Double DQN)方法对步骤一的问题进行求解：Step 2: Use the D3QN (Dueling Double DQN) method to solve the problem in step 1:

1、D3QN算法拥有两个神经网络，分别是Main Q Network和Target Q Network，这两个网络的结构一样，均为输入层、隐藏层和双输出子网的结构；输入层的输入维数和车间状态的维数保持一致；隐藏层采用一层或者多层全连接层；双输出子网包括状态价值函数网络Vn和优势函数网络An，状态价值函数网络Vn负责评估状态s的价值，优势函数网络An负责评估在当前状态下各个动作a的相对优劣，最终网络的价值输出Q由Vn和An线性组合得到：1. The D3QN algorithm has two neural networks, namely Main Q Network and Target Q Network. The structures of these two networks are the same, both of which are input layer, hidden layer and dual output subnet structure. The input dimension of the input layer is consistent with the dimension of the workshop state. The hidden layer uses one or more fully connected layers. The dual output subnet includes the state value function network Vn and the advantage function network An. The state value function network Vn is responsible for evaluating the value of the state s, and the advantage function network An is responsible for evaluating the relative advantages and disadvantages of each action a in the current state. The final network value output Q is obtained by linear combination of Vn and An:

2、参数设置：设置问题的相关参数，其中包括神经网络的参数和超参数。D3QN有两个神经网络，Main Q Network和Target Q Network，这两个网络的结构是一样的，都采用一个输入层，接三层全连接层和一个输出层的结构，全连接层的每层节点设置为256，激活函数设置为ReLU。超参数中，经验池的大小设置为4000，抽样批量为32，折扣因子设置为0.99，初始探索率设置为1，探索率的衰减率设置为0.999，探索率的最小值设置为0.04，学习率设置为0.00005，训练次数设置为5000。2. Parameter setting: Set the relevant parameters of the problem, including the parameters and hyperparameters of the neural network. D3QN has two neural networks, Main Q Network and Target Q Network. The structures of these two networks are the same. They both use an input layer, three fully connected layers and an output layer. The nodes of each layer of the fully connected layer are set to 256, and the activation function is set to ReLU. In the hyperparameters, the size of the experience pool is set to 4000, the sampling batch is 32, the discount factor is set to 0.99, the initial exploration rate is set to 1, the decay rate of the exploration rate is set to 0.999, the minimum exploration rate is set to 0.04, the learning rate is set to 0.00005, and the number of training times is set to 5000.

3、初始化训练次数episode＝0，初始化当前时间t＝0，初始化车间资源的状态；3. Initialize the number of training episodes = 0, initialize the current time t = 0, and initialize the status of workshop resources;

4、判断当前回合是否结束，如果结束则执行下一个回合并初始化时间和车间资源状态，否则继续执行；回合结束的判定条件为：是否所有的工件都加工结束；4. Determine whether the current round is over. If it is over, execute the next round and initialize the time and workshop resource status, otherwise continue to execute; the judgment condition for the end of the round is: whether all workpieces have been processed;

5、判断当前时刻是否为决策点，决策点的定义为：当前时刻有到来的新工件和已经结束上一道工序的待加工工件；如果是决策点则进行决策，进行第6步；如果不是决策点，则跳转至第15步；5. Determine whether the current moment is a decision point. The definition of a decision point is: at the current moment, there are new workpieces arriving and workpieces to be processed that have completed the previous process; if it is a decision point, make a decision and proceed to step 6; if it is not a decision point, jump to step 15;

6、计算当前车间的三类状态特征值，即已经到达工件的状态、机器的状态和AGV的状态；三类状态特征值一共14种状态，分别如下所示：6. Calculate the three types of state characteristic values of the current workshop, namely the state of the workpiece that has arrived, the state of the machine, and the state of the AGV; there are a total of 14 states of the three types of state characteristic values, which are as follows:

f₁＝mean{jcr}f ₁ =mean{jcr}

工件完成率 Workpiece completion rate

f₂＝std{jcr}f ₂ = std{jcr}

f₄＝mean{otwr}f ₄ =mean{otwr}

工序的运输等待率 Transportation waiting rate of the process

f₅＝mean{opwr}f ₅ =mean{opwr}

工序的加工等待率 Processing waiting rate of the process

f₆＝mean{cotr}f ₆ =mean{cotr}

f₇＝std{cotr}f ₇ = std{cotr}

f₈＝max{cotr}f ₈ = max{cotr}

(9)机器的平均利用率：(9) Average machine utilization:

f₉＝mean{mur}f ₉ =mean{mur}

机器的利用率为 The machine utilization rate is

f₁₀＝std{mur}f ₁₀ = std{mur}

(12)AGV的平均利用率：(12) Average utilization rate of AGV:

f₁₂＝mean{aur}f ₁₂ =mean{aur}

AGV的利用率为 The utilization rate of AGV is

f₁₃＝std{aur}f ₁₃ = std{aur}

(14)AGV的平均剩余任务量：(14) Average remaining task volume of AGV:

7、进行决策：随机生成一个0到1之间的小数，如果该数小于或者等于当前的探索率，则将随机选择一个复合调度规则；否则，将当前的车间状态值输入到Main Q Network中进行计算，选择价值最大的复合调度规则；7. Make a decision: randomly generate a decimal between 0 and 1. If the decimal is less than or equal to the current exploration rate, a composite scheduling rule will be randomly selected; otherwise, the current workshop status value is input into the Main Q Network for calculation, and the composite scheduling rule with the greatest value is selected;

(6)选择当前工序加工时间最短的工件(6) Select the workpiece with the shortest processing time in the current process

(7)选择当前工序加工时间最长的工件(7) Select the workpiece with the longest processing time in the current process

(8)选择剩余加工时间最小的工件(8) Select the workpiece with the shortest remaining processing time

(9)选择剩余加工时间最大的工件(9) Select the workpiece with the largest remaining processing time

(10)选择交货期最短的工件(10) Select the workpiece with the shortest delivery time

选择加工机器的规则：Rules for selecting processing machines:

(4)选择可用机器中剩余工作时间最小的机器(4) Select the machine with the shortest remaining working time among the available machines

(5)选择可用机器中机器利用率最小的机器(5) Select the machine with the lowest utilization rate among the available machines

(6)选择可用机器中加工最快的机器(6) Choose the fastest machine among the available machines

(3)选择剩余运输时间最小的AGV(3) Select the AGV with the shortest remaining transport time

(4)选择利用率最低的AGV(4) Select the AGV with the lowest utilization rate

上述五种选择加工工件的规则、三种选择加工机器的规则和两种选择运输工件的AGV的规则，通过排列组合的方式生成了三十种复合调度规则；The above five rules for selecting workpieces to be processed, three rules for selecting processing machines, and two rules for selecting AGVs to transport workpieces generate thirty composite scheduling rules through permutations and combinations;

8、执行调度规则：通过决策出的复合规则选择相对应的工件、机器和AGV；8. Execute scheduling rules: select the corresponding workpiece, machine and AGV through the composite rules decided;

9、采用第6步的方法计算执行调度规则后的车间的状态特征值；9. Use the method in step 6 to calculate the state characteristic value of the workshop after the scheduling rules are implemented;

10、计算奖励：本文的优化目标为最小化总延迟时间(Total Delay Time,TDT)，因此奖励函数的设计要能保证奖励最大化的趋势和优化目标的优化趋势一致，具体的奖励函数如为：如果决策选择的工件的预计工序完成时间小于该工序的交货期，则奖励为0；否则，奖励为工件的预计工序完成时间与交货期差值；10. Calculate rewards: The optimization goal of this paper is to minimize the total delay time (TDT). Therefore, the design of the reward function should ensure that the trend of maximizing the reward is consistent with the optimization trend of the optimization goal. The specific reward function is as follows: if the estimated process completion time of the workpiece selected by the decision is less than the delivery date of the process, the reward is 0; otherwise, the reward is the difference between the estimated process completion time and the delivery date of the workpiece;

r_ij＝max(0,ct_ij-dt_ij) _rij = max(0, _ctij _-dtij )

11、判断当前决策是否为该回合的最后一步决策，如果是则，done＝1；否则done＝0；11. Determine whether the current decision is the last decision of this round. If so, done = 1; otherwise done = 0;

12、将每次的调度经验以(s_t,a_t,s_t+1,r,done)的形式存储到经验池中，为后续神经网络的学习和训练提供经验；其中s_t代表的含义是决策前的车间状态值，a_t代表的是决策出的调度规则，r代表这次决策得到的奖励，s_t+1代表的是执行决策结果后的车间状态值，done代表的是该步决策是否为最后一步决策；12. Store each scheduling experience in the experience pool in the form of (s _t , a _t , s _t+1 , r, done) to provide experience for subsequent neural network learning and training; where s _t represents the workshop state value before the decision, a _t represents the scheduling rule decided, r represents the reward obtained for this decision, s _t+1 represents the workshop state value after the decision result is executed, and done represents whether this step of decision is the last step of decision;

13、经验回放：判断当前经验池中存储的经验的个数是否大于最小的Batch数，如果大于，则从经验池中随机选取一个Batch大小的经验用于神经网络的训练和更新；如果小于或者等于，则不执行任何操作；13. Experience playback: Determine whether the number of experiences stored in the current experience pool is greater than the minimum batch number. If it is greater, randomly select a batch size of experience from the experience pool for training and updating the neural network; if it is less than or equal to, do not perform any operation;

14、更新D3QN中的Main Q Network：将状态s_t输入到Main Q Network中并估计其Q值，将状态s_t+1输入Target Q Network中结合奖励值r得到目标值Y，Main Q Network的参数采用梯度下降的方式更新；而Target Network的参数则是通过间隔一定时间复制MainQNetwork的参数得来的；因此，每决策二十次，对Target Q Network的权重进行更新；14. Update the Main Q Network in D3QN: Input the state s _t into the Main Q Network and estimate its Q value. Input the state s _t+1 into the Target Q Network and combine it with the reward value r to get the target value Y. The parameters of the Main Q Network are updated by gradient descent. The parameters of the Target Network are obtained by copying the parameters of the MainQNetwork at regular intervals. Therefore, the weights of the Target Q Network are updated every 20 decisions.

其中Ε[]²表示为均方误差，(s_t,a_t,s_t+1,r,done)～U(B)表示从样本中抽取的数据，γ为折扣因子，argmax_aQ(s_t+1,a^m；ω,α,β)表示为Main Q Network评估下一状态s_t+1下价值最大的动作a^m；Where Ε[] ² represents the mean square error, (s _t ,a _t ,s _t+1 ,r,done)~U(B) represents the data extracted from the sample, γ is the discount factor, argmax _a Q(s _t+1, a ^m ; ω,α,β) represents the action a ^m with the greatest value in the next state s _t+1 evaluated by the Main Q Network;

15、更新车间工件、机器和AGV的状态，更新当前时间为t＝t+1；15. Update the status of workshop workpieces, machines and AGVs, and update the current time to t = t + 1;

16、判断是否已经达到训练次数，如果已经完成，则保存神经网络的参数，得到训练好的神经网络；否则，跳转执行第4步；16. Determine whether the training times have been reached. If so, save the parameters of the neural network to obtain the trained neural network; otherwise, jump to step 4.

17、向第16步得到的训练好的神经网络输入车间的当前状态，得到决策结果；所述决策结果包括当前时间选择待加工的工件，选择加工该工件的机器和选择运输该工件的AGV；通过执行该决策结果，实现降低总延迟时间的效果，提高车间的生产效率。17. Input the current state of the workshop into the trained neural network obtained in step 16 to obtain a decision result; the decision result includes selecting the workpiece to be processed at the current time, selecting the machine to process the workpiece, and selecting the AGV to transport the workpiece; by executing the decision result, the effect of reducing the total delay time is achieved and the production efficiency of the workshop is improved.

计算结果与分析：Calculation results and analysis:

在上述实施例的条件下，本文提出方法所得到的奖励图如下图所示：Under the conditions of the above embodiment, the reward graph obtained by the method proposed in this article is shown in the following figure:

从图1中可以看出奖励曲线随着训练次数的增加而不断上升，最后稳定收敛。将D3QN算法的效果与本文设计出的三十种复合调度规则的效果进行比较，如下表所示：From Figure 1, we can see that the reward curve continues to rise with the increase in the number of training times, and finally converges stably. The effect of the D3QN algorithm is compared with the effects of the thirty composite scheduling rules designed in this paper, as shown in the following table:

表算法效果对比表Table Algorithm effect comparison table

从表中分析表明，在实施例1的问题中，D3QN算法的效果与最优的复合规则相比，提升了约19％左右，可见D3QN算法对比复合规则的效果有着明显的提升，能够有效的降低总延迟，针对该实施例1中提出的柔性多资源车间调度问题，D3QN算法能够兼顾响应能力和优化效果。Analysis from the table shows that in the problem of Example 1, the effect of the D3QN algorithm is improved by about 19% compared with the optimal composite rule. It can be seen that the effect of the D3QN algorithm is significantly improved compared with the composite rule, and the total delay can be effectively reduced. For the flexible multi-resource workshop scheduling problem proposed in Example 1, the D3QN algorithm can take into account both responsiveness and optimization effect.

本发明在分析柔性多资源作业车间工件、机器和AGV等多种资源调度时，考虑了工艺柔性约束和紧急插单动态事件，提出了基于深度强化学习的调度方法，优化了AGV搬运任务和机器的加工任务的分配方案，在求解相关问题是能够快速收敛并且结果较为稳定，相对于传统算法有一定的优越性，提高生产效率，减少总延迟。When analyzing the scheduling of multiple resources such as workpieces, machines and AGVs in a flexible multi-resource job shop, the present invention takes into account process flexibility constraints and dynamic events of emergency insertion, proposes a scheduling method based on deep reinforcement learning, and optimizes the allocation scheme of AGV handling tasks and machine processing tasks. It can converge quickly and the results are relatively stable when solving related problems. It has certain advantages over traditional algorithms, improves production efficiency and reduces total delay.

以上所述的具体描述，对发明的目的、技术方案和有益效果进行进一步详细说明，所应理解的是，以上所述仅为本发明的具体实施例而已，并不用于限定本发明的保护范围，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The specific description above further illustrates the purpose, technical solutions and beneficial effects of the invention in detail. It should be understood that the above description is only a specific embodiment of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present invention should be included in the scope of protection of the present invention.

Claims

1. A flexible multi-resource workshop dynamic scheduling method based on deep reinforcement learning, characterized in that: comprising the following steps,

Step 1. With the optimization goal of minimizing the total delay time of workpieces, construct a scheduling objective optimization problem, which is used to arrange the set of processing machines for the process, the processing sequence on each machine and the AGV for transporting the workpieces;

Step 2, using the D3QN method to solve the scheduling target optimization problem constructed in step 1, and obtain a decision result; the decision result includes selecting the workpiece to be processed at the current time, selecting the machine for processing the workpiece, and selecting the AGV for transporting the workpiece; Execute the decision result, reduce the total delay time of the workpiece, and improve the production efficiency of the workshop;

Step 2.1, the D3QN algorithm has two neural networks, namely Main Q Network and Target Q Network. The structures of the two networks are the same, both of which are input layer, hidden layer and double output subnetwork structure; the input dimension of the input layer and The dimension of the workshop state remains consistent; the hidden layer adopts one or more fully connected layers; the dual output subnetwork includes the state value function network Vn and the advantage function network An, the state value function network Vn is responsible for evaluating the value of the state s, and the advantage function The network An is responsible for evaluating the relative advantages and disadvantages of each action a in the current state, and the final value output Q of the network is obtained by the linear combination of Vn and An:

Among them, ω represents the network parameters of the current dual-output subnetwork, α and β represent the parameters of the state-value network Vn and the advantage function network An, respectively, and A represents the set of all actions, a'∈A;

Step 2.2, parameter setting: set the relevant parameters of the flexible multi-resource workshop dynamic scheduling problem, including hyperparameters and neural network parameters;

Step 2.3, initialize the number of training times episode=0, initialize the current time t=0, initialize the state of workshop resources;

Step 2.4. Determine whether the current round is over. If it is over, execute the next round and initialize the time and workshop resource status, otherwise continue to execute; the judgment condition for the end of the round is: whether all workpieces have been processed;

Step 2.5. Determine whether the current moment is a decision point. The definition of a decision point is: there are new workpieces coming at the current moment and workpieces to be processed that have completed the previous process; if it is a decision point, make a decision and proceed to step 2.6; if not decision point, skip to step 2.15;

Step 2.6. Calculate the three types of state eigenvalues of the current workshop, that is, the state of the workpiece that has arrived, the state of the machine, and the state of the AGV; the three types of state eigenvalues have a total of 14 states, as follows:

(1) The average completion rate of the workpieces that have been reached:

f ₁ = mean{jcr}

Workpiece completion rate

Among them, OP _i is the total number of processes of workpiece _Ji , and OC _i is the number of completed processes of workpiece _Ji ;

(2) The standard deviation of the average completion rate of the workpiece has been reached:

f ₂ =std{jcr}

(3) The average completion rate of the reached process:

(4) The average transportation waiting rate of the already arrived process

f ₄ = mean{otwr}

Transport waiting rate of process

Where wtt _ij is the transportation waiting time of process O _ij , tt _ijr is the transportation time of process O _ij on AGV A _r , O _i ={O _ij |O _i1 ,O _i2 ,...,O _ij } is the workpiece J _i 's process set;

(5) The average processing waiting rate of the reached process

f ₅ = mean{opwr}

Process waiting rate

Where wpt _ij is the processing waiting time of process O _ij , pt _ijk is the processing time of process O _ij on machine M _k , and M _ij is the machine set available for process O _ij ;

(6) The average estimated delay rate of the workpieces that have arrived:

f ₆ = mean{cotr}

The estimated delay rate of the current operation of the workpiece

Among them, ct _ic is the completion time of process O _ic , dt _ic is the process delivery date of process O _ic , C _i is the delivery date of workpiece J _i , O _ic is the current process of workpiece J _i , and at _i is workpiece J the arrival time of _i ;

(7) The standard deviation of the average estimated delay rate of the arrived workpieces:

f ₇ =std{cotr}

(8) The maximum value of the average projected delay rate of the workpiece has been reached:

f ₈ =max{cotr}

(9) The average utilization rate of the machine:

f ₉ = mean{mur}

The utilization rate of the machine is

Among them, RT _k is the accumulative working time of machine M _k , and t is the current time;

(10) The standard deviation of the average utilization of the machine:

f ₁₀ =std{mur}

(11) The average remaining tasks of the machine:

Where rpt _ij is the remaining processing time of process O _ij , x _ijk is to judge whether process O _ij is processed on machine M _k , and m is the number of machines;

(12) Average utilization rate of AGV:

f ₁₂ = mean{aur}

The utilization rate of AGV is

Where a is the number of AGVs, RT _r is the cumulative working time of AGVA _r ;

(13) The standard deviation of the average utilization rate of AGV:

f ₁₃ =std{aur}

(14) The average remaining tasks of AGV:

Among them, rtt _ij is the remaining transportation time of process O _ij , and y _ijr is to judge whether process O _ij is transported on AGVA _r ;

The final workshop state input to the neural network is:

s={f ₁ , f ₂ , f ₃ , f ₄ , f ₅ , f ₆ , f ₇ , f ₈ , f ₉ , f ₁₀ , f ₁₁ , f ₁₂ , f ₁₃ , f ₁₄ }

Step 2.7, make a decision: randomly generate a decimal between 0 and 1, if the number is less than or equal to the current exploration rate, a composite scheduling rule will be randomly selected; otherwise, the current workshop state value will be input to the Main Q Network Calculate in and select the composite scheduling rule with the greatest value;

The compound dispatching rule generates a set of compound dispatching rules by arranging and combining the three types of dispatching rules; each individual compound dispatching rule can directly select the corresponding workpiece, machine and AGV;

The three types of scheduling rules are the rules for selecting workpieces to be processed, the rules for selecting processing machines, and the rules for selecting transport AGVs;

The rules for selecting workpieces for processing are as follows:

(1) Select the workpiece with the shortest processing time in the current process

Among them, J _d is the selected workpiece, L _t is the set of workpieces to be scheduled at the current time t, pt _ick is the processing time of process O _ic on machine M _k , M _ic is the set of machines available for process O _ic , am _ic is the number of machines available for process O _ic ;

(2) Select the workpiece with the longest processing time in the current process

(3) Select the workpiece with the smallest remaining processing time

M _ij is the set of machines available for process O _ij , and am _ij is the number of machines available for process O _ij

(4) Select the workpiece with the largest remaining processing time

(5) Select the workpiece with the shortest delivery time

where D _i is the delivery date of job J _i ;

Rules for selecting processing machines:

(1) Select the machine with the smallest remaining working time among the available machines

Among them, M _d is the selected machine, and mrt _k is the remaining working time of the machine M _k ;

(2) Choose the machine with the smallest machine utilization rate among the available machines

(3) Choose the fastest processing machine among the available machines

Rules for selecting AGV for transporting workpieces:

(1) Select the AGV with the smallest remaining transportation time

Where A _d is the selected AGV, art _r is the remaining working time of AGVA _r ;

(2) Select the AGV with the lowest utilization rate

The above five rules for selecting processing workpieces, three rules for selecting processing machines, and two rules for selecting AGVs for transporting workpieces, generate 30 compound scheduling rules by means of permutation and combination;

Step 2.8, execute scheduling rules: select the corresponding workpieces, machines and AGVs through the composite rules determined by the decision;

Step 2.9, using the method of step 2.6 to calculate the state characteristic value of the workshop after executing the scheduling rule;

Step 2.10, calculate the reward: the optimization goal is to minimize the total delay time TDT, so the design of the reward function should ensure that the trend of reward maximization is consistent with the optimization trend of the optimization goal. The specific reward function is as follows: If the estimated process completion time is less than the delivery date of the process, the reward is 0; otherwise, the reward is the difference between the estimated process completion time and the delivery date of the workpiece;

r _ij ＝max(0,ct _ij -dt _ij )

Among them, TDT is the total delay, C _i is the completion time of workpiece _Ji , D _i is the delivery date of workpiece _Ji , r _ij represents the reward obtained by this decision, ct _ij represents the completion of process O _ij Time, dt _ij represents the delivery date of process O _ij ;

Step 2.11. Determine whether the current decision is the last decision of the round, if yes, done=1; otherwise done=0;

Step 2.12, store each scheduling experience in the form of (st _t , a _t , _st+1 , r, done) in the experience pool to provide experience for the subsequent learning and training of the neural network; where _st represents the The meaning is the status value of the workshop before the decision, a _t represents the dispatching rule made by the decision, r represents the reward obtained by this decision, st ₊₁ represents the status value of the workshop after the execution of the decision result, and done represents the Whether the decision-making step is the last decision-making step;

Step 2.13, experience playback: judge whether the number of experiences stored in the current experience pool is greater than the minimum Batch number, if greater, then randomly select a Batch-sized experience from the experience pool for training and updating of the neural network; if less than or equals, do nothing;

Step 2.14, update the Main Q Network in D3QN: input the state s _t into the Main Q Network and estimate its Q value, input the state s _t+1 into the Target Q Network and combine the reward value r to get the target value Y, Main Q Network The parameters of the are updated by gradient descent; the parameters of the Target Network are obtained by copying the parameters of the MainQNetwork at a certain interval; therefore, the weights of the Target Q Network are updated every 20 times of decision-making;

Y=r+γQ ^t (s _t+1 ,argmaxQ(s _t+1 ,a ^m ;ω,α,β);ω ^t ,α ^t ,β ^t )*(1-done)

Among them, Ε[] ² represents the mean square error, (st _t ,a _t ,st _t+1 ,r,done)~U(B) represent the data extracted from the sample, γ is the discount factor, argmax _a Q(s _t+1 , a ^m ; ω, α, β) represents the action a ^m with the greatest value under the evaluation of the next state s _t+1 by the Main Q Network;

Step 2.15, update the status of the workshop workpiece, machine and AGV, and update the current time as t=t+1;

Step 2.16, judge whether the number of training times has been reached, if it has been completed, save the parameters of the neural network, and obtain the trained neural network; otherwise, skip to step 2.4;

Step 2.17, input the current state of the workshop to the trained neural network obtained in step 2.16, and obtain the decision result; the decision result includes selecting the workpiece to be processed at the current time, selecting the machine for processing the workpiece and selecting the AGV for transporting the workpiece; By implementing the results of this decision, the total delay time of all workpieces is reduced and the production efficiency of the shop floor is improved.