CN111917642B

CN111917642B - SDN network intelligent routing data transmission method based on distributed deep reinforcement learning

Info

Publication number: CN111917642B
Application number: CN202010673851.8A
Authority: CN
Inventors: 刘宇涛; 崔金鹏; 章小宁; 贺元林
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-07-14
Filing date: 2020-07-14
Publication date: 2021-04-27
Anticipated expiration: 2040-07-14
Also published as: CN111917642A

Abstract

The invention discloses a distributed deep reinforcement learning SDN network intelligent routing data transmission method, which realizes the calculation of a fast routing path, maximizes the throughput under the condition of ensuring delay and solves the problems of low speed and low throughput of the traditional algorithm. The invention uses the reinforcement learning algorithm, the algorithm simplifies the route calculation process into simple input and output, avoids multiple iterations during calculation so as to realize the rapid calculation of the route path, the speed of the route algorithm is accelerated, the forwarding delay is reduced, the data packet which is discarded due to the expiry of ttl originally has a more probable survival rate and is successfully forwarded, and the network throughput is increased. The invention is provided with two stages of off-line training and on-line training, and the parameters are updated in a dynamic environment to select the optimal path, so that the invention has topology self-adaptability.

Description

SDN network intelligent routing data transmission method based on distributed deep reinforcement learning

技术领域technical field

本发明属于数据传输领域，具体涉及分布式深度强化学习的SDN网络智慧路由数据传输方法。The invention belongs to the field of data transmission, and in particular relates to an SDN network intelligent routing data transmission method of distributed deep reinforcement learning.

背景技术Background technique

当前信息技术步入成熟阶段，在SDN网络(软件定义网络Software DefinedNetwork)架构中数据流灵活可控、控制器具有全网视图并可实时感知网络状态变化(如流量分布、拥塞状况以及链路利用情况等)，在现实中，路由选择问题往往通过最短路径算法来解决，将一些简单的网络参数(如路径跳数、时延等)作为算法的优化指标，以寻找跳数最少路径或时延最小路径作为算法的最终目标。单一的度量标准和优化目标，容易导致部分关键链路拥塞，造成网络负载不均衡的问题。虽然在多业务路径分配时，基于拉格朗日松弛的最短路由算法可以找到复合多约束条件的最优路径，但该类启发式路由算法必须经过多次迭代才能计算出最优路径，收敛速度慢、时效性不佳、吞吐量不大。At present, information technology has entered a mature stage. In the SDN network (Software Defined Network) architecture, the data flow is flexible and controllable, and the controller has a network-wide view and can sense network status changes in real time (such as traffic distribution, congestion status, and link utilization). In reality, the routing problem is often solved by the shortest path algorithm, and some simple network parameters (such as path hops, delay, etc.) are used as the optimization indicators of the algorithm to find the path with the least hops or delay. The minimum path serves as the ultimate goal of the algorithm. A single metric and optimization goal can easily lead to congestion of some key links, resulting in unbalanced network load. Although the shortest routing algorithm based on Lagrangian relaxation can find the optimal path with multiple constraints in multi-service path allocation, this type of heuristic routing algorithm must go through multiple iterations to calculate the optimal path, and the convergence speed Slow, poor timeliness, low throughput.

发明内容SUMMARY OF THE INVENTION

针对现有技术中的上述不足，本发明提供的分布式深度强化学习的SDN网络智慧路由数据传输方法解决了上述现有技术中存在的问题。In view of the above deficiencies in the prior art, the distributed deep reinforcement learning SDN network intelligent routing data transmission method provided by the present invention solves the above problems in the prior art.

为了达到上述发明目的，本发明采用的技术方案为：一种分布式深度强化学习的SDN网络智慧路由数据传输方法，包括以下步骤：In order to achieve the above purpose of the invention, the technical solution adopted in the present invention is: a distributed deep reinforcement learning SDN network intelligent routing data transmission method, comprising the following steps:

S1、构建奖励函数和包含行动者网络和评价者网络的深度强化学习模型，并在SDN网络的应用层布置深度强化学习模型；S1. Build a reward function and a deep reinforcement learning model including an actor network and an evaluator network, and arrange the deep reinforcement learning model in the application layer of the SDN network;

S2、随机初始化深度强化学习模型的行动者网络参数θ_a和评价者网络参数θ_c；S2. Randomly initialize the actor network parameters θ _a and the evaluator network parameters θ _c of the deep reinforcement learning model;

S3、随机初始化SDN网络的控制层中第i个本地GPU_i上行动者网络的本地行动者参数θ′_a和评价者网络的本地评价者参数θ′_c；S3. Randomly initialize the local actor parameter θ′ a of the actor network and the local evaluator parameter θ′ _c of the evaluator network on the i- _th local GPU _i in the control layer of the SDN network;

S4、根据奖励函数、行动者网络参数θ_a、评价者网络参数θ_c、本地行动者参数θ′_a和本地评价者参数θ′_c，使用A3C算法对第i个本地GPU_i上的深度强化学习模型进行离线训练，更新行动者网络参数θ_a和评价者网络参数θ_c；S4. According to the reward function, the actor network parameter θ _a , the evaluator network parameter θ _c , the local actor parameter θ′ _a and the local evaluator parameter θ′ _c , use the A3C algorithm to intensify the depth on the ith local GPU _i The learning model is trained offline, and the actor network parameters θ _a and the evaluator network parameters θ _c are updated;

S5、将更新后的行动者网络参数θ_a和更新后的评价者网络参数θ_c作用于SDN网络全局，使用更新参数后的SDN网络进行数据的传输；S5, applying the updated actor network parameter θ _a and the updated evaluator network parameter θ _c to the global SDN network, and using the updated SDN network for data transmission;

S6、定时检测SDN网络的拓扑结构是否发生改变，若是，则进入步骤S7，否则重复步骤S6；S6, regularly detect whether the topology structure of the SDN network has changed, if so, enter step S7, otherwise repeat step S6;

S7、对深度强化学习模型进行在线训练，使用自适应运行算法对行动者网络参数θ_a和评价者网络参数θ_c进行更新，并将行动者网络参数θ_a和评价者网络参数θ_c作用于SDN网络全局，使用更新参数后的SDN网络进行数据的传输；S7. Perform online training on the deep reinforcement learning model, use the adaptive running algorithm to update the actor network parameter θ _a and the evaluator network parameter θ _c , and apply the actor network parameter θ _a and the evaluator network parameter θ _c to The SDN network is global, and the SDN network with updated parameters is used for data transmission;

其中，i＝1,2,...,L，L表示本地GPU的总数。Among them, i=1,2,...,L, L represents the total number of local GPUs.

进一步地，所述步骤S1中行动者网络为全连接神经网络，所述步骤S1中评价者网络为全连接神经网络和CNN卷积神经网络的组合网络；所述行动者网络和评价者网络的输入均包括SDN网络的网络状态，所述网络状态包括当前节点信息、目的节点信息、带宽要求和时延要求，所述评价者网络的输入还包括由CNN卷积神经网络处理后的SDN网络的网络特征；所述CNN卷积神经网络包括依次连接的输入层、卷积层、池化层、全连接层和输出层。Further, in the step S1, the actor network is a fully connected neural network, and in the step S1, the evaluator network is a combined network of a fully connected neural network and a CNN convolutional neural network; the actor network and the evaluator network are combined. The inputs all include the network status of the SDN network, the network status includes current node information, destination node information, bandwidth requirements and delay requirements, and the input of the evaluator network also includes the SDN network processed by the CNN convolutional neural network. Network features; the CNN convolutional neural network includes an input layer, a convolution layer, a pooling layer, a fully connected layer and an output layer that are connected in sequence.

进一步地，所述步骤S1中奖励函数为：Further, the reward function in the step S1 is:

其中，

表示在状态s_n的情况下，SDN网络中第n个路由节点向第m个路由节点做出动作a_n后得到的奖励值；g表示动作惩罚，a₁表示第一权重，a₂表示第二权重，c(n)表示第n个路由节点的剩余容量，c(m)表示第m个路由节点的剩余容量，c(l)表示SDN网络中第l个链路的剩余容量，d(n)表示第n个路由节点与其邻接节点的流量负载的差异程度，d(m)示第m个路由节点与其邻接节点的流量负载的差异程度；所述状态s_n包括：数据包所在节点为第n个路由节点、数据包的最终目的节点、数据报的转发带宽需求和数据包的延迟要求；所述动作a_n表示在状态s_n的情况下可以采取的所有转发操作。in,

Represents the reward value obtained after the _nth routing node in the SDN network performs the action an to the mth routing node in the state s _n ; g represents the action penalty, a ₁ represents the first weight, and a ₂ represents the first weight Two weights, c(n) represents the remaining capacity of the nth routing node, c(m) represents the remaining capacity of the mth routing node, c(l) represents the remaining capacity of the lth link in the SDN network, d( n) represents the degree of difference between the traffic loads of the nth routing node and its adjacent nodes, and d(m) represents the degree of difference between the traffic loads of the mth routing node and its adjacent nodes; the state s _n includes: the node where the data packet is located is The _nth routing node, the final destination node of the data packet, the forwarding bandwidth requirement of the datagram and the delay requirement of the data packet; the action an represents all forwarding operations that can be taken in the state _sn .

进一步地，所述步骤S4包括以下分步骤：Further, the step S4 includes the following sub-steps:

S41、设置第一计数器t＝0、第二计数器T＝0、最大迭代次数T_max和路由跳数限制t_max；S41, setting the first counter t=0, the second counter T=0, the maximum number of iterations T _max and the routing hop limit t _max ;

S42、令dθ_a＝0和dθ_c＝0，并进行本地参数与全局参数的同步，将本地行动者参数θ_a'的值同步为行动者网络参数θ_a的值，将本地评价者参数θ_c'的值同步为评价者网络参数θ_c的值；S42, set dθ _a =0 and dθ _c =0, and synchronize the local parameters with the global parameters, synchronize the value of the local actor parameter θ _a ' with the value of the actor network parameter θ _a , and synchronize the local evaluator parameter θ The value of _c ' is synchronized to the value of the evaluator network parameter θ _c ;

S43、令第一中间计数值t_start＝t，通过本地GPU_i读取当前时刻的状态s_t；S43, make the first intermediate count value t _start =t, read the state s _t at the current moment through the local GPU _i ;

S44、通过行动者网络获取策略π(a_t|s_t；θ′_a)，并根据策略π(a_t|s_t；θ′_a)执行动作a_t，其中，π(a_t|s_t；θ′_a)表示在状态s_t和本地GPU_i上本地行动者参数θ′_a的情况下所需要执行的动作为a_t；S44. Obtain the strategy π(a _t |s _t ; θ′ _a ) through the actor network, and execute the action a _t according to the strategy π(a _t |s _t ; θ′ _a ), where π(a _t |s _t ; θ′ _a ) indicates that the action to be performed in the case of state _st and local actor parameter θ′ _a on local GPU _i is a _t ;

S45、获取执行动作a_t后的奖励值r_t和新状态s_t+1，并令第一计数器t的计数值加一；S45, obtain the reward value _rt and the new state s _t+1 after performing the action a _t , and increase the count value of the first counter t by one;

S46、判断新状态s_t是否达到最终状态所限定的条件，若是，则设置更新奖励值R＝0，并进入步骤S48，否则进入步骤S47；S46, determine whether the new state s _t reaches the condition limited by the final state, if so, set the update reward value R=0, and go to step S48, otherwise go to step S47;

S47、判断t-t_start是否大于路由跳数限制t_max，若是，则设置更新奖励值R＝V(s_t，θ′_c)，并进入步骤S48，否则返回步骤S44，其中V(s_t，θ′_c)表示评价者网络在本地评价者参数θ′_c时对到达状态s_t的路由策略评价值；S47, determine whether tt _start is greater than the routing hop limit t _max , if so, set the update reward value R=V(s _t , θ′ _c ), and go to step S48 , otherwise return to step S44 , where V(s _t , θ ′ c ) ′ _c ) represents the evaluation value of the routing strategy of the evaluator network to the arrival state _st when the local evaluator parameter θ′ _c ;

S48、设置第三计数器z＝t-1和梯度更新奖励值R_updata＝r_z+γR，初始化行动者网络参数的梯度Δθ_a和评价者网络参数的梯度Δθ_c为0；S48, set the third counter z=t-1 and the gradient update reward value R _updata =r _z +γR, initialize the gradient Δθ _{a of the actor network parameters and the gradient Δθ c} _of the evaluator network parameters to 0;

S49、根据梯度更新奖励值R_updata、本地行动者参数θ′_a和本地评价者参数θ′_c，获取本地行动者参数梯度Δθ_a的更新值和本地行动者参数梯度Δθ_c的更新值为：S49 , update the reward value R _updata , the local actor parameter θ′ _a and the local evaluator parameter θ′ _c according to the gradient, and obtain the updated value of the local actor parameter gradient Δθ _a and the updated value of the local actor parameter gradient Δθ _c :

其中，Δθ_{a_updata}表示梯度Δθ_a的更新值，

表示本地行动者参数θ′_a的导数，logπ(a_z|s_z；θ′_a)表示在参数θ′_a和状态s_z的情况下执行动作a_z这个策略的概率的对数，r_z表示执行动作a_z的奖励值，γ表示奖励折扣率，V(s_z；θ′_c)表示评价者网络在本地评价者参数θ′_c时对到达状态s_z的路由策略评价值，Δθ_{c_updata}表示梯度Δθ_c的更新值，

表示对(R_updata-V(s_z；θ′_c))²求取θ′_c的偏导数；Among them, Δθ _{a_updata} represents the update value of the gradient Δθ _a ,

is the derivative of the local actor parameter θ′ _a , logπ( _az |s _z ; θ′ _a ) is the logarithm of the probability of executing the policy of action a _z with parameters θ′ _a and state s _z , r _z Represents the reward value of performing action a _z , γ represents the reward discount rate, V(s _z ; θ′ _c ) represents the evaluation value of the routing strategy of the evaluator network to the state s _z when the local evaluator parameter θ′ _c , Δθ _{c_updata} represents the updated value of the gradient Δθ _c ,

represents the partial derivative of θ′ _c for (R _updata -V(s _z ; θ′ _c )) ² ;

S410、令Δθ_a＝Δθ_{a_updata}、Δθ_c＝Δθ_{c_updata}和R＝R_updata，并判断第三计数器z是否等于第一中间计数值t_start，若是，则进入步骤S411，否则令第三计数器z的计数值减一，将梯度更新奖励值R_updata更新为r_z+γR，并返回步骤S49；S410, set Δθ _a =Δθ _{a_updata} , Δθ _c =Δθ _{c_updata} and R=R _updata , and determine whether the third counter z is equal to the first intermediate count value t _start , if so, proceed to step S411 , otherwise let the third counter z Decrease the count value by one, update the gradient update reward value R _updata to r _z +γR, and return to step S49;

S411、判断第二计数器T是否大于或等于最大迭代次数T_max，若是，则使用本地行动者参数梯度Δθ_a和本地行动者参数梯度Δθ_c分别更新行动者网络参数θ_a和评价者网络参数θ_c，并结束更新流程，否则令第二计数器T的计数值加一，并返回步骤S42。S411. Determine whether the second counter T is greater than or equal to the maximum number of iterations T _max , and if so, use the local actor parameter gradient Δθ _a and the local actor parameter gradient Δθ _c to update the actor network parameter θ _a and the evaluator network parameter θ respectively _c , and end the update process, otherwise, increment the count value of the second counter T by one, and return to step S42.

进一步地，所述步骤S411中使用本地行动者参数梯度Δθ_a和本地行动者参数梯度Δθ_c分别更新行动者网络参数θ_a和评价者网络参数θ_c的公式为：Further, in the step S411, the formula for updating the actor network parameter θ _a and the evaluator network parameter θ _c using the local actor parameter gradient Δθ _a and the local actor parameter gradient Δθ _c respectively is:

θ_{a_updata}＝θ_a+βΔθ_a θ _{a_updata} = θ _a +βΔθ _a

θ_{c_updata}＝θ_c+βΔθ_c θ _{c_updata} = θ _c +βΔθ _c

其中，θ_{a_updata}表示更新后的行动者网络参数θ_a，θ_{c_updata}表示更新后的评价者网络参数θ_c，β表示本地GPU_i在SDN网络中的权重。Among them, θ _{a_updata} represents the updated actor network parameter θ _a , θ _{c_updata} represents the updated evaluator network parameter θ _c , and β represents the weight of the local GPU _i in the SDN network.

进一步地，所述步骤S7包括以下分步骤：Further, the step S7 includes the following sub-steps:

S71、设置第四计数器j＝1，并采集路由请求任务f；S71, set the fourth counter j=1, and collect the routing request task f;

S72、将路由请求任务f分配给SDN网络中空闲的GPU，空闲的GPU为GPU_idle；S72, assigning the routing request task f to an idle GPU in the SDN network, and the idle GPU is GPU _idle ;

S73、设定dθ_a＝0和dθ_c＝0，并将GPU_idle的本地行动者参数θ′_a同步为行动者网络参数θ_a参数值，将本地评价者参数θ′_c同步为评价者网络参数θ_c参数值；S73, set dθ _a =0 and dθ _c =0, synchronize the local actor parameter θ′ _a of GPU _idle to the actor network parameter θ _a parameter value, and synchronize the local evaluator parameter θ′ _c to the evaluator network parameter θ _c parameter value;

S74、令第二中间计数值j_start＝j，并读取当前时刻的初始状态s_j；S74, make the second intermediate count value j _start =j, and read the initial state s _j at the current moment;

S75、通过行动者网络获取在状态s_j和本地行动者参数θ′_a的情况下执行动作a_j的策略π(a_j|s_j；θ′_a)，并执行策略π(a_j|s_j；θ′_a)；S75. Obtain the strategy π(a _j |s _j ; θ′ _a ) of executing the action a _j under the condition of the state s _j and the local actor parameter θ′ _a through the actor network, and execute the strategy π(a _j |s _j ; θ′ _a );

S76、获取执行动作a_j后的奖励值r_j和新状态s_j+1，令第四计数器j的计数值加一，并将动作a_j加入动作集合A；S76, obtain the reward value r _j and the new state s _j+1 after the execution action a _j , make the count value of the fourth counter j add one, and add the action a _j to the action set A;

S77、判断新状态s_j是否达到路由请求任务f的最终状态所限定的条件，若是，则进入步骤S78，否则返回步骤S75；S77, determine whether the new state _sj reaches the condition limited by the final state of the routing request task f, if so, go to step S78, otherwise return to step S75;

S78、根据动作集合A获取路由路径p，并判断路由请求任务f是否与路由路径p匹配，若是，则令更新奖励值R＝0，并进入步骤S79，否则令更新奖励值R＝V(s_j,θ′_c)，并进入步骤S79；S78, obtain the routing path p according to the action set A, and determine whether the routing request task f matches the routing path p, if so, set the update reward value R=0, and go to step S79, otherwise set the update reward value R=V(s _j , θ′ _c ), and enter step S79;

S79、设定第五计数器k＝j-1和梯度更新奖励值R_updata＝r_k+γR，初始化行动者网络参数的梯度Δθ_a和评价者网络参数的梯度Δθ_c为0；S79, set the fifth counter k=j-1 and the gradient update reward value R _updata =r _k +γR, initialize the gradient Δθ _{a of the actor network parameters and the gradient Δθ c} _of the evaluator network parameters to 0;

S710、根据梯度更新奖励值R_updata、本地行动者参数θ′_a和本地评价者参数θ′_c，获取本地行动者参数梯度Δθ_a的更新值和本地行动者参数梯度Δθ_c的更新值为：S710: Update the reward value R _updata , the local actor parameter θ′ _a and the local evaluator parameter θ′ _c according to the gradient, and obtain the updated value of the local actor parameter gradient Δθ _a and the updated value of the local actor parameter gradient Δθ _c :

其中，Δθ_{a_updata}表示梯度Δθ_a的更新值，

表示本地行动者参数θ′_a的导数，logπ(a_k|s_k；θ′_a)表示在参数θ′_a和状态s_z的情况下执行动作a_k这个策略的概率的对数，r_k表示执行动作a_k的奖励值，γ表示奖励折扣率，V(s_k；θ′_c)表示评价者网络在本地评价者参数θ′_c时对到达状态s_k的路由策略评价值，Δθ_{c_updata}表示梯度Δθ_c的更新值，

表示对(R_updata-V(s_k；θ′_c))²求取θ′_c的偏导数；Among them, Δθ _{a_updata} represents the update value of the gradient Δθ _a ,

is the derivative of the local actor parameter θ′ _a , logπ( _ak |s _k ; θ′ _a ) is the logarithm of the probability of executing the policy of action a _k given parameters θ′ _a and state s _z , r _k Represents the reward value of performing action a _k , γ represents the reward discount rate, V(s _k ; θ′ _c ) represents the evaluation value of the routing strategy of the evaluator network to the state _sk when the local evaluator parameter θ′ _c , Δθ _{c_updata} represents the updated value of the gradient Δθ _c ,

represents the partial derivative of θ′ _c for (R _updata -V( _sk ; θ′ _c )) ² ;

S711、令Δθ_a＝Δθ_{a_updata}、Δθ_c＝Δθ_{c_updata}和R＝R_updata，并判断第五计数器k是否等于第二中间计数值j_start，若是，则进入步骤S712，否则令第五计数器k的计数值减一，将梯度更新奖励值R_updata更新为r_k+γR，并返回步骤S710；S711, set Δθ _a =Δθ _{a_updata} , Δθ _c =Δθ _{c_updata} and R=R _updata , and determine whether the fifth counter k is equal to the second intermediate count value j _start , if so, go to step S712 , otherwise set the value of the fifth counter k to be equal to the second intermediate count value j start . Decrease the count value by one, update the gradient update reward value R _updata to r _k +γR, and return to step S710;

S712、通过本地行动者参数梯度Δθ_a和本地行动者参数梯度Δθ_c分别更新行动者网络参数θ_a和评价者网络参数θ_c，并将行动者网络参数θ_a和评价者网络参数θ_c作用于SDN网络全局，使用更新参数后的SDN网络进行数据的传输。S712 _: Update the actor network parameter θa and the evaluator network parameter _θc respectively through the local actor parameter gradient _Δθa and the local actor parameter gradient _Δθc , and apply the actor network parameter _θa and the evaluator network parameter _θc to the effect For the global SDN network, the SDN network with updated parameters is used for data transmission.

本发明的有益效果为：The beneficial effects of the present invention are:

(1)本发明实现了快速路由路径的计算，在保证延迟的情况下最大化吞吐量，解决传统算法的慢速、吞吐量小的问题。(1) The present invention realizes the calculation of the fast routing path, maximizes the throughput under the condition of guaranteeing the delay, and solves the problems of slow speed and small throughput of the traditional algorithm.

(2)本发明使用了强化学习算法，该算法将路由计算过程简化为简单的输入输出，避免了计算时的多次迭代从而实现路由路径的快速计算，路由算法速度的加快降低了转发延迟，使原本因ttl到期被丢弃的数据包有更大概率存活并成功转发，增大了网络吞吐量。(2) The present invention uses a reinforcement learning algorithm, which simplifies the routing calculation process into a simple input and output, avoids multiple iterations during calculation, thereby realizing the rapid calculation of the routing path, and the acceleration of the routing algorithm reduces the forwarding delay. The data packets that were discarded due to ttl expiration have a higher probability of surviving and being forwarded successfully, which increases the network throughput.

(3)本发明设置有离线训练和在线训练两个阶段的训练，在动态环境中更新参数选择最优路径因此具有拓扑自适应性。(3) The present invention is provided with two stages of training: offline training and online training, and the optimal path is selected by updating parameters in a dynamic environment, so it has topology adaptability.

(4)本发明设置了奖励函数，使节点或链路负载、路由需求和网络拓扑信息更好的约束强化学习的训练过程，使训练后的深度强化学习模型能够更加准确地执行路由任务。(4) The present invention sets a reward function, so that the node or link load, routing requirements and network topology information can better constrain the training process of reinforcement learning, so that the trained deep reinforcement learning model can perform routing tasks more accurately.

附图说明Description of drawings

图1为本发明提出的分布式深度强化学习的SDN网络智慧路由数据传输方法流程图；Fig. 1 is the flow chart of the SDN network intelligent routing data transmission method of distributed deep reinforcement learning proposed by the present invention;

图2为本发明中CNN卷积神经网络示意图；2 is a schematic diagram of a CNN convolutional neural network in the present invention;

图3为本发明中深度强化学习模型示意图。FIG. 3 is a schematic diagram of a deep reinforcement learning model in the present invention.

具体实施方式Detailed ways

下面对本发明的具体实施方式进行描述，以便于本技术领域的技术人员理解本发明，但应该清楚，本发明不限于具体实施方式的范围，对本技术领域的普通技术人员来讲，只要各种变化在所附的权利要求限定和确定的本发明的精神和范围内，这些变化是显而易见的，一切利用本发明构思的发明创造均在保护之列。The specific embodiments of the present invention are described below to facilitate those skilled in the art to understand the present invention, but it should be clear that the present invention is not limited to the scope of the specific embodiments. For those of ordinary skill in the art, as long as various changes Such changes are obvious within the spirit and scope of the present invention as defined and determined by the appended claims, and all inventions and creations utilizing the inventive concept are within the scope of protection.

下面结合附图详细说明本发明的实施例。The embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

如图1所示，一种分布式深度强化学习的SDN网络智慧路由数据传输方法，包括以下步骤：As shown in Figure 1, a distributed deep reinforcement learning SDN network intelligent routing data transmission method includes the following steps:

所述步骤S1中行动者网络为全连接神经网络，所述步骤S1中评价者网络为全连接神经网络和CNN卷积神经网络的组合网络；所述行动者网络和评价者网络的输入均包括SDN网络的网络状态，所述网络状态包括当前节点信息、目的节点信息、带宽要求和时延要求，所述评价者网络的输入还包括由CNN卷积神经网络处理后的SDN网络的网络特征。In the described step S1, the actor network is a fully connected neural network, and in the described step S1, the evaluator network is a combined network of a fully connected neural network and a CNN convolutional neural network; the inputs of the actor network and the evaluator network include: The network status of the SDN network, the network status includes current node information, destination node information, bandwidth requirements and delay requirements, and the input of the evaluator network also includes the network characteristics of the SDN network processed by the CNN convolutional neural network.

如图2所示，所述CNN卷积神经网络包括依次连接的输入层、卷积层、池化层、全连接层和输出层。As shown in Figure 2, the CNN convolutional neural network includes an input layer, a convolutional layer, a pooling layer, a fully connected layer and an output layer that are connected in sequence.

所述步骤S1中奖励函数为：The reward function in the step S1 is:

其中，

所述步骤S4包括以下分步骤：The step S4 includes the following sub-steps:

S42、令dθ_a＝0和dθ_c＝0，并进行本地参数与全局参数的同步，将本地行动者参数θ′_a的值同步为行动者网络参数θ_a的值，将本地评价者参数θ′_c的值同步为评价者网络参数θ_c的值；S42. Set dθ _a =0 and dθ _c =0, and synchronize the local parameters with the global parameters, synchronize the value of the local actor parameter θ′ _a with the value of the actor network parameter θ _a , and synchronize the local evaluator parameter θ The value of ' _c is synchronized to the value of the evaluator's network parameter _θc ;

其中，Δθ_{a_updata}表示梯度Δθ_a的更新值，

所述步骤S411中使用本地行动者参数梯度Δθ_a和本地行动者参数梯度Δθ_c分别更新行动者网络参数θ_a和评价者网络参数θ_c的公式为：The formula for updating the actor network parameter θ _a and the evaluator network parameter θ _c using the local actor parameter gradient Δθ _a and the local actor parameter gradient Δθ _c respectively in the step S411 is:

θ_{a_updata}＝θ_a+βΔθ_a θ _{a_updata} = θ _a +βΔθ _a

θ_{c_updata}＝θ_c+βΔθ_c θ _{c_updata} = θ _c +βΔθ _c

所述步骤S7包括以下分步骤：The step S7 includes the following sub-steps:

S710、根据梯度更新奖励值R_updata、本地行动者参数θ_a'和本地评价者参数θ_c'，获取本地行动者参数梯度Δθ_a的更新值和本地行动者参数梯度Δθ_c的更新值为：S710: Update the reward value R _updata , the local actor parameter θ _a ' and the local evaluator parameter θ _c ' according to the gradient, and obtain the updated value of the local actor parameter gradient Δθ _a and the updated value of the local actor parameter gradient Δθ _c :

其中，Δθ_{a_updata}表示梯度Δθ_a的更新值，

如图3所示，在本实施例中，深度强化学习模型包括行为者和评论者对，它们都是使用神经网络NN构建的，行为者网络输出对于给定状态下所有动作的概率分布和路由策略，是多输出的神经网络。评论者网络使用时间差误差来评价行为者的策略，是一输出的神经网络。行动者网络是全连接神经网络，在当前节点、目的节点信息、带宽要求和时延要求等数据输入后会在每个神经网络节点计算加权求和及经过激活函数处理，输出多个结果。行动者网络根据当前状态给出下一步动作，动作有多种可选所以是多输出的神经网络，输出为多个路由选择的概率。而评价者网络包括四项网络信息输入外还有网络特征的输入，其输出是对行动者网络的策略的评价，所以是单一输出的。评价者网络输入中多了一个网络特征输入，该输入就是网络的变化信息，在评价行动者网络策略时加入实时的网络状态变化，使智慧路由具有自适应性。As shown in Figure 3, in this embodiment, the deep reinforcement learning model includes pairs of actors and reviewers, both of which are constructed using a neural network NN, and the actor network outputs the probability distribution and routing for all actions in a given state The strategy is a multi-output neural network. The critic network uses the time difference error to evaluate the policy of the actor and is an output neural network. The actor network is a fully connected neural network. After the current node, destination node information, bandwidth requirements and delay requirements are input, the weighted summation will be calculated at each neural network node and processed by the activation function to output multiple results. The actor network gives the next action according to the current state. There are many options for the action, so it is a multi-output neural network, and the output is the probability of multiple routing choices. The evaluator network includes four network information inputs and network feature inputs, and its output is the evaluation of the actor network's strategy, so it is a single output. There is an additional network feature input in the network input of the evaluator, which is the change information of the network. When evaluating the network strategy of the actor, the real-time network state change is added to make the smart routing adaptive.

Claims

1. A SDN intelligent routing data transmission method for distributed deep reinforcement learning is characterized by comprising the following steps:

s1, constructing a reward function and a deep reinforcement learning model comprising an actor network and an evaluator network, and arranging the deep reinforcement learning model in an application layer of the SDN network;

s2, randomly initializing the actor network parameters of the deep reinforcement learning model

And evaluator network parameters

；

S3, randomly initializing the first control layer of the SDN networkiPersonal office

Local actor parameters for an upper actor network

And local evaluator parameters of an evaluator network

；

S4, according to the reward function and the actor network parameter

Evaluator network parameters

Local actor parameters

And local evaluator parameters

Using the A3C algorithm for the secondiPersonal office

The deep reinforcement learning model on the system is used for off-line training and updating the network parameters of the actor

And evaluator network parameters

；

S5, updating the network parameters of the actor

And updated evaluator network parameters

Acting on the whole SDN network, and transmitting data by using the SDN network after updating parameters;

s6, regularly detecting whether the topological structure of the SDN network changes, if so, entering the step S7, otherwise, repeating the step S6;

s7, carrying out on-line training on the deep reinforcement learning model, and using the self-adaptive operation algorithm to carry out network parameter on the actor

And evaluator network parameters

Updating and updating the actor network parameters

And evaluator network parameters

wherein,i=1,2,...,L，Lrepresenting the total number of local GPUs.

2. The SDN network smart routing data transmission method of claim 1, wherein the actor network in step S1 is a fully-connected neural network, and the evaluator network in step S1 is a combination network of the fully-connected neural network and a CNN convolutional neural network; the input of the actor network and the evaluator network comprise network states of the SDN network, the network states comprise current node information, destination node information, bandwidth requirements and delay requirements, and the input of the evaluator network further comprises network characteristics of the SDN network processed by the CNN convolutional neural network; the CNN convolutional neural network comprises an input layer, a convolutional layer, a pooling layer, a full-connection layer and an output layer which are sequentially connected.

3. The SDN network smart routing data transmission method of distributed deep reinforcement learning according to claim 1, wherein the incentive function in step S1 is:

wherein,

is shown in a states _nIn case of (2), in the SDN networknA routing node tomEach routing node actsa _nThe reward value obtained later;gthe penalty of the action is represented by,

a first weight is represented that is a function of,

it is indicated that the second weight is,

is shown asnThe remaining capacity of each routing node,

is shown asmThe remaining capacity of each routing node,

representing data in an SDN networklThe remaining capacity of the individual links is,

is shown asnThe degree of difference in traffic load between a routing node and its neighboring nodes,

show firstmThe difference degree of the traffic load of each routing node and the adjacent nodes; said states _nThe method comprises the following steps:the nth routing node where the data packet is located, the final destination node of the data packet, the forwarding bandwidth requirement of the data packet and the delay requirement of the data packet; the actionsa _nIs shown in a states _nAll forwarding operations that may be taken in case of (1).

4. The SDN network smart routing data transmission method of distributed deep reinforcement learning according to claim 1, wherein the step S4 includes the following sub-steps:

s41, setting a first countert=0, second counterT=0, maximum number of iterations

And routing hop count limitation

；

S42, order

=0 and

=0, and synchronizing the local parameter with the global parameter to obtain the local actor parameter

Value synchronization of (A) to an actor network parameter

A local evaluator parameter of

Value synchronization of (2) into evaluator network parameters

A value of (d);

s43, making the first intermediate count value

By local

Reading the state of the current time

；

S44 obtaining policy through actor network

And according to a policy

Performing an action

Wherein

is shown in a state

And local

Upper local actor parameters

The action to be performed in the case of (1) is

；

S45, acquiring and executing action

Value of the reward after

And new state

And order the first countertThe count value of (a) is increased by one;

s46, judging the new state

Whether the condition limited by the final state is reached, if so, setting an updated reward valueR=0, and proceeds to step S48, otherwise proceeds to step S47;

s47, judgment

Whether greater than the routing hop limit

If yes, setting up the updated reward value

And proceeds to step S48, otherwise returns to step S44, wherein

Indicating evaluator network local evaluator parameters

Time pair arrival state

The routing policy evaluation value of (1);

s48, setting a third counter

And gradient update prize value

Initializing gradients of actor network parameters

And gradient of evaluator network parameters

Is 0;

s49, updating the reward value according to the gradient

Local actor parameters

And local evaluator parameters

Obtaining local actor parameter gradients

And local actor parameter gradient

The update values of (a) are:

wherein,

representing a gradient

The updated value of (a) is set,

representing local actor parameters

The derivative of (a) of (b),

is expressed in the parameter

And state

Perform an action

The logarithm of the probability of this strategy is,

indicating an execution action

The value of the prize of (a) is,

a discount rate of the reward is indicated,

indicating evaluator network local evaluator parameters

Time pair arrival state

The evaluation value of the routing policy of (1),

representing a gradient

The updated value of (a) is set,

presentation pair

Obtaining

Partial derivatives of (d);

s410, order

、

And

and judging the third counterzWhether or not it is equal to the first intermediate count value

If yes, go to step S411, otherwise, let the third counterzDecrementing the count value by one, updating the gradient to the prize value

Is updated to

And returns to step S49;

s411, judging a second counterTWhether or not it is greater than or equal to the maximum number of iterations

If so, local actor parameter gradients are used

And local actor parameter gradients

Updating actor network parameters separately

And evaluator network parameters

And ending the updating process, otherwise, making the second counterTIs incremented by one, and returns to step S42.

5. The SDN network smart routing data transmission method of claim 4, wherein local actor parameter gradient is used in step S411

And local actor parameter gradients

Updating actor network parameters separately

And evaluator network parameters

The formula of (1) is:

wherein,

representing updated actor network parameters

，

Representing updated evaluator network parameters

，

Representing locality

Weights in an SDN network.

6. The SDN network smart routing data transmission method of distributed deep reinforcement learning according to claim 4, wherein the step S7 includes the following sub-steps:

s71, setting a fourth counterj=1, and collects route request tasksf；

S72, routing request taskfAllocated to idle in SDN networkGPUIs idleGPUIs composed of

；

S73, setting

And

and will be

Local actor parameters of

Synchronizing to actor network parameters

Parameter value, local evaluator parameter

Synchronizing evaluator network parameters

A parameter value;

s74, calculating the second intermediate count value

And reading the initial state of the current time

；

S75, obtaining the state through the actor network

And local actor parameters

Perform an action

Strategy (2)

And execute the policy

；

S76, acquiring and executing action

Value of the reward after

And new state

Let the fourth counterjAnd increases the count value of (d) by one and acts on

Adding an action set A;

s77, judging the new state

Whether to reach the route request taskfIf so, go to step S78, otherwise return to step S75;

s78, obtaining the routing path according to the action set ApAnd judging the routing request taskfWhether to communicate with a routing pathpIf matching, then order to update the reward valueR=0, and proceed to step S79, otherwise, let the prize value be updated