[go: up one dir, main page]

CN111917642B - SDN network intelligent routing data transmission method based on distributed deep reinforcement learning - Google Patents

SDN network intelligent routing data transmission method based on distributed deep reinforcement learning Download PDF

Info

Publication number
CN111917642B
CN111917642B CN202010673851.8A CN202010673851A CN111917642B CN 111917642 B CN111917642 B CN 111917642B CN 202010673851 A CN202010673851 A CN 202010673851A CN 111917642 B CN111917642 B CN 111917642B
Authority
CN
China
Prior art keywords
network
actor
parameters
local
evaluator
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN202010673851.8A
Other languages
Chinese (zh)
Other versions
CN111917642A (en
Inventor
刘宇涛
崔金鹏
章小宁
贺元林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202010673851.8A priority Critical patent/CN111917642B/en
Publication of CN111917642A publication Critical patent/CN111917642A/en
Application granted granted Critical
Publication of CN111917642B publication Critical patent/CN111917642B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/12Shortest path evaluation
    • H04L45/124Shortest path evaluation using a combination of metrics
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/12Shortest path evaluation
    • H04L45/121Shortest path evaluation by minimising delays
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/12Shortest path evaluation
    • H04L45/125Shortest path evaluation based on throughput or bandwidth

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a distributed deep reinforcement learning SDN network intelligent routing data transmission method, which realizes the calculation of a fast routing path, maximizes the throughput under the condition of ensuring delay and solves the problems of low speed and low throughput of the traditional algorithm. The invention uses the reinforcement learning algorithm, the algorithm simplifies the route calculation process into simple input and output, avoids multiple iterations during calculation so as to realize the rapid calculation of the route path, the speed of the route algorithm is accelerated, the forwarding delay is reduced, the data packet which is discarded due to the expiry of ttl originally has a more probable survival rate and is successfully forwarded, and the network throughput is increased. The invention is provided with two stages of off-line training and on-line training, and the parameters are updated in a dynamic environment to select the optimal path, so that the invention has topology self-adaptability.

Description

分布式深度强化学习的SDN网络智慧路由数据传输方法SDN network intelligent routing data transmission method based on distributed deep reinforcement learning

技术领域technical field

本发明属于数据传输领域,具体涉及分布式深度强化学习的SDN网络智慧路由数据传输方法。The invention belongs to the field of data transmission, and in particular relates to an SDN network intelligent routing data transmission method of distributed deep reinforcement learning.

背景技术Background technique

当前信息技术步入成熟阶段,在SDN网络(软件定义网络Software DefinedNetwork)架构中数据流灵活可控、控制器具有全网视图并可实时感知网络状态变化(如流量分布、拥塞状况以及链路利用情况等),在现实中,路由选择问题往往通过最短路径算法来解决,将一些简单的网络参数(如路径跳数、时延等)作为算法的优化指标,以寻找跳数最少路径或时延最小路径作为算法的最终目标。单一的度量标准和优化目标,容易导致部分关键链路拥塞,造成网络负载不均衡的问题。虽然在多业务路径分配时,基于拉格朗日松弛的最短路由算法可以找到复合多约束条件的最优路径,但该类启发式路由算法必须经过多次迭代才能计算出最优路径,收敛速度慢、时效性不佳、吞吐量不大。At present, information technology has entered a mature stage. In the SDN network (Software Defined Network) architecture, the data flow is flexible and controllable, and the controller has a network-wide view and can sense network status changes in real time (such as traffic distribution, congestion status, and link utilization). In reality, the routing problem is often solved by the shortest path algorithm, and some simple network parameters (such as path hops, delay, etc.) are used as the optimization indicators of the algorithm to find the path with the least hops or delay. The minimum path serves as the ultimate goal of the algorithm. A single metric and optimization goal can easily lead to congestion of some key links, resulting in unbalanced network load. Although the shortest routing algorithm based on Lagrangian relaxation can find the optimal path with multiple constraints in multi-service path allocation, this type of heuristic routing algorithm must go through multiple iterations to calculate the optimal path, and the convergence speed Slow, poor timeliness, low throughput.

发明内容SUMMARY OF THE INVENTION

针对现有技术中的上述不足,本发明提供的分布式深度强化学习的SDN网络智慧路由数据传输方法解决了上述现有技术中存在的问题。In view of the above deficiencies in the prior art, the distributed deep reinforcement learning SDN network intelligent routing data transmission method provided by the present invention solves the above problems in the prior art.

为了达到上述发明目的,本发明采用的技术方案为:一种分布式深度强化学习的SDN网络智慧路由数据传输方法,包括以下步骤:In order to achieve the above purpose of the invention, the technical solution adopted in the present invention is: a distributed deep reinforcement learning SDN network intelligent routing data transmission method, comprising the following steps:

S1、构建奖励函数和包含行动者网络和评价者网络的深度强化学习模型,并在SDN网络的应用层布置深度强化学习模型;S1. Build a reward function and a deep reinforcement learning model including an actor network and an evaluator network, and arrange the deep reinforcement learning model in the application layer of the SDN network;

S2、随机初始化深度强化学习模型的行动者网络参数θa和评价者网络参数θcS2. Randomly initialize the actor network parameters θ a and the evaluator network parameters θ c of the deep reinforcement learning model;

S3、随机初始化SDN网络的控制层中第i个本地GPUi上行动者网络的本地行动者参数θ′a和评价者网络的本地评价者参数θ′cS3. Randomly initialize the local actor parameter θ′ a of the actor network and the local evaluator parameter θ′ c of the evaluator network on the i- th local GPU i in the control layer of the SDN network;

S4、根据奖励函数、行动者网络参数θa、评价者网络参数θc、本地行动者参数θ′a和本地评价者参数θ′c,使用A3C算法对第i个本地GPUi上的深度强化学习模型进行离线训练,更新行动者网络参数θa和评价者网络参数θcS4. According to the reward function, the actor network parameter θ a , the evaluator network parameter θ c , the local actor parameter θ′ a and the local evaluator parameter θ′ c , use the A3C algorithm to intensify the depth on the ith local GPU i The learning model is trained offline, and the actor network parameters θ a and the evaluator network parameters θ c are updated;

S5、将更新后的行动者网络参数θa和更新后的评价者网络参数θc作用于SDN网络全局,使用更新参数后的SDN网络进行数据的传输;S5, applying the updated actor network parameter θ a and the updated evaluator network parameter θ c to the global SDN network, and using the updated SDN network for data transmission;

S6、定时检测SDN网络的拓扑结构是否发生改变,若是,则进入步骤S7,否则重复步骤S6;S6, regularly detect whether the topology structure of the SDN network has changed, if so, enter step S7, otherwise repeat step S6;

S7、对深度强化学习模型进行在线训练,使用自适应运行算法对行动者网络参数θa和评价者网络参数θc进行更新,并将行动者网络参数θa和评价者网络参数θc作用于SDN网络全局,使用更新参数后的SDN网络进行数据的传输;S7. Perform online training on the deep reinforcement learning model, use the adaptive running algorithm to update the actor network parameter θ a and the evaluator network parameter θ c , and apply the actor network parameter θ a and the evaluator network parameter θ c to The SDN network is global, and the SDN network with updated parameters is used for data transmission;

其中,i=1,2,...,L,L表示本地GPU的总数。Among them, i=1,2,...,L, L represents the total number of local GPUs.

进一步地,所述步骤S1中行动者网络为全连接神经网络,所述步骤S1中评价者网络为全连接神经网络和CNN卷积神经网络的组合网络;所述行动者网络和评价者网络的输入均包括SDN网络的网络状态,所述网络状态包括当前节点信息、目的节点信息、带宽要求和时延要求,所述评价者网络的输入还包括由CNN卷积神经网络处理后的SDN网络的网络特征;所述CNN卷积神经网络包括依次连接的输入层、卷积层、池化层、全连接层和输出层。Further, in the step S1, the actor network is a fully connected neural network, and in the step S1, the evaluator network is a combined network of a fully connected neural network and a CNN convolutional neural network; the actor network and the evaluator network are combined. The inputs all include the network status of the SDN network, the network status includes current node information, destination node information, bandwidth requirements and delay requirements, and the input of the evaluator network also includes the SDN network processed by the CNN convolutional neural network. Network features; the CNN convolutional neural network includes an input layer, a convolution layer, a pooling layer, a fully connected layer and an output layer that are connected in sequence.

进一步地,所述步骤S1中奖励函数为:Further, the reward function in the step S1 is:

Figure BDA0002583337990000021
Figure BDA0002583337990000021

其中,

Figure BDA0002583337990000022
表示在状态sn的情况下,SDN网络中第n个路由节点向第m个路由节点做出动作an后得到的奖励值;g表示动作惩罚,a1表示第一权重,a2表示第二权重,c(n)表示第n个路由节点的剩余容量,c(m)表示第m个路由节点的剩余容量,c(l)表示SDN网络中第l个链路的剩余容量,d(n)表示第n个路由节点与其邻接节点的流量负载的差异程度,d(m)示第m个路由节点与其邻接节点的流量负载的差异程度;所述状态sn包括:数据包所在节点为第n个路由节点、数据包的最终目的节点、数据报的转发带宽需求和数据包的延迟要求;所述动作an表示在状态sn的情况下可以采取的所有转发操作。in,
Figure BDA0002583337990000022
Represents the reward value obtained after the nth routing node in the SDN network performs the action an to the mth routing node in the state s n ; g represents the action penalty, a 1 represents the first weight, and a 2 represents the first weight Two weights, c(n) represents the remaining capacity of the nth routing node, c(m) represents the remaining capacity of the mth routing node, c(l) represents the remaining capacity of the lth link in the SDN network, d( n) represents the degree of difference between the traffic loads of the nth routing node and its adjacent nodes, and d(m) represents the degree of difference between the traffic loads of the mth routing node and its adjacent nodes; the state s n includes: the node where the data packet is located is The nth routing node, the final destination node of the data packet, the forwarding bandwidth requirement of the datagram and the delay requirement of the data packet; the action an represents all forwarding operations that can be taken in the state sn .

进一步地,所述步骤S4包括以下分步骤:Further, the step S4 includes the following sub-steps:

S41、设置第一计数器t=0、第二计数器T=0、最大迭代次数Tmax和路由跳数限制tmaxS41, setting the first counter t=0, the second counter T=0, the maximum number of iterations T max and the routing hop limit t max ;

S42、令dθa=0和dθc=0,并进行本地参数与全局参数的同步,将本地行动者参数θa'的值同步为行动者网络参数θa的值,将本地评价者参数θc'的值同步为评价者网络参数θc的值;S42, set dθ a =0 and dθ c =0, and synchronize the local parameters with the global parameters, synchronize the value of the local actor parameter θ a ' with the value of the actor network parameter θ a , and synchronize the local evaluator parameter θ The value of c ' is synchronized to the value of the evaluator network parameter θ c ;

S43、令第一中间计数值tstart=t,通过本地GPUi读取当前时刻的状态stS43, make the first intermediate count value t start =t, read the state s t at the current moment through the local GPU i ;

S44、通过行动者网络获取策略π(at|st;θ′a),并根据策略π(at|st;θ′a)执行动作at,其中,π(at|st;θ′a)表示在状态st和本地GPUi上本地行动者参数θ′a的情况下所需要执行的动作为atS44. Obtain the strategy π(a t |s t ; θ′ a ) through the actor network, and execute the action a t according to the strategy π(a t |s t ; θ′ a ), where π(a t |s t ; θ′ a ) indicates that the action to be performed in the case of state st and local actor parameter θ′ a on local GPU i is a t ;

S45、获取执行动作at后的奖励值rt和新状态st+1,并令第一计数器t的计数值加一;S45, obtain the reward value rt and the new state s t+1 after performing the action a t , and increase the count value of the first counter t by one;

S46、判断新状态st是否达到最终状态所限定的条件,若是,则设置更新奖励值R=0,并进入步骤S48,否则进入步骤S47;S46, determine whether the new state s t reaches the condition limited by the final state, if so, set the update reward value R=0, and go to step S48, otherwise go to step S47;

S47、判断t-tstart是否大于路由跳数限制tmax,若是,则设置更新奖励值R=V(st,θ′c),并进入步骤S48,否则返回步骤S44,其中V(st,θ′c)表示评价者网络在本地评价者参数θ′c时对到达状态st的路由策略评价值;S47, determine whether tt start is greater than the routing hop limit t max , if so, set the update reward value R=V(s t , θ′ c ), and go to step S48 , otherwise return to step S44 , where V(s t , θ ′ c ) ′ c ) represents the evaluation value of the routing strategy of the evaluator network to the arrival state st when the local evaluator parameter θ′ c ;

S48、设置第三计数器z=t-1和梯度更新奖励值Rupdata=rz+γR,初始化行动者网络参数的梯度Δθa和评价者网络参数的梯度Δθc为0;S48, set the third counter z=t-1 and the gradient update reward value R updata =r z +γR, initialize the gradient Δθ a of the actor network parameters and the gradient Δθ c of the evaluator network parameters to 0;

S49、根据梯度更新奖励值Rupdata、本地行动者参数θ′a和本地评价者参数θ′c,获取本地行动者参数梯度Δθa的更新值和本地行动者参数梯度Δθc的更新值为:S49 , update the reward value R updata , the local actor parameter θ′ a and the local evaluator parameter θ′ c according to the gradient, and obtain the updated value of the local actor parameter gradient Δθ a and the updated value of the local actor parameter gradient Δθ c :

Figure BDA0002583337990000041
Figure BDA0002583337990000041

Figure BDA0002583337990000042
Figure BDA0002583337990000042

其中,Δθa_updata表示梯度Δθa的更新值,

Figure BDA0002583337990000043
表示本地行动者参数θ′a的导数,logπ(az|sz;θ′a)表示在参数θ′a和状态sz的情况下执行动作az这个策略的概率的对数,rz表示执行动作az的奖励值,γ表示奖励折扣率,V(sz;θ′c)表示评价者网络在本地评价者参数θ′c时对到达状态sz的路由策略评价值,Δθc_updata表示梯度Δθc的更新值,
Figure BDA0002583337990000044
表示对(Rupdata-V(sz;θ′c))2求取θ′c的偏导数;Among them, Δθ a_updata represents the update value of the gradient Δθ a ,
Figure BDA0002583337990000043
is the derivative of the local actor parameter θ′ a , logπ( az |s z ; θ′ a ) is the logarithm of the probability of executing the policy of action a z with parameters θ′ a and state s z , r z Represents the reward value of performing action a z , γ represents the reward discount rate, V(s z ; θ′ c ) represents the evaluation value of the routing strategy of the evaluator network to the state s z when the local evaluator parameter θ′ c , Δθ c_updata represents the updated value of the gradient Δθ c ,
Figure BDA0002583337990000044
represents the partial derivative of θ′ c for (R updata -V(s z ; θ′ c )) 2 ;

S410、令Δθa=Δθa_updata、Δθc=Δθc_updata和R=Rupdata,并判断第三计数器z是否等于第一中间计数值tstart,若是,则进入步骤S411,否则令第三计数器z的计数值减一,将梯度更新奖励值Rupdata更新为rz+γR,并返回步骤S49;S410, set Δθ a =Δθ a_updata , Δθ c =Δθ c_updata and R=R updata , and determine whether the third counter z is equal to the first intermediate count value t start , if so, proceed to step S411 , otherwise let the third counter z Decrease the count value by one, update the gradient update reward value R updata to r z +γR, and return to step S49;

S411、判断第二计数器T是否大于或等于最大迭代次数Tmax,若是,则使用本地行动者参数梯度Δθa和本地行动者参数梯度Δθc分别更新行动者网络参数θa和评价者网络参数θc,并结束更新流程,否则令第二计数器T的计数值加一,并返回步骤S42。S411. Determine whether the second counter T is greater than or equal to the maximum number of iterations T max , and if so, use the local actor parameter gradient Δθ a and the local actor parameter gradient Δθ c to update the actor network parameter θ a and the evaluator network parameter θ respectively c , and end the update process, otherwise, increment the count value of the second counter T by one, and return to step S42.

进一步地,所述步骤S411中使用本地行动者参数梯度Δθa和本地行动者参数梯度Δθc分别更新行动者网络参数θa和评价者网络参数θc的公式为:Further, in the step S411, the formula for updating the actor network parameter θ a and the evaluator network parameter θ c using the local actor parameter gradient Δθ a and the local actor parameter gradient Δθ c respectively is:

θa_updata=θa+βΔθa θ a_updata = θ a +βΔθ a

θc_updata=θc+βΔθc θ c_updata = θ c +βΔθ c

其中,θa_updata表示更新后的行动者网络参数θa,θc_updata表示更新后的评价者网络参数θc,β表示本地GPUi在SDN网络中的权重。Among them, θ a_updata represents the updated actor network parameter θ a , θ c_updata represents the updated evaluator network parameter θ c , and β represents the weight of the local GPU i in the SDN network.

进一步地,所述步骤S7包括以下分步骤:Further, the step S7 includes the following sub-steps:

S71、设置第四计数器j=1,并采集路由请求任务f;S71, set the fourth counter j=1, and collect the routing request task f;

S72、将路由请求任务f分配给SDN网络中空闲的GPU,空闲的GPU为GPUidleS72, assigning the routing request task f to an idle GPU in the SDN network, and the idle GPU is GPU idle ;

S73、设定dθa=0和dθc=0,并将GPUidle的本地行动者参数θ′a同步为行动者网络参数θa参数值,将本地评价者参数θ′c同步为评价者网络参数θc参数值;S73, set dθ a =0 and dθ c =0, synchronize the local actor parameter θ′ a of GPU idle to the actor network parameter θ a parameter value, and synchronize the local evaluator parameter θ′ c to the evaluator network parameter θ c parameter value;

S74、令第二中间计数值jstart=j,并读取当前时刻的初始状态sjS74, make the second intermediate count value j start =j, and read the initial state s j at the current moment;

S75、通过行动者网络获取在状态sj和本地行动者参数θ′a的情况下执行动作aj的策略π(aj|sj;θ′a),并执行策略π(aj|sj;θ′a);S75. Obtain the strategy π(a j |s j ; θ′ a ) of executing the action a j under the condition of the state s j and the local actor parameter θ′ a through the actor network, and execute the strategy π(a j |s j ; θ′ a );

S76、获取执行动作aj后的奖励值rj和新状态sj+1,令第四计数器j的计数值加一,并将动作aj加入动作集合A;S76, obtain the reward value r j and the new state s j+1 after the execution action a j , make the count value of the fourth counter j add one, and add the action a j to the action set A;

S77、判断新状态sj是否达到路由请求任务f的最终状态所限定的条件,若是,则进入步骤S78,否则返回步骤S75;S77, determine whether the new state sj reaches the condition limited by the final state of the routing request task f, if so, go to step S78, otherwise return to step S75;

S78、根据动作集合A获取路由路径p,并判断路由请求任务f是否与路由路径p匹配,若是,则令更新奖励值R=0,并进入步骤S79,否则令更新奖励值R=V(sj,θ′c),并进入步骤S79;S78, obtain the routing path p according to the action set A, and determine whether the routing request task f matches the routing path p, if so, set the update reward value R=0, and go to step S79, otherwise set the update reward value R=V(s j , θ′ c ), and enter step S79;

S79、设定第五计数器k=j-1和梯度更新奖励值Rupdata=rk+γR,初始化行动者网络参数的梯度Δθa和评价者网络参数的梯度Δθc为0;S79, set the fifth counter k=j-1 and the gradient update reward value R updata =r k +γR, initialize the gradient Δθ a of the actor network parameters and the gradient Δθ c of the evaluator network parameters to 0;

S710、根据梯度更新奖励值Rupdata、本地行动者参数θ′a和本地评价者参数θ′c,获取本地行动者参数梯度Δθa的更新值和本地行动者参数梯度Δθc的更新值为:S710: Update the reward value R updata , the local actor parameter θ′ a and the local evaluator parameter θ′ c according to the gradient, and obtain the updated value of the local actor parameter gradient Δθ a and the updated value of the local actor parameter gradient Δθ c :

Figure BDA0002583337990000051
Figure BDA0002583337990000051

Figure BDA0002583337990000052
Figure BDA0002583337990000052

其中,Δθa_updata表示梯度Δθa的更新值,

Figure BDA0002583337990000053
表示本地行动者参数θ′a的导数,logπ(ak|sk;θ′a)表示在参数θ′a和状态sz的情况下执行动作ak这个策略的概率的对数,rk表示执行动作ak的奖励值,γ表示奖励折扣率,V(sk;θ′c)表示评价者网络在本地评价者参数θ′c时对到达状态sk的路由策略评价值,Δθc_updata表示梯度Δθc的更新值,
Figure BDA0002583337990000061
表示对(Rupdata-V(sk;θ′c))2求取θ′c的偏导数;Among them, Δθ a_updata represents the update value of the gradient Δθ a ,
Figure BDA0002583337990000053
is the derivative of the local actor parameter θ′ a , logπ( ak |s k ; θ′ a ) is the logarithm of the probability of executing the policy of action a k given parameters θ′ a and state s z , r k Represents the reward value of performing action a k , γ represents the reward discount rate, V(s k ; θ′ c ) represents the evaluation value of the routing strategy of the evaluator network to the state sk when the local evaluator parameter θ′ c , Δθ c_updata represents the updated value of the gradient Δθ c ,
Figure BDA0002583337990000061
represents the partial derivative of θ′ c for (R updata -V( sk ; θ′ c )) 2 ;

S711、令Δθa=Δθa_updata、Δθc=Δθc_updata和R=Rupdata,并判断第五计数器k是否等于第二中间计数值jstart,若是,则进入步骤S712,否则令第五计数器k的计数值减一,将梯度更新奖励值Rupdata更新为rk+γR,并返回步骤S710;S711, set Δθ a =Δθ a_updata , Δθ c =Δθ c_updata and R=R updata , and determine whether the fifth counter k is equal to the second intermediate count value j start , if so, go to step S712 , otherwise set the value of the fifth counter k to be equal to the second intermediate count value j start . Decrease the count value by one, update the gradient update reward value R updata to r k +γR, and return to step S710;

S712、通过本地行动者参数梯度Δθa和本地行动者参数梯度Δθc分别更新行动者网络参数θa和评价者网络参数θc,并将行动者网络参数θa和评价者网络参数θc作用于SDN网络全局,使用更新参数后的SDN网络进行数据的传输。S712 : Update the actor network parameter θa and the evaluator network parameter θc respectively through the local actor parameter gradient Δθa and the local actor parameter gradient Δθc , and apply the actor network parameter θa and the evaluator network parameter θc to the effect For the global SDN network, the SDN network with updated parameters is used for data transmission.

本发明的有益效果为:The beneficial effects of the present invention are:

(1)本发明实现了快速路由路径的计算,在保证延迟的情况下最大化吞吐量,解决传统算法的慢速、吞吐量小的问题。(1) The present invention realizes the calculation of the fast routing path, maximizes the throughput under the condition of guaranteeing the delay, and solves the problems of slow speed and small throughput of the traditional algorithm.

(2)本发明使用了强化学习算法,该算法将路由计算过程简化为简单的输入输出,避免了计算时的多次迭代从而实现路由路径的快速计算,路由算法速度的加快降低了转发延迟,使原本因ttl到期被丢弃的数据包有更大概率存活并成功转发,增大了网络吞吐量。(2) The present invention uses a reinforcement learning algorithm, which simplifies the routing calculation process into a simple input and output, avoids multiple iterations during calculation, thereby realizing the rapid calculation of the routing path, and the acceleration of the routing algorithm reduces the forwarding delay. The data packets that were discarded due to ttl expiration have a higher probability of surviving and being forwarded successfully, which increases the network throughput.

(3)本发明设置有离线训练和在线训练两个阶段的训练,在动态环境中更新参数选择最优路径因此具有拓扑自适应性。(3) The present invention is provided with two stages of training: offline training and online training, and the optimal path is selected by updating parameters in a dynamic environment, so it has topology adaptability.

(4)本发明设置了奖励函数,使节点或链路负载、路由需求和网络拓扑信息更好的约束强化学习的训练过程,使训练后的深度强化学习模型能够更加准确地执行路由任务。(4) The present invention sets a reward function, so that the node or link load, routing requirements and network topology information can better constrain the training process of reinforcement learning, so that the trained deep reinforcement learning model can perform routing tasks more accurately.

附图说明Description of drawings

图1为本发明提出的分布式深度强化学习的SDN网络智慧路由数据传输方法流程图;Fig. 1 is the flow chart of the SDN network intelligent routing data transmission method of distributed deep reinforcement learning proposed by the present invention;

图2为本发明中CNN卷积神经网络示意图;2 is a schematic diagram of a CNN convolutional neural network in the present invention;

图3为本发明中深度强化学习模型示意图。FIG. 3 is a schematic diagram of a deep reinforcement learning model in the present invention.

具体实施方式Detailed ways

下面对本发明的具体实施方式进行描述,以便于本技术领域的技术人员理解本发明,但应该清楚,本发明不限于具体实施方式的范围,对本技术领域的普通技术人员来讲,只要各种变化在所附的权利要求限定和确定的本发明的精神和范围内,这些变化是显而易见的,一切利用本发明构思的发明创造均在保护之列。The specific embodiments of the present invention are described below to facilitate those skilled in the art to understand the present invention, but it should be clear that the present invention is not limited to the scope of the specific embodiments. For those of ordinary skill in the art, as long as various changes Such changes are obvious within the spirit and scope of the present invention as defined and determined by the appended claims, and all inventions and creations utilizing the inventive concept are within the scope of protection.

下面结合附图详细说明本发明的实施例。The embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

如图1所示,一种分布式深度强化学习的SDN网络智慧路由数据传输方法,包括以下步骤:As shown in Figure 1, a distributed deep reinforcement learning SDN network intelligent routing data transmission method includes the following steps:

S1、构建奖励函数和包含行动者网络和评价者网络的深度强化学习模型,并在SDN网络的应用层布置深度强化学习模型;S1. Build a reward function and a deep reinforcement learning model including an actor network and an evaluator network, and arrange the deep reinforcement learning model in the application layer of the SDN network;

S2、随机初始化深度强化学习模型的行动者网络参数θa和评价者网络参数θcS2. Randomly initialize the actor network parameters θ a and the evaluator network parameters θ c of the deep reinforcement learning model;

S3、随机初始化SDN网络的控制层中第i个本地GPUi上行动者网络的本地行动者参数θ′a和评价者网络的本地评价者参数θ′cS3. Randomly initialize the local actor parameter θ′ a of the actor network and the local evaluator parameter θ′ c of the evaluator network on the i- th local GPU i in the control layer of the SDN network;

S4、根据奖励函数、行动者网络参数θa、评价者网络参数θc、本地行动者参数θ′a和本地评价者参数θ′c,使用A3C算法对第i个本地GPUi上的深度强化学习模型进行离线训练,更新行动者网络参数θa和评价者网络参数θcS4. According to the reward function, the actor network parameter θ a , the evaluator network parameter θ c , the local actor parameter θ′ a and the local evaluator parameter θ′ c , use the A3C algorithm to intensify the depth on the ith local GPU i The learning model is trained offline, and the actor network parameters θ a and the evaluator network parameters θ c are updated;

S5、将更新后的行动者网络参数θa和更新后的评价者网络参数θc作用于SDN网络全局,使用更新参数后的SDN网络进行数据的传输;S5, applying the updated actor network parameter θ a and the updated evaluator network parameter θ c to the global SDN network, and using the updated SDN network for data transmission;

S6、定时检测SDN网络的拓扑结构是否发生改变,若是,则进入步骤S7,否则重复步骤S6;S6, regularly detect whether the topology structure of the SDN network has changed, if so, enter step S7, otherwise repeat step S6;

S7、对深度强化学习模型进行在线训练,使用自适应运行算法对行动者网络参数θa和评价者网络参数θc进行更新,并将行动者网络参数θa和评价者网络参数θc作用于SDN网络全局,使用更新参数后的SDN网络进行数据的传输;S7. Perform online training on the deep reinforcement learning model, use the adaptive running algorithm to update the actor network parameter θ a and the evaluator network parameter θ c , and apply the actor network parameter θ a and the evaluator network parameter θ c to The SDN network is global, and the SDN network with updated parameters is used for data transmission;

其中,i=1,2,...,L,L表示本地GPU的总数。Among them, i=1,2,...,L, L represents the total number of local GPUs.

所述步骤S1中行动者网络为全连接神经网络,所述步骤S1中评价者网络为全连接神经网络和CNN卷积神经网络的组合网络;所述行动者网络和评价者网络的输入均包括SDN网络的网络状态,所述网络状态包括当前节点信息、目的节点信息、带宽要求和时延要求,所述评价者网络的输入还包括由CNN卷积神经网络处理后的SDN网络的网络特征。In the described step S1, the actor network is a fully connected neural network, and in the described step S1, the evaluator network is a combined network of a fully connected neural network and a CNN convolutional neural network; the inputs of the actor network and the evaluator network include: The network status of the SDN network, the network status includes current node information, destination node information, bandwidth requirements and delay requirements, and the input of the evaluator network also includes the network characteristics of the SDN network processed by the CNN convolutional neural network.

如图2所示,所述CNN卷积神经网络包括依次连接的输入层、卷积层、池化层、全连接层和输出层。As shown in Figure 2, the CNN convolutional neural network includes an input layer, a convolutional layer, a pooling layer, a fully connected layer and an output layer that are connected in sequence.

所述步骤S1中奖励函数为:The reward function in the step S1 is:

Figure BDA0002583337990000081
Figure BDA0002583337990000081

其中,

Figure BDA0002583337990000082
表示在状态sn的情况下,SDN网络中第n个路由节点向第m个路由节点做出动作an后得到的奖励值;g表示动作惩罚,a1表示第一权重,a2表示第二权重,c(n)表示第n个路由节点的剩余容量,c(m)表示第m个路由节点的剩余容量,c(l)表示SDN网络中第l个链路的剩余容量,d(n)表示第n个路由节点与其邻接节点的流量负载的差异程度,d(m)示第m个路由节点与其邻接节点的流量负载的差异程度;所述状态sn包括:数据包所在节点为第n个路由节点、数据包的最终目的节点、数据报的转发带宽需求和数据包的延迟要求;所述动作an表示在状态sn的情况下可以采取的所有转发操作。in,
Figure BDA0002583337990000082
Represents the reward value obtained after the nth routing node in the SDN network performs the action an to the mth routing node in the state s n ; g represents the action penalty, a 1 represents the first weight, and a 2 represents the first weight Two weights, c(n) represents the remaining capacity of the nth routing node, c(m) represents the remaining capacity of the mth routing node, c(l) represents the remaining capacity of the lth link in the SDN network, d( n) represents the degree of difference between the traffic loads of the nth routing node and its adjacent nodes, and d(m) represents the degree of difference between the traffic loads of the mth routing node and its adjacent nodes; the state s n includes: the node where the data packet is located is The nth routing node, the final destination node of the data packet, the forwarding bandwidth requirement of the datagram and the delay requirement of the data packet; the action an represents all forwarding operations that can be taken in the state sn .

所述步骤S4包括以下分步骤:The step S4 includes the following sub-steps:

S41、设置第一计数器t=0、第二计数器T=0、最大迭代次数Tmax和路由跳数限制tmaxS41, setting the first counter t=0, the second counter T=0, the maximum number of iterations T max and the routing hop limit t max ;

S42、令dθa=0和dθc=0,并进行本地参数与全局参数的同步,将本地行动者参数θ′a的值同步为行动者网络参数θa的值,将本地评价者参数θ′c的值同步为评价者网络参数θc的值;S42. Set dθ a =0 and dθ c =0, and synchronize the local parameters with the global parameters, synchronize the value of the local actor parameter θ′ a with the value of the actor network parameter θ a , and synchronize the local evaluator parameter θ The value of ' c is synchronized to the value of the evaluator's network parameter θc ;

S43、令第一中间计数值tstart=t,通过本地GPUi读取当前时刻的状态stS43, make the first intermediate count value t start =t, read the state s t at the current moment through the local GPU i ;

S44、通过行动者网络获取策略π(at|st;θ′a),并根据策略π(at|st;θ′a)执行动作at,其中,π(at|st;θ′a)表示在状态st和本地GPUi上本地行动者参数θ′a的情况下所需要执行的动作为atS44. Obtain the strategy π(a t |s t ; θ′ a ) through the actor network, and execute the action a t according to the strategy π(a t |s t ; θ′ a ), where π(a t |s t ; θ′ a ) indicates that the action to be performed in the case of state st and local actor parameter θ′ a on local GPU i is a t ;

S45、获取执行动作at后的奖励值rt和新状态st+1,并令第一计数器t的计数值加一;S45, obtain the reward value rt and the new state s t+1 after performing the action a t , and increase the count value of the first counter t by one;

S46、判断新状态st是否达到最终状态所限定的条件,若是,则设置更新奖励值R=0,并进入步骤S48,否则进入步骤S47;S46, determine whether the new state s t reaches the condition limited by the final state, if so, set the update reward value R=0, and go to step S48, otherwise go to step S47;

S47、判断t-tstart是否大于路由跳数限制tmax,若是,则设置更新奖励值R=V(st,θ′c),并进入步骤S48,否则返回步骤S44,其中V(st,θ′c)表示评价者网络在本地评价者参数θ′c时对到达状态st的路由策略评价值;S47, determine whether tt start is greater than the routing hop limit t max , if so, set the update reward value R=V(s t , θ′ c ), and go to step S48 , otherwise return to step S44 , where V(s t , θ ′ c ) ′ c ) represents the evaluation value of the routing strategy of the evaluator network to the arrival state st when the local evaluator parameter θ′ c ;

S48、设置第三计数器z=t-1和梯度更新奖励值Rupdata=rz+γR,初始化行动者网络参数的梯度Δθa和评价者网络参数的梯度Δθc为0;S48, set the third counter z=t-1 and the gradient update reward value R updata =r z +γR, initialize the gradient Δθ a of the actor network parameters and the gradient Δθ c of the evaluator network parameters to 0;

S49、根据梯度更新奖励值Rupdata、本地行动者参数θ′a和本地评价者参数θ′c,获取本地行动者参数梯度Δθa的更新值和本地行动者参数梯度Δθc的更新值为:S49 , update the reward value R updata , the local actor parameter θ′ a and the local evaluator parameter θ′ c according to the gradient, and obtain the updated value of the local actor parameter gradient Δθ a and the updated value of the local actor parameter gradient Δθ c :

Figure BDA0002583337990000091
Figure BDA0002583337990000091

Figure BDA0002583337990000101
Figure BDA0002583337990000101

其中,Δθa_updata表示梯度Δθa的更新值,

Figure BDA0002583337990000102
表示本地行动者参数θ′a的导数,logπ(az|sz;θ′a)表示在参数θ′a和状态sz的情况下执行动作az这个策略的概率的对数,rz表示执行动作az的奖励值,γ表示奖励折扣率,V(sz;θ′c)表示评价者网络在本地评价者参数θ′c时对到达状态sz的路由策略评价值,Δθc_updata表示梯度Δθc的更新值,
Figure BDA0002583337990000103
表示对(Rupdata-V(sz;θ′c))2求取θ′c的偏导数;Among them, Δθ a_updata represents the update value of the gradient Δθ a ,
Figure BDA0002583337990000102
is the derivative of the local actor parameter θ′ a , logπ( az |s z ; θ′ a ) is the logarithm of the probability of executing the policy of action a z with parameters θ′ a and state s z , r z Represents the reward value of performing action a z , γ represents the reward discount rate, V(s z ; θ′ c ) represents the evaluation value of the routing strategy of the evaluator network to the state s z when the local evaluator parameter θ′ c , Δθ c_updata represents the updated value of the gradient Δθ c ,
Figure BDA0002583337990000103
represents the partial derivative of θ′ c for (R updata -V(s z ; θ′ c )) 2 ;

S410、令Δθa=Δθa_updata、Δθc=Δθc_updata和R=Rupdata,并判断第三计数器z是否等于第一中间计数值tstart,若是,则进入步骤S411,否则令第三计数器z的计数值减一,将梯度更新奖励值Rupdata更新为rz+γR,并返回步骤S49;S410, set Δθ a =Δθ a_updata , Δθ c =Δθ c_updata and R=R updata , and determine whether the third counter z is equal to the first intermediate count value t start , if so, proceed to step S411 , otherwise let the third counter z Decrease the count value by one, update the gradient update reward value R updata to r z +γR, and return to step S49;

S411、判断第二计数器T是否大于或等于最大迭代次数Tmax,若是,则使用本地行动者参数梯度Δθa和本地行动者参数梯度Δθc分别更新行动者网络参数θa和评价者网络参数θc,并结束更新流程,否则令第二计数器T的计数值加一,并返回步骤S42。S411. Determine whether the second counter T is greater than or equal to the maximum number of iterations T max , and if so, use the local actor parameter gradient Δθ a and the local actor parameter gradient Δθ c to update the actor network parameter θ a and the evaluator network parameter θ respectively c , and end the update process, otherwise, increment the count value of the second counter T by one, and return to step S42.

所述步骤S411中使用本地行动者参数梯度Δθa和本地行动者参数梯度Δθc分别更新行动者网络参数θa和评价者网络参数θc的公式为:The formula for updating the actor network parameter θ a and the evaluator network parameter θ c using the local actor parameter gradient Δθ a and the local actor parameter gradient Δθ c respectively in the step S411 is:

θa_updata=θa+βΔθa θ a_updata = θ a +βΔθ a

θc_updata=θc+βΔθc θ c_updata = θ c +βΔθ c

其中,θa_updata表示更新后的行动者网络参数θa,θc_updata表示更新后的评价者网络参数θc,β表示本地GPUi在SDN网络中的权重。Among them, θ a_updata represents the updated actor network parameter θ a , θ c_updata represents the updated evaluator network parameter θ c , and β represents the weight of the local GPU i in the SDN network.

所述步骤S7包括以下分步骤:The step S7 includes the following sub-steps:

S71、设置第四计数器j=1,并采集路由请求任务f;S71, set the fourth counter j=1, and collect the routing request task f;

S72、将路由请求任务f分配给SDN网络中空闲的GPU,空闲的GPU为GPUidleS72, assigning the routing request task f to an idle GPU in the SDN network, and the idle GPU is GPU idle ;

S73、设定dθa=0和dθc=0,并将GPUidle的本地行动者参数θ′a同步为行动者网络参数θa参数值,将本地评价者参数θ′c同步为评价者网络参数θc参数值;S73, set dθ a =0 and dθ c =0, synchronize the local actor parameter θ′ a of GPU idle to the actor network parameter θ a parameter value, and synchronize the local evaluator parameter θ′ c to the evaluator network parameter θ c parameter value;

S74、令第二中间计数值jstart=j,并读取当前时刻的初始状态sjS74, make the second intermediate count value j start =j, and read the initial state s j at the current moment;

S75、通过行动者网络获取在状态sj和本地行动者参数θ′a的情况下执行动作aj的策略π(aj|sj;θ′a),并执行策略π(aj|sj;θ′a);S75. Obtain the strategy π(a j |s j ; θ′ a ) of executing the action a j under the condition of the state s j and the local actor parameter θ′ a through the actor network, and execute the strategy π(a j |s j ; θ′ a );

S76、获取执行动作aj后的奖励值rj和新状态sj+1,令第四计数器j的计数值加一,并将动作aj加入动作集合A;S76, obtain the reward value r j and the new state s j+1 after the execution action a j , make the count value of the fourth counter j add one, and add the action a j to the action set A;

S77、判断新状态sj是否达到路由请求任务f的最终状态所限定的条件,若是,则进入步骤S78,否则返回步骤S75;S77, determine whether the new state sj reaches the condition limited by the final state of the routing request task f, if so, go to step S78, otherwise return to step S75;

S78、根据动作集合A获取路由路径p,并判断路由请求任务f是否与路由路径p匹配,若是,则令更新奖励值R=0,并进入步骤S79,否则令更新奖励值R=V(sj,θ′c),并进入步骤S79;S78, obtain the routing path p according to the action set A, and determine whether the routing request task f matches the routing path p, if so, set the update reward value R=0, and go to step S79, otherwise set the update reward value R=V(s j , θ′ c ), and enter step S79;

S79、设定第五计数器k=j-1和梯度更新奖励值Rupdata=rk+γR,初始化行动者网络参数的梯度Δθa和评价者网络参数的梯度Δθc为0;S79, set the fifth counter k=j-1 and the gradient update reward value R updata =r k +γR, initialize the gradient Δθ a of the actor network parameters and the gradient Δθ c of the evaluator network parameters to 0;

S710、根据梯度更新奖励值Rupdata、本地行动者参数θa'和本地评价者参数θc',获取本地行动者参数梯度Δθa的更新值和本地行动者参数梯度Δθc的更新值为:S710: Update the reward value R updata , the local actor parameter θ a ' and the local evaluator parameter θ c ' according to the gradient, and obtain the updated value of the local actor parameter gradient Δθ a and the updated value of the local actor parameter gradient Δθ c :

Figure BDA0002583337990000111
Figure BDA0002583337990000111

Figure BDA0002583337990000112
Figure BDA0002583337990000112

其中,Δθa_updata表示梯度Δθa的更新值,

Figure BDA0002583337990000113
表示本地行动者参数θ′a的导数,logπ(ak|sk;θ′a)表示在参数θ′a和状态sz的情况下执行动作ak这个策略的概率的对数,rk表示执行动作ak的奖励值,γ表示奖励折扣率,V(sk;θ′c)表示评价者网络在本地评价者参数θ′c时对到达状态sk的路由策略评价值,Δθc_updata表示梯度Δθc的更新值,
Figure BDA0002583337990000114
表示对(Rupdata-V(sk;θ′c))2求取θ′c的偏导数;Among them, Δθ a_updata represents the update value of the gradient Δθ a ,
Figure BDA0002583337990000113
is the derivative of the local actor parameter θ′ a , logπ( ak |s k ; θ′ a ) is the logarithm of the probability of executing the policy of action a k given parameters θ′ a and state s z , r k Represents the reward value of performing action a k , γ represents the reward discount rate, V(s k ; θ′ c ) represents the evaluation value of the routing strategy of the evaluator network to the state sk when the local evaluator parameter θ′ c , Δθ c_updata represents the updated value of the gradient Δθ c ,
Figure BDA0002583337990000114
represents the partial derivative of θ′ c for (R updata -V( sk ; θ′ c )) 2 ;

S711、令Δθa=Δθa_updata、Δθc=Δθc_updata和R=Rupdata,并判断第五计数器k是否等于第二中间计数值jstart,若是,则进入步骤S712,否则令第五计数器k的计数值减一,将梯度更新奖励值Rupdata更新为rk+γR,并返回步骤S710;S711, set Δθ a =Δθ a_updata , Δθ c =Δθ c_updata and R=R updata , and determine whether the fifth counter k is equal to the second intermediate count value j start , if so, go to step S712 , otherwise set the value of the fifth counter k to be equal to the second intermediate count value j start . Decrease the count value by one, update the gradient update reward value R updata to r k +γR, and return to step S710;

S712、通过本地行动者参数梯度Δθa和本地行动者参数梯度Δθc分别更新行动者网络参数θa和评价者网络参数θc,并将行动者网络参数θa和评价者网络参数θc作用于SDN网络全局,使用更新参数后的SDN网络进行数据的传输。S712 : Update the actor network parameter θa and the evaluator network parameter θc respectively through the local actor parameter gradient Δθa and the local actor parameter gradient Δθc , and apply the actor network parameter θa and the evaluator network parameter θc to the effect For the global SDN network, the SDN network with updated parameters is used for data transmission.

如图3所示,在本实施例中,深度强化学习模型包括行为者和评论者对,它们都是使用神经网络NN构建的,行为者网络输出对于给定状态下所有动作的概率分布和路由策略,是多输出的神经网络。评论者网络使用时间差误差来评价行为者的策略,是一输出的神经网络。行动者网络是全连接神经网络,在当前节点、目的节点信息、带宽要求和时延要求等数据输入后会在每个神经网络节点计算加权求和及经过激活函数处理,输出多个结果。行动者网络根据当前状态给出下一步动作,动作有多种可选所以是多输出的神经网络,输出为多个路由选择的概率。而评价者网络包括四项网络信息输入外还有网络特征的输入,其输出是对行动者网络的策略的评价,所以是单一输出的。评价者网络输入中多了一个网络特征输入,该输入就是网络的变化信息,在评价行动者网络策略时加入实时的网络状态变化,使智慧路由具有自适应性。As shown in Figure 3, in this embodiment, the deep reinforcement learning model includes pairs of actors and reviewers, both of which are constructed using a neural network NN, and the actor network outputs the probability distribution and routing for all actions in a given state The strategy is a multi-output neural network. The critic network uses the time difference error to evaluate the policy of the actor and is an output neural network. The actor network is a fully connected neural network. After the current node, destination node information, bandwidth requirements and delay requirements are input, the weighted summation will be calculated at each neural network node and processed by the activation function to output multiple results. The actor network gives the next action according to the current state. There are many options for the action, so it is a multi-output neural network, and the output is the probability of multiple routing choices. The evaluator network includes four network information inputs and network feature inputs, and its output is the evaluation of the actor network's strategy, so it is a single output. There is an additional network feature input in the network input of the evaluator, which is the change information of the network. When evaluating the network strategy of the actor, the real-time network state change is added to make the smart routing adaptive.

Claims (6)

1. A SDN intelligent routing data transmission method for distributed deep reinforcement learning is characterized by comprising the following steps:
s1, constructing a reward function and a deep reinforcement learning model comprising an actor network and an evaluator network, and arranging the deep reinforcement learning model in an application layer of the SDN network;
s2, randomly initializing the actor network parameters of the deep reinforcement learning model
Figure 969418DEST_PATH_IMAGE001
And evaluator network parameters
Figure 509727DEST_PATH_IMAGE002
S3, randomly initializing the first control layer of the SDN networkiPersonal office
Figure 649984DEST_PATH_IMAGE003
Local actor parameters for an upper actor network
Figure 674221DEST_PATH_IMAGE004
And local evaluator parameters of an evaluator network
Figure 908500DEST_PATH_IMAGE005
S4, according to the reward function and the actor network parameter
Figure 637289DEST_PATH_IMAGE006
Evaluator network parameters
Figure 530421DEST_PATH_IMAGE007
Local actor parameters
Figure 618069DEST_PATH_IMAGE008
And local evaluator parameters
Figure 650397DEST_PATH_IMAGE009
Using the A3C algorithm for the secondiPersonal office
Figure 204613DEST_PATH_IMAGE010
The deep reinforcement learning model on the system is used for off-line training and updating the network parameters of the actor
Figure 850620DEST_PATH_IMAGE011
And evaluator network parameters
Figure 408204DEST_PATH_IMAGE012
S5, updating the network parameters of the actor
Figure 554758DEST_PATH_IMAGE013
And updated evaluator network parameters
Figure 613631DEST_PATH_IMAGE014
Acting on the whole SDN network, and transmitting data by using the SDN network after updating parameters;
s6, regularly detecting whether the topological structure of the SDN network changes, if so, entering the step S7, otherwise, repeating the step S6;
s7, carrying out on-line training on the deep reinforcement learning model, and using the self-adaptive operation algorithm to carry out network parameter on the actor
Figure 744004DEST_PATH_IMAGE015
And evaluator network parameters
Figure 379647DEST_PATH_IMAGE016
Updating and updating the actor network parameters
Figure 195024DEST_PATH_IMAGE017
And evaluator network parameters
Figure 684518DEST_PATH_IMAGE018
Acting on the whole SDN network, and transmitting data by using the SDN network after updating parameters;
wherein,i=1,2,...,LLrepresenting the total number of local GPUs.
2. The SDN network smart routing data transmission method of claim 1, wherein the actor network in step S1 is a fully-connected neural network, and the evaluator network in step S1 is a combination network of the fully-connected neural network and a CNN convolutional neural network; the input of the actor network and the evaluator network comprise network states of the SDN network, the network states comprise current node information, destination node information, bandwidth requirements and delay requirements, and the input of the evaluator network further comprises network characteristics of the SDN network processed by the CNN convolutional neural network; the CNN convolutional neural network comprises an input layer, a convolutional layer, a pooling layer, a full-connection layer and an output layer which are sequentially connected.
3. The SDN network smart routing data transmission method of distributed deep reinforcement learning according to claim 1, wherein the incentive function in step S1 is:
Figure 305117DEST_PATH_IMAGE019
wherein,
Figure 216222DEST_PATH_IMAGE020
is shown in a states n In case of (2), in the SDN networknA routing node tomEach routing node actsa n The reward value obtained later;gthe penalty of the action is represented by,
Figure 540631DEST_PATH_IMAGE021
a first weight is represented that is a function of,
Figure 664008DEST_PATH_IMAGE022
it is indicated that the second weight is,
Figure 503395DEST_PATH_IMAGE023
is shown asnThe remaining capacity of each routing node,
Figure 746419DEST_PATH_IMAGE024
is shown asmThe remaining capacity of each routing node,
Figure 259090DEST_PATH_IMAGE025
representing data in an SDN networklThe remaining capacity of the individual links is,
Figure 887124DEST_PATH_IMAGE026
is shown asnThe degree of difference in traffic load between a routing node and its neighboring nodes,
Figure 279054DEST_PATH_IMAGE027
show firstmThe difference degree of the traffic load of each routing node and the adjacent nodes; said states n The method comprises the following steps:the nth routing node where the data packet is located, the final destination node of the data packet, the forwarding bandwidth requirement of the data packet and the delay requirement of the data packet; the actionsa n Is shown in a states n All forwarding operations that may be taken in case of (1).
4. The SDN network smart routing data transmission method of distributed deep reinforcement learning according to claim 1, wherein the step S4 includes the following sub-steps:
s41, setting a first countert=0, second counterT=0, maximum number of iterations
Figure 543680DEST_PATH_IMAGE028
And routing hop count limitation
Figure 639419DEST_PATH_IMAGE029
S42, order
Figure 647476DEST_PATH_IMAGE030
=0 and
Figure 198805DEST_PATH_IMAGE031
=0, and synchronizing the local parameter with the global parameter to obtain the local actor parameter
Figure 452806DEST_PATH_IMAGE032
Value synchronization of (A) to an actor network parameter
Figure 131613DEST_PATH_IMAGE033
A local evaluator parameter of
Figure 242395DEST_PATH_IMAGE034
Value synchronization of (2) into evaluator network parameters
Figure 80688DEST_PATH_IMAGE035
A value of (d);
s43, making the first intermediate count value
Figure 469206DEST_PATH_IMAGE036
By local
Figure 601854DEST_PATH_IMAGE037
Reading the state of the current time
Figure 183006DEST_PATH_IMAGE038
S44 obtaining policy through actor network
Figure 174840DEST_PATH_IMAGE039
And according to a policy
Figure 573241DEST_PATH_IMAGE040
Performing an action
Figure 701340DEST_PATH_IMAGE041
Wherein
Figure 156855DEST_PATH_IMAGE042
is shown in a state
Figure 161283DEST_PATH_IMAGE043
And local
Figure 154254DEST_PATH_IMAGE044
Upper local actor parameters
Figure 673878DEST_PATH_IMAGE045
The action to be performed in the case of (1) is
Figure 562943DEST_PATH_IMAGE046
S45, acquiring and executing action
Figure 797877DEST_PATH_IMAGE047
Value of the reward after
Figure 346538DEST_PATH_IMAGE048
And new state
Figure 511546DEST_PATH_IMAGE049
And order the first countertThe count value of (a) is increased by one;
s46, judging the new state
Figure 105601DEST_PATH_IMAGE050
Whether the condition limited by the final state is reached, if so, setting an updated reward valueR=0, and proceeds to step S48, otherwise proceeds to step S47;
s47, judgment
Figure 830761DEST_PATH_IMAGE051
Whether greater than the routing hop limit
Figure 431113DEST_PATH_IMAGE052
If yes, setting up the updated reward value
Figure 116873DEST_PATH_IMAGE053
And proceeds to step S48, otherwise returns to step S44, wherein
Figure 678567DEST_PATH_IMAGE054
Indicating evaluator network local evaluator parameters
Figure 416322DEST_PATH_IMAGE055
Time pair arrival state
Figure 357383DEST_PATH_IMAGE056
The routing policy evaluation value of (1);
s48, setting a third counter
Figure 562231DEST_PATH_IMAGE057
And gradient update prize value
Figure 26317DEST_PATH_IMAGE058
Initializing gradients of actor network parameters
Figure 941050DEST_PATH_IMAGE059
And gradient of evaluator network parameters
Figure 883205DEST_PATH_IMAGE060
Is 0;
s49, updating the reward value according to the gradient
Figure 880242DEST_PATH_IMAGE061
Local actor parameters
Figure 52247DEST_PATH_IMAGE062
And local evaluator parameters
Figure 30174DEST_PATH_IMAGE063
Obtaining local actor parameter gradients
Figure 982212DEST_PATH_IMAGE064
And local actor parameter gradient
Figure 293808DEST_PATH_IMAGE065
The update values of (a) are:
Figure 302959DEST_PATH_IMAGE066
Figure 242883DEST_PATH_IMAGE067
wherein,
Figure 998612DEST_PATH_IMAGE068
representing a gradient
Figure 701732DEST_PATH_IMAGE069
The updated value of (a) is set,
Figure 492838DEST_PATH_IMAGE070
representing local actor parameters
Figure 245024DEST_PATH_IMAGE071
The derivative of (a) of (b),
Figure 660569DEST_PATH_IMAGE072
is expressed in the parameter
Figure 486705DEST_PATH_IMAGE073
And state
Figure 843517DEST_PATH_IMAGE074
Perform an action
Figure 17753DEST_PATH_IMAGE075
The logarithm of the probability of this strategy is,
Figure 44179DEST_PATH_IMAGE076
indicating an execution action
Figure 190733DEST_PATH_IMAGE077
The value of the prize of (a) is,
Figure 246676DEST_PATH_IMAGE078
a discount rate of the reward is indicated,
Figure 851750DEST_PATH_IMAGE079
indicating evaluator network local evaluator parameters
Figure 281201DEST_PATH_IMAGE080
Time pair arrival state
Figure 377919DEST_PATH_IMAGE081
The evaluation value of the routing policy of (1),
Figure 542447DEST_PATH_IMAGE082
representing a gradient
Figure 894537DEST_PATH_IMAGE083
The updated value of (a) is set,
Figure 602380DEST_PATH_IMAGE084
presentation pair
Figure 723526DEST_PATH_IMAGE085
Obtaining
Figure 590114DEST_PATH_IMAGE086
Partial derivatives of (d);
s410, order
Figure 626903DEST_PATH_IMAGE087
Figure 398157DEST_PATH_IMAGE088
And
Figure 114090DEST_PATH_IMAGE089
and judging the third counterzWhether or not it is equal to the first intermediate count value
Figure 213895DEST_PATH_IMAGE090
If yes, go to step S411, otherwise, let the third counterzDecrementing the count value by one, updating the gradient to the prize value
Figure 602895DEST_PATH_IMAGE091
Is updated to
Figure 726576DEST_PATH_IMAGE092
And returns to step S49;
s411, judging a second counterTWhether or not it is greater than or equal to the maximum number of iterations
Figure 825244DEST_PATH_IMAGE093
If so, local actor parameter gradients are used
Figure 765125DEST_PATH_IMAGE094
And local actor parameter gradients
Figure 53804DEST_PATH_IMAGE095
Updating actor network parameters separately
Figure 229177DEST_PATH_IMAGE096
And evaluator network parameters
Figure 385615DEST_PATH_IMAGE097
And ending the updating process, otherwise, making the second counterTIs incremented by one, and returns to step S42.
5. The SDN network smart routing data transmission method of claim 4, wherein local actor parameter gradient is used in step S411
Figure 552854DEST_PATH_IMAGE098
And local actor parameter gradients
Figure 588550DEST_PATH_IMAGE099
Updating actor network parameters separately
Figure 183260DEST_PATH_IMAGE100
And evaluator network parameters
Figure 787679DEST_PATH_IMAGE101
The formula of (1) is:
Figure 69363DEST_PATH_IMAGE102
wherein,
Figure 406670DEST_PATH_IMAGE103
representing updated actor network parameters
Figure 330370DEST_PATH_IMAGE104
Figure 54875DEST_PATH_IMAGE105
Representing updated evaluator network parameters
Figure 903532DEST_PATH_IMAGE106
Figure 444979DEST_PATH_IMAGE107
Representing locality
Figure 175299DEST_PATH_IMAGE108
Weights in an SDN network.
6. The SDN network smart routing data transmission method of distributed deep reinforcement learning according to claim 4, wherein the step S7 includes the following sub-steps:
s71, setting a fourth counterj=1, and collects route request tasksf
S72, routing request taskfAllocated to idle in SDN networkGPUIs idleGPUIs composed of
Figure 683205DEST_PATH_IMAGE109
S73, setting
Figure 103428DEST_PATH_IMAGE110
And
Figure 872451DEST_PATH_IMAGE111
and will be
Figure 937621DEST_PATH_IMAGE112
Local actor parameters of
Figure 164947DEST_PATH_IMAGE113
Synchronizing to actor network parameters
Figure 352477DEST_PATH_IMAGE114
Parameter value, local evaluator parameter
Figure 948410DEST_PATH_IMAGE115
Synchronizing evaluator network parameters
Figure 220866DEST_PATH_IMAGE116
A parameter value;
s74, calculating the second intermediate count value
Figure 977732DEST_PATH_IMAGE117
And reading the initial state of the current time
Figure 745618DEST_PATH_IMAGE118
S75, obtaining the state through the actor network
Figure 483373DEST_PATH_IMAGE119
And local actor parameters
Figure 615978DEST_PATH_IMAGE120
Perform an action
Figure 430612DEST_PATH_IMAGE121
Strategy (2)
Figure 160277DEST_PATH_IMAGE122
And execute the policy
Figure 125609DEST_PATH_IMAGE123
S76, acquiring and executing action
Figure 867431DEST_PATH_IMAGE124
Value of the reward after
Figure 189435DEST_PATH_IMAGE125
And new state
Figure 904317DEST_PATH_IMAGE126
Let the fourth counterjAnd increases the count value of (d) by one and acts on
Figure 947491DEST_PATH_IMAGE127
Adding an action set A;
s77, judging the new state
Figure 552391DEST_PATH_IMAGE128
Whether to reach the route request taskfIf so, go to step S78, otherwise return to step S75;
s78, obtaining the routing path according to the action set ApAnd judging the routing request taskfWhether to communicate with a routing pathpIf matching, then order to update the reward valueR=0, and proceed to step S79, otherwise, let the prize value be updated
Figure 403935DEST_PATH_IMAGE129
And proceeds to step S79;
s79, setting the fifth counterk=j-1 and gradient update prize value
Figure 684524DEST_PATH_IMAGE130
Initializing gradients of actor network parameters
Figure 353010DEST_PATH_IMAGE131
And gradient of evaluator network parameters
Figure 303211DEST_PATH_IMAGE132
Is 0;
s710, updating the reward value according to the gradient
Figure 478103DEST_PATH_IMAGE133
Local actor parameters
Figure 392576DEST_PATH_IMAGE134
And local evaluator parameters
Figure 554217DEST_PATH_IMAGE135
Obtaining local actor parameter gradients
Figure 845128DEST_PATH_IMAGE136
And local actor parameter gradient
Figure 671263DEST_PATH_IMAGE137
The update values of (a) are:
Figure 898850DEST_PATH_IMAGE138
Figure 214031DEST_PATH_IMAGE139
wherein,
Figure 314492DEST_PATH_IMAGE140
representing a gradient
Figure 463976DEST_PATH_IMAGE141
The updated value of (a) is set,
Figure 782568DEST_PATH_IMAGE142
representing local actor parameters
Figure 641503DEST_PATH_IMAGE143
The derivative of (a) of (b),
Figure 70954DEST_PATH_IMAGE144
is expressed in the parameter
Figure 74945DEST_PATH_IMAGE145
And state
Figure 507981DEST_PATH_IMAGE146
Perform an action
Figure 594492DEST_PATH_IMAGE147
The logarithm of the probability of this strategy is,
Figure 48475DEST_PATH_IMAGE148
indicating an execution action
Figure 969288DEST_PATH_IMAGE149
The value of the prize of (a) is,
Figure 301787DEST_PATH_IMAGE150
a discount rate of the reward is indicated,
Figure 147033DEST_PATH_IMAGE151
indicating evaluator network local evaluator parameters
Figure 918287DEST_PATH_IMAGE152
Time pair arrival state
Figure 91343DEST_PATH_IMAGE153
The evaluation value of the routing policy of (1),
Figure 597673DEST_PATH_IMAGE154
representing a gradient
Figure 658776DEST_PATH_IMAGE155
The updated value of (a) is set,
Figure 974001DEST_PATH_IMAGE156
presentation pair
Figure 804160DEST_PATH_IMAGE157
Obtaining
Figure 74867DEST_PATH_IMAGE158
Partial derivatives of (d);
s711, order
Figure 844107DEST_PATH_IMAGE159
Figure 691584DEST_PATH_IMAGE160
And
Figure 647688DEST_PATH_IMAGE161
and judging the fifth counterkWhether or not to equal the second intermediate count value
Figure 354875DEST_PATH_IMAGE162
If yes, go to step S712, otherwise, let the fifth counterkDecrementing the count value by one, updating the gradient to the prize value
Figure 656150DEST_PATH_IMAGE163
Is updated to
Figure 567038DEST_PATH_IMAGE164
And returns to step S710;
s712, passing local actor parameter gradient
Figure 640298DEST_PATH_IMAGE165
And local actor parameter gradients
Figure 187561DEST_PATH_IMAGE166
Updating actor network parameters separately
Figure 450833DEST_PATH_IMAGE167
And evaluator network parameters
Figure 640112DEST_PATH_IMAGE168
And combining the network parameters of the mobile device
Figure 395028DEST_PATH_IMAGE169
And evaluator network parameters
Figure 381701DEST_PATH_IMAGE170
And acting on the whole SDN network, and transmitting data by using the SDN network after updating the parameters.
CN202010673851.8A 2020-07-14 2020-07-14 SDN network intelligent routing data transmission method based on distributed deep reinforcement learning Expired - Fee Related CN111917642B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010673851.8A CN111917642B (en) 2020-07-14 2020-07-14 SDN network intelligent routing data transmission method based on distributed deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010673851.8A CN111917642B (en) 2020-07-14 2020-07-14 SDN network intelligent routing data transmission method based on distributed deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN111917642A CN111917642A (en) 2020-11-10
CN111917642B true CN111917642B (en) 2021-04-27

Family

ID=73280083

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010673851.8A Expired - Fee Related CN111917642B (en) 2020-07-14 2020-07-14 SDN network intelligent routing data transmission method based on distributed deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN111917642B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818788B (en) * 2021-01-25 2022-05-03 电子科技大学 A Distributed Convolutional Neural Network Hierarchical Matching Method Based on UAV Swarm
CN113316216B (en) * 2021-05-26 2022-04-08 电子科技大学 A routing method for micro-nano satellite network
CN113537628B (en) * 2021-08-04 2023-08-22 郭宏亮 Universal reliable shortest path method based on distributed reinforcement learning
CN114051272A (en) * 2021-10-30 2022-02-15 西南电子技术研究所(中国电子科技集团公司第十研究所) Intelligent routing method for dynamic topological network
CN115660030B (en) * 2022-11-04 2025-12-09 中国科学技术大学 Robot cloud and end cooperative computing processing method, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108803615A (en) * 2018-07-03 2018-11-13 东南大学 A kind of visual human's circumstances not known navigation algorithm based on deeply study
CN109343341A (en) * 2018-11-21 2019-02-15 北京航天自动控制研究所 An intelligent control method for vertical recovery of launch vehicle based on deep reinforcement learning
CN110472880A (en) * 2019-08-20 2019-11-19 李峰 Evaluate the method, apparatus and storage medium of collaborative problem resolution ability
CN110472738A (en) * 2019-08-16 2019-11-19 北京理工大学 A kind of unmanned boat Real Time Obstacle Avoiding algorithm based on deeply study
CN110770761A (en) * 2017-07-06 2020-02-07 华为技术有限公司 Deep learning systems and methods and wireless network optimization using deep learning
CN111316295A (en) * 2017-10-27 2020-06-19 渊慧科技有限公司 Reinforcement learning using distributed prioritized playback

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150269479A1 (en) * 2014-03-24 2015-09-24 Qualcomm Incorporated Conversion of neuron types to hardware
CN106873585B (en) * 2017-01-18 2019-12-03 上海器魂智能科技有限公司 A kind of navigation method for searching, robot and system
US10396919B1 (en) * 2017-05-12 2019-08-27 Virginia Tech Intellectual Properties, Inc. Processing of communications signals using machine learning
US10375585B2 (en) * 2017-07-06 2019-08-06 Futurwei Technologies, Inc. System and method for deep learning and wireless network optimization using deep learning
CN108600104B (en) * 2018-04-28 2019-10-01 电子科技大学 A kind of SDN Internet of Things flow polymerization based on tree-shaped routing
EP3769264B1 (en) * 2018-05-18 2025-08-27 DeepMind Technologies Limited Meta-gradient updates for training return functions for reinforcement learning systems
US10940863B2 (en) * 2018-11-01 2021-03-09 GM Global Technology Operations LLC Spatial and temporal attention-based deep reinforcement learning of hierarchical lane-change policies for controlling an autonomous vehicle
CN109803344B (en) * 2018-12-28 2019-10-11 北京邮电大学 A joint construction method of UAV network topology and routing
CN110611619B (en) * 2019-09-12 2020-10-09 西安电子科技大学 An Intelligent Routing Decision Method Based on DDPG Reinforcement Learning Algorithm
CN110515303B (en) * 2019-09-17 2022-09-09 余姚市浙江大学机器人研究中心 DDQN-based self-adaptive dynamic path planning method
CN111010294B (en) * 2019-11-28 2022-07-12 国网甘肃省电力公司电力科学研究院 Electric power communication network routing method based on deep reinforcement learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110770761A (en) * 2017-07-06 2020-02-07 华为技术有限公司 Deep learning systems and methods and wireless network optimization using deep learning
CN111316295A (en) * 2017-10-27 2020-06-19 渊慧科技有限公司 Reinforcement learning using distributed prioritized playback
CN108803615A (en) * 2018-07-03 2018-11-13 东南大学 A kind of visual human's circumstances not known navigation algorithm based on deeply study
CN109343341A (en) * 2018-11-21 2019-02-15 北京航天自动控制研究所 An intelligent control method for vertical recovery of launch vehicle based on deep reinforcement learning
CN110472738A (en) * 2019-08-16 2019-11-19 北京理工大学 A kind of unmanned boat Real Time Obstacle Avoiding algorithm based on deeply study
CN110472880A (en) * 2019-08-20 2019-11-19 李峰 Evaluate the method, apparatus and storage medium of collaborative problem resolution ability

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Multi-task Deep Reinforcement Learning for Scalable Parallel Task Scheduling;Lingxin Zhang;《2019 IEEE International Conference on Big Data (Big Data)》;20200224;正文第1-10页 *
名址分离网络中一种新的双层映射系统研究;章小宁;《电子与信息学报》;20141030(第36卷第10期);正文第1-7页 *

Also Published As

Publication number Publication date
CN111917642A (en) 2020-11-10

Similar Documents

Publication Publication Date Title
CN111917642B (en) SDN network intelligent routing data transmission method based on distributed deep reinforcement learning
CN111010294B (en) Electric power communication network routing method based on deep reinforcement learning
CN112437020B (en) Data center network load balancing method based on deep reinforcement learning
CN111988225B (en) Multi-path routing method based on reinforcement learning and transfer learning
CN113467952B (en) Distributed federal learning collaborative computing method and system
CN108111335B (en) A method and system for scheduling and linking virtual network functions
CN110611619A (en) An Intelligent Routing Decision-Making Method Based on DDPG Reinforcement Learning Algorithm
CN113784410B (en) Vertical Handoff Method for Heterogeneous Wireless Networks Based on Reinforcement Learning TD3 Algorithm
CN113194034A (en) Route optimization method and system based on graph neural network and deep reinforcement learning
CN116527567A (en) Intelligent network path optimization method and system based on deep reinforcement learning
CN113098714A (en) Low-delay network slicing method based on deep reinforcement learning
CN113570039B (en) A blockchain system with optimized consensus based on reinforcement learning
CN113395207B (en) A routing optimization architecture and method based on deep reinforcement learning under SDN architecture
WO2020172825A1 (en) Method and apparatus for determining transmission policy
CN113887748B (en) Online federated learning task assignment method, device, federated learning method and system
CN116669068A (en) GCN-based delay service end-to-end slice deployment method and system
CN114629543A (en) Satellite network adaptive traffic scheduling method based on deep supervised learning
CN116234073A (en) Routing method of distributed unmanned aerial vehicle ad hoc network based on deep reinforcement learning
CN111340192B (en) Network path allocation model training method, path allocation method and device
CN119201470A (en) A computing network resource scheduling optimization method based on multi-agent deep reinforcement learning
CN118474013A (en) Intelligent routing method for intention network based on DRL-GNN
CN113177636A (en) Network dynamic routing method and system based on multiple constraint conditions
CN114726770B (en) Traffic engineering method applied to segmented routing network environment
CN117938959A (en) Multi-target SFC deployment method based on deep reinforcement learning and genetic algorithm
Alliche et al. Prisma: a packet routing simulator for multi-agent reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210427