CN111917642B - SDN network intelligent routing data transmission method based on distributed deep reinforcement learning - Google Patents
SDN network intelligent routing data transmission method based on distributed deep reinforcement learning Download PDFInfo
- Publication number
- CN111917642B CN111917642B CN202010673851.8A CN202010673851A CN111917642B CN 111917642 B CN111917642 B CN 111917642B CN 202010673851 A CN202010673851 A CN 202010673851A CN 111917642 B CN111917642 B CN 111917642B
- Authority
- CN
- China
- Prior art keywords
- network
- actor
- parameters
- local
- evaluator
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L45/00—Routing or path finding of packets in data switching networks
- H04L45/12—Shortest path evaluation
- H04L45/124—Shortest path evaluation using a combination of metrics
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L45/00—Routing or path finding of packets in data switching networks
- H04L45/12—Shortest path evaluation
- H04L45/121—Shortest path evaluation by minimising delays
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L45/00—Routing or path finding of packets in data switching networks
- H04L45/12—Shortest path evaluation
- H04L45/125—Shortest path evaluation based on throughput or bandwidth
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
Description
技术领域technical field
本发明属于数据传输领域,具体涉及分布式深度强化学习的SDN网络智慧路由数据传输方法。The invention belongs to the field of data transmission, and in particular relates to an SDN network intelligent routing data transmission method of distributed deep reinforcement learning.
背景技术Background technique
当前信息技术步入成熟阶段,在SDN网络(软件定义网络Software DefinedNetwork)架构中数据流灵活可控、控制器具有全网视图并可实时感知网络状态变化(如流量分布、拥塞状况以及链路利用情况等),在现实中,路由选择问题往往通过最短路径算法来解决,将一些简单的网络参数(如路径跳数、时延等)作为算法的优化指标,以寻找跳数最少路径或时延最小路径作为算法的最终目标。单一的度量标准和优化目标,容易导致部分关键链路拥塞,造成网络负载不均衡的问题。虽然在多业务路径分配时,基于拉格朗日松弛的最短路由算法可以找到复合多约束条件的最优路径,但该类启发式路由算法必须经过多次迭代才能计算出最优路径,收敛速度慢、时效性不佳、吞吐量不大。At present, information technology has entered a mature stage. In the SDN network (Software Defined Network) architecture, the data flow is flexible and controllable, and the controller has a network-wide view and can sense network status changes in real time (such as traffic distribution, congestion status, and link utilization). In reality, the routing problem is often solved by the shortest path algorithm, and some simple network parameters (such as path hops, delay, etc.) are used as the optimization indicators of the algorithm to find the path with the least hops or delay. The minimum path serves as the ultimate goal of the algorithm. A single metric and optimization goal can easily lead to congestion of some key links, resulting in unbalanced network load. Although the shortest routing algorithm based on Lagrangian relaxation can find the optimal path with multiple constraints in multi-service path allocation, this type of heuristic routing algorithm must go through multiple iterations to calculate the optimal path, and the convergence speed Slow, poor timeliness, low throughput.
发明内容SUMMARY OF THE INVENTION
针对现有技术中的上述不足,本发明提供的分布式深度强化学习的SDN网络智慧路由数据传输方法解决了上述现有技术中存在的问题。In view of the above deficiencies in the prior art, the distributed deep reinforcement learning SDN network intelligent routing data transmission method provided by the present invention solves the above problems in the prior art.
为了达到上述发明目的,本发明采用的技术方案为:一种分布式深度强化学习的SDN网络智慧路由数据传输方法,包括以下步骤:In order to achieve the above purpose of the invention, the technical solution adopted in the present invention is: a distributed deep reinforcement learning SDN network intelligent routing data transmission method, comprising the following steps:
S1、构建奖励函数和包含行动者网络和评价者网络的深度强化学习模型,并在SDN网络的应用层布置深度强化学习模型;S1. Build a reward function and a deep reinforcement learning model including an actor network and an evaluator network, and arrange the deep reinforcement learning model in the application layer of the SDN network;
S2、随机初始化深度强化学习模型的行动者网络参数θa和评价者网络参数θc;S2. Randomly initialize the actor network parameters θ a and the evaluator network parameters θ c of the deep reinforcement learning model;
S3、随机初始化SDN网络的控制层中第i个本地GPUi上行动者网络的本地行动者参数θ′a和评价者网络的本地评价者参数θ′c;S3. Randomly initialize the local actor parameter θ′ a of the actor network and the local evaluator parameter θ′ c of the evaluator network on the i- th local GPU i in the control layer of the SDN network;
S4、根据奖励函数、行动者网络参数θa、评价者网络参数θc、本地行动者参数θ′a和本地评价者参数θ′c,使用A3C算法对第i个本地GPUi上的深度强化学习模型进行离线训练,更新行动者网络参数θa和评价者网络参数θc;S4. According to the reward function, the actor network parameter θ a , the evaluator network parameter θ c , the local actor parameter θ′ a and the local evaluator parameter θ′ c , use the A3C algorithm to intensify the depth on the ith local GPU i The learning model is trained offline, and the actor network parameters θ a and the evaluator network parameters θ c are updated;
S5、将更新后的行动者网络参数θa和更新后的评价者网络参数θc作用于SDN网络全局,使用更新参数后的SDN网络进行数据的传输;S5, applying the updated actor network parameter θ a and the updated evaluator network parameter θ c to the global SDN network, and using the updated SDN network for data transmission;
S6、定时检测SDN网络的拓扑结构是否发生改变,若是,则进入步骤S7,否则重复步骤S6;S6, regularly detect whether the topology structure of the SDN network has changed, if so, enter step S7, otherwise repeat step S6;
S7、对深度强化学习模型进行在线训练,使用自适应运行算法对行动者网络参数θa和评价者网络参数θc进行更新,并将行动者网络参数θa和评价者网络参数θc作用于SDN网络全局,使用更新参数后的SDN网络进行数据的传输;S7. Perform online training on the deep reinforcement learning model, use the adaptive running algorithm to update the actor network parameter θ a and the evaluator network parameter θ c , and apply the actor network parameter θ a and the evaluator network parameter θ c to The SDN network is global, and the SDN network with updated parameters is used for data transmission;
其中,i=1,2,...,L,L表示本地GPU的总数。Among them, i=1,2,...,L, L represents the total number of local GPUs.
进一步地,所述步骤S1中行动者网络为全连接神经网络,所述步骤S1中评价者网络为全连接神经网络和CNN卷积神经网络的组合网络;所述行动者网络和评价者网络的输入均包括SDN网络的网络状态,所述网络状态包括当前节点信息、目的节点信息、带宽要求和时延要求,所述评价者网络的输入还包括由CNN卷积神经网络处理后的SDN网络的网络特征;所述CNN卷积神经网络包括依次连接的输入层、卷积层、池化层、全连接层和输出层。Further, in the step S1, the actor network is a fully connected neural network, and in the step S1, the evaluator network is a combined network of a fully connected neural network and a CNN convolutional neural network; the actor network and the evaluator network are combined. The inputs all include the network status of the SDN network, the network status includes current node information, destination node information, bandwidth requirements and delay requirements, and the input of the evaluator network also includes the SDN network processed by the CNN convolutional neural network. Network features; the CNN convolutional neural network includes an input layer, a convolution layer, a pooling layer, a fully connected layer and an output layer that are connected in sequence.
进一步地,所述步骤S1中奖励函数为:Further, the reward function in the step S1 is:
其中,表示在状态sn的情况下,SDN网络中第n个路由节点向第m个路由节点做出动作an后得到的奖励值;g表示动作惩罚,a1表示第一权重,a2表示第二权重,c(n)表示第n个路由节点的剩余容量,c(m)表示第m个路由节点的剩余容量,c(l)表示SDN网络中第l个链路的剩余容量,d(n)表示第n个路由节点与其邻接节点的流量负载的差异程度,d(m)示第m个路由节点与其邻接节点的流量负载的差异程度;所述状态sn包括:数据包所在节点为第n个路由节点、数据包的最终目的节点、数据报的转发带宽需求和数据包的延迟要求;所述动作an表示在状态sn的情况下可以采取的所有转发操作。in, Represents the reward value obtained after the nth routing node in the SDN network performs the action an to the mth routing node in the state s n ; g represents the action penalty, a 1 represents the first weight, and a 2 represents the first weight Two weights, c(n) represents the remaining capacity of the nth routing node, c(m) represents the remaining capacity of the mth routing node, c(l) represents the remaining capacity of the lth link in the SDN network, d( n) represents the degree of difference between the traffic loads of the nth routing node and its adjacent nodes, and d(m) represents the degree of difference between the traffic loads of the mth routing node and its adjacent nodes; the state s n includes: the node where the data packet is located is The nth routing node, the final destination node of the data packet, the forwarding bandwidth requirement of the datagram and the delay requirement of the data packet; the action an represents all forwarding operations that can be taken in the state sn .
进一步地,所述步骤S4包括以下分步骤:Further, the step S4 includes the following sub-steps:
S41、设置第一计数器t=0、第二计数器T=0、最大迭代次数Tmax和路由跳数限制tmax;S41, setting the first counter t=0, the second counter T=0, the maximum number of iterations T max and the routing hop limit t max ;
S42、令dθa=0和dθc=0,并进行本地参数与全局参数的同步,将本地行动者参数θa'的值同步为行动者网络参数θa的值,将本地评价者参数θc'的值同步为评价者网络参数θc的值;S42, set dθ a =0 and dθ c =0, and synchronize the local parameters with the global parameters, synchronize the value of the local actor parameter θ a ' with the value of the actor network parameter θ a , and synchronize the local evaluator parameter θ The value of c ' is synchronized to the value of the evaluator network parameter θ c ;
S43、令第一中间计数值tstart=t,通过本地GPUi读取当前时刻的状态st;S43, make the first intermediate count value t start =t, read the state s t at the current moment through the local GPU i ;
S44、通过行动者网络获取策略π(at|st;θ′a),并根据策略π(at|st;θ′a)执行动作at,其中,π(at|st;θ′a)表示在状态st和本地GPUi上本地行动者参数θ′a的情况下所需要执行的动作为at;S44. Obtain the strategy π(a t |s t ; θ′ a ) through the actor network, and execute the action a t according to the strategy π(a t |s t ; θ′ a ), where π(a t |s t ; θ′ a ) indicates that the action to be performed in the case of state st and local actor parameter θ′ a on local GPU i is a t ;
S45、获取执行动作at后的奖励值rt和新状态st+1,并令第一计数器t的计数值加一;S45, obtain the reward value rt and the new state s t+1 after performing the action a t , and increase the count value of the first counter t by one;
S46、判断新状态st是否达到最终状态所限定的条件,若是,则设置更新奖励值R=0,并进入步骤S48,否则进入步骤S47;S46, determine whether the new state s t reaches the condition limited by the final state, if so, set the update reward value R=0, and go to step S48, otherwise go to step S47;
S47、判断t-tstart是否大于路由跳数限制tmax,若是,则设置更新奖励值R=V(st,θ′c),并进入步骤S48,否则返回步骤S44,其中V(st,θ′c)表示评价者网络在本地评价者参数θ′c时对到达状态st的路由策略评价值;S47, determine whether tt start is greater than the routing hop limit t max , if so, set the update reward value R=V(s t , θ′ c ), and go to step S48 , otherwise return to step S44 , where V(s t , θ ′ c ) ′ c ) represents the evaluation value of the routing strategy of the evaluator network to the arrival state st when the local evaluator parameter θ′ c ;
S48、设置第三计数器z=t-1和梯度更新奖励值Rupdata=rz+γR,初始化行动者网络参数的梯度Δθa和评价者网络参数的梯度Δθc为0;S48, set the third counter z=t-1 and the gradient update reward value R updata =r z +γR, initialize the gradient Δθ a of the actor network parameters and the gradient Δθ c of the evaluator network parameters to 0;
S49、根据梯度更新奖励值Rupdata、本地行动者参数θ′a和本地评价者参数θ′c,获取本地行动者参数梯度Δθa的更新值和本地行动者参数梯度Δθc的更新值为:S49 , update the reward value R updata , the local actor parameter θ′ a and the local evaluator parameter θ′ c according to the gradient, and obtain the updated value of the local actor parameter gradient Δθ a and the updated value of the local actor parameter gradient Δθ c :
其中,Δθa_updata表示梯度Δθa的更新值,表示本地行动者参数θ′a的导数,logπ(az|sz;θ′a)表示在参数θ′a和状态sz的情况下执行动作az这个策略的概率的对数,rz表示执行动作az的奖励值,γ表示奖励折扣率,V(sz;θ′c)表示评价者网络在本地评价者参数θ′c时对到达状态sz的路由策略评价值,Δθc_updata表示梯度Δθc的更新值,表示对(Rupdata-V(sz;θ′c))2求取θ′c的偏导数;Among them, Δθ a_updata represents the update value of the gradient Δθ a , is the derivative of the local actor parameter θ′ a , logπ( az |s z ; θ′ a ) is the logarithm of the probability of executing the policy of action a z with parameters θ′ a and state s z , r z Represents the reward value of performing action a z , γ represents the reward discount rate, V(s z ; θ′ c ) represents the evaluation value of the routing strategy of the evaluator network to the state s z when the local evaluator parameter θ′ c , Δθ c_updata represents the updated value of the gradient Δθ c , represents the partial derivative of θ′ c for (R updata -V(s z ; θ′ c )) 2 ;
S410、令Δθa=Δθa_updata、Δθc=Δθc_updata和R=Rupdata,并判断第三计数器z是否等于第一中间计数值tstart,若是,则进入步骤S411,否则令第三计数器z的计数值减一,将梯度更新奖励值Rupdata更新为rz+γR,并返回步骤S49;S410, set Δθ a =Δθ a_updata , Δθ c =Δθ c_updata and R=R updata , and determine whether the third counter z is equal to the first intermediate count value t start , if so, proceed to step S411 , otherwise let the third counter z Decrease the count value by one, update the gradient update reward value R updata to r z +γR, and return to step S49;
S411、判断第二计数器T是否大于或等于最大迭代次数Tmax,若是,则使用本地行动者参数梯度Δθa和本地行动者参数梯度Δθc分别更新行动者网络参数θa和评价者网络参数θc,并结束更新流程,否则令第二计数器T的计数值加一,并返回步骤S42。S411. Determine whether the second counter T is greater than or equal to the maximum number of iterations T max , and if so, use the local actor parameter gradient Δθ a and the local actor parameter gradient Δθ c to update the actor network parameter θ a and the evaluator network parameter θ respectively c , and end the update process, otherwise, increment the count value of the second counter T by one, and return to step S42.
进一步地,所述步骤S411中使用本地行动者参数梯度Δθa和本地行动者参数梯度Δθc分别更新行动者网络参数θa和评价者网络参数θc的公式为:Further, in the step S411, the formula for updating the actor network parameter θ a and the evaluator network parameter θ c using the local actor parameter gradient Δθ a and the local actor parameter gradient Δθ c respectively is:
θa_updata=θa+βΔθa θ a_updata = θ a +βΔθ a
θc_updata=θc+βΔθc θ c_updata = θ c +βΔθ c
其中,θa_updata表示更新后的行动者网络参数θa,θc_updata表示更新后的评价者网络参数θc,β表示本地GPUi在SDN网络中的权重。Among them, θ a_updata represents the updated actor network parameter θ a , θ c_updata represents the updated evaluator network parameter θ c , and β represents the weight of the local GPU i in the SDN network.
进一步地,所述步骤S7包括以下分步骤:Further, the step S7 includes the following sub-steps:
S71、设置第四计数器j=1,并采集路由请求任务f;S71, set the fourth counter j=1, and collect the routing request task f;
S72、将路由请求任务f分配给SDN网络中空闲的GPU,空闲的GPU为GPUidle;S72, assigning the routing request task f to an idle GPU in the SDN network, and the idle GPU is GPU idle ;
S73、设定dθa=0和dθc=0,并将GPUidle的本地行动者参数θ′a同步为行动者网络参数θa参数值,将本地评价者参数θ′c同步为评价者网络参数θc参数值;S73, set dθ a =0 and dθ c =0, synchronize the local actor parameter θ′ a of GPU idle to the actor network parameter θ a parameter value, and synchronize the local evaluator parameter θ′ c to the evaluator network parameter θ c parameter value;
S74、令第二中间计数值jstart=j,并读取当前时刻的初始状态sj;S74, make the second intermediate count value j start =j, and read the initial state s j at the current moment;
S75、通过行动者网络获取在状态sj和本地行动者参数θ′a的情况下执行动作aj的策略π(aj|sj;θ′a),并执行策略π(aj|sj;θ′a);S75. Obtain the strategy π(a j |s j ; θ′ a ) of executing the action a j under the condition of the state s j and the local actor parameter θ′ a through the actor network, and execute the strategy π(a j |s j ; θ′ a );
S76、获取执行动作aj后的奖励值rj和新状态sj+1,令第四计数器j的计数值加一,并将动作aj加入动作集合A;S76, obtain the reward value r j and the new state s j+1 after the execution action a j , make the count value of the fourth counter j add one, and add the action a j to the action set A;
S77、判断新状态sj是否达到路由请求任务f的最终状态所限定的条件,若是,则进入步骤S78,否则返回步骤S75;S77, determine whether the new state sj reaches the condition limited by the final state of the routing request task f, if so, go to step S78, otherwise return to step S75;
S78、根据动作集合A获取路由路径p,并判断路由请求任务f是否与路由路径p匹配,若是,则令更新奖励值R=0,并进入步骤S79,否则令更新奖励值R=V(sj,θ′c),并进入步骤S79;S78, obtain the routing path p according to the action set A, and determine whether the routing request task f matches the routing path p, if so, set the update reward value R=0, and go to step S79, otherwise set the update reward value R=V(s j , θ′ c ), and enter step S79;
S79、设定第五计数器k=j-1和梯度更新奖励值Rupdata=rk+γR,初始化行动者网络参数的梯度Δθa和评价者网络参数的梯度Δθc为0;S79, set the fifth counter k=j-1 and the gradient update reward value R updata =r k +γR, initialize the gradient Δθ a of the actor network parameters and the gradient Δθ c of the evaluator network parameters to 0;
S710、根据梯度更新奖励值Rupdata、本地行动者参数θ′a和本地评价者参数θ′c,获取本地行动者参数梯度Δθa的更新值和本地行动者参数梯度Δθc的更新值为:S710: Update the reward value R updata , the local actor parameter θ′ a and the local evaluator parameter θ′ c according to the gradient, and obtain the updated value of the local actor parameter gradient Δθ a and the updated value of the local actor parameter gradient Δθ c :
其中,Δθa_updata表示梯度Δθa的更新值,表示本地行动者参数θ′a的导数,logπ(ak|sk;θ′a)表示在参数θ′a和状态sz的情况下执行动作ak这个策略的概率的对数,rk表示执行动作ak的奖励值,γ表示奖励折扣率,V(sk;θ′c)表示评价者网络在本地评价者参数θ′c时对到达状态sk的路由策略评价值,Δθc_updata表示梯度Δθc的更新值,表示对(Rupdata-V(sk;θ′c))2求取θ′c的偏导数;Among them, Δθ a_updata represents the update value of the gradient Δθ a , is the derivative of the local actor parameter θ′ a , logπ( ak |s k ; θ′ a ) is the logarithm of the probability of executing the policy of action a k given parameters θ′ a and state s z , r k Represents the reward value of performing action a k , γ represents the reward discount rate, V(s k ; θ′ c ) represents the evaluation value of the routing strategy of the evaluator network to the state sk when the local evaluator parameter θ′ c , Δθ c_updata represents the updated value of the gradient Δθ c , represents the partial derivative of θ′ c for (R updata -V( sk ; θ′ c )) 2 ;
S711、令Δθa=Δθa_updata、Δθc=Δθc_updata和R=Rupdata,并判断第五计数器k是否等于第二中间计数值jstart,若是,则进入步骤S712,否则令第五计数器k的计数值减一,将梯度更新奖励值Rupdata更新为rk+γR,并返回步骤S710;S711, set Δθ a =Δθ a_updata , Δθ c =Δθ c_updata and R=R updata , and determine whether the fifth counter k is equal to the second intermediate count value j start , if so, go to step S712 , otherwise set the value of the fifth counter k to be equal to the second intermediate count value j start . Decrease the count value by one, update the gradient update reward value R updata to r k +γR, and return to step S710;
S712、通过本地行动者参数梯度Δθa和本地行动者参数梯度Δθc分别更新行动者网络参数θa和评价者网络参数θc,并将行动者网络参数θa和评价者网络参数θc作用于SDN网络全局,使用更新参数后的SDN网络进行数据的传输。S712 : Update the actor network parameter θa and the evaluator network parameter θc respectively through the local actor parameter gradient Δθa and the local actor parameter gradient Δθc , and apply the actor network parameter θa and the evaluator network parameter θc to the effect For the global SDN network, the SDN network with updated parameters is used for data transmission.
本发明的有益效果为:The beneficial effects of the present invention are:
(1)本发明实现了快速路由路径的计算,在保证延迟的情况下最大化吞吐量,解决传统算法的慢速、吞吐量小的问题。(1) The present invention realizes the calculation of the fast routing path, maximizes the throughput under the condition of guaranteeing the delay, and solves the problems of slow speed and small throughput of the traditional algorithm.
(2)本发明使用了强化学习算法,该算法将路由计算过程简化为简单的输入输出,避免了计算时的多次迭代从而实现路由路径的快速计算,路由算法速度的加快降低了转发延迟,使原本因ttl到期被丢弃的数据包有更大概率存活并成功转发,增大了网络吞吐量。(2) The present invention uses a reinforcement learning algorithm, which simplifies the routing calculation process into a simple input and output, avoids multiple iterations during calculation, thereby realizing the rapid calculation of the routing path, and the acceleration of the routing algorithm reduces the forwarding delay. The data packets that were discarded due to ttl expiration have a higher probability of surviving and being forwarded successfully, which increases the network throughput.
(3)本发明设置有离线训练和在线训练两个阶段的训练,在动态环境中更新参数选择最优路径因此具有拓扑自适应性。(3) The present invention is provided with two stages of training: offline training and online training, and the optimal path is selected by updating parameters in a dynamic environment, so it has topology adaptability.
(4)本发明设置了奖励函数,使节点或链路负载、路由需求和网络拓扑信息更好的约束强化学习的训练过程,使训练后的深度强化学习模型能够更加准确地执行路由任务。(4) The present invention sets a reward function, so that the node or link load, routing requirements and network topology information can better constrain the training process of reinforcement learning, so that the trained deep reinforcement learning model can perform routing tasks more accurately.
附图说明Description of drawings
图1为本发明提出的分布式深度强化学习的SDN网络智慧路由数据传输方法流程图;Fig. 1 is the flow chart of the SDN network intelligent routing data transmission method of distributed deep reinforcement learning proposed by the present invention;
图2为本发明中CNN卷积神经网络示意图;2 is a schematic diagram of a CNN convolutional neural network in the present invention;
图3为本发明中深度强化学习模型示意图。FIG. 3 is a schematic diagram of a deep reinforcement learning model in the present invention.
具体实施方式Detailed ways
下面对本发明的具体实施方式进行描述,以便于本技术领域的技术人员理解本发明,但应该清楚,本发明不限于具体实施方式的范围,对本技术领域的普通技术人员来讲,只要各种变化在所附的权利要求限定和确定的本发明的精神和范围内,这些变化是显而易见的,一切利用本发明构思的发明创造均在保护之列。The specific embodiments of the present invention are described below to facilitate those skilled in the art to understand the present invention, but it should be clear that the present invention is not limited to the scope of the specific embodiments. For those of ordinary skill in the art, as long as various changes Such changes are obvious within the spirit and scope of the present invention as defined and determined by the appended claims, and all inventions and creations utilizing the inventive concept are within the scope of protection.
下面结合附图详细说明本发明的实施例。The embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
如图1所示,一种分布式深度强化学习的SDN网络智慧路由数据传输方法,包括以下步骤:As shown in Figure 1, a distributed deep reinforcement learning SDN network intelligent routing data transmission method includes the following steps:
S1、构建奖励函数和包含行动者网络和评价者网络的深度强化学习模型,并在SDN网络的应用层布置深度强化学习模型;S1. Build a reward function and a deep reinforcement learning model including an actor network and an evaluator network, and arrange the deep reinforcement learning model in the application layer of the SDN network;
S2、随机初始化深度强化学习模型的行动者网络参数θa和评价者网络参数θc;S2. Randomly initialize the actor network parameters θ a and the evaluator network parameters θ c of the deep reinforcement learning model;
S3、随机初始化SDN网络的控制层中第i个本地GPUi上行动者网络的本地行动者参数θ′a和评价者网络的本地评价者参数θ′c;S3. Randomly initialize the local actor parameter θ′ a of the actor network and the local evaluator parameter θ′ c of the evaluator network on the i- th local GPU i in the control layer of the SDN network;
S4、根据奖励函数、行动者网络参数θa、评价者网络参数θc、本地行动者参数θ′a和本地评价者参数θ′c,使用A3C算法对第i个本地GPUi上的深度强化学习模型进行离线训练,更新行动者网络参数θa和评价者网络参数θc;S4. According to the reward function, the actor network parameter θ a , the evaluator network parameter θ c , the local actor parameter θ′ a and the local evaluator parameter θ′ c , use the A3C algorithm to intensify the depth on the ith local GPU i The learning model is trained offline, and the actor network parameters θ a and the evaluator network parameters θ c are updated;
S5、将更新后的行动者网络参数θa和更新后的评价者网络参数θc作用于SDN网络全局,使用更新参数后的SDN网络进行数据的传输;S5, applying the updated actor network parameter θ a and the updated evaluator network parameter θ c to the global SDN network, and using the updated SDN network for data transmission;
S6、定时检测SDN网络的拓扑结构是否发生改变,若是,则进入步骤S7,否则重复步骤S6;S6, regularly detect whether the topology structure of the SDN network has changed, if so, enter step S7, otherwise repeat step S6;
S7、对深度强化学习模型进行在线训练,使用自适应运行算法对行动者网络参数θa和评价者网络参数θc进行更新,并将行动者网络参数θa和评价者网络参数θc作用于SDN网络全局,使用更新参数后的SDN网络进行数据的传输;S7. Perform online training on the deep reinforcement learning model, use the adaptive running algorithm to update the actor network parameter θ a and the evaluator network parameter θ c , and apply the actor network parameter θ a and the evaluator network parameter θ c to The SDN network is global, and the SDN network with updated parameters is used for data transmission;
其中,i=1,2,...,L,L表示本地GPU的总数。Among them, i=1,2,...,L, L represents the total number of local GPUs.
所述步骤S1中行动者网络为全连接神经网络,所述步骤S1中评价者网络为全连接神经网络和CNN卷积神经网络的组合网络;所述行动者网络和评价者网络的输入均包括SDN网络的网络状态,所述网络状态包括当前节点信息、目的节点信息、带宽要求和时延要求,所述评价者网络的输入还包括由CNN卷积神经网络处理后的SDN网络的网络特征。In the described step S1, the actor network is a fully connected neural network, and in the described step S1, the evaluator network is a combined network of a fully connected neural network and a CNN convolutional neural network; the inputs of the actor network and the evaluator network include: The network status of the SDN network, the network status includes current node information, destination node information, bandwidth requirements and delay requirements, and the input of the evaluator network also includes the network characteristics of the SDN network processed by the CNN convolutional neural network.
如图2所示,所述CNN卷积神经网络包括依次连接的输入层、卷积层、池化层、全连接层和输出层。As shown in Figure 2, the CNN convolutional neural network includes an input layer, a convolutional layer, a pooling layer, a fully connected layer and an output layer that are connected in sequence.
所述步骤S1中奖励函数为:The reward function in the step S1 is:
其中,表示在状态sn的情况下,SDN网络中第n个路由节点向第m个路由节点做出动作an后得到的奖励值;g表示动作惩罚,a1表示第一权重,a2表示第二权重,c(n)表示第n个路由节点的剩余容量,c(m)表示第m个路由节点的剩余容量,c(l)表示SDN网络中第l个链路的剩余容量,d(n)表示第n个路由节点与其邻接节点的流量负载的差异程度,d(m)示第m个路由节点与其邻接节点的流量负载的差异程度;所述状态sn包括:数据包所在节点为第n个路由节点、数据包的最终目的节点、数据报的转发带宽需求和数据包的延迟要求;所述动作an表示在状态sn的情况下可以采取的所有转发操作。in, Represents the reward value obtained after the nth routing node in the SDN network performs the action an to the mth routing node in the state s n ; g represents the action penalty, a 1 represents the first weight, and a 2 represents the first weight Two weights, c(n) represents the remaining capacity of the nth routing node, c(m) represents the remaining capacity of the mth routing node, c(l) represents the remaining capacity of the lth link in the SDN network, d( n) represents the degree of difference between the traffic loads of the nth routing node and its adjacent nodes, and d(m) represents the degree of difference between the traffic loads of the mth routing node and its adjacent nodes; the state s n includes: the node where the data packet is located is The nth routing node, the final destination node of the data packet, the forwarding bandwidth requirement of the datagram and the delay requirement of the data packet; the action an represents all forwarding operations that can be taken in the state sn .
所述步骤S4包括以下分步骤:The step S4 includes the following sub-steps:
S41、设置第一计数器t=0、第二计数器T=0、最大迭代次数Tmax和路由跳数限制tmax;S41, setting the first counter t=0, the second counter T=0, the maximum number of iterations T max and the routing hop limit t max ;
S42、令dθa=0和dθc=0,并进行本地参数与全局参数的同步,将本地行动者参数θ′a的值同步为行动者网络参数θa的值,将本地评价者参数θ′c的值同步为评价者网络参数θc的值;S42. Set dθ a =0 and dθ c =0, and synchronize the local parameters with the global parameters, synchronize the value of the local actor parameter θ′ a with the value of the actor network parameter θ a , and synchronize the local evaluator parameter θ The value of ' c is synchronized to the value of the evaluator's network parameter θc ;
S43、令第一中间计数值tstart=t,通过本地GPUi读取当前时刻的状态st;S43, make the first intermediate count value t start =t, read the state s t at the current moment through the local GPU i ;
S44、通过行动者网络获取策略π(at|st;θ′a),并根据策略π(at|st;θ′a)执行动作at,其中,π(at|st;θ′a)表示在状态st和本地GPUi上本地行动者参数θ′a的情况下所需要执行的动作为at;S44. Obtain the strategy π(a t |s t ; θ′ a ) through the actor network, and execute the action a t according to the strategy π(a t |s t ; θ′ a ), where π(a t |s t ; θ′ a ) indicates that the action to be performed in the case of state st and local actor parameter θ′ a on local GPU i is a t ;
S45、获取执行动作at后的奖励值rt和新状态st+1,并令第一计数器t的计数值加一;S45, obtain the reward value rt and the new state s t+1 after performing the action a t , and increase the count value of the first counter t by one;
S46、判断新状态st是否达到最终状态所限定的条件,若是,则设置更新奖励值R=0,并进入步骤S48,否则进入步骤S47;S46, determine whether the new state s t reaches the condition limited by the final state, if so, set the update reward value R=0, and go to step S48, otherwise go to step S47;
S47、判断t-tstart是否大于路由跳数限制tmax,若是,则设置更新奖励值R=V(st,θ′c),并进入步骤S48,否则返回步骤S44,其中V(st,θ′c)表示评价者网络在本地评价者参数θ′c时对到达状态st的路由策略评价值;S47, determine whether tt start is greater than the routing hop limit t max , if so, set the update reward value R=V(s t , θ′ c ), and go to step S48 , otherwise return to step S44 , where V(s t , θ ′ c ) ′ c ) represents the evaluation value of the routing strategy of the evaluator network to the arrival state st when the local evaluator parameter θ′ c ;
S48、设置第三计数器z=t-1和梯度更新奖励值Rupdata=rz+γR,初始化行动者网络参数的梯度Δθa和评价者网络参数的梯度Δθc为0;S48, set the third counter z=t-1 and the gradient update reward value R updata =r z +γR, initialize the gradient Δθ a of the actor network parameters and the gradient Δθ c of the evaluator network parameters to 0;
S49、根据梯度更新奖励值Rupdata、本地行动者参数θ′a和本地评价者参数θ′c,获取本地行动者参数梯度Δθa的更新值和本地行动者参数梯度Δθc的更新值为:S49 , update the reward value R updata , the local actor parameter θ′ a and the local evaluator parameter θ′ c according to the gradient, and obtain the updated value of the local actor parameter gradient Δθ a and the updated value of the local actor parameter gradient Δθ c :
其中,Δθa_updata表示梯度Δθa的更新值,表示本地行动者参数θ′a的导数,logπ(az|sz;θ′a)表示在参数θ′a和状态sz的情况下执行动作az这个策略的概率的对数,rz表示执行动作az的奖励值,γ表示奖励折扣率,V(sz;θ′c)表示评价者网络在本地评价者参数θ′c时对到达状态sz的路由策略评价值,Δθc_updata表示梯度Δθc的更新值,表示对(Rupdata-V(sz;θ′c))2求取θ′c的偏导数;Among them, Δθ a_updata represents the update value of the gradient Δθ a , is the derivative of the local actor parameter θ′ a , logπ( az |s z ; θ′ a ) is the logarithm of the probability of executing the policy of action a z with parameters θ′ a and state s z , r z Represents the reward value of performing action a z , γ represents the reward discount rate, V(s z ; θ′ c ) represents the evaluation value of the routing strategy of the evaluator network to the state s z when the local evaluator parameter θ′ c , Δθ c_updata represents the updated value of the gradient Δθ c , represents the partial derivative of θ′ c for (R updata -V(s z ; θ′ c )) 2 ;
S410、令Δθa=Δθa_updata、Δθc=Δθc_updata和R=Rupdata,并判断第三计数器z是否等于第一中间计数值tstart,若是,则进入步骤S411,否则令第三计数器z的计数值减一,将梯度更新奖励值Rupdata更新为rz+γR,并返回步骤S49;S410, set Δθ a =Δθ a_updata , Δθ c =Δθ c_updata and R=R updata , and determine whether the third counter z is equal to the first intermediate count value t start , if so, proceed to step S411 , otherwise let the third counter z Decrease the count value by one, update the gradient update reward value R updata to r z +γR, and return to step S49;
S411、判断第二计数器T是否大于或等于最大迭代次数Tmax,若是,则使用本地行动者参数梯度Δθa和本地行动者参数梯度Δθc分别更新行动者网络参数θa和评价者网络参数θc,并结束更新流程,否则令第二计数器T的计数值加一,并返回步骤S42。S411. Determine whether the second counter T is greater than or equal to the maximum number of iterations T max , and if so, use the local actor parameter gradient Δθ a and the local actor parameter gradient Δθ c to update the actor network parameter θ a and the evaluator network parameter θ respectively c , and end the update process, otherwise, increment the count value of the second counter T by one, and return to step S42.
所述步骤S411中使用本地行动者参数梯度Δθa和本地行动者参数梯度Δθc分别更新行动者网络参数θa和评价者网络参数θc的公式为:The formula for updating the actor network parameter θ a and the evaluator network parameter θ c using the local actor parameter gradient Δθ a and the local actor parameter gradient Δθ c respectively in the step S411 is:
θa_updata=θa+βΔθa θ a_updata = θ a +βΔθ a
θc_updata=θc+βΔθc θ c_updata = θ c +βΔθ c
其中,θa_updata表示更新后的行动者网络参数θa,θc_updata表示更新后的评价者网络参数θc,β表示本地GPUi在SDN网络中的权重。Among them, θ a_updata represents the updated actor network parameter θ a , θ c_updata represents the updated evaluator network parameter θ c , and β represents the weight of the local GPU i in the SDN network.
所述步骤S7包括以下分步骤:The step S7 includes the following sub-steps:
S71、设置第四计数器j=1,并采集路由请求任务f;S71, set the fourth counter j=1, and collect the routing request task f;
S72、将路由请求任务f分配给SDN网络中空闲的GPU,空闲的GPU为GPUidle;S72, assigning the routing request task f to an idle GPU in the SDN network, and the idle GPU is GPU idle ;
S73、设定dθa=0和dθc=0,并将GPUidle的本地行动者参数θ′a同步为行动者网络参数θa参数值,将本地评价者参数θ′c同步为评价者网络参数θc参数值;S73, set dθ a =0 and dθ c =0, synchronize the local actor parameter θ′ a of GPU idle to the actor network parameter θ a parameter value, and synchronize the local evaluator parameter θ′ c to the evaluator network parameter θ c parameter value;
S74、令第二中间计数值jstart=j,并读取当前时刻的初始状态sj;S74, make the second intermediate count value j start =j, and read the initial state s j at the current moment;
S75、通过行动者网络获取在状态sj和本地行动者参数θ′a的情况下执行动作aj的策略π(aj|sj;θ′a),并执行策略π(aj|sj;θ′a);S75. Obtain the strategy π(a j |s j ; θ′ a ) of executing the action a j under the condition of the state s j and the local actor parameter θ′ a through the actor network, and execute the strategy π(a j |s j ; θ′ a );
S76、获取执行动作aj后的奖励值rj和新状态sj+1,令第四计数器j的计数值加一,并将动作aj加入动作集合A;S76, obtain the reward value r j and the new state s j+1 after the execution action a j , make the count value of the fourth counter j add one, and add the action a j to the action set A;
S77、判断新状态sj是否达到路由请求任务f的最终状态所限定的条件,若是,则进入步骤S78,否则返回步骤S75;S77, determine whether the new state sj reaches the condition limited by the final state of the routing request task f, if so, go to step S78, otherwise return to step S75;
S78、根据动作集合A获取路由路径p,并判断路由请求任务f是否与路由路径p匹配,若是,则令更新奖励值R=0,并进入步骤S79,否则令更新奖励值R=V(sj,θ′c),并进入步骤S79;S78, obtain the routing path p according to the action set A, and determine whether the routing request task f matches the routing path p, if so, set the update reward value R=0, and go to step S79, otherwise set the update reward value R=V(s j , θ′ c ), and enter step S79;
S79、设定第五计数器k=j-1和梯度更新奖励值Rupdata=rk+γR,初始化行动者网络参数的梯度Δθa和评价者网络参数的梯度Δθc为0;S79, set the fifth counter k=j-1 and the gradient update reward value R updata =r k +γR, initialize the gradient Δθ a of the actor network parameters and the gradient Δθ c of the evaluator network parameters to 0;
S710、根据梯度更新奖励值Rupdata、本地行动者参数θa'和本地评价者参数θc',获取本地行动者参数梯度Δθa的更新值和本地行动者参数梯度Δθc的更新值为:S710: Update the reward value R updata , the local actor parameter θ a ' and the local evaluator parameter θ c ' according to the gradient, and obtain the updated value of the local actor parameter gradient Δθ a and the updated value of the local actor parameter gradient Δθ c :
其中,Δθa_updata表示梯度Δθa的更新值,表示本地行动者参数θ′a的导数,logπ(ak|sk;θ′a)表示在参数θ′a和状态sz的情况下执行动作ak这个策略的概率的对数,rk表示执行动作ak的奖励值,γ表示奖励折扣率,V(sk;θ′c)表示评价者网络在本地评价者参数θ′c时对到达状态sk的路由策略评价值,Δθc_updata表示梯度Δθc的更新值,表示对(Rupdata-V(sk;θ′c))2求取θ′c的偏导数;Among them, Δθ a_updata represents the update value of the gradient Δθ a , is the derivative of the local actor parameter θ′ a , logπ( ak |s k ; θ′ a ) is the logarithm of the probability of executing the policy of action a k given parameters θ′ a and state s z , r k Represents the reward value of performing action a k , γ represents the reward discount rate, V(s k ; θ′ c ) represents the evaluation value of the routing strategy of the evaluator network to the state sk when the local evaluator parameter θ′ c , Δθ c_updata represents the updated value of the gradient Δθ c , represents the partial derivative of θ′ c for (R updata -V( sk ; θ′ c )) 2 ;
S711、令Δθa=Δθa_updata、Δθc=Δθc_updata和R=Rupdata,并判断第五计数器k是否等于第二中间计数值jstart,若是,则进入步骤S712,否则令第五计数器k的计数值减一,将梯度更新奖励值Rupdata更新为rk+γR,并返回步骤S710;S711, set Δθ a =Δθ a_updata , Δθ c =Δθ c_updata and R=R updata , and determine whether the fifth counter k is equal to the second intermediate count value j start , if so, go to step S712 , otherwise set the value of the fifth counter k to be equal to the second intermediate count value j start . Decrease the count value by one, update the gradient update reward value R updata to r k +γR, and return to step S710;
S712、通过本地行动者参数梯度Δθa和本地行动者参数梯度Δθc分别更新行动者网络参数θa和评价者网络参数θc,并将行动者网络参数θa和评价者网络参数θc作用于SDN网络全局,使用更新参数后的SDN网络进行数据的传输。S712 : Update the actor network parameter θa and the evaluator network parameter θc respectively through the local actor parameter gradient Δθa and the local actor parameter gradient Δθc , and apply the actor network parameter θa and the evaluator network parameter θc to the effect For the global SDN network, the SDN network with updated parameters is used for data transmission.
如图3所示,在本实施例中,深度强化学习模型包括行为者和评论者对,它们都是使用神经网络NN构建的,行为者网络输出对于给定状态下所有动作的概率分布和路由策略,是多输出的神经网络。评论者网络使用时间差误差来评价行为者的策略,是一输出的神经网络。行动者网络是全连接神经网络,在当前节点、目的节点信息、带宽要求和时延要求等数据输入后会在每个神经网络节点计算加权求和及经过激活函数处理,输出多个结果。行动者网络根据当前状态给出下一步动作,动作有多种可选所以是多输出的神经网络,输出为多个路由选择的概率。而评价者网络包括四项网络信息输入外还有网络特征的输入,其输出是对行动者网络的策略的评价,所以是单一输出的。评价者网络输入中多了一个网络特征输入,该输入就是网络的变化信息,在评价行动者网络策略时加入实时的网络状态变化,使智慧路由具有自适应性。As shown in Figure 3, in this embodiment, the deep reinforcement learning model includes pairs of actors and reviewers, both of which are constructed using a neural network NN, and the actor network outputs the probability distribution and routing for all actions in a given state The strategy is a multi-output neural network. The critic network uses the time difference error to evaluate the policy of the actor and is an output neural network. The actor network is a fully connected neural network. After the current node, destination node information, bandwidth requirements and delay requirements are input, the weighted summation will be calculated at each neural network node and processed by the activation function to output multiple results. The actor network gives the next action according to the current state. There are many options for the action, so it is a multi-output neural network, and the output is the probability of multiple routing choices. The evaluator network includes four network information inputs and network feature inputs, and its output is the evaluation of the actor network's strategy, so it is a single output. There is an additional network feature input in the network input of the evaluator, which is the change information of the network. When evaluating the network strategy of the actor, the real-time network state change is added to make the smart routing adaptive.
Claims (6)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010673851.8A CN111917642B (en) | 2020-07-14 | 2020-07-14 | SDN network intelligent routing data transmission method based on distributed deep reinforcement learning |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010673851.8A CN111917642B (en) | 2020-07-14 | 2020-07-14 | SDN network intelligent routing data transmission method based on distributed deep reinforcement learning |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN111917642A CN111917642A (en) | 2020-11-10 |
| CN111917642B true CN111917642B (en) | 2021-04-27 |
Family
ID=73280083
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202010673851.8A Expired - Fee Related CN111917642B (en) | 2020-07-14 | 2020-07-14 | SDN network intelligent routing data transmission method based on distributed deep reinforcement learning |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN111917642B (en) |
Families Citing this family (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112818788B (en) * | 2021-01-25 | 2022-05-03 | 电子科技大学 | A Distributed Convolutional Neural Network Hierarchical Matching Method Based on UAV Swarm |
| CN113316216B (en) * | 2021-05-26 | 2022-04-08 | 电子科技大学 | A routing method for micro-nano satellite network |
| CN113537628B (en) * | 2021-08-04 | 2023-08-22 | 郭宏亮 | Universal reliable shortest path method based on distributed reinforcement learning |
| CN114051272A (en) * | 2021-10-30 | 2022-02-15 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Intelligent routing method for dynamic topological network |
| CN115660030B (en) * | 2022-11-04 | 2025-12-09 | 中国科学技术大学 | Robot cloud and end cooperative computing processing method, equipment and storage medium |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108803615A (en) * | 2018-07-03 | 2018-11-13 | 东南大学 | A kind of visual human's circumstances not known navigation algorithm based on deeply study |
| CN109343341A (en) * | 2018-11-21 | 2019-02-15 | 北京航天自动控制研究所 | An intelligent control method for vertical recovery of launch vehicle based on deep reinforcement learning |
| CN110472880A (en) * | 2019-08-20 | 2019-11-19 | 李峰 | Evaluate the method, apparatus and storage medium of collaborative problem resolution ability |
| CN110472738A (en) * | 2019-08-16 | 2019-11-19 | 北京理工大学 | A kind of unmanned boat Real Time Obstacle Avoiding algorithm based on deeply study |
| CN110770761A (en) * | 2017-07-06 | 2020-02-07 | 华为技术有限公司 | Deep learning systems and methods and wireless network optimization using deep learning |
| CN111316295A (en) * | 2017-10-27 | 2020-06-19 | 渊慧科技有限公司 | Reinforcement learning using distributed prioritized playback |
Family Cites Families (11)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20150269479A1 (en) * | 2014-03-24 | 2015-09-24 | Qualcomm Incorporated | Conversion of neuron types to hardware |
| CN106873585B (en) * | 2017-01-18 | 2019-12-03 | 上海器魂智能科技有限公司 | A kind of navigation method for searching, robot and system |
| US10396919B1 (en) * | 2017-05-12 | 2019-08-27 | Virginia Tech Intellectual Properties, Inc. | Processing of communications signals using machine learning |
| US10375585B2 (en) * | 2017-07-06 | 2019-08-06 | Futurwei Technologies, Inc. | System and method for deep learning and wireless network optimization using deep learning |
| CN108600104B (en) * | 2018-04-28 | 2019-10-01 | 电子科技大学 | A kind of SDN Internet of Things flow polymerization based on tree-shaped routing |
| EP3769264B1 (en) * | 2018-05-18 | 2025-08-27 | DeepMind Technologies Limited | Meta-gradient updates for training return functions for reinforcement learning systems |
| US10940863B2 (en) * | 2018-11-01 | 2021-03-09 | GM Global Technology Operations LLC | Spatial and temporal attention-based deep reinforcement learning of hierarchical lane-change policies for controlling an autonomous vehicle |
| CN109803344B (en) * | 2018-12-28 | 2019-10-11 | 北京邮电大学 | A joint construction method of UAV network topology and routing |
| CN110611619B (en) * | 2019-09-12 | 2020-10-09 | 西安电子科技大学 | An Intelligent Routing Decision Method Based on DDPG Reinforcement Learning Algorithm |
| CN110515303B (en) * | 2019-09-17 | 2022-09-09 | 余姚市浙江大学机器人研究中心 | DDQN-based self-adaptive dynamic path planning method |
| CN111010294B (en) * | 2019-11-28 | 2022-07-12 | 国网甘肃省电力公司电力科学研究院 | Electric power communication network routing method based on deep reinforcement learning |
-
2020
- 2020-07-14 CN CN202010673851.8A patent/CN111917642B/en not_active Expired - Fee Related
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110770761A (en) * | 2017-07-06 | 2020-02-07 | 华为技术有限公司 | Deep learning systems and methods and wireless network optimization using deep learning |
| CN111316295A (en) * | 2017-10-27 | 2020-06-19 | 渊慧科技有限公司 | Reinforcement learning using distributed prioritized playback |
| CN108803615A (en) * | 2018-07-03 | 2018-11-13 | 东南大学 | A kind of visual human's circumstances not known navigation algorithm based on deeply study |
| CN109343341A (en) * | 2018-11-21 | 2019-02-15 | 北京航天自动控制研究所 | An intelligent control method for vertical recovery of launch vehicle based on deep reinforcement learning |
| CN110472738A (en) * | 2019-08-16 | 2019-11-19 | 北京理工大学 | A kind of unmanned boat Real Time Obstacle Avoiding algorithm based on deeply study |
| CN110472880A (en) * | 2019-08-20 | 2019-11-19 | 李峰 | Evaluate the method, apparatus and storage medium of collaborative problem resolution ability |
Non-Patent Citations (2)
| Title |
|---|
| Multi-task Deep Reinforcement Learning for Scalable Parallel Task Scheduling;Lingxin Zhang;《2019 IEEE International Conference on Big Data (Big Data)》;20200224;正文第1-10页 * |
| 名址分离网络中一种新的双层映射系统研究;章小宁;《电子与信息学报》;20141030(第36卷第10期);正文第1-7页 * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN111917642A (en) | 2020-11-10 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN111917642B (en) | SDN network intelligent routing data transmission method based on distributed deep reinforcement learning | |
| CN111010294B (en) | Electric power communication network routing method based on deep reinforcement learning | |
| CN112437020B (en) | Data center network load balancing method based on deep reinforcement learning | |
| CN111988225B (en) | Multi-path routing method based on reinforcement learning and transfer learning | |
| CN113467952B (en) | Distributed federal learning collaborative computing method and system | |
| CN108111335B (en) | A method and system for scheduling and linking virtual network functions | |
| CN110611619A (en) | An Intelligent Routing Decision-Making Method Based on DDPG Reinforcement Learning Algorithm | |
| CN113784410B (en) | Vertical Handoff Method for Heterogeneous Wireless Networks Based on Reinforcement Learning TD3 Algorithm | |
| CN113194034A (en) | Route optimization method and system based on graph neural network and deep reinforcement learning | |
| CN116527567A (en) | Intelligent network path optimization method and system based on deep reinforcement learning | |
| CN113098714A (en) | Low-delay network slicing method based on deep reinforcement learning | |
| CN113570039B (en) | A blockchain system with optimized consensus based on reinforcement learning | |
| CN113395207B (en) | A routing optimization architecture and method based on deep reinforcement learning under SDN architecture | |
| WO2020172825A1 (en) | Method and apparatus for determining transmission policy | |
| CN113887748B (en) | Online federated learning task assignment method, device, federated learning method and system | |
| CN116669068A (en) | GCN-based delay service end-to-end slice deployment method and system | |
| CN114629543A (en) | Satellite network adaptive traffic scheduling method based on deep supervised learning | |
| CN116234073A (en) | Routing method of distributed unmanned aerial vehicle ad hoc network based on deep reinforcement learning | |
| CN111340192B (en) | Network path allocation model training method, path allocation method and device | |
| CN119201470A (en) | A computing network resource scheduling optimization method based on multi-agent deep reinforcement learning | |
| CN118474013A (en) | Intelligent routing method for intention network based on DRL-GNN | |
| CN113177636A (en) | Network dynamic routing method and system based on multiple constraint conditions | |
| CN114726770B (en) | Traffic engineering method applied to segmented routing network environment | |
| CN117938959A (en) | Multi-target SFC deployment method based on deep reinforcement learning and genetic algorithm | |
| Alliche et al. | Prisma: a packet routing simulator for multi-agent reinforcement learning |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant | ||
| CF01 | Termination of patent right due to non-payment of annual fee | ||
| CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20210427 |