CN111507506A

CN111507506A - Consensus embedding-based complex network community discovery method

Info

Publication number: CN111507506A
Application number: CN202010202056.0A
Authority: CN
Inventors: 曾湘祥; 杜妍孜; 刘向荣
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2020-03-20
Filing date: 2020-03-20
Publication date: 2020-08-07

Abstract

A consensus embedding-based complex network community discovery method relates to a multi-objective optimization technology. The method comprises the following steps: 1) giving a maximum algebra maxgen and a particle swarm size pop; 2) given network G ═ V, E, the network size is n, carry on the network and represent and study; 3) representing a learning result by using a network, initializing a particle swarm to obtain 100 particles POP, wherein the iteration number t is 1; 4) carrying out updating and mutation based on consensus embedding on the POP; 5) stopping conditions are as follows: and if t is not more than maxgen, t ← t +1 and go to step 3), otherwise, stopping and returning pareto front edge solution, namely a plurality of community division results. The updating process is more efficient and accurate, and the obtained pareto frontier effect is more competitive; the accuracy rate of community discovery is improved, the convergence time of the method is effectively reduced, and the method has good practicability in practical application of function prediction, recommendation systems and the like.

Description

A Community Discovery Method for Complex Networks Based on Consensus Embedding

技术领域technical field

本发明涉及多目标优化技术，具体是涉及可应用于功能预测、推荐系统等领域的一种基于共识嵌入的复杂网络社区发现方法。The invention relates to a multi-objective optimization technology, in particular to a complex network community discovery method based on consensus embedding, which can be applied to the fields of function prediction, recommendation system and the like.

背景技术Background technique

具有自组织、自相似、吸引子、小世界、无标度中部分或全部性质的网络称为复杂网络(Complex Network)。复杂网络中相同类型节点之间连接较多，构成一个一个的小社区，不同类型节点之间连接较少，但这些连接将成为沟通不同社区的重要通道。A network with some or all of the properties of self-organization, self-similarity, attractor, small world, and scale-free is called a complex network. In a complex network, there are many connections between the same type of nodes, forming a small community, and there are fewer connections between different types of nodes, but these connections will become an important channel for communication between different communities.

为了探索复杂网络的结构特点，进而了解复杂网络的功能，人们对复杂网络的社区结构进行了广泛的研究，提出了众多社区发现方法，主要分为四种方法：凝聚方法、分裂方法、优化方法和模拟方法。在众多方法中这四种方法并不是独立的，一种方法可能会同时体现多种思想。In order to explore the structural characteristics of complex networks and then understand the functions of complex networks, people have carried out extensive research on the community structure of complex networks, and proposed many community discovery methods, which are mainly divided into four methods: agglomeration method, splitting method, and optimization method. and simulation methods. Among the many methods, these four methods are not independent, and one method may reflect multiple ideas at the same time.

复杂网络社区发现方法被广泛应用到许多基础应用当中，比如推荐系统、疾病预测与预防等。粒子可以定义为多目标粒子群中的一个解决方案solution，每个解决方案可以表示复杂网络中的一种社区划分。最近几年，多目标优化方法取得了很好的成功，因此，在基于多目标粒子群算法上做改进有着重要的意义。其中代表的方法有Maoguo Gong等人(M.Gong,Q.Cai,X.Chen,and L.Ma,“Complex network clustering by multiobjectivediscrete particle swarm optimization based on decomposition,”IEEETransactions on Evolutionary Computation,vol.18,no.1,pp.82–97,2014)的基于分解的多目标离散粒子群算法在复杂网络聚类中的应用(MODPSO)。但算法初始化时产生大量无用解，使得搜索空间过大，冗余度很高，算法只能在相对较小的网络中表现良好，在较大规模的网络上表现不佳。Complex network community discovery methods are widely used in many basic applications, such as recommender systems, disease prediction and prevention, etc. A particle can be defined as a solution in a multi-objective particle swarm, and each solution can represent a kind of community partition in a complex network. In recent years, the multi-objective optimization method has achieved good success, so it is of great significance to improve the multi-objective particle swarm optimization. Among the representative methods are Maoguo Gong et al. (M.Gong, Q.Cai, X.Chen, and L.Ma, "Complex network clustering by multiobjectivediscrete particle swarm optimization based on decomposition," IEEE Transactions on Evolutionary Computation, vol.18, no.1, pp.82–97, 2014) Application of Decomposition-Based Multi-objective Discrete Particle Swarm Optimization to Complex Network Clustering (MODPSO). However, when the algorithm is initialized, a large number of useless solutions are generated, which makes the search space too large and the redundancy is very high. The algorithm can only perform well in relatively small networks, and does not perform well in large-scale networks.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于针对现有技术存在的收敛速度过慢，存在大量冗余解，没有深层次挖掘网络中潜藏的底层结构信息等问题，提供一种基于共识嵌入的复杂网络社区发现方法。The purpose of the present invention is to provide a complex network community discovery method based on consensus embedding, aiming at the problems of the prior art such as too slow convergence speed, a large number of redundant solutions, and no deep mining of the underlying structure information hidden in the network.

本发明包括以下步骤：The present invention includes the following steps:

1)给定最大代数maxgen，粒子群规模pop；1) Given the maximum algebra maxgen, the particle swarm size pop;

2)给定的网络G＝(V,E)，网络规模为n，进行网络表示学习；2) Given a network G=(V, E), the network scale is n, and network representation learning is performed;

3)利用网络表示学习结果，进行粒子群的初始化得到100个粒子POP，迭代次数t＝1；3) Using the network to represent the learning results, initialize the particle swarm to obtain 100 particle POPs, and the number of iterations is t=1;

4)将POP进行基于共识嵌入的更新、变异；4) Update and mutate POP based on consensus embedding;

5)停止条件：若满足t≤maxgen，则t←t+1并转向步骤3)，否则停止并返回帕累托前沿解，即多个社区划分结果。5) Stop condition: if t≤maxgen is satisfied, then t←t+1 and turn to step 3), otherwise stop and return to the Pareto frontier solution, that is, the result of dividing multiple communities.

在步骤2)中，所述进行网络表示学习的具体步骤可为：In step 2), the specific steps of performing network representation learning may be:

采用基于奇异值分解和特征值分解框架的保留任意阶相似度的网络嵌入方法AROPE，将高维的邻接矩阵映射到低维的连续的特征空间中，挖掘网络中潜藏的底层结构信息，得到节点的特征向量E＝{e₁,e₂,e₃,...,e_n}，其中e_i＝{e_i1,e_i2,e_i3,...,e_id}，d为降维后的维度。Using the network embedding method AROPE which preserves the similarity of any order based on singular value decomposition and eigenvalue decomposition framework, the high-dimensional adjacency matrix is mapped to the low-dimensional continuous feature space, and the underlying structural information hidden in the network is mined to obtain nodes. The feature vector E={e ₁ ,e ₂ ,e ₃ ,...,e _n }, where e _i ={e _i1 ,e _i2 ,e _i3 ,...,e _id }, d is the dimension reduction dimension.

在步骤3)中，所述利用网络表示学习结果，进行粒子群的初始化得到100个粒子POP的具体步骤为：In step 3), the specific steps of using the network to represent the learning result and initializing the particle swarm to obtain 100 particle POPs are:

计算相似度得到

利用k-means得到基于相似度的初始化的粒子群POP^t＝{x₁,x₂,x₃,...,x_pop}^t；粒子的局部最优解也初始化为Pbest＝POP^t。Calculate the similarity to get

The initialized particle swarm POP ^t ={x ₁ ,x ₂ ,x ₃ ,...,x _pop } ^t is obtained by using k-means; the local optimal solution of the particle is also initialized as Pbest=POP ^t .

在步骤4)中，所述将POP进行基于共识嵌入的更新、变异的具体步骤为：In step 4), the specific steps for updating and mutating the POP based on consensus embedding are:

在每代中进行基于共识的更新；Consensus-based updates in each generation;

每隔10代进行共识嵌入；Consensus embedding every 10 generations;

每代中进行以变异概率为pm的对粒子进行基于邻域的单点变异。Neighborhood-based single-point mutation of particles with mutation probability pm is performed in each generation.

将粒子的更新变异过程迭代进行maxgen次。Iterates the particle update mutation process for maxgen times.

在每代中进行基于共识的更新时，将gbest中的社区作为共识，随机选择其中一个嵌入到当前的粒子中；每隔10代进行共识嵌入时，每隔10代提取当前粒子群中的特征解，从中提取所有二节点社区，在当前粒子群中进行投票，若支持度≥70％，则选定为共识社区，全部嵌入到当前的粒子中；When consensus-based updates are performed in each generation, the community in gbest is used as consensus, and one of them is randomly selected to be embedded in the current particle; when consensus embedding is performed every 10 generations, the features in the current particle swarm are extracted every 10 generations. solution, extract all two-node communities from it, and vote in the current particle swarm. If the support is ≥ 70%, it is selected as a consensus community and all embedded in the current particle;

其中，从解集中提取的特征解，是三种代表性的解，分别是：Among them, the characteristic solutions extracted from the solution set are three representative solutions, namely:

a.具有最小KKM的解a. The solution with minimal KKM

b.具有最小RC的解b. The solution with minimal RC

c.具有最小曼哈顿距离的局部knee解c. Local knee solution with minimum Manhattan distance

每代中进行以变异概率为pm的对粒子进行基于邻域的单点变异时，对每个需要变异的粒子，随机选择一个位置i，其标签用其邻域中的另一个标签替换，这保证了新的可能解的产生。When performing neighborhood-based single-point mutation of particles with mutation probability pm in each generation, for each particle that needs to be mutated, a position i is randomly selected, and its label is replaced with another label in its neighborhood. The generation of new possible solutions is guaranteed.

本发明采用的目标函数是KKM和RC的变体：The objective function adopted in the present invention is a variant of KKM and RC:

其中V_h∈V，

所以

和

是集合V_h中节点的内部和外部相似度之和，“|·|”表示集合的大小。where V _h ∈ V,

so

and

is the sum of the internal and external similarity of nodes in the set V _h , and “|·|” represents the size of the set.

在粒子的选择过程中采用的是非支配排序选择策略。The non-dominated sorting selection strategy is adopted in the selection process of particles.

与现有技术相比，本发明具有以下突出的技术效果：Compared with the prior art, the present invention has the following outstanding technical effects:

相比MODPSO的进化过程，本发明提出的基于共识嵌入的复杂网络社区发现算法粒子的更新方式不一样，在更新过程中更加高效、准确。实验数据也表明，本发明方法得到的帕累托前沿效果更有竞争力；基于本发明需要的收敛时间也更少。本发明的方法提高了社区发现的准确率，同时又能有效减少方法的收敛时间，在功能预测、推荐系统等实际应用当中有很好的实用性。比如在生物领域的新陈代谢网络的分析、基因调控网络的分析、主控基因的识别；在疾病的传播演化和预测防控领域找出传染病的关键社区、关键节点以预测传播途径及时切断传播途径；在互联网中实现广告的精准投放，在电子商务系统上建立更可靠的推荐系统，在搜索引擎中提供更加个性化的搜索结果。Compared with the evolution process of MODPSO, the update method of the algorithm particles of the complex network community discovery based on consensus embedding proposed by the present invention is different, and the update process is more efficient and accurate. The experimental data also show that the Pareto front effect obtained by the method of the present invention is more competitive; the convergence time required by the present invention is also less. The method of the invention improves the accuracy rate of community discovery, and at the same time can effectively reduce the convergence time of the method, and has good practicability in practical applications such as function prediction and recommendation system. For example, in the biological field, the analysis of the metabolic network, the analysis of the gene regulation network, and the identification of the master gene; in the field of disease transmission and evolution, prediction and prevention, find out the key communities and key nodes of infectious diseases to predict the transmission route and cut off the transmission route in time. ; Realize the precise placement of advertisements in the Internet, establish a more reliable recommendation system on the e-commerce system, and provide more personalized search results in the search engine.

附图说明Description of drawings

图1为在每代中粒子的更新方式示意图。Figure 1 is a schematic diagram of how particles are updated in each generation.

图2为每隔10代粒子的基于共识嵌入的更新方式示意图。Figure 2 is a schematic diagram of the update method based on consensus embedding of particles every 10 generations.

图3本发明的方法(NE-PSO)与MODPSO算法在小规模网络数据集上的运行时间的对比图。FIG. 3 is a comparison diagram of the running time of the method of the present invention (NE-PSO) and the MODPSO algorithm on a small-scale network dataset.

图4本发明的方法(NE-PSO)在大规模网络数据集上的运行时间的折线图。Figure 4. Line graph of the running time of the method of the present invention (NE-PSO) on a large-scale network dataset.

具体实施方式Detailed ways

以下实施例将结合附图对本发明作进一步的说明。The following embodiments will further illustrate the present invention in conjunction with the accompanying drawings.

本发明所述基于共识嵌入的复杂网络社区发现方法实施例，具体包括以下步骤：The embodiment of the method for discovering complex network communities based on consensus embedding according to the present invention specifically includes the following steps:

1)给定网络G＝(V,E)，最大代数maxgen，粒子群规模pop，网络规模n，变异概率pm；1) Given network G=(V, E), maximum algebra maxgen, particle swarm size pop, network size n, mutation probability pm;

2)利用基于奇异值分解和特征值分解框架的保留任意阶相似度的网络嵌入方法AROPE对网络G进行网络表示学习，将高维的邻接矩阵映射到低维的连续的特征空间中，挖掘网络中潜藏的底层结构信息，得到节点的特征向量E＝{e₁,e₂,e₃,...,e_n}，其中e_i＝{e_i1,e_i2,e_i3,...,e_id}，d为降维后的维度；2) Using the network embedding method AROPE that retains any order similarity based on singular value decomposition and eigenvalue decomposition framework, network representation learning is performed on the network G, and the high-dimensional adjacency matrix is mapped into the low-dimensional continuous feature space, and the network is mined. The underlying structure information hidden in the node is obtained, and the feature vector E={e ₁ ,e ₂ ,e ₃ ,..., _{en } of the node is obtained, where e i} ₌ {e _i1 ,e _i2 ,e _i3 ,..., e _id }, d is the dimension after dimension reduction;

3)利用网络表示学习结果和节点的特征向量，计算节点之间的cos相似度得到

利用k-means得到基于相似度的初始化的粒子群POP^t＝{x₁,x₂,x₃,...,x_pop}^t；粒子的局部最优解也初始化为Pbest＝POP^t；迭代次数t＝1；3) Using the network to represent the learning results and the feature vectors of nodes, calculate the cos similarity between nodes to get

Using k-means to obtain the initialized particle swarm POP ^t ={x ₁ ,x ₂ ,x ₃ ,...,x _pop } ^t based on similarity; the local optimal solution of the particle is also initialized as Pbest=POP ^t ; iterative times t=1;

4)将POP进行基于共识嵌入的更新、变异，其中：4) Update and mutate POP based on consensus embedding, where:

在每代中进行基于共识的更新(如图1)，将gbest中的社区作为共识，随机选择其中一个嵌入到当前的粒子中；Perform consensus-based updates in each generation (as shown in Figure 1), take the community in gbest as consensus, and randomly select one of them to embed into the current particle;

每隔10代进行共识嵌入(如图2)，每隔10代提取当前粒子群中的特征解，从中提取所有二节点社区，在当前粒子群中进行投票，若支持度≥70％，则选定为共识社区，全部嵌入到当前的粒子中；Consensus embedding is performed every 10 generations (as shown in Figure 2), the feature solution in the current particle swarm is extracted every 10 generations, all two-node communities are extracted from it, and voting is performed in the current particle swarm. Set as a consensus community, all embedded in the current particle;

每代中进行以变异概率为pm的对粒子进行基于邻域的单点变异，对每个需要变异的粒子，随机选择一个位置i，其标签用其邻域中的另一个标签替换，这保证了新的可能解的产生。In each generation, a neighborhood-based single-point mutation is performed on the particles with the mutation probability pm. For each particle that needs to be mutated, a position i is randomly selected, and its label is replaced by another label in its neighborhood, which ensures that generation of new possible solutions.

5)停止条件：若满足t≤maxgen，则t←t+1并转向步骤3，否则停止算法并返回帕累托前沿解，即多个社区划分结果。5) Stop condition: if t≤maxgen is satisfied, then t←t+1 and turn to step 3, otherwise stop the algorithm and return to the Pareto frontier solution, that is, the result of multiple community divisions.

图1为在每代中粒子的更新方式示意图。在每一代更新中，所有的粒子都会根据非支配关系进行排序，维护一个全局最优解gbest，gbest中划分的社区可以被看作是一个个共识社区，当前代的粒子随机被一个共识社区嵌入，并与原粒子做非支配排序选择，留下的即为产生的新粒子。Figure 1 is a schematic diagram of how particles are updated in each generation. In each generation update, all particles will be sorted according to the non-dominant relationship to maintain a global optimal solution gbest. The communities divided in gbest can be regarded as a consensus community, and the particles of the current generation are randomly embedded by a consensus community , and make a non-dominated sorting selection with the original particle, and what is left is the new particle generated.

图2为每隔10代粒子的基于共识嵌入的更新方式示意图。在每隔10代的更新中，在所有粒子组成的帕累托前沿中，选出局部knee点(即极值点)和KKM最小值点、RC最小值点，在这h个点中提取共识社区(本发明中设置的支持度为70％)，并将得到的共识社区嵌入到当前代所有的粒子中，更新所有粒子。Figure 2 is a schematic diagram of the update method based on consensus embedding of particles every 10 generations. In the update every 10 generations, in the Pareto frontier composed of all particles, select the local knee points (ie extreme points), KKM minimum points, RC minimum points, and extract consensus among these h points community (the support degree set in the present invention is 70%), and the obtained consensus community is embedded into all particles of the current generation, and all particles are updated.

表1给出本发明的方法(NE-PSO)与其他基准算法在11个真实世界网络数据集上的模块度Q(最大值和平均值)的对比。Table 1 presents the comparison of the modularity Q (maximum and average) of the method of the present invention (NE-PSO) with other benchmark algorithms on 11 real-world network datasets.

表1Table 1

注：“——”说明对应的方法无法在给定的时间内得到结果。“？”说明对应的方法未公开导致对应的实验结果未知。Note: "——" indicates that the corresponding method cannot get the result within the given time. "?" indicates that the corresponding method is not disclosed, resulting in unknown experimental results.

从表1可以看出，本发明的方法在采用的11个数据集中有9个效果都比其他的基准算法好，特别是在网络节点多的数据集上非常有竞争力。并且根据图3和4在小规模网络数据集和大规模网络数据集上运行时间的折线图，可以看出，NE-PSO还明显减少了多目标粒子群算法的收敛时间。因此，本发明提高了社区发现的准确率，同时又能有效减少算法收敛时间，在实际应用当中有很好的实用性。It can be seen from Table 1 that the method of the present invention has better effects than other benchmark algorithms in 9 of the 11 data sets used, especially in the data set with many network nodes. And according to the line graphs of running time on small-scale network datasets and large-scale network datasets in Figures 3 and 4, it can be seen that NE-PSO also significantly reduces the convergence time of multi-objective particle swarm optimization. Therefore, the present invention improves the accuracy of community discovery, and at the same time can effectively reduce the algorithm convergence time, and has good practicability in practical applications.

上述实施例仅为本发明的较佳实施例，不能被认为用于限定本发明的实施范围。凡依本发明申请范围所作的均等变化与改进等，均应仍归属于本发明的专利涵盖范围之内。The above-mentioned embodiments are only preferred embodiments of the present invention, and should not be construed as limiting the scope of implementation of the present invention. All equivalent changes and improvements made according to the scope of the application of the present invention should still belong to the scope of the patent of the present invention.

Claims

1. A consensus embedding-based complex network community discovery method is characterized by comprising the following steps:

1) giving a maximum algebra maxgen and a particle swarm size pop;

2) given network G ═ V, E, the network size is n, carry on the network and represent and study;

3) representing a learning result by using a network, initializing a particle swarm to obtain 100 particles POP, wherein the iteration number t is 1;

4) carrying out updating and mutation based on consensus embedding on the POP;

5) stopping conditions are as follows: and if t is not more than maxgen, t ← t +1 and go to step 3), otherwise, stopping and returning pareto front edge solution, namely a plurality of community division results.

2. The method for discovering complex web communities based on consensus embedding as claimed in claim 1, wherein in step 2), the specific steps of performing web representation learning are:

the method comprises the steps of adopting a network embedding method AROPE based on singular value decomposition and eigenvalue decomposition frames and keeping any order of similarity, mapping a high-dimensional adjacent matrix into a low-dimensional continuous eigenspace, mining underlying structure information hidden in a network, and obtaining an eigenvector E ═ { E } of a node₁,e₂,e₃,...,e_nIn which e_i＝{e_i1,e_i2,e_i3,...,e_idD is the dimensionality after dimensionality reduction.

3. The method as claimed in claim 1, wherein in step 3), the specific steps of using the network to represent the learning result and performing initialization of the particle swarm to obtain 100 particle POPs are as follows:

calculating similarity to obtain S ═ S_ij)_n*n(ii) a Particle swarm POP capable of obtaining initialization based on similarity by using k-means^t＝{x₁,x₂,x₃,...,x_pop}^t(ii) a The local optimal solution for the particle is also initialized to Pbest ═ POP^t。

4. The method as claimed in claim 1, wherein in step 4), the step of updating and mutating the POP based on consensus embedding comprises:

performing consensus-based updates in each generation;

performing consensus embedding every 10 generations;

and performing neighborhood-based single-point mutation on the particles with mutation probability of pm in each generation.

5. The method for discovering complex network communities based on consensus embedding as claimed in claim 1, wherein in step 4), the updating and mutation iterates the updating and mutation process of the particles for maxgen times;

when updating based on consensus is carried out in each generation, taking the communities in the gbest as consensus, and randomly selecting one of the communities to be embedded into the current particle; when the consensus embedding is carried out every 10 generations, extracting the characteristic solution in the current particle swarm every 10 generations, extracting all two-node communities from the characteristic solution, voting in the current particle swarm, and if the support degree is more than or equal to 70%, selecting the characteristic solution as the consensus community and completely embedding the characteristic solution into the current particle;

wherein, the characteristic solutions extracted from the solution set are three representative solutions, which are respectively:

(1) solutions with minimum KKM

(2) Solutions with minimum RC

(3) Local knee solution with minimum manhattan distance

When single-point variation based on neighborhood is carried out on the particles with variation probability pm in each generation, a position i is randomly selected for each particle needing variation, and the label of the position i is replaced by another label in the neighborhood, so that the generation of a new possible solution is ensured.