CN105930457A

CN105930457A - Distributed architecture-based data flow frequent item mining method

Info

Publication number: CN105930457A
Application number: CN201610254621.1A
Authority: CN
Inventors: 张玉; 徐敬东; 张建忠; 于博文; 陈正阳
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2016-04-21
Filing date: 2016-04-21
Publication date: 2016-09-07

Abstract

The invention discloses a distributed architecture-based data flow frequent item mining method. According to the method, a two-layer tree-shaped communication structure which comprises m leaf nodes and 1 root node is adopted, wherein the leaf nodes are responsible for processing data items in data flows and sending frequency increments to the root node when increments of frequencies of the data items exceed a threshold value; the root node is responsible for collecting updating transferred by the leaf nodes. The method is low in communication overhead, and can be used for responding frequent item query requests initiated by users in real time.

Description

Mining method of frequent items in data stream based on distributed architecture

技术领域technical field

本发明属于数据挖掘技术领域，涉及一种频繁项挖掘方法，特别是涉及适用于分布式架构的频繁项挖掘方法。The invention belongs to the technical field of data mining, and relates to a frequent item mining method, in particular to a frequent item mining method applicable to a distributed architecture.

背景技术Background technique

在关联规则挖掘、序列模式挖掘、相关性挖掘、多层模式挖掘等数据挖掘问题中，频繁项挖掘既是基本步骤,也是关键步骤。目前有许多在单一节点上数据流的频繁项挖掘的研究，这些研究大致可以划分两类：基于计数器的方法和基于sketch的方法。In data mining problems such as association rule mining, sequential pattern mining, correlation mining, and multi-layer pattern mining, frequent item mining is not only a basic step, but also a key step. At present, there are many researches on frequent item mining of data streams on a single node, which can be roughly divided into two categories: counter-based methods and sketch-based methods.

基于计数器的频繁项挖掘方法会维护一组计数器用来统计数据项的频率。每个计数器包含两个参数，分别是数据项名称和数据项频率。当数据流中有数据项到来时，如果维护的计数器中存在这个数据项，则相应增加该计数器所记录的频率；否则增加一个新的计数器用来存储新数据项或者替换一个原有的计数器。通常基于计数器的频繁项挖掘方法处理单个数据项的开销很低，但是需要周期性地对所有计数器进行排序。基于计数器的频繁项挖掘方法有Frequent、Space Saving和Lossy Counting等。The counter-based frequent item mining method maintains a set of counters to count the frequency of data items. Each counter contains two parameters, the data item name and the data item frequency. When a data item arrives in the data stream, if the data item exists in the maintained counter, the frequency recorded by the counter is increased accordingly; otherwise, a new counter is added to store the new data item or replace an original counter. Usually counter-based frequent item mining methods have low overhead for processing individual data items, but need to sort all counters periodically. Counter-based frequent item mining methods include Frequent, Space Saving, and Lossy Counting.

基于sketch的方法采用由一维或者两维计数器数组构成的哈希表来估计数据流中各个数据项的频率。这类方法通常采用哈希技术将每个数据项映射到对应的多个计数器上，单个计数器可能被多个数据项所共享，即具有相同哈希值的数据项共享同一个计数器。当一个数据项到达时，只需修改对应的计数器的值。当用户提交查询请求时，使用对应的计数器的值来估计频率，能够以较高的置信度将误差控制在一定的范围之内。一般而言，基于哈希的方法需要额外数据结构的支持，如使用一个堆来跟踪候选的频繁项。基于Skech频繁项挖掘方法有CountSketch、CountMin Sketch和hCount等。Sketch-based methods use hash tables composed of one-dimensional or two-dimensional counter arrays to estimate the frequency of individual data items in a data stream. Such methods usually use hash technology to map each data item to corresponding multiple counters, and a single counter may be shared by multiple data items, that is, data items with the same hash value share the same counter. When a data item arrives, only the value of the corresponding counter needs to be modified. When a user submits a query request, the corresponding counter value is used to estimate the frequency, and the error can be controlled within a certain range with a high degree of confidence. In general, hash-based methods require the support of additional data structures, such as using a heap to track candidate frequent items. Skech-based frequent item mining methods include CountSketch, CountMin Sketch and hCount, etc.

传统的频繁项挖掘方法大多只考虑将单一数据源在单一节点上进行处理，然而目前在很多实际应用中，需要挖掘的数据是大规模分布式的，例如网络监控中的流检测、DDOS攻击检测等。目前针对分布式流环境下的频繁项挖掘方法的研究主要包括A.Manjhi提出的适用于树形或是多路径图形的拓扑结构的Tributary-Delta方法；G.Cormode提出的一种可以连续跟踪高速度分布式流中的频繁项的近似方法；B.Babcock和C.Olston提出的分布式top-k监听的方法等。这些方法存在的主要不足包括：通信开销过大以及不支持实时查询。Most of the traditional frequent item mining methods only consider processing a single data source on a single node. However, in many practical applications, the data to be mined is large-scale distributed, such as flow detection in network monitoring, DDOS attack detection Wait. At present, the research on frequent item mining methods in the distributed flow environment mainly includes the Tributary-Delta method proposed by A. Manjhi, which is suitable for tree or multi-path graph topology; Approximation methods for frequent items in velocity distributed streams; methods for distributed top-k monitoring proposed by B.Babcock and C.Olston, etc. The main disadvantages of these methods include: communication overhead is too large and real-time query is not supported.

发明内容Contents of the invention

本发明目的是解决现有方法存在通信开销过大以及不支持实时查询的问题，提供一种基于分布式架构的数据流频繁项挖掘方法，可用以提高传统频繁项挖掘方法在分布式架构上的处理能力。The purpose of the present invention is to solve the problems of excessive communication overhead and unsupported real-time query in existing methods, and provide a data stream frequent item mining method based on a distributed architecture, which can be used to improve the traditional frequent item mining method on a distributed architecture. processing power.

本发明提供了基于分布式架构的数据流频繁项挖掘方法，该方法是一种带权值的分布式数据流频繁项挖掘方法的ε-近似方法。该方法采用2层树形的通信结构，包括m个叶子节点和1个根节点；所述叶子节点负责处理数据流中的数据项，将数据流中的数据项频率存储在叶子节点的最小堆中，并在数据项频率增量大于阈值时，将数据项频率增量发送至根节点；所述根节点负责计算数据项在整体架构中的频率估计值，将数据项频率估计值存储在根节点的最小堆中；所述叶子节点的最小堆中存储的条目包括数据项名称、数据项频率以及数据项频率增量；所述根节点的最小堆中存储的条目包括数据项名称以及数据项频率估计值。The invention provides a method for mining frequent items of data streams based on a distributed architecture, which is an ε-approximation method of a method for mining frequent items of distributed data streams with weights. The method adopts a two-layer tree communication structure, including m leaf nodes and a root node; the leaf nodes are responsible for processing data items in the data stream, and store the frequency of data items in the data stream in the minimum heap of the leaf nodes , and when the data item frequency increment is greater than the threshold, the data item frequency increment is sent to the root node; the root node is responsible for calculating the frequency estimate of the data item in the overall architecture, and storing the data item frequency estimate in the root In the minimum heap of the node; the entries stored in the minimum heap of the leaf node include the data item name, the data item frequency and the data item frequency increment; the entries stored in the root node's minimum heap include the data item name and the data item frequency estimates.

本发明技术方案：Technical scheme of the present invention:

步骤1)、每个叶子节点i从所收到的数据流中依次取出数据项，所述数据项包括数据项名称v_t及数据项频率c_v，t；Step 1), each leaf node i takes out data item successively from received data flow, and described data item comprises data item name v _t and data item frequency c _{v, t} ;

步骤2)、更新所述叶子节点的数据项频率之和N_i＝N_i+c_v，t以及数据项频率之和的增量Δ_i＝Δ_i+c_v，t，其中的等号表述赋值，下同；Step 2), updating the sum of data item frequencies of the leaf nodes N _i =N _i +c _v,t and the increment of the sum of data item frequencies Δ _i =Δ _i +c _v,t , where the equal sign expresses assignment, the same below;

步骤3)、根据步骤1)所取出的数据项的数据项名称v_t和数据项频率c_v，t在所述的叶子节点的最小堆H_i中找出合适的条目，并为该条目中的数据项名称、数据项频率以及数据项频率增量赋值；该步骤包括：Step 3), according to the data item name v _t and the data item frequency c _{v, t} of the data item taken out in step 1), find out a suitable entry in the minimum heap H _i of the leaf node, and for this entry The data item name, data item frequency and data item frequency increment assignment; this step includes:

步骤3-1)、判断所述的叶子节点的最小堆中是否存在数据项名称为v_t的条目，若存在执行下一步，否则，执行步骤3-5)；Step 3-1), judging whether there is an entry whose data item name is v _t in the minimum heap of the leaf node, if there is, execute the next step, otherwise, execute step 3-5);

步骤3-2)、判断所述的叶子节点的最小堆H_i是否已满，若已满，执行下一步，否则，执行步骤3-4)；Step 3-2), judging whether the minimum heap H _i of the leaf node is full, if full, execute the next step, otherwise, execute step 3-4);

步骤3-3)、从所述的叶子节点的最小堆H_i中取出数据项频率最小的条目item_min，对该条目重新赋值，然后执行步骤4)；其中，对该条目赋值包括：Step 3-3), taking out the entry item _min with the smallest data item frequency from the minimum heap H _i of the leaf node, reassigning the entry, and then performing step 4); wherein, assigning the entry includes:

令v＝v_t，c_v＝c_v+c_v，t，Δ_v＝c_v，t；Let v=vt, _cv = _cv + _cv,t _, _Δv = _cv,t ;

所述的v表示取出条目的数据项名称，所述的v_t表示取出的数据项的数据项名称，所述的c_v表示取出条目的数据项频率，所述的c_v，t表示取出的数据项的数据项频率，所述的Δ_v表示取出条目的数据项频率增量；The v represents the data item name of the fetched entry, the v _t represents the data item name of the fetched data item, the c _v represents the data item frequency of the fetched entry, and the c _{v, t} represents the fetched The data item frequency of the data item, the _Δv represents the frequency increment of the data item of the fetched item;

步骤3-4)、创建一个新条目并为新条目赋值，将新条目插入所述的叶子节点的最小堆H_i中，然后执行步骤4)；其中，对新条目赋值包括：Step 3-4), creating a new entry and assigning a value to the new entry, inserting the new entry into the minimum heap H _i of the leaf node, and then performing step 4); wherein, assigning a value to the new entry includes:

令v＝v_t，c_v＝c_v，t，Δ_v＝c_v，t；Let v=vt, _cv = _cv,t _, _Δv = _cv,t ;

步骤3-5)、从所述的叶子节点的最小堆H_i中取出已存在的条目并对该条目进行更新，然后执行步骤4)；其中，对该条目更新包括：Step 3-5), take out the existing entry from the minimum heap H _i of the leaf node And update the entry, and then perform step 4); wherein, updating the entry includes:

令c_v＝c_v+c_v，t，Δ_v＝Δ_v+c_v，t；Let _cv = _cv + _cv,t , _Δv = _Δv + _cv,t ;

步骤4)、判断所述的数据项频率之和的增量以及所述条目的数据项频率增量是否大于阈值，如果大于阈值，向根节点传递更新；该步骤包括：Step 4), judging whether the increment of the sum of the data item frequencies and the data item frequency increment of the entry is greater than a threshold, if greater than the threshold, transfer an update to the root node; this step includes:

步骤4-1)、判断所述的数据项频率之和的增量Δ_i是否满足Δ_i＞β_iN_i，如果满足，执行下一步，否则，执行步骤4-3)；其中，Step 4-1), judging whether the increment Δ _i of the sum of data item frequencies satisfies Δ _i > β _i N _i , if so, execute the next step, otherwise, execute step 4-3); wherein,

所述的β_i表示用户定义的叶子节点的更新延迟系数，所述的N_i表示所述叶子节点的数据项频率之和；The β _i represents the update delay coefficient of the leaf node defined by the user, and the N _i represents the sum of the data item frequencies of the leaf node;

步骤4-2)、所述的叶子节点向根节点发送0-msg更新，然后将Δ_i的值置为0；其中Step 4-2), the leaf node sends a 0-msg update to the root node, and then sets the value of _Δi to 0; wherein

所述的0-msg更新发送的内容包括所述的数据项频率之和的增量Δ_i；The content sent by the 0-msg update includes the increment _Δi of the frequency sum of the data items;

步骤4-3)、所述的条目的数据项频率增量Δ_v是否满足Δ_v＞β_iN_i，如果满足，执行下一步，否则，执行步骤5)；Step 4-3), whether the data item frequency increment Δ _v of the entry satisfies Δ _v > β _i N _i , if so, execute the next step, otherwise, execute step 5);

步骤4-4)、所述的叶子节点向根节点发送数据项更新，然后将Δ_v的值置为0；其中Step 4-4), the leaf node sends a data item update to the root node, and then sets the value of _Δv to 0; wherein

所述的数据项更新发送的内容包括所述的条目的数据项名称以及所述的条目的数据项频率增量Δ_v；The content sent by the data item update includes the data item name of the entry and the data item frequency increment _Δv of the entry;

步骤5)、所述的根节点从所述的叶子节点发送的更新中依次取出更新，并根据取出的更新维护相应的数据；该步骤包括：Step 5), the root node sequentially fetches updates from the updates sent by the leaf nodes, and maintains corresponding data according to the fetched updates; this step includes:

步骤5-1)、判断根节点取出的所述叶子节点发送的更新的类型，如果是0-msg更新，执行下一步，如果是数据项更新，执行步骤5-3)；Step 5-1), determine the type of update sent by the leaf node taken out by the root node, if it is a 0-msg update, perform the next step, if it is a data item update, perform step 5-3);

步骤5-2)、更新所述的根节点的数据项频率之和的估计值N＝N+Δ_i，其中等号表示赋值，然后执行步骤6)；其中，Step 5-2), updating the estimated value N=N+Δ _i of the sum of the data item frequencies of the root node, wherein the equal sign indicates assignment, and then perform step 6); wherein,

所述的N表示根节点的数据项频率之和的估计值，所述的Δ_i表示所述的叶子节点发送的0-msg更新的频率；The N represents the estimated value of the sum of the data item frequencies of the root node, and the _Δi represents the frequency of the 0-msg update sent by the leaf node;

步骤5-3)、更新所述的根节点的最小堆H₀；该步骤包括：Step 5-3), updating the minimum heap H ₀ of the root node; this step includes:

步骤5-3-1)、取出所述的叶子节点发送的更新中的数据项名称v_t以及数据项频率增量Δ_v，t；Step 5-3-1), taking out the data item name v _t and data item frequency increment Δ _{v, t} in the update sent by the leaf node;

步骤5-3-2)、判断所述的根节点的最小堆中是否存在数据项名称为v_t的条目item_v，若存在执行下一步，否则，执行步骤5-3-4)；Step 5-3-2), judging whether there is an entry item _v whose data item name is v _t in the minimum heap of the root node, if there is, execute the next step, otherwise, execute step 5-3-4);

步骤5-3-3)、取出所述的条目item_v，对并对该条目进行更新，然后执行步骤6)；其中，对该条目更新包括：Step 5-3-3), take out the entry item _v , update and update the entry, and then perform step 6); wherein, updating the entry includes:

令c_v＝c_v+Δ_v,t，其中等号表示赋值；Let c _v =c _v +Δ _v,t , where the equal sign indicates assignment;

所述的v表示取出条目的数据项名称，所述的c_v表示取出条目的数据项频率，所述的Δ_v，t表示取出数据项更新的数据项频率增量；The v represents the data item name of the entry, the c _v represents the data item frequency of the entry, and the Δv _{, t} represents the frequency increment of the data item for the update of the data item;

步骤5-3-4)、判断所述的根节点的最小堆H₀是否已满，若已满，执行下一步，否则执行5-3-6)；Step _5-3-4 ), judging whether the minimum heap H0 of the root node is full, if full, execute the next step, otherwise execute 5-3-6);

步骤5-3-5)、取出所述的根节点的最小堆H₀中数据项频率最小的条目item_min，对该条目重新赋值，然后执行步骤6)；其中，对该条目赋值包括：Step 5-3-5), taking out the entry item _min with the smallest data item frequency in the minimum heap _H0 of the root node, reassigning the entry, and then performing step 6); wherein, assigning the entry includes:

令v＝v_t，c_v＝c_v+Δ_v，t；Let v=vt, _cv = _cv +Δv _,t _;

所述的v_t表示取出的数据项更新的数据项名称；The v _t represents the data item name of the updated data item taken out;

步骤5-3-6)、创建一个新条目并为新条目赋值，将新条目插入所述的根节点维护的最小堆中，然后执行步骤6)；其中，对新条目赋值包括：Step 5-3-6), creating a new entry and assigning a value to the new entry, inserting the new entry into the minimum heap maintained by the root node, and then performing step 6); wherein, assigning a value to the new entry includes:

令v＝v_t，c_v＝Δ_v，t；Let v=vt, _cv =Δv _,t _;

步骤6)、根据用户的请求，根节点遍历所述的最小堆H₀，返回所有数据项频率的条目为所要挖掘的频繁项。Step 6), according to the user's request, the root node traverses the minimum heap H ₀ , and returns the frequency of all data items The entries of are the frequent items to be mined.

本发明在所述的步骤4)和步骤5)之间还包括对叶子节点的最小堆H_i进行按照数据项的频率进行排序的操作步骤。The present invention also includes an operation step of sorting the minimum heap H _i of the leaf nodes according to the frequency of data items between step 4) and step 5).

以及，在所述的步骤5)和步骤6)之间还包括对根节点的最小堆H₀进行按照数据项的频率进行排序的操作步骤。And, an operation step of sorting the minimum heap H0 of the root node according to the frequency of the data items is also included between the steps ₅ ) and 6).

本发明的优点和有益效果：Advantages and beneficial effects of the present invention:

本发明提供的方法输出的频繁项的数据项频率的估计值与真实值之间的误差不大于εN，单条链路上的最大通信开销不大于可以支持用户实时的频繁项查询。The error between the estimated value of the data item frequency of the frequent item output by the method provided by the present invention and the real value is not greater than εN, and the maximum communication overhead on a single link is not greater than It can support users to query frequent items in real time.

附图说明Description of drawings

图1是基于分布式架构的数据流频繁项挖掘方法的通信结构。Figure 1 is the communication structure of the method for mining frequent items in data streams based on distributed architecture.

图2是基于分布式架构的数据流频繁项挖掘方法的平均相对误差。Figure 2 is the average relative error of the data stream frequent item mining method based on the distributed architecture.

图3是基于分布式架构的数据流频繁项挖掘方法的单链路通信开销。Figure 3 shows the single-link communication overhead of the data stream frequent item mining method based on the distributed architecture.

图4是基于分布式架构的数据流频繁项挖掘方法的初始化时间。Figure 4 is the initialization time of the data stream frequent item mining method based on the distributed architecture.

具体实施方式detailed description

为了更清晰直观的表达本发明的方法思路，下面对基于分布式架构的数据流频繁项挖掘方法的细节进行详细说明：In order to express the method idea of the present invention more clearly and intuitively, the details of the method for mining frequent items of data streams based on the distributed architecture are described in detail below:

1.确定参数1. Determine the parameters

分布式频繁项挖掘方法需要确定的参数包括：The parameters that need to be determined in the distributed frequent item mining method include:

(1)数据项支持度和误差度 (1) Data item support and degree of error

(2)叶子节点数据m；(2) Leaf node data m;

(3)每个叶子节点i和根节点的最小堆H_i和H₀的大小为和 (3) The size of the minimum heap H _i and H ₀ of each leaf node i and root node is and

(4)每个叶子节点i的延迟更新系数β_i(0＜β_i＜ε)。(4) Delay update coefficient β _i (0<β _i <ε) of each leaf node i.

在本实施例中叶子节点数量m＝8，叶子节点和根节点的最小堆的大小均为10000，即α₀＝α_i＝0.0001，β_i∈[0.001，0.005]，ε∈[0.0001，0.0005]， In this embodiment, the number of leaf nodes is m=8, and the minimum heap sizes of leaf nodes and root nodes are both 10000, that is, α ₀ =α _i =0.0001, β _i ∈[0.001,0.005], ε∈[0.0001,0.0005] ],

2.初始化2. Initialization

确定每个叶子节点i的初始化时间为B_i＞0为叶子节点i与根节点之间链路的带宽，单位为数据包/秒。在本发明方法运行的最初时刻，每个叶子节点i会处理收到的来自数据流S_i的数据项更新，但却不会将任何消息传递给它的根节点直至初始化状态结束。Determine the initialization time of each leaf node i as B _i >0 is the bandwidth of the link between the leaf node i and the root node, and the unit is data packet/second. At the initial moment of the operation of the method of the present invention, each leaf node i will process the data item update received from the data flow S _i , but will not pass any message to its root node until the initialization state ends.

3.叶子节点处理数据流中的数据项3. Leaf nodes process data items in the data stream

当数据流S_i中有新数据项(v，c_v，t)到达叶子节点i时，首先更新数据流S_i中数据项频率之和N_i＝N_i+c_v，t，其中等号表示赋值，下同，以及数据项频率之和的频率估计值增量Δ_i＝Δ_i+c_v，t。其次更新相应的数据项频率：如果v∈H_i，则增加相应数据项条目的频率估计值c_v＝c_v+c_v，t和数据项v的频率估计值增量Δ_v＝Δ_v+c_v，t；否则找到H_i中频率估计值最小的数据项item_min，将item_min的数据项名替换为v，并更新其数据项频率c_v＝c_v+c_v，t和频率估计值增量Δ_v＝c_v，t。最后检查是否满足条件向根节点传递数据项更新：如果Δ_i＞β_iN_i，则发送更新(0，Δ_i)给根节点，并重置频率估计值增量Δ_i＝0；如果Δ_v＞β_iN_i，则发送更新(v，Δ_v）给根节点，并重置频率估计值增量Δ_v＝0。When a new data item (v, c _{v, t} ) in the data stream S _i arrives at the leaf node i, first update the sum of the frequency of the data item in the data stream S _i N _i =N _i +c _{v, t} , where the equal sign Indicates the assignment, the same below, and the frequency estimation value increment Δ _i =Δ _i +c _v,t of the sum of the frequencies of the data items. Then update the corresponding data item frequency: if v∈H _i , then increase the frequency estimate value of the corresponding data item entry _cv = _cv + _{cv, t} and the frequency estimate value increment of data item _v Δv = _Δv + c _{v, t} ; otherwise, find the data item _min with the smallest frequency estimate in H _i , replace the data item name of item _min with v, and update its data item frequency c _v = c _v + c _{v, t} and frequency estimate Value increment Δ _v =c _v,t . Finally, check whether the condition is satisfied and transmit the data item update to the root node: if Δ _i > β _i N _i , then send the update (0, Δ _i ) to the root node, and reset the frequency estimation increment Δ _i = 0; if Δ _v >β _i N _i , then send an update (v, Δ _v ) to the root node, and reset the increment of the frequency estimation value Δ _v =0.

4.根节点处理叶子节点发送的更新4. The root node processes the updates sent by the leaf nodes

当根节点收到叶子节点传递的数据项更新(v，Δ_v)时，如果v∈H₀，则增加相应数据项条目的频率估计值c_v＝c_v+Δ_v；否则找到H₀中频率估计值最小的数据项item_min，将item_min的数据项名替换为v，并更新其数据项频率c_v＝c_v+Δ_v。当根节点收到叶子节点传递的0-msg更新(0，Δ_i)时，更新根节点对数据项频率之和的估计值N₀＝N₀+Δ_i。When the root node receives the data item update (v, Δ _v ) delivered by the leaf node, if v∈H ₀ , then increase the frequency estimation value of the corresponding data item entry c _v ₌ c _v +Δ _v ; otherwise, find For the data item item _min with the smallest estimated frequency value, replace the data item name of item _min with v, and update its data item frequency c _v =c _v +Δ _v . When the root node receives the 0-msg update (0, Δ _i ) delivered by the leaf node, it updates the root node's estimated value N ₀ =N ₀ +Δ _i of the sum of data item frequencies.

5.根节点处理用户发起的频繁项查询5. The root node handles frequent item queries initiated by users

当用户向根节点提交频繁项查询时，根节点扫描最小堆H₀中维护的每个数据项条目(v，c_v)。如果cv_≥ωN₀，则认为v是频繁项并将v输出，其中ω为输出阈值，有为用户定义的数据项支持度，α_max＝max(α_i)，β_max＝max(β_i)。When a user submits a frequent item query to the root node, the root node scans each data item entry (v, c _v ) maintained in the min-heap H ₀ . If cv _≥ ωN ₀ , v is considered to be a frequent item and v will be output, where ω is the output threshold, there is is the user-defined data item support, α _max =max(α _i ), β _max =max(β _i ).

本发明采用现实数据与计算机模拟的方式实施。The present invention is implemented by means of real data and computer simulation.

本发明选择3组真实的网络环境下采集的网络流量数据集作为实施例中的数据源。这3组数据集分别是：CERNET数据集，是在CERNET(China Education and Research Network)的OC-48链路上采集的TCP双向数据集；CAIDA48数据集，是在OC-48west coast peering link上采集的匿名数据集；CAIDA192数据集，是在OC-192链路上采集的单项匿名数据集。本发明将网络流量数据集中IP数据包的五元组(源IP地址，目的IP地址，源端口，目的端口，传输层协议)定义为数据项名，将数据包负载的长度定义为数据项频率。The present invention selects 3 groups of network traffic data sets collected in a real network environment as data sources in the embodiment. These three sets of data sets are: CERNET data set, which is a TCP bidirectional data set collected on the OC-48 link of CERNET (China Education and Research Network); CAIDA48 data set, which is collected on the OC-48west coast peering link The anonymous data set of CAIDA192 data set is a single anonymous data set collected on the OC-192 link. The present invention defines the quintuple (source IP address, destination IP address, source port, destination port, transport layer protocol) of the IP data packet in the network flow data set as the data item name, and defines the length of the data packet load as the data item frequency .

定义更新延迟系数β的相对值为图2显示了本发明方法处理3组不同的网络数据集的平均相对误差。可以观测到，在ε∈[0.0001，0.0005]时，方法的平均相对误差均小于当前ε的值与N乘积。Define the relative value of the update delay coefficient β as Fig. 2 shows the average relative error of the method of the present invention for processing 3 different network datasets. It can be observed that when ε∈[0.0001, 0.0005], the average relative error of the method is smaller than the product of the current ε value and N.

图3显示了本发明方法处理3组不同的网络数据集的单链路开销。对于图3的每幅子图，分别有4条曲线，由上至下每条曲线分别表示处理当前网络数据集的单链路通信开销的理论最大值单链路通信开销的实际最大值、实际平均值以及实际最小值。可以观测到，单条链路上的实际最大通信开销不大于 Fig. 3 shows the single-link overhead of the method of the present invention for processing three different sets of network data sets. For each sub-graph in Figure 3, there are 4 curves, and each curve from top to bottom represents the theoretical maximum value of the single-link communication overhead for processing the current network data set Actual maximum, actual average, and actual minimum of single-link communication overhead. It can be observed that the actual maximum communication overhead on a single link is not greater than

图4显示了本发明方法的初始化时间。可以观测到，更新延迟系数β的相对值越大，本发明方法所需的初始化时间越少。Figure 4 shows the initialization time of the method of the present invention. It can be observed that the larger the relative value of the update delay coefficient β, the less the initialization time required by the method of the present invention.

Claims

1. A data flow frequent item mining method based on a distributed architecture adopts a 2-layer tree-shaped communication structure, and comprises m leaf nodes and 1 root node; the leaf node is responsible for processing data items in the data stream, storing the frequency of the data items in the data stream in a minimum heap of the leaf node, and sending the frequency increment of the data items to the root node when the frequency increment of the data items is greater than a threshold value; the root node is responsible for calculating the frequency estimation value of the data item in the whole framework and storing the frequency estimation value of the data item in the minimum heap of the root node; the entries stored in the minimal heap of leaf nodes include data item names, data item frequencies, and data item frequency increments; the entries stored in the minimum heap of the root node include data item names and data item frequency estimates; the method comprises the following steps:

step 1), each leaf node i sequentially takes out data items from the received data stream, wherein the data items comprise data item names v_tAnd frequency c of data items_v，t；

Step 2), updating the sum N of the data item frequencies of the leaf nodes_i＝N_i+c_v，tAnd the increment delta of the sum of the frequencies of the data items_i＝Δ_i+c_v，tWherein the equal sign represents the assignment, the same as follows;

step 3) of retrieving the data item name v of the data item according to step 1)_tAnd data item frequency c_v，tMinimum heap H at said leaf node_iFinding out a proper entry, and assigning values to the name of the data item, the frequency of the data item and the frequency increment of the data item in the entry; the method comprises the following steps:

step 3-1), judging whether the data item name v exists in the minimum heap of the leaf nodes_tIf the item (3) exists, executing the next step, otherwise, executing the step 3-5);

step 3-2), judging the minimum pile H of the leaf nodes_iWhether the container is full, if so, executing the next step, otherwise, executing the step 3-4);

step 3-3), from said minimum heap of leaf nodes H_iItem with minimum data item frequency_minReassigns the entry and then performs step 4); wherein assigning the entry comprises:

let v equal v_t，c_v＝c_v+c_v，t，Δ_v＝c_v，t；

Said v representing the name of the data item from which the entry was fetched, said v_tA data item name indicating a retrieved data item, said c_vFrequency of data items representing fetched entries, said c_v，tData item frequency, said Δ, representing a retrieved data item_vA data item frequency increment representing a fetched entry;

step 3-4), creating a new entry and assigning a value to the new entry, and inserting the new entry into the minimum heap H of the leaf nodes_iThen step 4) is performed; wherein assigning the new entry comprises:

let v equal v_t，c_v＝c_v，t，Δ_v＝c_v，t；

Step 3-5), from said minimum heap of leaf nodes H_iFetching an existing entryUpdating the entry, and then executing the step 4); wherein updating the entry comprises:

let c_v＝c_v+c_v，t，Δ_v＝Δ_v+c_v，t；

Step 4), judging whether the increment of the sum of the data item frequencies and the data item frequency increment of the entry are greater than a threshold value, and if so, transmitting the update to the root node; the method comprises the following steps:

step 4-1), judging increment delta of the sum of the data item frequencies_iWhether or not Δ is satisfied_i＞β_iN_iIf yes, executing the next step, otherwise, executing the step 4-3); wherein,

β as described_iAn update delay factor representing a user-defined leaf node, said N_iA sum of data item frequencies representing the leaf nodes;

step 4-2), the leaf node sends 0-msg update to the root node, and then the delta is updated_iIs set to 0; wherein

Said 0-msg update transmitted content comprises an increment delta of the sum of said data item frequencies_i；

Step 4-3), data item frequency increment delta of said item_vWhether or not Δ is satisfied_v＞β_iN_iIf yes, executing the next step, otherwise, executing the step 5);

step 4-4), the leaf node sends data item update to the root node, and then the delta is sent_vIs set to 0; wherein

The data item updating transmission content comprises the data item name of the item and the data item frequency increment delta of the item_v；

Step 5), the root node sequentially takes out updates from the updates sent by the leaf nodes and maintains corresponding data according to the taken-out updates; the method comprises the following steps:

step 5-1), judging the type of the update sent by the leaf node taken out by the root node, if the type is 0-msg update, executing the next step, and if the type is data item update, executing step 5-3);

step 5-2), updating the estimated value N of the sum of the data item frequencies of the root node to be N + delta_iWherein the equal sign represents the assignment, and then step 6) is performed; wherein,

said N representing an estimate of the sum of the data item frequencies of the root node, said Δ_iIndicating a frequency of 0-msg updates sent by said leaf node;

step 5-3), updating the minimum heap H of the root node₀(ii) a The method comprises the following steps:

step 5-3-1), the data item name v in the update sent by the leaf node is taken out_tAnd data item frequency increment delta_v，t；

Step 5-3-2), judging whether the data item name v exists in the minimum heap of the root node_tItem of_vIf yes, executing the next step, otherwise, executing the step 5-3-4);

step 5-3-3), the item is taken out_vUpdating the entry, and then executing step 6); wherein updating the entry comprises:

let c_v＝c_v+Δ_v，tWherein the equal sign represents the assignment;

saidv denotes the data item name of the fetched entry, said c_vFrequency of data items representing fetched entries, said Δ_v，tA data item frequency increment representing a fetch data item update;

step 5-3-4), judging the minimum heap H of the root node₀Whether the container is full, if so, executing the next step, otherwise, executing 5-3-6);

step 5-3-5), taking out the minimum heap H of the root node₀Item with minimum frequency of medium data item_minReassigns the entry and then performs step 6); wherein assigning the entry comprises:

let v equal v_t，c_v＝c_v+Δ_v，t；

V is_tA data item name indicating a retrieved data item update;

step 5-3-6), creating a new entry and assigning a value to the new entry, inserting the new entry into the minimum heap maintained by the root node, and then executing step 6); wherein assigning the new entry comprises:

let v equal v_t，c_v＝Δ_v，t；

Step 6), according to the request of the user, the root node traverses the minimum heap H₀Frequency of all data items returnedThe entries of (1) are frequent items to be mined.

2. The method for mining frequent items of data flow based on distributed architecture as claimed in claim 1, further comprising a minimum heap H of leaf nodes between said steps 4) and 5)_iThe operation steps of sorting by frequency of the data items are performed.

3. The method for mining frequent items of data flow based on distributed architecture as claimed in claim 1, further comprising a minimum heap of root nodes between said steps 5) and 6)H₀The operation steps of sorting by frequency of the data items are performed.