[go: up one dir, main page]

CN105930457A - Distributed architecture-based data flow frequent item mining method - Google Patents

Distributed architecture-based data flow frequent item mining method Download PDF

Info

Publication number
CN105930457A
CN105930457A CN201610254621.1A CN201610254621A CN105930457A CN 105930457 A CN105930457 A CN 105930457A CN 201610254621 A CN201610254621 A CN 201610254621A CN 105930457 A CN105930457 A CN 105930457A
Authority
CN
China
Prior art keywords
data item
data
frequency
entry
root node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610254621.1A
Other languages
Chinese (zh)
Inventor
张玉
徐敬东
张建忠
于博文
陈正阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nankai University
Original Assignee
Nankai University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nankai University filed Critical Nankai University
Priority to CN201610254621.1A priority Critical patent/CN105930457A/en
Publication of CN105930457A publication Critical patent/CN105930457A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a distributed architecture-based data flow frequent item mining method. According to the method, a two-layer tree-shaped communication structure which comprises m leaf nodes and 1 root node is adopted, wherein the leaf nodes are responsible for processing data items in data flows and sending frequency increments to the root node when increments of frequencies of the data items exceed a threshold value; the root node is responsible for collecting updating transferred by the leaf nodes. The method is low in communication overhead, and can be used for responding frequent item query requests initiated by users in real time.

Description

基于分布式架构的数据流频繁项挖掘方法Mining method of frequent items in data stream based on distributed architecture

技术领域technical field

本发明属于数据挖掘技术领域,涉及一种频繁项挖掘方法,特别是涉及适用于分布式架构的频繁项挖掘方法。The invention belongs to the technical field of data mining, and relates to a frequent item mining method, in particular to a frequent item mining method applicable to a distributed architecture.

背景技术Background technique

在关联规则挖掘、序列模式挖掘、相关性挖掘、多层模式挖掘等数据挖掘问题中,频繁项挖掘既是基本步骤,也是关键步骤。目前有许多在单一节点上数据流的频繁项挖掘的研究,这些研究大致可以划分两类:基于计数器的方法和基于sketch的方法。In data mining problems such as association rule mining, sequential pattern mining, correlation mining, and multi-layer pattern mining, frequent item mining is not only a basic step, but also a key step. At present, there are many researches on frequent item mining of data streams on a single node, which can be roughly divided into two categories: counter-based methods and sketch-based methods.

基于计数器的频繁项挖掘方法会维护一组计数器用来统计数据项的频率。每个计数器包含两个参数,分别是数据项名称和数据项频率。当数据流中有数据项到来时,如果维护的计数器中存在这个数据项,则相应增加该计数器所记录的频率;否则增加一个新的计数器用来存储新数据项或者替换一个原有的计数器。通常基于计数器的频繁项挖掘方法处理单个数据项的开销很低,但是需要周期性地对所有计数器进行排序。基于计数器的频繁项挖掘方法有Frequent、Space Saving和Lossy Counting等。The counter-based frequent item mining method maintains a set of counters to count the frequency of data items. Each counter contains two parameters, the data item name and the data item frequency. When a data item arrives in the data stream, if the data item exists in the maintained counter, the frequency recorded by the counter is increased accordingly; otherwise, a new counter is added to store the new data item or replace an original counter. Usually counter-based frequent item mining methods have low overhead for processing individual data items, but need to sort all counters periodically. Counter-based frequent item mining methods include Frequent, Space Saving, and Lossy Counting.

基于sketch的方法采用由一维或者两维计数器数组构成的哈希表来估计数据流中各个数据项的频率。这类方法通常采用哈希技术将每个数据项映射到对应的多个计数器上,单个计数器可能被多个数据项所共享,即具有相同哈希值的数据项共享同一个计数器。当一个数据项到达时,只需修改对应的计数器的值。当用户提交查询请求时,使用对应的计数器的值来估计频率,能够以较高的置信度将误差控制在一定的范围之内。一般而言,基于哈希的方法需要额外数据结构的支持,如使用一个堆来跟踪候选的频繁项。基于Skech频繁项挖掘方法有CountSketch、CountMin Sketch和hCount等。Sketch-based methods use hash tables composed of one-dimensional or two-dimensional counter arrays to estimate the frequency of individual data items in a data stream. Such methods usually use hash technology to map each data item to corresponding multiple counters, and a single counter may be shared by multiple data items, that is, data items with the same hash value share the same counter. When a data item arrives, only the value of the corresponding counter needs to be modified. When a user submits a query request, the corresponding counter value is used to estimate the frequency, and the error can be controlled within a certain range with a high degree of confidence. In general, hash-based methods require the support of additional data structures, such as using a heap to track candidate frequent items. Skech-based frequent item mining methods include CountSketch, CountMin Sketch and hCount, etc.

传统的频繁项挖掘方法大多只考虑将单一数据源在单一节点上进行处理,然而目前在很多实际应用中,需要挖掘的数据是大规模分布式的,例如网络监控中的流检测、DDOS攻击检测等。目前针对分布式流环境下的频繁项挖掘方法的研究主要包括A.Manjhi提出的适用于树形或是多路径图形的拓扑结构的Tributary-Delta方法;G.Cormode提出的一种可以连续跟踪高速度分布式流中的频繁项的近似方法;B.Babcock和C.Olston提出的分布式top-k监听的方法等。这些方法存在的主要不足包括:通信开销过大以及不支持实时查询。Most of the traditional frequent item mining methods only consider processing a single data source on a single node. However, in many practical applications, the data to be mined is large-scale distributed, such as flow detection in network monitoring, DDOS attack detection Wait. At present, the research on frequent item mining methods in the distributed flow environment mainly includes the Tributary-Delta method proposed by A. Manjhi, which is suitable for tree or multi-path graph topology; Approximation methods for frequent items in velocity distributed streams; methods for distributed top-k monitoring proposed by B.Babcock and C.Olston, etc. The main disadvantages of these methods include: communication overhead is too large and real-time query is not supported.

发明内容Contents of the invention

本发明目的是解决现有方法存在通信开销过大以及不支持实时查询的问题,提供一种基于分布式架构的数据流频繁项挖掘方法,可用以提高传统频繁项挖掘方法在分布式架构上的处理能力。The purpose of the present invention is to solve the problems of excessive communication overhead and unsupported real-time query in existing methods, and provide a data stream frequent item mining method based on a distributed architecture, which can be used to improve the traditional frequent item mining method on a distributed architecture. processing power.

本发明提供了基于分布式架构的数据流频繁项挖掘方法,该方法是一种带权值的分布式数据流频繁项挖掘方法的ε-近似方法。该方法采用2层树形的通信结构,包括m个叶子节点和1个根节点;所述叶子节点负责处理数据流中的数据项,将数据流中的数据项频率存储在叶子节点的最小堆中,并在数据项频率增量大于阈值时,将数据项频率增量发送至根节点;所述根节点负责计算数据项在整体架构中的频率估计值,将数据项频率估计值存储在根节点的最小堆中;所述叶子节点的最小堆中存储的条目包括数据项名称、数据项频率以及数据项频率增量;所述根节点的最小堆中存储的条目包括数据项名称以及数据项频率估计值。The invention provides a method for mining frequent items of data streams based on a distributed architecture, which is an ε-approximation method of a method for mining frequent items of distributed data streams with weights. The method adopts a two-layer tree communication structure, including m leaf nodes and a root node; the leaf nodes are responsible for processing data items in the data stream, and store the frequency of data items in the data stream in the minimum heap of the leaf nodes , and when the data item frequency increment is greater than the threshold, the data item frequency increment is sent to the root node; the root node is responsible for calculating the frequency estimate of the data item in the overall architecture, and storing the data item frequency estimate in the root In the minimum heap of the node; the entries stored in the minimum heap of the leaf node include the data item name, the data item frequency and the data item frequency increment; the entries stored in the root node's minimum heap include the data item name and the data item frequency estimates.

本发明技术方案:Technical scheme of the present invention:

步骤1)、每个叶子节点i从所收到的数据流中依次取出数据项,所述数据项包括数据项名称vt及数据项频率cv,tStep 1), each leaf node i takes out data item successively from received data flow, and described data item comprises data item name v t and data item frequency c v, t ;

步骤2)、更新所述叶子节点的数据项频率之和Ni=Ni+cv,t以及数据项频率之和的增量Δi=Δi+cv,t,其中的等号表述赋值,下同;Step 2), updating the sum of data item frequencies of the leaf nodes N i =N i +c v,t and the increment of the sum of data item frequencies Δ ii +c v,t , where the equal sign expresses assignment, the same below;

步骤3)、根据步骤1)所取出的数据项的数据项名称vt和数据项频率cv,t在所述的叶子节点的最小堆Hi中找出合适的条目,并为该条目中的数据项名称、数据项频率以及数据项频率增量赋值;该步骤包括:Step 3), according to the data item name v t and the data item frequency c v, t of the data item taken out in step 1), find out a suitable entry in the minimum heap H i of the leaf node, and for this entry The data item name, data item frequency and data item frequency increment assignment; this step includes:

步骤3-1)、判断所述的叶子节点的最小堆中是否存在数据项名称为vt的条目,若存在执行下一步,否则,执行步骤3-5);Step 3-1), judging whether there is an entry whose data item name is v t in the minimum heap of the leaf node, if there is, execute the next step, otherwise, execute step 3-5);

步骤3-2)、判断所述的叶子节点的最小堆Hi是否已满,若已满,执行下一步,否则,执行步骤3-4);Step 3-2), judging whether the minimum heap H i of the leaf node is full, if full, execute the next step, otherwise, execute step 3-4);

步骤3-3)、从所述的叶子节点的最小堆Hi中取出数据项频率最小的条目itemmin,对该条目重新赋值,然后执行步骤4);其中,对该条目赋值包括:Step 3-3), taking out the entry item min with the smallest data item frequency from the minimum heap H i of the leaf node, reassigning the entry, and then performing step 4); wherein, assigning the entry includes:

令v=vt,cv=cv+cv,t,Δv=cv,tLet v=vt, cv = cv + cv,t , Δv = cv,t ;

所述的v表示取出条目的数据项名称,所述的vt表示取出的数据项的数据项名称,所述的cv表示取出条目的数据项频率,所述的cv,t表示取出的数据项的数据项频率,所述的Δv表示取出条目的数据项频率增量;The v represents the data item name of the fetched entry, the v t represents the data item name of the fetched data item, the c v represents the data item frequency of the fetched entry, and the c v, t represents the fetched The data item frequency of the data item, the Δv represents the frequency increment of the data item of the fetched item;

步骤3-4)、创建一个新条目并为新条目赋值,将新条目插入所述的叶子节点的最小堆Hi中,然后执行步骤4);其中,对新条目赋值包括:Step 3-4), creating a new entry and assigning a value to the new entry, inserting the new entry into the minimum heap H i of the leaf node, and then performing step 4); wherein, assigning a value to the new entry includes:

令v=vt,cv=cv,t,Δv=cv,tLet v=vt, cv = cv,t , Δv = cv,t ;

步骤3-5)、从所述的叶子节点的最小堆Hi中取出已存在的条目并对该条目进行更新,然后执行步骤4);其中,对该条目更新包括:Step 3-5), take out the existing entry from the minimum heap H i of the leaf node And update the entry, and then perform step 4); wherein, updating the entry includes:

令cv=cv+cv,t,Δv=Δv+cv,tLet cv = cv + cv,t , Δv = Δv + cv,t ;

步骤4)、判断所述的数据项频率之和的增量以及所述条目的数据项频率增量是否大于阈值,如果大于阈值,向根节点传递更新;该步骤包括:Step 4), judging whether the increment of the sum of the data item frequencies and the data item frequency increment of the entry is greater than a threshold, if greater than the threshold, transfer an update to the root node; this step includes:

步骤4-1)、判断所述的数据项频率之和的增量Δi是否满足Δi>βiNi,如果满足,执行下一步,否则,执行步骤4-3);其中,Step 4-1), judging whether the increment Δ i of the sum of data item frequencies satisfies Δ i > β i N i , if so, execute the next step, otherwise, execute step 4-3); wherein,

所述的βi表示用户定义的叶子节点的更新延迟系数,所述的Ni表示所述叶子节点的数据项频率之和;The β i represents the update delay coefficient of the leaf node defined by the user, and the N i represents the sum of the data item frequencies of the leaf node;

步骤4-2)、所述的叶子节点向根节点发送0-msg更新,然后将Δi的值置为0;其中Step 4-2), the leaf node sends a 0-msg update to the root node, and then sets the value of Δi to 0; wherein

所述的0-msg更新发送的内容包括所述的数据项频率之和的增量ΔiThe content sent by the 0-msg update includes the increment Δi of the frequency sum of the data items;

步骤4-3)、所述的条目的数据项频率增量Δv是否满足Δv>βiNi,如果满足,执行下一步,否则,执行步骤5);Step 4-3), whether the data item frequency increment Δ v of the entry satisfies Δ v > β i N i , if so, execute the next step, otherwise, execute step 5);

所述的βi表示用户定义的叶子节点的更新延迟系数,所述的Ni表示所述叶子节点的数据项频率之和;The β i represents the update delay coefficient of the leaf node defined by the user, and the N i represents the sum of the data item frequencies of the leaf node;

步骤4-4)、所述的叶子节点向根节点发送数据项更新,然后将Δv的值置为0;其中Step 4-4), the leaf node sends a data item update to the root node, and then sets the value of Δv to 0; wherein

所述的数据项更新发送的内容包括所述的条目的数据项名称以及所述的条目的数据项频率增量ΔvThe content sent by the data item update includes the data item name of the entry and the data item frequency increment Δv of the entry;

步骤5)、所述的根节点从所述的叶子节点发送的更新中依次取出更新,并根据取出的更新维护相应的数据;该步骤包括:Step 5), the root node sequentially fetches updates from the updates sent by the leaf nodes, and maintains corresponding data according to the fetched updates; this step includes:

步骤5-1)、判断根节点取出的所述叶子节点发送的更新的类型,如果是0-msg更新,执行下一步,如果是数据项更新,执行步骤5-3);Step 5-1), determine the type of update sent by the leaf node taken out by the root node, if it is a 0-msg update, perform the next step, if it is a data item update, perform step 5-3);

步骤5-2)、更新所述的根节点的数据项频率之和的估计值N=N+Δi,其中等号表示赋值,然后执行步骤6);其中,Step 5-2), updating the estimated value N=N+Δ i of the sum of the data item frequencies of the root node, wherein the equal sign indicates assignment, and then perform step 6); wherein,

所述的N表示根节点的数据项频率之和的估计值,所述的Δi表示所述的叶子节点发送的0-msg更新的频率;The N represents the estimated value of the sum of the data item frequencies of the root node, and the Δi represents the frequency of the 0-msg update sent by the leaf node;

步骤5-3)、更新所述的根节点的最小堆H0;该步骤包括:Step 5-3), updating the minimum heap H 0 of the root node; this step includes:

步骤5-3-1)、取出所述的叶子节点发送的更新中的数据项名称vt以及数据项频率增量Δv,tStep 5-3-1), taking out the data item name v t and data item frequency increment Δ v, t in the update sent by the leaf node;

步骤5-3-2)、判断所述的根节点的最小堆中是否存在数据项名称为vt的条目itemv,若存在执行下一步,否则,执行步骤5-3-4);Step 5-3-2), judging whether there is an entry item v whose data item name is v t in the minimum heap of the root node, if there is, execute the next step, otherwise, execute step 5-3-4);

步骤5-3-3)、取出所述的条目itemv,对并对该条目进行更新,然后执行步骤6);其中,对该条目更新包括:Step 5-3-3), take out the entry item v , update and update the entry, and then perform step 6); wherein, updating the entry includes:

令cv=cvv,t,其中等号表示赋值;Let c v =c vv,t , where the equal sign indicates assignment;

所述的v表示取出条目的数据项名称,所述的cv表示取出条目的数据项频率,所述的Δv,t表示取出数据项更新的数据项频率增量;The v represents the data item name of the entry, the c v represents the data item frequency of the entry, and the Δv , t represents the frequency increment of the data item for the update of the data item;

步骤5-3-4)、判断所述的根节点的最小堆H0是否已满,若已满,执行下一步,否则执行5-3-6);Step 5-3-4 ), judging whether the minimum heap H0 of the root node is full, if full, execute the next step, otherwise execute 5-3-6);

步骤5-3-5)、取出所述的根节点的最小堆H0中数据项频率最小的条目itemmin,对该条目重新赋值,然后执行步骤6);其中,对该条目赋值包括:Step 5-3-5), taking out the entry item min with the smallest data item frequency in the minimum heap H0 of the root node, reassigning the entry, and then performing step 6); wherein, assigning the entry includes:

令v=vt,cv=cvv,tLet v=vt, cv = cv +Δv ,t ;

所述的vt表示取出的数据项更新的数据项名称;The v t represents the data item name of the updated data item taken out;

步骤5-3-6)、创建一个新条目并为新条目赋值,将新条目插入所述的根节点维护的最小堆中,然后执行步骤6);其中,对新条目赋值包括:Step 5-3-6), creating a new entry and assigning a value to the new entry, inserting the new entry into the minimum heap maintained by the root node, and then performing step 6); wherein, assigning a value to the new entry includes:

令v=vt,cv=Δv,tLet v=vt, cv =Δv ,t ;

步骤6)、根据用户的请求,根节点遍历所述的最小堆H0,返回所有数据项频率的条目为所要挖掘的频繁项。Step 6), according to the user's request, the root node traverses the minimum heap H 0 , and returns the frequency of all data items The entries of are the frequent items to be mined.

本发明在所述的步骤4)和步骤5)之间还包括对叶子节点的最小堆Hi进行按照数据项的频率进行排序的操作步骤。The present invention also includes an operation step of sorting the minimum heap H i of the leaf nodes according to the frequency of data items between step 4) and step 5).

以及,在所述的步骤5)和步骤6)之间还包括对根节点的最小堆H0进行按照数据项的频率进行排序的操作步骤。And, an operation step of sorting the minimum heap H0 of the root node according to the frequency of the data items is also included between the steps 5 ) and 6).

本发明的优点和有益效果:Advantages and beneficial effects of the present invention:

本发明提供的方法输出的频繁项的数据项频率的估计值与真实值之间的误差不大于εN,单条链路上的最大通信开销不大于可以支持用户实时的频繁项查询。The error between the estimated value of the data item frequency of the frequent item output by the method provided by the present invention and the real value is not greater than εN, and the maximum communication overhead on a single link is not greater than It can support users to query frequent items in real time.

附图说明Description of drawings

图1是基于分布式架构的数据流频繁项挖掘方法的通信结构。Figure 1 is the communication structure of the method for mining frequent items in data streams based on distributed architecture.

图2是基于分布式架构的数据流频繁项挖掘方法的平均相对误差。Figure 2 is the average relative error of the data stream frequent item mining method based on the distributed architecture.

图3是基于分布式架构的数据流频繁项挖掘方法的单链路通信开销。Figure 3 shows the single-link communication overhead of the data stream frequent item mining method based on the distributed architecture.

图4是基于分布式架构的数据流频繁项挖掘方法的初始化时间。Figure 4 is the initialization time of the data stream frequent item mining method based on the distributed architecture.

具体实施方式detailed description

为了更清晰直观的表达本发明的方法思路,下面对基于分布式架构的数据流频繁项挖掘方法的细节进行详细说明:In order to express the method idea of the present invention more clearly and intuitively, the details of the method for mining frequent items of data streams based on the distributed architecture are described in detail below:

1.确定参数1. Determine the parameters

分布式频繁项挖掘方法需要确定的参数包括:The parameters that need to be determined in the distributed frequent item mining method include:

(1)数据项支持度和误差度 (1) Data item support and degree of error

(2)叶子节点数据m;(2) Leaf node data m;

(3)每个叶子节点i和根节点的最小堆Hi和H0的大小为 (3) The size of the minimum heap H i and H 0 of each leaf node i and root node is and

(4)每个叶子节点i的延迟更新系数βi(0<βi<ε)。(4) Delay update coefficient β i (0<β i <ε) of each leaf node i.

在本实施例中叶子节点数量m=8,叶子节点和根节点的最小堆的大小均为10000,即α0=αi=0.0001,βi∈[0.001,0.005],ε∈[0.0001,0.0005], In this embodiment, the number of leaf nodes is m=8, and the minimum heap sizes of leaf nodes and root nodes are both 10000, that is, α 0i =0.0001, β i ∈[0.001,0.005], ε∈[0.0001,0.0005] ],

2.初始化2. Initialization

确定每个叶子节点i的初始化时间为Bi>0为叶子节点i与根节点之间链路的带宽,单位为数据包/秒。在本发明方法运行的最初时刻,每个叶子节点i会处理收到的来自数据流Si的数据项更新,但却不会将任何消息传递给它的根节点直至初始化状态结束。Determine the initialization time of each leaf node i as B i >0 is the bandwidth of the link between the leaf node i and the root node, and the unit is data packet/second. At the initial moment of the operation of the method of the present invention, each leaf node i will process the data item update received from the data flow S i , but will not pass any message to its root node until the initialization state ends.

3.叶子节点处理数据流中的数据项3. Leaf nodes process data items in the data stream

当数据流Si中有新数据项(v,cv,t)到达叶子节点i时,首先更新数据流Si中数据项频率之和Ni=Ni+cv,t,其中等号表示赋值,下同,以及数据项频率之和的频率估计值增量Δi=Δi+cv,t。其次更新相应的数据项频率:如果v∈Hi,则增加相应数据项条目的频率估计值cv=cv+cv,t和数据项v的频率估计值增量Δv=Δv+cv,t;否则找到Hi中频率估计值最小的数据项itemmin,将itemmin的数据项名替换为v,并更新其数据项频率cv=cv+cv,t和频率估计值增量Δv=cv,t。最后检查是否满足条件向根节点传递数据项更新:如果Δi>βiNi,则发送更新(0,Δi)给根节点,并重置频率估计值增量Δi=0;如果Δv>βiNi,则发送更新(v,Δv)给根节点,并重置频率估计值增量Δv=0。When a new data item (v, c v, t ) in the data stream S i arrives at the leaf node i, first update the sum of the frequency of the data item in the data stream S i N i =N i +c v, t , where the equal sign Indicates the assignment, the same below, and the frequency estimation value increment Δ ii +c v,t of the sum of the frequencies of the data items. Then update the corresponding data item frequency: if v∈H i , then increase the frequency estimate value of the corresponding data item entry cv = cv + cv, t and the frequency estimate value increment of data item v Δv = Δv + c v, t ; otherwise, find the data item min with the smallest frequency estimate in H i , replace the data item name of item min with v, and update its data item frequency c v = c v + c v, t and frequency estimate Value increment Δ v =c v,t . Finally, check whether the condition is satisfied and transmit the data item update to the root node: if Δ i > β i N i , then send the update (0, Δ i ) to the root node, and reset the frequency estimation increment Δ i = 0; if Δ vi N i , then send an update (v, Δ v ) to the root node, and reset the increment of the frequency estimation value Δ v =0.

4.根节点处理叶子节点发送的更新4. The root node processes the updates sent by the leaf nodes

当根节点收到叶子节点传递的数据项更新(v,Δv)时,如果v∈H0,则增加相应数据项条目的频率估计值cv=cvv;否则找到H0中频率估计值最小的数据项itemmin,将itemmin的数据项名替换为v,并更新其数据项频率cv=cvv。当根节点收到叶子节点传递的0-msg更新(0,Δi)时,更新根节点对数据项频率之和的估计值N0=N0iWhen the root node receives the data item update (v, Δ v ) delivered by the leaf node, if v∈H 0 , then increase the frequency estimation value of the corresponding data item entry c v = c vv ; otherwise, find For the data item item min with the smallest estimated frequency value, replace the data item name of item min with v, and update its data item frequency c v =c vv . When the root node receives the 0-msg update (0, Δ i ) delivered by the leaf node, it updates the root node's estimated value N 0 =N 0i of the sum of data item frequencies.

5.根节点处理用户发起的频繁项查询5. The root node handles frequent item queries initiated by users

当用户向根节点提交频繁项查询时,根节点扫描最小堆H0中维护的每个数据项条目(v,cv)。如果cvωN0,则认为v是频繁项并将v输出,其中ω为输出阈值,有 为用户定义的数据项支持度,αmax=max(αi),βmax=max(βi)。When a user submits a frequent item query to the root node, the root node scans each data item entry (v, c v ) maintained in the min-heap H 0 . If cv ωN 0 , v is considered to be a frequent item and v will be output, where ω is the output threshold, there is is the user-defined data item support, α max =max(α i ), β max =max(β i ).

本发明采用现实数据与计算机模拟的方式实施。The present invention is implemented by means of real data and computer simulation.

本发明选择3组真实的网络环境下采集的网络流量数据集作为实施例中的数据源。这3组数据集分别是:CERNET数据集,是在CERNET(China Education and Research Network)的OC-48链路上采集的TCP双向数据集;CAIDA48数据集,是在OC-48west coast peering link上采集的匿名数据集;CAIDA192数据集,是在OC-192链路上采集的单项匿名数据集。本发明将网络流量数据集中IP数据包的五元组(源IP地址,目的IP地址,源端口,目的端口,传输层协议)定义为数据项名,将数据包负载的长度定义为数据项频率。The present invention selects 3 groups of network traffic data sets collected in a real network environment as data sources in the embodiment. These three sets of data sets are: CERNET data set, which is a TCP bidirectional data set collected on the OC-48 link of CERNET (China Education and Research Network); CAIDA48 data set, which is collected on the OC-48west coast peering link The anonymous data set of CAIDA192 data set is a single anonymous data set collected on the OC-192 link. The present invention defines the quintuple (source IP address, destination IP address, source port, destination port, transport layer protocol) of the IP data packet in the network flow data set as the data item name, and defines the length of the data packet load as the data item frequency .

定义更新延迟系数β的相对值为图2显示了本发明方法处理3组不同的网络数据集的平均相对误差。可以观测到,在ε∈[0.0001,0.0005]时,方法的平均相对误差均小于当前ε的值与N乘积。Define the relative value of the update delay coefficient β as Fig. 2 shows the average relative error of the method of the present invention for processing 3 different network datasets. It can be observed that when ε∈[0.0001, 0.0005], the average relative error of the method is smaller than the product of the current ε value and N.

图3显示了本发明方法处理3组不同的网络数据集的单链路开销。对于图3的每幅子图,分别有4条曲线,由上至下每条曲线分别表示处理当前网络数据集的单链路通信开销的理论最大值单链路通信开销的实际最大值、实际平均值以及实际最小值。可以观测到,单条链路上的实际最大通信开销不大于 Fig. 3 shows the single-link overhead of the method of the present invention for processing three different sets of network data sets. For each sub-graph in Figure 3, there are 4 curves, and each curve from top to bottom represents the theoretical maximum value of the single-link communication overhead for processing the current network data set Actual maximum, actual average, and actual minimum of single-link communication overhead. It can be observed that the actual maximum communication overhead on a single link is not greater than

图4显示了本发明方法的初始化时间。可以观测到,更新延迟系数β的相对值越大,本发明方法所需的初始化时间越少。Figure 4 shows the initialization time of the method of the present invention. It can be observed that the larger the relative value of the update delay coefficient β, the less the initialization time required by the method of the present invention.

Claims (3)

1. A data flow frequent item mining method based on a distributed architecture adopts a 2-layer tree-shaped communication structure, and comprises m leaf nodes and 1 root node; the leaf node is responsible for processing data items in the data stream, storing the frequency of the data items in the data stream in a minimum heap of the leaf node, and sending the frequency increment of the data items to the root node when the frequency increment of the data items is greater than a threshold value; the root node is responsible for calculating the frequency estimation value of the data item in the whole framework and storing the frequency estimation value of the data item in the minimum heap of the root node; the entries stored in the minimal heap of leaf nodes include data item names, data item frequencies, and data item frequency increments; the entries stored in the minimum heap of the root node include data item names and data item frequency estimates; the method comprises the following steps:
step 1), each leaf node i sequentially takes out data items from the received data stream, wherein the data items comprise data item names vtAnd frequency c of data itemsv,t
Step 2), updating the sum N of the data item frequencies of the leaf nodesi=Ni+cv,tAnd the increment delta of the sum of the frequencies of the data itemsi=Δi+cv,tWherein the equal sign represents the assignment, the same as follows;
step 3) of retrieving the data item name v of the data item according to step 1)tAnd data item frequency cv,tMinimum heap H at said leaf nodeiFinding out a proper entry, and assigning values to the name of the data item, the frequency of the data item and the frequency increment of the data item in the entry; the method comprises the following steps:
step 3-1), judging whether the data item name v exists in the minimum heap of the leaf nodestIf the item (3) exists, executing the next step, otherwise, executing the step 3-5);
step 3-2), judging the minimum pile H of the leaf nodesiWhether the container is full, if so, executing the next step, otherwise, executing the step 3-4);
step 3-3), from said minimum heap of leaf nodes HiItem with minimum data item frequencyminReassigns the entry and then performs step 4); wherein assigning the entry comprises:
let v equal vt,cv=cv+cv,t,Δv=cv,t
Said v representing the name of the data item from which the entry was fetched, said vtA data item name indicating a retrieved data item, said cvFrequency of data items representing fetched entries, said cv,tData item frequency, said Δ, representing a retrieved data itemvA data item frequency increment representing a fetched entry;
step 3-4), creating a new entry and assigning a value to the new entry, and inserting the new entry into the minimum heap H of the leaf nodesiThen step 4) is performed; wherein assigning the new entry comprises:
let v equal vt,cv=cv,t,Δv=cv,t
Step 3-5), from said minimum heap of leaf nodes HiFetching an existing entryUpdating the entry, and then executing the step 4); wherein updating the entry comprises:
let cv=cv+cv,t,Δv=Δv+cv,t
Step 4), judging whether the increment of the sum of the data item frequencies and the data item frequency increment of the entry are greater than a threshold value, and if so, transmitting the update to the root node; the method comprises the following steps:
step 4-1), judging increment delta of the sum of the data item frequenciesiWhether or not Δ is satisfiedi>βiNiIf yes, executing the next step, otherwise, executing the step 4-3); wherein,
β as describediAn update delay factor representing a user-defined leaf node, said NiA sum of data item frequencies representing the leaf nodes;
step 4-2), the leaf node sends 0-msg update to the root node, and then the delta is updatediIs set to 0; wherein
Said 0-msg update transmitted content comprises an increment delta of the sum of said data item frequenciesi
Step 4-3), data item frequency increment delta of said itemvWhether or not Δ is satisfiedv>βiNiIf yes, executing the next step, otherwise, executing the step 5);
β as describediAn update delay factor representing a user-defined leaf node, said NiA sum of data item frequencies representing the leaf nodes;
step 4-4), the leaf node sends data item update to the root node, and then the delta is sentvIs set to 0; wherein
The data item updating transmission content comprises the data item name of the item and the data item frequency increment delta of the itemv
Step 5), the root node sequentially takes out updates from the updates sent by the leaf nodes and maintains corresponding data according to the taken-out updates; the method comprises the following steps:
step 5-1), judging the type of the update sent by the leaf node taken out by the root node, if the type is 0-msg update, executing the next step, and if the type is data item update, executing step 5-3);
step 5-2), updating the estimated value N of the sum of the data item frequencies of the root node to be N + deltaiWherein the equal sign represents the assignment, and then step 6) is performed; wherein,
said N representing an estimate of the sum of the data item frequencies of the root node, said ΔiIndicating a frequency of 0-msg updates sent by said leaf node;
step 5-3), updating the minimum heap H of the root node0(ii) a The method comprises the following steps:
step 5-3-1), the data item name v in the update sent by the leaf node is taken outtAnd data item frequency increment deltav,t
Step 5-3-2), judging whether the data item name v exists in the minimum heap of the root nodetItem ofvIf yes, executing the next step, otherwise, executing the step 5-3-4);
step 5-3-3), the item is taken outvUpdating the entry, and then executing step 6); wherein updating the entry comprises:
let cv=cvv,tWherein the equal sign represents the assignment;
saidv denotes the data item name of the fetched entry, said cvFrequency of data items representing fetched entries, said Δv,tA data item frequency increment representing a fetch data item update;
step 5-3-4), judging the minimum heap H of the root node0Whether the container is full, if so, executing the next step, otherwise, executing 5-3-6);
step 5-3-5), taking out the minimum heap H of the root node0Item with minimum frequency of medium data itemminReassigns the entry and then performs step 6); wherein assigning the entry comprises:
let v equal vt,cv=cvv,t
V istA data item name indicating a retrieved data item update;
step 5-3-6), creating a new entry and assigning a value to the new entry, inserting the new entry into the minimum heap maintained by the root node, and then executing step 6); wherein assigning the new entry comprises:
let v equal vt,cv=Δv,t
Step 6), according to the request of the user, the root node traverses the minimum heap H0Frequency of all data items returnedThe entries of (1) are frequent items to be mined.
2. The method for mining frequent items of data flow based on distributed architecture as claimed in claim 1, further comprising a minimum heap H of leaf nodes between said steps 4) and 5)iThe operation steps of sorting by frequency of the data items are performed.
3. The method for mining frequent items of data flow based on distributed architecture as claimed in claim 1, further comprising a minimum heap of root nodes between said steps 5) and 6)H0The operation steps of sorting by frequency of the data items are performed.
CN201610254621.1A 2016-04-21 2016-04-21 Distributed architecture-based data flow frequent item mining method Pending CN105930457A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610254621.1A CN105930457A (en) 2016-04-21 2016-04-21 Distributed architecture-based data flow frequent item mining method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610254621.1A CN105930457A (en) 2016-04-21 2016-04-21 Distributed architecture-based data flow frequent item mining method

Publications (1)

Publication Number Publication Date
CN105930457A true CN105930457A (en) 2016-09-07

Family

ID=56838834

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610254621.1A Pending CN105930457A (en) 2016-04-21 2016-04-21 Distributed architecture-based data flow frequent item mining method

Country Status (1)

Country Link
CN (1) CN105930457A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115473933A (en) * 2022-10-10 2022-12-13 国网江苏省电力有限公司南通供电分公司 Network system associated service discovery method based on frequent subgraph mining
CN115982159A (en) * 2022-12-23 2023-04-18 河北工业大学 A Database Frequency Estimation Method
CN116881338A (en) * 2023-09-07 2023-10-13 北京傲星科技有限公司 Data mining method and related equipment for data stream based on large model

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030028531A1 (en) * 2000-01-03 2003-02-06 Jiawei Han Methods and system for mining frequent patterns
US20090112863A1 (en) * 2007-10-26 2009-04-30 Industry-Academic Cooperation Foundation, Yonsei University Method and apparatus for finding maximal frequent itmesets over data streams
CN101650730A (en) * 2009-09-08 2010-02-17 中国科学院计算技术研究所 Method and system for discovering weighted-value frequent-item in data flow
CN103258049A (en) * 2013-05-27 2013-08-21 重庆邮电大学 Association rule mining method based on mass data
CN103731738A (en) * 2014-01-23 2014-04-16 哈尔滨理工大学 Video recommendation method and device based on user group behavioral analysis
CN104376365A (en) * 2014-11-28 2015-02-25 国家电网公司 Method for constructing information system running rule libraries on basis of association rule mining

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030028531A1 (en) * 2000-01-03 2003-02-06 Jiawei Han Methods and system for mining frequent patterns
US20090112863A1 (en) * 2007-10-26 2009-04-30 Industry-Academic Cooperation Foundation, Yonsei University Method and apparatus for finding maximal frequent itmesets over data streams
CN101650730A (en) * 2009-09-08 2010-02-17 中国科学院计算技术研究所 Method and system for discovering weighted-value frequent-item in data flow
CN103258049A (en) * 2013-05-27 2013-08-21 重庆邮电大学 Association rule mining method based on mass data
CN103731738A (en) * 2014-01-23 2014-04-16 哈尔滨理工大学 Video recommendation method and device based on user group behavioral analysis
CN104376365A (en) * 2014-11-28 2015-02-25 国家电网公司 Method for constructing information system running rule libraries on basis of association rule mining

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Y ZHANG 等: "Parallel Optimization of Frequent Algorithm on Multi-core Processors", 《 INTERNATIONAL CONFERENCE ON CONTROL ENGINEERING AND COMMUNICATION TECHNOLOGY》 *
YU ZHANG 等: "An efficient framework for parallel and continuous frequent item monitoring", 《 CONCURRENCY & COMPUTATION PRACTICE & EXPERIENCE》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115473933A (en) * 2022-10-10 2022-12-13 国网江苏省电力有限公司南通供电分公司 Network system associated service discovery method based on frequent subgraph mining
CN115982159A (en) * 2022-12-23 2023-04-18 河北工业大学 A Database Frequency Estimation Method
CN115982159B (en) * 2022-12-23 2025-07-22 河北工业大学 Database frequency estimation method
CN116881338A (en) * 2023-09-07 2023-10-13 北京傲星科技有限公司 Data mining method and related equipment for data stream based on large model
CN116881338B (en) * 2023-09-07 2024-01-26 北京傲星科技有限公司 Data mining method and related equipment for data stream based on large model

Similar Documents

Publication Publication Date Title
CN103401777B (en) The parallel search method and system of Openflow
CN110572274B (en) Named data network method for optimizing deployment and management of edge computing nodes
CN113472420B (en) A satellite network cache placement method based on regional user interest perception
CN112070240A (en) Layered federal learning framework for efficient communication and optimization method and system thereof
US9535954B2 (en) Join processing device, data management device, and string similarity join system
CN108848032B (en) An Implementation Method of Named Object Network Supporting Multi-interest Type Processing
CN102970150A (en) Extensible multicast forwarding method and device for data center (DC)
CN102970242B (en) Method for achieving load balancing
CN103107945B (en) A kind of system and method for fast finding IPV6 route
CN106326308B (en) A method and system for deduplication in network based on SDN
CN105262833B (en) A kind of the cross-layer caching method and its node of content center network
CN103457700A (en) Data packet content name coding compression method in NDN/CCN
CN102045392A (en) Interest-based adaptive topology optimization method for unstructured P2P (peer-to-peer) network
CN105930457A (en) Distributed architecture-based data flow frequent item mining method
CN110334290A (en) A Fast Retrieval Method for Spatiotemporal Data Based on MF-Octree
CN101286920A (en) A Scalable Resource Search Method for Structured Peer-to-Peer Networks
CN103729461A (en) Releasing and subscribing method based on history recorded data mining
CN108173903B (en) Application method of autonomous system cooperative caching strategy in CCN
CN101500012A (en) Packet classification method and system
CN102378407A (en) Object name resolution system and method in internet of things
CN108521373B (en) Multipath routing method in named data network
CN110046286A (en) Method and apparatus for search engine caching
CN114520838A (en) K-nearest neighbor-based network message matching method for custom protocol application layer
CN117411886A (en) A collaborative edge request scheduling method based on digital twins
CN102075563B (en) Duplicate copy method for unstructured peer-to-peer (P2P) network

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160907