[go: up one dir, main page]

WO2024255927A1 - Cuckoo filter, data insertion method, data query method and data deletion method - Google Patents

Cuckoo filter, data insertion method, data query method and data deletion method Download PDF

Info

Publication number
WO2024255927A1
WO2024255927A1 PCT/CN2024/111746 CN2024111746W WO2024255927A1 WO 2024255927 A1 WO2024255927 A1 WO 2024255927A1 CN 2024111746 W CN2024111746 W CN 2024111746W WO 2024255927 A1 WO2024255927 A1 WO 2024255927A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
fingerprint
bucket
candidate
slot
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/CN2024/111746
Other languages
French (fr)
Chinese (zh)
Inventor
赵川
魏宇楠
赵圣楠
荆山
陈贞翔
杨波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Quan Cheng Laboratory
Original Assignee
Quan Cheng Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Quan Cheng Laboratory filed Critical Quan Cheng Laboratory
Priority to JP2025501692A priority Critical patent/JP2025523303A/en
Publication of WO2024255927A1 publication Critical patent/WO2024255927A1/en
Anticipated expiration legal-status Critical
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the invention belongs to the technical field of computer information representation and retrieval, and in particular relates to a cuckoo filter and a data inserting, querying and deleting method.
  • the approximate membership query data structure stores the probabilistic representation of a key set S on a data domain U in a compact format, and supports data insertion and query operations. Some AMQs support data deletion operations. For queries on existing elements in a set, it can efficiently complete set membership queries. For queries on elements outside the set, it has a controllable false alarm probability (hereinafter referred to as the false positive rate), that is, when querying an element that does not exist in the set, there is a certain probability that the element is returned to the set.
  • the biggest feature of AMQ is its high space efficiency. Under an acceptable false positive rate, AMQ can work on devices with limited memory resources such as network routers, switches or loT devices.
  • Bloom Filter is a typical example of AMQ, which supports insert and query operations on a set S of keys. If the query is for a key that exists in set S, the Bloom filter can quickly complete the query; for keys outside set S, since it is a probabilistic structure, the probability of returning "not present" is at least 1- ⁇ , where n is the number of elements that have been added, k is The number of hash functions is used, and m represents the length of the Bloom filter, which indicates that BF provides a controllable false positive rate ⁇ . It provides a trade-off mechanism between space efficiency and query accuracy, that is, the length of the Bloom filter directly affects the false positive rate. The longer the Bloom filter, the greater its false positive rate.
  • BF has been widely used in packet classification, payload inspection of deep packet inspection (DPI), reducing disk I/O, avoiding database cache penetration, and data services on mobile terminals and IoT devices - distributed connections and semi-connections, indexing, auxiliary metadata and query processing problems.
  • DPI deep packet inspection
  • the main advantage of its application is that a large number of classification rules can be stored and accessed in a very compact form in dedicated hardware such as FPGA under the premise of limited space.
  • Bloom filters have been a common solution when hardware storage space is limited or frequent access to external memory leads to high latency.
  • CBF Counteriting Bloom Filter
  • Cuckoo Filter supports dynamic deletion of data.
  • CF uses a hash function to calculate and save the fingerprint of the original set data instead of the original data, which ensures a low false positive rate and occupies less space.
  • CF uses cuckoo hash to calculate the element insertion position, but in the process of element insertion, due to the existence of hash collision, There is a certain probability that data relocation is needed; a large amount of research work has further optimized the insertion performance and query performance of the cuckoo filter, reducing the probability of relocation during element insertion and the memory usage of the filter itself; the cuckoo filter structure determines that there is a contradictory relationship between the false positive rate and space efficiency, and the two are mutually exclusive. Therefore, the optimization of the cuckoo filter needs to comprehensively consider the false positive rate of its structure and the required memory space.
  • the present invention provides a cuckoo filter and a method for inserting, querying and deleting data.
  • the constructed cuckoo filter consists of a fingerprint record table (Fingerprint Record Table, FRT) and a position flag table (Position Flag Table, PFT).
  • FCT is used to save the fingerprint of the inserted data
  • PFT is used to record the insertion position information of the fingerprint of the data in the fingerprint table, so as to solve the technical problem of false positive query in the cuckoo filter.
  • one or more embodiments of the present invention provide the following technical solutions:
  • a first aspect of the present invention provides a cuckoo filter.
  • a cuckoo filter consisting of a fingerprint record table and a position mark table
  • the fingerprint record table is composed of m buckets for storing inserted data fingerprints, wherein each bucket has b slots for storing data fingerprints;
  • the position mark table is composed of m vectors, and a vector is created for each bucket to record the insertion position information of the data fingerprint in the fingerprint record table;
  • each data fingerprint corresponds to two candidate buckets, and a slot of one bucket is selected to save the data fingerprint. If and only if it is finally saved in the second candidate bucket, the position subscript value of the slot is recorded in the vector corresponding to the second candidate bucket.
  • the number m of the buckets is an exponential power of 2.
  • ha ( ⁇ ) represents a hash function with a fixed-length output
  • the low-order part of the data summary is directly used as the data fingerprint of the data to be inserted.
  • x represents the data to be inserted
  • dig x represents the data summary
  • f x represents the data fingerprint
  • l is the fingerprint length
  • h b ( ⁇ ) represents the hash function with fixed-length output
  • m represents the number of buckets.
  • a second aspect of the present invention provides a data inserting method.
  • a data insertion method when inserting a data x, firstly calculates the data summary dig x corresponding to x, the data fingerprint f x and two candidate buckets Then insert, the insertion process is:
  • the position index value j of slot e j is stored in the bucket The corresponding vector.
  • the data fingerprint is stored in any empty slot e j of the bucket, j ⁇ [0, b), where j is the position index value of slot e j .
  • the evicted element f x' calculates the dual position of the current insertion position. If there is an empty slot in the bucket, f x' is inserted into the empty slot and the position subscript value of the corresponding vector is updated. If there is no empty slot, the eviction process is repeated until all items are stored in the filter.
  • a third aspect of the present invention provides a data search method.
  • a data search method when searching for a data y, firstly calculates the data summary dig x corresponding to y, the data fingerprint f y and two candidate buckets Then match the data fingerprint f y with all the fingerprints in the two buckets:
  • a fourth aspect of the present invention provides a data deletion method.
  • a data deletion method based on a cuckoo filter provided in the first aspect, when deleting a data z, searching for the data fingerprint of the deleted data z, and performing subsequent operations in two cases:
  • the data fingerprint is deleted from the corresponding position in the fingerprint record table, and the position mark in the corresponding vector is deleted, and a deletion success is returned;
  • deletion operation when the search is successful is specifically as follows:
  • the present invention has higher data fingerprint calculation efficiency: in the data insertion stage, the low-order part of the data summary is directly used when obtaining the data fingerprint, which reduces a hash operation and reduces the data insertion delay.
  • the present invention has a lower false positive rate of data query: by only recording the position subscript value stored in the second candidate bucket, the array subscript is cleverly integrated into the data fingerprint, and the fingerprint comparison length is significantly increased without increasing the fingerprint length, and the false positive rate of the cuckoo filter is reduced by m times (m is the number of buckets in the fingerprint record table), and when the fingerprint length is 8 bits, in actual operation, there is an average of 93.1% probability that the false positive rate is 0; when the fingerprint length is 12 bits, compared with 0.04% of CF, the filter has a 99.3% probability that the false positive rate is 0.
  • the present invention has a smaller space cost under the same false positive rate: under the premise of a false positive rate of 0, the fingerprint length required by the filter is only 8.76993 bits, while CF requires 34.3597 bits.
  • FIG. 1 is a schematic structural diagram of a cuckoo filter according to a first embodiment.
  • FIG. 2 is a schematic diagram of the structure of the fingerprint record table of the first embodiment.
  • FIG. 4 is a schematic diagram of the structure of the position mark table of the first embodiment.
  • FIG5 is a flow chart of a method according to the second embodiment.
  • FIG6 is a flow chart of a method according to the third embodiment.
  • FIG. 7 is a flow chart of a method according to a fourth embodiment.
  • FIG1 is a schematic diagram of the structure of the cuckoo filter.
  • the filter is composed of a fingerprint record table (Fingerprint Record Table, FRT) and a position flag table (Position Flag Table, PFT).
  • the fingerprint record table FCT is used to store the fingerprint of the inserted data
  • the position flag table PFT is used to record the insertion position information of the data fingerprint in the fingerprint table.
  • FIG2 is a schematic diagram of the structure of the fingerprint record table.
  • the fingerprint record table FCT is composed of m arrays (hereinafter referred to as buckets), where m must be an exponential power of 2.
  • buckets m arrays
  • m must be an exponential power of 2.
  • the benefit of this requirement is that when performing an XOR operation, it can be guaranteed that the calculated index must fall within the array; each bucket has b storage units e (hereinafter referred to as slots) that can store data fingerprints f.
  • the data summary, data fingerprint and two candidate buckets corresponding to the data must be calculated.
  • ha (x) is a hash function that outputs a binary string of a specific length.
  • h b (x) is a hash function with fixed-length output.
  • the position marking table PFT in this embodiment does not use the conventional matrix Fm ⁇ b composed of 0 and 1 to record the storage position information of the data fingerprint in the fingerprint recording table FCT, but only records the position subscript value stored in the second candidate bucket. Under the premise of not increasing the fingerprint length, the fingerprint comparison length is significantly increased, thereby reducing the false positive rate of data query.
  • 1 bit is used to indicate the position of the bucket where each data fingerprint is inserted, 0 and 1 respectively indicate the p1 position and p2 position, and the position marker table PFT is a matrix Fm ⁇ b composed of 0 and 1, each row occupies b bits, and the memory space required to store the position marker table PFT is mb bits. Due to the collision of the hash function, the load of the filter cannot reach 100%, so there are many empty slots in the fingerprint record table FRT, and F is a zero matrix when initialized, so bit 0 also indicates that the corresponding position of the fingerprint record table FRT is empty, which means that a lot of invalid position information is stored in F.
  • FIG. 4 is a schematic diagram of the structure of the position marking table.
  • the location information of the position tag table PFT With the help of the location information of the position tag table PFT, during the data query process, if a matching fingerprint is found in the fingerprint record table FRT, it is also necessary to confirm in the corresponding vector of the position tag table PFT that the two fingerprints are from the same bucket position, that is, whether the subscripts of the two fingerprints in the bucket can be queried in the corresponding vector, as shown in Figure 4.
  • the fingerprint of the data y to be queried is hashed to bucket[i+1], and bucket[i+1] is the p 2 position of y.
  • the position subscript in the bucket of x is found in vector[i+1], indicating that the current bucket is also the p 2 position of x, which means that the entire data summary of x and y is the same, so the search is returned successful.
  • a data insertion method based on a cuckoo filter is disclosed.
  • the cuckoo filter based on data fingerprint position mark provided in the first embodiment is used.
  • the data summary dig x corresponding to x is first calculated, and then the fingerprint f x of x and two candidate buckets are calculated.
  • the data insertion process is shown in Figure 5, and the insertion algorithm is shown in Table 1. The insertion process is divided into three cases:
  • the evicted element f x' calculates the dual position of the current insertion position If there is an empty slot in the bucket, f x' is inserted into the empty slot and the position index value of the corresponding vector in the PFT is updated. If there is no empty slot, the above eviction process is repeated until all items are stored in the filter. There is a maximum number of evictions MAXKICKNUM. If the eviction process exceeds MAXKICKNUM, the element insertion fails and the algorithm exits.
  • a data search method based on a cuckoo filter is disclosed.
  • the cuckoo filter based on data fingerprint position mark provided in the first embodiment is used.
  • the data search process is shown in FIG6 , and the search algorithm is shown in Table 2.
  • the data summary dig y and the data fingerprint f y of y are first calculated, and then the two insertion candidate bucket positions of y are calculated. Match the data fingerprint f y with all the fingerprints in the two buckets:
  • a data deletion method based on a cuckoo filter is disclosed.
  • the cuckoo filter based on data fingerprint position marking provided in the first embodiment is used.
  • the deletion process is shown in FIG7 , and the deletion algorithm is shown in Table 3.
  • the element z to be deleted is first searched in the filter by the search method provided in the third embodiment, and corresponding operations are performed according to the search results:
  • the data fingerprint to be deleted is in the second candidate bucket, the data fingerprint is deleted from the second candidate bucket, and the position mark in the vector corresponding to the second candidate bucket is deleted.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Collating Specific Patterns (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to the technical field of computer information representation and retrieval. Provided are a cuckoo filter, a data insertion method, a data query method and a data deletion method. The cuckoo filter is composed of a fingerprint record table and a position flag table. The fingerprint record table is composed of m buckets and is used for storing inserted data fingerprints, each bucket having b slots for storing the data fingerprints. The position flag table is composed of m vectors, with one vector being created for each bucket, and the position flag table is used for recording insertion position information of the data fingerprints in the fingerprint record table. Each data fingerprint corresponds to two candidate buckets, one slot of one bucket is selected from the two candidate buckets to store the data fingerprint, and when and only when the data fingerprint is finally stored in a second candidate bucket, a position subscript value of the slot is recorded in a vector corresponding to the second candidate bucket. The present invention has a higher data fingerprint calculation efficiency and a lower data query false positive rate, and solves the technical problem of query false positives occurring in a cuckoo filter.

Description

一种布谷鸟过滤器及数据插入、查询、删除方法A cuckoo filter and data insertion, query and deletion method

相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS

发明要求于2023年6月15日提交中国国家知识产权局、申请号为202310712462.5、发明名称为“一种布谷鸟过滤器及数据插入、查询、删除方法”的中国专利申请的优先权,其全部内容通过引用结合在本发明中并构成本发明的一部分,用于所有目的。The invention claims priority to the Chinese patent application filed with the State Intellectual Property Office of China on June 15, 2023, with application number 202310712462.5 and invention name “A Cuckoo Filter and Data Insertion, Query and Deletion Method”, the entire contents of which are incorporated by reference into the present invention and constitute a part of the present invention for all purposes.

技术领域Technical Field

本发明属于计算机信息表示与检索技术领域,尤其涉及一种布谷鸟过滤器及数据插入、查询、删除方法。The invention belongs to the technical field of computer information representation and retrieval, and in particular relates to a cuckoo filter and a data inserting, querying and deleting method.

背景技术Background Art

本部分的陈述仅仅是提供了与本发明相关的背景技术信息,不必然构成在先技术。The statements in this section merely provide background information related to the present invention and do not necessarily constitute prior art.

近似成员查询数据结构(approximate membership query data structure,AMQ)以一种紧凑(compact)格式保存了数据域U上的一个键集S的概率表示,并支持数据的插入与查询操作,部分AMQs支持数据删除操作;对于集合中已有元素的查询,它能够高效地完成集合成员查询,对于集合外元素的查询,它存在一个可控的误报概率(以下称为假阳率),即查询一个不存在于集合内的元素时存在一定概率返回该元素存在于集合中;AMQ最大的特点是其高效的空间效率,在一个可接受的假阳率下,AMQ可以在内存资源有限的设备如网络路由器、交换机或loT设备上工作。The approximate membership query data structure (AMQ) stores the probabilistic representation of a key set S on a data domain U in a compact format, and supports data insertion and query operations. Some AMQs support data deletion operations. For queries on existing elements in a set, it can efficiently complete set membership queries. For queries on elements outside the set, it has a controllable false alarm probability (hereinafter referred to as the false positive rate), that is, when querying an element that does not exist in the set, there is a certain probability that the element is returned to the set. The biggest feature of AMQ is its high space efficiency. Under an acceptable false positive rate, AMQ can work on devices with limited memory resources such as network routers, switches or loT devices.

布隆过滤器(Bloom Filter,BF)是AMQ的典型示例,它支持对一组键的集合S进行插入和查询操作,如果查询集合S中存在的键,布隆过滤器能够快速完成查询;而对于集合S以外的键,由于它是一种概率结构,查找返回“不存在”的概率至少为1-ε,其中n是已经添加元素的数量,k为 使用哈希函数的数量,m表示布隆过滤器的长度,这表明BF提供了可控的假阳率ε,它在空间效率和查询准确率之间提供了一种权衡机制,即布隆过滤器的长度会直接影响误报率,布隆过滤器越长其假阳率越大;另外,哈希函数的个数也需要权衡,个数越多布隆过滤器的效率越低,但是如果太少,假阳率就会变高。近年来,BF被广泛应用于数据包分类、深度包检测(DPI)的有效负载检查、减少磁盘I/O、避免数据库缓存穿透以及移动终端以及loT设备上的数据服务——分布式连接和半连接、索引、辅助元数据和查询处理问题等,其应用的主要优势是在空间有限的前提下,可以在FPGA等专用硬件中以非常紧凑的形式存储和访问大量分类规则,在过去的十年中,当硬件存储空间受限或频繁访问外存导致高延迟时,布隆过滤器是一种常见的解决方案。Bloom Filter (BF) is a typical example of AMQ, which supports insert and query operations on a set S of keys. If the query is for a key that exists in set S, the Bloom filter can quickly complete the query; for keys outside set S, since it is a probabilistic structure, the probability of returning "not present" is at least 1-ε, where n is the number of elements that have been added, k is The number of hash functions is used, and m represents the length of the Bloom filter, which indicates that BF provides a controllable false positive rate ε. It provides a trade-off mechanism between space efficiency and query accuracy, that is, the length of the Bloom filter directly affects the false positive rate. The longer the Bloom filter, the greater its false positive rate. In addition, the number of hash functions also needs to be weighed. The more the number, the lower the efficiency of the Bloom filter, but if there are too few, the false positive rate will become higher. In recent years, BF has been widely used in packet classification, payload inspection of deep packet inspection (DPI), reducing disk I/O, avoiding database cache penetration, and data services on mobile terminals and IoT devices - distributed connections and semi-connections, indexing, auxiliary metadata and query processing problems. The main advantage of its application is that a large number of classification rules can be stored and accessed in a very compact form in dedicated hardware such as FPGA under the premise of limited space. In the past decade, Bloom filters have been a common solution when hardware storage space is limited or frequent access to external memory leads to high latency.

与普通哈希表或二叉树相比,BF的主要优点是大小固定以及独立于结构内元素数量的恒定查询和插入效率;BF的主要缺点在于它不支持数据删除操作,计数布隆过滤器(Counting Bloom Filter,CBF)解决了BF不支持数据删除的问题,但是它需要比BF多三到四倍的空间去维持与BF相同的假阳率,而且一旦所需的存储空间比RAM大,过滤器性能就会显著下降,因为BF使用随机读写,不能有效地扩展到外存,例如闪存当中;其次,目前过滤器的假阳率已经被降至非常低的程度,但仍旧有一定概率发生查询误报的情况。Compared with ordinary hash tables or binary trees, the main advantages of BF are fixed size and constant query and insertion efficiency independent of the number of elements in the structure; the main disadvantage of BF is that it does not support data deletion operations. Counting Bloom Filter (CBF) solves the problem that BF does not support data deletion, but it requires three to four times more space than BF to maintain the same false positive rate as BF, and once the required storage space is larger than RAM, the filter performance will drop significantly, because BF uses random read and write and cannot be effectively expanded to external memory, such as flash memory; secondly, the false positive rate of the filter has been reduced to a very low level, but there is still a certain probability of false query alarms.

最近,有学者提出了带有假阳率消除域(False Positive Free Zone,FPFZ)的BF,他们通过使用元素到过滤器中具有特殊属性的位置的映射,实现了过滤器中插入的元素数量少于给定阈值时会完全消除给定域内的假阳率;但是在所有情况下FPFZ都很小,受支持的域和数据量也很有限,极大限制了该技术的可应用范围。Recently, some scholars proposed BF with False Positive Free Zone (FPFZ). By using the mapping of elements to positions with special properties in the filter, they achieved that when the number of elements inserted in the filter is less than a given threshold, the false positive rate in a given domain will be completely eliminated; however, in all cases, FPFZ is very small, and the supported domains and data amounts are also limited, which greatly limits the scope of application of this technology.

相较于BF,布谷鸟过滤器(Cuckoo Filter,CF)支持数据的动态删除。在空间效率方面,CF通过使用一个哈希函数计算并保存原始集合数据的指纹而非原始数据,在保证低假阳率的同时具有更小的空间占用;在时间效率方面,CF使用布谷鸟哈希计算元素插入位置,但是在元素插入过程中由于哈希碰撞的存 在有一定概率需要进行数据重定位;大量研究工作对布谷鸟过滤器的插入性能和查询性能又进行了进一步优化,降低了元素插入过程中重定位的发生概率和过滤器本身的内存占用;布谷鸟过滤器结构决定了假阳率和空间效率本身存在矛盾关系,二者此消彼长,因此对布谷鸟过滤器的优化需要综合考虑其结构的假阳率和所需的内存空间。Compared with BF, Cuckoo Filter (CF) supports dynamic deletion of data. In terms of space efficiency, CF uses a hash function to calculate and save the fingerprint of the original set data instead of the original data, which ensures a low false positive rate and occupies less space. In terms of time efficiency, CF uses cuckoo hash to calculate the element insertion position, but in the process of element insertion, due to the existence of hash collision, There is a certain probability that data relocation is needed; a large amount of research work has further optimized the insertion performance and query performance of the cuckoo filter, reducing the probability of relocation during element insertion and the memory usage of the filter itself; the cuckoo filter structure determines that there is a contradictory relationship between the false positive rate and space efficiency, and the two are mutually exclusive. Therefore, the optimization of the cuckoo filter needs to comprehensively consider the false positive rate of its structure and the required memory space.

目前,很多的专家学者针对于不同应用场景对CF进行了结构和算法上的改进,在平衡布谷鸟过滤器的存储空间和假阳率方面,据了解没有工作消除了布谷鸟过滤器的假阳率,并且多数工作提出的布谷鸟过滤器变种存在严重的效率问题,因此仍需要对CF的结构进行进一步研究和优化。At present, many experts and scholars have made structural and algorithmic improvements to CF for different application scenarios. In terms of balancing the storage space and false positive rate of the cuckoo filter, it is understood that no work has eliminated the false positive rate of the cuckoo filter, and the cuckoo filter variants proposed in most works have serious efficiency problems. Therefore, further research and optimization of the structure of CF is still needed.

发明内容Summary of the invention

为克服上述现有技术的不足,本发明提供了一种布谷鸟过滤器及数据插入、查询、删除方法,构造的布谷鸟过滤器由指纹记录表(Fingerprint Record Table,FRT)和位置标记表(Position Flag Table,PFT)构成,FCT用来保存插入数据的指纹,PFT用来记录指纹表中数据指纹的插入位置信息,解决布谷鸟过滤器存在查询假阳的技术问题。In order to overcome the deficiencies of the above-mentioned prior art, the present invention provides a cuckoo filter and a method for inserting, querying and deleting data. The constructed cuckoo filter consists of a fingerprint record table (Fingerprint Record Table, FRT) and a position flag table (Position Flag Table, PFT). FCT is used to save the fingerprint of the inserted data, and PFT is used to record the insertion position information of the fingerprint of the data in the fingerprint table, so as to solve the technical problem of false positive query in the cuckoo filter.

为实现上述目的,本发明的一个或多个实施例提供了如下技术方案:To achieve the above objectives, one or more embodiments of the present invention provide the following technical solutions:

本发明第一方面提供了一种布谷鸟过滤器。A first aspect of the present invention provides a cuckoo filter.

一种布谷鸟过滤器,由指纹记录表和位置标记表构成;A cuckoo filter, consisting of a fingerprint record table and a position mark table;

所述指纹记录表,由m个桶组成,用来保存插入的数据指纹,其中,每个桶有b个保存数据指纹的槽;The fingerprint record table is composed of m buckets for storing inserted data fingerprints, wherein each bucket has b slots for storing data fingerprints;

所述位置标记表,由m个向量组成,为每个桶创建一个向量,用来记录数据指纹在指纹记录表中的插入位置信息;The position mark table is composed of m vectors, and a vector is created for each bucket to record the insertion position information of the data fingerprint in the fingerprint record table;

其中,每个数据指纹对应两个候选桶,从中选择一个桶的一个槽来保存数据指纹,当且仅当最终保存在第二个候选桶中时,在第二个候选桶对应的向量中,记录槽的位置下标值。Among them, each data fingerprint corresponds to two candidate buckets, and a slot of one bucket is selected to save the data fingerprint. If and only if it is finally saved in the second candidate bucket, the position subscript value of the slot is recorded in the vector corresponding to the second candidate bucket.

进一步的,所述桶的个数m,是2的指数次方。 Furthermore, the number m of the buckets is an exponential power of 2.

进一步的,所述数据指纹的计算方式为:Furthermore, the data fingerprint is calculated as follows:

获取待插入数据;Get the data to be inserted;

通过哈希函数和取模操作,计算待插入数据的数据摘要:
digx=ha(x)mod2n
The data summary of the data to be inserted is calculated through the hash function and modulus operation:
dig x = ha (x) mod2n

其中,n为数据摘要长度,ha(·)表示定长输出的哈希函数;Where n is the length of the data summary, and ha (·) represents a hash function with a fixed-length output;

基于待插入数据和数据摘要,直接取用数据摘要的低位部分作为待插入数据的数据指纹。Based on the data to be inserted and the data summary, the low-order part of the data summary is directly used as the data fingerprint of the data to be inserted.

进一步的,所述两个候选桶的计算方式,具体为:

Furthermore, the two candidate buckets are calculated as follows:

其中,表示第一个候选桶,表示第二个候选桶,x表示待插入的数据,digx表示数据摘要,fx表示数据指纹,l为指纹长度,hb(·)表示定长输出的哈希函数,m表示桶的个数。in, represents the first candidate bucket, represents the second candidate bucket, x represents the data to be inserted, dig x represents the data summary, f x represents the data fingerprint, l is the fingerprint length, h b (·) represents the hash function with fixed-length output, and m represents the number of buckets.

本发明第二方面提供了一种数据插入方法。A second aspect of the present invention provides a data inserting method.

一种数据插入方法,基于第一方面提供的一种布谷鸟过滤器,当插入一个数据x时,首先计算x对应的数据摘要digx、数据指纹fx和两个候选桶然后进行插入,插入过程为:A data insertion method, based on a cuckoo filter provided in the first aspect, when inserting a data x, firstly calculates the data summary dig x corresponding to x, the data fingerprint f x and two candidate buckets Then insert, the insertion process is:

从两个候选桶中选择一个桶的一个空槽ej来保存数据指纹,如果两个候选桶都没有空槽,则选择一个槽ej,通过驱逐过程驱逐槽ej中的原有数据指纹fx',将数据指纹fx保存到空出来的槽ej中;Select an empty slot e j from the two candidate buckets to save the data fingerprint. If there are no empty slots in the two candidate buckets, select a slot e j , evict the original data fingerprint f x' in the slot e j through the eviction process, and save the data fingerprint f x to the empty slot e j ;

如果数据指纹fx最终保存在第二个候选桶中,则将槽ej的位置下标值j存入桶对应的向量中。If the data fingerprint f x is finally saved in the second candidate bucket, the position index value j of slot e j is stored in the bucket The corresponding vector.

进一步的,所述从两个候选桶中选择一个桶的一个空槽ej来保存数据指纹,具体分为两种情况:Furthermore, the process of selecting an empty slot ej of one bucket from two candidate buckets to store the data fingerprint is divided into two cases:

(1)如果两个侯选桶都有空槽,则随机选择一个桶,将数据指纹存入任意一个空槽ej,j∈[0,b),j是槽ej的位置下标值; (1) If both candidate buckets have empty slots, a bucket is randomly selected and the data fingerprint is stored in any empty slot e j , j∈[0,b), where j is the position index value of slot e j ;

(2)若两个候选桶中只有一个桶中有空槽,则将数据指纹存入该桶的任意一个空槽ej,j∈[0,b),j是槽ej的位置下标值。(2) If only one of the two candidate buckets has an empty slot, the data fingerprint is stored in any empty slot e j of the bucket, j ∈ [0, b), where j is the position index value of slot e j .

进一步的,所述两个候选桶都已经没有空槽,则驱逐过程具体为:Furthermore, if there are no empty slots in the two candidate buckets, the eviction process is as follows:

从两个候选桶中随机选择一个桶,随机驱逐桶中任意一个槽ej,j∈[0,b)中的数据指纹fx',并将fx存入ej中,更新桶对应向量的位置下标值;Randomly select a bucket from the two candidate buckets, randomly evict the data fingerprint f x' in any slot e j , j∈[0,b) in the bucket, and store f x in e j , and update the bucket Corresponding vector The position subscript value of ;

被驱逐元素fx'计算当前插入位置的对偶位置,如果该桶中存在空槽,则将fx'插入该空槽中并更新对应向量的位置下标值,如果没有空槽,则重复驱逐过程,直到所有项都被存入过滤器中。The evicted element f x' calculates the dual position of the current insertion position. If there is an empty slot in the bucket, f x' is inserted into the empty slot and the position subscript value of the corresponding vector is updated. If there is no empty slot, the eviction process is repeated until all items are stored in the filter.

本发明第三方面提供了一种数据查找方法。A third aspect of the present invention provides a data search method.

一种数据查找方法,基于第一方面提供的一种布谷鸟过滤器,当查找一个数据y时,首先计算y对应的数据摘要digx、数据指纹fy和两个候选桶然后将数据指纹fy和这两个桶内的所有指纹进行匹配:A data search method, based on a cuckoo filter provided in the first aspect, when searching for a data y, firstly calculates the data summary dig x corresponding to y, the data fingerprint f y and two candidate buckets Then match the data fingerprint f y with all the fingerprints in the two buckets:

如果第一个候选桶中槽ej,j∈[0,b)保存的指纹与fy匹配,则在中查找j,如果不存在,则返回查找成功,否则返回查找失败;If the first candidate bucket If the fingerprint stored in slot e j ,j∈[0,b) matches f y , then Search for j in the search, if it does not exist, return a successful search, otherwise return a failed search;

如果第二个候选桶中槽ek,k∈[0,b)保存的指纹与fy匹配,则在中查找k,如果存在,则返回查找成功,否则返回查找失败。If the second candidate bucket If the fingerprint stored in slot e k ,k∈[0,b) matches f y , then Search for k in , if it exists, return a success, otherwise return a failure.

本发明第四方面提供了一种数据删除方法。A fourth aspect of the present invention provides a data deletion method.

一种数据删除方法,基于第一方面提供的一种布谷鸟过滤器,当删除一个数据z时,查找被删除数据z的数据指纹,分为两种情况进行后续操作:A data deletion method, based on a cuckoo filter provided in the first aspect, when deleting a data z, searching for the data fingerprint of the deleted data z, and performing subsequent operations in two cases:

如果查找成功,则将数据指纹从指纹记录表中的对应位置删除,并删除对应向量中的位置标记,返回删除成功;If the search is successful, the data fingerprint is deleted from the corresponding position in the fingerprint record table, and the position mark in the corresponding vector is deleted, and a deletion success is returned;

如果查找失败,则说明该元素不存在于过滤器中,返回删除失败。If the search fails, it means that the element does not exist in the filter and the deletion fails.

进一步的,所述查找成功时的删除操作,具体为:Furthermore, the deletion operation when the search is successful is specifically as follows:

如果待删除的数据指纹在第一个候选桶中,则将数据指纹从第一个候选桶中删除;If the data fingerprint to be deleted is in the first candidate bucket, delete the data fingerprint from the first candidate bucket;

如果待删除的数据指纹在第二个候选桶中,则将数据指纹从第二个候选桶 中删除,并删除第二个候选桶对应向量中的位置标记。If the data fingerprint to be deleted is in the second candidate bucket, remove the data fingerprint from the second candidate bucket. , and delete the position mark in the vector corresponding to the second candidate bucket.

以上一个或多个技术方案存在以下有益效果:One or more of the above technical solutions have the following beneficial effects:

本发明具有更高的数据指纹计算效率:在数据插入阶段,在获得数据指纹时直接取用数据摘要的低位部分,减少了一个哈希操作,降低了数据插入时延。The present invention has higher data fingerprint calculation efficiency: in the data insertion stage, the low-order part of the data summary is directly used when obtaining the data fingerprint, which reduces a hash operation and reduces the data insertion delay.

本发明具有更低的数据查询假阳率:通过仅记录保存在第二个候选桶中的位置下标值的方式,将数组下标巧妙地融入了数据指纹中,在不增加指纹长度的前提下显著增加了指纹比对长度,将布谷鸟过滤器的假阳率降低了m倍(m是指纹记录表的桶数),并且在指纹长度为8bits时,在实际操作中,平均有93.1%的概率假阳率为0;在指纹长度为12bits,相较于CF的0.04%,本过滤器有99.3%的概率的假阳率为0。The present invention has a lower false positive rate of data query: by only recording the position subscript value stored in the second candidate bucket, the array subscript is cleverly integrated into the data fingerprint, and the fingerprint comparison length is significantly increased without increasing the fingerprint length, and the false positive rate of the cuckoo filter is reduced by m times (m is the number of buckets in the fingerprint record table), and when the fingerprint length is 8 bits, in actual operation, there is an average of 93.1% probability that the false positive rate is 0; when the fingerprint length is 12 bits, compared with 0.04% of CF, the filter has a 99.3% probability that the false positive rate is 0.

本发明在相同假阳率下具有更小的空间成本:在假阳率为0的前提下,本过滤器需要的指纹长度仅为8.76993bits,而CF则需要34.3597bits。The present invention has a smaller space cost under the same false positive rate: under the premise of a false positive rate of 0, the fingerprint length required by the filter is only 8.76993 bits, while CF requires 34.3597 bits.

本发明附加方面的优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本发明的实践了解到。Advantages of additional aspects of the present invention will be given in part in the following description, and in part will become obvious from the following description, or will be learned through practice of the present invention.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

构成本发明的一部分的说明书附图用来提供对本发明的进一步理解,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。The accompanying drawings in the specification, which constitute a part of the present invention, are used to provide a further understanding of the present invention. The exemplary embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute improper limitations on the present invention.

图1为第一个实施例布谷鸟过滤器的结构示意图。FIG. 1 is a schematic structural diagram of a cuckoo filter according to a first embodiment.

图2为第一个实施例指纹记录表的结构示意图。FIG. 2 is a schematic diagram of the structure of the fingerprint record table of the first embodiment.

图3为第一个实施例在槽数b=2,4,8和桶数m=215,220,225时1在矩阵F所有元素中的占比图。3 is a diagram showing the proportion of 1 in all elements of the matrix F when the number of slots b=2, 4, 8 and the number of buckets m=2 15 , 2 20 , 2 25 in the first embodiment.

图4为第一个实施例位置标记表的结构示意图。FIG. 4 is a schematic diagram of the structure of the position mark table of the first embodiment.

图5为第二个实施例的方法流程图。FIG5 is a flow chart of a method according to the second embodiment.

图6为第三个实施例的方法流程图。FIG6 is a flow chart of a method according to the third embodiment.

图7为第四个实施例的方法流程图。 FIG. 7 is a flow chart of a method according to a fourth embodiment.

具体实施方式DETAILED DESCRIPTION

应该指出,以下详细说明都是例示性的,旨在对本申请提供进一步的说明。除非另有指明,本发明使用的所有技术和科学术语具有与本申请所属技术领域的普通技术人员通常理解的相同含义。It should be noted that the following detailed descriptions are illustrative and are intended to provide further explanation of the present application. Unless otherwise specified, all technical and scientific terms used in the present invention have the same meanings as those commonly understood by those skilled in the art to which the present application belongs.

需要注意的是,这里所使用的术语仅是为了描述具体实施方式,而非意图限制根据本申请的示例性实施方式。如在这里所使用的,除非上下文另外明确指出,否则单数形式也意图包括复数形式,此外,还应当理解的是,当在本说明书中使用术语“包含”和/或“包括”时,其指明存在特征、步骤、操作、器件、组件和/或它们的组合。It should be noted that the terms used herein are only for describing specific embodiments and are not intended to limit the exemplary embodiments according to the present application. As used herein, unless the context clearly indicates otherwise, the singular form is also intended to include the plural form. In addition, it should be understood that when the terms "comprise" and/or "include" are used in this specification, it indicates the presence of features, steps, operations, devices, components and/or combinations thereof.

实施例一Embodiment 1

在一个或多个实施方式中,公开了一种基于数据指纹位置标记的布谷鸟过滤器,图1是布谷鸟过滤器的结构示意图,如图1所示,由指纹记录表(Fingerprint Record Table,FRT)和位置标记表(Position Flag Table,PFT)构成,指纹记录表FCT用来保存插入数据的指纹,位置标记表PFT用来记录指纹表中数据指纹的插入位置信息。In one or more embodiments, a cuckoo filter based on data fingerprint position marking is disclosed. FIG1 is a schematic diagram of the structure of the cuckoo filter. As shown in FIG1 , the filter is composed of a fingerprint record table (Fingerprint Record Table, FRT) and a position flag table (Position Flag Table, PFT). The fingerprint record table FCT is used to store the fingerprint of the inserted data, and the position flag table PFT is used to record the insertion position information of the data fingerprint in the fingerprint table.

(1)指纹记录表(1) Fingerprint Record Form

图2是指纹记录表的结构示意图,如图2所示,指纹记录表FCT由m个数组(以下称为桶bucket)组成,其中,要求m必须是2的指数次方,这个要求带来的好处是进行异或运算时,可以保证计算出来的下标一定是落在数组中的;每个桶有b个可以保存数据指纹f的存储单元e(以下称为槽)。FIG2 is a schematic diagram of the structure of the fingerprint record table. As shown in FIG2 , the fingerprint record table FCT is composed of m arrays (hereinafter referred to as buckets), where m must be an exponential power of 2. The benefit of this requirement is that when performing an XOR operation, it can be guaranteed that the calculated index must fall within the array; each bucket has b storage units e (hereinafter referred to as slots) that can store data fingerprints f.

在布谷鸟过滤器中插入或查找或删除数据前,都要计算数据对应的数据摘要、数据指纹和两个候选桶,以待插入数据x为例,首先,计算待插入数据x的数据摘要:
digx=ha(x)mod2n       (1)
Before inserting, searching or deleting data in the cuckoo filter, the data summary, data fingerprint and two candidate buckets corresponding to the data must be calculated. Taking the data x to be inserted as an example, first, the data summary of the data x to be inserted is calculated:
dig x = ha (x) mod 2 n (1)

其中,n为数据摘要长度,ha(x)为输出特定长度二进制串的哈希函数。Where n is the length of the data summary, and ha (x) is a hash function that outputs a binary string of a specific length.

然后,基于待插入数据和数据摘要,直接取用数据摘要的低位部分作为待插 入数据的数据指纹,公式表示为:
fx=digxmod2l          (2)
Then, based on the data to be inserted and the data digest, the low-order part of the data digest is directly used as the The data fingerprint of the input data is expressed as follows:
f x = dig x mod 2 l (2)

其中,l为指纹长度;直接取用数据摘要的低位部分作为待插入数据的数据指纹,减少了一个哈希操作,降低了数据插入时延,具有更高的数据指纹计算效率。Where l is the fingerprint length; directly taking the low-order part of the data summary as the data fingerprint of the data to be inserted reduces one hash operation, reduces the data insertion delay, and has higher data fingerprint calculation efficiency.

最后,确定待插入数据x的两个不同候选桶位置:

Finally, determine the two different candidate bucket locations for inserting data x:

其中,hb(x)为定长输出的哈希函数。Wherein, h b (x) is a hash function with fixed-length output.

(2)位置标记表PFT(2) Position marker table PFT

本实施例中的位置标记表PFT并没有采用常规的由0和1构成的矩阵Fm×b来记录数据指纹在指纹记录表FCT中的保存位置信息,而是通过仅记录保存在第二个候选桶中的位置下标值的方式,在不增加指纹长度的前提下显著增加了指纹比对长度,降低数据查询假阳率。The position marking table PFT in this embodiment does not use the conventional matrix Fm ×b composed of 0 and 1 to record the storage position information of the data fingerprint in the fingerprint recording table FCT, but only records the position subscript value stored in the second candidate bucket. Under the premise of not increasing the fingerprint length, the fingerprint comparison length is significantly increased, thereby reducing the false positive rate of data query.

常规思路下,使用1比特表示每个数据指纹被插入的桶的位置,0和1分别表示p1位置和p2位置,位置标记表PFT是一个由0和1构成的矩阵Fm×b,每一行占用b bits,存储位置标记表PFT所需要的内存空间为mb bits。由于哈希函数存在碰撞的情况,导致过滤器的负载不能达到100%,因此在指纹记录表FRT中存在许多空槽,并且F在初始化的时候为零矩阵,因此比特0还表示指纹记录表FRT对应位置为空,这意味着F中保存了许多无效的位置信息。In the conventional way of thinking, 1 bit is used to indicate the position of the bucket where each data fingerprint is inserted, 0 and 1 respectively indicate the p1 position and p2 position, and the position marker table PFT is a matrix Fm×b composed of 0 and 1, each row occupies b bits, and the memory space required to store the position marker table PFT is mb bits. Due to the collision of the hash function, the load of the filter cannot reach 100%, so there are many empty slots in the fingerprint record table FRT, and F is a zero matrix when initialized, so bit 0 also indicates that the corresponding position of the fingerprint record table FRT is empty, which means that a lot of invalid position information is stored in F.

通过观察并分析现有布谷鸟过滤器数据插入的特点,发现大多数数据指纹都会被插入到它们的p1位置,图3是在槽数b=2,4,8和桶数m=215,220,225时1在矩阵F所有元素中的占比图,这意味着在矩阵F中0的数量要远多于1,极端情况下矩阵F近似为稀疏矩阵。By observing and analyzing the characteristics of data insertion in the existing cuckoo filter, it is found that most data fingerprints will be inserted into their p 1 position. Figure 3 shows the proportion of 1 in all elements of the matrix F when the number of slots b = 2, 4, 8 and the number of buckets m = 2 15 , 2 20 , 2 25. This means that the number of 0s in the matrix F is much greater than 1. In extreme cases, the matrix F is approximately a sparse matrix.

基于此结论,只保存每个桶中被插入第二个候选桶p2位置的数据指纹在桶中的下标,使用m个向量(vector)组成位置标记表PFT,每个向量保存指纹记录 表FRT对应桶中插入p2位置的数据指纹的位置下标,每个下标大小为log b bits,图4是位置标记表的结构图示意图。Based on this conclusion, only the subscript of the data fingerprint inserted into the second candidate bucket p 2 in each bucket is saved, and the position marking table PFT is composed of m vectors, each vector saves the fingerprint record Table FRT corresponds to the position subscripts of the data fingerprint inserted into the p 2 position in the bucket, and the size of each subscript is log b bits. Figure 4 is a schematic diagram of the structure of the position marking table.

借助位置标记表PFT的位置信息,在数据查询过程中,如果在指纹记录表FRT中发现了匹配的指纹,还需要在位置标记表PFT的对应向量中确认两个指纹是来自同一个桶位置,即在对应向量中能否查询到这两个指纹在桶内的下标,如图4所示。待查询数据y的指纹被哈希到bucket[i+1],并且bucket[i+1]是y的p2位置,假设位置标记表FRT中x的指纹与y相匹配,此时在vector[i+1]中找到x桶内的位置下标,表明当前桶也是x的p2位置,意味着x和y整个数据摘要都相同,因此返回查找成功。With the help of the location information of the position tag table PFT, during the data query process, if a matching fingerprint is found in the fingerprint record table FRT, it is also necessary to confirm in the corresponding vector of the position tag table PFT that the two fingerprints are from the same bucket position, that is, whether the subscripts of the two fingerprints in the bucket can be queried in the corresponding vector, as shown in Figure 4. The fingerprint of the data y to be queried is hashed to bucket[i+1], and bucket[i+1] is the p 2 position of y. Assuming that the fingerprint of x in the position tag table FRT matches y, the position subscript in the bucket of x is found in vector[i+1], indicating that the current bucket is also the p 2 position of x, which means that the entire data summary of x and y is the same, so the search is returned successful.

实施例二Embodiment 2

在一个或多个实施例中,公开了一种基于布谷鸟过滤器的数据插入方法,采用实施例一所提供的基于数据指纹位置标记的布谷鸟过滤器,当插入一个数据x时,首先计算x对应的数据摘要digx,然后计算x的指纹fx和两个候选桶数据插入流程如图5所示,插入算法如表1所示,其中插入过程分为三种情况:In one or more embodiments, a data insertion method based on a cuckoo filter is disclosed. The cuckoo filter based on data fingerprint position mark provided in the first embodiment is used. When inserting a data x, the data summary dig x corresponding to x is first calculated, and then the fingerprint f x of x and two candidate buckets are calculated. The data insertion process is shown in Figure 5, and the insertion algorithm is shown in Table 1. The insertion process is divided into three cases:

(1)如果两个侯选桶都有空槽,则随机选择一个桶将指纹fx存入任意一个空槽ej,j∈[0,b),如果i=2,即数据指纹fx最终保存在第二个候选桶中,则将j存入 (1) If two candidate buckets and There are empty slots, so a bucket is randomly selected. Store the fingerprint f x in any empty slot e j , j∈[0,b). If i=2, that is, the data fingerprint f x is finally stored in the second candidate bucket, then store j in

(2)若两个候选桶中只有一个桶中剩余有空槽,则将fx存入该桶的任意一个空槽ej,j∈[0,b)。如果i=2,即数据指纹fx最终保存在第二个候选桶中,则将j存入 (2) If there is only one bucket among the two candidate buckets If there are empty slots left in the bucket, f x is stored in any empty slot e j of the bucket, j∈[0,b). If i=2, that is, the data fingerprint f x is finally stored in the second candidate bucket, then j is stored in

(3)若两个候选桶都已经没有空槽,则从中随机选择一个桶随机驱逐中任意一个槽ej,j∈[0,b)中的数据指纹fx'并将fx存入ej中,若等于并且等于则在向量中插入j;但当等于并且等于则在向量中删除j;如果i=2,即数据指纹fx最终保存在第二个候选桶中,则将j存入 (3) If both candidate buckets have no empty slots, and Randomly select a bucket Random eviction The data fingerprint f x' in any slot e j ,j∈[0,b) in the , and stores f x in e j . equal and equal Then in the vector Insert j in equal and equal Then in the vector If i = 2, that is, the data fingerprint f x is finally saved in the second candidate bucket, then j is stored in

被驱逐元素fx'计算当前插入位置的对偶位置如果该桶中存在空槽,则将fx'插入该空槽中并更新PFT中对应向量的位置下标值,如没有空槽,则重复上述驱逐过程,直到所有项都被存入过滤器中;此处存在一个最大驱逐次数MAXKICKNUM,如果驱逐过程超过MAXKICKNUM,则元素插入失败,算法退出。The evicted element f x' calculates the dual position of the current insertion position If there is an empty slot in the bucket, f x' is inserted into the empty slot and the position index value of the corresponding vector in the PFT is updated. If there is no empty slot, the above eviction process is repeated until all items are stored in the filter. There is a maximum number of evictions MAXKICKNUM. If the eviction process exceeds MAXKICKNUM, the element insertion fails and the algorithm exits.

表1插入算法
Table 1 Insertion algorithm

实施例三Embodiment 3

在一个或多个实施例中,公开了一种基于布谷鸟过滤器的数据查找方法,采用实施例一所提供的基于数据指纹位置标记的布谷鸟过滤器,数据查找流程如图6所示,查找算法如表2所示,当查找一个数据y时,首先计算y的数据摘要digy和数据指纹fy,然后计算y的两个插入候选桶位置将数据指纹fy和这两个桶内的所有指纹进行匹配:In one or more embodiments, a data search method based on a cuckoo filter is disclosed. The cuckoo filter based on data fingerprint position mark provided in the first embodiment is used. The data search process is shown in FIG6 , and the search algorithm is shown in Table 2. When searching for a data y, the data summary dig y and the data fingerprint f y of y are first calculated, and then the two insertion candidate bucket positions of y are calculated. Match the data fingerprint f y with all the fingerprints in the two buckets:

如果桶中槽ej,j∈[0,b)保存的指纹与fy匹配,则在中查找j,如果不存在,则返回查找成功,否则返回查找失败。If the bucket If the fingerprint stored in slot e j ,j∈[0,b) matches f y , then Search for j in , if it does not exist, return a success, otherwise return a failure.

如果桶中槽ek,k∈[0,b)保存的指纹与fy匹配,则在中查找k,如果存在,则返回查找成功,否则返回查找失败。If the bucket If the fingerprint stored in slot e k ,k∈[0,b) matches f y , then Search for k in , if it exists, return a success, otherwise return a failure.

表2查找算法
Table 2 Search algorithm

实施例四Embodiment 4

在一个或多个实施例中,公开了一种基于布谷鸟过滤器的数据删除方法,采用实施例一所提供的基于数据指纹位置标记的布谷鸟过滤器,删除流程如图7所示,删除算法如表3所示,当删除一个数据z时,首先通过实施例三提供的查找方法在过滤器中查找待删除元素z,依据查找结果进行相应操作:In one or more embodiments, a data deletion method based on a cuckoo filter is disclosed. The cuckoo filter based on data fingerprint position marking provided in the first embodiment is used. The deletion process is shown in FIG7 , and the deletion algorithm is shown in Table 3. When deleting a data z, the element z to be deleted is first searched in the filter by the search method provided in the third embodiment, and corresponding operations are performed according to the search results:

(1)如果返回查找成功,则将该指纹从FRT中的对应位置删除,并删除 PFT对应向量中的位置标记,返回删除成功,具体的:(1) If the search is successful, the fingerprint is deleted from the corresponding position in the FRT and the The position mark in the PFT corresponding vector returns the deletion success, specifically:

如果待删除的数据指纹在第一个候选桶中,则将数据指纹从第一个候选桶中删除;If the data fingerprint to be deleted is in the first candidate bucket, delete the data fingerprint from the first candidate bucket;

如果待删除的数据指纹在第二个候选桶中,则将数据指纹从第二个候选桶中删除,并删除第二个候选桶对应向量中的位置标记。If the data fingerprint to be deleted is in the second candidate bucket, the data fingerprint is deleted from the second candidate bucket, and the position mark in the vector corresponding to the second candidate bucket is deleted.

(2)如果查找失败,则说明该元素不存在于过滤器中,返回删除失败。(2) If the search fails, it means that the element does not exist in the filter and the deletion fails.

表3删除算法
Table 3 Deletion algorithm

以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。 The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and variations. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included in the protection scope of the present invention.

Claims (9)

基于数据指纹位置标记的布谷鸟过滤器,其特征在于,由指纹记录表和位置标记表构成;The cuckoo filter based on data fingerprint position marking is characterized by being composed of a fingerprint record table and a position marking table; 所述指纹记录表,由m个桶组成,用来保存插入的数据指纹,其中,每个桶有b个保存数据指纹的槽;The fingerprint record table is composed of m buckets for storing inserted data fingerprints, wherein each bucket has b slots for storing data fingerprints; 所述位置标记表,由m个向量组成,为每个桶创建一个向量,用来记录数据指纹在指纹记录表中的插入位置信息;The position mark table is composed of m vectors, and a vector is created for each bucket to record the insertion position information of the data fingerprint in the fingerprint record table; 其中,每个数据指纹对应两个候选桶,所述两个候选桶的计算方式,具体为:

Each data fingerprint corresponds to two candidate buckets, and the calculation method of the two candidate buckets is specifically as follows:

其中,表示第一个候选桶,表示第二个候选桶,x表示待插入的数据,digx表示数据摘要,fx表示数据指纹,l为指纹长度,hb(·)表示定长输出的哈希函数,m表示桶的个数;从中选择一个桶的一个槽来保存数据指纹,当且仅当最终保存在第二个候选桶中时,在第二个候选桶对应的向量中,记录槽的位置下标值。in, represents the first candidate bucket, represents the second candidate bucket, x represents the data to be inserted, dig x represents the data summary, f x represents the data fingerprint, l is the fingerprint length, h b (·) represents the hash function with fixed-length output, and m represents the number of buckets; a slot of a bucket is selected to save the data fingerprint. If and only if it is finally saved in the second candidate bucket, the position subscript value of the slot is recorded in the vector corresponding to the second candidate bucket.
如权利要求1所述的基于数据指纹位置标记的布谷鸟过滤器,其特征在于,所述桶的个数m,是2的指数次方。The cuckoo filter based on data fingerprint position marking as described in claim 1 is characterized in that the number of buckets m is an exponential power of 2. 如权利要求1所述的基于数据指纹位置标记的布谷鸟过滤器,其特征在于,所述数据指纹的计算方式为:The cuckoo filter based on data fingerprint position marking according to claim 1, characterized in that the data fingerprint is calculated in the following manner: 获取待插入数据;Get the data to be inserted; 通过哈希函数和取模操作,计算待插入数据的数据摘要:
digx=ha(x)mod 2n
The data summary of the data to be inserted is calculated through the hash function and modulus operation:
dig x = ha (x) mod2n
其中,n为数据摘要长度,ha(·)表示定长输出的哈希函数;Where n is the length of the data summary, and ha (·) represents a hash function with a fixed-length output; 基于待插入数据和数据摘要,直接取用数据摘要的低位部分作为待插入数据的数据指纹。Based on the data to be inserted and the data summary, the low-order part of the data summary is directly used as the data fingerprint of the data to be inserted.
一种基于布谷鸟过滤器的数据插入方法,其特征在于,所述布谷鸟过滤器为如权利要求1-3任一项所述的基于数据指纹位置标记的布谷鸟过滤器,当插入一个数据x时,首先计算x对应的数据摘要digx、数据指纹fx和两个候选桶然后进行插入,插入过程为:A data insertion method based on a cuckoo filter, characterized in that the cuckoo filter is a cuckoo filter based on data fingerprint position marking according to any one of claims 1 to 3, when inserting a data x, firstly calculate the data summary dig x corresponding to x, the data fingerprint f x and two candidate buckets Then insert, the insertion process is: 从两个候选桶中选择一个桶的一个空槽ej来保存数据指纹,如果两个候选桶都没有空槽,则选择一个槽ej,通过驱逐过程驱逐槽ej中的原有数据指纹fx',将数据指纹fx保存到空出来的槽ej中;Select an empty slot e j from the two candidate buckets to save the data fingerprint. If there are no empty slots in the two candidate buckets, select a slot e j , evict the original data fingerprint f x' in the slot e j through the eviction process, and save the data fingerprint f x to the empty slot e j ; 如果数据指纹fx最终保存在第二个候选桶中,则将槽ej的位置下标值j存入桶对应的向量中。If the data fingerprint f x is finally saved in the second candidate bucket, the position index value j of slot e j is stored in the bucket The corresponding vector. 如权利要求4所述的一种基于布谷鸟过滤器的数据插入方法,其特征在于,所述从两个候选桶中选择一个桶的一个空槽ej来保存数据指纹,具体分为两种情况:The data insertion method based on the cuckoo filter according to claim 4 is characterized in that the selecting an empty slot e j of one bucket from the two candidate buckets to store the data fingerprint is specifically divided into two cases: (1)如果两个侯选桶都有空槽,则随机选择一个桶,将数据指纹存入任意一个空槽ej,j∈[0,b),j是槽ej的位置下标值;(1) If both candidate buckets have empty slots, a bucket is randomly selected and the data fingerprint is stored in any empty slot e j , j∈[0,b), where j is the position subscript value of slot e j ; (2)若两个候选桶中只有一个桶中有空槽,则将数据指纹存入该桶的任意一个空槽ej,j∈[0,b),j是槽ej的位置下标值。(2) If only one of the two candidate buckets has an empty slot, the data fingerprint is stored in any empty slot e j of the bucket, j ∈ [0, b), where j is the position index value of slot e j . 如权利要求4所述的一种基于布谷鸟过滤器的数据插入方法,其特征在于,所述两个候选桶都已经没有空槽,则驱逐过程具体为:The data insertion method based on cuckoo filter according to claim 4 is characterized in that, if both candidate buckets have no empty slots, the eviction process is specifically as follows: 从两个候选桶中随机选择一个桶,随机驱逐桶中任意一个槽ej,j∈[0,b)中的数据指纹fx',并将fx存入ej中,更新桶对应向量的位置下标值; Randomly select a bucket from the two candidate buckets, randomly evict the data fingerprint f x' in any slot e j , j∈[0,b) in the bucket, and store f x in e j , and update the bucket Corresponding vector The position subscript value of ; 被驱逐元素fx'计算当前插入位置的对偶位置,如果该桶中存在空槽,则将fx'插入该空槽中并更新对应向量的位置下标值,如果没有空槽,则重复驱逐过程,直到所有项都被存入过滤器中。The evicted element f x' calculates the dual position of the current insertion position. If there is an empty slot in the bucket, f x' is inserted into the empty slot and the position subscript value of the corresponding vector is updated. If there is no empty slot, the eviction process is repeated until all items are stored in the filter. 一种基于布谷鸟过滤器的数据查找方法,其特征在于,所述布谷鸟过滤器为如权利要求1-3任一项所述的基于数据指纹位置标记的布谷鸟过滤器,当查找一个数据y时,首先计算y对应的数据摘要digx、数据指纹fy和两个候选桶然后将数据指纹fy和这两个桶内的所有指纹进行匹配:A data search method based on a cuckoo filter, characterized in that the cuckoo filter is a cuckoo filter based on data fingerprint position marking as described in any one of claims 1 to 3, when searching for a data y, firstly calculate the data summary dig x corresponding to y, the data fingerprint f y and two candidate buckets Then match the data fingerprint f y with all the fingerprints in the two buckets: 如果第一个候选桶中槽ej,j∈[0,b)保存的指纹与fy匹配,则在中查找j,如果不存在,则返回查找成功,否则返回查找失败;If the first candidate bucket If the fingerprint stored in slot e j ,j∈[0,b) matches f y , then Search for j in the search, if it does not exist, return a successful search, otherwise return a failed search; 如果第二个候选桶中槽ek,k∈[0,b)保存的指纹与fy匹配,则在中查找k,如果存在,则返回查找成功,否则返回查找失败。If the second candidate bucket If the fingerprint stored in slot e k ,k∈[0,b) matches f y , then Search for k in , if it exists, return a success, otherwise return a failure. 一种基于布谷鸟过滤器的数据删除方法,其特征在于,所述布谷鸟过滤器为如权利要求1-3任一项所述的基于数据指纹位置标记的布谷鸟过滤器,当删除一个数据z时,查找被删除数据z的数据指纹,分为两种情况进行后续操作:A data deletion method based on a cuckoo filter, characterized in that the cuckoo filter is a cuckoo filter based on a data fingerprint position mark as claimed in any one of claims 1 to 3, when deleting a data z, searching for the data fingerprint of the deleted data z, and performing subsequent operations in two cases: 如果查找成功,则将数据指纹从指纹记录表中的对应位置删除,并删除对应向量中的位置标记,返回删除成功;If the search is successful, the data fingerprint is deleted from the corresponding position in the fingerprint record table, and the position mark in the corresponding vector is deleted, and a deletion success is returned; 如果查找失败,则说明被删除数据z不存在于过滤器中,返回删除失败。If the search fails, it means that the deleted data z does not exist in the filter, and the deletion fails. 如权利要求8所述的一种基于布谷鸟过滤器的数据删除方法,其特征在于,所述查找成功时的删除操作,具体为:The data deletion method based on the cuckoo filter according to claim 8 is characterized in that the deletion operation when the search is successful is specifically: 如果待删除的数据指纹在第一个候选桶中,则将数据指纹从第一个候选桶中删除;If the data fingerprint to be deleted is in the first candidate bucket, delete the data fingerprint from the first candidate bucket; 如果待删除的数据指纹在第二个候选桶中,则将数据指纹从第二个候选桶中删除,并删除第二个候选桶对应向量中的位置标记。 If the data fingerprint to be deleted is in the second candidate bucket, the data fingerprint is deleted from the second candidate bucket, and the position mark in the vector corresponding to the second candidate bucket is deleted.
PCT/CN2024/111746 2023-06-15 2024-08-13 Cuckoo filter, data insertion method, data query method and data deletion method Pending WO2024255927A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2025501692A JP2025523303A (en) 2023-06-15 2024-08-13 Cuckoo filters and how to insert, query, and delete data

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202310712462.5A CN116701440B (en) 2023-06-15 2023-06-15 A cuckoo filter and data insertion, query and deletion method
CN202310712462.5 2023-06-15

Publications (1)

Publication Number Publication Date
WO2024255927A1 true WO2024255927A1 (en) 2024-12-19

Family

ID=87842966

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2024/111746 Pending WO2024255927A1 (en) 2023-06-15 2024-08-13 Cuckoo filter, data insertion method, data query method and data deletion method

Country Status (3)

Country Link
JP (1) JP2025523303A (en)
CN (1) CN116701440B (en)
WO (1) WO2024255927A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116701440B (en) * 2023-06-15 2024-04-16 泉城省实验室 A cuckoo filter and data insertion, query and deletion method
CN117891858B (en) * 2024-03-14 2024-07-05 苏州大学 A time- and space-efficient parallel approximate membership query method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190266252A1 (en) * 2018-02-27 2019-08-29 Advanced Micro Devices, Inc. Cuckoo filters and cuckoo hash tables with biasing, compression, and decoupled logical sparsity
CN110222088A (en) * 2019-05-20 2019-09-10 华中科技大学 Data approximation set representation method and system based on insertion position selection
CN113535705A (en) * 2021-08-03 2021-10-22 佛山赛思禅科技有限公司 SFAD cuckoo filter and data de-duplication method based on SFAD cuckoo filter
CN114625719A (en) * 2022-03-18 2022-06-14 中国人民解放军国防科技大学 Dynamic set management method and system based on mobile filtering framework
CN116701440A (en) * 2023-06-15 2023-09-05 泉城省实验室 Cuckoo filter and data insertion, query and deletion method

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10222987B2 (en) * 2016-02-11 2019-03-05 Dell Products L.P. Data deduplication with augmented cuckoo filters
EP3418909A1 (en) * 2017-06-19 2018-12-26 Thomson Licensing A method for accessing a key in a cuckoo hash table
US12032548B2 (en) * 2018-11-28 2024-07-09 Advanced Micro Devices, Inc. System and method for self-resizing associative probabilistic hash-based data structures
CN111552693B (en) * 2020-04-30 2023-04-07 南方科技大学 Tag cuckoo filter
CN112148928B (en) * 2020-09-18 2024-02-20 鹏城实验室 Cuckoo filter based on fingerprint family
CN113535706B (en) * 2021-08-03 2023-05-23 佛山赛思禅科技有限公司 Two-stage cuckoo filter and repeated data deleting method based on two-stage cuckoo filter
CN116126928A (en) * 2021-11-11 2023-05-16 中国科学院声学研究所 An Information Finding System Based on Variable Fingerprint Cuckoo Filter
CN115827623B (en) * 2022-09-30 2025-09-16 中国人民解放军国防科技大学 Data filtering method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190266252A1 (en) * 2018-02-27 2019-08-29 Advanced Micro Devices, Inc. Cuckoo filters and cuckoo hash tables with biasing, compression, and decoupled logical sparsity
CN110222088A (en) * 2019-05-20 2019-09-10 华中科技大学 Data approximation set representation method and system based on insertion position selection
CN113535705A (en) * 2021-08-03 2021-10-22 佛山赛思禅科技有限公司 SFAD cuckoo filter and data de-duplication method based on SFAD cuckoo filter
CN114625719A (en) * 2022-03-18 2022-06-14 中国人民解放军国防科技大学 Dynamic set management method and system based on mobile filtering framework
CN116701440A (en) * 2023-06-15 2023-09-05 泉城省实验室 Cuckoo filter and data insertion, query and deletion method

Also Published As

Publication number Publication date
JP2025523303A (en) 2025-07-18
CN116701440B (en) 2024-04-16
CN116701440A (en) 2023-09-05

Similar Documents

Publication Publication Date Title
CN100498740C (en) Data cache processing method, system and data cache device
WO2024255927A1 (en) Cuckoo filter, data insertion method, data query method and data deletion method
US8635402B2 (en) Storage system and storage access method and program
US8868926B2 (en) Cryptographic hash database
CN116450656B (en) Data processing method, device, equipment and storage medium
US8086641B1 (en) Integrated search engine devices that utilize SPM-linked bit maps to reduce handle memory duplication and methods of operating same
CN103345472A (en) Redundancy removal file system based on limited binary tree bloom filter and construction method of redundancy removal file system
CN110888886A (en) An index structure and construction method, key-value storage system and request processing method
CN114398007B (en) LSM-tree-based caching optimization method for KV storage system read performance
CN112000846A (en) Method for grouping LSM tree indexes based on GPU
CN108287840A (en) A kind of data storage and query method based on matrix Hash
WO2018205151A1 (en) Data updating method and storage device
CN114647658A (en) Data retrieval method, device, equipment and machine-readable storage medium
CN113535670B (en) Virtual resource mirror image storage system and implementation method thereof
CN112527804B (en) File storage method, file reading method and data storage system
WO2013075306A1 (en) Data access method and device
US7987205B1 (en) Integrated search engine devices having pipelined node maintenance sub-engines therein that support database flush operations
CN114416741A (en) KV data writing and reading method and device based on multi-level index and storage medium
CN113867627A (en) A method and system for optimizing the performance of a storage system
Zhang et al. Succinct range filters
CN118092812B (en) Key-value storage and reading and writing method based on memory table index and iterator reduction mechanism
KR20230026946A (en) Key value storage device with hashing
CN109299143B (en) Knowledge fast indexing method of data interoperation test knowledge base based on Redis cache
US7953721B1 (en) Integrated search engine devices that support database key dumping and methods of operating same
KR20200029431A (en) Method and apparatus for providing efficient indexing and computer program included in computer readable medium therefor

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 2025501692

Country of ref document: JP

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24822872

Country of ref document: EP

Kind code of ref document: A1