CN106649676A

CN106649676A - Duplication eliminating method and device based on HDFS storage file

Info

Publication number: CN106649676A
Application number: CN201611159251.XA
Authority: CN
Inventors: 张为锋
Original assignee: Beijing Ruian Technology Co Ltd
Current assignee: Beijing Ruian Technology Co Ltd
Priority date: 2016-12-15
Filing date: 2016-12-15
Publication date: 2017-05-10
Anticipated expiration: 2036-12-15
Also published as: CN106649676B

Abstract

The embodiment of the invention discloses a duplication eliminating method and device based on an HDFS storage file. The method includes the steps that a file fingerprint of a file to be subjected to duplication eliminating is compared with a file fingerprint of a stored file; if the file fingerprints are the same, link identification is calculated according to file identification of the file to be subjected to duplication eliminating; the link identification and the storage address of the same storage file in a storage node replace the file content of the file to be subjected to duplication eliminating to serve as a key value of the file identification of the file to be subjected to duplication eliminating, and the key value is stored into the storage node. According to the duplication eliminating method and device in the technical scheme, files with the duplicated content are effectively removed, the number of files is decreased, the storage space is saved, and the system performance is improved.

Description

A method and device for deduplication based on HDFS storage files

技术领域technical field

本发明实施例涉及非结构化数据存储技术，尤其涉及一种基于HDFS存储文件的去重方法及装置。Embodiments of the present invention relate to unstructured data storage technologies, and in particular to a deduplication method and device for storing files based on HDFS.

背景技术Background technique

Hadoop分布式文件系统(Hadoop Distributed File System，简称HDFS)是对超大规模数据集提供可靠存储功能的系统，建立在响应以“一次写入，多次读取”任务的基础上，对用户应用程序提供高带宽的输入输出数据流。HDFS具有高容错性，可以运行在低廉的硬件集群上。采用MASTER/SLAVES的主从架构，一个HDFS集群由一个Namenode节点(管理节点)和多个Datanode节点(存储节点)组成。管理节点是一个中心服务器，负责管理文件系统的元数据和客户端对文件的访问。管理节点存储着文件的元数据，因此管理节点的内存容量限制了文件的数量。HDFS默认会将文件分割成block(存储块)，例如64M为1个存储块。然后将各存储块以键值对的形式存储在HDFS的存储节点中,并将键值对的映射存到内存中。每个文件、存储块以及索引目录在内存中均以对象的形式存储，每个对象约占150字节。举例来说，如果有1000000个小文件，每个文件占用一个存储块，则管理节点就至少需要300M的内存；如果存储1亿甚至更多的文件时，需要20G甚至更多的内存容量，解决办法是搭建支持集群的内存数据库，但增加系统成本。如果小文件太多，占用过多的内存资源，影响集群性能，需要对小文件进行合并，减少文件数量。Hadoop Distributed File System (HDFS for short) is a system that provides reliable storage for ultra-large-scale data sets. Provide high-bandwidth input and output data streams. HDFS has high fault tolerance and can run on low-cost hardware clusters. Using the master-slave architecture of MASTER/SLAVES, an HDFS cluster consists of a Namenode node (management node) and multiple Datanode nodes (storage nodes). The management node is a central server responsible for managing file system metadata and client access to files. The management node stores the metadata of the files, so the memory capacity of the management node limits the number of files. By default, HDFS divides files into blocks (storage blocks), for example, 64M is 1 storage block. Then store each storage block in the storage node of HDFS in the form of key-value pair, and store the mapping of key-value pair in memory. Each file, storage block, and index directory is stored in the form of an object in memory, and each object occupies about 150 bytes. For example, if there are 1,000,000 small files, and each file occupies a storage block, then the management node needs at least 300M of memory; if storing 100 million or more files, 20G or more memory capacity is required. The solution is to build an in-memory database that supports clusters, but this increases the cost of the system. If there are too many small files, excessive memory resources will be occupied and cluster performance will be affected. Small files need to be merged to reduce the number of files.

然而，在实际互联网应用中，存在着海量的小文件，尤其是随着博客、微博、Facebook等社交网站的兴起，改变了互联网存储内容的方式。用户基本上已经成为互联网内容的创造者，其数据具有海量、多样、动态变化等特点，由此产生了海量小文件，如状态文件、用户资料、头像等。这些数据按照数据的存储格式可以分为结构化数据和非结构化数据。结构化数据具有相同的层次和网格结构，可以用数字或文字来描述；而有一些信息则无法用数字或者统一的结构表示，例如，扫描图像、传真、照片、计算机生成的报告、字处理文档、电子表格、演示文稿、语音和视频等，这些即为非结构化数据。非结构化数据在经过结构化的提取之后，需要把原始文件进行保存，以供后续使用。However, in actual Internet applications, there are a large number of small files, especially with the rise of social networking sites such as blogs, microblogs, and Facebook, which have changed the way the Internet stores content. Users have basically become the creators of Internet content, and their data has the characteristics of massive, diverse, and dynamic changes, resulting in a large number of small files, such as status files, user profiles, and avatars. According to the data storage format, these data can be divided into structured data and unstructured data. Structured data has the same hierarchical and grid structure and can be described by numbers or words; while some information cannot be represented by numbers or a uniform structure, such as scanned images, faxes, photographs, computer-generated reports, word processing Documents, spreadsheets, presentations, voice and video, these are unstructured data. After structured extraction of unstructured data, the original files need to be saved for subsequent use.

在很多领域中，非结构化数据所占比例远远高于结构化数据所占比例。非结构化数据信息量非常大，如果直接存储于数据库中，除了大幅度加大数据库的容量外，还会降低维护和应用的效率。尤其是在互联网获得的非结构化数据往往具有重复性，热点事件在短时间内会带来大量的网民关注，导致少量非结构化数据在短时间内被大量重复利用,占用系统存储空间。现有技术中，采用压缩技术对数据按照一定的比例进行压缩，但是非结构化数据不具备严格的结构，较之结构化信息更难以标准化，管理起来更困难。针对这些特点，目前HDFS存储的海量非结构化小文件采用Mapfile技术合并为大文件后，没有经过压缩处理，占用的存储空间多，因此，如何去除海量非结构化数据中重复的内容，节约存储空间是急需解决的问题。In many fields, the proportion of unstructured data is much higher than that of structured data. The amount of unstructured data information is very large. If it is directly stored in the database, in addition to greatly increasing the capacity of the database, it will also reduce the efficiency of maintenance and application. In particular, the unstructured data obtained on the Internet is often repetitive, and hot events will attract a large number of netizens' attention in a short period of time, resulting in a small amount of unstructured data being reused in a large amount in a short period of time, occupying system storage space. In the existing technology, compression technology is used to compress data according to a certain ratio, but unstructured data does not have a strict structure, and it is more difficult to standardize and manage than structured information. In response to these characteristics, the massive unstructured small files currently stored in HDFS are merged into large files using Mapfile technology, and they are not compressed, which takes up a lot of storage space. Therefore, how to remove duplicate content in massive unstructured data and save storage Space is an urgent issue.

发明内容Contents of the invention

本发明实施例提供一种基于HDFS存储文件的去重方法及装置，以使HDFS处理存储的海量非结构化小文件时，有效去重，节约存储空间。Embodiments of the present invention provide a deduplication method and device based on HDFS stored files, so that HDFS can effectively deduplicate and save storage space when processing a large number of stored unstructured small files.

第一方面，本发明实施例提供了一种基于HDFS存储文件的去重方法，包括：In the first aspect, the embodiment of the present invention provides a method for deduplication based on HDFS storage files, including:

将待去重文件的文件指纹，与已存储文件的文件指纹进行比对；Compare the file fingerprint of the file to be deduplicated with the file fingerprint of the stored file;

如果比对结果为相同，根据所述待去重文件的文件标识计算链接标识；If the comparison result is the same, calculate the link identifier according to the file identifier of the file to be deduplicated;

以所述链接标识和相同的已存储文件在存储节点中的存储地址，替换所述待去重文件的文件内容，作为所述待去重文件的文件标识的键值存储到存储节点中。Replace the file content of the file to be deduplicated with the link ID and the storage address of the same stored file in the storage node, and store it in the storage node as the key value of the file ID of the file to be deduplicated.

优选的，将待去重文件的文件指纹，与已存储文件的文件指纹进行比对之前，还包括：Preferably, before comparing the file fingerprint of the file to be deduplicated with the file fingerprint of the stored file, it also includes:

将接收到的文件存储至所述存储节点中设定区域中，并标记为未去重处理区域；storing the received file in the set area in the storage node, and marking it as an area not deduplicated;

从所述未去重处理区域中逐一获取文件，作为待去重文件。Obtain files one by one from the non-deduplication processing area as files to be deduplication.

优选的，将接收到的文件存储至所述存储节点中设定区域中包括：Preferably, storing the received file in the setting area in the storage node includes:

为接收到的文件生成主键，作为文件标识；Generate a primary key for the received file as a file identifier;

将所述文件的文件内容转换为二进制数据，与所述文件标识对应存储至所述存储节点中设定区域中。Converting the file content of the file into binary data, corresponding to the file identifier, and storing it in the setting area of the storage node.

按照文件的接收日期，将接收到的文件存储至所述存储节点中不同的设定区域中。According to the receiving date of the file, the received file is stored in different setting areas in the storage node.

优选的，根据所述待去重文件的文件标识计算链接标识包括：Preferably, calculating the link identifier according to the file identifier of the file to be deduplicated comprises:

对所述待去重文件的文件标识计算32位MD5值，作为所述链接标识。A 32-bit MD5 value is calculated for the file identifier of the file to be deduplicated as the link identifier.

优选的，以所述链接标识和相同的已存储文件在存储节点中的存储地址，替换所述待去重文件的文件内容，作为所述待去重文件的文件标识的键值存储到存储节点中之后，还包括：Preferably, the file content of the file to be deduplicated is replaced with the link identifier and the storage address of the same stored file in the storage node, and stored in the storage node as the key value of the file identifier of the file to be deduplicated After that, also include:

根据所述存储节点中各文件标识及对应键值的存储位置，重写所述存储节点的索引文件。Rewrite the index file of the storage node according to the storage location of each file identifier and the corresponding key value in the storage node.

优选的，所述方法还包括：Preferably, the method also includes:

根据接收到的文件读取请求，获取待读取文件的文件标识；Obtain the file identifier of the file to be read according to the received file read request;

根据所述文件标识计算对应的链接标识；Calculating a corresponding link identifier according to the file identifier;

根据所述文件标识从存储节点中读取对应的键值的设定位数据；Reading the set bit data of the corresponding key value from the storage node according to the file identifier;

如果比对所述链接标识与所述设定位数据匹配，则从所述键值中读取存储地址；If comparing the link identifier with the set bit data, read the storage address from the key value;

根据所述存储地址在所述存储节点中定位查找对应的文件，并读取后响应所述文件读取请求。Locate and find the corresponding file in the storage node according to the storage address, and respond to the file read request after reading.

第二方面，本发明实施例还提供了一种基于HDFS存储文件的去重装置，包括：In the second aspect, the embodiment of the present invention also provides a deduplication device based on HDFS storage files, including:

指纹比对模块，用于将待去重文件的文件指纹，与已存储文件的文件指纹进行比对；The fingerprint comparison module is used to compare the file fingerprint of the file to be deduplicated with the file fingerprint of the stored file;

链接标识计算模块，用于如果比对结果为相同，根据所述待去重文件的文件标识计算链接标识；A link identification calculation module, configured to calculate a link identification according to the file identification of the file to be deduplicated if the comparison result is the same;

内容替换模块，用于以所述链接标识和相同的已存储文件在存储节点中的存储地址，替换所述待去重文件的文件内容，作为所述待去重文件的文件标识的键值存储到存储节点中。A content replacement module, configured to replace the file content of the file to be deduplicated with the link identifier and the storage address of the same stored file in the storage node, as the key value storage of the file identifier of the file to be deduplicated to the storage node.

优选的，所述装置还包括：Preferably, the device also includes:

文件存储模块，用于将待去重文件的文件指纹，与已存储文件的文件指纹进行比对之前，将接收到的文件存储至所述存储节点中设定区域中，并标记为未去重处理区域；The file storage module is used to store the received file in the set area in the storage node before comparing the file fingerprint of the file to be deduplicated with the file fingerprint of the stored file, and mark it as not deduplicated processing area;

文件获取模块，用于从所述未去重处理区域中逐一获取文件，作为待去重文件。The file acquisition module is configured to acquire files one by one from the non-deduplication processing area as files to be deduplication.

优选的，所述文件存储模块包括：Preferably, the file storage module includes:

主键生成单元，用于为接收到的文件生成主键，作为文件标识；A primary key generating unit, configured to generate a primary key for the received file as a file identifier;

内容转换单元，用于将所述文件的文件内容转换为二进制数据，与所述文件标识对应存储至所述存储节点中设定区域中。The content conversion unit is configured to convert the file content of the file into binary data, and store it in a setting area in the storage node corresponding to the file identifier.

优选的，所述文件存储模块具体用于：Preferably, the file storage module is specifically used for:

优选的，所述链接标识计算模块具体用于：Preferably, the link identification calculation module is specifically used for:

优选的，所述装置还包括：Preferably, the device also includes:

重写索引模块，用于以所述链接标识和相同的已存储文件在存储节点中的存储地址，替换所述待去重文件的文件内容，作为所述待去重文件的文件标识的键值存储到存储节点中之后，根据所述存储节点中各文件标识及对应键值的存储位置，重写所述存储节点的索引文件。The rewriting index module is used to replace the file content of the file to be deduplicated with the link identifier and the storage address of the same stored file in the storage node as the key value of the file identifier of the file to be deduplicated After being stored in the storage node, the index file of the storage node is rewritten according to the storage location of each file identifier and the corresponding key value in the storage node.

优选的，所述装置还包括：Preferably, the device also includes:

文件标识读取模块，用于根据接收到的文件读取请求，获取待读取文件的文件标识；The file identification reading module is used to obtain the file identification of the file to be read according to the received file reading request;

对应标识计算模块，用于根据所述文件标识计算对应的链接标识；A corresponding identification calculation module, configured to calculate a corresponding link identification according to the file identification;

设定位数据读取模块，用于根据所述文件标识从存储节点中读取对应的键值的设定位数据；The set bit data reading module is used to read the set bit data of the corresponding key value from the storage node according to the file identifier;

匹配模块，用于如果比对所述链接标识与所述设定位数据匹配，则从所述键值中读取存储地址；A matching module, configured to read the storage address from the key value if comparing the link identifier with the set bit data;

文件查找模块，用于根据所述存储地址在所述存储节点中定位查找对应的文件，并读取后响应所述文件读取请求。A file search module, configured to locate and find a corresponding file in the storage node according to the storage address, and respond to the file read request after reading.

本发明实施例针对HDFS中文件内容相同的海量非结构化文件，对内容相同的文件只保留一份，删除与已存储文件指纹相同的文件内容，替换为链接标识和链接地址，有效去除内容重复的文件，减少文件数量，节约了大量的存储空间，释放内存资源，提升系统性能，同时，能够满足快速存储和正确读取的需求。The embodiment of the present invention is aimed at massive unstructured files with the same file content in HDFS, only one copy of the file with the same content is reserved, and the file content with the same fingerprint as the stored file is deleted and replaced with a link identifier and link address, effectively removing content duplication files, reduce the number of files, save a lot of storage space, release memory resources, improve system performance, and at the same time, meet the needs of fast storage and correct reading.

附图说明Description of drawings

图1A是本发明实施例一中的一种基于HDFS存储文件的去重方法的流程图；FIG. 1A is a flow chart of a deduplication method based on HDFS storage files in Embodiment 1 of the present invention;

图1B是本发明实施例一中的一种基于HDFS存储文件的去重方法的示意图；FIG. 1B is a schematic diagram of a deduplication method based on HDFS storage files in Embodiment 1 of the present invention;

图2是本发明实施例二中的一种基于HDFS存储文件的去重方法的流程图；Fig. 2 is a flow chart of a method for deduplication based on HDFS storage files in Embodiment 2 of the present invention;

图3是本发明实施例三中的一种基于HDFS存储文件的去重方法的流程图；Fig. 3 is a flow chart of a method for deduplication based on HDFS storage files in Embodiment 3 of the present invention;

图4A是本发明实施例四中的一种基于HDFS存储文件的去重装置的结构示意图；4A is a schematic structural diagram of a deduplication device based on HDFS storage files in Embodiment 4 of the present invention;

图4B是本发明实施例四中的一种基于HDFS存储文件的去重装置的结构示意图。FIG. 4B is a schematic structural diagram of an HDFS-based deduplication device for storing files in Embodiment 4 of the present invention.

具体实施方式detailed description

下面结合附图和实施例对本发明作进一步的详细说明。可以理解的是，此处所描述的具体实施例仅仅用于解释本发明，而非对本发明的限定。另外还需要说明的是，为了便于描述，附图中仅示出了与本发明相关的部分而非全部结构。The present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, but not to limit the present invention. In addition, it should be noted that, for the convenience of description, only some structures related to the present invention are shown in the drawings but not all structures.

实施例一Embodiment one

图1A为本发明实施例一提供的一种基于HDFS存储文件的去重方法的流程图，本实施例可适用于Hadoop分布式文件系统，该系统一般可包括管理节点和多个存储节点。该方法可以由基于HDFS存储文件的去重装置来执行，该装置可以采用软件和/或硬件的方式实现，一般集成于Hadoop分布式文件系统中的管理节点中。FIG. 1A is a flow chart of a deduplication method based on HDFS storage files provided by Embodiment 1 of the present invention. This embodiment is applicable to Hadoop distributed file system, and the system generally includes a management node and multiple storage nodes. The method can be executed by a deduplication device based on HDFS storage files, the device can be implemented in software and/or hardware, and is generally integrated in a management node in a Hadoop distributed file system.

本发明实施例一的方法具体包括：The method of Embodiment 1 of the present invention specifically includes:

S101、将待去重文件的文件指纹，与已存储文件的文件指纹进行比对。S101. Compare the file fingerprint of the file to be deduplicated with the file fingerprint of the stored file.

待去重文件是接收到的文件，可以先将该文件存储到存储节点中，而后在离线状态下进行本实施例的去重操作，也可以在接收到待去重文件时，进行在线的去重操作。由于在线去重需要占用较大的资源，运行速度慢，响应时间长，所以优选是进行离线去重。从存储节点中提取没有进行过去重处理的文件，作为待去重文件。The file to be deduplicated is a received file, the file can be stored in the storage node first, and then the deduplication operation of this embodiment is performed in an offline state, or the deduplication operation can be performed online when the deduplication file is received. Heavy operation. Since online deduplication requires large resources, slow operation speed, and long response time, it is preferable to perform offline deduplication. Extract files that have not been reprocessed in the past from the storage node as files to be deduplicated.

具体的，文件指纹是根据每一个文件的内容计算得出的，不管文件名称如何变化，只要文件的内容没有变化，计算出的文件指纹就是相同的。如果待去重文件与已存储文件的文件内容相同，计算出的文件指纹就相同。文件指纹的计算方法可以是计算文件的消息摘要算法第五版(Message-Digest Algorithm 5，简称MD5值)、安全哈希算法(Secure HashAlgorithm 1，简称SHA1值)或循环冗余校验(Cyclic Redundancy Check，简称CRC32值)。其中，MD5值具有高度的离散性，原信息内容的微小变化会导致MD5值的巨大变化，可靠性高。本实施例中，优选获取文件前1K二进制数据和文件最后1K二进制数据进行MD5值计算，计算的结果作为文件指纹。Specifically, the file fingerprint is calculated according to the content of each file. No matter how the file name changes, as long as the file content does not change, the calculated file fingerprint is the same. If the content of the file to be deduplicated is the same as that of the stored file, the calculated file fingerprints are the same. The calculation method of the file fingerprint can be the fifth edition of the message digest algorithm (Message-Digest Algorithm 5, referred to as MD5 value), the secure hash algorithm (Secure Hash Algorithm 1, referred to as SHA1 value) or cyclic redundancy check (Cyclic Redundancy) Check, CRC32 value for short). Among them, the MD5 value is highly discrete, and a small change in the original information content will lead to a huge change in the MD5 value, which is highly reliable. In this embodiment, it is preferable to obtain the first 1K binary data of the file and the last 1K binary data of the file to calculate the MD5 value, and the calculated result is used as the file fingerprint.

在离线状态下，定期对待去重文件的文件指纹与已存储文件的文件指纹进行比对。在每天0点之后，利用Hadoop分布式文件系统中的MapReduce计算模型离线比对所述待去重文件的文件指纹与已存储文件的文件指纹，筛选出与所述已存储文件具有相同内容的待去重文件，并获取对应的已存储文件及其在数据存储节点中的存储地址。In the offline state, the file fingerprint of the file to be deduplicated is regularly compared with the file fingerprint of the stored file. After 0 o'clock every day, use the MapReduce computing model in the Hadoop distributed file system to compare the file fingerprints of the files to be deduplicated with the file fingerprints of the stored files offline, and filter out the files that have the same content as the stored files. Deduplicate files, and obtain the corresponding stored files and their storage addresses in the data storage nodes.

S102、如果比对结果为相同，根据所述待去重文件的文件标识计算链接标识。S102. If the comparison result is the same, calculate a link identifier according to the file identifier of the file to be deduplicated.

具体的，在文件写入Hadoop分布式文件系统中时，是在映射文件(Mapfile)中以键值对Key-Value的形式存储的，主键Key是文件标识，是在文件存储时分配给该文件能唯一标识该文件的字符串。键值Value是Key对应的二进制值，即文件内容对应的全部二进制数据。如果文件指纹比对结果为相同，根据所述待去重文件的文件标识Key计算出链接标识，该链接标识对已经进行去重处理的文件起到特殊标识作用。在文件读取阶段，如果从文件的键值中读取的是链接标识而不是实际的二进制数据，则表明该文件进行了去重处理。如果文件指纹比对结果不相同，则说明该文件与已存储文件的文件内容不相同，保留该文件的文件内容，不进行去重处理。Specifically, when a file is written into the Hadoop distributed file system, it is stored in the form of a key-value pair Key-Value in the map file (Mapfile), and the primary key Key is the file identifier, which is assigned to the file when the file is stored A string that uniquely identifies the file. The key value Value is the binary value corresponding to the Key, that is, all the binary data corresponding to the file content. If the file fingerprint comparison results are the same, a link identifier is calculated according to the file identifier Key of the file to be deduplicated, and the link identifier plays a special identification role for the file that has been deduplicated. In the file reading stage, if the link identifier is read from the key value of the file instead of the actual binary data, it indicates that the file has been deduplicated. If the file fingerprint comparison results are not the same, it means that the file content of the file is different from that of the stored file, and the file content of the file is retained without deduplication processing.

优选的，步骤S102包括：Preferably, step S102 includes:

本实施例中，根据所述待去重文件的文件标识Key计算出32位MD5值，作为所述链接标识。类似于加密过程，对去重处理的文件进行加密标识，计算出32位MD5值，在响应文件读取请求时进行解密。并且，读取文件阶段，可以根据文件标识计算出链接标识，从而识别该文件是否进行了去重处理。In this embodiment, a 32-bit MD5 value is calculated according to the file identifier Key of the file to be deduplicated as the link identifier. Similar to the encryption process, the deduplicated file is encrypted and marked, and the 32-bit MD5 value is calculated, and decrypted when responding to the file read request. Moreover, at the stage of reading the file, the link identifier can be calculated according to the file identifier, so as to identify whether the file has been deduplicated.

S103、以所述链接标识和相同的已存储文件在存储节点中的存储地址，替换所述待去重文件的文件内容，作为所述待去重文件的文件标识的键值存储到存储节点中。S103. Replace the file content of the file to be deduplicated with the link identifier and the storage address of the same stored file in the storage node, and store it in the storage node as the key value of the file identifier of the file to be deduplicated. .

本实施例中，进行去重处理的文件，其键值中不再存储文件内容的二进制数据，而是替换为链接标识和存储地址，该存储地址存储的内容是与去重处理文件的内容完全相同的。如图1B所示，假设Key2对应的文件内容与已存储文件的内容相同，则读出Key2文件内容对应的二进制数据，在原位置写入链接标识和存储地址，即32位MD5值和相同文件存储的实际地址，完成所述待去重文件内容的替换。Key1和Key3对应的文件内容与已存储文件的内容不同，保留Key1和Key3文件内容对应的二进制数据。In this embodiment, for a file that undergoes deduplication processing, the binary data of the file content is no longer stored in the key value, but is replaced by a link identifier and a storage address. identical. As shown in Figure 1B, assuming that the content of the file corresponding to Key2 is the same as that of the stored file, the binary data corresponding to the content of the Key2 file is read out, and the link identifier and storage address are written in the original position, that is, the 32-bit MD5 value and the same file storage The actual address of the file to complete the replacement of the content of the file to be deduplicated. The contents of the files corresponding to Key1 and Key3 are different from the contents of the stored files, and the binary data corresponding to the contents of the files of Key1 and Key3 are reserved.

优选的，步骤S103包括：Preferably, step S103 includes:

具体的，根据索引文件可以快速定位数据，所述存储节点中各文件标识对应的键值数据已经被替换，原有的索引文件不能正确表示新的映射关系，需要根据替换后的所述存储节点中各文件标识及对应键值的存储位置，重写所述存储节点的索引文件。Specifically, the data can be quickly located according to the index file. The key-value data corresponding to each file identifier in the storage node has been replaced, and the original index file cannot correctly represent the new mapping relationship. The storage location of each file identifier and the corresponding key value in the storage node, and rewrite the index file of the storage node.

本发明实施例一提供的一种基于HDFS存储文件的去重方法，在离线状态下比对文件指纹并进行数据去重处理，可以适当延长处理时间，增加了系统的可靠性，节约内存资源，降低了对硬件设备的要求，进而节约大量设备成本，并且能够有效去除内容重复的文件，减少文件数量，节约存储空间。Embodiment 1 of the present invention provides a deduplication method based on HDFS stored files, which compares file fingerprints and performs data deduplication processing in an offline state, which can appropriately extend the processing time, increase system reliability, and save memory resources. It reduces the requirements for hardware devices, thereby saving a lot of device costs, and can effectively remove files with duplicate content, reduce the number of files, and save storage space.

实施例二Embodiment two

图2为本发明实施例二提供的一种基于HDFS存储文件的去重方法的流程图，本发明实施例二以实施例一为基础进行了优化改进，对如何离线去重操作进行进一步说明，如图2所示，本发明实施例二的具体包括：Fig. 2 is a flow chart of a deduplication method based on HDFS storage files provided by Embodiment 2 of the present invention. Embodiment 2 of the present invention is optimized and improved on the basis of Embodiment 1, and how to perform offline deduplication operations is further explained. As shown in Figure 2, the second embodiment of the present invention specifically includes:

S201、将接收到的文件存储至所述存储节点中设定区域中，并标记为未去重处理区域。S201. Store the received file in a set area in the storage node, and mark it as an area not deduplicated.

本实施例中，Hadoop分布式文件系统中包含多个映射文件，所述映射文件用于归档海量非结构化小文件，并生成归档文件对应的映射关系。系统连续接收文件并进行缓存，当缓存占用空间达到容量阈值或者接收时间达到预设时限时，系统根据接收非结构化文件的顺序依次写入各存储节点的映射文件中，并标记为未去重处理区域。其中，所述容量阈值的范围可设置为128M到2G之间，所述预设时限的范围可以设置为5分钟到20分钟之间，写入方式可以通过多线程并发写入，以保证写入的速度。In this embodiment, the Hadoop distributed file system includes multiple mapping files, and the mapping files are used for archiving a large number of unstructured small files and generating mapping relationships corresponding to the archive files. The system continuously receives files and caches them. When the cache space reaches the capacity threshold or the receiving time reaches the preset time limit, the system writes the unstructured files into the mapping files of each storage node in sequence according to the order in which they are received, and marks them as not deduplicated. Treatment area. Wherein, the range of the capacity threshold can be set between 128M and 2G, the range of the preset time limit can be set between 5 minutes and 20 minutes, and the writing method can be written concurrently through multiple threads to ensure that the writing speed.

优选的，步骤S201包括：Preferably, step S201 includes:

具体的，系统为接收到的文件生成主键Key，作为文件标识，根据主键Key进行索引存储，将所述文件的文件内容转换为二进制数据，作为主键Key对应的键值Value，主键Key和对应的键值Value以键值对的形式存储至所述存储节点的映射文件中。Specifically, the system generates a primary key Key for the received file as a file identifier, performs index storage according to the primary key Key, converts the file content of the file into binary data, and uses it as the key value corresponding to the primary key Key, the primary key Key and the corresponding The key-value Value is stored in the mapping file of the storage node in the form of key-value pairs.

优选的，步骤S201还包括：Preferably, step S201 also includes:

具体的，存储文件时，按照接收日期将接收到的文件存储至所述存储节点中不同的映射文件中。针对每天写入的文件在Hadoop分布式文件系统中新建一个目录进行保存，以天为单位分区存储。Specifically, when storing a file, the received file is stored in different mapping files in the storage node according to the receiving date. For the files written every day, create a new directory in the Hadoop distributed file system for storage, and store them in partitions in units of days.

S202、从所述未去重处理区域中逐一获取文件，作为待去重文件。S202. Obtain files one by one from the non-deduplication processing area as files to be deduplication.

S203、将待去重文件的文件指纹，与已存储文件的文件指纹进行比对。S203. Compare the file fingerprint of the file to be deduplicated with the file fingerprint of the stored file.

S204、如果比对结果为相同，根据所述待去重文件的文件标识计算链接标识。S204. If the comparison result is the same, calculate the link identifier according to the file identifier of the file to be deduplicated.

S205、以所述链接标识和相同的已存储文件在存储节点中的存储地址，替换所述待去重文件的文件内容，作为所述待去重文件的文件标识的键值存储到存储节点中。S205. Replace the file content of the file to be deduplicated with the link ID and the storage address of the same stored file in the storage node, and store it in the storage node as the key value of the file ID of the file to be deduplicated. .

本发明实施例二提供的一种基于HDFS存储文件的去重方法，按照文件的接受日期对文件进行分区存储，便于离线处理，针对当天的存储文件暂时不进行去重处理，能够保证数据的存储效率，满足快速存储数据的需求，提高了数据存储的实时性。Embodiment 2 of the present invention provides a deduplication method based on HDFS storage files, which stores files in partitions according to the acceptance date of the files, which is convenient for offline processing, and temporarily does not perform deduplication processing for the stored files of the day, which can ensure data storage Efficiency, meet the demand for fast data storage, and improve the real-time performance of data storage.

实施例三Embodiment three

图3为本发明实施例三提供的一种基于HDFS存储文件的去重方法的流程图，本发明实施例三以实施例二为基础进行了优化改进，对在文件经过去重处理后，获取文件内容的过程进行进一步说明，如图3所示，本发明实施例三的具体包括：Fig. 3 is a flow chart of a deduplication method based on HDFS storage files provided by Embodiment 3 of the present invention. Embodiment 3 of the present invention is optimized and improved on the basis of Embodiment 2. After the files are deduplicated, the obtained The process of the file content is further described, as shown in Figure 3, the third embodiment of the present invention specifically includes:

S301、根据接收到的文件读取请求，获取待读取文件的文件标识。S301. According to the received file reading request, acquire the file identifier of the file to be read.

S302、根据所述文件标识计算对应的链接标识。S302. Calculate a corresponding link identifier according to the file identifier.

S303、根据所述文件标识从存储节点中读取对应的键值的设定位数据。S303. Read the set bit data of the corresponding key value from the storage node according to the file identifier.

S304、如果比对所述链接标识与所述设定位数据匹配，则从所述键值中读取存储地址。S304. If comparing the link identifier with the set bit data, read the storage address from the key value.

S305、根据所述存储地址在所述存储节点中定位查找对应的文件，并读取后响应所述文件读取请求。S305. Locate and find the corresponding file in the storage node according to the storage address, and respond to the file read request after reading.

在本实施例中，获取文件内容的过程对文件使用者屏蔽内部处理流程，系统根据接收到的文件读取请求，获取待读取文件的文件标识主键Key，根据所述主键Key计算待读取文件主键Key对应的链接标识，该链接标识可以通过计算MD5值获取。根据所述主键Key从存储节点中读取对应键值的前32位MD5值，比对链接标识的MD5值和读取的前32位MD5值，如果一致，则说明该待读取文件经过去重处理，文件内容存储的是内容相同的已存储文件在存储节点中的存储地址，而不是文件本身真实的内容，去掉该文件内容中的前32位数据，从所述键值中读取存储地址，根据所述存储地址在所述存储节点中定位查找对应的文件，并读取后响应所述文件读取请求。比对链接标识的MD5值和读取的前32位MD5值，如果不一致，则说明该待读取文件没有经过去重处理，从所述键值中读取文件内容后响应所述文件读取请求。In this embodiment, the process of obtaining the file content shields the file user from the internal processing flow. The system obtains the file identification primary key Key of the file to be read according to the received file reading request, and calculates the key to be read according to the primary key Key. The link identifier corresponding to the file's primary key Key, which can be obtained by calculating the MD5 value. Read the first 32-bit MD5 value of the corresponding key value from the storage node according to the primary key Key, compare the MD5 value of the link identifier with the read first 32-bit MD5 value, if they are consistent, it means that the file to be read has passed Reprocessing, the file content stores the storage address of the stored file with the same content in the storage node, not the real content of the file itself, removes the first 32 bits of data in the file content, and reads the storage from the key value address, locate and find the corresponding file in the storage node according to the storage address, and respond to the file read request after reading. Compare the MD5 value of the link identifier with the first 32 MD5 values read, if they are inconsistent, it means that the file to be read has not been deduplicated, and respond to the file read after reading the file content from the key value ask.

本发明实施例三提供的一种基于HDFS存储文件的去重方法，对于重复的非结构化文件只保存了对应的存储地址，读取文件时对访问者屏蔽内部处理流程，能够满足正确读取的需求，节约了存储空间，提升了系统性能。Embodiment 3 of the present invention provides a deduplication method based on HDFS storage files. For repeated unstructured files, only the corresponding storage address is saved, and the internal processing flow is shielded from the visitor when reading the file, which can meet the requirements of correct reading. requirements, saving storage space and improving system performance.

实施例四Embodiment Four

图4A是本发明实施例四中的一种基于HDFS存储文件的去重装置的结构示意图，该装置应用于Hadoop分布式文件系统。如图4A所示，该装置包括：FIG. 4A is a schematic structural diagram of a deduplication device based on HDFS storage files in Embodiment 4 of the present invention, and the device is applied to the Hadoop distributed file system. As shown in Figure 4A, the device includes:

指纹比对模块401，用于将待去重文件的文件指纹，与已存储文件的文件指纹进行比对；Fingerprint comparison module 401, for comparing the file fingerprint of the file to be deduplicated with the file fingerprint of the stored file;

链接标识计算模块402，用于如果比对结果为相同，根据所述待去重文件的文件标识计算链接标识；A link identification calculation module 402, configured to calculate a link identification according to the file identification of the file to be deduplicated if the comparison result is the same;

内容替换模块403，用于以所述链接标识和相同的已存储文件在存储节点中的存储地址，替换所述待去重文件的文件内容，作为所述待去重文件的文件标识的键值存储到存储节点中。The content replacement module 403 is used to replace the file content of the file to be deduplicated with the link identifier and the storage address of the same stored file in the storage node as the key value of the file identifier of the file to be deduplicated stored in the storage node.

优选的，所述装置还包括：Preferably, the device also includes:

重写索引模块404，用于以所述链接标识和相同的已存储文件在存储节点中的存储地址，替换所述待去重文件的文件内容，作为所述待去重文件的文件标识的键值存储到存储节点中之后，根据所述存储节点中各文件标识及对应键值的存储位置，重写所述存储节点的索引文件。The rewriting index module 404 is used to replace the file content of the file to be deduplicated with the link identifier and the storage address of the same stored file in the storage node as the key of the file identifier of the file to be deduplicated After the value is stored in the storage node, the index file of the storage node is rewritten according to the storage location of each file identifier and the corresponding key value in the storage node.

具体的，在离线状态下，利用指纹比对模块比对所述待去重文件的文件指纹与已存储文件的文件指纹，筛选出与所述已存储文件具有相同内容的待去重文件，并获取对应的已存储文件及其在数据存储节点中的存储地址。如果文件指纹比对结果为相同，根据所述待去重文件的文件标识Key，在链接标识计算模块中计算出32位MD5值，作为链接标识，该链接标识对已经进行去重处理的文件起到标识作用。通过内容替换模块，以所述链接标识和相同的已存储文件在存储节点中的存储地址，替换所述待去重文件的文件内容，作为所述待去重文件的文件标识的键值存储到存储节点中。根据所述存储节点中各文件标识及对应键值的存储位置，在重写索引模块重写所述存储节点的索引文件。Specifically, in the offline state, use the fingerprint comparison module to compare the file fingerprint of the file to be deduplicated with the file fingerprint of the stored file, filter out the file to be deduplicated with the same content as the stored file, and Obtain the corresponding stored file and its storage address in the data storage node. If the file fingerprint comparison result is the same, according to the file identification Key of the file to be deduplicated, calculate the 32-bit MD5 value in the link identification calculation module, and as the link identification, the link identification is used for the file that has been deduplicated. to the identification function. Through the content replacement module, the file content of the file to be deduplicated is replaced with the link identifier and the storage address of the same stored file in the storage node, and stored as the key value of the file identifier of the file to be deduplicated storage node. According to the storage location of each file identifier and the corresponding key value in the storage node, rewrite the index file of the storage node in the rewriting index module.

优选的，如图4A所示，所述装置还包括：Preferably, as shown in Figure 4A, the device further includes:

文件存储模块405，用于将待去重文件的文件指纹，与已存储文件的文件指纹进行比对之前，将接收到的文件存储至所述存储节点中设定区域中，并标记为未去重处理区域；The file storage module 405 is used to store the received file in the setting area in the storage node before comparing the file fingerprint of the file to be deduplicated with the file fingerprint of the stored file, and mark it as undeleted reprocessing area;

文件获取模块406，用于从所述未去重处理区域中逐一获取文件，作为待去重文件。The file obtaining module 406 is configured to obtain files one by one from the non-deduplication processed area as files to be deduplicated.

具体的，文件存取模块连续接收文件并进行缓存，当缓存占用空间达到容量阈值或者接收时间达到预设时限时，系统按照文件的接受日期，根据接收非结构化文件的顺序多线程并发写入各存储节点的映射文件中。其中，所述容量阈值的范围可设置为128M到2G之间，所述预设时限的范围可以设置为5分钟到20分钟之间。主键生成单元为接收到的文件生成主键Key，作为文件标识，利用内容转换单元将所述文件的文件内容转换为二进制数据，作为主键Key对应的键值Value，主键Key和对应的键值Value以键值对的形式存储至所述存储节点的映射文件中。根据文件获取模块从所述未去重处理区域中逐一获取文件，作为待去重文件。Specifically, the file access module continuously receives files and caches them. When the cache space reaches the capacity threshold or the receiving time reaches the preset time limit, the system writes them concurrently by multiple threads according to the acceptance date of the files and the order in which unstructured files are received. In the mapping file of each storage node. Wherein, the range of the capacity threshold can be set between 128M and 2G, and the range of the preset time limit can be set between 5 minutes and 20 minutes. The primary key generation unit generates the primary key Key for the received file, as the file identifier, utilizes the content conversion unit to convert the file content of the file into binary data, as the key value corresponding to the primary key Key, the primary key Key and the corresponding key value Value are The form of the key-value pair is stored in the mapping file of the storage node. According to the file acquisition module, the files are obtained one by one from the non-deduplication processing area as the files to be deduplication.

优选的，如图4B所示，所述装置还包括：Preferably, as shown in Figure 4B, the device further includes:

文件标识读取模块407，用于根据接收到的文件读取请求，获取待读取文件的文件标识；The file identification reading module 407 is used to obtain the file identification of the file to be read according to the received file reading request;

对应标识计算模块408，用于根据所述文件标识计算对应的链接标识；A corresponding identification calculation module 408, configured to calculate a corresponding link identification according to the file identification;

设定位数据读取模块409，用于根据所述文件标识从存储节点中读取对应的键值的设定位数据；The set bit data reading module 409 is used to read the set bit data of the corresponding key value from the storage node according to the file identifier;

匹配模块410，用于如果比对所述链接标识与所述设定位数据匹配，则从所述键值中读取存储地址；A matching module 410, configured to read the storage address from the key value if comparing the link identifier matches the set bit data;

文件查找模块411，用于根据所述存储地址在所述存储节点中定位查找对应的文件，并读取后响应所述文件读取请求。The file search module 411 is configured to locate and find a corresponding file in the storage node according to the storage address, and respond to the file read request after reading.

具体的，利用文件标识读取模块根据接收到的文件读取请求，获取待读取文件的文件标识主键Key，利用对应标识计算模块根据所述主键Key计算待读取文件主键Key对应的MD5值。利用设定位数据读取模块根据所述主键Key从存储节点中读取对应键值的前32位MD5值，在匹配模块中比对链接标识的MD5值和读取的前32位MD5值，如果一致，则说明该待读取文件经过去重处理，文件内容存储的是内容相同的已存储文件在存储节点中的存储地址，而不是文件本身真实的内容，去掉该文件内容中的前32位数据，通过文件查找模块从所述键值中读取存储地址，根据所述存储地址在所述存储节点中定位查找对应的文件，并读取后响应所述文件读取请求。如果不一致，则说明该待读取文件没有经过去重处理，从所述键值中读取文件内容后响应所述文件读取请求。Specifically, utilize the file identification reading module to obtain the file identification primary key Key of the file to be read according to the received file reading request, and utilize the corresponding identification calculation module to calculate the MD5 value corresponding to the primary key Key of the file to be read according to the primary key Key . Utilize the set bit data reading module to read the first 32 MD5 values of the corresponding key value from the storage node according to the primary key Key, compare the MD5 value of the link identification and the first 32 MD5 values read in the matching module, If they are consistent, it means that the file to be read has been de-duplicated, and the file content stores the storage address of the stored file with the same content in the storage node, rather than the real content of the file itself. Remove the first 32 in the file content. For bit data, the file search module reads the storage address from the key value, locates and searches the corresponding file in the storage node according to the storage address, and responds to the file read request after reading. If not, it means that the file to be read has not been deduplicated, and the file read request is responded to after reading the file content from the key value.

本发明实施例四提供的一种基于HDFS存储文件的去重装置，能够有效去除内容重复的文件，减少文件数量，节约了大量的存储空间，释放内存资源，提升系统性能，同时，能够满足快速存储和正确读取的需求。Embodiment 4 of the present invention provides a deduplication device based on HDFS storage files, which can effectively remove files with duplicate content, reduce the number of files, save a lot of storage space, release memory resources, and improve system performance. At the same time, it can meet fast Storage and correct read requirements.

本发明实施例提供的装置可执行本发明任意实施例提供的方法，具备执行方法相应的功能模块和有益效果。The device provided in the embodiment of the present invention can execute the method provided in any embodiment of the present invention, and has corresponding functional modules and beneficial effects for executing the method.

注意，上述仅为本发明的较佳实施例及所运用技术原理。本领域技术人员会理解，本发明不限于这里所述的特定实施例，对本领域技术人员来说能够进行各种明显的变化、重新调整和替代而不会脱离本发明的保护范围。因此，虽然通过以上实施例对本发明进行了较为详细的说明，但是本发明不仅仅限于以上实施例，在不脱离本发明构思的情况下，还可以包括更多其他等效实施例，而本发明的范围由所附的权利要求范围决定。Note that the above are only preferred embodiments of the present invention and applied technical principles. Those skilled in the art will understand that the present invention is not limited to the specific embodiments described herein, and that various obvious changes, readjustments and substitutions can be made by those skilled in the art without departing from the protection scope of the present invention. Therefore, although the present invention has been described in detail through the above embodiments, the present invention is not limited to the above embodiments, and can also include more other equivalent embodiments without departing from the concept of the present invention, and the present invention The scope is determined by the scope of the appended claims.

Claims

1. A deduplication method based on HDFS storage file, is characterized in that, comprises:

Compare the file fingerprint of the file to be deduplicated with the file fingerprint of the stored file;

If the comparison result is the same, calculate the link identifier according to the file identifier of the file to be deduplicated;

Replace the file content of the file to be deduplicated with the link ID and the storage address of the same stored file in the storage node, and store it in the storage node as the key value of the file ID of the file to be deduplicated.

2. The method according to claim 1, wherein, before comparing the file fingerprint of the file to be deduplicated with the file fingerprint of the stored file, further comprising:

storing the received file in the set area in the storage node, and marking it as an area not deduplicated;

Obtain files one by one from the non-deduplication processing area as files to be deduplication.

3. The method according to claim 2, wherein storing the received file into the set area in the storage node comprises:

Generate a primary key for the received file as a file identifier;

Converting the file content of the file into binary data, corresponding to the file identifier, and storing it in the setting area of the storage node.

4. The method according to claim 2, wherein storing the received file into the set area in the storage node comprises:

According to the receiving date of the file, the received file is stored in different setting areas in the storage node.

5. The method according to claim 1, wherein calculating the link identifier according to the file identifier of the file to be deduplicated comprises:

A 32-bit MD5 value is calculated for the file identifier of the file to be deduplicated as the link identifier.

6. The method according to claim 1, wherein, the file content of the file to be deduplicated is replaced with the link identifier and the storage address of the same stored file in the storage node as the file content of the file to be deduplicated. After the key value of the file identifier of the heavy file is stored in the storage node, it also includes:

Rewrite the index file of the storage node according to the storage location of each file identifier and the corresponding key value in the storage node.

7. The method according to any one of claims 1-6, further comprising:

Obtain the file identifier of the file to be read according to the received file read request;

Calculating a corresponding link identifier according to the file identifier;

Reading the set bit data of the corresponding key value from the storage node according to the file identifier;

If comparing the link identifier with the set bit data, read the storage address from the key value;

Locate and find the corresponding file in the storage node according to the storage address, and respond to the file read request after reading.

8. A deduplication device based on HDFS storage files, characterized in that, comprising:

The fingerprint comparison module is used to compare the file fingerprint of the file to be deduplicated with the file fingerprint of the stored file;

A link identification calculation module, configured to calculate a link identification according to the file identification of the file to be deduplicated if the comparison result is the same;

A content replacement module, configured to replace the file content of the file to be deduplicated with the link identifier and the storage address of the same stored file in the storage node, as the key value storage of the file identifier of the file to be deduplicated to the storage node.

9. The device according to claim 8, further comprising:

The file storage module is used to store the received file in the set area in the storage node before comparing the file fingerprint of the file to be deduplicated with the file fingerprint of the stored file, and mark it as not deduplicated processing area;

The file acquisition module is configured to acquire files one by one from the non-deduplication processing area as files to be deduplication.

10. The device according to claim 9, wherein the file storage module comprises:

A primary key generating unit, configured to generate a primary key for the received file as a file identifier;

The content conversion unit is configured to convert the file content of the file into binary data, and store it in a setting area in the storage node corresponding to the file identifier.

11. The device according to claim 9, wherein the file storage module is specifically used for:

12. The device according to claim 8, wherein the link identification calculation module is specifically used for:

13. The device of claim 8, further comprising:

The rewriting index module is used to replace the file content of the file to be deduplicated with the link identifier and the storage address of the same stored file in the storage node as the key value of the file identifier of the file to be deduplicated After being stored in the storage node, the index file of the storage node is rewritten according to the storage location of each file identifier and the corresponding key value in the storage node.

14. The device according to any one of claims 8-13, characterized in that the device further comprises:

The file identification reading module is used to obtain the file identification of the file to be read according to the received file reading request;

A corresponding identification calculation module, configured to calculate a corresponding link identification according to the file identification;

The set bit data reading module is used to read the set bit data of the corresponding key value from the storage node according to the file identifier;

A matching module, configured to read the storage address from the key value if comparing the link identifier with the set bit data;

A file search module, configured to locate and find a corresponding file in the storage node according to the storage address, and respond to the file read request after reading.