CN116737684A

CN116737684A - An acceleration method for HDFS short-circuit writing

Info

Publication number: CN116737684A
Application number: CN202310633693.7A
Authority: CN
Inventors: 卢山; 曹俊亮; 赵智峰; 王刚; 孟李鹏; 陈超群; 丁军峰; 刘伟; 高超; 程丽红
Original assignee: Xi'an Fenghuo Software Technology Co ltd
Current assignee: Xi'an Fenghuo Software Technology Co ltd
Priority date: 2023-05-31
Filing date: 2023-05-31
Publication date: 2023-09-12

Abstract

The invention relates to the technical field of big data storage, and provides an acceleration method for HDFS short circuit writing, which comprises the following steps: the HDFS client, the HDFS cluster has two types of nodes and operates in a manager-worker mode, namely a NameNode (manager) and a plurality of datanodes (workers), and the acceleration method comprises the following steps: s1: creating a file and applying for Block, S2: creating Block information according to a Block allocation strategy, and S3: create RBW files of Block and Meta, S4: creating a local Java stream write Block and a Meta file, and S5: a Block information archiving record; the invention shortens the logic link for writing data and simplifies the data packaging and checking flow by stripping the logic of the data writing file and directly writing the local output stream into the disk, thereby improving the data writing efficiency.

Description

An acceleration method for HDFS short-circuit writing

技术领域Technical field

本发明涉及大数据存储技术领域，更具体地说是一种HDFS短路写的加速方法。The invention relates to the technical field of big data storage, and more specifically to an acceleration method for HDFS short-circuit writing.

背景技术Background technique

HDFS即Hadoop分布式文件系统(Hadoop Distributed FileSystem)，以面向数据追加和读取优化的开源分布式文件系统，具备可移植、高容错性和大规模水平扩展的特性。作为海量数据的底层平台，HDFS存储了海量的结构化和非结构化数据，支撑着复杂查询、交互式分析等丰富的应用场景；HDFS的性能问题将影响所有大数据系统和应用，因此对HDFS存储性能的优化至关重要。HDFS, the Hadoop Distributed File System, is an open source distributed file system optimized for data appending and reading. It has the characteristics of portability, high fault tolerance and large-scale horizontal expansion. As the underlying platform for massive data, HDFS stores massive amounts of structured and unstructured data, supporting rich application scenarios such as complex queries and interactive analysis; HDFS performance issues will affect all big data systems and applications, so HDFS Optimizing storage performance is crucial.

目前HDFS写数据通过DataXceiverServer提供的服务建立Socket服务，接受客户端的各种请求，每种请求有不同的操作码，服务端通过操作码类型判断请求类型；但是上述原生HDFS数据写入过程流程比较复杂，并且客户端和服务端进行了大量的packet封装、解包和数据校验计算，更加影响性能；此外，即使客户端与Datanode同处一台物理机，其发送的数据也需要通过本地回环网络(lo)传输，造成了额外的机器带宽开销。目前HDFS的写数据过程受限于解包校验和网络压力，无法真正发挥磁盘的性能，写入速度较慢，对业务影响较大。因此，本发明提供一种HDFS短路写的加速方法。Currently, HDFS writes data through the service provided by DataXceiverServer to establish a Socket service to accept various requests from the client. Each request has a different operation code. The server determines the request type through the operation code type; however, the above-mentioned native HDFS data writing process is relatively complicated. , and the client and server perform a large number of packet encapsulation, unpacking and data verification calculations, which further affects performance; in addition, even if the client and Datanode are on the same physical machine, the data they send needs to pass through the local loopback network (lo) transmission, causing additional machine bandwidth overhead. At present, the data writing process of HDFS is limited by unpacking verification and network pressure, which makes it impossible to truly utilize the performance of the disk. The writing speed is slow and has a great impact on the business. Therefore, the present invention provides an acceleration method for HDFS short-circuit writing.

发明内容Contents of the invention

为了解决上述技术问题，本发明提供一种HDFS短路写的加速方法，通过剥离部分DataNode写文件流程到本地，缩短了写数据的逻辑链路，简化了数据打包和校验流程，减少了网络带宽消耗，提高写数据效率，以解决目前原生HDFS数据写入过程流程复杂、性能低下和带来额外的带宽消耗等问题。In order to solve the above technical problems, the present invention provides an acceleration method for HDFS short-circuit writing. By stripping off part of the DataNode file writing process to the local, the logical link of writing data is shortened, the data packaging and verification process is simplified, and the network bandwidth is reduced. Consumption, improve the efficiency of writing data to solve the current problems of complex native HDFS data writing process, low performance and additional bandwidth consumption.

本发明具体的技术方案如下：The specific technical solutions of the present invention are as follows:

一种HDFS短路写的加速方法，包括：HDFS客户端，HDFS集群有两类节点，并以管理者-工作者模式运行，即一个NameNode(管理者)和多个DataNode(工作者)，所述加速方法步骤如下：An HDFS short-circuit write acceleration method, including: HDFS client, HDFS cluster has two types of nodes, and runs in a manager-worker mode, that is, one NameNode (manager) and multiple DataNode (workers), as described The steps of the acceleration method are as follows:

S1：创建文件并申请Block，HDFS客户端向NameNode发起RPC，创建文件并申请第一个Block；S1: Create a file and apply for a Block. The HDFS client initiates an RPC to the NameNode, creates a file and applies for the first Block;

S2：根据Block分配策略创建Block信息，利用NameNode根据Block分配策略创建Block信息并返回给客户端；S2: Create Block information according to the Block allocation policy, use NameNode to create Block information according to the Block allocation policy and return it to the client;

S3：创建Block和Meta的RBW文件，客户端构造短路写输出流ShortCircuitWriteOutputStream，并对本地的DataNode1发起请求，创建Block和Meta的RBW文件，DataNode1返回Block和Meta文件的描述符信息给客户端；S3: Create the RBW files of Block and Meta. The client constructs the short-circuit write output stream ShortCircuitWriteOutputStream and initiates a request to the local DataNode1 to create the RBW files of Block and Meta. DataNode1 returns the descriptor information of the Block and Meta files to the client;

S4：创建本地Java流写Block和Meta文件，客户端创建本地Java流直接写位于本地磁盘的Block和Meta文件，客户端写完Block和Meta后，向本地DataNode1提交Block信息；S4: Create a local Java stream to write Block and Meta files. The client creates a local Java stream and directly writes the Block and Meta files located on the local disk. After the client writes the Block and Meta, it submits the Block information to the local DataNode1;

S5：Block信息归档记录，客户端向NameNode提交Block信息，NameNode接受提交Block请求，完成Block信息归档记录。S5: Block information archiving record. The client submits Block information to the NameNode. The NameNode accepts the submitted Block request and completes the Block information archiving record.

作为本发明的一种技术方案，步骤S1中，Block信息包括BlockID和BlockLocation等信息。As a technical solution of the present invention, in step S1, the Block information includes BlockID, BlockLocation and other information.

作为本发明的一种技术方案，步骤S3中，客户端从NameNode获取到Block信息后，根据Block的位置信息BlockLocation判断其中是否包含有本地节点；若存在，则调用ShortCircuitWriteOutputStream中的接口实现写数据流程；若不存在，则回调DFSOutputStream接口使用原生写数据流程。As a technical solution of the present invention, in step S3, after the client obtains the Block information from the NameNode, it determines whether it contains a local node based on the Block's location information BlockLocation; if it exists, it calls the interface in ShortCircuitWriteOutputStream to implement the data writing process. ; If it does not exist, call back the DFSOutputStream interface and use the native data writing process.

作为本发明的一种技术方案，步骤S4中，剥离DataNode写文件的逻辑，使用本地输出流直接写磁盘，缩短了写数据的逻辑链路。As a technical solution of the present invention, in step S4, the logic of writing files by the DataNode is stripped off, and the local output stream is used to directly write to the disk, thus shortening the logical link for writing data.

与现有技术相比，本发明具有如下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

1、本发明通过剥离DataNode写文件的逻辑，改为使用本地输出流直接写磁盘，缩短了写数据的逻辑链路，简化了数据打包和校验流程，从而提高写数据效率。1. The present invention strips the logic of DataNode writing files and instead uses the local output stream to directly write to the disk, shortening the logical link of writing data, simplifying the data packaging and verification process, thereby improving the efficiency of writing data.

2、本发明通过写数据逻辑从DataNode端剥离，减轻了DataNode端的处理压力；并将原生的管道流式发送数据改为本地流直接写磁盘，降低了本地回环网络带宽压力，提高了磁盘IO处理能力。2. The present invention reduces the processing pressure on the DataNode by stripping the data writing logic from the DataNode; and changes the original pipeline streaming data to a local stream to directly write to the disk, which reduces the local loopback network bandwidth pressure and improves disk IO processing. ability.

附图说明Description of drawings

图1是本发明HDFS原生写流程图；Figure 1 is a flow chart of HDFS native writing according to the present invention;

图2是本发明短路写整体架构图；Figure 2 is an overall architecture diagram of short-circuit writing according to the present invention;

图3是本发明短路写具体流程图。Figure 3 is a specific flow chart of short-circuit writing according to the present invention.

具体实施方式Detailed ways

下面结合附图和实施例对本发明的实施方式作进一步详细描述。以下实施例用于说明本发明，但不能用来限制本发明的范围。The embodiments of the present invention will be described in further detail below with reference to the accompanying drawings and examples. The following examples are used to illustrate the invention but are not intended to limit the scope of the invention.

如图2-3所示，一种HDFS短路写的加速方法，包括：HDFS客户端，所述加速方法步骤如下：As shown in Figure 2-3, an HDFS short-circuit write acceleration method includes: HDFS client. The steps of the acceleration method are as follows:

S4：创建本地Java流写Block和Meta文件，客户端创建本地Java流直接写位于本地磁盘的Block和Meta文件，客户端写完Block和Meta后，向本地DataNode1提交Block信息；通过剥离DataNode写文件的逻辑，改为使用本地输出流直接写磁盘，缩短了写数据的逻辑链路，简化了数据打包和校验流程，从而提高写数据效率，并且写数据逻辑从DataNode端剥离，减轻了DataNode端的处理压力；S4: Create a local Java stream to write Block and Meta files. The client creates a local Java stream and directly writes the Block and Meta files located on the local disk. After the client writes the Block and Meta, it submits the Block information to the local DataNode1; it writes the file by stripping the DataNode. The logic is changed to use the local output stream to write directly to the disk, which shortens the logical link of writing data, simplifies the data packaging and verification process, thereby improving the efficiency of writing data, and the writing data logic is separated from the DataNode side, reducing the load on the DataNode side. dealing with stress;

S5：Block信息归档记录，客户端向NameNode提交Block信息，NameNode接受提交Block请求，完成Block信息归档记录；至此，Block已对外读可见。同时NameNode检测到Block副本缺失(默认2副本，短路写方案只写1副本到本地DataNode)，随发起Block副本恢复请求，DataNode1拷贝第2副本到DataNode2，完成Block的双副本写入。S5: Block information archiving record. The client submits Block information to the NameNode. The NameNode accepts the submitted Block request and completes the Block information archiving record. At this point, the Block is visible for external reading. At the same time, the NameNode detects that the Block copy is missing (default is 2 copies, the short-circuit write scheme only writes 1 copy to the local DataNode), and initiates a Block copy recovery request. DataNode1 copies the second copy to DataNode2, completing the dual copy writing of the Block.

本发明将写Block和Meta文件的过程从DataNode端剥离，改用本地流将数据快速写入文件，从而简化了原生写流程中的大量数据校验过程，减轻了DataNode端的请求压力，使写性能接近本地文件系统的性能。同时，本发明利用HDFS的分布式及副本检测机制，由HDFS自动完成Block第二副本的恢复，保证数据的可靠性，由于写完第1副本后Block即可对外可见，故相比原生写2副本写完之后才对外可见也行，进一步加速了文件的写入速度。This invention separates the process of writing Block and Meta files from the DataNode side, and instead uses local streams to quickly write data into files, thereby simplifying the large-scale data verification process in the native writing process, reducing the request pressure on the DataNode side, and improving the writing performance. Close to local file system performance. At the same time, the present invention uses the distribution and copy detection mechanism of HDFS to automatically complete the recovery of the second copy of the Block by HDFS to ensure the reliability of the data. Since the Block can be visible to the outside world after writing the first copy, it is better than native writing 2 The copy can only be visible to the outside world after it is written, which further speeds up file writing.

下表为HDFS原生写本地与修改之后的HDFS短路写本地的性能对比，基于新的短路写方式，提升了HDFS的写性能。其中表1为单并发写10GB数据场景下，HDFS原生写与修改之后的短路写在不同并发量下的单机性能对比；如下表所示：The following table shows the performance comparison between HDFS native write local and modified HDFS short-circuit write local. Based on the new short-circuit write method, the write performance of HDFS is improved. Table 1 shows the single-machine performance comparison of HDFS native writing and modified short-circuit writing under different concurrency amounts in a single concurrent write scenario of 10GB data; as shown in the following table:

表1Table 1

如图1所示，为HDFS原生写流程：As shown in Figure 1, the native writing process for HDFS:

(1)HDFS客户端向NameNode发起RPC(远程过程调用)，创建文件并申请第一个Block(BlockID和BlockLocation等信息)；(1) The HDFS client initiates an RPC (remote procedure call) to the NameNode, creates a file and applies for the first Block (BlockID, BlockLocation and other information);

(2)NameNode根据Block分配策略创建Block信息并返回给客户端；(2) NameNode creates Block information according to the Block allocation policy and returns it to the client;

(3)客户端构造输出流DFSOutputStream，根据Block的位置信息建立管道流，将数据打包成packet对象，并通过管道流发送到管道中各个DataNode节点；(3) The client constructs the output stream DFSOutputStream, establishes a pipeline stream based on the location information of the Block, packages the data into a packet object, and sends it to each DataNode node in the pipeline through the pipeline stream;

(4)客户端将packet通过管道发送到第一个DataNode节点DataNode1，DataNode1接收到客户端发送的packet后，再次校验数据的合法性，并将packet转发给管道中的下一个DataNode节点。同时，DataNode1将packet中的数据和checksum分别写入本地RBW目录下的block和meta文件中，管道中的所有DataNode节点执行同样的写操作；(4) The client sends the packet to the first DataNode node DataNode1 through the pipeline. After receiving the packet sent by the client, DataNode1 verifies the validity of the data again and forwards the packet to the next DataNode node in the pipeline. At the same time, DataNode1 writes the data and checksum in the packet to the block and meta files in the local RBW directory respectively, and all DataNode nodes in the pipeline perform the same write operation;

(5)当一个packet写完管道中所有的DataNode节点之后，在对客户端发送的response中记录成功信息，代表一个packet写入所有副本成功。所有数据写完之后，客户端关闭管道流；(5) After a packet has been written to all DataNode nodes in the pipeline, success information is recorded in the response sent to the client, indicating that a packet is successfully written to all copies. After all data is written, the client closes the pipeline stream;

(6)客户端向NameNode提交Block，NameNode更改Block状态后，代表该Block已完成写入，对外读可见。(6) The client submits the Block to the NameNode. After the NameNode changes the Block status, it means that the Block has been written and is visible for external reading.

上述原生HDFS数据写入过程流程比较复杂，并且客户端和服务端进行了大量的packet封装、解包和数据校验计算，更加影响性能；此外，即使客户端与Datanode同处一台物理机，其发送的数据也需要通过本地回环网络(lo)传输，造成了额外的机器带宽开销。目前HDFS的写数据过程受限于解包校验和网络压力，无法真正发挥磁盘的性能，写入速度较慢，对业务影响较大。The above-mentioned native HDFS data writing process is relatively complicated, and the client and server perform a large number of packet encapsulation, unpacking and data verification calculations, which further affects performance; in addition, even if the client and Datanode are on the same physical machine, The data it sends also needs to be transmitted through the local loopback network (lo), causing additional machine bandwidth overhead. At present, the data writing process of HDFS is limited by unpacking verification and network pressure, which makes it impossible to truly utilize the performance of the disk. The writing speed is slow and has a great impact on the business.

而本发明通过HDFS客户端从NameNode获取到Block信息后，根据Block的位置信息BlockLocation判断其中是否包含有本地节点；如图2所示，若存在本地DataNode，则HDFS客户端构造短路写输出流ShortCircuitWriteOutputStream，完成短路写流程；若不存在本地DataNode，则HDFS客户端调用DFSOutputStream接口使用原生写数据流程。After the present invention obtains the Block information from the NameNode through the HDFS client, it determines whether it contains a local node based on the Block's location information BlockLocation. As shown in Figure 2, if there is a local DataNode, the HDFS client constructs a short-circuit write output stream ShortCircuitWriteOutputStream. , complete the short-circuit writing process; if there is no local DataNode, the HDFS client calls the DFSOutputStream interface to use the native writing data process.

需要说明的是：短路写方案依旧采取HDFS的API，可灵活配置开关启用、特定配置目录下启用、调用新增接口等方式使用短路写，并保持原分布文件系统文件夹HdfsDirectory机制不变。It should be noted that the short-circuit write solution still uses the HDFS API, which can be flexibly configured to enable switches, enable in specific configuration directories, call new interfaces, etc. to use short-circuit write, and maintain the original distribution file system folder HdfsDirectory mechanism unchanged.

本发明的实施例是为了示例和描述起见而给出的，而并不是无遗漏的或者将本发明限于所公开的形式，尽管参照前述实施例对本发明进行了详细的说明，对于本领域的技术人员来说，其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换。The embodiments of the present invention are given for the purpose of illustration and description, and are not intended to be exhaustive or to limit the invention to the disclosed form. Although the present invention has been described in detail with reference to the foregoing embodiments, it will not be useful to those skilled in the art. For those skilled in the art, they can still modify the technical solutions recorded in the foregoing embodiments, or make equivalent substitutions for some of the technical features.

Claims

1. An acceleration method of HDFS short circuit writing, comprising: the HDFS client is characterized in that the acceleration method comprises the following steps:

s1: creating a file and applying for a Block, and initiating RPC to NameNode by an HDFS client, creating the file and applying for a first Block;

s2: creating Block information according to the Block allocation strategy, and using NameNode to create the Block information according to the Block allocation strategy and returning the Block information to the client;

s3: creating RBW files of Block and Meta, wherein a client constructs a short-circuit write output stream (ShortCircuitWriteOutputStream), and initiates a request to a local DataNode1, creates RBW files of Block and Meta, and the DataNode1 returns descriptor information of the Block and Meta files to the client;

s4: creating a local Java stream write Block and Meta file, wherein the client creates the local Java stream to directly write the Block and Meta file on the local disk, and submits Block information to the local DataNode1 after the client writes the Block and Meta;

s5: the method comprises the steps that a client side submits Block information to a NameNode, the NameNode receives the Block request, and the Block information archiving record is completed.

2. The acceleration method of HDFS short circuit writing of claim 1, wherein: in step S1, the Block information includes a BlockID and BlockLocation information.

3. The acceleration method of HDFS short circuit writing according to claim 2, wherein: in step S3, after the client acquires the Block information from the NameNode, determining whether the Block information includes a local node according to the Block location information; if so, calling an interface in the shortcircuit writeoutputstream to realize a data writing flow; if not, the callback DFSOutputStream interface uses the native write data flow.

4. The acceleration method of HDFS short circuit writing of claim 1, wherein: in step S4, the logic of writing the file by the DataNode is stripped, and the local output stream is used to directly write the disk, so as to shorten the logic link for writing the data.