CN106302662A

CN106302662A - Hbase-based MR operation method capable of saving network flow

Info

Publication number: CN106302662A
Application number: CN201610628407.8A
Authority: CN
Inventors: 赵明超; 牛硕; 臧勇真
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: IEIT Systems Co Ltd
Priority date: 2016-08-03
Filing date: 2016-08-03
Publication date: 2017-01-04

Abstract

The invention discloses an Hbase-based MR operation method capable of saving network flow, which solves the problems of high network overhead and risk of network paralysis in entitlement collection, and adopts the technical scheme that: the method comprises the following steps: (1) the InputFormat method of Mapreduce is realized; (2) acquiring all large data block information of a certain Hbase table; (3) acquiring the bottom files of the data blocks according to the data blocks; (4) taking the bottom layer files of all the obtained data blocks as input of Mapreduce; executing mapreduce by taking each bottom file as a calculation unit; (5) and executing reduce to end mapreduce.

Description

A MR operation method based on Hbase to save network traffic

技术领域technical field

本发明涉及一种，具体地说是一种基于Hbase的节省网络流量的MR运行方法。The present invention relates to an MR running method based on Hbase and saving network traffic in particular.

背景技术Background technique

当今世界，公司的日常运营经常会生成TB级别的数据。数据来源囊括了互联网装置可以捕获的任何类型数据，网站、社交媒体、交易型商业数据以及其它商业环境中创建的数据。考虑到数据的生成量，实时处理成为了许多机构需要面对的首要挑战。In today's world, companies' daily operations often generate terabytes of data. Data sources include any type of data that can be captured by internet-connected devices, websites, social media, transactional business data, and data created in other business environments. Given the volume of data being generated, real-time processing is a top challenge for many organizations.

MR为mapreduce的缩写，MapReduce是一种编程模型，用于大规模数据集(大于1TB)的并行运算。概念"Map(映射)"和"Reduce(归约)"，和它们的主要思想，都是从函数式编程语言里借来的，还有从矢量编程语言里借来的特性。它极大地方便了编程人员在不会分布式并行编程的情况下，将自己的程序运行在分布式系统上。当前的软件实现是指定一个Map(映射)函数，用来把一组键值对映射成一组新的键值对，指定并发的Reduce(归约)函数，用来保证所有映射的键值对中的每一个共享相同的键组。MR is the abbreviation of mapreduce, and MapReduce is a programming model for parallel computing of large-scale data sets (greater than 1TB). The concepts "Map" and "Reduce", and their main ideas, are borrowed from functional programming languages, with features borrowed from vector programming languages. It greatly facilitates programmers to run their programs on distributed systems without distributed parallel programming. The current software implementation is to specify a Map (mapping) function to map a set of key-value pairs into a new set of key-value pairs, and specify a concurrent Reduce (reduction) function to ensure that all mapped key-value pairs are Each of the shares the same set of keys.

使用hbase运行MR，由于HBase本身底层数据文件没有全部分布在MR运行节点上。因此在执行MR时，MR执行节点会跨网络读取其他节点上的数据文件，从而造成很多额外的网络开销。当集群数据有上TB或者PB是，传统的Hbase的Mapreduce很容易造成很大的网络开销，使集权有网络瘫痪的风险。Use hbase to run MR, because the underlying data files of HBase itself are not all distributed on the MR running nodes. Therefore, when executing MR, the MR execution node will read data files on other nodes across the network, resulting in a lot of additional network overhead. When the cluster data reaches terabytes or petabytes, the traditional mapreduce of Hbase can easily cause a lot of network overhead, and centralization has the risk of network paralysis.

发明内容Contents of the invention

本发明的技术任务是提供一种基于Hbase的节省网络流量的MR运行方法，来解决网络开销大，集权有网络瘫痪的风险的问题。The technical task of the present invention is to provide an Hbase-based MR operation method that saves network traffic to solve the problems of large network overhead and risk of network paralysis due to centralization.

本发明的技术任务是按以下方式实现的，Technical task of the present invention is realized in the following manner,

一种基于Hbase的节省网络流量的MR运行方法，步骤如下：An Hbase-based MR operation method for saving network traffic, the steps are as follows:

（1）、实现Mapreduce的InputFormat方法；(1) Implement the InputFormat method of Mapreduce;

（2）、获取Hbase某张表的所有大的数据块（Region）信息；(2) Obtain all large data block (Region) information of a table in Hbase;

（3）、根据每个数据块，获取他们的底层文件（Hfile）；(3) Obtain their underlying files (Hfile) according to each data block;

（4）、将获取到的所有数据块的底层文件作为Mapreduce的输入；以每个底层文件为计算单元，执行mapreduce；(4) Use the obtained underlying files of all data blocks as the input of Mapreduce; use each underlying file as a calculation unit to execute mapreduce;

（5）、执行reduce，结束mapreduce。(5) Execute reduce and end mapreduce.

步骤（4）中，执行mapreduce，MapReduce通过把对数据集的大规模操作分发给网络上的每个计算单元实现可靠性;每个计算单元周期性的返回它所完成的工作和最新的状态。In step (4), mapreduce is executed. MapReduce achieves reliability by distributing large-scale operations on data sets to each computing unit on the network; each computing unit periodically returns its completed work and the latest status.

若一个计算单元保持沉默超过一个预设的时间间隔，主计算单元(类同GoogleFile System中的主服务器)记录下这个计算单元状态为死亡，并把分配给这个计算单元的数据发到别的计算单元。If a computing unit remains silent for more than a preset time interval, the main computing unit (similar to the main server in GoogleFile System) records the status of this computing unit as dead, and sends the data assigned to this computing unit to other computing units unit.

本发明的一种基于Hbase的节省网络流量的MR运行方法具有以下优点：结合MR的执行特点和HBase数据的存储特点，直接在每个数据文件上执行MR，从根本上解决了Mapreduce运行初期跨节点取数据的问题，从而很好的节省了网络开销，具有很好的推广使用价值。An Hbase-based MR operation method for saving network traffic of the present invention has the following advantages: combining the execution characteristics of MR and the storage characteristics of HBase data, MR is directly executed on each data file, which fundamentally solves the problem of the initial delay of Mapreduce operation. The problem of node fetching data, thus saving the network overhead very well, has a very good promotion and use value.

附图说明Description of drawings

下面结合附图对本发明进一步说明。The present invention will be further described below in conjunction with the accompanying drawings.

附图1为一种基于Hbase的节省网络流量的MR运行方法的流程图。Accompanying drawing 1 is a flow chart of an Hbase-based MR operation method for saving network traffic.

具体实施方式detailed description

参照说明书附图和具体实施例对本发明的一种基于Hbase的节省网络流量的MR运行方法作以下详细地说明。An Hbase-based MR operation method for saving network traffic of the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

实施例：Example:

本发明的一种基于Hbase的节省网络流量的MR运行方法，步骤如下：A kind of MR running method based on Hbase of the present invention saves network flow, and the steps are as follows:

一、映射和化简1. Mapping and Simplification

简单说来，一个映射函数就是对一些独立元素组成的概念上的列表(例如，一个测试成绩的列表)的每一个元素进行指定的操作(比如前面的例子里，有人发现所有学生的成绩都被高估了一分，它可以定义一个"减一"的映射函数，用来修正这个错误。)。事实上，每个元素都是被独立操作的，而原始列表没有被更改，因为这里创建了一个新的列表来保存新的答案。这就是说，Map操作是可以高度并行的，这对高性能要求的应用以及并行计算领域的需求非常有用。In simple terms, a mapping function is to perform a specified operation on each element of a conceptual list of independent elements (for example, a list of test scores) (for example, in the previous example, it was found that all student scores were Overestimated by one point, it can define a "minus one" mapping function to fix this error.). In fact, each element is manipulated independently, and the original list is not changed, because a new list is created here to hold the new answer. That is to say, the Map operation can be highly parallelized, which is very useful for applications with high performance requirements and the requirements in the field of parallel computing.

而化简操作指的是对一个列表的元素进行适当的合并(继续看前面的例子，如果有人想知道班级的平均分该怎么做?它可以定义一个化简函数，通过让列表中的元素跟自己的相邻的元素相加的方式把列表减半，如此递归运算直到列表只剩下一个元素，然后用这个元素除以人数，就得到了平均分。)。虽然他不如映射函数那么并行，但是因为化简总是有一个简单的答案，大规模的运算相对独立，所以化简函数在高度并行环境下也很有用。The simplification operation refers to the proper merging of the elements of a list (continue to look at the previous example, if someone wants to know the average score of the class, how to do it? It can define a simplification function, by making the elements in the list follow the The list is halved by adding its own adjacent elements, so recursive operation until there is only one element left in the list, and then divide this element by the number of people to get the average score.). Although it is not as parallel as the mapping function, the reduction function is also useful in a highly parallel environment because the reduction always has a simple answer and the large-scale operations are relatively independent.

二、分布可靠2. Reliable distribution

MapReduce通过把对数据集的大规模操作分发给网络上的每个节点实现可靠性;每个节点会周期性的返回它所完成的工作和最新的状态。如果一个节点保持沉默超过一个预设的时间间隔，主节点(类同Google File System中的主服务器)记录下这个节点状态为死亡，并把分配给这个节点的数据发到别的节点。每个操作使用命名文件的原子操作以确保不会发生并行线程间的冲突;当文件被改名的时候，系统可能会把他们复制到任务名以外的另一个名字上去。(避免副作用)。MapReduce achieves reliability by distributing large-scale operations on data sets to each node on the network; each node will periodically return its completed work and the latest status. If a node remains silent for more than a preset time interval, the master node (similar to the master server in Google File System) records the node status as dead, and sends the data assigned to this node to other nodes. Each operation uses atomic operations on named files to ensure that no conflicts between parallel threads occur; when files are renamed, the system may copy them to a name other than the task name. (to avoid side effects).

化简操作工作方式与之类似，但是由于化简操作的可并行性相对较差，主节点会尽量把化简操作只分配在一个节点上，或者离需要操作的数据尽可能近的节点上;这个特性可以满足Google的需求，因为他们有足够的带宽，他们的内部网络没有那么多的机器。The reduction operation works in a similar way, but because the parallelism of the reduction operation is relatively poor, the master node will try to allocate the reduction operation to only one node, or the node as close as possible to the data to be operated; This feature can meet the needs of Google, because they have enough bandwidth, and their internal network does not have so many machines.

通过上面具体实施方式，所述技术领域的技术人员可容易的实现本发明。但是应当理解，本发明并不限于上述的具体实施方式。在公开的实施方式的基础上，所述技术领域的技术人员可任意组合不同的技术特征，从而实现不同的技术方案。Through the above specific implementation manners, those skilled in the technical field can easily realize the present invention. However, it should be understood that the present invention is not limited to the specific embodiments described above. On the basis of the disclosed embodiments, those skilled in the art can arbitrarily combine different technical features, so as to realize different technical solutions.

除说明书所述的技术特征外，均为本专业技术人员的已知技术。Except for the technical features described in the instructions, all are known technologies by those skilled in the art.

Claims

1. an MR operation method based on Hbase to save network traffic, is characterized in that the steps are as follows:

(1) Implement the InputFormat method of Mapreduce;

(2) Obtain all the large data block information of a table in Hbase;

(3) Obtain their underlying files according to each data block;

(4) Use the obtained underlying files of all data blocks as the input of Mapreduce; use each underlying file as a calculation unit to execute mapreduce;

(5) Execute reduce and end mapreduce.

2. A Hbase-based MR operation method for saving network traffic according to claim 1, characterized in that in step (4), mapreduce is executed, and MapReduce distributes large-scale operations on data sets to each network on the network Each computing unit achieves reliability; each computing unit periodically returns the work it has done and the latest status.

3. A Hbase-based MR operation method for saving network traffic according to claim 2, wherein if a computing unit remains silent for more than a preset time interval, the main computing unit records that the computing unit state is dead , and send the data assigned to this computing unit to other computing units.