[go: up one dir, main page]

CN106302662A - Hbase-based MR operation method capable of saving network flow - Google Patents

Hbase-based MR operation method capable of saving network flow Download PDF

Info

Publication number
CN106302662A
CN106302662A CN201610628407.8A CN201610628407A CN106302662A CN 106302662 A CN106302662 A CN 106302662A CN 201610628407 A CN201610628407 A CN 201610628407A CN 106302662 A CN106302662 A CN 106302662A
Authority
CN
China
Prior art keywords
mapreduce
hbase
computing unit
operation method
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610628407.8A
Other languages
Chinese (zh)
Inventor
赵明超
牛硕
臧勇真
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
IEIT Systems Co Ltd
Original Assignee
Inspur Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Electronic Information Industry Co Ltd filed Critical Inspur Electronic Information Industry Co Ltd
Priority to CN201610628407.8A priority Critical patent/CN106302662A/en
Publication of CN106302662A publication Critical patent/CN106302662A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/15Flow control; Congestion control in relation to multipoint traffic

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Multi Processors (AREA)

Abstract

The invention discloses an Hbase-based MR operation method capable of saving network flow, which solves the problems of high network overhead and risk of network paralysis in entitlement collection, and adopts the technical scheme that: the method comprises the following steps: (1) the InputFormat method of Mapreduce is realized; (2) acquiring all large data block information of a certain Hbase table; (3) acquiring the bottom files of the data blocks according to the data blocks; (4) taking the bottom layer files of all the obtained data blocks as input of Mapreduce; executing mapreduce by taking each bottom file as a calculation unit; (5) and executing reduce to end mapreduce.

Description

一种基于Hbase的节省网络流量的MR运行方法A MR operation method based on Hbase to save network traffic

技术领域technical field

本发明涉及一种,具体地说是一种基于Hbase的节省网络流量的MR运行方法。The present invention relates to an MR running method based on Hbase and saving network traffic in particular.

背景技术Background technique

当今世界,公司的日常运营经常会生成TB级别的数据。数据来源囊括了互联网装置可以捕获的任何类型数据,网站、社交媒体、交易型商业数据以及其它商业环境中创建的数据。考虑到数据的生成量,实时处理成为了许多机构需要面对的首要挑战。In today's world, companies' daily operations often generate terabytes of data. Data sources include any type of data that can be captured by internet-connected devices, websites, social media, transactional business data, and data created in other business environments. Given the volume of data being generated, real-time processing is a top challenge for many organizations.

MR为mapreduce的缩写,MapReduce是一种编程模型,用于大规模数据集(大于1TB)的并行运算。概念"Map(映射)"和"Reduce(归约)",和它们的主要思想,都是从函数式编程语言里借来的,还有从矢量编程语言里借来的特性。它极大地方便了编程人员在不会分布式并行编程的情况下,将自己的程序运行在分布式系统上。 当前的软件实现是指定一个Map(映射)函数,用来把一组键值对映射成一组新的键值对,指定并发的Reduce(归约)函数,用来保证所有映射的键值对中的每一个共享相同的键组。MR is the abbreviation of mapreduce, and MapReduce is a programming model for parallel computing of large-scale data sets (greater than 1TB). The concepts "Map" and "Reduce", and their main ideas, are borrowed from functional programming languages, with features borrowed from vector programming languages. It greatly facilitates programmers to run their programs on distributed systems without distributed parallel programming. The current software implementation is to specify a Map (mapping) function to map a set of key-value pairs into a new set of key-value pairs, and specify a concurrent Reduce (reduction) function to ensure that all mapped key-value pairs are Each of the shares the same set of keys.

使用hbase运行MR,由于HBase本身底层数据文件没有全部分布在MR运行节点上。因此在执行MR时,MR执行节点会跨网络读取其他节点上的数据文件,从而造成很多额外的网络开销。当集群数据有上TB或者PB是,传统的Hbase的Mapreduce很容易造成很大的网络开销,使集权有网络瘫痪的风险。Use hbase to run MR, because the underlying data files of HBase itself are not all distributed on the MR running nodes. Therefore, when executing MR, the MR execution node will read data files on other nodes across the network, resulting in a lot of additional network overhead. When the cluster data reaches terabytes or petabytes, the traditional mapreduce of Hbase can easily cause a lot of network overhead, and centralization has the risk of network paralysis.

发明内容Contents of the invention

本发明的技术任务是提供一种基于Hbase的节省网络流量的MR运行方法,来解决网络开销大,集权有网络瘫痪的风险的问题。The technical task of the present invention is to provide an Hbase-based MR operation method that saves network traffic to solve the problems of large network overhead and risk of network paralysis due to centralization.

本发明的技术任务是按以下方式实现的,Technical task of the present invention is realized in the following manner,

一种基于Hbase的节省网络流量的MR运行方法,步骤如下:An Hbase-based MR operation method for saving network traffic, the steps are as follows:

(1)、实现Mapreduce的InputFormat方法;(1) Implement the InputFormat method of Mapreduce;

(2)、获取Hbase某张表的所有大的数据块(Region)信息;(2) Obtain all large data block (Region) information of a table in Hbase;

(3)、根据每个数据块,获取他们的底层文件(Hfile);(3) Obtain their underlying files (Hfile) according to each data block;

(4)、将获取到的所有数据块的底层文件作为Mapreduce的输入;以每个底层文件为计算单元,执行mapreduce;(4) Use the obtained underlying files of all data blocks as the input of Mapreduce; use each underlying file as a calculation unit to execute mapreduce;

(5)、执行reduce,结束mapreduce。(5) Execute reduce and end mapreduce.

步骤(4)中,执行mapreduce,MapReduce通过把对数据集的大规模操作分发给网络上的每个计算单元实现可靠性;每个计算单元周期性的返回它所完成的工作和最新的状态。In step (4), mapreduce is executed. MapReduce achieves reliability by distributing large-scale operations on data sets to each computing unit on the network; each computing unit periodically returns its completed work and the latest status.

若一个计算单元保持沉默超过一个预设的时间间隔,主计算单元(类同GoogleFile System中的主服务器)记录下这个计算单元状态为死亡,并把分配给这个计算单元的数据发到别的计算单元。If a computing unit remains silent for more than a preset time interval, the main computing unit (similar to the main server in GoogleFile System) records the status of this computing unit as dead, and sends the data assigned to this computing unit to other computing units unit.

本发明的一种基于Hbase的节省网络流量的MR运行方法具有以下优点:结合MR的执行特点和HBase数据的存储特点,直接在每个数据文件上执行MR,从根本上解决了Mapreduce运行初期跨节点取数据的问题,从而很好的节省了网络开销,具有很好的推广使用价值。An Hbase-based MR operation method for saving network traffic of the present invention has the following advantages: combining the execution characteristics of MR and the storage characteristics of HBase data, MR is directly executed on each data file, which fundamentally solves the problem of the initial delay of Mapreduce operation. The problem of node fetching data, thus saving the network overhead very well, has a very good promotion and use value.

附图说明Description of drawings

下面结合附图对本发明进一步说明。The present invention will be further described below in conjunction with the accompanying drawings.

附图1为一种基于Hbase的节省网络流量的MR运行方法的流程图。Accompanying drawing 1 is a flow chart of an Hbase-based MR operation method for saving network traffic.

具体实施方式detailed description

参照说明书附图和具体实施例对本发明的一种基于Hbase的节省网络流量的MR运行方法作以下详细地说明。An Hbase-based MR operation method for saving network traffic of the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

实施例:Example:

本发明的一种基于Hbase的节省网络流量的MR运行方法,步骤如下:A kind of MR running method based on Hbase of the present invention saves network flow, and the steps are as follows:

(1)、实现Mapreduce的InputFormat方法;(1) Implement the InputFormat method of Mapreduce;

(2)、获取Hbase某张表的所有大的数据块(Region)信息;(2) Obtain all large data block (Region) information of a table in Hbase;

(3)、根据每个数据块,获取他们的底层文件(Hfile);(3) Obtain their underlying files (Hfile) according to each data block;

(4)、将获取到的所有数据块的底层文件作为Mapreduce的输入;以每个底层文件为计算单元,执行mapreduce;(4) Use the obtained underlying files of all data blocks as the input of Mapreduce; use each underlying file as a calculation unit to execute mapreduce;

(5)、执行reduce,结束mapreduce。(5) Execute reduce and end mapreduce.

步骤(4)中,执行mapreduce,MapReduce通过把对数据集的大规模操作分发给网络上的每个计算单元实现可靠性;每个计算单元周期性的返回它所完成的工作和最新的状态。In step (4), mapreduce is executed. MapReduce achieves reliability by distributing large-scale operations on data sets to each computing unit on the network; each computing unit periodically returns its completed work and the latest status.

若一个计算单元保持沉默超过一个预设的时间间隔,主计算单元(类同GoogleFile System中的主服务器)记录下这个计算单元状态为死亡,并把分配给这个计算单元的数据发到别的计算单元。If a computing unit remains silent for more than a preset time interval, the main computing unit (similar to the main server in GoogleFile System) records the status of this computing unit as dead, and sends the data assigned to this computing unit to other computing units unit.

一、映射和化简1. Mapping and Simplification

简单说来,一个映射函数就是对一些独立元素组成的概念上的列表(例如,一个测试成绩的列表)的每一个元素进行指定的操作(比如前面的例子里,有人发现所有学生的成绩都被高估了一分,它可以定义一个"减一"的映射函数,用来修正这个错误。)。事实上,每个元素都是被独立操作的,而原始列表没有被更改,因为这里创建了一个新的列表来保存新的答案。这就是说,Map操作是可以高度并行的,这对高性能要求的应用以及并行计算领域的需求非常有用。In simple terms, a mapping function is to perform a specified operation on each element of a conceptual list of independent elements (for example, a list of test scores) (for example, in the previous example, it was found that all student scores were Overestimated by one point, it can define a "minus one" mapping function to fix this error.). In fact, each element is manipulated independently, and the original list is not changed, because a new list is created here to hold the new answer. That is to say, the Map operation can be highly parallelized, which is very useful for applications with high performance requirements and the requirements in the field of parallel computing.

而化简操作指的是对一个列表的元素进行适当的合并(继续看前面的例子,如果有人想知道班级的平均分该怎么做?它可以定义一个化简函数,通过让列表中的元素跟自己的相邻的元素相加的方式把列表减半,如此递归运算直到列表只剩下一个元素,然后用这个元素除以人数,就得到了平均分。)。虽然他不如映射函数那么并行,但是因为化简总是有一个简单的答案,大规模的运算相对独立,所以化简函数在高度并行环境下也很有用。The simplification operation refers to the proper merging of the elements of a list (continue to look at the previous example, if someone wants to know the average score of the class, how to do it? It can define a simplification function, by making the elements in the list follow the The list is halved by adding its own adjacent elements, so recursive operation until there is only one element left in the list, and then divide this element by the number of people to get the average score.). Although it is not as parallel as the mapping function, the reduction function is also useful in a highly parallel environment because the reduction always has a simple answer and the large-scale operations are relatively independent.

二、分布可靠2. Reliable distribution

MapReduce通过把对数据集的大规模操作分发给网络上的每个节点实现可靠性;每个节点会周期性的返回它所完成的工作和最新的状态。如果一个节点保持沉默超过一个预设的时间间隔,主节点(类同Google File System中的主服务器)记录下这个节点状态为死亡,并把分配给这个节点的数据发到别的节点。每个操作使用命名文件的原子操作以确保不会发生并行线程间的冲突;当文件被改名的时候,系统可能会把他们复制到任务名以外的另一个名字上去。(避免副作用)。MapReduce achieves reliability by distributing large-scale operations on data sets to each node on the network; each node will periodically return its completed work and the latest status. If a node remains silent for more than a preset time interval, the master node (similar to the master server in Google File System) records the node status as dead, and sends the data assigned to this node to other nodes. Each operation uses atomic operations on named files to ensure that no conflicts between parallel threads occur; when files are renamed, the system may copy them to a name other than the task name. (to avoid side effects).

化简操作工作方式与之类似,但是由于化简操作的可并行性相对较差,主节点会尽量把化简操作只分配在一个节点上,或者离需要操作的数据尽可能近的节点上;这个特性可以满足Google的需求,因为他们有足够的带宽,他们的内部网络没有那么多的机器。The reduction operation works in a similar way, but because the parallelism of the reduction operation is relatively poor, the master node will try to allocate the reduction operation to only one node, or the node as close as possible to the data to be operated; This feature can meet the needs of Google, because they have enough bandwidth, and their internal network does not have so many machines.

通过上面具体实施方式,所述技术领域的技术人员可容易的实现本发明。但是应当理解,本发明并不限于上述的具体实施方式。在公开的实施方式的基础上,所述技术领域的技术人员可任意组合不同的技术特征,从而实现不同的技术方案。Through the above specific implementation manners, those skilled in the technical field can easily realize the present invention. However, it should be understood that the present invention is not limited to the specific embodiments described above. On the basis of the disclosed embodiments, those skilled in the art can arbitrarily combine different technical features, so as to realize different technical solutions.

除说明书所述的技术特征外,均为本专业技术人员的已知技术。Except for the technical features described in the instructions, all are known technologies by those skilled in the art.

Claims (3)

1.一种基于Hbase的节省网络流量的MR运行方法,其特征在于步骤如下:1. an MR operation method based on Hbase to save network traffic, is characterized in that the steps are as follows: (1)、实现Mapreduce的InputFormat方法;(1) Implement the InputFormat method of Mapreduce; (2)、获取Hbase某张表的所有大的数据块信息;(2) Obtain all the large data block information of a table in Hbase; (3)、根据每个数据块,获取他们的底层文件;(3) Obtain their underlying files according to each data block; (4)、将获取到的所有数据块的底层文件作为Mapreduce的输入;以每个底层文件为计算单元,执行mapreduce;(4) Use the obtained underlying files of all data blocks as the input of Mapreduce; use each underlying file as a calculation unit to execute mapreduce; (5)、执行reduce,结束mapreduce。(5) Execute reduce and end mapreduce. 2.根据权利要求1所述的一种基于Hbase的节省网络流量的MR运行方法,其特征在于步骤(4)中,执行mapreduce,MapReduce通过把对数据集的大规模操作分发给网络上的每个计算单元实现可靠性;每个计算单元周期性的返回它所完成的工作和最新的状态。2. A Hbase-based MR operation method for saving network traffic according to claim 1, characterized in that in step (4), mapreduce is executed, and MapReduce distributes large-scale operations on data sets to each network on the network Each computing unit achieves reliability; each computing unit periodically returns the work it has done and the latest status. 3.根据权利要求2所述的一种基于Hbase的节省网络流量的MR运行方法,其特征在于若一个计算单元保持沉默超过一个预设的时间间隔,主计算单元记录下这个计算单元状态为死亡,并把分配给这个计算单元的数据发到别的计算单元。3. A Hbase-based MR operation method for saving network traffic according to claim 2, wherein if a computing unit remains silent for more than a preset time interval, the main computing unit records that the computing unit state is dead , and send the data assigned to this computing unit to other computing units.
CN201610628407.8A 2016-08-03 2016-08-03 Hbase-based MR operation method capable of saving network flow Pending CN106302662A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610628407.8A CN106302662A (en) 2016-08-03 2016-08-03 Hbase-based MR operation method capable of saving network flow

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610628407.8A CN106302662A (en) 2016-08-03 2016-08-03 Hbase-based MR operation method capable of saving network flow

Publications (1)

Publication Number Publication Date
CN106302662A true CN106302662A (en) 2017-01-04

Family

ID=57664543

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610628407.8A Pending CN106302662A (en) 2016-08-03 2016-08-03 Hbase-based MR operation method capable of saving network flow

Country Status (1)

Country Link
CN (1) CN106302662A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110066649A1 (en) * 2009-09-14 2011-03-17 Myspace, Inc. Double map reduce distributed computing framework
CN103645952A (en) * 2013-08-08 2014-03-19 中国人民解放军国防科学技术大学 Non-accurate task parallel processing method based on MapReduce
CN103984926A (en) * 2014-05-15 2014-08-13 江苏科大汇峰科技有限公司 Distributed moving object detection method based on MapReduce calculation model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110066649A1 (en) * 2009-09-14 2011-03-17 Myspace, Inc. Double map reduce distributed computing framework
CN103645952A (en) * 2013-08-08 2014-03-19 中国人民解放军国防科学技术大学 Non-accurate task parallel processing method based on MapReduce
CN103984926A (en) * 2014-05-15 2014-08-13 江苏科大汇峰科技有限公司 Distributed moving object detection method based on MapReduce calculation model

Similar Documents

Publication Publication Date Title
US11455290B1 (en) Streaming database change data from distributed storage
US12210419B2 (en) Continuous data protection
US11042503B1 (en) Continuous data protection and restoration
CN102307206B (en) Caching method of caching system for quickly accessing virtual machine mirror image based on cloud storage
CN110799960A (en) System and method for database tenant migration
CN110569252B (en) Data processing system and method
CN107045422A (en) Distributed storage method and equipment
CN103139300A (en) Virtual machine image management optimization method based on data de-duplication
JPWO2011108695A1 (en) Parallel data processing system, parallel data processing method and program
CN109508326B (en) Method, device and system for processing data
CN108573029B (en) Method, device and storage medium for acquiring network access relation data
CN104885054A (en) System and method for performing a transaction in a massively parallel processing database
CN103714123A (en) Methods for deleting duplicated data and controlling reassembly versions of cloud storage segmented objects of enterprise
CN107526645A (en) A kind of communication optimization method and system
CN104363222A (en) Hadoop-based network security event analysis method
US9110820B1 (en) Hybrid data storage system in an HPC exascale environment
CN103514298A (en) Method for achieving file lock and metadata server
CN105138679A (en) Data processing system and method based on distributed caching
CN103365987B (en) Clustered database system and data processing method based on shared-disk framework
CN103294799B (en) A kind of data parallel batch imports the method and system of read-only inquiry system
CN103106261A (en) Distributed query method based on narrow-band cloud data service
CN112395308A (en) Data query method based on HDFS database
CN109388651B (en) A data processing method and device
CN114860762A (en) Distributed data collection platform development and research method based on data lake storage
CN105045571A (en) Novel WebGIS architecture

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170104

WD01 Invention patent application deemed withdrawn after publication