CN106302662A - Hbase-based MR operation method capable of saving network flow - Google Patents
Hbase-based MR operation method capable of saving network flow Download PDFInfo
- Publication number
- CN106302662A CN106302662A CN201610628407.8A CN201610628407A CN106302662A CN 106302662 A CN106302662 A CN 106302662A CN 201610628407 A CN201610628407 A CN 201610628407A CN 106302662 A CN106302662 A CN 106302662A
- Authority
- CN
- China
- Prior art keywords
- mapreduce
- hbase
- computing unit
- operation method
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L47/00—Traffic control in data switching networks
- H04L47/10—Flow control; Congestion control
- H04L47/15—Flow control; Congestion control in relation to multipoint traffic
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Multi Processors (AREA)
Abstract
Description
技术领域technical field
本发明涉及一种,具体地说是一种基于Hbase的节省网络流量的MR运行方法。The present invention relates to an MR running method based on Hbase and saving network traffic in particular.
背景技术Background technique
当今世界,公司的日常运营经常会生成TB级别的数据。数据来源囊括了互联网装置可以捕获的任何类型数据,网站、社交媒体、交易型商业数据以及其它商业环境中创建的数据。考虑到数据的生成量,实时处理成为了许多机构需要面对的首要挑战。In today's world, companies' daily operations often generate terabytes of data. Data sources include any type of data that can be captured by internet-connected devices, websites, social media, transactional business data, and data created in other business environments. Given the volume of data being generated, real-time processing is a top challenge for many organizations.
MR为mapreduce的缩写,MapReduce是一种编程模型,用于大规模数据集(大于1TB)的并行运算。概念"Map(映射)"和"Reduce(归约)",和它们的主要思想,都是从函数式编程语言里借来的,还有从矢量编程语言里借来的特性。它极大地方便了编程人员在不会分布式并行编程的情况下,将自己的程序运行在分布式系统上。 当前的软件实现是指定一个Map(映射)函数,用来把一组键值对映射成一组新的键值对,指定并发的Reduce(归约)函数,用来保证所有映射的键值对中的每一个共享相同的键组。MR is the abbreviation of mapreduce, and MapReduce is a programming model for parallel computing of large-scale data sets (greater than 1TB). The concepts "Map" and "Reduce", and their main ideas, are borrowed from functional programming languages, with features borrowed from vector programming languages. It greatly facilitates programmers to run their programs on distributed systems without distributed parallel programming. The current software implementation is to specify a Map (mapping) function to map a set of key-value pairs into a new set of key-value pairs, and specify a concurrent Reduce (reduction) function to ensure that all mapped key-value pairs are Each of the shares the same set of keys.
使用hbase运行MR,由于HBase本身底层数据文件没有全部分布在MR运行节点上。因此在执行MR时,MR执行节点会跨网络读取其他节点上的数据文件,从而造成很多额外的网络开销。当集群数据有上TB或者PB是,传统的Hbase的Mapreduce很容易造成很大的网络开销,使集权有网络瘫痪的风险。Use hbase to run MR, because the underlying data files of HBase itself are not all distributed on the MR running nodes. Therefore, when executing MR, the MR execution node will read data files on other nodes across the network, resulting in a lot of additional network overhead. When the cluster data reaches terabytes or petabytes, the traditional mapreduce of Hbase can easily cause a lot of network overhead, and centralization has the risk of network paralysis.
发明内容Contents of the invention
本发明的技术任务是提供一种基于Hbase的节省网络流量的MR运行方法,来解决网络开销大,集权有网络瘫痪的风险的问题。The technical task of the present invention is to provide an Hbase-based MR operation method that saves network traffic to solve the problems of large network overhead and risk of network paralysis due to centralization.
本发明的技术任务是按以下方式实现的,Technical task of the present invention is realized in the following manner,
一种基于Hbase的节省网络流量的MR运行方法,步骤如下:An Hbase-based MR operation method for saving network traffic, the steps are as follows:
(1)、实现Mapreduce的InputFormat方法;(1) Implement the InputFormat method of Mapreduce;
(2)、获取Hbase某张表的所有大的数据块(Region)信息;(2) Obtain all large data block (Region) information of a table in Hbase;
(3)、根据每个数据块,获取他们的底层文件(Hfile);(3) Obtain their underlying files (Hfile) according to each data block;
(4)、将获取到的所有数据块的底层文件作为Mapreduce的输入;以每个底层文件为计算单元,执行mapreduce;(4) Use the obtained underlying files of all data blocks as the input of Mapreduce; use each underlying file as a calculation unit to execute mapreduce;
(5)、执行reduce,结束mapreduce。(5) Execute reduce and end mapreduce.
步骤(4)中,执行mapreduce,MapReduce通过把对数据集的大规模操作分发给网络上的每个计算单元实现可靠性;每个计算单元周期性的返回它所完成的工作和最新的状态。In step (4), mapreduce is executed. MapReduce achieves reliability by distributing large-scale operations on data sets to each computing unit on the network; each computing unit periodically returns its completed work and the latest status.
若一个计算单元保持沉默超过一个预设的时间间隔,主计算单元(类同GoogleFile System中的主服务器)记录下这个计算单元状态为死亡,并把分配给这个计算单元的数据发到别的计算单元。If a computing unit remains silent for more than a preset time interval, the main computing unit (similar to the main server in GoogleFile System) records the status of this computing unit as dead, and sends the data assigned to this computing unit to other computing units unit.
本发明的一种基于Hbase的节省网络流量的MR运行方法具有以下优点:结合MR的执行特点和HBase数据的存储特点,直接在每个数据文件上执行MR,从根本上解决了Mapreduce运行初期跨节点取数据的问题,从而很好的节省了网络开销,具有很好的推广使用价值。An Hbase-based MR operation method for saving network traffic of the present invention has the following advantages: combining the execution characteristics of MR and the storage characteristics of HBase data, MR is directly executed on each data file, which fundamentally solves the problem of the initial delay of Mapreduce operation. The problem of node fetching data, thus saving the network overhead very well, has a very good promotion and use value.
附图说明Description of drawings
下面结合附图对本发明进一步说明。The present invention will be further described below in conjunction with the accompanying drawings.
附图1为一种基于Hbase的节省网络流量的MR运行方法的流程图。Accompanying drawing 1 is a flow chart of an Hbase-based MR operation method for saving network traffic.
具体实施方式detailed description
参照说明书附图和具体实施例对本发明的一种基于Hbase的节省网络流量的MR运行方法作以下详细地说明。An Hbase-based MR operation method for saving network traffic of the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
实施例:Example:
本发明的一种基于Hbase的节省网络流量的MR运行方法,步骤如下:A kind of MR running method based on Hbase of the present invention saves network flow, and the steps are as follows:
(1)、实现Mapreduce的InputFormat方法;(1) Implement the InputFormat method of Mapreduce;
(2)、获取Hbase某张表的所有大的数据块(Region)信息;(2) Obtain all large data block (Region) information of a table in Hbase;
(3)、根据每个数据块,获取他们的底层文件(Hfile);(3) Obtain their underlying files (Hfile) according to each data block;
(4)、将获取到的所有数据块的底层文件作为Mapreduce的输入;以每个底层文件为计算单元,执行mapreduce;(4) Use the obtained underlying files of all data blocks as the input of Mapreduce; use each underlying file as a calculation unit to execute mapreduce;
(5)、执行reduce,结束mapreduce。(5) Execute reduce and end mapreduce.
步骤(4)中,执行mapreduce,MapReduce通过把对数据集的大规模操作分发给网络上的每个计算单元实现可靠性;每个计算单元周期性的返回它所完成的工作和最新的状态。In step (4), mapreduce is executed. MapReduce achieves reliability by distributing large-scale operations on data sets to each computing unit on the network; each computing unit periodically returns its completed work and the latest status.
若一个计算单元保持沉默超过一个预设的时间间隔,主计算单元(类同GoogleFile System中的主服务器)记录下这个计算单元状态为死亡,并把分配给这个计算单元的数据发到别的计算单元。If a computing unit remains silent for more than a preset time interval, the main computing unit (similar to the main server in GoogleFile System) records the status of this computing unit as dead, and sends the data assigned to this computing unit to other computing units unit.
一、映射和化简1. Mapping and Simplification
简单说来,一个映射函数就是对一些独立元素组成的概念上的列表(例如,一个测试成绩的列表)的每一个元素进行指定的操作(比如前面的例子里,有人发现所有学生的成绩都被高估了一分,它可以定义一个"减一"的映射函数,用来修正这个错误。)。事实上,每个元素都是被独立操作的,而原始列表没有被更改,因为这里创建了一个新的列表来保存新的答案。这就是说,Map操作是可以高度并行的,这对高性能要求的应用以及并行计算领域的需求非常有用。In simple terms, a mapping function is to perform a specified operation on each element of a conceptual list of independent elements (for example, a list of test scores) (for example, in the previous example, it was found that all student scores were Overestimated by one point, it can define a "minus one" mapping function to fix this error.). In fact, each element is manipulated independently, and the original list is not changed, because a new list is created here to hold the new answer. That is to say, the Map operation can be highly parallelized, which is very useful for applications with high performance requirements and the requirements in the field of parallel computing.
而化简操作指的是对一个列表的元素进行适当的合并(继续看前面的例子,如果有人想知道班级的平均分该怎么做?它可以定义一个化简函数,通过让列表中的元素跟自己的相邻的元素相加的方式把列表减半,如此递归运算直到列表只剩下一个元素,然后用这个元素除以人数,就得到了平均分。)。虽然他不如映射函数那么并行,但是因为化简总是有一个简单的答案,大规模的运算相对独立,所以化简函数在高度并行环境下也很有用。The simplification operation refers to the proper merging of the elements of a list (continue to look at the previous example, if someone wants to know the average score of the class, how to do it? It can define a simplification function, by making the elements in the list follow the The list is halved by adding its own adjacent elements, so recursive operation until there is only one element left in the list, and then divide this element by the number of people to get the average score.). Although it is not as parallel as the mapping function, the reduction function is also useful in a highly parallel environment because the reduction always has a simple answer and the large-scale operations are relatively independent.
二、分布可靠2. Reliable distribution
MapReduce通过把对数据集的大规模操作分发给网络上的每个节点实现可靠性;每个节点会周期性的返回它所完成的工作和最新的状态。如果一个节点保持沉默超过一个预设的时间间隔,主节点(类同Google File System中的主服务器)记录下这个节点状态为死亡,并把分配给这个节点的数据发到别的节点。每个操作使用命名文件的原子操作以确保不会发生并行线程间的冲突;当文件被改名的时候,系统可能会把他们复制到任务名以外的另一个名字上去。(避免副作用)。MapReduce achieves reliability by distributing large-scale operations on data sets to each node on the network; each node will periodically return its completed work and the latest status. If a node remains silent for more than a preset time interval, the master node (similar to the master server in Google File System) records the node status as dead, and sends the data assigned to this node to other nodes. Each operation uses atomic operations on named files to ensure that no conflicts between parallel threads occur; when files are renamed, the system may copy them to a name other than the task name. (to avoid side effects).
化简操作工作方式与之类似,但是由于化简操作的可并行性相对较差,主节点会尽量把化简操作只分配在一个节点上,或者离需要操作的数据尽可能近的节点上;这个特性可以满足Google的需求,因为他们有足够的带宽,他们的内部网络没有那么多的机器。The reduction operation works in a similar way, but because the parallelism of the reduction operation is relatively poor, the master node will try to allocate the reduction operation to only one node, or the node as close as possible to the data to be operated; This feature can meet the needs of Google, because they have enough bandwidth, and their internal network does not have so many machines.
通过上面具体实施方式,所述技术领域的技术人员可容易的实现本发明。但是应当理解,本发明并不限于上述的具体实施方式。在公开的实施方式的基础上,所述技术领域的技术人员可任意组合不同的技术特征,从而实现不同的技术方案。Through the above specific implementation manners, those skilled in the technical field can easily realize the present invention. However, it should be understood that the present invention is not limited to the specific embodiments described above. On the basis of the disclosed embodiments, those skilled in the art can arbitrarily combine different technical features, so as to realize different technical solutions.
除说明书所述的技术特征外,均为本专业技术人员的已知技术。Except for the technical features described in the instructions, all are known technologies by those skilled in the art.
Claims (3)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201610628407.8A CN106302662A (en) | 2016-08-03 | 2016-08-03 | Hbase-based MR operation method capable of saving network flow |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201610628407.8A CN106302662A (en) | 2016-08-03 | 2016-08-03 | Hbase-based MR operation method capable of saving network flow |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN106302662A true CN106302662A (en) | 2017-01-04 |
Family
ID=57664543
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201610628407.8A Pending CN106302662A (en) | 2016-08-03 | 2016-08-03 | Hbase-based MR operation method capable of saving network flow |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN106302662A (en) |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20110066649A1 (en) * | 2009-09-14 | 2011-03-17 | Myspace, Inc. | Double map reduce distributed computing framework |
| CN103645952A (en) * | 2013-08-08 | 2014-03-19 | 中国人民解放军国防科学技术大学 | Non-accurate task parallel processing method based on MapReduce |
| CN103984926A (en) * | 2014-05-15 | 2014-08-13 | 江苏科大汇峰科技有限公司 | Distributed moving object detection method based on MapReduce calculation model |
-
2016
- 2016-08-03 CN CN201610628407.8A patent/CN106302662A/en active Pending
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20110066649A1 (en) * | 2009-09-14 | 2011-03-17 | Myspace, Inc. | Double map reduce distributed computing framework |
| CN103645952A (en) * | 2013-08-08 | 2014-03-19 | 中国人民解放军国防科学技术大学 | Non-accurate task parallel processing method based on MapReduce |
| CN103984926A (en) * | 2014-05-15 | 2014-08-13 | 江苏科大汇峰科技有限公司 | Distributed moving object detection method based on MapReduce calculation model |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11455290B1 (en) | Streaming database change data from distributed storage | |
| US12210419B2 (en) | Continuous data protection | |
| US11042503B1 (en) | Continuous data protection and restoration | |
| CN102307206B (en) | Caching method of caching system for quickly accessing virtual machine mirror image based on cloud storage | |
| CN110799960A (en) | System and method for database tenant migration | |
| CN110569252B (en) | Data processing system and method | |
| CN107045422A (en) | Distributed storage method and equipment | |
| CN103139300A (en) | Virtual machine image management optimization method based on data de-duplication | |
| JPWO2011108695A1 (en) | Parallel data processing system, parallel data processing method and program | |
| CN109508326B (en) | Method, device and system for processing data | |
| CN108573029B (en) | Method, device and storage medium for acquiring network access relation data | |
| CN104885054A (en) | System and method for performing a transaction in a massively parallel processing database | |
| CN103714123A (en) | Methods for deleting duplicated data and controlling reassembly versions of cloud storage segmented objects of enterprise | |
| CN107526645A (en) | A kind of communication optimization method and system | |
| CN104363222A (en) | Hadoop-based network security event analysis method | |
| US9110820B1 (en) | Hybrid data storage system in an HPC exascale environment | |
| CN103514298A (en) | Method for achieving file lock and metadata server | |
| CN105138679A (en) | Data processing system and method based on distributed caching | |
| CN103365987B (en) | Clustered database system and data processing method based on shared-disk framework | |
| CN103294799B (en) | A kind of data parallel batch imports the method and system of read-only inquiry system | |
| CN103106261A (en) | Distributed query method based on narrow-band cloud data service | |
| CN112395308A (en) | Data query method based on HDFS database | |
| CN109388651B (en) | A data processing method and device | |
| CN114860762A (en) | Distributed data collection platform development and research method based on data lake storage | |
| CN105045571A (en) | Novel WebGIS architecture |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170104 |
|
| WD01 | Invention patent application deemed withdrawn after publication |