[go: up one dir, main page]

CN103699656A - GPU-based mass-multimedia-data-oriented MapReduce platform - Google Patents

GPU-based mass-multimedia-data-oriented MapReduce platform Download PDF

Info

Publication number
CN103699656A
CN103699656A CN201310738761.2A CN201310738761A CN103699656A CN 103699656 A CN103699656 A CN 103699656A CN 201310738761 A CN201310738761 A CN 201310738761A CN 103699656 A CN103699656 A CN 103699656A
Authority
CN
China
Prior art keywords
platform
data
gpu
hdfs
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310738761.2A
Other languages
Chinese (zh)
Inventor
王瀚漓
肖波
王雷
朱冯贶天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN201310738761.2A priority Critical patent/CN103699656A/en
Publication of CN103699656A publication Critical patent/CN103699656A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明涉及一种基于GPU的面向海量多媒体数据的MapReduce平台,包括平台驱动器和工作子模块,平台驱动器采用MapReduce计算模型将图像/视频检索处理任务分割为若干Map任务,Map任务的数据存储在HDFS中,每个Map任务启动时,利用平台驱动器传入的文件列表获取任务数据,并将具体的计算任务分配给所述的工作子模块计算,工作子模块中的任务调度器将任务派发给GPU或CPU处理,在计算处理过程中通过HDFS的本地库libhdfs.so获得计算所需的数据,之后将计算处理后的数据直接写入HDFS。与现有技术相比,本发明可以实现海量多媒体数据的高性能处理,不仅能大大提高计算速度,同时还能保证计算精度。

Figure 201310738761

The invention relates to a GPU-based MapReduce platform for massive multimedia data, including a platform driver and a working sub-module. The platform driver uses the MapReduce computing model to divide image/video retrieval processing tasks into several Map tasks, and the data of the Map tasks are stored in HDFS. , when each Map task starts, use the file list passed in by the platform driver to obtain task data, and assign specific computing tasks to the work sub-module for calculation, and the task scheduler in the work sub-module dispatches the task to the GPU Or CPU processing. During the calculation process, the data required for calculation is obtained through the HDFS local library libhdfs.so, and then the calculated and processed data is directly written into HDFS. Compared with the prior art, the present invention can realize high-performance processing of mass multimedia data, not only can greatly improve the calculation speed, but also can ensure the calculation accuracy.

Figure 201310738761

Description

A kind of MapReduce platform towards magnanimity multi-medium data based on GPU
Technical field
The present invention relates to mass data processing and high-performance calculation processing technology field, especially relate to a kind of MapReduce platform towards magnanimity multi-medium data based on GPU.
Background technology
After the information age enters into Web2.0, emergence along with the original mutually acting systems of multimedia, the new media such as network multimedia and mobile multimedia popular, and portable intelligent terminal device (as: IPhone, IPad, notebook etc.) popular and universal, the multimedia on internet (as video, image etc.) quantity is just presenting magnanimity level explosion type and is increasing.Picture and the video of magnanimity transmit on the internet, by internet hunt and watch abundant picture and video resource to become the important way of numerous netizens' obtaining information.In the face of the multi-medium data of magnanimity, how effectively it to be organized, manage, to be searched for has become a urgent task, is also the study hotspot in the fields such as multimedia, search engine, data mining.For this reason, not only need advanced algorithm to carry out content-based analysis and understanding to video data; For the required huge calculated amount of analyzing and processing, also need cloud computing platform, GPU (Graphics Processing Unit) etc. with support, the multi-medium data of magnanimity to be processed.Cloud computing is a kind of emerging computation schema based on internet, and being intended to provides the calculating of getting as required by isomery on internet, autonomous service for individual and enterprise customer.MapReduce is a kind of distributed computing framework of realizing cloud computing being proposed by Google.Cloud computing is distributed in calculation task on the resource pool of a large amount of computing machines formations, makes various application systems can obtain as required computing power, storage space and various software service.
In recent years, flourish along with integrated circuit and semiconductor industry, the calculated performance of GPU has had swift and violent development.The meanwhile appearance of GPGPU (General Programming for GPU) makes GPU no longer be confined to traditional graph and image processing and demonstration, can also be as high performance universal computing device.CUDA be exactly a set of like this by NVidia company, proposed for solve the software architecture of concurrent operation on GPU.Due to the hardware advances speed of GPU, substantially exceed the speed of development of CPU simultaneously, also make the performance of GPU promote at double, thereby be more and more subject to vast researcher, Application Engineer's attention.
For video, it is different from traditional documents, and it need to characterize its complicated data by extracting the feature of magnanimity, and especially local feature point, larger to the demand of calculated amount.To the analysis of video data and processing, bring huge burden will to common computer system.In the face of the explosive growth of video information exponentially form growth present situation, especially Internet video, traditional calculating and memory module are difficult to meet to be analyzed and processes these mass data information.The technical advantages such as cloud computing is extensive by it, can expand, unstructured data processing, the splendid platform and the solution that address this problem just.
Summary of the invention
Object of the present invention is exactly to provide a kind of MapReduce platform towards magnanimity multi-medium data based on GPU in order to overcome the defect of above-mentioned prior art existence.
Object of the present invention can be achieved through the following technical solutions: a kind of MapReduce platform towards magnanimity multi-medium data based on GPU, utilize computer cluster to realize the computing to image/video retrieval tasks, in each computer cluster, be provided with a plurality of CPU and GPU, it is characterized in that, described platform is based upon on CUDA and HDFS, comprise platform driver and work submodule, described platform driver adopts MapReduce computation model, primary control program on dispatching clustered node is some Map tasks by image/video retrieval process division of tasks, the data of described Map task are stored in HDFS, during each Map task start, utilize the listed files that platform driver imports into obtain task data, and calculate concrete distribution of computation tasks to described work submodule, task is distributed to GPU to task dispatcher in described work submodule or CPU processes, local library libhdfs.so by HDFS in calculation processes obtains the required data of calculating, HDFS afterwards writes direct the data after computing.
Between described platform driver and work submodule, adopt Protocol Buffer serializing agreement as host-host protocol, to simplify the complicacy of exchanges data between the two, utilize JNI technology to carry out alternately, with the high efficiency that guarantees that it is mutual simultaneously.
Described platform driver is used Java language to write, and is realization and the expansion of Hadoop framework in concrete application.
Described work submodule is based upon and on CUDA basis, uses C/C++ and CUDA-C language compilation.
Described work submodule adopts distributed caching technology in calculation processes, and the Internet Transmission of HDFS while realizing the image/video retrieval process algorithm with data unchangeability to reduce improves the performance of whole cluster.
Described platform driver is in charge of the soft and hardware resource of platform, controls the workflow of platform, and its groundwork comprises the startup, cutting, scheduling, fault-tolerant processing of task etc.; Described work submodule be main image, video frequency searching Processing Algorithm as the realization of feature point extraction, cluster etc., born calculation task the heaviest in platform.Different work submodules are under the management of platform driver, and certain task that mutually cooperated, is meanwhile keeping again mutual independence between them, be beneficial to maintenance and the expansion of platform.
Compared with prior art, the present invention is a set of complete magnanimity multi-medium data analysis theories and technical system, by this platform of the present invention, can realize the high-performance treatments of magnanimity multi-medium data, to meet the many services demands such as video content analysis, video frequency searching, image retrieval and event detection, can not only greatly improve computing velocity, can also guarantee computational accuracy simultaneously.
Accompanying drawing explanation
Fig. 1 is framework schematic diagram of the present invention;
Fig. 2 is the high-level schematic functional block diagram of platform driver of the present invention;
Fig. 3 is the high-level schematic functional block diagram of work submodule of the present invention.
Embodiment
Below in conjunction with the drawings and specific embodiments, the present invention is described in detail.
As Figure 1-3, a kind of MapReduce platform towards magnanimity multi-medium data based on GPU, utilize computer cluster to realize the computing to image/video retrieval tasks, in each computer cluster, be provided with a plurality of CPU and GPU, it is characterized in that, described platform is based upon on CUDA and HDFS, comprise platform driver 1 and work submodule 2, described platform driver 1 is in charge of the soft and hardware resource of platform, control the workflow of platform, its groundwork comprises the startup, cutting, scheduling, fault-tolerant processing of task etc.; Described 2 of work submodules be main image, video frequency searching Processing Algorithm as the realization of feature point extraction, cluster etc., born calculation task the heaviest in platform.Different work submodules 2 are under the management of platform driver 1, and certain task that mutually cooperated, is meanwhile keeping again mutual independence between them, be beneficial to maintenance and the expansion of platform.Between described platform driver 1 and work submodule 2, adopt Protocol Buffer serializing agreement as host-host protocol, to simplify the complicacy of exchanges data between the two, utilize JNI technology to carry out alternately, with the high efficiency that guarantees that it is mutual simultaneously.Described platform driver 1 is used Java language to write, and is realization and the expansion of Hadoop framework in concrete application.Described work submodule 2 is based upon and on CUDA basis, uses C/C++ and CUDA-C language compilation.
Described platform driver 1 adopts MapReduce computation model, and the primary control program on dispatching clustered node is some Map tasks by image/video retrieval process division of tasks, and the data of described Map task are stored in HDFS.During each Map task start, utilize the listed files that platform driver 1 imports into obtain task data, and calculate concrete distribution of computation tasks to described work submodule 2, task is distributed to GPU to task dispatcher in described work submodule 2 or CPU processes, local library libhdfs.so by HDFS in calculation processes obtains the required data of calculating, and HDFS afterwards writes direct the data after computing.Described work submodule 2 adopts distributed caching technology in calculation processes, and the Internet Transmission of HDFS while realizing the image/video retrieval process algorithm with data unchangeability to reduce improves the performance of whole cluster.
Embodiment: by 12 host nodes, carried out a large amount of image/video retrieval process experiments on the computer cluster that each node comprises a CPU and two GPU at one.Experiment shows, platform of the present invention can not only accelerate processing speed (highest point reaches nearly 1500 times) greatly, also can greatly improve arithmetic accuracy simultaneously.Cluster configuration is as shown in table 1:
The configuration of table 1 computer cluster
Figure BDA0000447976230000041
From upper table, can see, platform of the present invention can move on the PC cluster of common, inexpensive, and does not need special expensive server cluster, and performance is not less than the latter.The present embodiment has selected different data sets to test on platform of the present invention, and comprising MSR-Bing, Flickr100k, CCVideo and Oxford etc., its picture number has reached 1,000,000 grades, and unique point quantity is over hundred million grades.When Flickr100k pictures are carried out to clustering algorithm, its speed-up ratio is as shown in table 2:
The speed-up ratio of table 2 when Flickr100k pictures are carried out to clustering algorithm
Figure BDA0000447976230000051
Wherein:
S---standalone version single-threading program
C---platform of the present invention is not enabled GPU and is accelerated
C+G---platform of the present invention is enabled GPU and is accelerated
In experimentation, whole working platform is smooth, does not substantially need human intervention and supervision.As can be seen from Table 2, platform of the present invention is not when enabling GPU, and speed-up ratio is directly proportional to host number; Enable after GPU, the speed-up ratio of whole cluster obtains greatly and promotes, and mainly has benefited from the acceleration that GPU is superior.
The precision of different images searching algorithm is as shown in table 3:
The precision of table 3 the invention process different images searching algorithm
Figure 20131073876121000022
In table 3, the first row 20K, 200K, 1M represent respectively central point number in cluster, and 0 and 1M in the second row represents respectively to join the number of the picture disturbing in reference set, are respectively 0 and 1,000,000.Baseline (Inv), HE and WGC represent respectively three kinds of common methods for Image Retrieval.From table 3, can see, on platform of the present invention, the precision of implementation algorithm also has many liftings, and this is mainly because the simultaneously treatable data volume of platform of the present invention strengthens, and can process the not treatable big data quantity of other algorithms.

Claims (5)

1. the MapReduce platform towards magnanimity multi-medium data based on GPU, utilize computer cluster to realize the computing to image/video retrieval tasks, in each computer cluster, be provided with a plurality of CPU and GPU, it is characterized in that, described platform is based upon on CUDA and HDFS, comprise platform driver and work submodule, described platform driver adopts MapReduce computation model, primary control program on dispatching clustered node is some Map tasks by image/video retrieval process division of tasks, the data of described Map task are stored in HDFS, during each Map task start, utilize the listed files that platform driver imports into obtain task data, and calculate concrete distribution of computation tasks to described work submodule, task is distributed to GPU to task dispatcher in described work submodule or CPU processes, local library libhdfs.so by HDFS in calculation processes obtains the required data of calculating, HDFS afterwards writes direct the data after computing.
2. a kind of MapReduce platform towards magnanimity multi-medium data based on GPU according to claim 1, it is characterized in that, between described platform driver and work submodule, adopt Protocol Buffer serializing agreement as host-host protocol, to simplify the complicacy of exchanges data between the two, utilize JNI technology to carry out alternately, with the high efficiency that guarantees that it is mutual simultaneously.
3. a kind of MapReduce platform towards magnanimity multi-medium data based on GPU according to claim 1, is characterized in that, described platform driver is used Java language to write, and is realization and the expansion of Hadoop framework in concrete application.
4. a kind of MapReduce platform towards magnanimity multi-medium data based on GPU according to claim 1, is characterized in that, described work submodule is based upon and on CUDA basis, uses C/C++ and CUDA-C language compilation.
5. a kind of MapReduce platform towards magnanimity multi-medium data based on GPU according to claim 1, it is characterized in that, described work submodule adopts distributed caching technology in calculation processes, the Internet Transmission of HDFS when reduce realizing the image/video retrieval process algorithm with data unchangeability, improves the performance of whole cluster.
CN201310738761.2A 2013-12-27 2013-12-27 GPU-based mass-multimedia-data-oriented MapReduce platform Pending CN103699656A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310738761.2A CN103699656A (en) 2013-12-27 2013-12-27 GPU-based mass-multimedia-data-oriented MapReduce platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310738761.2A CN103699656A (en) 2013-12-27 2013-12-27 GPU-based mass-multimedia-data-oriented MapReduce platform

Publications (1)

Publication Number Publication Date
CN103699656A true CN103699656A (en) 2014-04-02

Family

ID=50361184

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310738761.2A Pending CN103699656A (en) 2013-12-27 2013-12-27 GPU-based mass-multimedia-data-oriented MapReduce platform

Country Status (1)

Country Link
CN (1) CN103699656A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105049485A (en) * 2015-06-09 2015-11-11 中国石油大学(华东) Real-time video processing oriented load-aware cloud calculation system
CN105094981A (en) * 2014-05-23 2015-11-25 华为技术有限公司 Method and device for processing data
CN105243160A (en) * 2015-10-28 2016-01-13 西安美林数据技术股份有限公司 Mass data-based distributed video processing system
CN105263050A (en) * 2015-11-04 2016-01-20 山东大学 Mobile terminal real-time rendering system and method based on cloud platform
CN106604063A (en) * 2016-12-28 2017-04-26 北京恒华伟业科技股份有限公司 Video retrieving method and apparatus
CN107038482A (en) * 2017-04-21 2017-08-11 上海极链网络科技有限公司 Applied to AI algorithm engineerings, the Distributed Architecture of systematization
CN107273435A (en) * 2017-05-23 2017-10-20 北京环境特性研究所 Video personnel's fuzzy search parallel method based on MapReduce
CN107861723A (en) * 2017-10-25 2018-03-30 深圳市华成峰科技有限公司 Mass data processing method and its system
CN108762915A (en) * 2018-04-19 2018-11-06 上海交通大学 A method of caching RDF data in GPU memories
CN111507466A (en) * 2019-01-30 2020-08-07 北京沃东天骏信息技术有限公司 Data processing method and device, electronic equipment and readable medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110047172A1 (en) * 2009-08-20 2011-02-24 Qiming Chen Map-reduce and parallel processing in databases
CN102662639A (en) * 2012-04-10 2012-09-12 南京航空航天大学 Mapreduce-based multi-GPU (Graphic Processing Unit) cooperative computing method
CN102708088A (en) * 2012-05-08 2012-10-03 北京理工大学 CPU/GPU (Central Processing Unit/ Graphic Processing Unit) cooperative processing method oriented to mass data high-performance computation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110047172A1 (en) * 2009-08-20 2011-02-24 Qiming Chen Map-reduce and parallel processing in databases
CN102662639A (en) * 2012-04-10 2012-09-12 南京航空航天大学 Mapreduce-based multi-GPU (Graphic Processing Unit) cooperative computing method
CN102708088A (en) * 2012-05-08 2012-10-03 北京理工大学 CPU/GPU (Central Processing Unit/ Graphic Processing Unit) cooperative processing method oriented to mass data high-performance computation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HANLI WANG ET AL.: "Large-Scale Multimedia Data Mining Using MapReduce Framework", 《2012 IEEE 4TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING TECHNOLOGY AND SCIENCE》 *
HE B ET AL.: "Mars: A MapReduce framework on graphics", 《IN: PROC. PACT’08》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105094981A (en) * 2014-05-23 2015-11-25 华为技术有限公司 Method and device for processing data
CN105094981B (en) * 2014-05-23 2019-02-12 华为技术有限公司 A method and device for data processing
CN105049485B (en) * 2015-06-09 2018-10-16 中国石油大学(华东) A kind of Load-aware cloud computing system towards real time video processing
CN105049485A (en) * 2015-06-09 2015-11-11 中国石油大学(华东) Real-time video processing oriented load-aware cloud calculation system
CN105243160A (en) * 2015-10-28 2016-01-13 西安美林数据技术股份有限公司 Mass data-based distributed video processing system
CN105263050A (en) * 2015-11-04 2016-01-20 山东大学 Mobile terminal real-time rendering system and method based on cloud platform
CN105263050B (en) * 2015-11-04 2018-01-12 山东大学 Mobile terminal real-time rendering system and method based on cloud platform
CN106604063A (en) * 2016-12-28 2017-04-26 北京恒华伟业科技股份有限公司 Video retrieving method and apparatus
CN107038482A (en) * 2017-04-21 2017-08-11 上海极链网络科技有限公司 Applied to AI algorithm engineerings, the Distributed Architecture of systematization
CN107273435A (en) * 2017-05-23 2017-10-20 北京环境特性研究所 Video personnel's fuzzy search parallel method based on MapReduce
CN107861723A (en) * 2017-10-25 2018-03-30 深圳市华成峰科技有限公司 Mass data processing method and its system
CN108762915A (en) * 2018-04-19 2018-11-06 上海交通大学 A method of caching RDF data in GPU memories
CN108762915B (en) * 2018-04-19 2020-11-06 上海交通大学 Method for caching RDF data in GPU memory
CN111507466A (en) * 2019-01-30 2020-08-07 北京沃东天骏信息技术有限公司 Data processing method and device, electronic equipment and readable medium

Similar Documents

Publication Publication Date Title
CN103699656A (en) GPU-based mass-multimedia-data-oriented MapReduce platform
CN113836235B (en) Data processing method based on data center and related equipment thereof
You et al. Large-scale spatial join query processing in cloud
CN104794194B (en) A kind of distributed heterogeneous concurrent computational system towards large scale multimedia retrieval
US9053067B2 (en) Distributed data scalable adaptive map-reduce framework
Padhy et al. Big data processing with Hadoop-MapReduce in cloud systems
Yan et al. Large-scale image processing research cloud
CN107004012A (en) graphic manipulation
Elsayed et al. Mapreduce: State-of-the-art and research directions
You et al. Spatial join query processing in cloud: Analyzing design choices and performance comparisons
Luo et al. Big-data analytics: challenges, key technologies and prospects
Tanase et al. A highly efficient runtime and graph library for large scale graph analytics
CN106445645B (en) Method and apparatus for executing distributed computing task
Wang et al. An efficient image aesthetic analysis system using Hadoop
US10326824B2 (en) Method and system for iterative pipeline
Wang et al. CHCF: A cloud-based heterogeneous computing framework for large-scale image retrieval
dos Anjos et al. Smart: An application framework for real time big data analysis on heterogeneous cloud environments
Li et al. Survey of recent research progress and issues in big data
Mei et al. An overview on the convergence of high performance computing and big data processing
Adam et al. Big data management and analysis
Wang et al. Geospatial big data analytics engine for spark
Mezzoudj et al. A Comparative Study of Parallel Processing, Distributed Storage Techniques, and Technologies: A Survey on Big Data Analytics
Khan Hadoop performance modeling and job optimization for big data analytics
Xiong et al. HiGIS: An open framework for high performance geographic information system
CN117271122A (en) Task processing method, device, equipment and storage medium based on separation of CPU and GPU

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20140402

RJ01 Rejection of invention patent application after publication