[go: up one dir, main page]

CN113127548A - File merging method, device, equipment and storage medium - Google Patents

File merging method, device, equipment and storage medium Download PDF

Info

Publication number
CN113127548A
CN113127548A CN201911418125.5A CN201911418125A CN113127548A CN 113127548 A CN113127548 A CN 113127548A CN 201911418125 A CN201911418125 A CN 201911418125A CN 113127548 A CN113127548 A CN 113127548A
Authority
CN
China
Prior art keywords
merging
files
partition
file
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911418125.5A
Other languages
Chinese (zh)
Other versions
CN113127548B (en
Inventor
李营
张超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Secworld Information Technology Beijing Co Ltd
Qax Technology Group Inc
Original Assignee
Secworld Information Technology Beijing Co Ltd
Qax Technology Group Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Secworld Information Technology Beijing Co Ltd, Qax Technology Group Inc filed Critical Secworld Information Technology Beijing Co Ltd
Priority to CN201911418125.5A priority Critical patent/CN113127548B/en
Publication of CN113127548A publication Critical patent/CN113127548A/en
Application granted granted Critical
Publication of CN113127548B publication Critical patent/CN113127548B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种文件合并的方法、装置、设备及存储介质,该方法包括:从HDFS中获取HIVE的各个分区的文件,并从每个分区的文件中确定出需要合并的目标文件;基于每个分区的目标文件,按照预设合并效果模型计算出每个分区的合并效果优先级;其中,所述合并效果优先级与目标文件合并后对HDFS的NameNode内存的缓解程度正相关;分别为每个分区启动对应的合并任务,并按照每个分区的合并效果优先级,先后执行各个合并任务,以对分区上的目标文件进行文件合并;本发明能够更加高效的合并HIVE产生的小文件,尽快缓解HDFS的NameNode内存压力。

Figure 201911418125

The invention discloses a file merging method, device, equipment and storage medium. The method includes: acquiring files of each partition of HIVE from HDFS, and determining a target file to be merged from the files of each partition; For the target file of each partition, the merge effect priority of each partition is calculated according to the preset merge effect model; wherein, the merge effect priority is positively related to the degree of mitigation of the NameNode memory of HDFS after the target file is merged; respectively Each partition starts a corresponding merging task, and executes each merging task successively according to the priority of the merging effect of each partition, so as to perform file merging on the target files on the partition; the invention can more efficiently merge the small files generated by HIVE, Relieve the memory pressure on the NameNode of HDFS as soon as possible.

Figure 201911418125

Description

File merging method, device, equipment and storage medium
Technical Field
The present invention relates to the field of big data technologies, and in particular, to a method, an apparatus, a device, and a storage medium for merging files.
Background
HIVE is a data warehouse tool based on Hadoop and is used for data extraction, conversion and loading; the HIVE is generally used in cooperation with an HDFS (Hadoop Distributed File System) to store all files generated by the HIVE in the HDFS; under the condition that a large amount of HIVE is used by a service, a large amount of small files are generated in each partition of the HIVE, and the small files generated by the HIVE are stored through the HDFS, so that the memory pressure of the NameNode of the HDFS is increased along with the increase of the number of the small files, and the read-write performance of the whole HDFS cluster is also influenced; therefore, how to quickly and effectively merge small files in each partition to quickly relieve the memory pressure of the NameNode becomes a technical problem that needs to be solved by the technical personnel in the field.
Disclosure of Invention
The invention aims to provide a method, a device, equipment and a storage medium for merging files, which can more efficiently merge small files generated by HIVE and relieve the memory pressure of NameNode of HDFS as soon as possible.
According to an aspect of the present invention, there is provided a method of file merging, the method comprising:
acquiring files of all partitions of the HIVE from the HDFS, and determining target files needing to be combined from the files of all the partitions;
calculating the merging effect priority of each partition according to a preset merging effect model based on the target file of each partition; the merging effect priority is positively correlated with the remission degree of the merged target file on the NameNode memory of the HDFS;
and respectively starting a corresponding merging task for each partition, and executing each merging task in sequence according to the merging effect priority of each partition so as to merge the files of the target files on the partitions.
Optionally, the step of determining a target file to be merged from the files of each partition specifically includes:
and aiming at all files of one partition, setting the file with the file size smaller than a preset threshold value as a target file of the partition.
Optionally, the step of calculating the merging effect priority of each partition according to a preset merging effect model based on the target file of each partition specifically includes:
counting the total number of files and/or the total size of the files of all target files of one partition;
inputting the total number of the files and/or the total size of the files as characteristic parameters into the preset merging effect model, and obtaining the merging effect priority of the partitions by operating the preset merging effect model; and the preset merging effect model is obtained by training through a machine learning algorithm.
Optionally, the step of respectively starting a corresponding merged task for each partition, and executing each merged task in sequence according to the merging effect priority of each partition specifically includes:
acquiring the total amount of resources used for processing file combination on the Yarn;
determining the resource amount required by each merging task;
sequencing all the merged tasks according to the merging effect priority of each partition to obtain a task sequencing result;
dividing the task sequencing result into a plurality of merging batches according to the total resource amount and the resource amount required by each merging task; wherein one merged batch comprises a plurality of merged tasks;
and successively deploying the plurality of merging tasks included in each merging batch to the Yarn in a distributed manner, so that the plurality of merging tasks are executed simultaneously through the Yarn.
Optionally, the method further includes:
after all target files of one partition are merged into one or more merged files, storing all the merged files under a temporary directory;
counting the total number of first data contained in all target files of the partition, and counting the total number of second data contained in all merged files of the partition;
judging whether the total number of the first data is consistent with the total number of the second data, if so, replacing the target file on the partition by using the merged file under the temporary target; and if not, deleting the merged file under the temporary directory.
Optionally, after the step of replacing the target file on the partition with the merged file under the temporary target, the method further includes:
and storing all the target files on the partition under a backup directory, and deleting all the target files under the backup directory after a preset time period.
According to another aspect of the present invention, there is provided an apparatus for file merging, the apparatus including;
the determining module is used for acquiring files of all partitions of the HIVE from the HDFS and determining target files needing to be combined from the files of all the partitions;
the calculation module is used for calculating the merging effect priority of each partition according to a preset merging effect model based on the target file of each partition; the merging effect priority is positively correlated with the remission degree of the merged target file on the NameNode memory of the HDFS;
and the execution module is used for respectively starting the corresponding merging tasks for each partition, and executing each merging task in sequence according to the merging effect priority of each partition so as to merge the target files on the partitions.
Optionally, the apparatus further comprises:
the verification module is used for storing all the merged files into a temporary directory after merging all the target files of one partition into one or more merged files; counting the total number of first data contained in all target files of the partition, and counting the total number of second data contained in all merged files of the partition; judging whether the total number of the first data is consistent with the total number of the second data, if so, replacing the target file on the partition by using the merged file under the temporary target; and if not, deleting the merged file under the temporary directory.
According to another aspect of the present invention, there is provided a computer device, specifically including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above-described steps of the file merging method when executing the computer program.
According to another aspect of the present invention, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the above-mentioned steps of the method for file merging.
According to the file merging method, the file merging device, the file merging equipment and the storage medium, the merging effect priority of each partition of the HIVE is calculated to sort the partitions, and the HIVE small files of the partitions with higher merging effect priority are preferably merged according to the sorting result; because the merging effect priority is positively correlated with the alleviation degree of the NameNode memory of the HDFS after the small files are merged, the files of the partitions with higher merging effect priority are preferably merged, so that the NameNode memory pressure of the HDFS can be alleviated as soon as possible; in addition, due to the fact that physical resources for processing file combination are limited, the HIVE small files can be combined intelligently under the limited physical resources through the method and the system, the problem of the HIVE small files is relieved rapidly, and the cluster performance is improved.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is an alternative flowchart of a file merging method according to an embodiment;
fig. 2 is a schematic diagram of an alternative composition structure of the apparatus for merging files provided in the second embodiment;
fig. 3 is a schematic diagram of an alternative hardware architecture of the computer device according to the third embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example one
The embodiment of the invention provides a method for merging files, which specifically comprises the following steps as shown in fig. 1:
step S101: and acquiring files of all partitions of the HIVE from the HDFS, and determining a target file to be merged from the files of each partition.
The HDFS (Hadoop Distributed File System) is the basis of Distributed computing stream data storage management and is developed based on the requirements of accessing and processing super-large files in a stream data mode; the HDFS is used for storing mass data and has the characteristics of high fault tolerance, high reliability, high expandability, high availability, high throughput rate and the like.
HIVE is a data warehouse tool based on Hadoop, is used for data extraction, transformation and loading, and is a mechanism capable of storing, inquiring and analyzing large-scale data stored in Hadoop; the HIVE data warehouse tool can map the structured data file into a database table, provides an SQL query function and can convert an SQL statement into a MapReduce task to execute.
In practical application, files generated by the HIVE are stored in the HDFS according to partitions, and because a large number of small files are generated by the HIVE mechanism, the storage of the large number of small files by the HDFS will occupy the memory of a manager node NameNode of the HDFS, so that the transverse expansion capability of the HDFS is influenced.
Specifically, the step of determining a target file to be merged from the files of each partition includes:
and aiming at all files of one partition, setting the file with the file size smaller than a preset threshold value as a target file of the partition.
Preferably, the value range of the preset threshold is 128MB to 256 MB; the embodiment is mainly used for merging the HIVE small files so as to relieve the memory pressure of the NameNode of the HDFS.
Step S102: calculating the merging effect priority of each partition according to a preset merging effect model based on the target file of each partition; and the merging effect priority is positively correlated with the remission degree of the merged target file on the NameNode memory of the HDFS.
In this embodiment, the merging effect of each partition is determined through a preset merging effect model, so as to calculate the merging effect priority of each partition; if the priority of the merging effect is higher, the merging effect is better; the good and bad concrete performance of the merging effect is as follows: and combining the most target files within the least combining time, thereby relieving the memory of the NameNode to the greatest extent. In this embodiment, the merging effect priority of the partition is determined by three dimensions, i.e., merging time, the number of merged files, and the size of the memory of the released NameNode.
Specifically, step S102 includes:
step A1: counting the total number of files and/or the total size of the files of all target files of one partition;
step A2: inputting the total number of the files and/or the total size of the files as characteristic parameters into the preset merging effect model, and obtaining the merging effect priority of the partitions by operating the preset merging effect model; and the preset merging effect model is obtained by training through a machine learning algorithm.
In this embodiment, a merging effect model may be trained in advance through a machine learning algorithm, so as to calculate the merging effect priority of each partition through the trained merging effect model; when the combined effect mode is used, the total number of files and/or the total size of files of each partition can be used as characteristic parameters to be input into the combined effect model, and the output parameter of the combined effect model is the priority of the combined effect. Of course, in practical applications, other characteristic parameters may also be used as the input of the merging effect model, and are not limited herein.
Step S103: and respectively starting a corresponding merging task for each partition, and executing each merging task in sequence according to the merging effect priority of each partition so as to merge the files of the target files on the partitions.
One partition corresponds to one merging task, and the merging task is executed to merge the target files on the partition; since the HIVE mechanism is that each table needs to specify a uniform data format, the data format under each partition is uniform, for example: orc, lzo, text, avro, parquet; preferably, the target files with data formats of orc, lzo, text and avro are subjected to file merging by starting a MapReduce task, and the target files with data formats of parquet are subjected to file merging by starting a SparkSQL task.
Because the resources used for merging the files cannot occupy the queue resources with excessive online services, the merging task of the partition with higher priority level needs to be preferentially executed according to the merging effect priority level of each partition, so that the memory pressure of the NameNode is relieved as soon as possible.
In addition, in practical application, all the partitions can be sequenced through a machine learning algorithm according to the target file of each partition so as to obtain a partition sequencing result; the more front the sorting is, the better the merging effect of the partitions is; and according to the partition sequencing result, sequentially executing the merging tasks of all the partitions.
Specifically, step S103 includes:
step B1: acquiring the total amount of resources used for processing file combination on the Yarn;
step B2: determining the resource amount required by each merging task;
step B3: sequencing all the merged tasks according to the merging effect priority of each partition to obtain a task sequencing result;
step B4: dividing the task sequencing result into a plurality of merging batches according to the total resource amount and the resource amount required by each merging task; wherein one merged batch comprises a plurality of merged tasks;
step B5: and successively deploying the plurality of merging tasks included in each merging batch to the Yarn in a distributed manner, so that the plurality of merging tasks are executed simultaneously through the Yarn.
The Yarn is a new Hadoop resource manager, is a universal resource management system and can provide uniform resource management and scheduling for upper-layer application; if the merging task is executed only through a single node, the number of the merged target files is far from the number of the generated HIVE files; therefore, a plurality of merging tasks need to be distributed and deployed on the Yarn queue, and the concurrency amount of the merging tasks needing to be executed in each batch is determined according to the queue resources on the Yarn.
Further, the method further comprises:
step C1: after all target files of one partition are merged into one or more merged files, storing all the merged files under a temporary directory;
step C2: counting the total number of first data contained in all target files of the partition, and counting the total number of second data contained in all merged files of the partition;
step C3: judging whether the total number of the first data is consistent with the total number of the second data, if so, replacing the target file on the partition by using the merged file under the temporary target; and if not, deleting the merged file under the temporary directory.
In this embodiment, in order to prevent data from being damaged and lost in the file merging process, and in order to improve data security, verification comparison is performed before and after a file merging operation, and whether the total number of first data included in each target file in a partition before merging is consistent with the total number of second data included in a merged file is determined, if yes, it is indicated that no data is lost in the file merging process, and file replacement may be performed, and if not, it is indicated that data is lost in the file merging process, an original target file is retained, and the merged file is deleted. Preferably, the total number of the first data included in all the target files before the file merging is obtained through sparkSQL, and the total number of the second data included in all the merged files after the file merging is obtained through sparkSQL.
It should be noted that, the maximum file size of the target files is configured in advance, and in the file merging process, if the file size of one target file reaches the maximum file size, one target file is created under the temporary directory, so that there may be a case where a plurality of target small files are merged into a plurality of merged large files.
Further, after the step of replacing the target file on the partition with the merged file under the temporary target, the method further comprises:
and storing all the target files on the partition under a backup directory, and deleting all the target files under the backup directory after a preset time period.
In this embodiment, in order to ensure the security of data, when the merged file is used to replace the target file, the target file is not deleted immediately, but is deleted after a period of time passes and after the manual confirmation is correct.
Example two
The embodiment of the invention provides a file merging device, and as shown in fig. 2, the device specifically comprises the following components:
a determining module 201, configured to obtain files of each partition of the HIVE from the HDFS, and determine a target file to be merged from the files of each partition;
a calculating module 202, configured to calculate, based on the target file of each partition, a merging effect priority of each partition according to a preset merging effect model; the merging effect priority is positively correlated with the remission degree of the merged target file on the NameNode memory of the HDFS;
the execution module 203 is configured to start a corresponding merging task for each partition, and execute each merging task in sequence according to the merging effect priority of each partition, so as to perform file merging on the target files on the partitions.
Specifically, the determining module 201 is configured to:
and aiming at all files of one partition, setting the file with the file size smaller than a preset threshold value as a target file of the partition.
Preferably, the value range of the preset threshold is 128MB to 256 MB; the embodiment is mainly used for merging the HIVE small files so as to relieve the memory pressure of the NameNode of the HDFS.
A calculation module 202 configured to:
counting the total number of files and/or the total size of the files of all target files of one partition; inputting the total number of the files and/or the total size of the files as characteristic parameters into the preset merging effect model, and obtaining the merging effect priority of the partitions by operating the preset merging effect model; and the preset merging effect model is obtained by training through a machine learning algorithm.
In this embodiment, the merging effect of each partition is determined through a preset merging effect model, so as to calculate the merging effect priority of each partition; if the priority of the merging effect is higher, the merging effect is better; the good and bad concrete performance of the merging effect is as follows: and combining the most target files within the least combining time, thereby relieving the memory of the NameNode to the greatest extent.
An execution module 203 for:
acquiring the total amount of resources used for processing file combination on the Yarn; determining the resource amount required by each merging task; sequencing all the merged tasks according to the merging effect priority of each partition to obtain a task sequencing result; dividing the task sequencing result into a plurality of merging batches according to the total resource amount and the resource amount required by each merging task; wherein one merged batch comprises a plurality of merged tasks; and successively deploying the plurality of merging tasks included in each merging batch to the Yarn in a distributed manner, so that the plurality of merging tasks are executed simultaneously through the Yarn.
One partition corresponds to one merging task, and the merging task is executed to merge the target files on the partition; since the HIVE mechanism is that each table needs to specify a uniform data format, the data format under each partition is uniform, for example: orc, lzo, text, avro, parquet; preferably, the target files with data formats of orc, lzo, text and avro are subjected to file merging by starting a MapReduce task, and the target files with data formats of parquet are subjected to file merging by starting a SparkSQL task.
Because the resources used for merging the files cannot occupy the queue resources with excessive online services, the merging task of the partition with higher priority level needs to be preferentially executed according to the merging effect priority level of each partition, so that the memory pressure of the NameNode is relieved as soon as possible.
In addition, in practical application, all the partitions can be sequenced through a machine learning algorithm according to the target file of each partition so as to obtain a partition sequencing result; the more front the sorting is, the better the merging effect of the partitions is; and according to the partition sequencing result, sequentially executing the merging tasks of all the partitions.
Further, the apparatus further comprises:
the verification module is used for storing all the merged files into a temporary directory after merging all the target files of one partition into one or more merged files; counting the total number of first data contained in all target files of the partition, and counting the total number of second data contained in all merged files of the partition; judging whether the total number of the first data is consistent with the total number of the second data, if so, replacing the target file on the partition by using the merged file under the temporary target; and if not, deleting the merged file under the temporary directory.
In this embodiment, in order to prevent data from being damaged and lost in the file merging process, and in order to improve data security, verification comparison is performed before and after a file merging operation, and whether the total number of first data included in each target file in a partition before merging is consistent with the total number of second data included in a merged file is determined, if yes, it is indicated that no data is lost in the file merging process, and file replacement may be performed, and if not, it is indicated that data is lost in the file merging process, an original target file is retained, and the merged file is deleted. Preferably, the total number of the first data included in all the target files before the file merging is obtained through sparkSQL, and the total number of the second data included in all the merged files after the file merging is obtained through sparkSQL.
It should be noted that, the maximum file size of the target files is configured in advance, and in the file merging process, if the file size of one target file reaches the maximum file size, one target file is created under the temporary directory, so that there may be a case where a plurality of target small files are merged into a plurality of merged large files.
Still further, the apparatus further comprises:
and the backup module is used for storing all the target files on the partition into a backup directory after the merged file under the temporary target is used for replacing the target files on the partition, and deleting all the target files under the backup directory after a preset time period.
In this embodiment, in order to ensure the security of data, when the merged file is used to replace the target file, the target file is not deleted immediately, but is deleted after a period of time passes and after the manual confirmation is correct.
EXAMPLE III
The embodiment also provides a computer device, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server or a rack server (including an independent server or a server cluster composed of a plurality of servers) capable of executing programs, and the like. As shown in fig. 3, the computer device 30 of the present embodiment includes at least but is not limited to: a memory 301, a processor 302 communicatively coupled to each other via a system bus. It is noted that FIG. 3 only shows the computer device 30 having components 301 and 302, but it is understood that not all of the shown components are required and that more or fewer components may be implemented instead.
In this embodiment, the memory 301 (i.e., the readable storage medium) includes a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage 301 may be an internal storage unit of the computer device 30, such as a hard disk or a memory of the computer device 30. In other embodiments, the memory 301 may also be an external storage device of the computer device 30, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, provided on the computer device 30. Of course, the memory 301 may also include both internal and external storage devices for the computer device 30. In the present embodiment, the memory 301 is generally used for storing an operating system and various types of application software installed in the computer device 30. In addition, the memory 301 may also be used to temporarily store various types of data that have been output or are to be output.
Processor 302 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 3302 generally operates to control the overall operation of the computer device 30.
Specifically, in this embodiment, the processor 302 is configured to execute a program of a file merging method stored in the processor 302, and when executed, the program of the file merging method implements the following steps:
acquiring files of all partitions of the HIVE from the HDFS, and determining target files needing to be combined from the files of all the partitions;
calculating the merging effect priority of each partition according to a preset merging effect model based on the target file of each partition; the merging effect priority is positively correlated with the remission degree of the merged target file on the NameNode memory of the HDFS;
and respectively starting a corresponding merging task for each partition, and executing each merging task in sequence according to the merging effect priority of each partition so as to merge the files of the target files on the partitions.
The specific embodiment process of the above method steps can be referred to in the first embodiment, and the detailed description of this embodiment is not repeated here.
Example four
The present embodiments also provide a computer readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application mall, etc., having stored thereon a computer program that when executed by a processor implements the method steps of:
acquiring files of all partitions of the HIVE from the HDFS, and determining target files needing to be combined from the files of all the partitions;
calculating the merging effect priority of each partition according to a preset merging effect model based on the target file of each partition; the merging effect priority is positively correlated with the remission degree of the merged target file on the NameNode memory of the HDFS;
and respectively starting a corresponding merging task for each partition, and executing each merging task in sequence according to the merging effect priority of each partition so as to merge the files of the target files on the partitions.
The specific embodiment process of the above method steps can be referred to in the first embodiment, and the detailed description of this embodiment is not repeated here.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A method of file merging, the method comprising:
acquiring files of all partitions of the HIVE from the HDFS, and determining target files needing to be combined from the files of all the partitions;
calculating the merging effect priority of each partition according to a preset merging effect model based on the target file of each partition; the merging effect priority is positively correlated with the remission degree of the merged target file on the NameNode memory of the HDFS;
and respectively starting a corresponding merging task for each partition, and executing each merging task in sequence according to the merging effect priority of each partition so as to merge the files of the target files on the partitions.
2. The method for merging files according to claim 1, wherein the step of determining the target file to be merged from the files of each partition specifically comprises:
and aiming at all files of one partition, setting the file with the file size smaller than a preset threshold value as a target file of the partition.
3. The method for merging files according to claim 1, wherein the step of calculating the merging effect priority of each partition according to a preset merging effect model based on the target file of each partition specifically comprises:
counting the total number of files and/or the total size of the files of all target files of one partition;
inputting the total number of the files and/or the total size of the files as characteristic parameters into the preset merging effect model, and obtaining the merging effect priority of the partitions by operating the preset merging effect model; and the preset merging effect model is obtained by training through a machine learning algorithm.
4. The method for merging files according to claim 1, wherein the step of respectively starting a corresponding merging task for each partition and executing each merging task in sequence according to the merging effect priority of each partition specifically comprises:
acquiring the total amount of resources used for processing file combination on the Yarn;
determining the resource amount required by each merging task;
sequencing all the merged tasks according to the merging effect priority of each partition to obtain a task sequencing result;
dividing the task sequencing result into a plurality of merging batches according to the total resource amount and the resource amount required by each merging task; wherein one merged batch comprises a plurality of merged tasks;
and successively deploying the plurality of merging tasks included in each merging batch to the Yarn in a distributed manner, so that the plurality of merging tasks are executed simultaneously through the Yarn.
5. The method of file merging according to claim 1, further comprising:
after all target files of one partition are merged into one or more merged files, storing all the merged files under a temporary directory;
counting the total number of first data contained in all target files of the partition, and counting the total number of second data contained in all merged files of the partition;
judging whether the total number of the first data is consistent with the total number of the second data, if so, replacing the target file on the partition by using the merged file under the temporary target; and if not, deleting the merged file under the temporary directory.
6. The method of file merging according to claim 5, wherein after the step of replacing the target file on the partition with the merged file under the temporary target, the method further comprises:
and storing all the target files on the partition under a backup directory, and deleting all the target files under the backup directory after a preset time period.
7. An apparatus for merging files, the apparatus comprising:
the determining module is used for acquiring files of all partitions of the HIVE from the HDFS and determining target files needing to be combined from the files of all the partitions;
the calculation module is used for calculating the merging effect priority of each partition according to a preset merging effect model based on the target file of each partition; the merging effect priority is positively correlated with the remission degree of the merged target file on the NameNode memory of the HDFS;
and the execution module is used for respectively starting the corresponding merging tasks for each partition, and executing each merging task in sequence according to the merging effect priority of each partition so as to merge the target files on the partitions.
8. The method of file merging according to claim 7, wherein the apparatus further comprises:
the verification module is used for storing all the merged files into a temporary directory after merging all the target files of one partition into one or more merged files; counting the total number of first data contained in all target files of the partition, and counting the total number of second data contained in all merged files of the partition; judging whether the total number of the first data is consistent with the total number of the second data, if so, replacing the target file on the partition by using the merged file under the temporary target; and if not, deleting the merged file under the temporary directory.
9. A computer device, the computer device comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
CN201911418125.5A 2019-12-31 2019-12-31 File merging method, device, equipment and storage medium Active CN113127548B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911418125.5A CN113127548B (en) 2019-12-31 2019-12-31 File merging method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911418125.5A CN113127548B (en) 2019-12-31 2019-12-31 File merging method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113127548A true CN113127548A (en) 2021-07-16
CN113127548B CN113127548B (en) 2023-10-31

Family

ID=76770698

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911418125.5A Active CN113127548B (en) 2019-12-31 2019-12-31 File merging method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113127548B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116089391A (en) * 2022-12-29 2023-05-09 南京苏宁软件技术有限公司 File data processing method, device, equipment and medium
CN116795790A (en) * 2022-03-18 2023-09-22 腾讯科技(深圳)有限公司 Method, device, electronic equipment and storage medium for merging small files
US12399868B1 (en) * 2023-03-20 2025-08-26 Amazon Technologies, Inc. Managed file compaction for distributed storage systems

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101245994B1 (en) * 2012-08-31 2013-03-20 케이씨씨시큐리티주식회사 Parallel distributed processing system and method
CN106843763A (en) * 2017-01-19 2017-06-13 北京神州绿盟信息安全科技股份有限公司 A kind of Piece file mergence method and device based on HDFS systems
CN106855861A (en) * 2015-12-09 2017-06-16 北京金山安全软件有限公司 File merging method and device and electronic equipment
CN107168802A (en) * 2017-05-18 2017-09-15 郑州云海信息技术有限公司 The merging method and device of a kind of cloud storage small file
CN110515920A (en) * 2019-08-30 2019-11-29 北京浪潮数据技术有限公司 A kind of mass small documents access method and system based on Hadoop

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101245994B1 (en) * 2012-08-31 2013-03-20 케이씨씨시큐리티주식회사 Parallel distributed processing system and method
CN106855861A (en) * 2015-12-09 2017-06-16 北京金山安全软件有限公司 File merging method and device and electronic equipment
CN106843763A (en) * 2017-01-19 2017-06-13 北京神州绿盟信息安全科技股份有限公司 A kind of Piece file mergence method and device based on HDFS systems
CN107168802A (en) * 2017-05-18 2017-09-15 郑州云海信息技术有限公司 The merging method and device of a kind of cloud storage small file
CN110515920A (en) * 2019-08-30 2019-11-29 北京浪潮数据技术有限公司 A kind of mass small documents access method and system based on Hadoop

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116795790A (en) * 2022-03-18 2023-09-22 腾讯科技(深圳)有限公司 Method, device, electronic equipment and storage medium for merging small files
CN116089391A (en) * 2022-12-29 2023-05-09 南京苏宁软件技术有限公司 File data processing method, device, equipment and medium
US12399868B1 (en) * 2023-03-20 2025-08-26 Amazon Technologies, Inc. Managed file compaction for distributed storage systems

Also Published As

Publication number Publication date
CN113127548B (en) 2023-10-31

Similar Documents

Publication Publication Date Title
CN108536745B (en) Shell-based data table extraction method, terminal, equipment and storage medium
CN108470045B (en) Electronic device, data chain archiving method and storage medium
CN112416972A (en) Real-time data stream processing method, device, equipment and readable storage medium
CN112783436B (en) Synchronous object placement for information lifecycle management
CN110633211A (en) Test method, device, server and medium for multi-interface
CN113127548A (en) File merging method, device, equipment and storage medium
CN111988419A (en) File uploading method, file downloading method, file uploading device, file downloading device, computer equipment and storage medium
CN114169309A (en) Method and device for modifying behavior data table, computer equipment and storage medium
WO2019095667A1 (en) Database data collection method, application server, and computer readable storage medium
CN111258774A (en) Process processing method, device, computer equipment and storage medium
US20210357201A1 (en) Upgrades based on analytics from multiple sources
CN110162344B (en) Isolation current limiting method and device, computer equipment and readable storage medium
CN109445800B (en) Version automatic deployment method and system based on distributed system
CN112732367A (en) Event flow processing method, device and equipment and readable storage medium
CN116339908A (en) Virtual machine starting method, device, computer equipment and storage medium
CN118939472A (en) A data management method and related equipment
CN113886590A (en) Data summarizing method and device, computer equipment and storage medium
CN113886419A (en) SQL statement processing method and device, computer equipment and storage medium
CN113127359A (en) Method and device for obtaining test data
CN119356711A (en) Data updating method, device, system, computer equipment and readable storage medium
US20170168867A1 (en) Information processing system and control method
CN111159985A (en) Data export method, data export device, computer equipment and computer-readable storage medium
CN112583761A (en) Management method and device of security entity, computer equipment and storage medium
CN118860587A (en) Task processing method, device, electronic device, storage medium and program product
CN110457273A (en) A nuclear power plant document management method, system and terminal equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: Room 332, 3 / F, Building 102, 28 xinjiekouwei street, Xicheng District, Beijing 100088

Applicant after: QAX Technology Group Inc.

Applicant after: Qianxin Wangshen information technology (Beijing) Co.,Ltd.

Address before: Room 332, 3 / F, Building 102, 28 xinjiekouwei street, Xicheng District, Beijing 100088

Applicant before: QAX Technology Group Inc.

Applicant before: LEGENDSEC INFORMATION TECHNOLOGY (BEIJING) Inc.

GR01 Patent grant
GR01 Patent grant