Disclosure of Invention
The invention aims to provide a method, a device, equipment and a storage medium for merging files, which can more efficiently merge small files generated by HIVE and relieve the memory pressure of NameNode of HDFS as soon as possible.
According to an aspect of the present invention, there is provided a method of file merging, the method comprising:
acquiring files of all partitions of the HIVE from the HDFS, and determining target files needing to be combined from the files of all the partitions;
calculating the merging effect priority of each partition according to a preset merging effect model based on the target file of each partition; the merging effect priority is positively correlated with the remission degree of the merged target file on the NameNode memory of the HDFS;
and respectively starting a corresponding merging task for each partition, and executing each merging task in sequence according to the merging effect priority of each partition so as to merge the files of the target files on the partitions.
Optionally, the step of determining a target file to be merged from the files of each partition specifically includes:
and aiming at all files of one partition, setting the file with the file size smaller than a preset threshold value as a target file of the partition.
Optionally, the step of calculating the merging effect priority of each partition according to a preset merging effect model based on the target file of each partition specifically includes:
counting the total number of files and/or the total size of the files of all target files of one partition;
inputting the total number of the files and/or the total size of the files as characteristic parameters into the preset merging effect model, and obtaining the merging effect priority of the partitions by operating the preset merging effect model; and the preset merging effect model is obtained by training through a machine learning algorithm.
Optionally, the step of respectively starting a corresponding merged task for each partition, and executing each merged task in sequence according to the merging effect priority of each partition specifically includes:
acquiring the total amount of resources used for processing file combination on the Yarn;
determining the resource amount required by each merging task;
sequencing all the merged tasks according to the merging effect priority of each partition to obtain a task sequencing result;
dividing the task sequencing result into a plurality of merging batches according to the total resource amount and the resource amount required by each merging task; wherein one merged batch comprises a plurality of merged tasks;
and successively deploying the plurality of merging tasks included in each merging batch to the Yarn in a distributed manner, so that the plurality of merging tasks are executed simultaneously through the Yarn.
Optionally, the method further includes:
after all target files of one partition are merged into one or more merged files, storing all the merged files under a temporary directory;
counting the total number of first data contained in all target files of the partition, and counting the total number of second data contained in all merged files of the partition;
judging whether the total number of the first data is consistent with the total number of the second data, if so, replacing the target file on the partition by using the merged file under the temporary target; and if not, deleting the merged file under the temporary directory.
Optionally, after the step of replacing the target file on the partition with the merged file under the temporary target, the method further includes:
and storing all the target files on the partition under a backup directory, and deleting all the target files under the backup directory after a preset time period.
According to another aspect of the present invention, there is provided an apparatus for file merging, the apparatus including;
the determining module is used for acquiring files of all partitions of the HIVE from the HDFS and determining target files needing to be combined from the files of all the partitions;
the calculation module is used for calculating the merging effect priority of each partition according to a preset merging effect model based on the target file of each partition; the merging effect priority is positively correlated with the remission degree of the merged target file on the NameNode memory of the HDFS;
and the execution module is used for respectively starting the corresponding merging tasks for each partition, and executing each merging task in sequence according to the merging effect priority of each partition so as to merge the target files on the partitions.
Optionally, the apparatus further comprises:
the verification module is used for storing all the merged files into a temporary directory after merging all the target files of one partition into one or more merged files; counting the total number of first data contained in all target files of the partition, and counting the total number of second data contained in all merged files of the partition; judging whether the total number of the first data is consistent with the total number of the second data, if so, replacing the target file on the partition by using the merged file under the temporary target; and if not, deleting the merged file under the temporary directory.
According to another aspect of the present invention, there is provided a computer device, specifically including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above-described steps of the file merging method when executing the computer program.
According to another aspect of the present invention, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the above-mentioned steps of the method for file merging.
According to the file merging method, the file merging device, the file merging equipment and the storage medium, the merging effect priority of each partition of the HIVE is calculated to sort the partitions, and the HIVE small files of the partitions with higher merging effect priority are preferably merged according to the sorting result; because the merging effect priority is positively correlated with the alleviation degree of the NameNode memory of the HDFS after the small files are merged, the files of the partitions with higher merging effect priority are preferably merged, so that the NameNode memory pressure of the HDFS can be alleviated as soon as possible; in addition, due to the fact that physical resources for processing file combination are limited, the HIVE small files can be combined intelligently under the limited physical resources through the method and the system, the problem of the HIVE small files is relieved rapidly, and the cluster performance is improved.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example one
The embodiment of the invention provides a method for merging files, which specifically comprises the following steps as shown in fig. 1:
step S101: and acquiring files of all partitions of the HIVE from the HDFS, and determining a target file to be merged from the files of each partition.
The HDFS (Hadoop Distributed File System) is the basis of Distributed computing stream data storage management and is developed based on the requirements of accessing and processing super-large files in a stream data mode; the HDFS is used for storing mass data and has the characteristics of high fault tolerance, high reliability, high expandability, high availability, high throughput rate and the like.
HIVE is a data warehouse tool based on Hadoop, is used for data extraction, transformation and loading, and is a mechanism capable of storing, inquiring and analyzing large-scale data stored in Hadoop; the HIVE data warehouse tool can map the structured data file into a database table, provides an SQL query function and can convert an SQL statement into a MapReduce task to execute.
In practical application, files generated by the HIVE are stored in the HDFS according to partitions, and because a large number of small files are generated by the HIVE mechanism, the storage of the large number of small files by the HDFS will occupy the memory of a manager node NameNode of the HDFS, so that the transverse expansion capability of the HDFS is influenced.
Specifically, the step of determining a target file to be merged from the files of each partition includes:
and aiming at all files of one partition, setting the file with the file size smaller than a preset threshold value as a target file of the partition.
Preferably, the value range of the preset threshold is 128MB to 256 MB; the embodiment is mainly used for merging the HIVE small files so as to relieve the memory pressure of the NameNode of the HDFS.
Step S102: calculating the merging effect priority of each partition according to a preset merging effect model based on the target file of each partition; and the merging effect priority is positively correlated with the remission degree of the merged target file on the NameNode memory of the HDFS.
In this embodiment, the merging effect of each partition is determined through a preset merging effect model, so as to calculate the merging effect priority of each partition; if the priority of the merging effect is higher, the merging effect is better; the good and bad concrete performance of the merging effect is as follows: and combining the most target files within the least combining time, thereby relieving the memory of the NameNode to the greatest extent. In this embodiment, the merging effect priority of the partition is determined by three dimensions, i.e., merging time, the number of merged files, and the size of the memory of the released NameNode.
Specifically, step S102 includes:
step A1: counting the total number of files and/or the total size of the files of all target files of one partition;
step A2: inputting the total number of the files and/or the total size of the files as characteristic parameters into the preset merging effect model, and obtaining the merging effect priority of the partitions by operating the preset merging effect model; and the preset merging effect model is obtained by training through a machine learning algorithm.
In this embodiment, a merging effect model may be trained in advance through a machine learning algorithm, so as to calculate the merging effect priority of each partition through the trained merging effect model; when the combined effect mode is used, the total number of files and/or the total size of files of each partition can be used as characteristic parameters to be input into the combined effect model, and the output parameter of the combined effect model is the priority of the combined effect. Of course, in practical applications, other characteristic parameters may also be used as the input of the merging effect model, and are not limited herein.
Step S103: and respectively starting a corresponding merging task for each partition, and executing each merging task in sequence according to the merging effect priority of each partition so as to merge the files of the target files on the partitions.
One partition corresponds to one merging task, and the merging task is executed to merge the target files on the partition; since the HIVE mechanism is that each table needs to specify a uniform data format, the data format under each partition is uniform, for example: orc, lzo, text, avro, parquet; preferably, the target files with data formats of orc, lzo, text and avro are subjected to file merging by starting a MapReduce task, and the target files with data formats of parquet are subjected to file merging by starting a SparkSQL task.
Because the resources used for merging the files cannot occupy the queue resources with excessive online services, the merging task of the partition with higher priority level needs to be preferentially executed according to the merging effect priority level of each partition, so that the memory pressure of the NameNode is relieved as soon as possible.
In addition, in practical application, all the partitions can be sequenced through a machine learning algorithm according to the target file of each partition so as to obtain a partition sequencing result; the more front the sorting is, the better the merging effect of the partitions is; and according to the partition sequencing result, sequentially executing the merging tasks of all the partitions.
Specifically, step S103 includes:
step B1: acquiring the total amount of resources used for processing file combination on the Yarn;
step B2: determining the resource amount required by each merging task;
step B3: sequencing all the merged tasks according to the merging effect priority of each partition to obtain a task sequencing result;
step B4: dividing the task sequencing result into a plurality of merging batches according to the total resource amount and the resource amount required by each merging task; wherein one merged batch comprises a plurality of merged tasks;
step B5: and successively deploying the plurality of merging tasks included in each merging batch to the Yarn in a distributed manner, so that the plurality of merging tasks are executed simultaneously through the Yarn.
The Yarn is a new Hadoop resource manager, is a universal resource management system and can provide uniform resource management and scheduling for upper-layer application; if the merging task is executed only through a single node, the number of the merged target files is far from the number of the generated HIVE files; therefore, a plurality of merging tasks need to be distributed and deployed on the Yarn queue, and the concurrency amount of the merging tasks needing to be executed in each batch is determined according to the queue resources on the Yarn.
Further, the method further comprises:
step C1: after all target files of one partition are merged into one or more merged files, storing all the merged files under a temporary directory;
step C2: counting the total number of first data contained in all target files of the partition, and counting the total number of second data contained in all merged files of the partition;
step C3: judging whether the total number of the first data is consistent with the total number of the second data, if so, replacing the target file on the partition by using the merged file under the temporary target; and if not, deleting the merged file under the temporary directory.
In this embodiment, in order to prevent data from being damaged and lost in the file merging process, and in order to improve data security, verification comparison is performed before and after a file merging operation, and whether the total number of first data included in each target file in a partition before merging is consistent with the total number of second data included in a merged file is determined, if yes, it is indicated that no data is lost in the file merging process, and file replacement may be performed, and if not, it is indicated that data is lost in the file merging process, an original target file is retained, and the merged file is deleted. Preferably, the total number of the first data included in all the target files before the file merging is obtained through sparkSQL, and the total number of the second data included in all the merged files after the file merging is obtained through sparkSQL.
It should be noted that, the maximum file size of the target files is configured in advance, and in the file merging process, if the file size of one target file reaches the maximum file size, one target file is created under the temporary directory, so that there may be a case where a plurality of target small files are merged into a plurality of merged large files.
Further, after the step of replacing the target file on the partition with the merged file under the temporary target, the method further comprises:
and storing all the target files on the partition under a backup directory, and deleting all the target files under the backup directory after a preset time period.
In this embodiment, in order to ensure the security of data, when the merged file is used to replace the target file, the target file is not deleted immediately, but is deleted after a period of time passes and after the manual confirmation is correct.
Example two
The embodiment of the invention provides a file merging device, and as shown in fig. 2, the device specifically comprises the following components:
a determining module 201, configured to obtain files of each partition of the HIVE from the HDFS, and determine a target file to be merged from the files of each partition;
a calculating module 202, configured to calculate, based on the target file of each partition, a merging effect priority of each partition according to a preset merging effect model; the merging effect priority is positively correlated with the remission degree of the merged target file on the NameNode memory of the HDFS;
the execution module 203 is configured to start a corresponding merging task for each partition, and execute each merging task in sequence according to the merging effect priority of each partition, so as to perform file merging on the target files on the partitions.
Specifically, the determining module 201 is configured to:
and aiming at all files of one partition, setting the file with the file size smaller than a preset threshold value as a target file of the partition.
Preferably, the value range of the preset threshold is 128MB to 256 MB; the embodiment is mainly used for merging the HIVE small files so as to relieve the memory pressure of the NameNode of the HDFS.
A calculation module 202 configured to:
counting the total number of files and/or the total size of the files of all target files of one partition; inputting the total number of the files and/or the total size of the files as characteristic parameters into the preset merging effect model, and obtaining the merging effect priority of the partitions by operating the preset merging effect model; and the preset merging effect model is obtained by training through a machine learning algorithm.
In this embodiment, the merging effect of each partition is determined through a preset merging effect model, so as to calculate the merging effect priority of each partition; if the priority of the merging effect is higher, the merging effect is better; the good and bad concrete performance of the merging effect is as follows: and combining the most target files within the least combining time, thereby relieving the memory of the NameNode to the greatest extent.
An execution module 203 for:
acquiring the total amount of resources used for processing file combination on the Yarn; determining the resource amount required by each merging task; sequencing all the merged tasks according to the merging effect priority of each partition to obtain a task sequencing result; dividing the task sequencing result into a plurality of merging batches according to the total resource amount and the resource amount required by each merging task; wherein one merged batch comprises a plurality of merged tasks; and successively deploying the plurality of merging tasks included in each merging batch to the Yarn in a distributed manner, so that the plurality of merging tasks are executed simultaneously through the Yarn.
One partition corresponds to one merging task, and the merging task is executed to merge the target files on the partition; since the HIVE mechanism is that each table needs to specify a uniform data format, the data format under each partition is uniform, for example: orc, lzo, text, avro, parquet; preferably, the target files with data formats of orc, lzo, text and avro are subjected to file merging by starting a MapReduce task, and the target files with data formats of parquet are subjected to file merging by starting a SparkSQL task.
Because the resources used for merging the files cannot occupy the queue resources with excessive online services, the merging task of the partition with higher priority level needs to be preferentially executed according to the merging effect priority level of each partition, so that the memory pressure of the NameNode is relieved as soon as possible.
In addition, in practical application, all the partitions can be sequenced through a machine learning algorithm according to the target file of each partition so as to obtain a partition sequencing result; the more front the sorting is, the better the merging effect of the partitions is; and according to the partition sequencing result, sequentially executing the merging tasks of all the partitions.
Further, the apparatus further comprises:
the verification module is used for storing all the merged files into a temporary directory after merging all the target files of one partition into one or more merged files; counting the total number of first data contained in all target files of the partition, and counting the total number of second data contained in all merged files of the partition; judging whether the total number of the first data is consistent with the total number of the second data, if so, replacing the target file on the partition by using the merged file under the temporary target; and if not, deleting the merged file under the temporary directory.
In this embodiment, in order to prevent data from being damaged and lost in the file merging process, and in order to improve data security, verification comparison is performed before and after a file merging operation, and whether the total number of first data included in each target file in a partition before merging is consistent with the total number of second data included in a merged file is determined, if yes, it is indicated that no data is lost in the file merging process, and file replacement may be performed, and if not, it is indicated that data is lost in the file merging process, an original target file is retained, and the merged file is deleted. Preferably, the total number of the first data included in all the target files before the file merging is obtained through sparkSQL, and the total number of the second data included in all the merged files after the file merging is obtained through sparkSQL.
It should be noted that, the maximum file size of the target files is configured in advance, and in the file merging process, if the file size of one target file reaches the maximum file size, one target file is created under the temporary directory, so that there may be a case where a plurality of target small files are merged into a plurality of merged large files.
Still further, the apparatus further comprises:
and the backup module is used for storing all the target files on the partition into a backup directory after the merged file under the temporary target is used for replacing the target files on the partition, and deleting all the target files under the backup directory after a preset time period.
In this embodiment, in order to ensure the security of data, when the merged file is used to replace the target file, the target file is not deleted immediately, but is deleted after a period of time passes and after the manual confirmation is correct.
EXAMPLE III
The embodiment also provides a computer device, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server or a rack server (including an independent server or a server cluster composed of a plurality of servers) capable of executing programs, and the like. As shown in fig. 3, the computer device 30 of the present embodiment includes at least but is not limited to: a memory 301, a processor 302 communicatively coupled to each other via a system bus. It is noted that FIG. 3 only shows the computer device 30 having components 301 and 302, but it is understood that not all of the shown components are required and that more or fewer components may be implemented instead.
In this embodiment, the memory 301 (i.e., the readable storage medium) includes a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage 301 may be an internal storage unit of the computer device 30, such as a hard disk or a memory of the computer device 30. In other embodiments, the memory 301 may also be an external storage device of the computer device 30, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, provided on the computer device 30. Of course, the memory 301 may also include both internal and external storage devices for the computer device 30. In the present embodiment, the memory 301 is generally used for storing an operating system and various types of application software installed in the computer device 30. In addition, the memory 301 may also be used to temporarily store various types of data that have been output or are to be output.
Processor 302 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 3302 generally operates to control the overall operation of the computer device 30.
Specifically, in this embodiment, the processor 302 is configured to execute a program of a file merging method stored in the processor 302, and when executed, the program of the file merging method implements the following steps:
acquiring files of all partitions of the HIVE from the HDFS, and determining target files needing to be combined from the files of all the partitions;
calculating the merging effect priority of each partition according to a preset merging effect model based on the target file of each partition; the merging effect priority is positively correlated with the remission degree of the merged target file on the NameNode memory of the HDFS;
and respectively starting a corresponding merging task for each partition, and executing each merging task in sequence according to the merging effect priority of each partition so as to merge the files of the target files on the partitions.
The specific embodiment process of the above method steps can be referred to in the first embodiment, and the detailed description of this embodiment is not repeated here.
Example four
The present embodiments also provide a computer readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application mall, etc., having stored thereon a computer program that when executed by a processor implements the method steps of:
acquiring files of all partitions of the HIVE from the HDFS, and determining target files needing to be combined from the files of all the partitions;
calculating the merging effect priority of each partition according to a preset merging effect model based on the target file of each partition; the merging effect priority is positively correlated with the remission degree of the merged target file on the NameNode memory of the HDFS;
and respectively starting a corresponding merging task for each partition, and executing each merging task in sequence according to the merging effect priority of each partition so as to merge the files of the target files on the partitions.
The specific embodiment process of the above method steps can be referred to in the first embodiment, and the detailed description of this embodiment is not repeated here.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.