CN111475336A

CN111475336A - Backup data analysis method and device based on file information and computer equipment

Info

Publication number: CN111475336A
Application number: CN202010158694.7A
Authority: CN
Inventors: 梁思
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-03-09
Filing date: 2020-03-09
Publication date: 2020-07-31
Anticipated expiration: 2040-03-09
Also published as: CN111475336B; WO2021179579A1

Abstract

The present application provides a backup data analysis method, device, computer equipment and computer-readable storage medium based on file information, and relates to the technical field of data processing. Meta information and file time information; use the pre-built KNN algorithm model to process the file meta information accordingly, and obtain the data track corresponding to the file meta information; The data trace corresponding to the time information. Finally, the data traces are integrated and sorted accordingly, and the data trace report of the backup data is obtained. In this application, the data track report is obtained by analyzing the file information of the backup data, which can be used to provide the evolution record of the data state, so that the changes of the data can be traced, so as to help users better understand the data characteristics and evaluate the future of the data. Change trends and application directions.

Description

Backup data analysis method, device and computer equipment based on file information

技术领域technical field

本申请涉及数据处理技术领域，特别涉及一种基于文件信息的备份数据分析方法、装置和计算机设备。The present application relates to the technical field of data processing, and in particular, to a method, apparatus and computer equipment for analyzing backup data based on file information.

背景技术Background technique

现有市面上的备份产品或云备份产品，在用户设置的备份策略下进行数据备份。备份作业后，保留下来的备份数据闲置在磁盘或磁带上，并在保留周期后直接被清除。在此期间，备份数据没有任何作用，不能产生任何增值效益，反而要占用大量的存储空间，浪费企业资源。Existing backup products or cloud backup products on the market perform data backup under the backup policy set by the user. After a backup job, the retained backup data sits idle on disk or tape and is purged directly after the retention period. During this period, backing up data has no effect and cannot generate any value-added benefits. Instead, it takes up a lot of storage space and wastes enterprise resources.

发明内容SUMMARY OF THE INVENTION

本申请的主要目的为提供一种基于文件信息的备份数据分析方法、装置和计算机设备，旨在解决现有备份数据在备份后无法产生增值效益、浪费资源的弊端。The main purpose of this application is to provide a backup data analysis method, device and computer equipment based on file information, aiming at solving the drawbacks of existing backup data that cannot generate value-added benefits and waste resources after backup.

为实现上述目的，本申请提供了一种基于文件信息的备份数据分析方法，包括：To achieve the above purpose, the application provides a backup data analysis method based on file information, including:

从数据湖中获取备份数据的文件信息，其中，所述备份数据为生产数据的副本，所述数据湖与生产环境隔离；Obtain file information of backup data from the data lake, wherein the backup data is a copy of the production data, and the data lake is isolated from the production environment;

将所述文件信息按照数据类型进行分类，得到若干个文件元信息和若干个文件时间信息；classifying the file information according to the data type to obtain several pieces of file meta information and several pieces of file time information;

将所述文件元信息按照备份时间进行分割，得到多个子文件元信息；dividing the file meta-information according to the backup time to obtain a plurality of sub-file meta-information;

根据所述备份时间，将各所述子文件元信息顺序输入预先构建的KNN算法模型中，预测得到各所述文件元信息分别对应的inode数；According to the backup time, the sub-file meta-information is sequentially input into the pre-built KNN algorithm model, and the inode number corresponding to each of the file meta-information is predicted;

调取所述文件元信息和所述文件时间信息分别对应的预设规则，处理各所述inode数和各所述文件时间信息，得到所述文件元信息对应的第一数据轨迹和所述文件时间信息对应的第二数据轨迹；Retrieve the preset rules corresponding to the file meta information and the file time information respectively, process each of the inode numbers and each of the file time information, and obtain the first data track and the file corresponding to the file meta information the second data track corresponding to the time information;

分别将所述第一数据轨迹和所述第二数据轨迹进行分割、比对，得到所述备份数据的数据轨迹报告。The first data track and the second data track are divided and compared respectively to obtain a data track report of the backup data.

进一步的，所述调取所述文件元信息对应的预设规则，处理各所述inode数，得到所述文件元信息对应的第一数据轨迹的步骤，包括：Further, the step of retrieving the preset rules corresponding to the file meta information, processing each of the inode numbers, and obtaining the first data track corresponding to the file meta information, includes:

根据各所述inode数与各所述子文件元信息之间的对应关系，查找得到各所述inode数分别对应的备份时间；According to the correspondence between each of the inode numbers and the meta-information of each of the sub-files, the backup time corresponding to each of the inode numbers is obtained by searching;

将各所述inode数按照各自对应的所述备份时间进行顺序排列，得到所述文件元信息对应的所述第一数据轨迹。The inode numbers are sequentially arranged according to their corresponding backup times to obtain the first data track corresponding to the file meta information.

进一步的，所述调取所述文件时间信息对应的预设规则，处理各所述文件时间信息，得到所述文件时间信息对应的第二数据轨迹的步骤，包括：Further, the step of retrieving preset rules corresponding to the file time information, processing each of the file time information, and obtaining a second data track corresponding to the file time information, includes:

将所述文件时间信息按照时间类型进行分类，得到若干个子文件时间信息；classifying the file time information according to the time type to obtain several sub-file time information;

将各所述子文件时间信息以时间为基准绘制散点图，得到所述文件时间信息对应的所述第二数据轨迹。The time information of each sub-file is drawn as a scatter diagram based on time, so as to obtain the second data track corresponding to the time information of the file.

进一步的，所述分别将所述第一数据轨迹和所述第二数据轨迹进行分割、比对，得到所述备份数据的数据轨迹报告的步骤，包括：Further, the step of dividing and comparing the first data track and the second data track respectively to obtain a data track report of the backup data includes:

分别将所述第一数据轨迹和所述第二数据轨迹按照第一预设时间段进行分割，得到若干个分段时间轨迹；respectively dividing the first data track and the second data track according to a first preset time period to obtain several segmented time tracks;

将属于同一所述数据类型的各所述分段时间轨迹进行比对，生成所述数据轨迹报告。The segmented time trajectories belonging to the same data type are compared to generate the data trajectory report.

进一步的，所述调取所述文件元信息和所述文件时间信息分别对应的预设规则，处理各所述inode数和各所述文件时间信息，得到所述文件元信息对应的第一数据轨迹和所述文件时间信息对应的第二数据轨迹的步骤之后，包括：Further, the preset rules corresponding respectively to the file meta information and the file time information are called, and each of the inode numbers and each of the file time information is processed to obtain the first data corresponding to the file meta information. After the steps of the track and the second data track corresponding to the file time information, include:

分别根据所述第一数据轨迹和所述第二数据轨迹的变化幅度，判断所述第一数据轨迹和/或所述第二数据轨迹中是否存在异常反馈点，其中，所述异常反馈点为所述第一数据轨迹和/或所述第二数据轨迹中，当前的变化幅度大于正常幅度的文件信息；Judging whether there is an abnormal feedback point in the first data trace and/or the second data trace according to the change range of the first data trace and the second data trace, wherein the abnormal feedback point is In the first data track and/or the second data track, the current change amplitude is greater than the file information of the normal amplitude;

若存在异常反馈点，则在所述第一数据轨迹和/或所述第二数据轨迹中将所述异常反馈点以预设格式进行标记，并输出预设信息，以提醒用户所述异常反馈点的存在。If there is an abnormal feedback point, the abnormal feedback point is marked in a preset format in the first data track and/or the second data track, and preset information is output to remind the user of the abnormal feedback the existence of points.

进一步的，所述根据所述第一数据轨迹的变化幅度，判断所述第一数据轨迹中是否存在异常反馈点的步骤，包括：Further, the step of judging whether there is an abnormal feedback point in the first data track according to the change range of the first data track includes:

计算所述第一数据轨迹中，所有相邻的两个所述inode数之间的差值，得到若干个变化值；Calculate the difference between all two adjacent inode numbers in the first data track to obtain several change values;

计算所有相邻两个所述变化值之间的差值，得到多个变化差值；Calculate the difference between all two adjacent change values to obtain a plurality of change difference values;

判断是否从各所述变化差值中，能够筛选得到至少一个大于预设差值的所述变化差值；Judging whether at least one of the variation differences greater than a preset difference can be obtained from each of the variation differences;

若能够筛选得到至少一个大于预设差值的所述变化差值，则判定所述第一数据轨迹中存在异常反馈点；If at least one variation difference value greater than a preset difference value can be obtained by screening, it is determined that there is an abnormal feedback point in the first data track;

若不能够筛选到至少一个大于预设差值的所述变化值，则判定所述第二数据轨迹中不存在异常反馈点；If at least one of the change values greater than the preset difference cannot be screened out, it is determined that there is no abnormal feedback point in the second data track;

进一步的，所述根据所述第二数据轨迹的变化幅度，判断所述第二数据轨迹中是否存在异常反馈点的步骤，还包括：Further, the step of judging whether there is an abnormal feedback point in the second data track according to the change range of the second data track further includes:

将所述第二数据轨迹按照第二预设时间段进行分割，得到若干个散点图区域；dividing the second data track according to a second preset time period to obtain several scatter plot areas;

分别比对相邻两个所述散点图区域中散点的密集度，判断相邻两个所述密集度的差异程度是否在预设范围内；Comparing the density of scatter points in two adjacent said scatter plot areas respectively, and judge whether the degree of difference between the two adjacent said density is within a preset range;

若相邻两个所述密集度的差异程度是在预设范围内，则判定所述第二数据轨迹中不存在异常反馈点；If the degree of difference between the two adjacent intensities is within a preset range, it is determined that there is no abnormal feedback point in the second data track;

若相邻两个所述密集度的差异程度不在预设范围内，则判定所述第二数据轨迹中存在异常反馈点。If the degree of difference between the two adjacent intensities is not within a preset range, it is determined that there is an abnormal feedback point in the second data track.

本申请还提供了一种基于文件信息的备份数据分析装置，包括：The application also provides a backup data analysis device based on file information, including:

获取模块，用于从数据湖中获取备份数据的文件信息，其中，所述备份数据为生产数据的副本，所述数据湖与生产环境隔离；an acquisition module, configured to acquire file information of backup data from the data lake, wherein the backup data is a copy of the production data, and the data lake is isolated from the production environment;

分类模块，用于将所述文件信息按照数据类型进行分类，得到若干个文件元信息和若干个文件时间信息；A classification module, used for classifying the file information according to the data type, to obtain several pieces of file meta information and several pieces of file time information;

分割模块，用于将所述文件元信息按照备份时间进行分割，得到多个子文件元信息；a splitting module, configured to split the file meta-information according to the backup time to obtain a plurality of sub-file meta-information;

预测模块，用于根据所述备份时间，将各所述子文件元信息顺序输入预先构建的KNN算法模型中，预测得到各所述子文件元信息分别对应的inode数；a prediction module, configured to sequentially input the meta-information of each of the sub-files into a pre-built KNN algorithm model according to the backup time, and predict the number of inodes corresponding to the meta-information of each of the sub-files;

处理模块，用于调取所述文件元信息和所述文件时间信息分别对应的预设规则，处理各所述inode数和各所述文件时间信息，得到所述文件元信息对应的第一数据轨迹和所述文件时间信息对应的第二数据轨迹；A processing module, configured to retrieve preset rules corresponding to the file meta information and the file time information respectively, process each of the inode numbers and each of the file time information, and obtain first data corresponding to the file meta information the track and the second data track corresponding to the file time information;

生成模块，用于分别将所述第一数据轨迹和所述第二数据轨迹进行分割、比对，得到所述备份数据的数据轨迹报告。The generating module is configured to separate and compare the first data track and the second data track respectively to obtain a data track report of the backup data.

进一步的，所述处理模块，包括：Further, the processing module includes:

查找单元，用于根据各所述inode数与各所述子文件元信息之间的对应关系，查找得到各所述inode数分别对应的备份时间；A search unit, configured to search and obtain the backup time corresponding to each of the inode numbers according to the correspondence between each of the inode numbers and the meta-information of each of the sub-files;

排列单元，用于将各所述inode数按照各自对应的所述备份时间进行顺序排列，得到所述文件元信息对应的所述第一数据轨迹。an arranging unit, configured to arrange the inode numbers in order according to the respective backup times, so as to obtain the first data track corresponding to the file meta information.

分类单元，用于将所述文件时间信息按照时间类型进行分类，得到若干个子文件时间信息；a classification unit, configured to classify the file time information according to the time type to obtain several sub-file time information;

绘制单元，用于将各所述子文件时间信息以时间为基准绘制散点图，得到所述文件时间信息对应的所述第二数据轨迹。A drawing unit, configured to draw the time information of each sub-file as a scatterplot based on time, to obtain the second data track corresponding to the time information of the file.

进一步的，所述生成模块，包括：Further, the generation module includes:

第一分割单元，用于分别将所述第一数据轨迹和所述第二数据轨迹按照第一预设时间段进行分割，得到若干个分段时间轨迹；a first dividing unit, configured to divide the first data track and the second data track according to a first preset time period, respectively, to obtain several segmented time tracks;

生成单元，用于将属于同一所述数据类型的各所述分段时间轨迹进行比对，生成所述数据轨迹报告。A generating unit, configured to compare the segmented time trajectories belonging to the same data type to generate the data trajectory report.

进一步的，所述分析装置，包括：Further, the analysis device includes:

判断模块，用于分别根据所述第一数据轨迹和所述第二数据轨迹的变化幅度，判断所述第一数据轨迹和/或所述第二数据轨迹中是否存在异常反馈点，其中，所述异常反馈点为所述第一数据轨迹和/或所述第二数据轨迹中，当前的变化幅度大于正常幅度的文件信息；a judgment module, configured to judge whether there is an abnormal feedback point in the first data trace and/or the second data trace according to the change range of the first data trace and the second data trace, wherein the The abnormal feedback point is the file information whose current variation range is greater than the normal range in the first data track and/or the second data track;

标识模块，用于若存在异常反馈点，则在所述第一数据轨迹和/或所述第二数据轨迹中将所述异常反馈点以预设格式进行标记，并输出预设信息，以提醒用户所述异常反馈点的存在。An identification module, used to mark the abnormal feedback point in the first data track and/or the second data track in a preset format if there is an abnormal feedback point, and output preset information to remind The existence of the abnormal feedback point described by the user.

进一步的，所述判断模块，包括：Further, the judging module includes:

第一计算单元，用于计算所述第一数据轨迹中，所有相邻的两个所述inode数之间的差值，得到若干个变化值；a first calculation unit, configured to calculate the difference between all two adjacent inode numbers in the first data track, to obtain several change values;

第二计算单元，用于计算所有相邻两个所述变化值之间的差值，得到多个变化差值；a second calculation unit, configured to calculate the difference between all two adjacent change values to obtain a plurality of change difference values;

第一判断单元，用于判断是否从各所述变化差值中，能够筛选得到至少一个大于预设差值的所述变化差值；a first judging unit for judging whether at least one of the variation differences greater than a preset difference can be obtained from each of the variation differences;

第一判定单元，用于若能够筛选得到至少一个大于预设差值的所述变化差值，则判定所述第一数据轨迹中存在异常反馈点；a first determination unit, configured to determine that there is an abnormal feedback point in the first data track if at least one variation difference value greater than a preset difference value can be obtained by screening;

第二判定单元，用于若不能够筛选到至少一个大于预设差值的所述变化值，则判定所述第二数据轨迹中不存在异常反馈点；a second determination unit, configured to determine that there is no abnormal feedback point in the second data track if at least one variation value greater than a preset difference cannot be screened out;

进一步的，所述判断模块，还包括：Further, the judging module also includes:

第二分割单元，用于将所述第二数据轨迹按照第二预设时间段进行分割，得到若干个散点图区域；a second dividing unit, configured to divide the second data track according to a second preset time period to obtain several scatter plot areas;

第二判断单元，用于分别比对相邻两个所述散点图区域中散点的密集度，判断相邻两个所述密集度的差异程度是否在预设范围内；a second judging unit, configured to compare the density of the scatter points in the two adjacent scatter plot areas respectively, and judge whether the degree of difference between the two adjacent said densities is within a preset range;

第三判定单元，用于若相邻两个所述密集度的差异程度是在预设范围内，则判定所述第二数据轨迹中不存在异常反馈点；a third determination unit, configured to determine that there is no abnormal feedback point in the second data track if the degree of difference between the two adjacent intensities is within a preset range;

第四判定单元，用于若相邻两个所述密集度的差异程度不在预设范围内，则判定所述第二数据轨迹中存在异常反馈点。The fourth determination unit is configured to determine that there is an abnormal feedback point in the second data track if the degree of difference between the two adjacent intensities is not within a preset range.

本申请还提供一种计算机设备，包括存储器和处理器，所述存储器中存储有计算机程序，所述处理器执行所述计算机程序时实现上述任一项所述方法的步骤。The present application further provides a computer device, including a memory and a processor, where a computer program is stored in the memory, and the processor implements the steps of any one of the above-mentioned methods when the processor executes the computer program.

本申请还提供一种计算机可读存储介质，其上存储有计算机程序，所述计算机程序被处理器执行时实现上述任一项所述的方法的步骤。The present application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements the steps of any one of the methods described above.

本申请中提供的一种基于文件信息的备份数据分析方法、装置和计算机设备，通过对备份数据的文件信息进行分类，得到备份数据的文件元信息和文件时间信息，然后采用预先构建的KNN算法模型对文件元信息进行相应的处理，得到文件元信息对应的inode数，将inode数按照备份时间进行顺序排列，得到与文件元信息对应的数据轨迹；并将文件时间信息以时间为基准绘制散点图，得到与文件时间信息对应的数据轨迹。最后，综合各数据轨迹进行分割、比对，得到备份数据的数据轨迹报告。数据轨迹报告能够用于提供数据状态的演变记录，让数据的变化有迹可循，从而帮助用户更好的理解数据特征，评估数据未来的变化趋势和应用方向。A backup data analysis method, device and computer equipment based on file information provided in the present application, by classifying the file information of the backup data, to obtain the file meta information and file time information of the backup data, and then adopting a pre-built KNN algorithm The model processes the file meta-information accordingly, obtains the inode number corresponding to the file meta-information, arranges the inode numbers in order according to the backup time, and obtains the data track corresponding to the file meta-information; and draws the file time information based on time. Dot graph to get the data trace corresponding to the file time information. Finally, each data track is combined for segmentation and comparison, and a data track report of the backup data is obtained. The data track report can be used to provide a record of the evolution of the data status, so that the changes in the data can be traced, so as to help users better understand the characteristics of the data and evaluate the future trend and application direction of the data.

附图说明Description of drawings

图1是本申请一实施例中基于文件信息的备份数据分析方法步骤示意图；1 is a schematic diagram of steps of a backup data analysis method based on file information in an embodiment of the present application;

图2是本申请一实施例中基于文件信息的备份数据分析装置整体结构框图；2 is a block diagram of the overall structure of a backup data analysis device based on file information in an embodiment of the present application;

图3是本申请一实施例的计算机设备的结构示意框图。FIG. 3 is a schematic structural block diagram of a computer device according to an embodiment of the present application.

本申请目的的实现、功能特点及优点将结合实施例，参照附图做进一步说明。The realization, functional characteristics and advantages of the purpose of the present application will be further described with reference to the accompanying drawings in conjunction with the embodiments.

具体实施方式Detailed ways

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本申请进行进一步详细说明。应当理解，此处描述的具体实施例仅仅用以解释本申请，并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.

参照图1，本申请一实施例中提供了一种基于文件信息的备份数据分析方法，包括：1, an embodiment of the present application provides a backup data analysis method based on file information, including:

S1：从数据湖中获取备份数据的文件信息，其中，所述备份数据为生产数据的副本，所述数据湖与生产环境隔离；S1: Obtain file information of backup data from the data lake, wherein the backup data is a copy of the production data, and the data lake is isolated from the production environment;

S2：将所述文件信息按照数据类型进行分类，得到若干个文件元信息和若干个文件时间信息；S2: classify the file information according to the data type, and obtain several pieces of file meta information and several pieces of file time information;

S3：将所述文件元信息按照备份时间进行分割，得到多个子文件元信息；S3: Divide the file meta-information according to the backup time to obtain multiple sub-file meta-information;

S4：根据所述备份时间，将各所述子文件元信息顺序输入预先构建的KNN算法模型中，预测得到各所述文件元信息分别对应的inode数；S4: According to the backup time, sequentially input the meta-information of each sub-file into a pre-built KNN algorithm model, and predict the number of inodes corresponding to the meta-information of each of the files;

S5：调取所述文件元信息和所述文件时间信息分别对应的预设规则，处理各所述inode数和各所述文件时间信息，得到所述文件元信息对应的第一数据轨迹和所述文件时间信息对应的第二数据轨迹；S5: Retrieve preset rules corresponding to the file meta information and the file time information respectively, process each of the inode numbers and each of the file time information, and obtain the first data track and the corresponding file meta information. the second data track corresponding to the file time information;

S6：分别将所述第一数据轨迹和所述第二数据轨迹进行分割、比对，得到所述备份数据的数据轨迹报告。S6: Separate and compare the first data track and the second data track respectively to obtain a data track report of the backup data.

本实施例中，生产数据在进行备份时，系统会根据备份数据创建对应的文件或文件夹并存储在磁盘或磁带上。其中，备份数据为生产数据的副本，备份数据存储在数据湖中，与生产环境隔离；生产环境是指正式提供对外服务的系统环境，一般会关掉错误报告，打开错误日志；可以理解为包含所有的功能的环境，任何项目所使用的环境都以这个为基础，然后根据客户的个性化需求来做调整或者修改。分析系统从数据湖中获取备份的生产数据来进行数据轨迹分析，一方面因为备份数据跟生产环境隔离，不会对生成作业的负载造成影响；另一方面，现有的生产数据进行备份只是单纯的数据记录，防止数据丢失，并没有对备份数据进行二次的开发或利用，极大的浪费了资源。分析系统在进行数据分析时，会获取备份数据从创建至今的所有文件信息，其中，文件信息包括文件容量、存储位置、目录树、目录树深度、创建时间、修改时间等信息。分析系统针对不同类型的文件信息有不同的处理规则，因此，需要先将文件信息按照数据类型进行分类，得到若干个文件子信息，即文件元信息和文件时间信息。比如，文件容量、存储位置的数量、目录树中目录文件的数量、目录树中各目录文件之间的层级数量这类型的信息与备份数据文件的inode数相关，用于表征文件的具体信息，属于文件元信息。具体的，inode包括文件数据block(块)的位置，因此可以通过inode数反映文件存储位置的数量；在Unix/Linux系统中，目录也是一种文件，目录文件的结构非常简单，就是一系列目录项的列表。每个目录项，由两部分组成：所包含文件的文件名，以及该文件名对应的inode号码，因此inode数可以反映目录树中目录文件的数量、目录树中各目录文件之间的层级数量。备份数据文件的创建时间、修改时间和生命周期等信息则用于表征文件的时间属性，属于文件时间信息。在将文件信息分类后，分析系统根据各文件子信息的数据类型调取对应的处理规则，然后分别使用各处理规则并行处理对应的文件子信息，从而得到各文件子信息对应的数据轨迹。比如，若文件子信息是属于文件元信息类型的，则分析系统调用预先训练的KNN算法模型，并将文件子信息输入KNN算法模型中进行相应的计算，得到预测的inode数。具体的，文件元信息是指文件的容量、存储位置的数量、目录树中目录文件的数量、目录树中各目录文件之间的层级数量等表征文件的具体信息，与备份数据文件的inode数相关。系统在备份生产数据从而建立备份数据时，会将备份数据的文件存储在磁盘中，并且还需要找到一个地方储存文件的元信息，比如文件的创建者、文件的创建日期、文件的大小等等。这种储存文件元信息的区域就叫做inode，中文译名为“索引节点”，inode的数值则为inode数。用户设定有备份频率，系统按照备份频率对生产数据进行备份。比如，备份频率为每天一次，则系统会对每天所有的生产数据进行一次备份，备份时的时间点即为备份时间。系统将文件元信息按照备份时间进行分割，从而得到若干个子元件元信息。然后将各个子文件元信息按照各自对应的备份时间的先后顺序，依次输入预先构建的KNN算法模型中，从而预测得到各子文件元信息各自对应的inode数。其中，KNN算法模型的训练过程具体为：将历史记录的文件相关元数据，比如存储位置、文件数量、软硬链接数等数据进行预处理作为特征值组合，将文件系统的inode数作为特征值组合的结果，通过KNN算法将不同的inode数进行初步的分类。而每个元数据随着时间的推移可以计算出增长趋势，通过将组合所有元数据的趋势数据进行KNN计算，预测inode的消耗，并使用真实发生的数据不断训练回归验证模型算法，通过对特征值加减权重的方式来优化算法来减少误差。通过该模型，可以从历史数据的变化来预测inode的消耗，反馈用户优化程序或文件存储的方法，使得存储容量与inode容量得以最优化的利用。多个inode数按照备份时间顺序排列形成第一数据轨迹；若文件子信息是属于文件时间信息，则以时间为横轴，以文件时间为点绘制散点图，散点图即为文件时间信息对应的第二数据轨迹。分析系统分别将第一数据轨迹和第二数据轨迹按照预设时间段进行分割、对比，比如以一个月为预设时间段，则将第一数据轨迹按照一个月为分割点，将第一数据轨迹进行分割对比，使得文件元信息对应的当月数据轨迹可以和上一个月的数据进行比对，从而得到备份数据完整的数据轨迹报告。In this embodiment, when the production data is backed up, the system will create a corresponding file or folder according to the backup data and store it on a disk or tape. Among them, the backup data is a copy of the production data, and the backup data is stored in the data lake and isolated from the production environment; the production environment refers to the system environment that formally provides external services, usually closes the error report and opens the error log; it can be understood as containing All functional environments and environments used in any project are based on this, and then adjusted or modified according to the individual needs of customers. The analysis system obtains the backup production data from the data lake for data trajectory analysis. On the one hand, because the backup data is isolated from the production environment, it will not affect the load of the generation job; on the other hand, the backup of the existing production data is only simple The backup data is not used for secondary development or utilization, which greatly wastes resources. When analyzing the data, the analysis system will obtain all the file information from the backup data creation to the present. The file information includes file capacity, storage location, directory tree, directory tree depth, creation time, modification time and other information. The analysis system has different processing rules for different types of file information. Therefore, it is necessary to first classify the file information according to the data type, and obtain several file sub-information, namely file meta information and file time information. For example, the file capacity, the number of storage locations, the number of directory files in the directory tree, and the number of levels between each directory file in the directory tree are related to the number of inodes in the backup data file, and are used to characterize the specific information of the file. Belongs to file meta information. Specifically, the inode includes the position of the file data block (block), so the number of file storage locations can be reflected by the number of inodes; in Unix/Linux systems, a directory is also a kind of file, and the structure of a directory file is very simple, which is a series of directories list of items. Each directory entry consists of two parts: the file name of the contained file, and the inode number corresponding to the file name, so the inode number can reflect the number of directory files in the directory tree and the number of levels between each directory file in the directory tree. . Information such as the creation time, modification time, and life cycle of the backup data file is used to represent the time attribute of the file, and belongs to the file time information. After classifying the file information, the analysis system retrieves the corresponding processing rules according to the data type of each file sub-information, and then uses each processing rule to process the corresponding file sub-information in parallel, thereby obtaining the data track corresponding to each file sub-information. For example, if the file sub-information belongs to the file meta-information type, the analysis system calls the pre-trained KNN algorithm model, and inputs the file sub-information into the KNN algorithm model for corresponding calculation to obtain the predicted inode number. Specifically, the file meta information refers to the specific information that characterizes the file, such as the capacity of the file, the number of storage locations, the number of directory files in the directory tree, the number of levels between the directory files in the directory tree, etc., and the number of inodes of the backup data file. related. When the system backs up production data to create backup data, it will store the files of the backup data in the disk, and also need to find a place to store the meta information of the file, such as the creator of the file, the creation date of the file, the size of the file, etc. . This area for storing file meta information is called inode, the Chinese translation is "index node", and the value of inode is the number of inodes. The user sets the backup frequency, and the system backs up the production data according to the backup frequency. For example, if the backup frequency is once a day, the system will back up all production data once a day, and the backup time is the backup time. The system divides the file meta-information according to the backup time, so as to obtain several sub-element meta-information. Then, the meta-information of each sub-file is sequentially input into the pre-built KNN algorithm model according to the sequence of the corresponding backup time, so as to predict the number of inodes corresponding to the meta-information of each sub-file. Among them, the training process of the KNN algorithm model is as follows: preprocessing the file-related metadata of historical records, such as storage location, number of files, number of soft and hard links, etc., as a combination of eigenvalues, and using the number of inodes of the file system as eigenvalues The result of the combination is preliminarily classified by the KNN algorithm for different inode numbers. The growth trend of each metadata can be calculated over time. By combining the trend data of all metadata, KNN calculation is performed to predict the consumption of inodes, and the actual data is used to continuously train the regression verification model algorithm. The method of adding and subtracting weights to optimize the algorithm to reduce the error. Through this model, the consumption of inodes can be predicted from the changes of historical data, and the user's method of optimizing program or file storage can be fed back, so that the storage capacity and inode capacity can be optimally utilized. Multiple inode numbers are arranged in the order of backup time to form the first data track; if the file sub-information belongs to the file time information, a scatter graph is drawn with the time as the horizontal axis and the file time as the point, and the scatter graph is the file time information The corresponding second data trace. The analysis system separately divides and compares the first data track and the second data track according to a preset time period. For example, if one month is the preset time period, the first data track is divided according to one month, and the first data track is divided into one month. The track is divided and compared, so that the data track of the current month corresponding to the file metadata can be compared with the data of the previous month, so as to obtain a complete data track report of the backup data.

S501：根据各所述inode数与各所述子文件元信息之间的对应关系，查找得到各所述inode数分别对应的备份时间；S501: According to the corresponding relationship between each described inode number and each described sub-file meta information, find and obtain the backup time corresponding to each described inode number respectively;

S502：将各所述inode数按照各自对应的所述备份时间进行排列，得到所述文件元信息对应的所述第一数据轨迹。S502: Arrange the inode numbers according to the respective backup times to obtain the first data track corresponding to the file meta information.

本实施例中，分析系统按照各个inode数与各个子文件元信息之间的对应关系，将各个子文件元信息的备份时间作为对应的inode数的备份时间。然后，分析系统将各个inode数按照各自对应的备份时间进行顺序排列，从而得到inode数的第一数据轨迹，实现对inode数的模拟预测。In this embodiment, the analysis system takes the backup time of the meta information of each subfile as the backup time of the corresponding inode number according to the corresponding relationship between each inode number and each subfile meta information. Then, the analysis system arranges each inode number in sequence according to their corresponding backup time, thereby obtaining the first data track of the inode number, and realizing the simulation prediction of the inode number.

S503：将所述文件时间信息按照时间类型进行分类，得到若干个子文件时间信息；S503: Classify the file time information according to the time type to obtain time information of several sub-files;

S504：将各所述子文件时间信息以时间为基准绘制散点图，得到所述文件子信息对应的所述数据轨迹。S504: Draw a scatter diagram of the time information of each of the sub-files on the basis of time, to obtain the data track corresponding to the sub-information of the file.

本实施例中，文件时间信息包括备份数据文件的创建时间、每一次的修改时间、删除时间以及生命周期等表征文件时间的信息。分析系统首先将文件时间信息按照时间类型进行分类，比如创建时间为一类，修改时间为一类等，得到多个子文件时间信息。然后，分析系统以日期为横轴、以时刻为纵轴形成直角坐标系，将各子文件时间信息按照各自对应的时间标注在直角坐标系中，从而形成散点图，该散点图即为文件时间信息的第二数据轨迹。具体地，不同类型的子文件时间信息以不同的颜色进行标记，比如创建时间用红色点标记，修改时间以蓝色点标记。分析系统可以根据各点之间的间隔或密度，从而判断文件创建、修改等动作是否频繁，并以此作为基准预测后续在相同时间点的修改频率以及文件是否会大量创建从而给系统带来较大的负载压力。In this embodiment, the file time information includes the creation time of the backup data file, the modification time of each time, the deletion time, and the life cycle, and other information representing the file time. The analysis system first classifies the file time information according to the time type, for example, the creation time is one type, and the modification time is one type, etc., and obtains the time information of multiple sub-files. Then, the analysis system takes the date as the horizontal axis and the time as the vertical axis to form a rectangular coordinate system, and marks the time information of each sub-file in the rectangular coordinate system according to their corresponding time, thereby forming a scatter diagram, which is A second data track of file time information. Specifically, different types of sub-file time information are marked with different colors, for example, the creation time is marked with a red dot, and the modification time is marked with a blue dot. The analysis system can judge whether the file creation, modification and other actions are frequent according to the interval or density between each point, and use this as a benchmark to predict the subsequent modification frequency at the same time point and whether a large number of files will be created, which will bring more difficulties to the system. high load pressure.

S601：分别将所述第一数据轨迹和所述第二数据轨迹按照第一预设时间段进行分割，得到若干个分段时间轨迹；S601: Divide the first data track and the second data track respectively according to a first preset time period to obtain several segmented time tracks;

S602：将属于同一所述数据类型的各所述分段时间轨迹进行比对，生成所述数据轨迹报告。S602: Compare each segmented time track belonging to the same data type to generate the data track report.

本实施例中，分析系统将各数据类型的数据轨迹，即分别将第一数据轨迹和第二数据轨迹按照第一预设时间段进行分割，比如第一预设时间段为1个月，则将各数据轨迹以1个月为一次记录进行分割，从而得到各自对应的分段时间轨迹。分析系统以数据类型为基准，将属于同一数据类型的各个分段时间轨迹进行比对，比如将文件修改时间对应的分段数据轨迹进行一一对应比对。具体地，分析系统可以将属于同一数据类型的各个分段时间轨迹按照备份时间的先后顺序进行纵向排列，从而可以更加直观地表现出各个分段时间轨迹之间的差异性，实现当前数据轨迹与历史数据轨迹的相互比对，以此形成数据轨迹报告。数据轨迹报告可以为用户提供数据状态的演变记录，让数据变化有迹可循，问题分析有据可依，在数据出现异常时可以快速追根溯源。并且，还可以根据用户感兴趣的方向深入学习和分析，为用户提供趋势分析及异常点标记，帮助用户更好的理解数据特征，评估未来变化趋势和应用方向；协助用户尽早识别风险点，从而制定应急方案或规避措施，让数据更可控。In this embodiment, the analysis system divides the data tracks of each data type, that is, the first data track and the second data track respectively, according to the first preset time period. For example, the first preset time period is 1 month, then Each data track is divided into a record of one month, so as to obtain the corresponding segmented time track. Based on the data type, the analysis system compares each segmented time track belonging to the same data type, for example, compares the segmented data tracks corresponding to the file modification time in a one-to-one correspondence. Specifically, the analysis system can vertically arrange each segmented time trajectories belonging to the same data type in the order of backup time, so that the differences between the various segmented time trajectories can be more intuitively displayed, and the current data trajectories can be compared with each other. The historical data traces are compared with each other to form a data trace report. The data track report can provide users with a record of the evolution of the data status, so that the data changes can be traced, the problem analysis can be based on evidence, and the root cause can be quickly traced when the data is abnormal. In addition, it can also learn and analyze in-depth according to the direction of interest of users, provide users with trend analysis and abnormal point marking, help users better understand data characteristics, evaluate future trends and application directions; help users identify risk points as soon as possible, thereby Develop contingency plans or workarounds to make data more controllable.

S7：分别根据所述第一数据轨迹和所述第二数据轨迹的变化幅度，判断所述第一数据轨迹和/或所述第二数据轨迹中是否存在异常反馈点，其中，所述异常反馈点为所述第一数据轨迹和/或所述第二数据轨迹中，当前的变化幅度大于正常幅度的文件信息；S7: Determine whether there is an abnormal feedback point in the first data trajectory and/or the second data trajectory according to the change range of the first data trajectory and the second data trajectory, wherein the abnormal feedback The point is the file information whose current variation range is greater than the normal range in the first data track and/or the second data track;

S8：若存在异常反馈点，则在所述第一数据轨迹和/或所述第二数据轨迹中将所述异常反馈点以预设格式进行标记，并输出预设信息，以提醒用户所述异常反馈点的存在。S8: If there is an abnormal feedback point, mark the abnormal feedback point in the first data track and/or the second data track in a preset format, and output preset information to remind the user of the Existence of abnormal feedback points.

本实施例中，分析系统分别根据第一数据轨迹和第二数据轨迹的变化幅度，来判断第一数据轨迹和/或第二数据轨迹中是否存在异常反馈点，其中，异常反馈点是指数据轨迹中，当前的变化幅度大于正常幅度的文件信息；而正常幅度可以由开发人员预先设定一个具体值，也可以由数据轨迹中，当前的文件信息之前的相邻两个文件信息之间的变化幅度值决定。具体地，不同数据类型的数据轨迹有不同的判断方法。当分析系统在对文件元信息的第一数据轨迹进行判断时，需要计算文件元信息的数据轨迹中，所有相邻的两个inode数之间的差值，从而得到多个变化值，该变化值用于表示相邻两个inode数的变化大小。然后，分析系统计算所有相邻两个变化值之间的差值，得到多个变化差值，该变化差值用于表现相邻两个变化值之间的变化幅度大小。分析系统判断是否可以从各个变化差值中，筛选得到至少一个大于预设差值的变化差值，其中，预设差值由开发人员预先设定，用于表示inode数的正常变化幅度。若分析系统筛选得到至少一个大于预设差值的变化差值，则判定文件元信息的第一数据轨迹中存在异常反馈点；若筛选不到至少一个大于预设差值的变化值，则判定文件元信息的第一数据轨迹中不存在异常反馈点。当分析系统在对文件时间信息的第二数据轨迹进行判断时，首先将文件时间信息的散点图按照第二预设时间段进行分割，得到若干个散点图区域。然后，分别比对相邻两个散点图区域中散点的密集度，判断相邻两个密集度的差异程度是否在预设范围内。若相邻两个三点去区域中散点的密集度的差异程度是在预设范围内，则分析系统判定第二数据轨迹中不存在异常反馈点。若相邻两个密集度的差异程度不在预设范围内，则判定第二数据轨迹中存在异常反馈点。分析系统在筛选到异常反馈点后，在对应的数据轨迹中将异常反馈点以预设格式进行标记，比如将异常反馈点以特定颜色标记，或者将其放大标记等，以将其突出显示，，并输出预设信息，以提醒用户数据轨迹中异常反馈点的存在。In this embodiment, the analysis system determines whether there is an abnormal feedback point in the first data trace and/or the second data trace according to the change range of the first data trace and the second data trace, wherein the abnormal feedback point refers to the data In the track, the current change range is greater than the normal range of file information; the normal range can be preset by the developer to a specific value, or can be determined by the data track, the current file information between two adjacent file information before the current file information. The magnitude of change is determined. Specifically, data traces of different data types have different judgment methods. When the analysis system judges the first data track of the file meta information, it needs to calculate the difference between all two adjacent inode numbers in the data track of the file meta information, so as to obtain multiple change values. The value is used to indicate the size of the change in the number of adjacent two inodes. Then, the analysis system calculates the difference between all two adjacent change values to obtain a plurality of change difference values, and the change difference values are used to represent the magnitude of change between the two adjacent change values. The analysis system determines whether at least one change difference value greater than a preset difference value can be obtained by screening from each change difference value, wherein the preset difference value is preset by the developer to represent the normal change range of the inode number. If the analysis system filters out at least one change difference value greater than the preset difference value, it is determined that there is an abnormal feedback point in the first data track of the file metadata; There is no abnormal feedback point in the first data track of the file metadata. When judging the second data track of the file time information, the analysis system first divides the scatter plot of the file time information according to the second preset time period to obtain several scatter plot areas. Then, the intensities of the scatter points in the two adjacent scatter plot regions are compared respectively, and it is judged whether the degree of difference between the two adjacent intensities is within a preset range. If the degree of difference in the density of the scattered points in the adjacent three-point depots is within a preset range, the analysis system determines that there are no abnormal feedback points in the second data track. If the degree of difference between the two adjacent densities is not within the preset range, it is determined that there is an abnormal feedback point in the second data track. After filtering out the abnormal feedback points, the analysis system marks the abnormal feedback points in the corresponding data track in a preset format, such as marking the abnormal feedback points with a specific color, or enlarging them, etc. to highlight them. , and output preset information to remind the user of the existence of abnormal feedback points in the data track.

S701：计算所述第一数据轨迹中，所有相邻的两个所述inode数之间的差值，得到若干个变化值；S701: Calculate the difference between all two adjacent inode numbers in the first data track to obtain several change values;

S702：计算所有相邻两个所述变化值之间的差值，得到多个变化差值；S702: Calculate the difference between all two adjacent change values to obtain a plurality of change difference values;

S703：判断是否从各所述变化差值中，能够筛选得到至少一个大于预设差值的所述变化差值；S703: Determine whether at least one change difference value greater than a preset difference value can be obtained from each of the change difference values;

S704：若能够筛选得到至少一个大于预设差值的所述变化差值，则判定所述第一数据轨迹中存在异常反馈点；S704: If at least one variation difference value greater than a preset difference value can be obtained by screening, determine that there is an abnormal feedback point in the first data track;

S705：若不能够筛选到至少一个大于预设差值的所述变化值，则判定所述第一数据轨迹中不存在异常反馈点。S705: If at least one change value greater than a preset difference value cannot be screened out, determine that there is no abnormal feedback point in the first data track.

本实施例中，分析系统在对文件元信息的第一数据轨迹进行异常反馈点的判断时，由于第一数据轨迹中各个inode数是按照备份时间进行排列的，具有顺序性。分析系统按照顺序，依次计算相邻两个inode数之间的差值，在完成对所有相邻inode数的差值计算后，得到多个变化值，并将各变化值按照计算的顺序进行排列，该变化值用于表示相邻两个inode数的变化大小。然后，分析系统按照变化值的排列顺序，依次计算相邻两个变化值之间的差值，在完成对所有相邻变化值的差值计算后，得到多个变化差值，该变化差值用于表现两个变化值之间的变化幅度大小。分析系统对各个变化差值进行筛选，判断所有的变化差值中是否存在大于预设差值的一个或多个变化差值。若可以从所有的变化差值中筛选得到至少一个大于预设差值的变化差值，则说明该变化差值对应的两个相邻的inode数中，后一个inode数的变化幅度超过正常幅度，比如有备份数据的inode数每天都会增大2—4，某一天的inode数的增量多大10，明显大于之前每天的变化幅度，因此分析系统可以判定文件元信息的数据轨迹中存在异常反馈点，该异常反馈点即为变化差值对应的两个相邻的inode数中，排序在后面的一个inode数。In this embodiment, when the analysis system judges the abnormal feedback point on the first data track of the file meta information, since the numbers of inodes in the first data track are arranged according to the backup time, they are sequential. The analysis system calculates the difference between two adjacent inode numbers in sequence. After completing the calculation of the difference between all adjacent inode numbers, multiple change values are obtained, and the change values are arranged in the order of calculation. , the change value is used to indicate the change size of two adjacent inode numbers. Then, the analysis system calculates the difference between two adjacent change values in turn according to the arrangement order of the change values. It is used to express the magnitude of change between two change values. The analysis system screens each change difference value, and determines whether there is one or more change difference values greater than the preset difference value among all the change difference values. If at least one change difference value greater than the preset difference value can be obtained from all the change difference values, it means that among the two adjacent inode numbers corresponding to the change difference value, the change range of the latter inode number exceeds the normal range For example, the number of inodes with backup data will increase by 2-4 every day. The increment of the number of inodes on a certain day is 10, which is significantly larger than the previous daily change. Therefore, the analysis system can determine that there is abnormal feedback in the data trace of the file metadata. point, the abnormal feedback point is the inode number that is sorted at the back among the two adjacent inode numbers corresponding to the change difference.

S706：将所述第二数据轨迹按照第二预设时间段进行分割，得到若干个散点图区域；S706: Divide the second data track according to a second preset time period to obtain several scatter plot areas;

S707：分别比对相邻两个所述散点图区域中散点的密集度，判断相邻两个所述密集度的差异程度是否在预设范围内；S707: Compare the density of scatter points in two adjacent scatter plot regions respectively, and determine whether the degree of difference between the two adjacent densities is within a preset range;

S708：若相邻两个所述密集度的差异程度是在预设范围内，则判定所述第二数据轨迹中不存在异常反馈点；S708: If the degree of difference between the two adjacent densities is within a preset range, determine that there is no abnormal feedback point in the second data track;

S709：若相邻两个所述密集度的差异程度不在预设范围内，则判定所述第二数据轨迹中存在异常反馈点。S709: If the degree of difference between the two adjacent intensities is not within a preset range, determine that there is an abnormal feedback point in the second data track.

本实施例中，分析系统在对文件时间信息的第二数据轨迹进行异常反馈点的判断时，第二数据轨迹为散点图，因此可以根据散点图中点的分布密集度进行相应的分析。具体地，分析系统将文件时间信息的第二数据轨迹，即散点图按照第二预设时间段进行分割，比如第二预设时间段为24小时，则散点图按照每24小时为一段进行分割，形成多个散点图区域，即每个散点图区域表现的是24小时内的文件时间信息。分析系统分别对比相邻两个散点图区域中散点的密集度，以分析相邻24小时内文件时间信息的变化频繁度，从而判断相邻两个散点图的散点密集度的差异程度是否在预设范围内。其中，预设范围由开发人员设定，作为判定散点密集度的差异程度的判定基准。如果相邻两个密集度的差异程度在预设范围内，则分析系统可以判定第二数据轨迹中不存在异常反馈点。如果相邻两个密集度的差异程度不在预设范围内，则分析系统可以判定第二数据轨迹中存在异常反馈点，异常反馈点即为散点图区域中密集度差异长度大于预设范围所对应的区域内的散点。In this embodiment, when the analysis system judges the abnormal feedback points on the second data trace of the file time information, the second data trace is a scatter diagram, so the corresponding analysis can be performed according to the distribution density of the points in the scatter diagram. . Specifically, the analysis system divides the second data track of the file time information, that is, the scatter plot, according to the second preset time period. For example, if the second preset time period is 24 hours, the scatter plot is divided into one section every 24 hours. Divide to form multiple scatter plot areas, that is, each scatter plot area represents the file time information within 24 hours. The analysis system compares the density of scatter points in two adjacent scatter plot areas to analyze the change frequency of file time information in adjacent 24 hours, thereby judging the difference in the density of scatter points between two adjacent scatter plots Whether the degree is within the preset range. Wherein, the preset range is set by the developer as a criterion for judging the degree of difference in scatter density. If the degree of difference between two adjacent densities is within a preset range, the analysis system may determine that there is no abnormal feedback point in the second data track. If the difference between the two adjacent intensities is not within the preset range, the analysis system can determine that there is an abnormal feedback point in the second data track, and the abnormal feedback point is the difference in the density in the scatter plot area that is greater than the preset range. Scatter points within the corresponding area.

本实施例提供的一种基于文件信息的备份数据分析方法，通过对备份数据的文件信息进行分类，得到备份数据的文件元信息和文件时间信息，然后采用预先构建的KNN算法模型对文件元信息进行相应的处理，得到与文件元信息对应的数据轨迹；并将文件时间信息以时间为基准绘制散点图，得到与文件时间信息对应的数据轨迹。最后，综合各数据轨迹进行相应的整理，得到备份数据的数据轨迹报告。数据轨迹报告能够用于提供数据状态的演变记录，让数据的变化有迹可循，从而帮助用户更好的理解数据特征，评估数据未来的变化趋势和应用方向。This embodiment provides a backup data analysis method based on file information. By classifying the file information of the backup data, the file meta information and file time information of the backup data are obtained, and then a pre-built KNN algorithm model is used to analyze the file meta information. Corresponding processing is performed to obtain a data track corresponding to the file meta information; a scatter diagram is drawn for the file time information based on time to obtain a data track corresponding to the file time information. Finally, the data traces are integrated and sorted accordingly, and the data trace report of the backup data is obtained. The data track report can be used to provide a record of the evolution of the data status, so that the changes in the data can be traced, so as to help users better understand the characteristics of the data and evaluate the future trend and application direction of the data.

参照图2，本申请一实施例中还提供了一种基于文件信息的备份数据分析装置，包括：2, an embodiment of the present application also provides a backup data analysis device based on file information, including:

获取模块1，用于从数据湖中获取备份数据的文件信息，其中，所述备份数据为生产数据的副本，所述数据湖与生产环境隔离；Obtaining module 1, configured to obtain file information of backup data from the data lake, wherein the backup data is a copy of production data, and the data lake is isolated from the production environment;

分类模块2，用于将所述文件信息按照数据类型进行分类，得到若干个文件元信息和若干个文件时间信息；Classification module 2, for classifying the file information according to the data type, to obtain several pieces of file meta information and several pieces of file time information;

分割模块3，用于将所述文件元信息按照备份时间进行分割，得到多个子文件元信息；Segmentation module 3, for segmenting the file meta-information according to the backup time to obtain a plurality of sub-file meta-information;

预测模块4，用于根据所述备份时间，将各所述子文件元信息顺序输入预先构建的KNN算法模型中，预测得到各所述子文件元信息分别对应的inode数；The prediction module 4 is used for sequentially inputting the meta-information of each of the sub-files into a pre-constructed KNN algorithm model according to the backup time, and predicting the number of inodes corresponding to the meta-information of each of the sub-files;

处理模块5，用于调取所述文件元信息和所述文件时间信息分别对应的预设规则，处理各所述inode数和各所述文件时间信息，得到所述文件元信息对应的第一数据轨迹和所述文件时间信息对应的第二数据轨迹；The processing module 5 is configured to retrieve the preset rules corresponding to the file meta information and the file time information respectively, process each of the inode numbers and each of the file time information, and obtain the first corresponding to the file meta information. a data track and a second data track corresponding to the file time information;

生成模块6，用于分别将所述第一数据轨迹和所述第二数据轨迹进行分割、比对，得到所述备份数据的数据轨迹报告。The generating module 6 is configured to separate and compare the first data track and the second data track respectively to obtain a data track report of the backup data.

本实施例中，上述备份数据分析装置中的获取模块1、分类模块2、分割模块3、预测模块4、处理模块5和生成模块6的功能和作用的实现过程具体详见上述基于文件信息的备份数据分析方法中对应步骤S1至S6的实现过程，在此不再赘述。In this embodiment, the realization process of the functions and functions of the acquisition module 1, the classification module 2, the segmentation module 3, the prediction module 4, the processing module 5, and the generation module 6 in the above-mentioned backup data analysis device is detailed in the above-mentioned file information-based. The implementation process corresponding to steps S1 to S6 in the backup data analysis method will not be repeated here.

进一步的，所述处理模块5，包括：Further, the processing module 5 includes:

本实施例中，上述备份数据分析装置中的查找单元和排列单元的功能和作用的实现过程具体详见上述基于文件信息的备份数据分析方法中对应步骤S501至S502的实现过程，在此不再赘述。In this embodiment, the implementation process of the functions and functions of the search unit and the arrangement unit in the above-mentioned backup data analysis device can be found in the implementation process corresponding to steps S501 to S502 in the above-mentioned backup data analysis method based on file information, which is not repeated here. Repeat.

本实施例中，上述备份数据分析装置中的分类单元和绘制单元的功能和作用的实现过程具体详见上述基于文件信息的备份数据分析方法中对应步骤S503至S504的实现过程，在此不再赘述。In this embodiment, for the implementation process of the functions and functions of the classification unit and the drawing unit in the above-mentioned backup data analysis device, please refer to the implementation process corresponding to steps S503 to S504 in the above-mentioned backup data analysis method based on file information, which is not repeated here. Repeat.

进一步的，所述生成模块6，包括：Further, the generation module 6 includes:

生成单元，用于将属于同一所述数据类型的各所述分段时间轨迹进行比对，生成所述数据轨迹报告告。A generating unit, configured to compare the segmented time trajectories belonging to the same data type, and generate the data trajectory report.

本实施例中，上述备份数据分析装置中的第二分割单元和生成单元的功能和作用的实现过程具体详见上述基于文件信息的备份数据分析方法中对应步骤S601至S602的实现过程，在此不再赘述。In this embodiment, the implementation process of the functions and functions of the second dividing unit and the generating unit in the above-mentioned backup data analysis device can be found in the implementation process corresponding to steps S601 to S602 in the above-mentioned backup data analysis method based on file information. No longer.

判断模块7，用于分别根据所述第一数据轨迹和所述第二数据轨迹的变化幅度，判断所述第一数据轨迹和/或所述第二数据轨迹中是否存在异常反馈点，其中，所述异常反馈点为所述第一数据轨迹和/或所述第二数据轨迹中，当前的变化幅度大于正常幅度的文件信息；Judging module 7, configured to judge whether there is an abnormal feedback point in the first data track and/or the second data track according to the variation range of the first data track and the second data track, wherein, The abnormal feedback point is the file information whose current variation range is greater than the normal range in the first data track and/or the second data track;

标识模块8，用于若存在异常反馈点，则在所述第一数据轨迹和/或所述第二数据轨迹中将所述异常反馈点以预设格式进行标记，并输出预设信息，以提醒用户所述异常反馈点的存在。The identification module 8 is used to mark the abnormal feedback point in a preset format in the first data track and/or the second data track if there is an abnormal feedback point, and output preset information to The user is alerted to the existence of the abnormal feedback point.

本实施例中，上述备份数据分析装置中的判断模块5和标识模块6的功能和作用的实现过程具体详见上述基于文件信息的备份数据分析方法中对应步骤S7至S8的实现过程，在此不再赘述。In this embodiment, the implementation process of the functions and functions of the judgment module 5 and the identification module 6 in the above-mentioned backup data analysis device can be found in the implementation process corresponding to steps S7 to S8 in the above-mentioned backup data analysis method based on file information. No longer.

进一步的，所述判断模块7，包括：Further, the judging module 7 includes:

本实施例中，上述备份数据分析装置中的第三分割单元、第二判断单元、第三判定单元和第四判定单元的功能和作用的实现过程具体详见上述基于文件信息的备份数据分析方法中对应步骤S701至S709的实现过程，在此不再赘述。In this embodiment, for the implementation process of the functions and functions of the third dividing unit, the second judging unit, the third judging unit and the fourth judging unit in the above-mentioned backup data analysis device, please refer to the above-mentioned backup data analysis method based on file information for details. The implementation process corresponding to steps S701 to S709 in , will not be repeated here.

本实施例提供的一种基于文件信息的备份数据分析装置，通过对备份数据的文件信息进行分类，得到备份数据的文件元信息和文件时间信息，然后采用预先构建的KNN算法模型对文件元信息进行相应的处理，得到与文件元信息对应的数据轨迹；并将文件时间信息以时间为基准绘制散点图，得到与文件时间信息对应的数据轨迹。最后，综合各数据轨迹进行相应的整理，得到备份数据的数据轨迹报告。数据轨迹报告能够用于提供数据状态的演变记录，让数据的变化有迹可循，从而帮助用户更好的理解数据特征，评估数据未来的变化趋势和应用方向。A backup data analysis device based on file information provided by this embodiment obtains file metadata and file time information of the backup data by classifying the file information of the backup data, and then uses a pre-built KNN algorithm model to analyze the file metadata. Corresponding processing is performed to obtain a data track corresponding to the file meta information; a scatter diagram is drawn for the file time information based on time to obtain a data track corresponding to the file time information. Finally, the data traces are integrated and sorted accordingly, and the data trace report of the backup data is obtained. The data track report can be used to provide a record of the evolution of the data status, so that the changes in the data can be traced, so as to help users better understand the characteristics of the data and evaluate the future trend and application direction of the data.

参照图3，本申请实施例中还提供一种计算机设备，该计算机设备可以是服务器，其内部结构可以如图3所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中，该计算机设计的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储备份数据等数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种基于文件信息的备份数据分析方法。Referring to FIG. 3 , an embodiment of the present application further provides a computer device. The computer device may be a server, and its internal structure may be as shown in FIG. 3 . The computer device includes a processor, memory, a network interface, and a database connected by a system bus. Among them, the processor of the computer design is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium, an internal memory. The nonvolatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the execution of the operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store data such as backup data. The network interface of the computer device is used to communicate with an external terminal through a network connection. When the computer program is executed by the processor, a method for analyzing backup data based on file information is implemented.

上述处理器执行上述基于文件信息的备份数据分析方法的步骤：The above-mentioned processor performs the steps of the above-mentioned backup data analysis method based on file information:

本申请一实施例还提供一种计算机可读存储介质，其上存储有计算机程序，计算机程序被处理器执行时实现一种基于文件信息的备份数据分析方法，所述方法具体为：An embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, a method for analyzing backup data based on file information is implemented, and the method is specifically:

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，所述的计算机程序可存储与一非易失性计算机可读取存储介质中，该计算机程序在执行时，可包括如上述各方法的实施例的流程。其中，本申请所提供的和实施例中所使用的对存储器、存储、数据库或其它介质的任何引用，均可包括非易失性和/或易失性存储器。非易失性存储器可以包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限，RAM通过多种形式可得，诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双速据率SDRAM(SSRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through a computer program, and the computer program can be stored in a non-volatile computer-readable storage In the medium, when the computer program is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other medium provided in this application and used in the embodiments may include non-volatile and/or volatile memory. Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

需要说明的是，在本文中，术语“包括”、“包含”或者其任何其它变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素，而且还包括没有明确列出的其它要素，或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。It should be noted that, herein, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, apparatus, article or method comprising a series of elements includes not only those elements, It also includes other elements not expressly listed or inherent to such a process, apparatus, article or method. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, apparatus, article, or method that includes the element.

以上所述仅为本申请的优选实施例，并非因此限制本申请的专利范围，凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换，或直接或间接运用在其它相关的技术领域，均同理包括在本申请的专利保护范围内。The above are only the preferred embodiments of the present application, and are not intended to limit the scope of the patent of the present application. Any equivalent structure or equivalent process transformation made by using the contents of the description and drawings of the present application, or directly or indirectly applied to other related The technical field is similarly included in the scope of patent protection of this application.

Claims

1. A backup data analysis method based on file information is characterized by comprising the following steps:

acquiring file information of backup data from a data lake, wherein the backup data is a copy of production data, and the data lake is isolated from a production environment;

classifying the file information according to data types to obtain a plurality of file meta information and a plurality of file time information;

dividing the file meta-information according to backup time to obtain a plurality of sub-file meta-information;

sequentially inputting the meta-information of each subfile into a pre-constructed KNN algorithm model according to the backup time, and predicting to obtain the inode number corresponding to the meta-information of each subfile;

calling preset rules respectively corresponding to the file meta-information and the file time information, and processing the inode number and the file time information to obtain a first data track corresponding to the file meta-information and a second data track corresponding to the file time information;

and respectively segmenting and comparing the first data track and the second data track to obtain a data track report of the backup data.

2. The method according to claim 1, wherein the step of retrieving a preset rule corresponding to the file meta-information, processing each inode number, and obtaining a first data track corresponding to the file meta-information includes:

searching and obtaining backup time corresponding to each inode number according to the corresponding relation between each inode number and each subfile meta-information;

and sequentially arranging the inode numbers according to the backup time corresponding to each inode number to obtain the first data track corresponding to the file meta-information.

3. The method for analyzing backup data based on file information according to claim 1, wherein the step of retrieving a preset rule corresponding to the file time information, processing each file time information, and obtaining a second data track corresponding to the file time information comprises:

classifying the file time information according to time types to obtain a plurality of sub-file time information;

and drawing a scatter diagram by taking time as a reference according to the time information of each subfile to obtain the second data track corresponding to the time information of the file.

4. The method for analyzing backup data based on file information according to claim 1, wherein the step of obtaining the data track report of the backup data by dividing and comparing the first data track and the second data track respectively comprises:

respectively dividing the first data track and the second data track according to a first preset time period to obtain a plurality of segmented time tracks;

and comparing the segmented time tracks belonging to the same data type to generate the data track report.

5. The method according to claim 2 or 3, wherein the step of retrieving the preset rules corresponding to the file meta-information and the file time information, respectively, and processing the inode numbers and the file time information to obtain the first data track corresponding to the file meta-information and the second data track corresponding to the file time information is followed by the step of:

judging whether an abnormal feedback point exists in the first data track and/or the second data track according to the variation amplitude of the first data track and the second data track respectively, wherein the abnormal feedback point is file information of which the current variation amplitude is larger than the normal amplitude in the first data track and/or the second data track;

if the abnormal feedback points exist, marking the abnormal feedback points in a preset format in the first data track and/or the second data track, and outputting preset information to remind a user of the existence of the abnormal feedback points.

6. The method for analyzing backup data based on file information according to claim 5, wherein said step of determining whether an abnormal feedback point exists in the first data track according to the variation amplitude of the first data track comprises:

calculating the difference between all adjacent two inode numbers in the first data track to obtain a plurality of change values;

calculating the difference between all two adjacent change values to obtain a plurality of change differences;

judging whether at least one change difference value larger than a preset difference value can be obtained by screening from all the change difference values;

if at least one change difference value larger than a preset difference value can be obtained through screening, judging that an abnormal feedback point exists in the first data track;

and if at least one change value larger than a preset difference value cannot be screened, judging that an abnormal feedback point does not exist in the second data track.

7. The method for analyzing backup data based on file information according to claim 5, wherein the step of determining whether an abnormal feedback point exists in the second data track according to the variation amplitude of the second data track further comprises:

dividing the second data track according to a second preset time period to obtain a plurality of scatter diagram areas;

respectively comparing the densities of scatter points in two adjacent scatter point diagram areas, and judging whether the difference degree of the two adjacent scatter point diagram areas is within a preset range;

if the difference degree of the two adjacent densities is within a preset range, judging that no abnormal feedback point exists in the second data track;

and if the difference degree of the two adjacent densities is not within a preset range, judging that an abnormal feedback point exists in the second data track.

8. A backup data analysis apparatus based on file information, comprising:

the acquisition module is used for acquiring file information of backup data from a data lake, wherein the backup data is a copy of production data, and the data lake is isolated from a production environment;

the classification module is used for classifying the file information according to data types to obtain a plurality of file meta information and a plurality of file time information;

the dividing module is used for dividing the file meta-information according to the backup time to obtain a plurality of sub-file meta-information;

the prediction module is used for sequentially inputting the subfile meta-information into a pre-constructed KNN algorithm model according to the backup time, and predicting to obtain the inode number corresponding to the subfile meta-information;

the processing module is used for calling preset rules corresponding to the file meta-information and the file time information respectively, and processing the inode numbers and the file time information to obtain a first data track corresponding to the file meta-information and a second data track corresponding to the file time information;

and the generating module is used for respectively segmenting and comparing the first data track and the second data track to obtain a data track report of the backup data.

9. A computer device comprising a memory and a processor, the memory having stored therein a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method according to any one of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.