CN118535441A

CN118535441A - Disk fault prediction method, medium and device based on meta learning

Info

Publication number: CN118535441A
Application number: CN202410730788.5A
Authority: CN
Inventors: 丁建立; 梁烨文; 李静
Original assignee: Civil Aviation University of China
Current assignee: Civil Aviation University of China
Priority date: 2024-06-06
Filing date: 2024-06-06
Publication date: 2024-08-23

Abstract

The present invention relates to the field of disk failure prediction, and in particular, to a disk failure prediction method, medium, and apparatus based on meta learning. Comprising the following steps: acquiring a feature matrix of a target disk; inputting the feature matrix into a target neural network model to generate a prediction result corresponding to a target disk; the target neural network model comprises an LSTM model which is obtained after training by a model independent element learning algorithm. The LSTM is guided to carry out gradient descent and parameter updating through model independent element learning, so that a new or changing task is quickly adapted, and the prediction problem of a small sample disk in disk data can be more quickly adapted. Meta-learning can use existing disk data to train a model that can quickly adapt to small amounts of new data, thereby reducing reliance on large-scale fault data. The heterogeneity among the disks of different models can be processed by utilizing the meta-learning framework, so that the prediction performance of the model on the model which is not seen is improved.

Description

A disk failure prediction method, medium and device based on meta-learning

技术领域Technical Field

本发明涉及磁盘故障预测领域，特别是涉及一种基于元学习的磁盘故障预测方法、介质及设备。The present invention relates to the field of disk failure prediction, and in particular to a disk failure prediction method, medium and device based on meta-learning.

背景技术Background Art

磁盘故障预测是计算机硬件维护的关键环节，它对数据中心的运行效率和数据安全性有着直接影响。传统的预测方法依赖于分析大量历史数据和复杂的特征工程来构建预测模型，并且这些方法通常专注于同构磁盘的故障预测。然而，在数据中心中，随着系统的不断运行，新型号的磁盘不断替代已损坏的磁盘，使得数据中心中存在众多不同厂商或不同型号的异构磁盘。这种情况在大型数据中心尤为常见。此外，随着存储系统的不断更新，会有一些新型号磁盘的加入，同时也会导致一些磁盘型号的数量远少于其他型号，上述两种情况产生的磁盘即为小样本磁盘，通常小样本磁盘可以指某一型号的磁盘数量在1500左右的磁盘。Disk failure prediction is a key link in computer hardware maintenance, and it has a direct impact on the operating efficiency and data security of data centers. Traditional prediction methods rely on analyzing large amounts of historical data and complex feature engineering to build prediction models, and these methods usually focus on failure prediction of homogeneous disks. However, in data centers, as the system continues to run, new models of disks continue to replace damaged disks, resulting in the presence of many heterogeneous disks from different manufacturers or models in data centers. This situation is particularly common in large data centers. In addition, as the storage system is continuously updated, some new models of disks will be added, which will also result in the number of some disk models being far less than other models. The disks generated in the above two situations are small sample disks, which usually refer to disks with a certain model number of about 1,500.

因此，在实际的数据中心环境中，对于小样本磁盘的故障预测会遇到以下两个挑战。首先，由于部署了来自不同制造商的多种型号的磁盘，这些磁盘在运行特性和故障模式上的差异较大，这使得单一的通用模型难以有效适用于所有型号的磁盘。其次，对于市场上新推出的磁盘型号，通常在短时间内难以积累足够的故障数据来支持传统机器学习模型的训练，导致这些新型号磁盘的故障预测在初期阶段面临效果不佳的问题。Therefore, in an actual data center environment, the failure prediction of a small sample of disks encounters the following two challenges. First, since multiple models of disks from different manufacturers are deployed, these disks have large differences in operating characteristics and failure modes, which makes it difficult for a single general model to be effectively applied to all models of disks. Second, for newly launched disk models on the market, it is usually difficult to accumulate enough failure data in a short period of time to support the training of traditional machine learning models, resulting in poor results in the initial stage of failure prediction of these new models of disks.

发明内容Summary of the invention

针对上述技术问题，本发明采用的技术方案为：In view of the above technical problems, the technical solution adopted by the present invention is:

根据本发明的一个方面，提供了一种基于元学习的磁盘故障预测方法，方法包括如下步骤：According to one aspect of the present invention, a disk failure prediction method based on meta-learning is provided, and the method comprises the following steps:

获取目标磁盘的特征矩阵，特征矩阵包括磁盘的时间序列特征及SMART属性特征；Obtain a feature matrix of the target disk, the feature matrix including the time series features and SMART attribute features of the disk;

将特征矩阵输入目标神经网络模型中，生成目标磁盘对应的预测结果；目标神经网络模型包括经过模型无关元学习算法训练后得到的LSTM模型；Input the feature matrix into the target neural network model to generate a prediction result corresponding to the target disk; the target neural network model includes an LSTM model trained by a model-independent meta-learning algorithm;

元学习训练中的元任务按照如下步骤构建：The meta-task in meta-learning training is constructed as follows:

根据磁盘日志中的SMART属性值，生成不同磁盘型号分别对应的磁盘样本集；其中，D_m＝{(x^m ₁,y^m ₁)、(x^m ₂,y^m ₂)、(x^m ₃,y^m ₃)、…、(x^m _n,y^m _n)}；D_m为第m个磁盘型号对应的磁盘样本集，x^m _n为D_m中第n个磁盘样本对应的特征矩阵；y^m _n为D_m中x_n对应的标签；n为D_m中磁盘样本的总数量；According to the SMART attribute values in the disk log, generate disk sample sets corresponding to different disk models; where D _m = {(x ^m ₁ ,y ^m ₁ ), (x ^m ₂ ,y ^m ₂ ), (x ^m ₃ ,y ^m ₃ ), …, (x ^m _n ,y ^m _n )}; D _m is the disk sample set corresponding to the mth disk model, x ^m _n is the feature matrix corresponding to the nth disk sample in D _m ; y ^m _n is the label corresponding to x _n in D _m ; n is the total number of disk samples in D _m ;

从D_m中获取多个元任务中训练任务对应的支持集和查询集；Obtain the support set and query set corresponding to the training tasks in multiple meta-tasks from _Dm ;

从D_s中获取多个元任务中测试任务对应的支持集和查询集；D_s为第s个磁盘型号对应的磁盘样本集，D_m与D_s的磁盘型号不同；Obtain the support set and query set corresponding to the test task in multiple meta-tasks from _Ds ; _Ds is the disk sample set corresponding to the s-th disk model, and the disk models of _Dm and _Ds are different;

使用每一个元任务中训练任务及测试任务分别对应的支持集和查询集，对LSTM模型进行内循环训练；Use the support set and query set corresponding to the training task and test task in each meta-task to perform inner loop training on the LSTM model;

其中，第i个元任务Task_i中的支持集和查询集均按照如下方法获得：The support set and query set in the ith meta-task Task _i are obtained as follows:

从对应的磁盘样本集中随机选取一个故障磁盘样本，并计算其与磁盘样本集中其他故障磁盘样本的KLD值；Randomly select a faulty disk sample from the corresponding disk sample set, and calculate its KLD value with other faulty disk samples in the disk sample set;

根据KLD值由小到大进行排序，形成KLD序列；KLD序列中共有2K-1个KLD值；K为正整数；Sort by KLD values from small to large to form a KLD sequence; there are 2K-1 KLD values in the KLD sequence; K is a positive integer;

将KLD序列中前K-1个KLD值对应的故障磁盘样本及随机选取的一个故障磁盘样本，作为支持集中的故障样本集；The faulty disk samples corresponding to the first K-1 KLD values in the KLD sequence and a randomly selected faulty disk sample are used as the faulty sample set in the support set;

从对应的磁盘样本集中随机选取K个健康磁盘样本，作为支持集中的健康样本集。K healthy disk samples are randomly selected from the corresponding disk sample set as the healthy sample set in the support set.

根据本发明的第二个方面，提供了一种非瞬时性计算机可读存储介质，非瞬时性计算机可读存储介质存储有计算机程序，计算机程序被处理器执行时实现上述的一种基于元学习的磁盘故障预测方法。According to a second aspect of the present invention, a non-transitory computer-readable storage medium is provided, wherein the non-transitory computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the above-mentioned disk failure prediction method based on meta-learning is implemented.

根据本发明的第三个方面，提供了一种电子设备，包括存储器、处理器以及存储在存储器中并可在处理器上运行的计算机程序，处理器执行计算机程序时实现上述的一种基于元学习的磁盘故障预测方法。According to a third aspect of the present invention, an electronic device is provided, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the above-mentioned disk failure prediction method based on meta-learning when executing the computer program.

本发明至少具有以下有益效果：The present invention has at least the following beneficial effects:

通过元学习方法，可以利用少量磁盘样本较为迅速适应新型号的磁盘。具体为，通过模型无关元学习(Model-Agnostic Meta-Learning,MAML)指导LSTM进行梯度下降和参数更新，快速适应新的或变化任务，即可以更加快速的适应磁盘数据中的漂移问题或新型号(小样本)磁盘的预测问题。元学习可以利用已有的磁盘数据来训练一个能够快速适应少量新数据的模型，从而减少对大规模故障数据的依赖。利用元学习框架可以处理不同型号磁盘之间的异构性，从而提高模型在未见过型号上的预测性能。Through the meta-learning method, a small number of disk samples can be used to quickly adapt to new models of disks. Specifically, Model-Agnostic Meta-Learning (MAML) is used to guide LSTM to perform gradient descent and parameter updates to quickly adapt to new or changing tasks, that is, it can more quickly adapt to drift problems in disk data or prediction problems of new models (small samples) of disks. Meta-learning can use existing disk data to train a model that can quickly adapt to a small amount of new data, thereby reducing dependence on large-scale fault data. The meta-learning framework can handle the heterogeneity between disks of different models, thereby improving the prediction performance of the model on unseen models.

同时，本方案中采用KLD(Kullback-Leibler Divergence，相对熵)对每个元任务的磁盘样本进行选择，通过KLD按相似度将样本分配到不同的任务中。具体为，对于支持集，选取KLD值较小的磁盘样本作为支持集，因为这些样本在特征分布上更接近基准样本，更易于模型学习。对于查询集，选取KLD值稍大的样本作为查询集，用于测试模型对稍有不同特征分布样本的适应能力。从而帮助模型更好地理解不同磁盘型号间的差异和相似性，从而提高模型的泛化能力和适应性。At the same time, this scheme uses KLD (Kullback-Leibler Divergence, relative entropy) to select disk samples for each meta-task, and distributes samples to different tasks according to similarity through KLD. Specifically, for the support set, disk samples with smaller KLD values are selected as the support set, because these samples are closer to the benchmark samples in feature distribution and are easier for the model to learn. For the query set, samples with slightly larger KLD values are selected as the query set to test the model's adaptability to samples with slightly different feature distributions. This helps the model better understand the differences and similarities between different disk models, thereby improving the model's generalization and adaptability.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required for use in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without creative work.

图1为本发明实施例提供的一种基于元学习的磁盘故障预测方法的流程图。FIG1 is a flow chart of a disk failure prediction method based on meta-learning provided in an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will be combined with the drawings in the embodiments of the present invention to clearly and completely describe the technical solutions in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative work are within the scope of protection of the present invention.

作为本发明的一个可能的实施例，如图1所示，提供了一种基于元学习的磁盘故障预测方法，方法包括如下步骤：As a possible embodiment of the present invention, as shown in FIG1 , a disk failure prediction method based on meta-learning is provided, and the method includes the following steps:

S100：获取目标磁盘的特征矩阵，特征矩阵包括磁盘的时间序列特征及SMART属性特征。S100: Obtain a feature matrix of a target disk, where the feature matrix includes time series features and SMART attribute features of the disk.

具体的，特征矩阵与S231中的形式相同，也即为获取近期14天的历史记录中的目标SMART属性。Specifically, the feature matrix has the same form as that in S231, that is, to obtain the target SMART attributes in the recent 14-day historical records.

具体的，每条采集的磁盘SMART样本中都包含有70多种SMART属性。然而，其中许多属性存在大量缺失值，且有些属性与磁盘故障预测模型的相关性较低，因为这些属性在磁盘出现故障时不会发生明显变化。因此，在建模之前，只选择与磁盘健康状况密切相关的属性。同时，本实施例中所使用的数据集包含多个磁盘型号，因此，在选择SMART属性时，需选择不同磁盘型号间共有的属性。最终，选择的目标SMART属性如表1所示。Specifically, each collected disk SMART sample contains more than 70 SMART attributes. However, many of these attributes have a large number of missing values, and some attributes have a low correlation with the disk failure prediction model because these attributes will not change significantly when the disk fails. Therefore, before modeling, only attributes that are closely related to the health of the disk are selected. At the same time, the data set used in this embodiment contains multiple disk models. Therefore, when selecting SMART attributes, it is necessary to select attributes that are common to different disk models. Finally, the selected target SMART attributes are shown in Table 1.

表1Table 1

上述表1中的属性ID、属性名及称属性类型对应的中文释义可以根据现有技术中的SMART属性表进行获取。每个SMART属性中都包含一个由五个元素组成的元组:ID、Normalized、RawThreshold和Worst。其中，ID表示SMART属性的序列编号；Normalized表示属性标准化后的数值，由磁盘制造商根据特定算法从原始值计算得出；Raw表示属性统计数据的原始值；Threshold表示属性的告警阈值；Worst表示该属性记录的最小值或最差值。在磁盘故障预测实施例中，根据数据中心提供的磁盘数据集,主要关注ID、Normalized、Raw三个元素。The Chinese definitions corresponding to the attribute ID, attribute name, and attribute type in Table 1 above can be obtained based on the SMART attribute table in the prior art. Each SMART attribute contains a tuple consisting of five elements: ID, Normalized, RawThreshold, and Worst. Among them, ID represents the serial number of the SMART attribute; Normalized represents the value of the attribute after normalization, which is calculated from the original value by the disk manufacturer according to a specific algorithm; Raw represents the original value of the attribute statistics; Threshold represents the alarm threshold of the attribute; Worst represents the minimum or worst value recorded for the attribute. In the disk failure prediction embodiment, based on the disk data set provided by the data center, the three elements of ID, Normalized, and Raw are mainly concerned.

由于不同的SMART属性具有不同的值区间，为了保证它们之间的公平比较，对SMART属性值进行标准化处理。在本实施例中使用的Min-Max标准化如公式如下所示：Since different SMART attributes have different value ranges, in order to ensure fair comparison between them, the SMART attribute values are standardized. The Min-Max standardization used in this embodiment is as follows:

其中，x是SMART特征对应的属性值，x_min和x_max分别是该属性的最小值与最大值。Here, x is the attribute value corresponding to the SMART feature, and x _min and x _max are the minimum and maximum values of the attribute, respectively.

S200：将特征矩阵输入目标神经网络模型中，生成目标磁盘对应的预测结果。目标神经网络模型包括经过模型无关元学习算法训练后得到的LSTM模型。LSTM(Long Short-Term Memory)是一种特殊的循环神经网络(Recurrent Neural Network,RNN)类型，设计用于解决长期依赖问题，即在处理时间序列和序列数据时，能够记住长期的信息而不会遗忘。S200: Input the feature matrix into the target neural network model to generate the prediction result corresponding to the target disk. The target neural network model includes an LSTM model obtained after training with a model-independent meta-learning algorithm. LSTM (Long Short-Term Memory) is a special type of recurrent neural network (RNN) designed to solve the long-term dependency problem, that is, when processing time series and sequence data, it can remember long-term information without forgetting.

具体的，目标神经网络模型还包括一个dropout层及一个包括Sigmoid激活函数的全连接层，LSTM模型包括四个堆叠的LSTM层。LSTM模型后连接dropout层，dropout层后连接包括Sigmoid激活函数的全连接层。Dropout层是深度学习网络中的一种正则化技术，主要用于减少模型的过拟合问题，提高模型的泛化能力。Specifically, the target neural network model also includes a dropout layer and a fully connected layer including a Sigmoid activation function, and the LSTM model includes four stacked LSTM layers. The LSTM model is connected to the dropout layer, and the dropout layer is connected to the fully connected layer including a Sigmoid activation function. The dropout layer is a regularization technology in deep learning networks, which is mainly used to reduce the overfitting problem of the model and improve the generalization ability of the model.

其中，最后一层利用Sigmoid函数作为sigmoid函数返回值在0到1范围内，因此，该激活函数非常适合用于二元分类预测模型。与传统神经网络不同，LSTM在时间序列向量上运行，因此能够捕捉磁盘健康状态的历史数据，使其成为磁盘故障预测任务的理想选择。The last layer uses the Sigmoid function as the return value of the sigmoid function is in the range of 0 to 1, so this activation function is very suitable for binary classification prediction models. Unlike traditional neural networks, LSTM operates on time series vectors, so it can capture historical data on the health status of the disk, making it an ideal choice for disk failure prediction tasks.

在元学习中，模型无关元学习(MAML)算法是解决小样本问题的有效方法之一，是一种基于优化的元学习方法。MAML可以提供元学习器来训练基础学习器，其中元学习器是MAML中用于学习的主要部分，基础学习器是在数据集上训练并用于测试任务的元学习器。MAML的关键是训练一组初始化参数，并通过在初始参数之上应用一个或多个梯度更新步骤来快速适应具有有限数据量的新任务。该模型使用N-way K-shot任务(训练任务)进行元学习训练，以确保其获得“先验知识”(初始化参数)。此外，这种“先验知识”可以提高新的N-way K-shot任务的性能。在MAML训练期间，优化分为内循环与外循环：内循环是开发执行此任务的基本能力的训练过程，而外循环是增强跨任务泛化能力的元训练过程。N-way K-shot是元学习的通用实验设置。本实施例中设置如下：在构建分类任务时，固定N为2，表示故障与健康两种类别。对于每个类别，每个任务K设置为5、10、15个。整个学习过程有两层，内循环和外循环。内循环学习率α和外循环学习率β分别为0.001和0.0001。内循环中的梯度更新次数设置为10。训练的最大迭代次数设置为1000。In meta-learning, the model-independent meta-learning (MAML) algorithm is one of the effective methods to solve the small sample problem and is a meta-learning method based on optimization. MAML can provide a meta-learner to train the base learner, where the meta-learner is the main part for learning in MAML, and the base learner is a meta-learner trained on a data set and used for testing tasks. The key to MAML is to train a set of initialization parameters and quickly adapt to new tasks with a limited amount of data by applying one or more gradient update steps on top of the initial parameters. The model is meta-learned using N-way K-shot tasks (training tasks) to ensure that it obtains "prior knowledge" (initialization parameters). In addition, this "prior knowledge" can improve the performance of new N-way K-shot tasks. During MAML training, the optimization is divided into an inner loop and an outer loop: the inner loop is a training process for developing the basic ability to perform this task, while the outer loop is a meta-training process for enhancing cross-task generalization capabilities. N-way K-shot is a common experimental setting for meta-learning. In this embodiment, the settings are as follows: When constructing a classification task, N is fixed to 2, representing two categories: fault and health. For each category, K is set to 5, 10, and 15 per task. The entire learning process has two layers, the inner loop and the outer loop. The inner loop learning rate α and the outer loop learning rate β are 0.001 and 0.0001, respectively. The number of gradient updates in the inner loop is set to 10. The maximum number of iterations for training is set to 1000.

MAML的目标是找到适合于大多数任务的最优初始参数θ。通过多次迭代训练更新改进模型的初始参数θ，从而使模型能够在见到新任务时，通过极少的学习步骤就能表现出良好的性能。在训练过程中，主要流程如下：The goal of MAML is to find the optimal initial parameters θ that are suitable for most tasks. The initial parameters θ of the model are updated and improved through multiple iterations of training, so that the model can show good performance with very few learning steps when encountering new tasks. During the training process, the main process is as follows:

1、内循环。对于每一个任务Task_i，使用任务Task_i的支持集数据进行有限步数的梯度下降更新，最小化任务Task_i上损失函数得到任务特定的参数θ′_i，如公式(1)所示。1. Inner loop. For each task Task _i , use the support set data of task Task _i to perform a finite number of steps of gradient descent update to minimize the loss function on task Task _i The task-specific parameters θ′ _i are obtained as shown in formula (1).

其中，α为内循环学习率，是损失函数关于参数θ的梯度。同时，使用更新后的参数θ′_i在相应的查询集上计算性能。Among them, α is the inner loop learning rate, is the gradient of the loss function with respect to the parameter θ. At the same time, the performance is calculated on the corresponding query set using the updated parameters θ′ _i .

2、外循环。对于所有任务，使用从内循环得到的适应后参数θ′_i，计算每个任务的查询集损失使用所有任务的损失来更新原始参数θ，具体如公式(2)所示。2. Outer loop. For all tasks, use the adapted parameters θ′ _i obtained from the inner loop to calculate the query set loss for each task The original parameters θ are updated using the losses of all tasks, as shown in formula (2).

其中，β为元学习率，R为任务的数量。Among them, β is the meta-learning rate and R is the number of tasks.

3、重复执行内循环和外循环，直至模型的初始参数在多任务学习环境中展现出良好的泛化能力。3. Repeat the inner and outer loops until the initial parameters of the model show good generalization ability in a multi-task learning environment.

为了确保模型不仅能够处理磁盘数据集中时间序列数据的复杂性，而且能够通过元学习方法快速适应新的或变化任务，即新型号磁盘。本实施例中，为了获取磁盘SMART数据中的时序信息，利用LSTM处理SMART时间序列数据的解析。同时，利用了元学习方法MAML的快速适应性，来对目标神经网络模型进行训练。In order to ensure that the model can not only handle the complexity of time series data in the disk dataset, but also quickly adapt to new or changing tasks, i.e., new models of disks, through the meta-learning method. In this embodiment, in order to obtain the timing information in the disk SMART data, LSTM is used to process the parsing of SMART time series data. At the same time, the rapid adaptability of the meta-learning method MAML is used to train the target neural network model.

S201：根据磁盘日志中的SMART属性值，生成不同磁盘型号分别对应的磁盘样本集。其中，D_m＝{(x^m ₁,y^m ₁)、(x^m ₂,y^m ₂)、(x^m ₃,y^m ₃)、…、(x^m _n,y^m _n)}。D_m为第m个磁盘型号对应的磁盘样本集，x^m _n为D_m中第n个磁盘样本对应的特征矩阵。y^m _n为D_m中x_n对应的标签。n为D_m中磁盘样本的总数量。S201: Generate disk sample sets corresponding to different disk models according to the SMART attribute values in the disk log. Wherein, D _m ={(x ^m ₁ ,y ^m ₁ ),(x ^m ₂ ,y ^m ₂ ),(x ^m ₃ ,y ^m ₃ ),…,(x ^m _n ,y ^m _n )}. D _m is the disk sample set corresponding to the mth disk model, and x ^m _n is the feature matrix corresponding to the nth disk sample in D _m . y ^m _n is the label corresponding to x _n in D _m . n is the total number of disk samples in D _m .

在进行磁盘故障预测元任务构建前，需要先提取每块磁盘的SMART属性的统计特征和时间序列特征。在以前的一些实施例中，将SMART属性的单个快照样本作为训练数据进行建模，忽略了SMART属性值随着时间变化率，这导致磁盘故障预测的准确率不高。同时，LSTM需要3D数组或张量形式的输入。因此，在本实施例中将获取磁盘SMART属性的时间序列特征。Before constructing the disk failure prediction meta-task, it is necessary to extract the statistical features and time series features of the SMART attributes of each disk. In some previous embodiments, a single snapshot sample of the SMART attribute was used as training data for modeling, ignoring the rate of change of the SMART attribute value over time, which resulted in a low accuracy rate for disk failure prediction. At the same time, LSTM requires input in the form of 3D arrays or tensors. Therefore, in this embodiment, the time series features of the disk SMART attributes will be obtained.

具体的，S201包括：Specifically, S201 includes:

S211：若磁盘为健康且具有一整年的完整日志记录的磁盘，则将日志记录加入该型号磁盘对应的健康数据池中。S211: If the disk is healthy and has complete log records for a whole year, the log records are added to the healthy data pool corresponding to the disk model.

S221：从健康数据池平均每月获取部分磁盘的月日志记录。S221: Obtain monthly log records of some disks from the health data pool on average every month.

S231：根据月日志记录，获取每一磁盘分别对应的连续14天的目标SMART属性值，以生成每一磁盘分别对应N×14的特征矩阵。其中，N为每个磁盘每天所采集到的目标SMART属性值的集合。S231: According to the monthly log records, the target SMART attribute values corresponding to each disk for 14 consecutive days are obtained to generate an N×14 feature matrix corresponding to each disk, where N is the set of target SMART attribute values collected for each disk every day.

S241：若磁盘为健康且只具有一整年内的部分日志记录的磁盘，则获取日志记录中前14天对应的目标SMART属性值。S241: If the disk is healthy and has only partial log records for a whole year, then the target SMART attribute values corresponding to the first 14 days in the log records are obtained.

S251：若磁盘为故障磁盘，则获取T至T-13时间段对应的目标SMART属性值。T为磁盘出现故障的日期。S251: If the disk is a faulty disk, obtain the target SMART attribute value corresponding to the time period from T to T-13, where T is the date when the disk fails.

S261：将T-13至T-7时间段对应的样本的标签设置为健康标签。S261: Set the labels of the samples corresponding to the time period from T-13 to T-7 as healthy labels.

S271：将T-6至T时间段对应的样本的标签设置为故障标签。S271: Set the labels of the samples corresponding to the time period from T-6 to T as fault labels.

为了获取磁盘SMART属性的时间序列特征，根据输入LSTM所需的序列长度，然后获取每个硬盘的等于所需序列长度的子集，按磁盘序列号收集每个磁盘连续14天SMART数据。对于故障磁盘，如果在T天发生了故障，则收集T至T-13这段连续时间的SMART数据；对于健康磁盘，按照每月健康磁盘分布情况，每个月平均采集健康磁数量，为每个健康磁盘收集连续14天的SMART数据。同时，根据现有研究表明，大多数SAMRT属性在磁盘故障发现前约7天会出现显著性变化。因此，对于故障磁盘，将磁盘故障发生前7天内的样本标记为故障(也即标签y＝1)，该磁盘的其他样本则标记为健康(y＝0)；对于健康磁盘，将所有连续采集的磁盘样本标记为健康(y＝0)。最后，每个磁盘得到一个N×14特征矩阵，其中N为每个磁盘每天所采集到的SMART属性值的集合。In order to obtain the time series features of the disk SMART attributes, according to the sequence length required by the input LSTM, a subset of each hard disk equal to the required sequence length is obtained, and SMART data for each disk is collected for 14 consecutive days according to the disk serial number. For a faulty disk, if a failure occurs on day T, SMART data for the continuous period from T to T-13 is collected; for healthy disks, according to the monthly distribution of healthy disks, the average number of healthy disks is collected each month, and SMART data for each healthy disk is collected for 14 consecutive days. At the same time, according to existing research, most SAMRT attributes will change significantly about 7 days before the disk failure is discovered. Therefore, for a faulty disk, the samples within 7 days before the disk failure are marked as faulty (that is, label y=1), and the other samples of the disk are marked as healthy (y=0); for a healthy disk, all continuously collected disk samples are marked as healthy (y=0). Finally, each disk obtains an N×14 feature matrix, where N is the set of SMART attribute values collected for each disk every day.

S202：从Dm中获取多个元任务中训练任务对应的支持集和查询集。S202: Obtain support sets and query sets corresponding to training tasks in multiple meta-tasks from Dm.

S203：从D_s中获取多个元任务中测试任务对应的支持集和查询集。D_s为第s个磁盘型号对应的磁盘样本集，Dm与Ds的磁盘型号不同。S203: Obtain support sets and query sets corresponding to the test tasks in the multiple meta-tasks from _Ds . _Ds is a disk sample set corresponding to the s-th disk model, and the disk models of Dm and Ds are different.

本实施例中为了进行元训练，首先需要分别构建训练任务和测试任务。训练任务中的支持集和查询集以及测试任务中的支持集和查询集都是2-way K-shot。In order to perform meta-training in this embodiment, it is first necessary to construct a training task and a test task respectively. The support set and query set in the training task and the support set and query set in the test task are both 2-way K-shot.

S204：使用每一个元任务中训练任务及测试任务分别对应的支持集和查询集，对LSTM模型进行内循环训练。S204: Use the support set and query set corresponding to the training task and the test task in each meta-task to perform inner loop training on the LSTM model.

其中，第i个元任务Taski中的支持集和查询集均按照如下方法获得：The support set and query set in the ith meta-task Taski are obtained as follows:

S214：从对应的磁盘样本集中随机选取一个故障磁盘样本，并计算其与磁盘样本集中其他故障磁盘样本的KLD值。具体的，任意两个磁盘样本的KLD通过相对熵算法获取。S214: randomly select a faulty disk sample from the corresponding disk sample set, and calculate the KLD value between the faulty disk sample and other faulty disk samples in the disk sample set. Specifically, the KLD of any two disk samples is obtained by a relative entropy algorithm.

KLD全称为Kullback-Leibler Divergence，也称为相对熵，是衡量两个概率分布之间匹配程度的指标。通常来讲KLD值越大表明两个分布之间的差异越大。因此，本实施例中采用KLD对每个任务的磁盘样本进行选择，从而帮助模型更好地理解不同磁盘型号间的差异和相似性，从而提高模型的泛化能力和适应性。KLD stands for Kullback-Leibler Divergence, also known as relative entropy, which is an indicator to measure the degree of match between two probability distributions. Generally speaking, the larger the KLD value, the greater the difference between the two distributions. Therefore, in this embodiment, KLD is used to select disk samples for each task, so as to help the model better understand the differences and similarities between different disk models, thereby improving the generalization ability and adaptability of the model.

具体的，可以基于SMART属性值为同一型号磁盘与不同型号的磁盘计算概率分布，计算不同型号之间的KLD值，以衡量分布的相似度。Specifically, the probability distribution may be calculated for disks of the same model and disks of different models based on the SMART attribute values, and the KLD values between the different models may be calculated to measure the similarity of the distribution.

S224：根据KLD值由小到大进行排序，形成KLD序列。KLD序列中共有2K-1个KLD值。K为正整数。S224: Sort the KLD values from small to large to form a KLD sequence. There are 2K-1 KLD values in the KLD sequence. K is a positive integer.

S234：将KLD序列中前K-1个KLD值对应的故障磁盘样本及随机选取的一个故障磁盘样本，作为支持集中的故障样本集。S234: The faulty disk samples corresponding to the first K-1 KLD values in the KLD sequence and a randomly selected faulty disk sample are used as the faulty sample set in the support set.

S244：将KLD序列中第K-1个之后的KLD值对应的故障磁盘样本，作为查询集中的故障样本集。S244: Use the faulty disk samples corresponding to the KLD values after the K-1th in the KLD sequence as the faulty sample set in the query set.

在现有的基于优化的元学习中，任务的样本通常通过随机选择来进行分配，而不考虑各任务之间的顺序。这种随机的任务安排可能引入高度的随机性，并且可能导致一些任务之间存在直接的相关性，而其他任务则几乎没有相关性。为了有效地学习到有代表性的参数，并在对新型号磁盘型号的故障预测时获得更优越的泛化性能，需要安排好任务的学习顺序。同时，通过对磁盘数据集的分析，发现即使是同一型号的故障磁盘，其SMART属性分布情况也有一定的区别，因此本实施例中基于KLD来进行磁盘故障预测元任务的构建，以使模型从相似的任务中开始学习，逐渐过渡到分布差异大(即KLD值更大)的磁盘样本任务中。这种递进的学习方法有助于模型逐步建立和强化其泛化能力，同时也有助于防止早期过拟合。In existing optimization-based meta-learning, task samples are usually assigned by random selection without considering the order between tasks. This random task arrangement may introduce a high degree of randomness and may cause direct correlations between some tasks, while other tasks have almost no correlation. In order to effectively learn representative parameters and obtain better generalization performance when predicting failures of new models of disk models, it is necessary to arrange the learning order of tasks. At the same time, through the analysis of disk data sets, it is found that even for failed disks of the same model, there are certain differences in the distribution of SMART attributes. Therefore, in this embodiment, the disk failure prediction meta-task is constructed based on KLD, so that the model starts learning from similar tasks and gradually transitions to disk sample tasks with large distribution differences (i.e., larger KLD values). This progressive learning method helps the model gradually establish and strengthen its generalization ability, and also helps prevent early overfitting.

本实施例中在构建元任务时，首先要定义同一型号磁盘中每个任务的支持集和查询集。通过按照KLD值的大小选择磁盘样本，按相似度将样本分配到不同的任务中。具体的，对于支持集，选取KLD值较小的磁盘样本作为支持集，因为这些样本在特征分布上更接近基准样本，更易于模型学习。对于查询集，选取KLD值稍大的样本作为查询集，用于测试模型对稍有不同特征分布样本的适应能力。In this embodiment, when constructing the meta-task, the support set and query set of each task in the same model disk must first be defined. By selecting disk samples according to the size of the KLD value, the samples are assigned to different tasks according to similarity. Specifically, for the support set, disk samples with smaller KLD values are selected as the support set, because these samples are closer to the benchmark samples in feature distribution and are easier for the model to learn. For the query set, samples with slightly larger KLD values are selected as the query set to test the model's adaptability to samples with slightly different feature distributions.

S254：从对应的磁盘样本集中随机选取K个健康磁盘样本，作为支持集中的健康样本集。S254: Randomly select K healthy disk samples from the corresponding disk sample set as the healthy sample set in the support set.

对应的，在本实施例中们可以将磁盘数据根据不同的磁盘型号分为多个任务，其中每个任务包括支持集Support Set与查询集Query Set。其次，将LSTM作为基学习器，不断更新参数。在训练时，将任务Task_i的支持集磁盘数据作为LSTM的输入并随机初始化参数θ，进行多次梯度下降更新，计算其损失按照公式(1)更新参数为θ′_i；使用更新后的参数θ′_i评估每一个任务的查询集，计算所有任务在查询集上的总损失，根据公式(2)更新θ。最后，在测试时，使用其他型号的少量磁盘数据构成的支持集优化模型参数，对查询集中的磁盘数据进行预测。Correspondingly, in this embodiment, we can divide the disk data into multiple tasks according to different disk models, where each task includes a support set Support Set and a query set Query Set. Secondly, LSTM is used as a base learner and the parameters are continuously updated. During training, the support set disk data of task Task _i is used as the input of LSTM and the parameters θ are randomly initialized. Multiple gradient descent updates are performed to calculate its loss Update the parameters to θ′ _i according to formula (1); use the updated parameters θ′ _i to evaluate the query set of each task, calculate the total loss of all tasks on the query set, and update θ according to formula (2). Finally, during testing, use the support set consisting of a small amount of disk data from other models to optimize the model parameters and make predictions on the disk data in the query set.

如下为对本实施例中提供的基于元学习的磁盘故障预测方法(下称MLDF)的预测性能的评估实验数据。The following is experimental data for evaluating the prediction performance of the meta-learning-based disk failure prediction method (hereinafter referred to as MLDF) provided in this embodiment.

1、2种异构小样本磁盘的故障预测1. Failure prediction of two types of heterogeneous small sample disks

使用ST12000NM0007作为训练任务，ST14000NM0138作为测试任务，保证训练集与测试集中的数据来自不同磁盘型号进行实验，其中K＝15，实验结果如表2所示。由实验数据可知，在跨磁盘型号的异构小样本磁盘故障预测上，MLDF仍然能保持良好的预测性能，实现快速适应新型号磁盘。We use ST12000NM0007 as the training task and ST14000NM0138 as the test task, ensuring that the data in the training set and the test set come from different disk models for the experiment, where K=15, and the experimental results are shown in Table 2. From the experimental data, we can see that in the prediction of heterogeneous small sample disk failures across disk models, MLDF can still maintain good prediction performance and quickly adapt to new model disks.

表2Table 2

2、多种异构小样本磁盘的故障预测2. Failure prediction of multiple heterogeneous small sample disks

使用ST12000NM0007和ST10000NM0086作为训练任务，ST14000NM0138作为测试任务，保证训练集与测试集中的数据来自不同磁盘型号进行实验，其中K＝15，实验结果如表3所示。由实验数据可知，对于多型号的磁盘数据集，也能拥有良好的预测性能。We use ST12000NM0007 and ST10000NM0086 as training tasks, and ST14000NM0138 as test tasks, ensuring that the data in the training set and the test set come from different disk models for experiments, where K = 15, and the experimental results are shown in Table 3. It can be seen from the experimental data that for multi-model disk data sets, we can also have good prediction performance.

表3Table 3

此外，尽管在附图中以特定顺序描述了本公开中方法的各个步骤，但是，这并非要求或者暗示必须按照该特定顺序来执行这些步骤，或是必须执行全部所示的步骤才能实现期望的结果。附加的或备选的，可以省略某些步骤，将多个步骤合并为一个步骤执行，以及/或者将一个步骤分解为多个步骤执行等。In addition, although the steps of the method in the present disclosure are described in a specific order in the drawings, this does not require or imply that the steps must be performed in this specific order, or that all the steps shown must be performed to achieve the desired results. Additionally or alternatively, some steps may be omitted, multiple steps may be combined into one step, and/or one step may be decomposed into multiple steps, etc.

通过以上的实施方式的描述，本领域的技术人员易于理解，这里描述的示例实施方式可以通过软件实现，也可以通过软件结合必要的硬件的方式来实现。因此，根据本公开实施方式的技术方案可以以软件产品的形式体现出来，该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM，U盘，移动硬盘等)中或网络上，包括若干指令以使得一台计算设备(可以是个人计算机、服务器、移动终端、或者网络设备等)执行根据本公开实施方式的方法。Through the description of the above implementation methods, it is easy for those skilled in the art to understand that the example implementation methods described here can be implemented by software, or by combining software with necessary hardware. Therefore, the technical solution according to the implementation methods of the present disclosure can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a USB flash drive, a mobile hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the implementation methods of the present disclosure.

在本公开的示例性实施例中，还提供了一种能够实现上述方法的电子设备。In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.

所属技术领域的技术人员能够理解，本发明的各个方面可以实现为系统、方法或程序产品。因此，本发明的各个方面可以具体实现为以下形式，即：完全的硬件实施方式、完全的软件实施方式(包括固件、微代码等)，或硬件和软件方面结合的实施方式，这里可以统称为“电路”、“模块”或“系统”。It will be appreciated by those skilled in the art that various aspects of the present invention may be implemented as a system, method or program product. Therefore, various aspects of the present invention may be specifically implemented in the following forms, namely: a complete hardware implementation, a complete software implementation (including firmware, microcode, etc.), or a combination of hardware and software, which may be collectively referred to herein as a "circuit", "module" or "system".

根据本发明的这种实施方式的电子设备。电子设备仅仅是一个示例，不应对本发明实施例的功能和使用范围带来任何限制。The electronic device according to this embodiment of the present invention is only an example and should not bring any limitation to the functions and scope of use of the embodiments of the present invention.

电子设备以通用计算设备的形式表现。电子设备的组件可以包括但不限于：上述至少一个处理器、上述至少一个储存器、连接不同系统组件(包括储存器和处理器)的总线。The electronic device is presented in the form of a general-purpose computing device. The components of the electronic device may include, but are not limited to: the at least one processor mentioned above, the at least one storage device mentioned above, and a bus connecting different system components (including storage devices and processors).

其中，储存器存储有程序代码，程序代码可以被处理器执行，使得处理器执行本说明书上述“示例性方法”部分中描述的根据本发明各种示例性实施方式的步骤。The storage stores program codes, which can be executed by the processor, so that the processor executes the steps according to various exemplary embodiments of the present invention described in the above “Exemplary Method” section of this specification.

储存器可以包括易失性储存器形式的可读介质，例如随机存取储存器(RAM)和/或高速缓存储存器，还可以进一步包括只读储存器(ROM)。The memory may include readable media in the form of volatile memory, such as random access memory (RAM) and/or cache memory, and may further include read only memory (ROM).

储存器还可以包括具有一组(至少一个)程序模块的程序/实用工具，这样的程序模块包括但不限于：操作系统、一个或者多个应用程序、其它程序模块以及程序数据，这些示例中的每一个或某种组合中可能包括网络环境的实现。The storage may also include a program/utility having a set (at least one) of program modules, such program modules including but not limited to: an operating system, one or more application programs, other program modules and program data, each of which or some combination may include the implementation of a network environment.

总线可以为表示几类总线结构中的一种或多种，包括储存器总线或者储存器控制器、外围总线、图形加速端口、处理器或者使用多种总线结构中的任意总线结构的局域总线。The bus may represent one or more of several types of bus structures including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures.

电子设备也可以与一个或多个外部设备(例如键盘、指向设备、蓝牙设备等)通信，还可与一个或者多个使得用户能与该电子设备交互的设备通信，和/或与使得该电子设备能与一个或多个其它计算设备进行通信的任何设备(例如路由器、调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口进行。并且，电子设备还可以通过网络适配器与一个或者多个网络(例如局域网(LAN)，广域网(WAN)和/或公共网络，例如因特网)通信。网络适配器通过总线与电子设备的其它模块通信。应当明白，尽管图中未示出，可以结合电子设备使用其它硬件和/或软件模块，包括但不限于：微代码、设备驱动器、冗余处理器、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。The electronic device may also communicate with one or more external devices (e.g., keyboards, pointing devices, Bluetooth devices, etc.), may communicate with one or more devices that enable a user to interact with the electronic device, and/or may communicate with any device (e.g., routers, modems, etc.) that enables the electronic device to communicate with one or more other computing devices. Such communication may be performed through an input/output (I/O) interface. Furthermore, the electronic device may also communicate with one or more networks (e.g., local area networks (LANs), wide area networks (WANs), and/or public networks, such as the Internet) through a network adapter. The network adapter communicates with other modules of the electronic device through a bus. It should be understood that, although not shown in the figure, other hardware and/or software modules may be used in conjunction with the electronic device, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, etc.

通过以上的实施方式的描述，本领域的技术人员易于理解，这里描述的示例实施方式可以通过软件实现，也可以通过软件结合必要的硬件的方式来实现。因此，根据本公开实施方式的技术方案可以以软件产品的形式体现出来，该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM，U盘，移动硬盘等)中或网络上，包括若干指令以使得一台计算设备(可以是个人计算机、服务器、终端装置、或者网络设备等)执行根据本公开实施方式的方法。Through the description of the above implementation, it is easy for those skilled in the art to understand that the example implementation described here can be implemented by software, or by software combined with necessary hardware. Therefore, the technical solution according to the implementation of the present disclosure can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a USB flash drive, a mobile hard disk, etc.) or on a network, including several instructions to enable a computing device (which can be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the implementation of the present disclosure.

在本公开的示例性实施例中，还提供了一种计算机可读存储介质，其上存储有能够实现本说明书上述方法的程序产品。在一些可能的实施方式中，本发明的各个方面还可以实现为一种程序产品的形式，其包括程序代码，当程序产品在终端设备上运行时，程序代码用于使终端设备执行本说明书上述“示例性方法”部分中描述的根据本发明各种示例性实施方式的步骤。In an exemplary embodiment of the present disclosure, a computer-readable storage medium is also provided, on which a program product capable of implementing the above method of the present specification is stored. In some possible implementations, various aspects of the present invention may also be implemented in the form of a program product, which includes a program code, and when the program product is run on a terminal device, the program code is used to enable the terminal device to execute the steps according to various exemplary embodiments of the present invention described in the above "Exemplary Method" section of the present specification.

程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以为但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括：具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。The program product may use any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination of the above. More specific examples of readable storage media (a non-exhaustive list) include: an electrical connection with one or more wires, a portable disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above.

计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号，其中承载了可读程序代码。这种传播的数据信号可以采用多种形式，包括但不限于电磁信号、光信号或上述的任意合适的组合。可读信号介质还可以是可读存储介质以外的任何可读介质，该可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。Computer readable signal media may include data signals propagated in baseband or as part of a carrier wave, in which readable program code is carried. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above. Readable signal media may also be any readable medium other than a readable storage medium, which may send, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device.

可读介质上包含的程序代码可以用任何适当的介质传输，包括但不限于无线、有线、光缆、RF等等，或者上述的任意合适的组合。The program code embodied on the readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the foregoing.

可以以一种或多种程序设计语言的任意组合来编写用于执行本发明操作的程序代码，程序设计语言包括面向对象的程序设计语言—诸如Java、C++等，还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程计算设备的情形中，远程计算设备可以通过任意种类的网络，包括局域网(LAN)或广域网(WAN)，连接到用户计算设备，或者，可以连接到外部计算设备(例如利用因特网服务提供商来通过因特网连接)。Program code for performing the operations of the present invention may be written in any combination of one or more programming languages, including object-oriented programming languages such as Java, C++, etc., and conventional procedural programming languages such as "C" or similar programming languages. The program code may be executed entirely on the user computing device, partially on the user device, as a separate software package, partially on the user computing device and partially on a remote computing device, or entirely on a remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any type of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computing device (e.g., through the Internet using an Internet service provider).

此外，上述附图仅是根据本发明示例性实施例的方法所包括的处理的示意性说明，而不是限制目的。易于理解，上述附图所示的处理并不表明或限制这些处理的时间顺序。另外，也易于理解，这些处理可以是例如在多个模块中同步或异步执行的。In addition, the above-mentioned figures are only schematic illustrations of the processes included in the method according to an exemplary embodiment of the present invention, and are not intended to be limiting. It is easy to understand that the processes shown in the above-mentioned figures do not indicate or limit the time sequence of these processes. In addition, it is also easy to understand that these processes can be performed synchronously or asynchronously, for example, in multiple modules.

应当注意，尽管在上文详细描述中提及了用于动作执行的设备的若干模块或者单元，但是这种划分并非强制性的。实际上，根据本公开的实施方式，上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之，上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。It should be noted that, although several modules or units of the device for action execution are mentioned in the above detailed description, this division is not mandatory. In fact, according to the embodiments of the present disclosure, the features and functions of two or more modules or units described above can be embodied in one module or unit. Conversely, the features and functions of one module or unit described above can be further divided into multiple modules or units to be embodied.

以上，仅为本发明的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到的变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应以权利要求的保护范围为准。The above are only specific embodiments of the present invention, but the protection scope of the present invention is not limited thereto. Any changes or substitutions that can be easily thought of by a person skilled in the art within the technical scope disclosed by the present invention should be included in the protection scope of the present invention. Therefore, the protection scope of the present invention should be based on the protection scope of the claims.

Claims

1. A disk failure prediction method based on meta-learning, characterized in that the method comprises the following steps:

Obtaining a feature matrix of a target disk, wherein the feature matrix includes time series features and SMART attribute features of the disk;

Inputting the feature matrix into a target neural network model to generate a prediction result corresponding to the target disk; the target neural network model includes an LSTM model obtained after training with a model-independent meta-learning algorithm;

The meta-task in meta-learning training is constructed as follows:

According to the SMART attribute values in the disk log, generate disk sample sets corresponding to different disk models; where D _m = {(x ^m ₁ ,y ^m ₁ ), (x ^m ₂ ,y ^m ₂ ), (x ^m ₃ ,y ^m ₃ ), …, (x ^m _n ,y ^m _n )}; D _m is the disk sample set corresponding to the mth disk model, x ^m _n is the feature matrix corresponding to the nth disk sample in D _m ; y ^m _n is the label corresponding to x _n in D _m ; n is the total number of disk samples in D _m ;

Obtain the support set and query set corresponding to the training tasks in multiple meta-tasks from _Dm ;

Obtain the support set and query set corresponding to the test task in multiple meta-tasks from _Ds ; _Ds is the disk sample set corresponding to the s-th disk model, and the disk models of _Dm and _Ds are different;

Use the support set and query set corresponding to the training task and test task in each meta-task to perform inner loop training on the LSTM model;

The support set and query set in the ith meta-task Task _i are obtained as follows:

Randomly select a faulty disk sample from the corresponding disk sample set, and calculate its KLD value with other faulty disk samples in the disk sample set;

Sort by KLD values from small to large to form a KLD sequence; there are 2K-1 KLD values in the KLD sequence; K is a positive integer;

The faulty disk samples corresponding to the first K-1 KLD values in the KLD sequence and a randomly selected faulty disk sample are used as the faulty sample set in the support set;

K healthy disk samples are randomly selected from the corresponding disk sample set as the healthy sample set in the support set.

2. The method according to claim 1, characterized in that after sorting the KLD values from small to large to form a KLD sequence, the method further comprises:

The faulty disk samples corresponding to the KLD values after the K-1th in the KLD sequence are taken as the faulty sample set in the query set.

3. The method according to claim 1 is characterized in that the KLD of any two disk samples is obtained by a relative entropy algorithm.

4. The method according to claim 1 is characterized in that the target neural network model also includes a dropout layer and a fully connected layer including a Sigmoid activation function, and the LSTM model includes four stacked LSTM layers; the LSTM model is connected to the dropout layer, and the dropout layer is connected to the fully connected layer including the Sigmoid activation function.

5. The method according to claim 1, characterized in that generating disk sample sets corresponding to different disk models according to the SMART attribute values in the disk logs comprises:

If the disk is healthy and has complete log records for a whole year, the log records are added to the healthy data pool corresponding to the disk model;

Obtain monthly log records of a portion of disks from the health data pool on average every month;

According to the monthly log records, the target SMART attribute values of each disk for 14 consecutive days are obtained to generate an N×14 feature matrix corresponding to each disk; where N is the set of target SMART attribute values collected for each disk every day.

6. The method according to claim 5, characterized in that generating disk sample sets corresponding to different disk models according to the SMART attribute values in the disk logs comprises:

If the disk is healthy and has only partial log records for a whole year, the target SMART attribute values corresponding to the first 14 days in the log records are obtained.

7. The method according to claim 5, characterized in that generating disk sample sets corresponding to different disk models according to the SMART attribute values in the disk logs comprises:

If the disk is a faulty disk, the target SMART attribute value corresponding to the time period from T to T-13 is obtained; T is the date when the disk fails.

8. The method according to claim 7 is characterized in that the labels of the samples corresponding to the time period from T-13 to T-7 are set as healthy labels; and the labels of the samples corresponding to the time period from T-6 to T are set as fault labels.

9. A non-transitory computer-readable storage medium storing a computer program, wherein when the computer program is executed by a processor, the meta-learning-based disk failure prediction method according to any one of claims 1 to 8 is implemented.

10. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the computer program, a meta-learning-based disk failure prediction method as described in any one of claims 1 to 8 is implemented.