WO2024124497A1

WO2024124497A1 - Machine-learning-based method for recognizing state of nanopore sequencing signal, and training method and apparatus for machine learning model

Info

Publication number: WO2024124497A1
Application number: PCT/CN2022/139405
Authority: WO
Inventors: 蔡志强; 颜旭; 曾涛; 黎宇翔; 董宇亮; 曹杰; 吴蔚; 郭斐; 郑荣荣; 季州翔; 章文蔚; 徐讯
Original assignee: BGI Shenzhen Co Ltd
Current assignee: BGI Shenzhen Co Ltd
Priority date: 2022-12-15
Filing date: 2022-12-15
Publication date: 2024-06-20
Anticipated expiration: 2025-06-15
Also published as: CN120188158A

Abstract

A machine-learning-based method for recognizing the state of a nanopore sequencing signal. The method comprises: acquiring nanopore sequencing signal data X₁, X₂, …, X_N, and performing feature extraction on the nanopore sequencing signal data to generate a feature value vector; and by using the generated feature value vector and a trained machine learning model, recognizing the state of the nanopore sequencing signal data. The feature extraction comprises: dividing the nanopore sequencing signal data into G groups according to value magnitude, wherein G is a natural number greater than 1 and less than N, and a feature value comprises at least one of a quantity proportion, a standard deviation and a variation coefficient; calculating the quantity proportion of each of the G groups, so as to generate G quantity proportion feature values; calculating the standard deviation of data in each of the G groups, so as to generate G standard deviation feature values; calculating the variation coefficient of the data in each of the G groups, so as to generate G variation coefficient feature values; and generating a feature value vector on the basis of the plurality of generated feature values.

Description

Method for identifying nanopore sequencing signal state based on machine learning, method and device for training machine learning model

Technical Field

本发明涉及生物信息技术领域，并且更具体地涉及基于机器学习的纳米孔测序信号状态的识别方法、机器学习模型的训练方法、装置、电子设备和存储介质。The present invention relates to the field of bioinformatics, and more specifically to a method for identifying a nanopore sequencing signal state based on machine learning, a method for training a machine learning model, a device, an electronic device and a storage medium.

Background technique

纳米孔测序信号指的是DNA分子穿过纳米孔时记录的穿孔电流变化，这种变化主要是由不同的碱基在纳米孔中电流不同导致的。在对纳米孔测序信号进行分析时，一个重要的环节是对纳米孔测序信号进行状态检测。状态检测的核心任务是识别出纳米孔测序信号处于哪个状态，以便为下一步的分析策略提供依据。Nanopore sequencing signals refer to the changes in the perforation current recorded when a DNA molecule passes through a nanopore. This change is mainly caused by the different currents of different bases in the nanopore. When analyzing nanopore sequencing signals, an important step is to perform state detection on the nanopore sequencing signals. The core task of state detection is to identify which state the nanopore sequencing signal is in, so as to provide a basis for the next analysis strategy.

纳米孔测序信号是一维时序信号，用于检测这种时序信号的状态的现有技术方案有两种。一种是传统时序数据分析方法，将时序数据从时域转换到频域，通过人工提取特殊序列特征等方式，对频域信号进行相似度计算和阈值过滤，以识别信号的状态。另一种是深度学习端对端方法，此方法无需人工提取特征，直接基于时序信号的原始数据训练出深度学习模型，再用此深度学习模型对信号状态进行识别。Nanopore sequencing signals are one-dimensional time series signals. There are two existing technical solutions for detecting the state of such time series signals. One is the traditional time series data analysis method, which converts the time series data from the time domain to the frequency domain, and performs similarity calculation and threshold filtering on the frequency domain signal by manually extracting special sequence features to identify the state of the signal. The other is a deep learning end-to-end method, which does not require manual feature extraction, but directly trains a deep learning model based on the original data of the time series signal, and then uses this deep learning model to identify the signal state.

但是，传统时序数据分析方法在一些情况下仅能识别一些特征很明显的容易识别的目标信号，在另一些情况下鲁棒性很差，对于不同类型的目标信号需要单独设计人工提取特征，效率不高。另一方面，深度学习端对端方法虽然鲁棒性好，但计算复杂度高，需在GPU上才能高效运行，在CPU上运行会非常慢，训练一个深度学习模型需要很长时间。因此，需要一种能够在提高鲁棒性的同时以较低的复杂度对纳米孔测序信号进行识别的方法、装置和存储介质。However, in some cases, traditional time series data analysis methods can only identify some target signals with obvious features that are easy to identify. In other cases, the robustness is very poor. For different types of target signals, it is necessary to design artificial feature extraction separately, which is inefficient. On the other hand, although the deep learning end-to-end method has good robustness, it has high computational complexity and needs to be run efficiently on the GPU. It will run very slowly on the CPU, and it takes a long time to train a deep learning model. Therefore, there is a need for a method, device and storage medium that can identify nanopore sequencing signals with lower complexity while improving robustness.

发明内容Summary of the invention

为了解决现有问题中存在的技术问题，在本公开的第一方面中，提供了一种基于机器学习的纳米孔测序信号状态的识别方法，包括以下步骤：In order to solve the technical problems existing in the existing problems, in a first aspect of the present disclosure, a method for identifying a nanopore sequencing signal state based on machine learning is provided, comprising the following steps:

获取纳米孔测序信号数据X ₁、X ₂、……、X _N，并对所述纳米孔测序信号数据进行特征提取以生成特征值向量，其中N是大于1的自然数；以及 Acquire nanopore sequencing signal data X ₁ , X ₂ , . . . , X _N , and perform feature extraction on the nanopore sequencing signal data to generate a feature value vector, wherein N is a natural number greater than 1; and

使用所生成的特征值向量和经训练的机器学习模型，来识别所获取的纳米孔测序信号数据的状态，Using the generated feature value vector and the trained machine learning model to identify the state of the acquired nanopore sequencing signal data,

其中，所述特征提取包括以下步骤：The feature extraction comprises the following steps:

将所述纳米孔测序信号数据按照值的大小划分为G个组，其中G是大于1且小于N的自然数；其中特征值包括数量占比、标准差和变异系数中的至少一项；Dividing the nanopore sequencing signal data into G groups according to the size of the value, where G is a natural number greater than 1 and less than N; wherein the characteristic value includes at least one of the quantity proportion, standard deviation and coefficient of variation;

分别计算所述G个组中的每个组的数量占比，生成G个数量占比特征值V ₁₁、V ₁₂、……、V _1G，其中每个组的所述数量占比是所述组中的数据的数量占总数据数量N的比率； Calculate the quantity proportion of each group in the G groups respectively, and generate G quantity proportion feature values V ₁₁ , V ₁₂ , . . . , V _1G , wherein the quantity proportion of each group is the ratio of the quantity of data in the group to the total quantity of data N;

分别计算所述G个组中的每个组的数据的标准差，生成G个标准差特征值V ₂₁、V ₂₂、……、V _2G； Calculating the standard deviation of the data of each group in the G groups respectively, and generating G standard deviation eigenvalues V ₂₁ , V ₂₂ , . . . , V _2G ;

分别计算所述G个组中的每个组的数据的变异系数，生成G个变异系数特征值V ₃₁、V ₃₂、……、V _3G，其中每个组的数据的所述变异系数是每个组的数据的所述标准差与每个组的数据的平均值的比率；以及 Calculating the coefficient of variation of the data of each of the G groups respectively, generating G coefficient of variation characteristic values V ₃₁ , V ₃₂ , . . . , V _3G , wherein the coefficient of variation of the data of each group is the ratio of the standard deviation of the data of each group to the mean value of the data of each group; and

基于所生成的多个特征值，来生成所述特征值向量。根据本发明的实施例，在将所述纳米孔测序信号数据按照值的大小划分为G个组之前，所述特征提取还包括标准化步骤，并且所述划分是对通过所述标准化步骤标准化后的纳米孔测序信号数据进行的划分，According to an embodiment of the present invention, before dividing the nanopore sequencing signal data into G groups according to the size of the values, the feature extraction further includes a standardization step, and the division is performed on the nanopore sequencing signal data after being standardized by the standardization step.

所述标准化步骤包括：The standardization steps include:

对于所述纳米孔测序信号数据中的每一个数据，将所述数据的值与信号数据最小值的差除以信号数据最大值与信号数据最小值的差，并将得到的值作为所述数据的标准化后的值，其中所述信号数据最小值是所获取的纳米孔测序信号数据中的最小值，所述信号数据最大值是所获取的纳米孔测序信号数据中的最大值。For each data in the nanopore sequencing signal data, the difference between the value of the data and the minimum value of the signal data is divided by the difference between the maximum value of the signal data and the minimum value of the signal data, and the obtained value is used as the standardized value of the data, wherein the minimum value of the signal data is the minimum value in the acquired nanopore sequencing signal data, and the maximum value of the signal data is the maximum value in the acquired nanopore sequencing signal data.

根据本发明的实施例，所述划分包括：According to an embodiment of the present invention, the division includes:

将标准化后的纳米孔测序信号数据从0到1以0.1为间隔划分为10个组，从而将所述信号数据划分为分别落入间隔[0，0.1)、[0.1，0.2)、…、[0.9，11中的10个组。The normalized nanopore sequencing signal data is divided into 10 groups from 0 to 1 at intervals of 0.1, thereby dividing the signal data into 10 groups falling into the intervals [0, 0.1), [0.1, 0.2), ..., [0.9, 11] respectively.

根据本发明的实施例，所述特征提取还包括以下步骤：According to an embodiment of the present invention, the feature extraction further comprises the following steps:

计算G个标准差特征值V ₂₁、V ₂₂、……、V _2G中的最大值、最小值、平均值、最大值与最小值的商、最大值与平均值的商，生成5个特征值V ₄₁、V ₄₂、……、V ₄₅。 The maximum value, minimum value, average value, quotient of the maximum value and the minimum value, and quotient of the maximum value and the average value of the G standard deviation eigenvalues V ₂₁ , V ₂₂ , ..., V _2G are calculated to generate five eigenvalues V ₄₁ , V ₄₂ , ..., V ₄₅ .

计算G个变异系数特征值V ₃₁、V ₃₂、……、V _3G中的最大值、最小值、平均值、最大值与最小值的商、最大值与平均值的商，生成5个特征值V ₅₁、V ₅₂、……、V ₅₅。 The maximum value, minimum value, average value, quotient of the maximum value and the minimum value, and quotient of the maximum value and the average value of the G variation coefficient eigenvalues V ₃₁ , V ₃₂ , ..., V _3G are calculated to generate five eigenvalues V ₅₁ , V ₅₂ , ..., V ₅₅ .

根据本发明的实施例，所述机器学习模型包括决策树模型、贝叶斯模型或神经网络模型。According to an embodiment of the present invention, the machine learning model includes a decision tree model, a Bayesian model or a neural network model.

在本公开的第二方面中，提供了一种用于识别纳米孔测序信号状态的机器学习模型的训练方法，包括以下步骤：In a second aspect of the present disclosure, a method for training a machine learning model for identifying nanopore sequencing signal states is provided, comprising the following steps:

获取所述机器学习模型的训练数据，所述获取包括：采集纳米孔测序信号数据，对所采集的纳米孔测序信号数据进行状态识别，截取与识别出的状态相对应的纳米孔测序信号数据X ₁、X ₂、……、X _N，对所截取的纳米孔测序信号数据X ₁、X ₂、……、X _N进行特征提取以生成特征值向量，其中N是大于1的自然数，所述训练数据包括所述状态识别的结果和所述特征值向量；以及 Acquiring training data for the machine learning model, the acquiring comprising: collecting nanopore sequencing signal data, performing state recognition on the collected nanopore sequencing signal data, intercepting nanopore sequencing signal data _X1 , _X2 , ..., XN corresponding to the recognized state, performing feature extraction on the intercepted nanopore sequencing signal data _X1 , _X2 , ..., _XN to generate a eigenvalue vector, wherein N is a natural number greater than 1, and the training data comprises a result of the _state recognition and the eigenvalue vector; and

使用所获取的训练数据对所述机器学习模型进行训练，从而获取经训练的机器学习模型，Using the acquired training data to train the machine learning model, thereby acquiring a trained machine learning model,

将所述纳米孔测序信号数据按照值的大小划分为G个组，其中G是大于1且小于N的自然数；其中特征值包括数量占比、标准差和变异系数中的至少一项；Dividing the nanopore sequencing signal data into G groups according to the size of the value, wherein G is a natural number greater than 1 and less than N; wherein the characteristic value includes at least one of the quantity proportion, standard deviation and coefficient of variation;

基于所生成的多个特征值，来生成所述特征值向量。The eigenvalue vector is generated based on the generated plurality of eigenvalues.

在本公开的第三方面中，提供了一种基于机器学习的纳米孔测序信号状态的识别装置，包括以下模块：In a third aspect of the present disclosure, a device for identifying a nanopore sequencing signal state based on machine learning is provided, comprising the following modules:

获取模块，用于获取纳米孔测序信号数据X ₁、X ₂、……、X _N，并对所述纳米孔测序信号数据进行特征提取以生成特征值向量，其中N是大于1的自然数；以及 an acquisition module, configured to acquire nanopore sequencing signal data X ₁ , X ₂ , . . . , X _N , and perform feature extraction on the nanopore sequencing signal data to generate a feature value vector, wherein N is a natural number greater than 1; and

识别模块，用于使用所生成的特征值向量和经训练的机器学习模型，来识别所获取的纳米孔测序信号数据的状态，an identification module, for identifying the state of the acquired nanopore sequencing signal data using the generated feature value vector and the trained machine learning model,

其中，所述获取模块进一步包括以下单元：Wherein, the acquisition module further includes the following units:

分组单元，用于将所述纳米孔测序信号数据按照值的大小划分为G个组，其中G是大于1且小于N的自然数；其中特征值包括数量占比、标准差和变异系数中的至少一项；A grouping unit, for dividing the nanopore sequencing signal data into G groups according to the size of the value, wherein G is a natural number greater than 1 and less than N; wherein the characteristic value includes at least one of the quantity proportion, standard deviation and coefficient of variation;

第一特征值生成单元，用于分别计算所述G个组中的每个组的数量占比，生成G个数量占比特征值V ₁₁、V ₁₂、……、V _1G，其中每个组的所述数量占比是所述组中的数据的数量占总数据数量N的比率； a first eigenvalue generating unit, configured to respectively calculate the quantity proportion of each group in the G groups, and generate G quantity proportion eigenvalues V ₁₁ , V ₁₂ , . . . , V _1G , wherein the quantity proportion of each group is the ratio of the quantity of data in the group to the total quantity of data N;

第二特征值生成单元，用于分别计算所述G个组中的每个组的数据的标准差，生成G个标准差特征值V ₂₁、V ₂₂、……、V _2G； A second eigenvalue generating unit, configured to respectively calculate the standard deviation of the data of each of the G groups, and generate G standard deviation eigenvalues V ₂₁ , V ₂₂ , . . . , V _2G ;

第三特征值生成单元，用于分别计算所述G个组中的每个组的数据的变异系数，生成G个变异系数特征值V ₃₁、V ₃₂、……、V _3G，其中每个组的数据的所述变异系数是每个组的数据的所述标准差与每个组的数据的平均值的比率；以及 a third eigenvalue generating unit, configured to respectively calculate a coefficient of variation of the data of each of the G groups, and generate G coefficient of variation eigenvalues V ₃₁ , V ₃₂ , . . . , V _3G , wherein the coefficient of variation of the data of each group is a ratio of the standard deviation of the data of each group to an average value of the data of each group; and

特征值向量生成单元，用于基于所生成的多个特征值，来生成所述特征值向量。The eigenvalue vector generating unit is used to generate the eigenvalue vector based on the generated multiple eigenvalues.

在本公开的第四方面中，提供了一种用于识别纳米孔测序信号状态的机器学习模型的训练装置，包括以下模块：In a fourth aspect of the present disclosure, there is provided a training device for a machine learning model for identifying a nanopore sequencing signal state, comprising the following modules:

获取模块，用于获取所述机器学习模型的训练数据，其中，所述获取模块采集纳米孔测序信号数据，对所采集的纳米孔测序信号数据进行状态识别，截取与识别出的状态相对应的纳米孔测序信号数据X ₁、X ₂、……、X _N，对所截取的纳米孔测序信号数据X ₁、X ₂、……、X _N进行特征提取以生成特征值向量，其中N是大于1的自然数，所述训练数据包括所述状态识别的结果和所述特征值向量；以及 an acquisition module, configured to acquire training data for the machine learning model, wherein the acquisition module acquires nanopore sequencing signal data, performs state recognition on the acquired nanopore sequencing signal data, intercepts nanopore sequencing signal data _X1 , _X2 , ..., _XN corresponding to the recognized state, performs feature extraction on the intercepted nanopore sequencing signal data _X1 , _X2 , ..., _XN to generate a eigenvalue vector, wherein N is a natural number greater than 1, and the training data includes the state recognition result and the eigenvalue vector; and

训练模块，用于使用所获取的训练数据对所述机器学习模型进行训练，从而获取经训练的机器学习模型，a training module, configured to train the machine learning model using the acquired training data, thereby acquiring a trained machine learning model,

第三特征值生成单元，用于分别计算所述G个组中的每个组的数据的变异系数，生成G个变异系数特征值V ₃₁、V ₃₂、……、V _3G，其中每个组的数据的所述变异系数是每个组的数据的所述标准差与每个组的数据的平均值的比率；以及 a third eigenvalue generating unit, configured to calculate the coefficient of variation of the data of each of the G groups, respectively, and generate G coefficient of variation eigenvalues V ₃₁ , V ₃₂ , . . . , V _3G , wherein the coefficient of variation of the data of each group is a ratio of the standard deviation of the data of each group to the mean value of the data of each group; and

在本公开的第四方面中，提供了一种电子设备，包括：In a fourth aspect of the present disclosure, an electronic device is provided, comprising:

至少一个处理器；以及at least one processor; and

与所述至少一个处理器通信连接的存储器；其中，a memory communicatively connected to the at least one processor; wherein,

所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述至少一个处理器能够执行以上所述的任一种方法。The memory stores instructions that can be executed by the at least one processor. The instructions are executed by the at least one processor to enable the at least one processor to perform any one of the methods described above.

在本公开的第五方面中，提供了一种存储有计算机指令的非暂时性计算机可读存储介质，其中，所述计算机指令用于使所述计算机执行以上所述的任一种方法。In a fifth aspect of the present disclosure, a non-transitory computer-readable storage medium storing computer instructions is provided, wherein the computer instructions are used to cause the computer to execute any one of the methods described above.

基于本公开提供的各个方面，提供了基于机器学习的纳米孔测序信号状态的识别方法、用于识别纳米孔测序信号状态的机器学习模型的训练方法、装置、电子设备和非暂时性计算机可读存储介质。这些方面能够在提高鲁棒性的同时以较低的复杂度对纳米孔测序信号进行识别。Based on various aspects provided by the present disclosure, a method for identifying a nanopore sequencing signal state based on machine learning, a training method, an apparatus, an electronic device, and a non-transitory computer-readable storage medium for a machine learning model for identifying a nanopore sequencing signal state are provided. These aspects can identify nanopore sequencing signals with lower complexity while improving robustness.

BRIEF DESCRIPTION OF THE DRAWINGS

图1是根据本公开实施例的用于识别纳米孔测序信号状态的机器学习模型的训练方法的流程图。FIG1 is a flowchart of a method for training a machine learning model for identifying nanopore sequencing signal states according to an embodiment of the present disclosure.

图2是基于所采集的纳米孔测序信号数据生成的示例性图片的示意图。FIG. 2 is a schematic diagram of an exemplary image generated based on collected nanopore sequencing signal data.

图3是通过状态识别对图2所示的示例性图片进行分类而得到的三个图片的示意图。FIG. 3 is a schematic diagram of three pictures obtained by classifying the exemplary picture shown in FIG. 2 through state recognition.

图4是根据本公开实施例的基于机器学习的纳米孔测序信号状态的识别方法的流程图。FIG4 is a flow chart of a method for identifying a nanopore sequencing signal state based on machine learning according to an embodiment of the present disclosure.

图5示出了纳米孔测序信号的不同状态的一个示例。FIG5 shows an example of different states of a nanopore sequencing signal.

图6示出了纳米孔测序信号的不同状态的另一个示例。FIG. 6 shows another example of different states of a nanopore sequencing signal.

图7示出了纳米孔测序信号的不同状态的又一个示例。FIG. 7 shows yet another example of different states of a nanopore sequencing signal.

图8是根据本公开实施例的基于机器学习的纳米孔测序信号状态的识别装置的示意图。FIG8 is a schematic diagram of a device for identifying nanopore sequencing signal states based on machine learning according to an embodiment of the present disclosure.

图9是根据本公开实施例的用于识别纳米孔测序信号状态的机器学习模型的训练装置的示意图。FIG9 is a schematic diagram of a training device for a machine learning model for identifying nanopore sequencing signal states according to an embodiment of the present disclosure.

图10示出了决策树模型的一个示例。FIG10 shows an example of a decision tree model.

Detailed ways

下面将详细描述本发明的具体实施例，应当注意，这里描述的实施例只用于举例说明，并不用于限制本发明。在以下描述中，为了提供对本发明的透彻理解，阐述了大量特定细节。然而，对于本领域普通技术人员显而易见的是：不必采用这些特定细节来实行本发明。在其他实例中，为了避免混淆本发明，未具体描述公知的电路、材料或方法。The specific embodiments of the present invention will be described in detail below. It should be noted that the embodiments described herein are only for illustration and are not intended to limit the present invention. In the following description, a large number of specific details are set forth in order to provide a thorough understanding of the present invention. However, it is obvious to those of ordinary skill in the art that these specific details do not have to be used to implement the present invention. In other examples, in order to avoid confusing the present invention, known circuits, materials or methods are not specifically described.

在整个说明书中，对“一个实施例”、“实施例”、“一个示例”或“示例”的提及意味着：结合该实施例或示例描述的特定特征、结构或特性被包含在本发明至少一个实施例中。因此，在整个说明书的各个地方出现的短语“在一个实施例中”、“在实施例中”、“一个示例”或“示例”不一定都指同一实施例或示例。此外，可以以任何适当的组合和/或子组合将特定的特征、结构或特性组合在一个或多个实施例或示例中。Throughout the specification, references to "one embodiment," "an embodiment," "an example," or "an example" mean that a particular feature, structure, or characteristic described in conjunction with the embodiment or example is included in at least one embodiment of the present invention. Thus, the phrases "in one embodiment," "in an embodiment," "an example," or "an example" appearing in various places throughout the specification do not necessarily all refer to the same embodiment or example. Furthermore, particular features, structures, or characteristics may be combined in one or more embodiments or examples in any appropriate combinations and/or subcombinations.

应当理解，当称元件“耦接到”或“连接到”另一元件时，它可以是直接耦接或连接到另一元件或者可以存在中间元件。相反，当称元件“直接耦接到”或“直接连接到”另一元件时，不存在中间元件。It should be understood that when an element is said to be "coupled to" or "connected to" another element, it can be directly coupled or connected to the other element or there may be intermediate elements. On the contrary, when an element is said to be "directly coupled to" or "directly connected to" another element, there are no intermediate elements.

此外，这里使用的术语“和/或”包括一个或多个相关列出的项目的任何和所有组合。Furthermore, as used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

将理解的是，与术语相应的单数形式的名词可包括一个或更多个事物，除非相关上下文另有明确指示。如这里所使用的，诸如“A或B”、“A和B中的至少一个”、“A或B中的至少一个”、“A、B或C”、“A、B和C中的至少一个”以及“A、B或C中的至少一个”的短语中的每一个短语可包括在与所述多个短语中的相应一个短语中一起列举出的项的所有可能组合。如这里所使用的，诸如“第1”和“第2”或者“第一”和“第二”的术语可用于将相应部件与另一部件进行简单区分，并且不在其它方面(例如，重要性或顺序)限制所述部件。It will be understood that a noun corresponding to a term in the singular form may include one or more things unless the relevant context clearly indicates otherwise. As used herein, each of phrases such as "A or B," "at least one of A and B," "at least one of A or B," "A, B or C," "at least one of A, B and C," and "at least one of A, B or C" may include all possible combinations of items listed together with a corresponding one of the plurality of phrases. As used herein, terms such as "1st" and "2nd" or "first" and "second" may be used to simply distinguish a corresponding component from another component and do not limit the components in other respects (e.g., importance or order).

如这里所使用的，术语“模块”可包括以硬件、软件或固件实现的单元，并可与其他术语(例如，“逻辑”、“逻辑块”、“部分”或“电路”)可互换地使用。模块可以是被适配为执行一个或更多个功能的单个集成部件或者是该单个集成部件的最小单元或部分。例如，根据实施例，可以以专用集成电路(ASIC)的形式来实现模块。As used herein, the term "module" may include a unit implemented in hardware, software, or firmware, and may be used interchangeably with other terms (e.g., "logic," "logic block," "portion," or "circuit"). A module may be a single integrated component adapted to perform one or more functions or a minimum unit or portion of the single integrated component. For example, according to an embodiment, a module may be implemented in the form of an application specific integrated circuit (ASIC).

应该理解的是，本公开的各种实施例以及其中使用的术语并不意图将在此阐述的技术特征限制于具体实施例，而是包括针对相应实施例的各种改变、等同形式或替换形式。除非本文另有明确定义，否则所有术语将给出其最广泛的可能解释，包括说明书中暗示的含义以及本领域技术人员理解的和/或字典、论文等中定义的含义。It should be understood that the various embodiments of the present disclosure and the terms used therein are not intended to limit the technical features set forth herein to specific embodiments, but rather include various changes, equivalent forms or alternative forms for the corresponding embodiments. Unless otherwise expressly defined herein, all terms will be given their broadest possible interpretation, including the meanings implied in the specification and the meanings understood by those skilled in the art and/or defined in dictionaries, papers, etc.

此外，本领域普通技术人员应当理解，在此提供的附图都是为了说明的目的，并且附图不一定是按比例绘制的。对于附图的描述，相似的参考标号可用来指代相似或相关的元件。以下将参考附图对本公开进行示例性描述。In addition, it should be understood by those skilled in the art that the drawings provided herein are for illustrative purposes and are not necessarily drawn to scale. For the description of the drawings, similar reference numerals may be used to refer to similar or related elements. The present disclosure will be described exemplarily with reference to the drawings below.

为了解决传统时序数据分析方法鲁棒性差和深度学习方法计算复杂度高成本高的问题，本公开提出了一种全新的特征提取方法，并与机器学习结合训练机器学习模型，利用这样得到的机器学习模型来识别纳米孔测序信号状态。下面参考附图详细说明根据本发明实施例的基于机器学习的纳米孔测序信号状态的识别方法、用于识别纳米孔测序信号状态的机器学习模型的训练方法、装置、电子设备和非暂时性计算机可读存储介质。In order to solve the problems of poor robustness of traditional time series data analysis methods and high computational complexity and cost of deep learning methods, the present disclosure proposes a new feature extraction method, and combines it with machine learning to train a machine learning model, and uses the machine learning model obtained in this way to identify the state of nanopore sequencing signals. The following is a detailed description of the method for identifying the state of nanopore sequencing signals based on machine learning, the training method of the machine learning model for identifying the state of nanopore sequencing signals, the device, the electronic device, and the non-transitory computer-readable storage medium according to the embodiments of the present invention with reference to the accompanying drawings.

图1是根据本公开实施例的用于识别纳米孔测序信号状态的机器学习模型的训练方法的流程图。下面参考图1对根据本公开实施例的用于识别纳米孔测序信号状态的机器学习模型的训练方法100进行说明。Fig. 1 is a flow chart of a method for training a machine learning model for identifying the state of a nanopore sequencing signal according to an embodiment of the present disclosure. Referring to Fig. 1 , a method 100 for training a machine learning model for identifying the state of a nanopore sequencing signal according to an embodiment of the present disclosure is described below.

如图1所示，根据本公开实施例的用于识别纳米孔测序信号状态的机器学习模型的训练方法100包括步骤S110和步骤S210。在步骤S110，获取机器学习模型的训练数据。在步骤S120，使用所获取的训练数据对机器学习模型进行训练，从而获取用于识别纳米孔测序信号状态的经训练的机器学习模型。As shown in Figure 1, the training method 100 for a machine learning model for identifying the state of a nanopore sequencing signal according to an embodiment of the present disclosure includes steps S110 and S210. In step S110, training data for the machine learning model is obtained. In step S120, the machine learning model is trained using the obtained training data, thereby obtaining a trained machine learning model for identifying the state of a nanopore sequencing signal.

具体而言，步骤S110进一步包括步骤S1101和步骤S1102。Specifically, step S110 further includes step S1101 and step S1102.

其中，在步骤S1101，采集纳米孔测序信号数据。Wherein, in step S1101, nanopore sequencing signal data is collected.

例如，可以通过纳米孔测序装置来采集纳米孔测序信号数据。在采集信号时，例如可以使用单通道纳米孔测序装置和计算机。利用纳米孔测序装置采集目标测序信号数据，采集后的测序信号数据以及其它信息保存为h5的文件结构，并实时存储到计算机的存储装置中。For example, nanopore sequencing signal data can be collected by a nanopore sequencing device. When collecting signals, for example, a single-channel nanopore sequencing device and a computer can be used. The target sequencing signal data is collected by the nanopore sequencing device, and the collected sequencing signal data and other information are saved as an h5 file structure and stored in real time in a storage device of the computer.

纳米孔测序信号中包含状态信号。状态信号指的是从开孔电流值下降直到上升回到开孔电流值的信号片段，开孔电流值在测试数据采集前是已知的。The nanopore sequencing signal includes a state signal, which refers to a signal segment from the time when the pore current value decreases until it rises back to the pore current value, and the pore current value is known before the test data is collected.

纳米孔测序信号数据也可以是提前采集并存储在计算机系统中的现有信号数据，而不是临时采集的信号数据。本公开对此不作特殊限定。The nanopore sequencing signal data may also be existing signal data collected in advance and stored in a computer system, rather than temporarily collected signal data, which is not specifically limited in the present disclosure.

在步骤S1102，基于所采集的纳米孔测序信号数据制作训练数据。In step S1102, training data is generated based on the collected nanopore sequencing signal data.

在制作训练数据时，首先基于所采集的纳米孔测序信号数据生成图片。例如，以电流值为纵轴，以数据采集序号(在数据序列中的位置)为横轴，可以基于所采集的纳米孔测序信号数据生成图2所示的示例性图片。生成这些图片的目的是对状态信号进行人工分类，图片并不作为模型的输入。在图2所示的示例性图片中，电流信号在数据序列中的位置100处从起初的开孔电流值下降并在位置500处回升(如图2中的附图标记“1”所示)，然后在位置600处再次从开孔电流值下降并在位置2000处回升(如图2中的附图标记“2”所示)，在位置3000处第三次从开孔电流值下降并在位置3500处回升(如图2中的附图标记“3”所示)。可以看出，图2所示的示例性图片中共包含了3个状态信号。When making training data, first generate a picture based on the collected nanopore sequencing signal data. For example, with the current value as the vertical axis and the data collection sequence number (position in the data sequence) as the horizontal axis, the exemplary picture shown in Figure 2 can be generated based on the collected nanopore sequencing signal data. The purpose of generating these pictures is to manually classify the state signals, and the pictures are not used as inputs to the model. In the exemplary picture shown in Figure 2, the current signal at position 100 in the data sequence drops from the initial open hole current value and rises at position 500 (as shown by the reference numeral "1" in Figure 2), then drops from the open hole current value again at position 600 and rises at position 2000 (as shown by the reference numeral "2" in Figure 2), and drops from the open hole current value for the third time at position 3000 and rises at position 3500 (as shown by the reference numeral "3" in Figure 2). It can be seen that the exemplary picture shown in Figure 2 contains a total of 3 state signals.

接着，对所生成的图片进行人工分类。例如，对图2所示的图片进行人工分类，可以得到三个类别，名称分别是“0”，“1”和“2”。针对每个类别生成单个图片，如图3所示。记录每个类别的状态信号在数据序列中的开始位置和结束位置，每个状态信号的图片都以“人工分类名称在数据序列中的开始位置在数据序列中的结束位置.png”命名。这样，根据图片的名称，即可知道该状态信号图片的人工分类名称和该状态信号在数据序列中的哪个位置，以便在数据序列中截取对应数据。Next, the generated images are manually classified. For example, by manually classifying the image shown in Figure 2, three categories can be obtained, named "0", "1" and "2" respectively. A single image is generated for each category, as shown in Figure 3. The starting position and ending position of the state signal of each category in the data sequence are recorded, and the image of each state signal is named "artificial classification name starting position in data sequence ending position in data sequence.png". In this way, according to the name of the image, the artificial classification name of the state signal image and the position of the state signal in the data sequence can be known, so as to intercept the corresponding data in the data sequence.

接下来，根据图片名称的数据序列开始位置和结束位置，在数据序列中截取对应的目标数据，并对目标数据进行特征提取。Next, according to the start position and end position of the data sequence of the picture name, the corresponding target data is intercepted in the data sequence, and features are extracted for the target data.

对目标数据进行的特征提取可以包括以下步骤。Feature extraction of target data may include the following steps.

首先对目标数据进行分组，假设目标数据为纳米孔测序信号数据X ₁、X ₂、……、X _N，其中N是大于1的自然数，按照值的大小将这些数据划分为G个组，其中G是大于1且小于N的自然数。其中特征值包括数量占比、标准差和变异系数中的至少一项。 First, the target data are grouped. Assuming that the target data is nanopore sequencing signal data X ₁ , X ₂ , ..., X _N , where N is a natural number greater than 1, these data are divided into G groups according to the value, where G is a natural number greater than 1 and less than N. The characteristic value includes at least one of the quantity proportion, standard deviation and coefficient of variation.

接着，分别计算G个组中的每个组的数量占比，生成G个数量占比特征值V ₁₁、V ₁₂、……、V _1G，其中每个组的该数量占比是该组中的数据的数量占总数据数量N的比率。 Next, the quantity proportion of each of the G groups is calculated respectively to generate G quantity proportion feature values V ₁₁ , V ₁₂ , . . . , V _1G , where the quantity proportion of each group is the ratio of the quantity of data in the group to the total quantity N of data.

接着，分别计算G个组中的每个组的数据的标准差，生成G个标准差特征值V ₂₁、V ₂₂、……、V _2G。 Next, the standard deviation of the data of each of the G groups is calculated to generate G standard deviation eigenvalues V ₂₁ , V ₂₂ , . . . , V _2G .

接着，分别计算G个组中的每个组的数据的变异系数，生成G个变异系数特征值V ₃₁、V ₃₂、……、V _3G，其中每个组的数据的该变异系数是每个组的数据的该标准差与每个组的数据的平均值的比率。 Next, the coefficient of variation of the data of each of the G groups is calculated to generate G coefficient of variation characteristic values V ₃₁ , V ₃₂ , . . . , V _3G , wherein the coefficient of variation of the data of each group is the ratio of the standard deviation of the data of each group to the mean value of the data of each group.

接着，基于所生成的多个特征值，来生成特征值向量。例如，可以基于所生成的多个特征值来构成特征值向量V(V ₁₁、V ₁₂、……、V _1G、V ₂₁、V ₂₂、……、V _2G、V ₃₁、V ₃₂、……、V _3G)。 Next, an eigenvalue vector is generated based on the generated multiple eigenvalues. For example, an eigenvalue vector V (V ₁₁ , V ₁₂ , ..., V _1G , V ₂₁ , V ₂₂ , ..., V _2G , V ₃₁ , V ₃₂ , ..., V _3G ) may be constructed based on the generated multiple eigenvalues.

此外，在分组之前，特征提取方法还可以包括标准化步骤，以将所获取的纳米孔测序信号数据标准化为[0，1]间隔中的值。Furthermore, before grouping, the feature extraction method may further include a normalization step to normalize the acquired nanopore sequencing signal data to a value in the [0, 1] interval.

在计算变异系数特征值之后，特征提取方法还可以包括：计算G个标准差特征值V ₂₁、V ₂₂、……、V _2G中的最大值、最小值、平均值、最大值与最小值的商、最大值与平均值的商，生成5个特征值V ₄₁、V ₄₂、……、V ₄₅；以及计算G个变异系数特征值V ₃₁、V ₃₂、……、V _3G中的最大值、最小值、平均值、最大值与最小值的商、最大值与平均值的商，生成5个特征值V ₅₁、V ₅₂、……、V ₅₅。 After calculating the coefficient of variation eigenvalue, the feature extraction method may further include: calculating the maximum value, minimum value, average value, quotient of the maximum value and the minimum value, and quotient of the maximum value and the average value among the G standard deviation eigenvalues V ₂₁ , V ₂₂ , … , V _2G , to generate five eigenvalues V ₄₁ , V ₄₂ , … , V ₄₅ ; and calculating the maximum value, minimum value, average value, quotient of the maximum value and the minimum value, and quotient of the maximum value and the average value among the G coefficient of variation eigenvalues V ₃₁ , V ₃₂ , … , V _3G , to generate five eigenvalues V ₅₁ , V ₅₂ , … , V ₅₅ .

由此，在生成特征值量时，所构成的特征值向量V可以进一步包括更多的特征值，例如可以是V(V ₁₁、V ₁₂、……、V _1G、V ₂₁、V ₂₂、……、V _2G、V ₃₁、V ₃₂、……、V _3G、V ₄₁、V ₄₂、……、V ₄₅、V ₅₁、V ₅₂、……、V ₅₅)。 Therefore, when generating eigenvalue quantities, the constructed eigenvalue vector V can further include more eigenvalues, for example, it can be V (V ₁₁ , V ₁₂ , ... , V _1G , V ₂₁ , V ₂₂ , ... , V _2G , V ₃₁ , V ₃₂ , ... , V _3G , V ₄₁ , V ₄₂ , ... , V ₄₅ , V ₅₁ , V ₅₂ , ... , V ₅₅ ).

下面对上述特征提取方法的一个具体示例进行详细描述。A specific example of the above feature extraction method is described in detail below.

(a)对目标数据(即状态信号数据)进行标准化，即，对于状态信号数据中的每一个数据，将该数据的值与状态信号数据最小值的差除以状态信号数据最大值与状态信号数据最小值的差，并将得到的值作为该数据的标准化后的值。(a) Standardize the target data (i.e., the status signal data), that is, for each data in the status signal data, divide the difference between the value of the data and the minimum value of the status signal data by the difference between the maximum value of the status signal data and the minimum value of the status signal data, and use the obtained value as the standardized value of the data.

(b)对标准化后的状态信号数据进行分组，即，对在步骤(a)中得到的状态信号数据进行分组，从0到1以0.1为间隔划分为10个组，从而将状态信号数据划分为分别落入间隔[0，0.1)，[0.1，0.2)，…，[0.9，11中的10个组。(b) Grouping the standardized state signal data, that is, grouping the state signal data obtained in step (a) into 10 groups from 0 to 1 with an interval of 0.1, thereby dividing the state signal data into 10 groups falling into the intervals [0, 0.1), [0.1, 0.2), ..., [0.9, 11] respectively.

(c)计算在步骤(b)中划分得到的10个组中的每个组的数量占比，生成10个关于数量占比的特征值。每个组的数量占比指的是该组中的数据的数量占总数据数量的比率。(c) Calculate the quantity ratio of each of the 10 groups divided in step (b) to generate 10 feature values about quantity ratio. The quantity ratio of each group refers to the ratio of the quantity of data in the group to the total quantity of data.

(d)计算在步骤(b)中划分得到的10个组中的每个组的标准差，生成10个关于标准差的特征值。标准差的计算方法如下。假设有n个数据x ₁、X ₂、……、x _n，它们的平均值为

则这n个数据的标准差S为： (d) Calculate the standard deviation of each of the 10 groups obtained in step (b) and generate 10 eigenvalues related to the standard deviation. The standard deviation is calculated as follows. Suppose there are n data x ₁ , X ₂ , …, x _n , and their average is

Then the standard deviation S of these n data is:

(e)计算在步骤(b)中划分得到的10个组中的每个组的变异系数，生成10个关于变异系数的特征值。变异系数为标准差与平均值的比率，即，变异系数＝标准差/平均值。(e) Calculate the coefficient of variation of each of the 10 groups divided in step (b) to generate 10 characteristic values for the coefficient of variation. The coefficient of variation is the ratio of the standard deviation to the mean, that is, coefficient of variation = standard deviation / mean.

(f)计算在步骤(d)中生成的10个特征值(即，关于标准差的特征值)的最大值、最小值、平均值、最大值与最小值的商、最大值与平均值的商，生成5个特征值。(f) Calculate the maximum value, minimum value, average value, quotient of the maximum value and the minimum value, and quotient of the maximum value and the average value of the 10 eigenvalues generated in step (d) (i.e., eigenvalues with respect to the standard deviation) to generate 5 eigenvalues.

(g)计算在步骤(e)中生成的10个特征值(即，关于变异系数的特征值)的最大值、最小值、平均值、最大值与最小值的商、最大值与平均值的商，生成5个特征值。(g) Calculate the maximum value, minimum value, average value, quotient of the maximum value and the minimum value, and quotient of the maximum value and the average value of the 10 eigenvalues (i.e., eigenvalues regarding the coefficient of variation) generated in step (e) to generate 5 eigenvalues.

(h)步骤(c)、(d)、(e)、(f)、(g)中共得到了40个特征值，将这40个特征值构成最终的特征值向量，作为训练数据。例如，可以将这40个特征值依次作为特征值向量的分量，构成具有40个分量的一维的特征值向量。需要说明的是，特征值向量的构成方法不限于此，也可以使用本领域技术人员常用的其他方法。(h) A total of 40 eigenvalues are obtained in steps (c), (d), (e), (f), and (g), and these 40 eigenvalues are used to form a final eigenvalue vector as training data. For example, these 40 eigenvalues can be used as components of the eigenvalue vector in sequence to form a one-dimensional eigenvalue vector with 40 components. It should be noted that the method for forming the eigenvalue vector is not limited to this, and other methods commonly used by those skilled in the art can also be used.

最后，将每个所生成的特征值向量与对应的人工分类结果组成一个训练数据。例如，针对图3中的图片0_100_500.png生成特征值向量V1，V1与对应的人工分类结果(名称为“0”的类别)组成第一个训练数据。类似地，针对图3中的图片1_600_2000.png生成特征值向量V2，V2与对应的人工分类结果(名称为“1”的类别)组成第二个训练数据。针对图3中的图片2_3000_3500.png生成特征值向量V3，V3与对应的人工分类结果(名称为“2”的类别)组成第三个训练数据。Finally, each generated eigenvalue vector and the corresponding manual classification result are combined into a training data. For example, for the picture 0_100_500.png in Figure 3, the eigenvalue vector V1 is generated, and V1 and the corresponding manual classification result (the category named "0") constitute the first training data. Similarly, for the picture 1_600_2000.png in Figure 3, the eigenvalue vector V2 is generated, and V2 and the corresponding manual classification result (the category named "1") constitute the second training data. For the picture 2_3000_3500.png in Figure 3, the eigenvalue vector V3 is generated, and V3 and the corresponding manual classification result (the category named "2") constitute the third training data.

在步骤S120，使用在步骤S110中得到的训练数据，对xgboost机器学习模型进行训练，并保存模型文件。In step S120, the xgboost machine learning model is trained using the training data obtained in step S110, and the model file is saved.

在训练时，需要将用于训练的数据和对应的人工分类结果划分为训练集和测试集。在训练集制作阶段可以使用随机抽取和自行设置比例制作出用于机器学习训练的数据集。然后，根据应用场景设置好模型训练的参数，如学习率以及训练迭代次数等，对模型进行训练和测试。如果测试的效果不理想，可尝试增加数据集的数量、修改模型训练参数等方法调整训练，重复迭代这一过程直到模型训练的结果满足指标的要求。During training, the training data and the corresponding manual classification results need to be divided into training sets and test sets. In the training set production stage, you can use random extraction and self-set ratios to produce a data set for machine learning training. Then, set the model training parameters according to the application scenario, such as learning rate and number of training iterations, and train and test the model. If the test results are not ideal, you can try to increase the number of data sets, modify the model training parameters, and adjust the training. Repeat this process until the model training results meet the indicator requirements.

图4是根据本公开实施例的基于机器学习的纳米孔测序信号状态的识别方法的流程图。下面参考图4对根据本公开实施例的基于机器学习的纳米孔测序信号状态的识别方法400进行说明。Fig. 4 is a flow chart of a method for identifying a nanopore sequencing signal state based on machine learning according to an embodiment of the present disclosure. Referring to Fig. 4 , a method 400 for identifying a nanopore sequencing signal state based on machine learning according to an embodiment of the present disclosure is described below.

在步骤S410，获取纳米孔测序信号数据，并对所获取的纳米孔测序信号数据进行特征提取以生成特征值向量。所获取的纳米孔测序信号数据是要检测其状态的纳米孔测序信号数据。In step S410, nanopore sequencing signal data is acquired, and feature extraction is performed on the acquired nanopore sequencing signal data to generate a feature value vector. The acquired nanopore sequencing signal data is the nanopore sequencing signal data whose state is to be detected.

为了获取纳米孔测序信号数据，首先使用单通道纳米孔测序装置和计算机采集得到测序信号数据，然后在测序信号数据中截取从开孔电流值下降直到回升到开孔电流值的信号片段数据，将该信号片段数据作为所获取的纳米孔测序信号数据X ₁、X ₂、……、X _N，其中N是大于1的自然数，并记录该信号片段数据在整个纳米孔测序信号数据中的开始位置和结束位置。测序信号数据中可以包括一个或多个信号片段数据，针对所截取的每个信号片段数据分别进行状态识别。 In order to obtain nanopore sequencing signal data, firstly, a single-channel nanopore sequencing device and a computer are used to collect sequencing signal data, and then signal segment data from the opening current value falling to the opening current value is intercepted in the sequencing signal data, and the signal segment data is used as the obtained nanopore sequencing signal data _X1 , _X2 , ..., _XN , where N is a natural number greater than 1, and the starting position and the ending position of the signal segment data in the entire nanopore sequencing signal data are recorded. The sequencing signal data may include one or more signal segment data, and the state identification is performed for each intercepted signal segment data.

特征提取方法可以包括以下步骤。The feature extraction method may include the following steps.

在分组步骤S4101中，将所获取的纳米孔测序信号数据X ₁、X ₂、……、X _N按照值的大小划分为G个组，其中G是大于1且小于N的自然数。其中特征值包括数量占比、标准差和变异系数中的至少一项。 In the grouping step S4101, the acquired nanopore sequencing signal data _X1 , _X2 , ..., _XN are divided into G groups according to the size of the value, where G is a natural number greater than 1 and less than N. The characteristic value includes at least one of the quantity proportion, standard deviation and coefficient of variation.

在数量占比特征值计算步骤S4102中，分别计算G个组中的每个组的数量占比，生成G个数量占比特征值V ₁₁、V ₁₂、……、V _1G，其中每个组的该数量占比是该组中的数据的数量占总数据数量N的比率。 In the quantity proportion eigenvalue calculation step S4102 , the quantity proportion of each of the G groups is calculated respectively to generate G quantity proportion eigenvalues V ₁₁ , V ₁₂ , . . . , V _1G , where the quantity proportion of each group is the ratio of the quantity of data in the group to the total quantity N of data.

在标准差特征值计算步骤S4103中，分别计算G个组中的每个组的数据的标准差，生成G个标准差特征值V ₂₁、V ₂₂、……、V _2G。 In the standard deviation eigenvalue calculation step S4103 , the standard deviation of the data of each of the G groups is calculated respectively to generate G standard deviation eigenvalues V ₂₁ , V ₂₂ , . . . , V _2G .

在变异系数特征值计算步骤S4104中，分别计算G个组中的每个组的数据的变异系数，生成G个变异系数特征值V ₃₁、V ₃₂、……、V _3G，其中每个组的数据的该变异系数是每个组的数据的该标准差与每个组的数据的平均值的比率。 In the variation coefficient characteristic value calculation step S4104, the variation coefficient of the data of each of the G groups is calculated respectively to generate G variation coefficient characteristic values V ₃₁ , V ₃₂ , ..., V _3G , where the variation coefficient of each group of data is the ratio of the standard deviation of each group of data to the mean value of each group of data.

在特征值量生成步骤S4105中，基于所生成的多个特征值，来生成特征值向量。例如，可以基于所生成的多个特征值来构成特征值向量V(V ₁₁、V ₁₂、……、V _1G、V ₂₁、V ₂₂、……、V _2G、V ₃₁、V ₃₂、……、V _3G)。 In the feature value generating step S4105, a feature value vector is generated based on the generated multiple feature values. For example, a feature value vector V (V ₁₁ , V ₁₂ , ..., V _1G , V ₂₁ , V ₂₂ , ..., V _2G , V ₃₁ , V ₃₂ , ..., V _3G ) can be constructed based on the generated multiple feature values.

此外，在分组步骤S4101之前，特征提取方法还可以包括标准化步骤，以将所获取的纳米孔测序信号数据标准化为[0，1]间隔中的值。In addition, before the grouping step S4101, the feature extraction method may further include a normalization step to normalize the acquired nanopore sequencing signal data to a value in the interval [0, 1].

在变异系数特征值计算步骤S4104之后，特征提取方法还可以包括：计算G个标准差特征值V ₂₁、V ₂₂、……、V _2G中的最大值、最小值、平均值、最大值与最小值的商、最大值与平均值的商，生成5个特征值V ₄₁、V ₄₂、……、V ₄₅；以及计算G个变异系数特征值V ₃₁、V ₃₂、……、V _3G中的最大值、最小值、平均值、最大值与最小值的商、最大值与平均值的商，生成5个特征值V ₅₁、V ₅₂、……、V ₅₅。 After the variation coefficient eigenvalue calculation step S4104, the feature extraction method may further include: calculating the maximum value, minimum value, average value, quotient of the maximum value and the minimum value, and quotient of the maximum value and the average value among the G standard deviation eigenvalues V ₂₁ , V ₂₂ , … , V _2G , and generating five eigenvalues V ₄₁ , V 42 , … , V ₄₅ ; and calculating the maximum value, minimum value, average value, quotient of the maximum value and the minimum value, and quotient of _{the maximum value and the average value among the G variation coefficient eigenvalues V 31} _, V ₃₂ , … , V _3G , and generating five eigenvalues V ₅₁ , V ₅₂ , … , V ₅₅ .

由此，在特征值量生成步骤S4105中，所构成的特征值向量V可以进一步包括更多的特征值，例如可以是V(V ₁₁、V ₁₂、……、V _1G、V ₂₁、 V ₂₂、……、V _2G、V ₃₁、V ₃₂、……、V _3G、V ₄₁、V ₄₂、……、V ₄₅、V ₅₁、V ₅₂、……、V ₅₅)。 Therefore, in the eigenvalue generating step S4105, the constructed eigenvalue vector V can further include more eigenvalues, for example, it can be V (V ₁₁ , V ₁₂ , ... , V _1G , V ₂₁ , V ₂₂ , ... , V _2G , V ₃₁ , V ₃₂ , ... , V _3G , V ₄₁ , V ₄₂ , ... , V ₄₅ , V ₅₁ , V ₅₂ , ... , V ₅₅ ).

特征提取方法的一个具体示例与前面描述的步骤(a)至步骤(h)相同，这里不再重复描述。A specific example of the feature extraction method is the same as steps (a) to (h) described above, and will not be repeated here.

在步骤S420，使用所生成的特征值向量和经训练的机器学习模型，来识别所获取的纳米孔测序信号数据的状态。例如，经训练的机器学习模型可以是通过图1所示的根据本公开实施例的机器学习模型的训练方法训练得到的模型。In step S420, the generated eigenvalue vector and the trained machine learning model are used to identify the state of the acquired nanopore sequencing signal data. For example, the trained machine learning model can be a model trained by the training method of the machine learning model according to the embodiment of the present disclosure shown in FIG. 1.

具体而言，将在步骤S410中所提取的特征值向量输入到训练好的机器学习模型中，得到预测结果。例如，预测结果可以是0、1、2这三个值中之一。另外，可以根据预测结果输出所识别的每个状态信号的图片。例如，可以用“预测结果名称_在纳米孔测序信号数据中的开始位置_在纳米孔测序信号数据中的结束位置.png”命名图片。这里所描述的图片的名称、格式仅为示例，本公开不限于此。Specifically, the feature value vector extracted in step S410 is input into the trained machine learning model to obtain a prediction result. For example, the prediction result can be one of the three values 0, 1, and 2. In addition, a picture of each state signal identified can be output according to the prediction result. For example, the picture can be named "prediction result name_starting position in nanopore sequencing signal data_ending position in nanopore sequencing signal data.png". The names and formats of the pictures described here are only examples, and the present disclosure is not limited thereto.

图5至图7分别示出了纳米孔测序信号的不同状态的示例。可以看出，有些状态所包含的数据量相当大，人工判断所有信号的状态需要大量时间和精力，判断结果的精度也不高。利用本公开的训练方法训练得到的机器学习模型可以以高精度识别出大量信号的状态，从而提高效率和节省时间。Figures 5 to 7 show examples of different states of nanopore sequencing signals. It can be seen that some states contain a large amount of data, and it takes a lot of time and effort to manually determine the states of all signals, and the accuracy of the determination results is not high. The machine learning model trained using the training method of the present disclosure can identify the states of a large number of signals with high accuracy, thereby improving efficiency and saving time.

图8是根据本公开实施例的基于机器学习的纳米孔测序信号状态的识别装置的示意图。下面参考图8对根据本公开实施例的基于机器学习的纳米孔测序信号状态的识别装置800进行说明。识别装置800包括获取模块810和识别模块820。FIG8 is a schematic diagram of a device for identifying a nanopore sequencing signal state based on machine learning according to an embodiment of the present disclosure. Referring to FIG8 , a device 800 for identifying a nanopore sequencing signal state based on machine learning according to an embodiment of the present disclosure is described below. The identification device 800 includes an acquisition module 810 and an identification module 820.

获取模块810用于获取纳米孔测序信号数据X ₁、X ₂、……、X _N，并对该纳米孔测序信号数据进行特征提取以生成特征值向量，其中N是大于1的自然数。 The acquisition module 810 is used to acquire nanopore sequencing signal data X ₁ , X ₂ , . . . , X _N , and perform feature extraction on the nanopore sequencing signal data to generate a feature value vector, wherein N is a natural number greater than 1.

识别模块820用于使用所生成的特征值向量和经训练的机器学习模型，来识别所获取的纳米孔测序信号数据的状态。The identification module 820 is used to identify the state of the acquired nanopore sequencing signal data using the generated eigenvalue vector and the trained machine learning model.

其中，获取模块810进一步包括以下单元：分组单元8101，用于将所述纳米孔测序信号数据按照值的大小划分为G个组，其中G是大于1且小于N的自然数；其中特征值包括数量占比、标准差和变异系数中的至少一项；第一特征值生成单元8102，用于分别计算所述G个组中的每个组的数量占比，生成G个数量占比特征值V_11、V_12、……、V_1G，其中每个组的所述数量占比是所述组中的数据的数量占总数据数量N的比率；第二特征值生成单元8103，用于分别计算所述G个组中的每个组的数据的标准差，生成G个标准差特征值V_21、V_22、……、V_2G；第三特征值生成单元8104，用于分别计算所述G个组中的每个组的数据的变异系数，生成G个变异系数特征值V_31、V_32、……、V_3G，其中每个组的数据的所述变异系数是每个组的数据的所述标准差与每个组的数据的平均值的比率；以及特征值向量生成单元8105，用于基于所生成的多个特征值，来生成所述特征值向量。The acquisition module 810 further includes the following units: a grouping unit 8101, which is used to divide the nanopore sequencing signal data into G groups according to the size of the value, where G is a natural number greater than 1 and less than N; wherein the characteristic value includes the number proportion, standard deviation and variation coefficient. at least one of the numbers; a first eigenvalue generating unit 8102, used to respectively calculate the quantity proportion of each group in the G groups, and generate G quantity proportion eigenvalues V_11, V_12, ..., V_1G, wherein the quantity proportion of each group is the ratio of the quantity of data in the group to the total quantity of data N; a second eigenvalue generating unit 8103, used to respectively calculate the standard deviation of the data of each group in the G groups, and generate G standard deviation eigenvalues V_21, V_22, ..., V_2G; a third eigenvalue generating unit 8104, used to respectively calculate the coefficient of variation of the data of each group in the G groups, and generate G coefficient of variation eigenvalues V_31, V_32, ..., V_3G, wherein the coefficient of variation of the data of each group is the ratio of the standard deviation of the data of each group to the average value of the data of each group; and an eigenvalue vector generating unit 8105, used to generate the eigenvalue vector based on the generated multiple eigenvalues.

图9是根据本公开实施例的用于识别纳米孔测序信号状态的机器学习模型的训练装置的示意图。下面参考图9对根据本公开实施例的用于识别纳米孔测序信号状态的机器学习模型的训练装置900进行说明。训练装置900包括获取模块910和训练模块920。FIG9 is a schematic diagram of a training device for a machine learning model for identifying a nanopore sequencing signal state according to an embodiment of the present disclosure. Referring to FIG9 , a training device 900 for a machine learning model for identifying a nanopore sequencing signal state according to an embodiment of the present disclosure is described below. The training device 900 includes an acquisition module 910 and a training module 920.

获取模块910用于获取机器学习模型的训练数据，其中，获取模块910采集纳米孔测序信号数据，对所采集的纳米孔测序信号数据进行状态识别，截取与识别出的状态相对应的纳米孔测序信号数据X ₁、X ₂、……、X _N，对所截取的纳米孔测序信号数据X ₁、X ₂、……、X _N进行特征提取以生成特征值向量，其中N是大于1的自然数，训练数据包括状态识别的结果和特征值向量。 The acquisition module 910 is used to acquire training data for a machine learning model, wherein the acquisition module 910 acquires nanopore sequencing signal data, performs state recognition on the acquired nanopore sequencing signal data, intercepts nanopore sequencing signal data _X1 , _X2 , ..., _XN corresponding to the recognized state, performs feature extraction on the intercepted nanopore sequencing signal data _X1 , _X2 , ..., _XN to generate a eigenvalue vector, wherein N is a natural number greater than 1, and the training data includes the state recognition result and the eigenvalue vector.

训练模块920用于使用所获取的训练数据对机器学习模型进行训练，从而获取经训练的机器学习模型。The training module 920 is used to train the machine learning model using the acquired training data, thereby obtaining a trained machine learning model.

其中，获取模块910进一步包括以下单元：分组单元9101，用于将所述纳米孔测序信号数据按照值的大小划分为G个组，其中G是大于1且小于N的自然数；其中特征值包括数量占比、标准差和变异系数中的至少一项；第一特征值生成单元9102，用于分别计算所述G个组中的每个组的数量占比，生成G个数量占比特征值V ₁₁、V ₁₂、……、V _1G，其中每个组的所述数量占比是所述组中的数据的数量占总数据数量N的比率；第二特征值生成单元9103，用于分别计算所述G个组中的每个组的数据的标准差，生成G个标准差特征值V ₂₁、V ₂₂、……、V _2G；第三特征值生成单元9104，用于分别计算所述G个组中的每个组的数据的变异系数，生成G个变异系数特征值V ₃₁、V ₃₂、……、V _3G，其中每个组的数据的所述变异系数是每个组的数据的所述标准差与每个组的数据的平均值的比率；以及特征值向量生成单元9105，用于基于所生成的多个特征值，来生成所述特征值向量。 The acquisition module 910 further includes the following units: a grouping unit 9101, used to divide the nanopore sequencing signal data into G groups according to the size of the value, wherein G is a natural number greater than 1 and less than N; wherein the eigenvalue includes at least one of the quantity proportion, standard deviation and coefficient of variation; a first eigenvalue generating unit 9102, used to respectively calculate the quantity proportion of each group in the G groups, and generate G quantity proportion eigenvalues V ₁₁ , V ₁₂ , ..., V _1G , wherein the quantity proportion of each group is the ratio of the quantity of data in the group to the total quantity of data N; a second eigenvalue generating unit 9103, used to respectively calculate the standard deviation of the data of each group in the G groups, and generate G standard deviation eigenvalues V ₂₁ , V ₂₂ , ..., V _2G ; a third eigenvalue generating unit 9104, used to respectively calculate the coefficient of variation of the data of each group in the G groups, and generate G coefficient of variation eigenvalues V ₃₁ , V ₃₂ , ..., V _3G , wherein the coefficient of variation of each group of data is the ratio of the standard deviation of each group of data to the mean value of each group of data; and an eigenvalue vector generating unit 9105, which is used to generate the eigenvalue vector based on the generated multiple eigenvalues.

本公开中的机器学习模型包括决策树模型、贝叶斯模型或神经网络模型等模型。The machine learning model in the present disclosure includes models such as decision tree models, Bayesian models or neural network models.

本公开中的机器学习模型的一个具体示例可以是开源的xgboost模型，具体细节可参考开源代码，以下对xgboost模型作粗略说明。A specific example of the machine learning model in the present disclosure may be an open source xgboost model. For specific details, please refer to the open source code. The xgboost model is briefly described below.

xgboost是由k个基模型组成的一个加法运算式：Xgboost is an additive formula consisting of k base models:

其中f _t是第k个基模型，

是第i个样本x _i的预测值。 where _ft is the kth basis model,

is the predicted value of the ith sample _xi .

每个基模型是决策树模型。图10示出了决策树模型的一个示例。如图10所示，根据人们的性别和年龄可以分类人们是否喜欢电脑游戏。例如，当年龄大于或等于15时，分数为-1；当年龄小于15时，如果为男性则分数为+2，如果为女性则分数为+0.1。其中，分数越高代表越喜欢电脑游戏。Each base model is a decision tree model. FIG10 shows an example of a decision tree model. As shown in FIG10 , whether people like computer games can be classified according to their gender and age. For example, when the age is greater than or equal to 15, the score is -1; when the age is less than 15, if the person is male, the score is +2, and if the person is female, the score is +0.1. Among them, the higher the score, the more people like computer games.

本公开中所描述的特征提取方法不限于上述具体实施例，本领域技术人员可以在本公开的特征提取方法的基础上做出各种改进、变形、删除和替代。例如，可以增加、减少或替换所生成的特征值向量中包括的各种特征值。作为一个示例，可以计算10个数量占比特征值的最大值、最小值、平均值、最大值与最小值的商、最大值与平均值的商中的一个或多个，并将计算得到的值加入所生成的特征值向量中。The feature extraction method described in the present disclosure is not limited to the above-mentioned specific embodiments, and those skilled in the art can make various improvements, deformations, deletions and substitutions based on the feature extraction method of the present disclosure. For example, the various eigenvalues included in the generated eigenvalue vector can be increased, decreased or replaced. As an example, one or more of the maximum value, minimum value, average value, quotient of the maximum value and the minimum value, and quotient of the maximum value and the average value of 10 quantity-proportion eigenvalues can be calculated, and the calculated values can be added to the generated eigenvalue vector.

本公开中所描述的机器学习模型不限于上述具体实施例。例如，还可以使用线性回归、支持向量机、随机森林、聚类算法、隐马尔可夫模型等常用模型。The machine learning model described in the present disclosure is not limited to the above specific embodiments. For example, common models such as linear regression, support vector machine, random forest, clustering algorithm, hidden Markov model, etc. can also be used.

本公开所描述的基于机器学习的纳米孔测序信号状态的识别方法、以及机器学习模型的训练方法不限于仅对纳米孔测序信号进行状态识别，还可以扩展到对其他时域、频域信号进行分类和识别的领域。The method for identifying the state of a nanopore sequencing signal based on machine learning and the method for training a machine learning model described in the present disclosure are not limited to state identification of nanopore sequencing signals, but can also be extended to the field of classification and identification of other time domain and frequency domain signals.

尽管以上各个框图中示出了多个组件，但是本领域技术人员应当理解，可以在缺少一个或多个组件或将某些组件组合的情况下实现本发明的实施例。Although multiple components are shown in the above block diagrams, those skilled in the art will appreciate that the embodiments of the present invention may be implemented without one or more components or by combining certain components.

尽管以上根据附图中所示的顺序对各个步骤进行了描述，但是本领域技术人员应当理解，所述各个步骤可以按照不同的顺序执行，或者可以在没有上述步骤中的一个或多个步骤的情况下实现本发明的实施例。Although the various steps are described above according to the order shown in the drawings, those skilled in the art should understand that the various steps may be performed in a different order, or the embodiments of the present invention may be implemented without one or more of the above steps.

根据前述内容可以理解，一个或多个系统或设备的电子组件可以包括但不限于至少一个处理单元、存储器、以及将包括存储器在内的各个组件耦接到处理单元的通信总线或通信装置。系统或设备可以包括或可以访问各种设备可读介质。系统存储器可以包括易失性和/或非易失性存储器形式的设备可读存储介质(比如，只读存储器(ROM)和/或随机存取存储器(RAM))。通过示例而非限制的方式，系统存储器还可以包括操作系统、应用程序、其它程序模块和程序数据。It can be understood from the foregoing that the electronic components of one or more systems or devices may include, but are not limited to, at least one processing unit, a memory, and a communication bus or communication device that couples the various components including the memory to the processing unit. The system or device may include or may access various device-readable media. The system memory may include device-readable storage media in the form of volatile and/or non-volatile memory (e.g., read-only memory (ROM) and/or random access memory (RAM)). By way of example and not limitation, the system memory may also include an operating system, application programs, other program modules, and program data.

实施例可以实现为系统、方法或程序产品。因此，实施例可以采用全硬件实施例或者包括软件(包括固件、常驻软件、微代码等)的实施例的形式，它们在本文中可以统称为“电路”、“模块”或“系统”。此外，实施例可以采取在其上体现有设备可读程序代码的至少一个设备可读介质中体现的程序产品的形式。Embodiments may be implemented as systems, methods, or program products. Thus, embodiments may take the form of all-hardware embodiments or embodiments that include software (including firmware, resident software, microcode, etc.), which may be collectively referred to herein as "circuits," "modules," or "systems." In addition, embodiments may take the form of a program product embodied in at least one device-readable medium having device-readable program code embodied thereon.

可以使用设备可读存储介质的组合。在本文档的上下文中，设备可读存储介质(“存储介质”)可以是任何有形的非信号介质，其可以包含或存储由配置为由指令执行系统、装置或设备使用或与其结合使用的程序代码组成的程序。出于本公开的目的，存储介质或设备应被解释为非暂时性的，即不包括信号或传播介质。Combinations of device-readable storage media may be used. In the context of this document, a device-readable storage medium ("storage medium") may be any tangible, non-signal medium that may contain or store a program consisting of program code configured to be used by or in conjunction with an instruction execution system, apparatus, or device. For purposes of this disclosure, a storage medium or device should be interpreted as being non-transitory, i.e., not including signal or propagation media.

本公开出于说明和描述的目的得以呈现，但是并非旨在穷举或限制。许多修改和变化对于本领域普通技术人员将是明显的。选择和描述实施例以便说明原理和实际应用，并且使得本领域普通技术人员能够理解具有适合于所预期的特定用途的各种修改的本公开的各种实施例。The present disclosure is presented for purposes of illustration and description, but is not intended to be exhaustive or limiting. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiments are chosen and described in order to illustrate the principles and practical applications, and to enable those of ordinary skill in the art to understand the various embodiments of the present disclosure with various modifications suitable for the particular use contemplated.

Claims

A method for identifying a nanopore sequencing signal state based on machine learning comprises the following steps:

Acquire nanopore sequencing signal data X ₁ , X ₂ , . . . , X _N , and perform feature extraction on the nanopore sequencing signal data to generate a feature value vector, wherein N is a natural number greater than 1; and

Using the generated feature value vector and the trained machine learning model to identify the state of the acquired nanopore sequencing signal data,

The feature extraction comprises the following steps:

Dividing the nanopore sequencing signal data into G groups according to the size of the value, where G is a natural number greater than 1 and less than N; wherein the characteristic value includes at least one of the quantity proportion, standard deviation and coefficient of variation;

Calculate the quantity proportion of each group in the G groups respectively, and generate G quantity proportion feature values V ₁₁ , V ₁₂ , . . . , V _1G , wherein the quantity proportion of each group is the ratio of the quantity of data in the group to the total quantity of data N;

Calculating the standard deviation of the data of each group in the G groups respectively, and generating G standard deviation eigenvalues V ₂₁ , V ₂₂ , . . . , V _2G ;

Calculating the coefficient of variation of the data of each of the G groups respectively, generating G coefficient of variation characteristic values V _31, V ₃₂ , . . . , V _3G , wherein the coefficient of variation of the data of each group is the ratio of the standard deviation of the data of each group to the mean value of the data of each group; and

The eigenvalue vector is generated based on the generated plurality of eigenvalues.

The identification method according to claim 1, wherein, before dividing the nanopore sequencing signal data into G groups according to the size of the value, the feature extraction further includes a standardization step, and the division is performed on the nanopore sequencing signal data standardized by the standardization step,

The standardization steps include:

For each data in the nanopore sequencing signal data, the difference between the value of the data and the minimum value of the signal data is divided by the difference between the maximum value of the signal data and the minimum value of the signal data, and the obtained value is used as the standardized value of the data, wherein the minimum value of the signal data is the minimum value in the acquired nanopore sequencing signal data, and the maximum value of the signal data is the maximum value in the acquired nanopore sequencing signal data.

The identification method according to claim 2, wherein the division comprises:

The normalized nanopore sequencing signal data is divided into 10 groups from 0 to 1 at intervals of 0.1, thereby dividing the signal data into 10 groups falling into the intervals [0, 0.1), [0.1, 0.2), ..., [0.9, 1] respectively.

The recognition method according to claim 1, wherein the feature extraction further comprises the following steps:

The maximum value, minimum value, average value, quotient of the maximum value and the minimum value, and quotient of the maximum value and the average value of the G standard deviation eigenvalues V ₂₁ , V ₂₂ , ..., V _2G are calculated to generate five eigenvalues V ₄₁ , V ₄₂ , ..., V ₄₅ .

The maximum value, minimum value, average value, quotient of the maximum value and the minimum value, and quotient of the maximum value and the average value of the G variation coefficient eigenvalues V ₃₁ , V ₃₂ , ..., V _3G are calculated to generate five eigenvalues V ₅₁ , V ₅₂ , ..., V ₅₅ .

The recognition method according to claim 1, wherein the machine learning model comprises a decision tree model, a Bayesian model or a neural network model.

A method for training a machine learning model for identifying nanopore sequencing signal states, comprising the following steps:

Acquiring training data for the machine learning model, the acquiring comprising: collecting nanopore sequencing signal data, performing state recognition on the collected nanopore sequencing signal data, intercepting nanopore sequencing signal data _X1 , _X2 , ..., XN corresponding to the recognized state, performing feature extraction on the intercepted nanopore sequencing signal data _X1 , _X2 , ..., _XN to generate a eigenvalue vector, wherein N is a natural number greater than 1, and the training data comprises a result of the _state recognition and the eigenvalue vector; and

Using the acquired training data to train the machine learning model, thereby acquiring a trained machine learning model,

The feature extraction comprises the following steps:

Dividing the nanopore sequencing signal data X ₁ , X ₂ , ..., X _N into G groups according to the size of the values, wherein G is a natural number greater than 1 and less than N; wherein the characteristic value includes at least one of the quantity proportion, the standard deviation and the coefficient of variation;

Calculating the coefficient of variation of the data of each of the G groups respectively, generating G coefficient of variation characteristic values V ₃₁ , V ₃₂ , . . . , V _3G , wherein the coefficient of variation of the data of each group is the ratio of the standard deviation of the data of each group to the mean value of the data of each group; and

The training method according to claim 7, wherein, before dividing the nanopore sequencing signal data into G groups according to the size of the value, the feature extraction further includes a standardization step, and the division is performed on the nanopore sequencing signal data standardized by the standardization step,

The standardization steps include:

The training method according to claim 8, wherein the dividing comprises:

The training method according to claim 7, wherein the feature extraction further comprises the following steps:

The training method according to claim 7, wherein the machine learning model comprises a decision tree model, a Bayesian model or a neural network model.

A device for identifying the state of a nanopore sequencing signal based on machine learning, comprising the following modules:

an acquisition module, configured to acquire nanopore sequencing signal data X ₁ , X ₂ , . . . , X _N , and perform feature extraction on the nanopore sequencing signal data to generate a feature value vector, wherein N is a natural number greater than 1; and

an identification module, for identifying the state of the acquired nanopore sequencing signal data using the generated feature value vector and the trained machine learning model,

Wherein, the acquisition module further includes the following units:

A grouping unit, configured to divide the nanopore sequencing signal data into G groups according to the size of the value, wherein G is a natural number greater than 1 and less than N; wherein the characteristic value includes at least one of a quantity proportion, a standard deviation, and a coefficient of variation;

a first eigenvalue generating unit, configured to respectively calculate the quantity proportion of each group in the G groups, and generate G quantity proportion eigenvalues V ₁₁ , V ₁₂ , . . . , V _1G , wherein the quantity proportion of each group is the ratio of the quantity of data in the group to the total quantity of data N;

A second eigenvalue generating unit, configured to respectively calculate the standard deviation of the data of each of the G groups, and generate G standard deviation eigenvalues V ₂₁ , V ₂₂ , . . . , V _2G ;

a third eigenvalue generating unit, configured to calculate the coefficient of variation of the data of each of the G groups, respectively, and generate G coefficient of variation eigenvalues V ₃₁ , V ₃₂ , . . . , V _3G , wherein the coefficient of variation of the data of each group is a ratio of the standard deviation of the data of each group to the mean value of the data of each group; and

The eigenvalue vector generating unit is used to generate the eigenvalue vector based on the generated multiple eigenvalues.

A training device for a machine learning model for identifying nanopore sequencing signal states, comprising the following modules:

an acquisition module, configured to acquire training data for the machine learning model, wherein the acquisition module acquires nanopore sequencing signal data, performs state recognition on the acquired nanopore sequencing signal data, intercepts nanopore sequencing signal data _X1 , _X2 , ..., _XN corresponding to the recognized state, performs feature extraction on the intercepted nanopore sequencing signal data _X1 , _X2 , ..., _XN to generate a eigenvalue vector, wherein N is a natural number greater than 1, and the training data includes the state recognition result and the eigenvalue vector; and

a training module, configured to train the machine learning model using the acquired training data, thereby acquiring a trained machine learning model,

Wherein, the acquisition module further includes the following units:

A grouping unit, for dividing the nanopore sequencing signal data X ₁ , X ₂ , ..., X _N into G groups according to the size of the values, wherein G is a natural number greater than 1 and less than N; wherein the characteristic value includes at least one of the quantity proportion, standard deviation and coefficient of variation;

An electronic device, comprising:

at least one processor; and

a memory communicatively connected to the at least one processor; wherein,

The memory stores instructions that can be executed by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform the method according to any one of claims 1 to 12.

A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method according to any one of claims 1-12.