WO2024124466A1

WO2024124466A1 - Method for clustering time series data and device

Info

Publication number: WO2024124466A1
Application number: PCT/CN2022/139189
Authority: WO
Inventors: 曹耀方; 章文蔚; 颜旭; 钟沛彬; 季州翔; 黎宇翔; 董宇亮; 徐讯
Original assignee: BGI Shenzhen Co Ltd
Current assignee: BGI Shenzhen Co Ltd
Priority date: 2022-12-15
Filing date: 2022-12-15
Publication date: 2024-06-20
Anticipated expiration: 2025-06-15
Also published as: CN120239855A

Abstract

Provided is a method for clustering time series data, comprising: pre-clustering sample points in the time series data (S310), wherein the pre-clustering comprises traversing all sample points in the sample points, and on the basis of a distance between a current sample point and a previous sample point of the current sample point, clustering the sample points into corresponding clusters (S311), and calculating a mean value for all sample points in each cluster in all the generated clusters, and determining, on the basis of a mean value difference between a current cluster and adjacent clusters thereof, whether to merge the current cluster and the clusters adjacent thereto (S312); and on the basis of the mean value, clustering all the pre-clustered clusters by using the DBSCAN algorithm (S320).

Description

Method and device for clustering time series data

Technical Field

本公开涉及聚类技术，更具体地，涉及用于对时序数据进行聚类的方法及其设备。The present disclosure relates to clustering technology, and more particularly, to a method and device for clustering time series data.

Background technique

随着互联网计算机技术的发展，实时数据流成为数据信息中的一种重要的数据形式，且已被广泛应用于网络流量控制、数据监测系统、互联网金融等领域。如何快速有效的从高速、大量的实时数据流中提取信息则成为数据流挖掘领域中的一大挑战。聚类分析是数据挖掘过程中一项重要技术，核心目标是在相似的基础上将数据进行分类，它是一种无监督的机器学习任务，而基于密度的聚类算法最能体现高效性。机器学习是一种数据驱动的方法，能够学习研究和构建一种特殊算法，让计算机自己在数据中学习从而进行预测，从而快速、有效地解决这个问题。With the development of Internet computer technology, real-time data streams have become an important form of data information and have been widely used in network traffic control, data monitoring systems, Internet finance and other fields. How to quickly and effectively extract information from high-speed, large-scale real-time data streams has become a major challenge in the field of data stream mining. Cluster analysis is an important technology in the data mining process. The core goal is to classify data based on similarity. It is an unsupervised machine learning task, and density-based clustering algorithms can best reflect efficiency. Machine learning is a data-driven method that can study, research and build a special algorithm to allow computers to learn from data and make predictions, thereby quickly and effectively solving this problem.

然而，现有技术对于实际产生的时序信号，无法准确聚类，对噪声鲁棒性差，效率低。However, the existing technology cannot accurately cluster the actually generated time series signals, has poor robustness to noise, and is inefficient.

因此，需要能够对时序信号进行准确聚类的技术。Therefore, a technique is needed that can accurately cluster time series signals.

发明内容Summary of the invention

本公开的实施例的目的在于提供对时序信号进行准确聚类的有效解决方案。具体地，本公开的实施例提供了一种对时序数据进行聚类的方法及其设备。The purpose of the embodiments of the present disclosure is to provide an effective solution for accurately clustering time series signals. Specifically, the embodiments of the present disclosure provide a method and device for clustering time series data.

根据本公开的第一方面，提出了一种用于对时序数据进行聚类的方法，包括：对所述时序数据中的样本点进行预聚类，其中，所述预聚类包括遍历所述样本点中的所有样本点，基于当前样本点与所述当前样本点的前一样本点之间的距离，将所述样本点聚类为相应的簇，以及对所生成的所有簇中的每一个簇中的所有样本点计算均值，基于当前簇与其相邻的簇的均值差，确定是否合并所述当前簇和与其相邻的簇；以及基于均值对所述预聚类后的所有簇进行聚类。According to a first aspect of the present disclosure, a method for clustering time series data is proposed, comprising: pre-clustering sample points in the time series data, wherein the pre-clustering includes traversing all sample points in the sample points, clustering the sample points into corresponding clusters based on the distance between the current sample point and the previous sample point of the current sample point, and calculating the mean of all sample points in each of all the generated clusters, determining whether to merge the current cluster and its adjacent clusters based on the mean difference between the current cluster and its adjacent clusters; and clustering all the clusters after the pre-clustering based on the mean.

在一些实施例中，所述时序数据包括纳米孔序列数据，并以浮点型保存在存储器中，所述纳米孔序列数据包括测序信号和开孔电流信号。In some embodiments, the time series data includes nanopore sequence data and is stored in a memory in floating point format, wherein the nanopore sequence data includes a sequencing signal and an open pore current signal.

在一些实施例中，将保存在所述存储器中的所述时序数据转换为数组。In some embodiments, the time series data stored in the memory is converted into an array.

在一些实施例中，通过计算当前样本点与所述当前样本点的前一样本点的L1范数的值，获得所述距离。In some embodiments, the distance is obtained by calculating the L1 norm value of the current sample point and the previous sample point of the current sample point.

在一些实施例中，基于当前样本点与所述当前样本点的前一样本点之间的距离，将所述样本点聚类为相应的簇包括：当前样本点与所述当前样本点的前一样本点的L1范数小于初始最小距离ε时，将所述当前样本点与所述前一样本点标记为相同的簇；当前样本点与所述当前样本点的前一样本点的L1范数大于或等于所述ε时，如果当前簇中的总样本点数小于最小样本点数MinPts，则将所述当前样本点与所述前一样本点标记为相同的簇，否则将所述当前样本点设为新的簇。In some embodiments, clustering the sample points into corresponding clusters based on the distance between the current sample point and the previous sample point of the current sample point includes: when the L1 norm of the current sample point and the previous sample point of the current sample point is less than the initial minimum distance ε, marking the current sample point and the previous sample point as the same cluster; when the L1 norm of the current sample point and the previous sample point of the current sample point is greater than or equal to the ε, if the total number of sample points in the current cluster is less than the minimum number of sample points MinPts, marking the current sample point and the previous sample point as the same cluster, otherwise setting the current sample point to a new cluster.

在一些实施例中，基于当前簇与其相邻的簇的均值差，确定是否合并所述当前簇和与其相邻的簇包括：如果当前簇与其相邻的簇的均值差小于阈值eps_c，则合并所述当前簇和与其相邻的簇；否则保持所述当前簇不变。In some embodiments, based on the mean difference between the current cluster and its adjacent clusters, determining whether to merge the current cluster and its adjacent clusters includes: if the mean difference between the current cluster and its adjacent clusters is less than a threshold eps_c, merging the current cluster and its adjacent clusters; otherwise keeping the current cluster unchanged.

在一些实施例中，利用最邻近搜索确定所述初始最小距离ε、所述最小样本点数MinPts和所述阈值eps_c。In some embodiments, the initial minimum distance ε, the minimum number of sample points MinPts and the threshold eps_c are determined using nearest neighbor search.

在一些实施例中，所述初始最小距离ε、所述最小样本点数MinPts和所述阈值eps_c是可调整参数。In some embodiments, the initial minimum distance ε, the minimum number of sample points MinPts and the threshold eps_c are adjustable parameters.

在一些实施例中，通过建立KD树或球树来实现所述最邻近搜索。In some embodiments, the nearest neighbor search is implemented by building a KD tree or a ball tree.

在一些实施例中，基于均值，利用DBSCAN算法对所述预聚类后的所有簇进行聚类还基于标准差。In some embodiments, clustering all the pre-clustered clusters using the DBSCAN algorithm is based on the mean and also based on the standard deviation.

根据本公开的第二方面，还提供了一种用于对时序数据进行聚类的设备，包括：一个或多个处理器；以及一个或多个存储器，所述一个或多个存储器存储计算机可执行指令，所述计算机可执行指令当由所述一个或多个处理器执行时，使所述一个或多个处理器执行上述任一方法。According to the second aspect of the present disclosure, a device for clustering time series data is also provided, comprising: one or more processors; and one or more memories, wherein the one or more memories store computer-executable instructions, and when the computer-executable instructions are executed by the one or more processors, the one or more processors execute any of the above methods.

根据本公开的第三方面，还提供了一种计算机存储介质，其上存储有计算机可执行指令，所述计算机可执行指令当由处理器执行时，使所述处理器执行上述任一方法。According to a third aspect of the present disclosure, a computer storage medium is further provided, on which computer executable instructions are stored. When the computer executable instructions are executed by a processor, the processor is caused to execute any of the above methods.

本公开创造性地将一维时序信号进行预聚类，优化了搜索过程，并且可以充分应用在时序数据聚类中。具体来说：The present disclosure creatively pre-clusters one-dimensional time series signals, optimizes the search process, and can be fully applied in time series data clustering. Specifically:

1)针对现有技术鲁棒性问题1) Regarding the robustness issues of existing technologies

基于预聚类设计的方法有效地将噪声点准确分类，理论上可以解决各种形态的噪声点的聚类，可以有效地应用于对于不需要将噪声点聚类为其他类的应用中。The method based on pre-clustering design can effectively and accurately classify noise points. In theory, it can solve the clustering of noise points of various forms and can be effectively applied to applications that do not need to cluster noise points into other categories.

2)针对现有技术的效率低的问题2) Addressing the low efficiency of existing technologies

本发明是基于预聚类的DBSCAN，最邻近搜索过程可以通过建立KD树或者球树进行规模限制优化。与DBSCAN算法相比，改进后的算法的时间复杂度明显下降，并且在样本集较大时，聚类的收敛时间大大降低，有效地提高了时序数据分类的准确率和效率，并且对特征不断变化的场景具有较好的适应性。The present invention is based on pre-clustering DBSCAN, and the nearest neighbor search process can be optimized by establishing a KD tree or a ball tree for scale restriction. Compared with the DBSCAN algorithm, the time complexity of the improved algorithm is significantly reduced, and when the sample set is large, the clustering convergence time is greatly reduced, which effectively improves the accuracy and efficiency of time series data classification, and has good adaptability to scenes with constantly changing features.

BRIEF DESCRIPTION OF THE DRAWINGS

为了更完整地理解本公开及其优势，现在将参考结合附图的以下描述，其中：For a more complete understanding of the present disclosure and its advantages, reference will now be made to the following description taken in conjunction with the accompanying drawings, in which:

图1是示出了作为时序数据的示例的纳米孔序列数据的示意图；FIG1 is a schematic diagram showing nanopore sequence data as an example of time series data;

图2是示出了DBSCAN算法的示意图；FIG2 is a schematic diagram showing a DBSCAN algorithm;

图3是示出了根据本公开的实施例的用于对时序数据进行聚类的方法的示意流程图；FIG3 is a schematic flow chart showing a method for clustering time series data according to an embodiment of the present disclosure;

图4是示出了根据本公开的实施例的用于对时序数据进行聚类的方法的具体实现的流程图；FIG4 is a flowchart showing a specific implementation of a method for clustering time series data according to an embodiment of the present disclosure;

图5-图7是示出了根据本公开的实施例的用于对时序数据进行聚类的方法的预聚类结果的图；5-7 are diagrams showing pre-clustering results of a method for clustering time series data according to an embodiment of the present disclosure;

图8是示出了根据本公开的实施例的用于对时序数据进行聚类的方法的聚类结果的图；FIG8 is a diagram showing a clustering result of a method for clustering time series data according to an embodiment of the present disclosure;

图9是示出了利用DBSCAN算法对模拟数据的聚类结果的图；FIG9 is a diagram showing the clustering results of simulated data using the DBSCAN algorithm;

图10是示出了根据本公开的实施例的用于对时序数据进行聚类的方法对模拟数据的聚类结果的图；FIG10 is a diagram showing clustering results of simulation data according to a method for clustering time series data according to an embodiment of the present disclosure;

图11是示出了利用DBSCAN算法对实验数据的聚类结果的图；FIG11 is a diagram showing the clustering results of the experimental data using the DBSCAN algorithm;

图12是示出了根据本公开的实施例的用于对时序数据进行聚类的方法对实验数据的聚类结果的图；FIG12 is a diagram showing clustering results of experimental data by a method for clustering time series data according to an embodiment of the present disclosure;

图13是示出了利用DBSCAN算法与根据本公开的实施例的用于对时序数据进行聚类的方法的处理时间比较的表；13 is a table showing a comparison of processing time using the DBSCAN algorithm and a method for clustering time series data according to an embodiment of the present disclosure;

图14示意性地示出了根据本公开的实施例的用于对时序数据进行聚类的方法的设备1400的框图。FIG. 14 schematically shows a block diagram of a device 1400 for a method of clustering time series data according to an embodiment of the present disclosure.

在附图中，相同或相似的结构均以相同或相似的附图标记进行标识。In the drawings, the same or similar structures are marked with the same or similar reference numerals.

Detailed ways

根据结合附图对本公开示例性实施例的以下详细描述，本公开的其它方面、优势和突出特征对于本领域技术人员将变得显而易见。Other aspects, advantages and salient features of the disclosure will become apparent to those skilled in the art from the following detailed description of exemplary embodiments of the disclosure taken in conjunction with the accompanying drawings.

在本公开中，术语“包括”和“含有”及其派生词意为包括而非限制；术语“或”是包含性的，意为和/或。In this disclosure, the terms "include" and "including" and their derivatives mean inclusion rather than limitation; the term "or" is inclusive, meaning and/or.

在本说明书中，下述用于描述本公开原理的各种实施例只是说明性的，不应该以任何方式解释为限制公开的范围。参照附图的下述描述用于帮助全面理解由权利要求及其等同物限定的本公开的示例性实施例。下述描述包括多种具体细节来帮助理解，但这些细节应认为仅仅是示例性的。因此，本领域普通技术人员应认识到，在不背离本公开的范围和精神的情况下，可以对本文中描述的实施例进行多种改变和修改。此外，为了清楚和简洁起见，省略了公知功能和结构的描述。此外，贯穿附图，相同附图标记用于相似功能和操作。In this specification, the various embodiments described below for describing the principles of the present disclosure are merely illustrative and should not be interpreted in any way as limiting the scope of the disclosure. The following description with reference to the accompanying drawings is used to help fully understand the exemplary embodiments of the present disclosure as defined by the claims and their equivalents. The following description includes a variety of specific details to aid understanding, but these details should be considered merely exemplary. Therefore, it should be recognized by those of ordinary skill in the art that various changes and modifications may be made to the embodiments described herein without departing from the scope and spirit of the present disclosure. In addition, for the sake of clarity and brevity, descriptions of well-known functions and structures are omitted. In addition, throughout the accompanying drawings, the same figure numerals are used for similar functions and operations.

图1是示出了作为时序数据的示例的纳米孔序列数据的示意图。FIG. 1 is a schematic diagram showing nanopore sequence data as an example of time-series data.

本公开涉及纳米孔测序信号(即，纳米孔序列数据)的聚类方法。纳米孔测序是将人工合成的一种多聚合物的膜浸在离子溶液中，多聚合物膜上布满了经改造的跨膜通道蛋白(纳米孔)，在膜两侧施加不同的电压产生电压差，DNA链在马达蛋白的牵引下，解螺旋通过纳米孔蛋白，不同的碱基会形成特性性离子电流变化信号。纳米孔电信号是一维时序信号。纳米孔测序信号指的是DNA分子穿过纳米孔时记录的穿孔电流变化，这种变化主要是由不同的碱基在纳米孔中电流不同导致的。本公开提供的聚类方法主要是解决诸如纳米孔测序信号之类的时序数据的聚类。如图1所示，纳米孔序列数据可以包括测序信号和开孔电流信号。任意形态的纳米孔序列数据可以被采集并以浮点型保存在存储器中作为用于聚类方法的样本点。The present disclosure relates to a clustering method for nanopore sequencing signals (i.e., nanopore sequence data). Nanopore sequencing is to immerse a synthetic polymer membrane in an ionic solution. The polymer membrane is covered with modified transmembrane channel proteins (nanopores). Different voltages are applied on both sides of the membrane to generate a voltage difference. The DNA chain is unwound through the nanopore protein under the traction of the motor protein, and different bases will form characteristic ion current change signals. Nanopore electrical signals are one-dimensional time series signals. Nanopore sequencing signals refer to the changes in perforation current recorded when DNA molecules pass through nanopores. This change is mainly caused by different currents of different bases in nanopores. The clustering method provided in the present disclosure is mainly to solve the clustering of time series data such as nanopore sequencing signals. As shown in Figure 1, nanopore sequence data may include sequencing signals and pore current signals. Nanopore sequence data of any form can be collected and stored in a memory in floating point type as sample points for clustering methods.

图2是示出了DBSCAN算法的示意图。FIG2 is a schematic diagram showing the DBSCAN algorithm.

在对时序数据进行聚类的现有技术中，为了检测实时时序数据(例如，纳米孔电信号)并对其进行聚类，基于密度的聚类分析算法从样本密度的角度来考察样本之间的可连续性，并基于可连续样本不断扩展聚类簇以获得最终的聚类结果。DBSCAN算法是基于密度空间的聚类算法，在机器学习和数据挖掘领域有广泛的应用，其聚类原理通俗点讲是每个簇类的密度高于该簇类周围的密度，噪声的密度小于任一簇类的密度。如图2所示，DBSCAN算法的基本思想为：设定一个最小距离ε，对数据集中的每一个点，以最小距离ε为半径画圆，圆内的点称为临近点(neighbors)。如果临近点的个数小于阈值MinPts，将当前点设为边界点(Boundary molecule)；如果邻近点的个数大于或等于MinPts，则该点称为核心点(Core molecule)，将该点和它的临近点标记为一个簇(cluster)；而既不是核心点又不是边界点的点(即，邻近点的个数为零的点)称为噪声点(Outlier)。同时循环所有的临近点，将每一个临近点的临近点都设为同一个簇，当所有满足条件的点遍历结束后，cluster+1，并开始下一个循环。In the prior art of clustering time series data, in order to detect real-time time series data (for example, nanopore electrical signals) and cluster them, the density-based clustering analysis algorithm examines the continuity between samples from the perspective of sample density, and continuously expands clustering clusters based on continuous samples to obtain the final clustering results. The DBSCAN algorithm is a clustering algorithm based on density space, which is widely used in the fields of machine learning and data mining. Its clustering principle is generally that the density of each cluster is higher than the density around the cluster, and the density of noise is less than the density of any cluster. As shown in Figure 2, the basic idea of the DBSCAN algorithm is: set a minimum distance ε, and for each point in the data set, draw a circle with the minimum distance ε as the radius. The points in the circle are called neighbors. If the number of neighboring points is less than the threshold MinPts, the current point is set as a boundary molecule; if the number of neighboring points is greater than or equal to MinPts, the point is called a core molecule, and the point and its neighboring points are marked as a cluster; points that are neither core points nor boundary points (i.e., points with zero neighboring points) are called outliers. At the same time, loop all neighboring points, set the neighboring points of each neighboring point to the same cluster, and when all points that meet the conditions are traversed, cluster+1, and start the next loop.

虽然现有的DBSCAN算法能够实现对时序数据的聚类，但是对于实际纳米孔产生的时序信号无法准确聚类，对噪声的鲁棒性差，并且在样本集较大的情况下，聚类效率较低。Although the existing DBSCAN algorithm can cluster time series data, it cannot accurately cluster the time series signals generated by actual nanopores, has poor robustness to noise, and has low clustering efficiency when the sample set is large.

因此，为了解决上述问题，本公开提出了一种改进的用于对时序数据进行聚类的方法。该方法是根据上述现有的DBSCAN算法改进的基于预聚类的聚类方法。Therefore, in order to solve the above problems, the present disclosure proposes an improved method for clustering time series data. The method is a pre-clustering-based clustering method improved according to the above existing DBSCAN algorithm.

图3是示出了根据本公开的实施例的用于对时序数据进行聚类的方法300的示意流程图。FIG. 3 is a schematic flowchart showing a method 300 for clustering time series data according to an embodiment of the present disclosure.

如图3所示，根据本公开实施例的用于对时序数据进行聚类的方法300可以包括：As shown in FIG. 3 , a method 300 for clustering time series data according to an embodiment of the present disclosure may include:

步骤S310：对时序数据中的样本点进行预聚类；以及Step S310: pre-clustering sample points in the time series data; and

步骤S320：基于聚类簇中所有样本点的均值，利用DBSCAN算法对预聚类后的所有簇进行聚类。Step S320: clustering all pre-clustered clusters using the DBSCAN algorithm based on the mean of all sample points in the cluster.

根据实施例，步骤S310中的预聚类可以包括：According to an embodiment, the pre-clustering in step S310 may include:

步骤S311：遍历所有样本点，基于当前样本点与当前样本点的前一样本点之间的距离，将样本点聚类为相应的簇；以及Step S311: traverse all sample points, and cluster the sample points into corresponding clusters based on the distance between the current sample point and the previous sample point of the current sample point; and

步骤S312：对所生成的所有簇中的每一个簇中的所有样本点计算均值，基于当前簇与其相邻的簇的均值差，确定是否合并当前簇和与其相邻的簇。Step S312: Calculate the mean of all sample points in each of all generated clusters, and determine whether to merge the current cluster with its adjacent clusters based on the mean difference between the current cluster and its adjacent clusters.

参照图4，示出了进行预聚类的一个具体示例。4 , a specific example of performing pre-clustering is shown.

可以设定一个初始最小距离eps_p(S410)，对输入的一维时序数据，从开始到结尾依次遍历每一个样本点pt(S420)，如果当前样本点pt与前一个样本点的L1范数小于eps_p(S430：是)，则可以将当前样本点pt标记为与前一个样本点相同的簇(S431)，否则(S430：否)设为新的簇，继续下一个样本点。其中，对于数值来说，L1范数是当前样本点与前一个样本点的差的绝对值。在此，以L1范数作为距离，本领域技术人员也可以设想其他形式的距离。上述基于距离的预聚类操作的结果在图5中示出。An initial minimum distance eps_p (S410) can be set, and for the input one-dimensional time series data, each sample point pt is traversed from the beginning to the end (S420). If the L1 norm of the current sample point pt and the previous sample point is less than eps_p (S430: yes), the current sample point pt can be marked as the same cluster as the previous sample point (S431), otherwise (S430: no), it is set as a new cluster and continues to the next sample point. Among them, for numerical values, the L1 norm is the absolute value of the difference between the current sample point and the previous sample point. Here, the L1 norm is used as the distance, and those skilled in the art can also conceive of other forms of distance. The result of the above distance-based pre-clustering operation is shown in Figure 5.

在遍历过程中，即便当前样本点pt与前一个样本点的L1范数大于eps_p(S430：否)，但如果当前簇的总样本点数小于最小样本点数MinPts(S432：是)，则也可以将当前样本点pt标记为与前一个样本点相同的簇(S431)，否则(S432：否)设为新的簇(S433)，继续下一个样本点。上述基于总样本点数的预聚类操作的结果在图6中示出。During the traversal process, even if the L1 norm of the current sample point pt and the previous sample point is greater than eps_p (S430: No), if the total number of sample points in the current cluster is less than the minimum number of sample points MinPts (S432: Yes), the current sample point pt can also be marked as the same cluster as the previous sample point (S431), otherwise (S432: No) it is set as a new cluster (S433) and continues to the next sample point. The result of the above pre-clustering operation based on the total number of sample points is shown in FIG6.

可以将上述操作的聚类结果存储在数据库中(S440)，该聚类结果可以包含聚类簇开始索引、聚类簇结束索引、聚类簇、聚类簇中所有样本点的均值和聚类簇中所有点的标准差。The clustering result of the above operation may be stored in a database (S440), and the clustering result may include a cluster start index, a cluster end index, a cluster, a mean of all sample points in the cluster, and a standard deviation of all points in the cluster.

可以对通过上述操作获得的簇中的每一个簇中的所有样本点计算均值，若其与相邻的簇的均值差小于阈值eps_c(S450：是)，则可以将两个簇合并(S451)，否则(S450：否)保持簇不变(S452)。上述基于均值差的预聚类操作的结果在图7中示出。The mean value of all sample points in each cluster obtained by the above operation can be calculated. If the mean difference between the cluster and the adjacent cluster is less than the threshold value eps_c (S450: yes), the two clusters can be merged (S451), otherwise (S450: no) the cluster remains unchanged (S452). The result of the above pre-clustering operation based on mean difference is shown in FIG7.

可以根据簇合并操作的结果，更新上述预聚类操作的聚类结果。The clustering result of the above pre-clustering operation may be updated according to the result of the cluster merging operation.

对于预聚类操作的结果，可以基于每个簇中的所有样本点的均值，利用DBSCAN算法对所有簇进行聚类(S460)。根据实施例，还可以进一步基于每个簇中的所有样本点的标准差来进行聚类。可以更新上述聚类操作的聚类结果。上述聚类操作的结果在图8中示出。For the result of the pre-clustering operation, all clusters can be clustered using the DBSCAN algorithm based on the mean of all sample points in each cluster (S460). According to an embodiment, clustering can be further performed based on the standard deviation of all sample points in each cluster. The clustering result of the above clustering operation can be updated. The result of the above clustering operation is shown in FIG8.

在以上示例中，eps_p控制相邻样本点之间的差异，MinPts是簇内的最少点数目，并且eps_c在预聚类中控制相邻聚类平均值之间的差异。eps_p、MinPts和eps_c是可调整参数。例如，可以将保存在存储器中的浮点型数据转换为数组，利用最邻近搜索确定最佳参数。最邻近搜索是在给定集合中找到与给定点最接近(或最相似)的点的优化问题。最邻近搜索可以通过建立KD树或球树等来实现。本公开基于所确定的最佳参数对实时时序数据应用上述方法进行聚类并得到聚类结果。In the above example, eps_p controls the difference between adjacent sample points, MinPts is the minimum number of points in a cluster, and eps_c controls the difference between adjacent cluster averages in pre-clustering. eps_p, MinPts, and eps_c are adjustable parameters. For example, the floating-point data stored in the memory can be converted into an array, and the optimal parameters can be determined using the nearest neighbor search. The nearest neighbor search is an optimization problem of finding the point closest to (or most similar to) a given point in a given set. The nearest neighbor search can be implemented by establishing a KD tree or a ball tree, etc. The present disclosure applies the above method to cluster the real-time time series data based on the determined optimal parameters and obtains the clustering results.

图9是示出了利用DBSCAN算法对模拟数据的聚类结果的图。FIG. 9 is a diagram showing the clustering results of the simulation data using the DBSCAN algorithm.

图10是示出了根据本公开的实施例的用于对时序数据进行聚类的方法对模拟数据的聚类结果的图。FIG. 10 is a diagram showing clustering results of simulation data according to a method for clustering time series data according to an embodiment of the present disclosure.

通过对图9和图10进行对比，可以看出，本公开的聚类方法对模拟数据样本集中的样本点的聚类效果优于利用DBSCAN算法的聚类效果，并对模拟数据样本集中的噪声有很好的鲁棒性。By comparing FIG. 9 and FIG. 10 , it can be seen that the clustering method of the present invention has a better clustering effect on the sample points in the simulated data sample set than the clustering effect using the DBSCAN algorithm, and has good robustness to the noise in the simulated data sample set.

图11是示出了利用DBSCAN算法对实验数据的聚类结果的图。FIG. 11 is a diagram showing the clustering results of the experimental data using the DBSCAN algorithm.

图12是示出了根据本公开的实施例的用于对时序数据进行聚类的方法对实验数据的聚类结果的图。FIG. 12 is a diagram showing clustering results of experimental data by a method for clustering time series data according to an embodiment of the present disclosure.

同样，通过对图11和图12进行对比，可以看出，本公开的聚类方法对实验数据样本集中的样本点的聚类效果优于利用DBSCAN算法的聚类效果，并对实验数据样本集中的噪声有很好的鲁棒性。Similarly, by comparing FIG. 11 and FIG. 12 , it can be seen that the clustering effect of the clustering method disclosed in the present invention on the sample points in the experimental data sample set is better than the clustering effect using the DBSCAN algorithm, and has good robustness to the noise in the experimental data sample set.

图13是示出了利用DBSCAN算法与根据本公开的实施例的用于对时序数据进行聚类的方法的处理时间比较的表。Figure 13 is a table showing a comparison of the processing time of the method for clustering time series data using the DBSCAN algorithm and according to an embodiment of the present disclosure.

从图13中的表可以看出，在样本集中的样本点数为1000、30000和50000时，利用DBSCAN算法的聚类时间分别为0.01s、6.238s和29.395s，而利用本公开的聚类方法的聚类时间分别为0.0038s、0.0409s和0.0719s。利用本公开的聚类方法的聚类速度比利用DBSCAN算法的聚类速度分别提升了2.63倍、152.53倍和408.80倍。因此，利用本公开的聚类方法的聚类时间比利用DBSCAN算法的聚类时间显著降低，效率明显提高。As can be seen from the table in Figure 13, when the number of sample points in the sample set is 1000, 30000 and 50000, the clustering time using the DBSCAN algorithm is 0.01s, 6.238s and 29.395s respectively, while the clustering time using the clustering method disclosed in the present invention is 0.0038s, 0.0409s and 0.0719s respectively. The clustering speed using the clustering method disclosed in the present invention is 2.63 times, 152.53 times and 408.80 times higher than the clustering speed using the DBSCAN algorithm. Therefore, the clustering time using the clustering method disclosed in the present invention is significantly lower than the clustering time using the DBSCAN algorithm, and the efficiency is significantly improved.

综上所述，本发明创造性地将一维时序数据进行预聚类，该预聚类有利于提高后续基于DBSCAN聚类的准确性和效率，可以解决当前时序数据聚类对噪声敏感且效率低的问题。In summary, the present invention creatively pre-clusters one-dimensional time series data, which is beneficial to improving the accuracy and efficiency of subsequent DBSCAN-based clustering and can solve the problem that current time series data clustering is sensitive to noise and inefficient.

本发明可以基于如表1所示的开发环境来实现。虽然本发明在一般的CPU配置的机器上可以稳定运行，但是算法模块中牵涉到大量数据的读写和储存以及模型的训练和测试，需要大量的数据运算和操作，在高性能的CPU或带GPU配置的软硬件环境中运行能够显著地提升效率和稳定性。采集数据使用的是单通道纳米孔测序装置和PC机。利用纳米孔测序装置采集目标文库数据，采集后的信号数据以及其它信息保存为dat文件结构，并实时存入PC机硬盘。The present invention can be implemented based on the development environment shown in Table 1. Although the present invention can run stably on a machine with a general CPU configuration, the algorithm module involves the reading, writing and storage of a large amount of data as well as the training and testing of the model, which requires a large amount of data calculation and operation. Running in a high-performance CPU or a hardware and software environment with a GPU configuration can significantly improve efficiency and stability. A single-channel nanopore sequencing device and a PC are used to collect data. The target library data is collected using the nanopore sequencing device, and the collected signal data and other information are saved as a dat file structure and stored in real time in the PC hard disk.

表1.开发环境Table 1. Development environment

图14示意性地示出了根据本公开的实施例的用于对时序数据进行聚类的方法的设备1400的框图。图14所示的设备可以是任何具有处理能力的设备。需要注意的是，图14示出的设备仅仅是一个示例，不应对本申请实施例的功能和使用范围带来任何限制。FIG14 schematically shows a block diagram of a device 1400 for clustering time series data according to an embodiment of the present disclosure. The device shown in FIG14 can be any device with processing capabilities. It should be noted that the device shown in FIG14 is only an example and should not bring any limitation to the functions and scope of use of the embodiments of the present application.

如图14所示，根据该实施例的设备1400包括中央处理单元(CPU)1401(例如，表1中的CPU：

@3.60GHz)，其可以根据存储在只读存储器(ROM)1402中的程序或者从存储部分1408加载到随机访问存储器(RAM)1403中的程序而执行各种适当的动作和处理。在RAM 1403中，还存储有设备1400操作所需的各种程序和数据。CPU或GPU 1401、ROM 1402以及RAM 1403通过总线1404彼此相连。输入/输出(I/O)接口1405也连接至总线1404。 As shown in FIG. 14 , the device 1400 according to this embodiment includes a central processing unit (CPU) 1401 (e.g., the CPU in Table 1:

@3.60GHz), which can perform various appropriate actions and processes according to the program stored in the read-only memory (ROM) 1402 or the program loaded from the storage part 1408 to the random access memory (RAM) 1403. In the RAM 1403, various programs and data required for the operation of the device 1400 are also stored. The CPU or GPU 1401, the ROM 1402, and the RAM 1403 are connected to each other through the bus 1404. An input/output (I/O) interface 1405 is also connected to the bus 1404.

设备1400还可以包括连接至I/O接口1405的以下部件中的一项或多项：包括键盘或鼠标等的输入部分1406；包括诸如阴极射线管(CRT)或液晶显示器(LCD)等以及扬声器等的输出部分1407；包括硬盘等的存储部分1408；以及包括诸如LAN卡或调制解调器等的网络接口卡的通信部分1409。通信部分1409经由诸如因特网的网络执行通信处理。驱动器1410也根据需要连接至I/O接口1405。可拆卸介质1411，诸如磁盘、光盘、磁光盘或半导体存储器等等，根据需要安装在驱动器1410上，以便于从其上读出的计算机程序根据需要被安装入存储部分1408。The device 1400 may also include one or more of the following components connected to the I/O interface 1405: an input section 1406 including a keyboard or a mouse, etc.; an output section 1407 including a cathode ray tube (CRT) or a liquid crystal display (LCD), etc. and a speaker, etc.; a storage section 1408 including a hard disk, etc.; and a communication section 1409 including a network interface card such as a LAN card or a modem. The communication section 1409 performs communication processing via a network such as the Internet. A drive 1410 is also connected to the I/O interface 1405 as needed. A removable medium 1411, such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory, etc., is installed on the drive 1410 as needed, so that a computer program read therefrom is installed into the storage section 1408 as needed.

特别地，根据本公开的实施例，上文参考流程图描述的过程可以被实现为计算机软件程序。例如，本公开的实施例包括一种计算机程序产品，其包括承载在计算机可读介质上的计算机程序，该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中，该计算机程序可以通过通信部分1409从网络上被下载和安装，和/或从可拆卸介质1411被安装。在该计算机程序被中央处理单元(CPU)1401执行时，执行本申请实施例的设备中限定的上述功能。In particular, according to an embodiment of the present disclosure, the process described above with reference to the flowchart can be implemented as a computer software program. For example, an embodiment of the present disclosure includes a computer program product, which includes a computer program carried on a computer-readable medium, and the computer program includes a program code for executing the method shown in the flowchart. In such an embodiment, the computer program can be downloaded and installed from a network through a communication part 1409, and/or installed from a removable medium 1411. When the computer program is executed by a central processing unit (CPU) 1401, the above-mentioned functions defined in the device of the embodiment of the present application are executed.

需要说明的是，本公开所示的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线或半导体的系统、装置或器件，或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于：具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件或者上述的任意合适的组合。在本公开中，计算机可读存储介质可以是任何包含或存储程序的有形介质，该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中，计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号，其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式，包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质，该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输，包括但不限于：无线、电线、光缆或RF等等，或者上述的任意合适的组合。It should be noted that the computer-readable medium shown in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two. The computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared or semiconductor system, device or device, or any combination of the above. More specific examples of computer-readable storage media may include, but are not limited to: an electrical connection with one or more wires, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above. In the present disclosure, a computer-readable storage medium may be any tangible medium containing or storing a program that can be used by or in combination with an instruction execution system, device or device. In the present disclosure, a computer-readable signal medium may include a data signal propagated in a baseband or as part of a carrier wave, in which a computer-readable program code is carried. This propagated data signal may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above. Computer-readable signal media may also be any computer-readable medium other than computer-readable storage media, which may send, propagate or transmit a program for use by or in conjunction with an instruction execution system, apparatus or device. The program code contained on the computer-readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, optical cable or RF, etc., or any suitable combination of the above.

上文已经结合实施例对本公开的方法和涉及的设备进行了描述。本领域技术人员可以理解，上面示出的方法仅是示例性的。本公开的方法并不局限于上面示出的步骤和顺序。上面示出的设备可以包括更多的模块，例如还可以包括可以开发的或者将来开发的可用于该设备的模块等等。上文中示出的各种标识仅是示例性的而不是限制性的，本公开并不局限于作为这些标识的示例的具体信元。本领域技术人员根据所示实施例的教导可以进行许多改进、变形和替代，并可以扩展到其他需要将时间序列聚类的领域。The method of the present invention and the devices involved have been described above in conjunction with the embodiments. It will be appreciated by those skilled in the art that the method shown above is exemplary only. The method of the present invention is not limited to the steps and sequence shown above. The device shown above may include more modules, for example, it may also include modules that can be developed or developed in the future and can be used for the device, etc. The various identifiers shown above are exemplary rather than restrictive, and the present invention is not limited to the specific cells that serve as examples of these identifiers. Those skilled in the art may make many improvements, modifications and substitutions based on the teachings of the illustrated embodiments, and may extend to other fields where time series need to be clustered.

运行在根据本公开的设备上的程序可以是通过控制中央处理单元(CPU)来使计算机实现本公开的实施例功能的程序。该程序或由该程序处理的信息可以临时存储在易失性存储器(如随机存取存储器RAM)、硬盘驱动器(HDD)、非易失性存储器(如闪速存储器)、或其他存储器系统中。The program running on the device according to the present disclosure may be a program that enables a computer to implement the functions of the embodiments of the present disclosure by controlling a central processing unit (CPU). The program or information processed by the program may be temporarily stored in a volatile memory (such as a random access memory RAM), a hard disk drive (HDD), a non-volatile memory (such as a flash memory), or other memory systems.

用于实现本公开各实施例功能的程序可以记录在计算机可读记录介质上。可以通过使计算机系统读取记录在所述记录介质上的程序并执行这些程序来实现相应的功能。此处的所谓“计算机系统”可以是嵌入在该设备中的计算机系统，可以包括操作系统或硬件(如外围设备)。“计算机可读记录介质”可以是半导体记录介质、光学记录介质、磁性记录介质、短时动态存储程序的记录介质、或计算机可读的任何其他记录介质。The program for realizing the functions of each embodiment of the present disclosure can be recorded on a computer-readable recording medium. The corresponding functions can be realized by making a computer system read the program recorded on the recording medium and executing these programs. The so-called "computer system" herein can be a computer system embedded in the device, which can include an operating system or hardware (such as a peripheral device). "Computer-readable recording medium" can be a semiconductor recording medium, an optical recording medium, a magnetic recording medium, a recording medium of a short-term dynamic storage program, or any other recording medium readable by a computer.

用在上述实施例中的设备的各种特征或功能模块可以通过电路(例如，单片或多片集成电路)来实现或执行。设计用于执行本说明书所描述的功能的电路可以包括通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA)、或其他可编程逻辑器件、分立的门或晶体管逻辑、分立的硬件组件、或上述器件的任意组合。通用处理器可以是微处理器，也可以是任何现有的处理器、控制器、微控制器、或状态机。上述电路可以是数字电路，也可以是模拟电路。因半导体技术的进步而出现了替代现有集成电路的新的集成电路技术的情况下，本公开的一个或多个实施例也可以使用这些新的集成电路技术来实现。The various features or functional modules of the devices used in the above embodiments can be implemented or executed by circuits (e.g., monolithic or multi-chip integrated circuits). Circuits designed to perform the functions described in this specification may include general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic, discrete hardware components, or any combination of the above devices. The general-purpose processor may be a microprocessor, or any existing processor, controller, microcontroller, or state machine. The above circuits may be digital circuits or analog circuits. In the case where new integrated circuit technologies have emerged to replace existing integrated circuits due to advances in semiconductor technology, one or more embodiments of the present disclosure may also be implemented using these new integrated circuit technologies.

如上所述，已经参考附图对本公开的实施例进行了详细描述。但是，具体的结构并不局限于上述实施例，本公开也包括不偏离本公开主旨的任何设计改动。另外，可以在权利要求的范围内对本公开进行多种改动，通过适当地组合不同实施例所公开的技术手段所得到的实施例也包含在本公开的技术范围内。此外，上述实施例中所描述的具有相同效果的组件可以相互替代。As described above, the embodiments of the present disclosure have been described in detail with reference to the accompanying drawings. However, the specific structure is not limited to the above embodiments, and the present disclosure also includes any design changes that do not deviate from the main purpose of the present disclosure. In addition, various changes can be made to the present disclosure within the scope of the claims, and the embodiments obtained by appropriately combining the technical means disclosed in different embodiments are also included in the technical scope of the present disclosure. In addition, the components with the same effect described in the above embodiments can be replaced by each other.

Claims

A method for clustering time series data, comprising:

Pre-clustering the sample points in the time series data, wherein the pre-clustering includes:

Traversing all sample points among the sample points, clustering the sample points into corresponding clusters based on the distance between a current sample point and a previous sample point of the current sample point; and

Calculating the mean of all sample points in each of all generated clusters, and determining whether to merge the current cluster with the adjacent clusters based on the mean difference between the current cluster and the adjacent clusters; and

All the pre-clustered clusters are clustered based on the mean.

The method according to claim 1, wherein the timing data includes nanopore sequence data and is stored in a memory in a floating point format, wherein the nanopore sequence data includes a sequencing signal and an opening current signal.

The method according to claim 2 further includes: converting the time series data stored in the memory into an array.

The method according to claim 1, wherein the distance is obtained by calculating the value of the L1 norm of the current sample point and the previous sample point of the current sample point.

The method according to claim 4, wherein clustering the sample points into corresponding clusters based on the distance between the current sample point and a previous sample point of the current sample point comprises:

When the L1 norm of the current sample point and the previous sample point of the current sample point is less than the initial minimum distance ε, the current sample point and the previous sample point are marked as the same cluster;

When the L1 norm of the current sample point and the previous sample point of the current sample point is greater than or equal to the ε, if the total number of sample points in the current cluster is less than the minimum number of sample points MinPts, the current sample point and the previous sample point are marked as the same cluster, otherwise the current sample point is set to a new cluster.

The method according to claim 5, wherein determining whether to merge the current cluster with the clusters adjacent to it based on the mean difference between the current cluster and the clusters adjacent to it comprises:

If the mean difference between the current cluster and its adjacent clusters is less than the threshold eps_c, the current cluster and its adjacent clusters are merged;

Otherwise the current cluster is kept unchanged.

The method according to claim 6 further comprises: using nearest neighbor search to determine the initial minimum distance ε, the minimum number of sample points MinPts and the threshold eps_c.

The method according to claim 7, wherein the initial minimum distance ε, the minimum number of sample points MinPts and the threshold eps_c are adjustable parameters.

The method according to claim 8, further comprising: implementing the nearest neighbor search by establishing a KD tree or a ball tree.

The method according to claim 1, wherein clustering all the pre-clustered clusters using the DBSCAN algorithm is based on the mean and is also based on the standard deviation.

A device for clustering time series data, comprising:

one or more processors; and

One or more memories storing computer executable instructions which, when executed by the one or more processors, cause the one or more processors to perform the method according to any one of claims 1 to 10.

A computer storage medium having computer executable instructions stored thereon, wherein when the computer executable instructions are executed by a processor, the processor is caused to perform the method according to any one of claims 1 to 10.