WO2024260345A1

WO2024260345A1 - High-entropy knn clustering method and device based on random traversal, and medium

Info

Publication number: WO2024260345A1
Application number: PCT/CN2024/099875
Authority: WO
Inventors: 徐同明; 鹿海洋; 魏代森; 张梅; 祝静; 孙帅; 林卉; 马娉婷; 蔺永建
Original assignee: Inspur General Software Co Ltd
Current assignee: Inspur General Software Co Ltd
Priority date: 2023-06-19
Filing date: 2024-06-18
Publication date: 2024-12-26
Anticipated expiration: 2025-12-19
Also published as: CN116451099A; CN116451099B

Abstract

The present application relates to the field of electric digital data processing, and discloses a high-entropy KNN clustering method and device based on random traversal, and a medium. The method comprises: acquiring a sample set needing to be clustered; on the basis of a random traversal mode and category labels of other specified samples that have been classified before, classifying the specified samples; for the remaining samples to be classified other than prior samples in the sample set, selecting K prior samples closest to the samples to be classified as comparison samples; and on the basis of the mode of similarity difference and determined category labels of the comparison samples, obtaining category labels of the samples to be classified. The high-entropy effect of the prior samples is ensured. On the basis of the mode of similarity difference, the requirements of inter-category homogeneity and intra-category dissimilarity are effectively met, the high-entropy clustering process of all samples is implemented, and the requirements for high-entropy clustering are met.

Description

A high entropy KNN clustering method, device and medium based on random traversal

本申请要求于2023年06月19日提交中国专利局、申请号为202310720618.4、发明名称为“一种基于随机遍历的高熵KNN聚类方法、设备及介质”的中国专利申请的优先权，其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed with the China Patent Office on June 19, 2023, with application number 202310720618.4 and invention name “A high entropy KNN clustering method, device and medium based on random traversal”, the entire contents of which are incorporated by reference in this application.

Technical Field

本申请涉及电数字数据处理领域，具体涉及一种基于随机遍历的高熵KNN聚类方法、设备及介质。The present application relates to the field of electronic digital data processing, and in particular to a high entropy KNN clustering method, device and medium based on random traversal.

Background Art

K最邻近分类算法(K-Nearest Neighbor，KNN)是一种监督学习算法，其能够根据K个最近的邻居的状态来决定样本的状态，常用于样本分类。通常来说，KNN算法能够呈现类间迥异、类内同质的特点，也就是能起到类间高熵、类内低熵的效果。K-Nearest Neighbor (KNN) is a supervised learning algorithm that can determine the state of a sample based on the states of its K nearest neighbors and is often used for sample classification. Generally speaking, the KNN algorithm can present the characteristics of being very different between classes and homogeneous within classes, that is, it can achieve the effect of high entropy between classes and low entropy within classes.

但是，随着技术的发展，出现一些类间同质、类内迥异的应用需求，比如，在对多类型产品或者多类型的数据进行分类时，只需要保证每个类别中，各类型的产品或者数据是符合一定比例的即可。此时在分类过程中，需要保证实现类间低熵、类内高熵的效果，通过传统的KNN算法是难以实现的。However, with the development of technology, some application requirements have emerged, such as homogeneity between classes and heterogeneity within classes. For example, when classifying multiple types of products or multiple types of data, it is only necessary to ensure that each type of product or data in each category meets a certain proportion. In this case, during the classification process, it is necessary to ensure the effect of low entropy between classes and high entropy within classes, which is difficult to achieve through the traditional KNN algorithm.

发明内容Summary of the invention

为了解决上述问题，本申请提出了一种基于随机遍历的高熵KNN聚类方法，包括：In order to solve the above problems, this application proposes a high entropy KNN clustering method based on random traversal, including:

获取需要进行聚类的样本集合，并在所述样本集合中，选取若干个指定样本；Obtain a sample set that needs to be clustered, and select a number of specified samples from the sample set;

基于随机遍历的方式，在所述若干个指定样本中依次选取每个指定样本，针对该指定样本，根据在先已经分类完成的其他指定样本的类别标签，对该指定样本进行分类，并将完成分类的该指定样本作为先验样本；Based on the random traversal method, each designated sample is selected in turn from the plurality of designated samples, and for the designated sample, the designated sample is classified according to the category labels of other designated samples that have been classified previously. The designated sample is classified and the classified designated sample is used as the prior sample;

针对所述样本集合中，除所述先验样本以外剩余的待分类样本，选取与所述待分类样本距离最近的K个先验样本，作为对比样本；所述K为预先设置的正整数值；For the remaining samples to be classified except the priori samples in the sample set, K priori samples closest to the samples to be classified are selected as comparison samples; K is a preset positive integer value;

基于相似度相异的方式，以及所述对比样本已确定的类别标签，得到所述待分类样本的类别标签，直至对所有待分类样本完成分类。Based on the different similarities and the determined category labels of the comparison samples, the category labels of the samples to be classified are obtained until all the samples to be classified are classified.

另一方面，本申请还提出了一种基于随机遍历的高熵KNN聚类设备，包括：On the other hand, the present application also proposes a high entropy KNN clustering device based on random traversal, including:

至少一个处理器；以及，at least one processor; and,

与所述至少一个处理器通信连接的存储器；a memory communicatively coupled to the at least one processor;

其中，所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述至少一个处理器能够执行如：上述基于随机遍历的高熵KNN聚类方法。The memory stores instructions that can be executed by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can execute, for example, the above-mentioned high entropy KNN clustering method based on random traversal.

另一方面，本申请还提出了一种非易失性计算机存储介质，存储有计算机可执行指令，所述计算机可执行指令设置为：上述基于随机遍历的高熵KNN聚类方法。On the other hand, the present application also proposes a non-volatile computer storage medium storing computer executable instructions, wherein the computer executable instructions are configured as: the above-mentioned high entropy KNN clustering method based on random traversal.

通过本申请提出的基于随机遍历的高熵KNN聚类方法，能够带来如下有益效果：The high entropy KNN clustering method based on random traversal proposed in this application can bring the following beneficial effects:

通过随机遍历过程中得到的先验样本，并且可以在随机遍历过程中以就远原则，保证了先验样本的高熵效果。基于相似度相异的方式，有效实现类间同质、类内迥异的需求，实现对所有样本的高熵聚类过程，满足了对于高熵聚类的需求。The prior samples obtained in the random traversal process can be used to ensure the high entropy effect of the prior samples based on the principle of distance in the random traversal process. Based on the method of different similarities, the requirements of homogeneity between classes and differences within classes are effectively met, and the high entropy clustering process of all samples is realized, which meets the needs of high entropy clustering.

BRIEF DESCRIPTION OF THE DRAWINGS

此处所说明的附图用来提供对本申请的进一步理解，构成本申请的一部分，本申请的示意性实施例及其说明用于解释本申请，并不构成对本申请的不当限定。在附图中：The drawings described herein are used to provide a further understanding of the present application and constitute a part of the present application. The illustrative embodiments of the present application and their descriptions are used to explain the present application and do not constitute an improper limitation on the present application. In the drawings:

图1为本申请实施例中基于随机遍历的高熵KNN聚类方法的流程示意图； FIG1 is a schematic diagram of a high entropy KNN clustering method based on random traversal in an embodiment of the present application;

图2为本申请实施例中，一种场景下对先验样本的分类示意图；FIG2 is a schematic diagram of classification of prior samples in a scenario in an embodiment of the present application;

图3为本申请实施例中传统KNN聚类算法的结果示意图；FIG3 is a schematic diagram of the results of a traditional KNN clustering algorithm in an embodiment of the present application;

图4为本申请实施例中，相似度相异的方式进行分类的示意图；FIG4 is a schematic diagram of classification in different similarity modes in an embodiment of the present application;

图5为本申请实施例中相似度相异的方式的分类结果示意图；FIG5 is a schematic diagram of classification results in different similarities according to an embodiment of the present application;

图6为本申请实施例中基于随机遍历的高熵KNN聚类设备的示意图。FIG6 is a schematic diagram of a high entropy KNN clustering device based on random traversal in an embodiment of the present application.

DETAILED DESCRIPTION

为使本申请的目的、技术方案和优点更加清楚，下面将结合本申请具体实施例及相应的附图对本申请技术方案进行清楚、完整地描述。显然，所描述的实施例仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。In order to make the purpose, technical solution and advantages of the present application clearer, the technical solution of the present application will be clearly and completely described below in combination with the specific embodiments of the present application and the corresponding drawings. Obviously, the described embodiments are only part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in this field without making creative work are within the scope of protection of the present application.

以下结合附图，详细说明本申请各实施例提供的技术方案。The technical solutions provided by various embodiments of the present application are described in detail below in conjunction with the accompanying drawings.

如图1所示，本申请实施例提供基于随机遍历的高熵KNN聚类方法，包括：As shown in FIG1 , the embodiment of the present application provides a high entropy KNN clustering method based on random traversal, including:

S101：获取需要进行聚类的样本集合，并在所述样本集合中，选取若干个指定样本。S101: Obtain a sample set that needs to be clustered, and select a number of specified samples from the sample set.

与传统的KNN聚类不同的是，在本文中的高熵KNN聚类所要实现的目的不同。在预先获取的数据集合中，选取若干个数据，该数据里可以是产品数据、图像数据、音频数据等。Different from the traditional KNN clustering, the purpose of high entropy KNN clustering in this paper is different. In the pre-acquired data set, several data are selected, which can be product data, image data, audio data, etc.

将若干个数据作为样本集合，以对样本集合进行聚类，此时，聚类的目的不再是将相同或相似类别的数据汇集在一个类簇中，而是在聚类结果的类簇中，不同类别的数据符合预设比例。比如，以产品数据为例，最终得到的每个类簇中，产品质量的比例符合预设比例，优品、良品、差品的比例符合5:3:2的比例，即可达到预先的目的。Several data are used as sample sets to cluster the sample sets. At this time, the purpose of clustering is no longer to gather data of the same or similar categories into one cluster, but to make the data of different categories meet the preset ratio in the clusters of the clustering results. For example, taking product data as an example, in each cluster finally obtained, the ratio of product quality meets the preset ratio, and the ratio of excellent products, good products, and poor products meets the ratio of 5:3:2, which can achieve the preset purpose.

在样本集合中，选取若干个指定样本，指定样本为具有可识别特点(也可以称作显著特点)的样本，比如，以产品数据为例，某些产品的质量非常优秀，或者具有非常明显的残次，则可以认为其具有可识别特点。或者，对图像数据进行识别时，图像中明显存在指定物品，或者明显不存在指定物品的，认为其具有可识别特点。通常来说选取的指定样本数量相比于样本集合为少量。In the sample set, a number of designated samples are selected. The designated samples are samples with identifiable characteristics (also called significant characteristics). For example, taking product data as an example, if the quality of some products is very good, or they have very obvious defects, they can be considered to have identifiable characteristics. Or, when recognizing image data, if the designated object is obviously present in the image, or if the designated object is obviously absent, it is considered to have identifiable characteristics. Have identifiable characteristics. Usually the number of samples selected is small compared to the sample set.

S102：基于随机遍历的方式，在所述若干个指定样本中依次选取每个指定样本，针对该指定样本，根据在先已经分类完成的其他指定样本的类别标签，对该指定样本进行分类，并将完成分类的该指定样本作为先验样本。S102: Based on a random traversal method, each designated sample is selected in turn from the plurality of designated samples, and for the designated sample, the designated sample is classified according to the category labels of other designated samples that have been classified before, and the designated sample that has been classified is used as a priori sample.

随机遍历指的是，在所有指定样本中，每次通过随机选取的方式，选取一个指定样本，在确定了该指定样本的类别标签后，再通过随机选取的方式选取下一个指定样本，进行分类，直至将所有指定样本都遍历，完成分类。Random traversal means that among all the specified samples, one specified sample is selected each time by random selection. After the category label of the specified sample is determined, the next specified sample is selected by random selection for classification until all the specified samples are traversed and the classification is completed.

具体地，针对选取出的该指定样本，确定在先已经分类完成的其他指定样本的样本数量。当然，此时若样本数量为0，则说明该指定样本为第一个样本，则在所有类别标签，随机选取一个类别标签，作为该指定样本的类别标签，在此假设，将该指定样本的类别标签定义为A，将其所属的类别称作A类。Specifically, for the selected designated sample, determine the sample number of other designated samples that have been classified previously. Of course, if the sample number is 0 at this time, it means that the designated sample is the first sample, then a category label is randomly selected from all category labels as the category label of the designated sample. Here, it is assumed that the category label of the designated sample is defined as A, and the category to which it belongs is called category A.

若样本数量为所需划分的类别数量(所需划分的类别数量，也就是类别标签的类别数量，在本文中为方便解释，以两类为例)的整数倍，则确定该指定样本与每个类别标签下对应的其他指定样本之间的距离和，并将距离和最高对应的类别标签，作为该指定样本的类别标签。以两类的类别数量为例，当样本数量为偶数时，则为整数倍，当样本数量为奇数倍时，则为非整数倍。If the number of samples is an integer multiple of the number of categories to be divided (the number of categories to be divided is the number of categories of the category label. For the convenience of explanation in this article, two categories are taken as an example), then the sum of the distances between the specified sample and the other specified samples corresponding to each category label is determined, and the category label corresponding to the highest sum of distances is used as the category label of the specified sample. Taking the number of categories of two categories as an example, when the number of samples is an even number, it is an integer multiple, and when the number of samples is an odd number, it is a non-integer multiple.

当样本数量并非所需划分的类别数量的整数倍时，则在其他指定样本对应的所有类别标签中，选取数量最少的类别标签，作为该指定样本的类别标签。When the number of samples is not an integer multiple of the number of categories to be divided, the category label with the least number is selected from all the category labels corresponding to other specified samples as the category label of the specified sample.

仍以两类的类别数量为例，当分到第2个指定样本(如图2所示，在图2中，以数字1～6分别对应于第1个指定样本～第6个指定样本，将空心方框的图标代表A类，将方框内包含叉的图标代表B类，以这6个指定样本来举例进行解释说明)时，此时只有第1个指定样本被分到了A类，而类别总共包含A类和B类两类，此时A类对应的数量为1，B类对应的数量为0，则第2个指定样本被分到B类，其类别标签为B。Still taking the number of categories of two categories as an example, when it is classified into the second designated sample (as shown in Figure 2, in Figure 2, numbers 1 to 6 correspond to the first to sixth designated samples respectively, the icon of the hollow square represents category A, and the icon containing a cross in the square represents category B, and these six designated samples are used as examples for explanation), at this time only the first designated sample is classified into category A, and the category includes two categories, A and B. At this time, the number corresponding to category A is 1, and the number corresponding to category B is 0, then the second designated sample is classified into category B, and its category label is B.

当分类到第3个指定样本时，计算与前述第1个、第2个指定样本分别对应的距离，若距离第2个指定样本的距离大，根据就远原则，则分到B类，反之，分到A类。假设第3个指定样本分到B类，则此时A类和B类分别对应的样本数量为1和2。When the third designated sample is classified, the distances to the first and second designated samples are calculated. If the distance to the second designated sample is large, it is classified into Class B according to the principle of distance. Otherwise, it is classified into Class A. Assuming that the third designated sample is classified into Class B, the number of samples corresponding to Class A and Class B is 1 and 2 respectively.

对于第4个指定样本，根据前3个指定样本的分类情况，分到样本数量最少的类别，此时分到A类。For the fourth designated sample, the sample with the largest number of samples is assigned according to the classification of the first three designated samples. The category with fewer items is classified as Class A.

对于第5个指定样本，由于其之前，每个类别中均包含了多个指定样本，此时，需要计算其与前述4个指定样本之间的距离，与A类第1个指定样本和第4个指定样本之间的距离和为D_A＝d₁+d₄，与B类第2个指定样本和第3个指定样本之间的距离和为D_B＝d₂+d₃，其中，D_A和D_B分别表示第5个指定样本与A类和B类对应的其他指定样本之间的距离和，d₁～d₄分别为第5个指定样本与第1个指定样本～第4个指定样本之间的距离，然后比较D_A和D_B，在此假设D_A＞D_B，则仍按照就远原则，将第5个指定样本分到A类。For the fifth designated sample, since each category contains multiple designated samples before it, it is necessary to calculate the distance between it and the aforementioned four designated samples. The sum of the distances between it and the first and fourth designated samples of category A is D _A = d ₁ + d ₄ , and the sum of the distances between it and the second and third designated samples of category B is D _B = d ₂ + d ₃ , where D _A and D _B represent the sum of the distances between the fifth designated sample and other designated samples corresponding to categories A and B, respectively, and d ₁ to d ₄ are the distances between the fifth designated sample and the first to fourth designated samples, respectively. Then, D _A and D _B are compared. Assuming that D _A > D _B , the fifth designated sample is still classified into category A according to the principle of distance.

对于第6个指定样本，根据前5个指定样本的分类情况，分到样本数量最少的类别，此时分到B类。For the sixth designated sample, according to the classification of the first five designated samples, it is classified into the category with the least number of samples, which is Class B.

另外，在一些情况下，当类别数量达到三类或更多类时，当样本数量并非所需划分的类别数量的整数倍时，选取数量最少的类别标签中可能会包含多个类别标签，他们数量相等且均为最少，此时可以在其中进行随机选取，作为本次数量最少的类别标签，或者，计算当前指定样本与这些类别标签下的指定样本的距离和，按照就远原则，选择距离和更高的类别标签作为当前指定样本的类别标签。In addition, in some cases, when the number of categories reaches three or more, when the number of samples is not an integer multiple of the number of categories to be divided, the category label with the least number may contain multiple category labels, which are equal in number and the least in number. At this time, you can randomly select them as the category label with the least number this time, or calculate the distance and distance between the current specified sample and the specified samples under these category labels, and according to the principle of distance, select the category label with the higher distance and distance as the category label of the current specified sample.

S103：针对所述样本集合中，除所述先验样本以外剩余的待分类样本，选取与所述待分类样本距离最近的K个先验样本，作为对比样本；所述K为预先设置的正整数值。S103: For the remaining samples to be classified except the priori samples in the sample set, select K priori samples closest to the samples to be classified as comparison samples; K is a preset positive integer value.

K值的选取不宜过大或过小，通常来说，其与样本集合的样本容量相关。此时，确定样本集合对应的样本容量，根据样本容量确定分类过程中对应的K值以及指定样本的选取数量，其中，K值与样本容量的比值范围为[0.03,0.09]之内的正整数，且K值为奇数，当样本容量为100时，K值可以是3、5、7、9。The K value should not be too large or too small. Generally speaking, it is related to the sample size of the sample set. At this time, determine the sample size corresponding to the sample set, and determine the corresponding K value and the number of selected samples in the classification process according to the sample size. The ratio of the K value to the sample size is a positive integer within [0.03, 0.09], and the K value is an odd number. When the sample size is 100, the K value can be 3, 5, 7, or 9.

而先验样本的数量至少为K+1个，且每个类别中所包含的先验样本的数量相同。比如，当类别数量为两类时，则M＝2，N＝1，此时先验样本的数量最少为K+1个。而K往往是奇数，则此时，K+1为偶数，满足每个类别中数量相同的要求。当类别数量更多时，可能K+1是无法满足数量相同的要求，则此时可以提高先验样本的数量，直至满足要求即可。 The number of prior samples is at least K+1, and the number of prior samples in each category is the same. For example, when there are two categories, M=2, N=1, and the number of prior samples is at least K+1. K is often an odd number, so K+1 is an even number, which meets the requirement of the same number in each category. When there are more categories, K+1 may not be able to meet the requirement of the same number. In this case, the number of prior samples can be increased until the requirement is met.

S104：基于相似度相异的方式，以及所述对比样本在初始化分类中确定的类别标签，得到所述待分类样本的类别标签，直至对所有待分类样本完成分类。S104: Based on the different similarities and the category labels of the comparison samples determined in the initial classification, the category labels of the samples to be classified are obtained until all the samples to be classified are classified.

如图3所示，按照传统的KNN聚类算法，仍以相似度相同的方式进行聚类时，则最终得到的结果仍是类间迥异，类内同质的效果，此时类内仍处于低熵的状态，不符合本文中的需求。As shown in Figure 3, according to the traditional KNN clustering algorithm, when clustering is still performed in the same similarity manner, the final result is still very different between classes and homogeneous within classes. At this time, the class is still in a low entropy state, which does not meet the requirements of this article.

基于此，采用相似度相异的方式，针对每个待分类样本，确定其对应的对比样本中出现的类别标签，以及出现的各类别标签分别对应的出现次数。在所有类别标签中，选取出现次数最少的类别标签，作为待分类样本的类别标签。Based on this, the method of similarity difference is adopted to determine the category labels that appear in the corresponding comparison samples for each sample to be classified, as well as the number of occurrences of each category label. Among all the category labels, the category label with the least number of occurrences is selected as the category label of the sample to be classified.

如图4所示，其中除了空心方框的图标代表的A类、方框内包含叉的图标代表的B类之外，还包括以实心方框的图标代表的未确定的类别。在图4中，以K＝5为例，则第7样本(也就是当前正在确认的待分类样本)对应的5个对比样本分别是第1样本～第3样本、第5样本～第6样本。经确认，其中属于A类的是第1样本和第5样本，数量对应于2，属于B类的是第2样本、第3样本和第6样本，数量对应于3，A类对应的对比样本更少，则根据相似度相异的原则，第7样本被分至A类。As shown in Figure 4, in addition to the A category represented by the hollow square icon and the B category represented by the cross icon in the square, there are also undetermined categories represented by the solid square icon. In Figure 4, taking K=5 as an example, the 5 comparison samples corresponding to the 7th sample (that is, the sample to be classified currently being confirmed) are the 1st to 3rd samples, and the 5th to 6th samples. After confirmation, the 1st and 5th samples belong to the A category, corresponding to 2 in number, and the 2nd, 3rd and 6th samples belong to the B category, corresponding to 3 in number. The A category has fewer comparison samples, so according to the principle of different similarities, the 7th sample is classified into the A category.

此时，最终实现的效果可以如图5所示，达到类间同质、类内迥异的效果，此时类内处于高熵的状态，符合需求。At this point, the final effect can be shown in Figure 5, achieving the effect of homogeneity between classes and great differences within classes. At this time, the class is in a high entropy state, which meets the requirements.

在一个实施例中，上文中描述过需求选取距离最近的样本作为对比样本，在计算样本间距离时，确定样本集合中，各样本所包含的维度数量，并根据维度数量计算待分类样本与其他所有先验样本之间的距离，从而选取距离最近的K个先验样本，作为对比样本。In one embodiment, it is described above that the closest sample needs to be selected as a comparison sample. When calculating the distance between samples, the number of dimensions contained in each sample in the sample set is determined, and the distance between the sample to be classified and all other prior samples is calculated based on the number of dimensions, so as to select the K prior samples with the closest distance as comparison samples.

维度数量通常包括一维至三维数据，比如，文本数据中包含文字的一维数据，2D平面图像中包含像素在x轴和y轴二维数据，产品数据中包含外观、功能、价格的三维数据。The number of dimensions usually includes one-dimensional to three-dimensional data. For example, text data contains one-dimensional data of text, 2D plane images contain two-dimensional data of pixels on the x-axis and y-axis, and product data contains three-dimensional data of appearance, function, and price.

当样本集合中样本为一维时，通过d_i＝|x_i-x₁|得到待分类样本与先验样本之间的距离，其中，d_i为待分类样本与第i个先验样本之间的距离，x_i为第i个先验样本的坐标，x₁为待分类样本的坐标。When the samples in the sample set are one-dimensional, the samples to be classified and the prior samples are obtained by d _i = | _xi - _x1 | Here, d _i is the distance between the sample to be classified and the i-th prior sample, _xi is the coordinate of the i-th prior sample, and _x1 is the coordinate of the sample to be classified.

当样本集合中样本为二维时，通过得到待分类样本与先验样本之间的距离，其中，d_i为待分类样本与第i个先验样本之间的距离，(x_i，y_i)为第i个先验样本的坐标，(x₁，y₁)为待分类样本的坐标；When the samples in the sample set are two-dimensional, The distance between the sample to be classified and the prior sample is obtained, where d _i is the distance between the sample to be classified and the i-th prior sample, (x _i , y _i ) is the coordinate of the i-th prior sample, and (x ₁ , y ₁ ) is the coordinate of the sample to be classified;

当样本集合中样本为三维，通过得到待分类样本与先验样本之间的距离，其中，d_i为待分类样本与第i个先验样本之间的距离，(x_i，y_i，z_i)为第i个先验样本的坐标，(x₁，y₁，z₁)为待分类样本的坐标。When the samples in the sample set are three-dimensional, The distance between the sample to be classified and the prior sample is obtained, where _di is the distance between the sample to be classified and the i-th prior sample, ( _xi , _yi , _zi ) is the coordinate of the i-th prior sample, and ( _x1 , _y1 , _z1 ) is the coordinate of the sample to be classified.

当然，通过该距离计算方式，也可以得到各指定样本之间的距离，并且维度数量还可以包括更多维度，此时推导类似地公式即可计算样本之间的距离。Of course, the distance between each specified sample can also be obtained through this distance calculation method, and the number of dimensions can also include more dimensions. In this case, a similar formula can be derived to calculate the distance between samples.

如图6所示，本申请还提出了一种基于随机遍历的高熵KNN聚类设备，包括：As shown in FIG6 , the present application also proposes a high entropy KNN clustering device based on random traversal, including:

至少一个处理器；以及，at least one processor; and,

其中，所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述至少一个处理器能够执行如：上述任一实施例中所述的基于随机遍历的高熵KNN聚类方法。The memory stores instructions that can be executed by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can execute the high entropy KNN clustering method based on random traversal as described in any of the above embodiments.

本申请还提出了一种非易失性计算机存储介质，存储有计算机可执行指令，所述计算机可执行指令设置为：上述任一实施例中所述的基于随机遍历的高熵KNN聚类方法。The present application also proposes a non-volatile computer storage medium storing computer executable instructions, wherein the computer executable instructions are configured to be: the high entropy KNN clustering method based on random traversal described in any of the above embodiments.

本申请中的各个实施例均采用递进的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于设备和介质实施例而言，由于其基本相似于方法实施例，所以描述的比较简单，相关之处参见方法实施例的部分说明即可。Each embodiment in this application is described in a progressive manner, and the same or similar parts between the embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the device and medium embodiments, since they are basically similar to the method embodiments, the description is relatively simple, and the relevant parts can be referred to the partial description of the method embodiments.

本申请实施例提供的设备和介质与方法是一一对应的，因此，设备和介质也具有与其对应的方法类似的有益技术效果，由于上面已经对方法的有益技术效果进行了详细说明，因此，这里不再赘述设备和介质的有益技术效果。The devices and media provided in the embodiments of the present application correspond one-to-one to the methods. Therefore, the devices and media also have similar beneficial technical effects as the corresponding methods. Since the beneficial technical effects of the methods have been described in detail above, the beneficial technical effects of the devices and media will not be repeated here.

本领域内的技术人员应明白，本申请的实施例可提供为方法、系统、或计算机程序产品。因此，本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。It should be understood by those skilled in the art that the embodiments of the present application may be provided as methods, systems, or computers. Computer program product. Therefore, the present application may take the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the present application may take the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to the flowchart and/or block diagram of the method, device (system) and computer program product according to the embodiment of the present application. It should be understood that each process and/or box in the flowchart and/or block diagram, and the combination of the process and/or box in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, an embedded processor or other programmable data processing device to produce a machine, so that the instructions executed by the processor of the computer or other programmable data processing device produce a device for realizing the function specified in one process or multiple processes in the flowchart and/or one box or multiple boxes in the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product including an instruction device that implements the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device so that a series of operational steps are executed on the computer or other programmable device to produce a computer-implemented process, whereby the instructions executed on the computer or other programmable device provide steps for implementing the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

在一个典型的配置中，计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。In a typical configuration, a computing device includes one or more processors (CPU), input/output interfaces, network interfaces, and memory.

内存可能包括计算机可读介质中的非永久性存储器，随机存取存储器(RAM)和/或非易失性内存等形式，如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。Memory may include non-permanent storage in a computer-readable medium, in the form of random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括，但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带，磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质，可用于存储可以被计算设备访问的信息。按照本文中的界定，计算机可读介质不包括暂存电脑可读媒体(transitory media)，如调制的数据信号和载波。Computer-readable media include permanent and non-permanent, removable and non-removable media that can be used to store information by any method or technology. The information can be computer-readable instructions, data structures, program modules or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable EEPROM, flash memory or other memory technology, CD-ROM, DVD or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices or any other non-transmission media that can be used to store information that can be accessed by a computing device. As defined herein, computer-readable media does not include transitory media such as modulated data signals and carrier waves.

还需要说明的是，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the terms "include", "comprises" or any other variations thereof are intended to cover non-exclusive inclusion, so that a process, method, commodity or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, commodity or device. In the absence of more restrictions, the elements defined by the sentence "comprises a ..." do not exclude the existence of other identical elements in the process, method, commodity or device including the elements.

以上所述仅为本申请的实施例而已，并不用于限制本申请。对于本领域技术人员来说，本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等，均应包含在本申请的权利要求范围之内。 The above is only an embodiment of the present application and is not intended to limit the present application. For those skilled in the art, the present application may have various changes and variations. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

A high entropy KNN clustering method based on random traversal, including:

A sample set to be clustered is obtained, and several designated samples are selected from the sample set; wherein several data are selected from the pre-acquired data set and the several data are used as the sample set; the several data are product data; the designated samples are samples with identifiable characteristics, and the identifiable characteristics include excellent product quality and defective products;

Based on the random traversal method, each designated sample is selected in turn from the plurality of designated samples, and for the designated sample, the designated sample is classified according to the category labels of other designated samples that have been classified before, and the designated sample that has been classified is used as a priori sample; wherein, for the designated sample, the sample number of other designated samples that have been classified before is determined, and if the sample number is 0, a category label is randomly selected from all category labels as the category label of the designated sample;

For the remaining samples to be classified except the priori samples in the sample set, K priori samples closest to the samples to be classified are selected as comparison samples; K is a preset positive integer value;

Based on the different similarities and the determined category labels of the comparison samples, the category labels of the samples to be classified are obtained until all samples to be classified are classified so that in the clusters of the clustering results, data of different categories meet the preset proportions.

The method according to claim 1, wherein obtaining the category label of the sample to be classified based on the manner in which the similarities are different and the category label determined for the comparison sample comprises:

Determine the category labels that appear in the comparison sample, and the number of occurrences of each category label that appears;

Among all the category labels, the category label with the least number of occurrences is selected as the category label of the sample to be classified.

The method according to claim 1, wherein the number of the prior samples is at least K+1, and the number of prior samples included in each category is the same.

The method according to claim 3, wherein, for the designated sample, classifying the designated sample according to the category labels of other designated samples that have been previously classified comprises:

For the designated sample, determine the sample quantity of other designated samples that have been classified previously;

If the number of samples is an integer multiple of the number of categories to be divided, the sum of distances between the designated sample and other designated samples corresponding to each category label is determined, and the category label corresponding to the highest distance sum is used as the category label of the designated sample;

Otherwise, among all the category labels corresponding to the other specified samples, the category label with the least number is selected as the category label of the specified sample.

The method according to claim 1, wherein selecting K prior samples closest to the sample to be classified as comparison samples comprises:

Calculate the distance between the sample to be classified and all other prior samples according to the number of dimensions contained in each sample in the sample set;

Select the K prior samples closest to each other as comparison samples.

The method according to claim 5, wherein calculating the distance between the sample to be classified and all other prior samples comprises:

When the samples in the sample set are one-dimensional, the distance between the sample to be classified and the priori sample is obtained by d _i =| _xi - _x1 |, wherein d _i is the distance between the sample to be classified and the i-th priori sample, _xi is the coordinate of the i-th priori sample, and _x1 is the coordinate of the sample to be classified;

When the samples in the sample set are two-dimensional, Obtaining the distance between the sample to be classified and the priori sample, wherein d _i is the distance between the sample to be classified and the i-th priori sample, (x _i , y _i ) is the coordinate of the i-th priori sample, and (x ₁ , y ₁ ) is the coordinate of the sample to be classified;

When the samples in the sample set are three-dimensional, The distance between the sample to be classified and the priori sample is obtained, wherein d _i is the distance between the sample to be classified and the i-th priori sample, (xi _, _yi , z _i ) is the coordinate of the i-th priori sample, and ( _x1 , _y1 , _z1 ) is the coordinate of the sample to be classified.

A high entropy KNN clustering device based on random traversal, comprising:

at least one processor; and,

a memory communicatively coupled to the at least one processor;

The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can perform the following steps: The method according to any one of claims 1 to 6.

A non-volatile computer storage medium storing computer executable instructions, wherein the computer executable instructions are configured as: the method described in any one of claims 1 to 6.