[go: up one dir, main page]

CN116701622B - A three-branch text clustering method based on improved game theory rough set - Google Patents

A three-branch text clustering method based on improved game theory rough set Download PDF

Info

Publication number
CN116701622B
CN116701622B CN202310582217.7A CN202310582217A CN116701622B CN 116701622 B CN116701622 B CN 116701622B CN 202310582217 A CN202310582217 A CN 202310582217A CN 116701622 B CN116701622 B CN 116701622B
Authority
CN
China
Prior art keywords
clustering
text
rough
text data
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310582217.7A
Other languages
Chinese (zh)
Other versions
CN116701622A (en
Inventor
徐森
陆湘文
徐秀芳
花小朋
朱锦新
许贺洋
郭乃瑄
嵇宏伟
姜陈雨
陈思博
蔡娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yancheng Institute of Technology
Original Assignee
Yancheng Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yancheng Institute of Technology filed Critical Yancheng Institute of Technology
Priority to CN202310582217.7A priority Critical patent/CN116701622B/en
Publication of CN116701622A publication Critical patent/CN116701622A/en
Application granted granted Critical
Publication of CN116701622B publication Critical patent/CN116701622B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Creation or modification of classes or clusters
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a text three-branch clustering method based on an improved game theory rough set, which comprises the steps of collecting text data, preprocessing the text data, establishing a text data set, establishing a corresponding rough set according to the text data set, clustering the rough set to obtain clustering information, carrying out text clustering on the text data set according to the clustering information to obtain a plurality of data clusters, clustering the text data set based on the text three-branch clustering of the improved game theory rough set, processing the numerical value loss in the data set, obviously reducing the calculation time of the text three-branch clustering, improving the universality of a model, and effectively and efficiently improving the accuracy of the text three-branch clustering.

Description

Text three-branch clustering method based on improved game theory rough set
Technical Field
The invention relates to the technical field of data mining, in particular to a text three-branch clustering method based on an improved game theory rough set.
Background
Clustering techniques are widely used in the fields of machine learning, data analysis, artificial intelligence, and the like. It can be used to solve a number of practical problems such as market segments, recommendation systems, image analysis, text classification, etc.
The core idea of clustering techniques is to find natural groups in a dataset, which groups consist of data points with similar properties. Clustering algorithms typically represent data points as n-dimensional vectors and use distance metrics to calculate similarity between the data points. Once the similarity is measured, the algorithm may divide the data points into different clusters.
In modern data science, clustering algorithms have become a very important technique because it can process large-scale data sets and find structures therein. Many clustering algorithms are now available, including hierarchical clustering, k-means clustering, DBSCAN clustering, and the like. These algorithms have respective advantages and disadvantages and can be selected according to the requirements of different problems.
Due to incomplete data, missing data or damaged data, the effect of general clustering is not very good, and three-branch clustering is a good solution for processing uncertainty in clustering caused by missing values. The core idea of the method is to put an object in an indeterminable class and postpone the decision on the object when it cannot be determined whether the object belongs to a cluster. However, in this method, there is a critical problem that a fixed value is generally used for determining the threshold value of the three clusters.
Therefore, the invention provides a text three-branch clustering method based on an improved game theory rough set.
Disclosure of Invention
According to the text three-branch clustering method based on the improved game theory rough set, the text data set is clustered based on the text three-branch clustering of the improved game theory rough set, the defect of numerical values in the data set is processed, the calculation time of the text three-branch clustering is obviously reduced, the universality of a model is improved, and the accuracy of the text three-branch clustering is effectively and efficiently improved.
The invention provides a text three-branch clustering method based on an improved game theory rough set, which comprises the following steps:
Step 1, collecting text data;
Step 2, preprocessing the text data to establish a text data set;
Step 3, establishing a corresponding rough set according to the text data set;
And 4, clustering the rough set to obtain clustering information, and performing text clustering on the text data set according to the clustering information to obtain a plurality of data clusters.
In one embodiment of the present invention, in one possible implementation,
The step 2 includes:
Step 21, eliminating first non-important words with occurrence frequency smaller than a preset frequency in the text data to obtain target text data;
Step 22, marking the target text data with preset quantity as one type, and establishing a standard data set;
And 23, eliminating second non-important words belonging to the preset part of speech in the standard data set, and establishing a text data set.
In one embodiment of the present invention, in one possible implementation,
The step 3 includes:
Step 31, respectively acquiring a first set of features corresponding to each text data set;
Step 32, dividing the text data set into a complete text data set and a defect text data set according to the first set characteristics;
step 33, converting the complete text data set and the defect text data set into a complete rough set and a defect rough set;
step 34. The complete rough set and the defect rough set are collectively referred to as a rough set.
In one embodiment of the present invention, in one possible implementation,
The step 4 includes:
step 41, obtaining a second set of features corresponding to each rough set;
Step 42, marking the roughness of which the second set features are defect features as a defect rough set, and marking the rough set of which the second set features are complete features as a complete rough set;
Step 43, obtaining a defect rough set and a complete rough set, and establishing clustering information;
and 44, acquiring a corresponding clustering algorithm according to the clustering information, and carrying out text clustering according to the clustering algorithm to obtain a plurality of data clusters.
In one embodiment of the present invention, in one possible implementation,
The step 44 includes:
Step 441, performing dimension reduction on each defect rough set by using consistent manifold approximation and projection to obtain a corresponding compensation rough set, and generating a compensation text data set according to the compensation rough set;
Step 442, extracting a target clustering algorithm from a preset clustering algorithm library according to the clustering information;
and 443, clustering the compensating text data set and the complete text data set by using a target clustering algorithm to obtain a plurality of data clusters.
In one embodiment of the present invention, in one possible implementation,
Further comprises:
obtaining a defect rough set;
Traversing the defect rough set by using a preset sample data segment to obtain a defect data segment on the defect rough set;
Recording a data segment except a defect data segment in the defect rough set as a non-defect data segment;
Acquiring the data proportion between the defect data segment and the non-defect data segment, and establishing an initial threshold group according to the data proportion;
And correcting the target clustering algorithm by using the initial threshold group, and clustering the compensated text data set and the complete text data set by using the corrected target clustering algorithm to obtain a plurality of data clusters.
In one embodiment of the present invention, in one possible implementation,
Further comprises:
The initial threshold value group comprises a first initial threshold value and a second initial threshold value, and the value range of the first initial threshold value and the second initial threshold value is (0, 1);
Combining the initial threshold group with a target clustering algorithm, adjusting a first initial threshold and a second initial threshold in the range of (0, 1), and establishing a clustering result trend graph in the adjustment process;
target trend points corresponding to the target clustering results in the clustering result trend graph;
And acquiring a first value and a second value corresponding to the target trend point, and correcting the target clustering algorithm by using the first value and the second value.
In one embodiment of the present invention, in one possible implementation,
Further comprises:
and naming each data cluster respectively, and transmitting the data clusters to a designated terminal for display.
The text clustering method has the beneficial effects that the text data is collected and preprocessed to establish a text data set, then a corresponding rough set is established, the clustering of the text data is realized by clustering the rough set, a plurality of data clusters are obtained, the numerical value loss in the data set can be processed before the clustering, the calculation time of three clusters of the text is obviously reduced, the universality of a model is improved, and the accuracy of the three clusters of the text is effectively and efficiently improved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.
The technical scheme of the invention is further described in detail through the drawings and the embodiments.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:
FIG. 1 is a schematic workflow diagram of a text three-branch clustering method based on an improved game theory rough set in an embodiment of the invention;
Fig. 2 is a schematic workflow diagram of step 3 of a text three-branch clustering method based on an improved game theory rough set in an embodiment of the invention.
Detailed Description
The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.
Example 1
The embodiment provides a text three-branch clustering method based on an improved game theory rough set, which is shown in fig. 1 and comprises the following steps:
Step 1, collecting text data;
Step 2, preprocessing the text data to establish a text data set;
Step 3, establishing a corresponding rough set according to the text data set;
And 4, clustering the rough set to obtain clustering information, and performing text clustering on the text data set according to the clustering information to obtain a plurality of data clusters.
The working principle and the beneficial effects of the technical scheme are that the text data are collected, then the text data are preprocessed to establish a text data set, then a corresponding rough set is established, the text data are clustered in a mode of clustering the rough set, a plurality of data clusters are obtained, the numerical value loss in the data sets can be processed before clustering in such a mode, the calculation time of three clusters of the text is obviously shortened, the universality of a model is improved, and the accuracy of the three clusters of the text is effectively and efficiently improved.
Example 2
Based on the embodiment 1, the method for clustering three texts based on the improved game theory rough set, the step 2, includes:
Step 21, eliminating first non-important words with occurrence frequency smaller than a preset frequency in the text data to obtain target text data;
Step 22, marking the target text data with preset quantity as one type, and establishing a standard data set;
And 23, eliminating second non-important words belonging to the preset part of speech in the standard data set, and establishing a text data set.
In this example, the preset frequency may be 3 times;
In this example, the first non-important word represents a word whose number of occurrences is less than a preset frequency;
In this example, the preset part of speech may be a junction adverb;
in this example, the second non-important word represents a junction adverb;
In this example, the preset number may be 50.
The working principle and the beneficial effects of the technical scheme are that in order to accelerate the clustering speed and avoid interference of unnecessary factors on the clustering result, non-important words in text data are removed before the clustering analysis is carried out, and a standard data set is established, so that the workload of clustering is reduced, and the clustering efficiency is improved.
Example 3
On the basis of embodiment 1, the method for clustering three texts based on the improved game theory rough set, as shown in fig. 2, includes the following steps:
Step 31, respectively acquiring a first set of features corresponding to each text data set;
Step 32, dividing the text data set into a complete text data set and a defect text data set according to the first set characteristics;
step 33, converting the complete text data set and the defect text data set into a complete rough set and a defect rough set;
step 34. The complete rough set and the defect rough set are collectively referred to as a rough set.
In this example, the first feature set represents the features of the text data set, and is used to distinguish between different text data sets, where "first" is used to distinguish from the feature set of the subsequently occurring rough set, and has no comparing and sorting effects;
in this example, the complete text data set indicates that all data bits in the text data set have text data and no missing data;
In this example, the defective text data indicates that the text data set contains blank data bits on which data is missing;
in this example, the complete rough set represents a rough set obtained by classifying all complete text data sets into one type;
In this example, the defect rough set represents a rough set obtained by classifying all defect text data sets into one type.
The technical scheme has the advantages that the text data sets are classified by the aid of the set features by acquiring the set features of the text data sets, the classified data sets are converted into rough sets, the text data are roughly classified by means of establishing the rough sets, and therefore follow-up clustering analysis is facilitated.
Example 4
Based on embodiment 1, the method for clustering three texts based on the improved game theory rough set, the step 4 includes:
step 41, obtaining a second set of features corresponding to each rough set;
Step 42, marking the roughness of which the second set features are defect features as a defect rough set, and marking the rough set of which the second set features are complete features as a complete rough set;
Step 43, obtaining a defect rough set and a complete rough set, and establishing clustering information;
and 44, acquiring a corresponding clustering algorithm according to the clustering information, and carrying out text clustering according to the clustering algorithm to obtain a plurality of data clusters.
In this example, the second feature set represents features of the rough set, where "second" is used to distinguish from the features of the previously mentioned text data set, without the function of sorting or comparing.
The working principle and the beneficial effects of the technical scheme are that the rough set is classified by utilizing the set characteristics of the rough set, the clustering information is established, then the clustering algorithm is matched by utilizing the clustering information, the clustering algorithm is utilized to cluster the clustering algorithm, a plurality of data clusters are obtained, and the clustering work is completed.
Example 5
Based on embodiment 4, the method for clustering three text branches based on the improved game theory rough set, the step 44 includes:
Step 441, performing dimension reduction on each defect rough set by using consistent manifold approximation and projection to obtain a corresponding compensation rough set, and generating a compensation text data set according to the compensation rough set;
Step 442, extracting a target clustering algorithm from a preset clustering algorithm library according to the clustering information;
and 443, clustering the compensating text data set and the complete text data set by using a target clustering algorithm to obtain a plurality of data clusters.
The working principle and the beneficial effects of the technical scheme are that the defect rough set is subjected to dimension reduction by utilizing consistent manifold approximation and projection, a corresponding compensation rough set can be generated according to the dimension reduction result, so that a compensation text data set can be established, clustering is performed by combining a target clustering algorithm proposed in a preset clustering algorithm library, a data cluster is obtained, the defect rough set is subjected to dimension reduction, the defect rough set is subjected to compensation and then clustering, the numerical value loss in the data set is processed, and the calculation time of three clusters of text is remarkably reduced.
Example 6
Based on embodiment 5, the method for clustering three texts based on the improved game theory rough set further comprises:
obtaining a defect rough set;
Traversing the defect rough set by using a preset sample data segment to obtain a defect data segment on the defect rough set;
Recording a data segment except a defect data segment in the defect rough set as a non-defect data segment;
Acquiring the data proportion between the defect data segment and the non-defect data segment, and establishing an initial threshold group according to the data proportion;
And correcting the target clustering algorithm by using the initial threshold group, and clustering the compensated text data set and the complete text data set by using the corrected target clustering algorithm to obtain a plurality of data clusters.
The preset data segment may be a blank data segment of length 1, for example.
The technical scheme has the working principle and beneficial effects that in order to improve the clustering accuracy when the target clustering algorithm is used for clustering, the defect data segments are divided into a plurality of data segments by using preset sample data segments, then the number of the defect data segments and the number of the non-defect data segments are compared, an initial threshold group is established, the target clustering algorithm is corrected by using the initial threshold group, and finally the clustering analysis is carried out, so that the clustering analysis accuracy is improved, the using surface of the clustering analysis is widened, and the clustering analysis can be carried out on different types of text data.
Example 7
Based on embodiment 6, the method for clustering three texts based on the improved game theory rough set further comprises:
The initial threshold value group comprises a first initial threshold value and a second initial threshold value, and the value range of the first initial threshold value and the second initial threshold value is (0, 1);
Combining the initial threshold group with a target clustering algorithm, adjusting a first initial threshold and a second initial threshold in the range of (0, 1), and establishing a clustering result trend graph in the adjustment process;
target trend points corresponding to the target clustering results in the clustering result trend graph;
And acquiring a first value and a second value corresponding to the target trend point, and correcting the target clustering algorithm by using the first value and the second value.
In the example, an initial threshold is formulated as (alpha, beta) = (1, 0), the difference of the thresholds can influence the accuracy and the universality, the larger the values of alpha and beta are in a limited range, the better the three-branch clustering effect is, the threshold adjusting method is divided into three types, wherein ① reduces the threshold alpha, which is marked as alpha ∈, ② improves the threshold beta, which is marked as beta ∈, ③ reduces alpha and improves beta simultaneously, which is marked as alpha ∈beta ∈;
the accuracy of clustering the objects with missing values is defined as accuracy, and the formula is as follows:
Wherein Accuracy (α, β) represents Accuracy, correcthy clustered objects represents the number of correctly clustered objects, total clustered objects represents the total number of clustered objects;
The accuracy of the actually clustered objects is defined as commonality, and the formula is:
Wherein GENERALITY (α, β) represents commonality, totalclusteredobjects represents total clustered objects, totalobjects inU represents the total number of objects in U;
Taking the median value of alpha and beta in (1, 0) and (0, 1) respectively, and calculating the values of Accurcry (alpha, beta) and GENERALITY (alpha, beta), so that the values of the two formulas are as large as possible simultaneously, setting the new values of alpha and beta as a and b respectively, taking the median value of alpha and beta in (1, a) and (0, b) respectively next time, and continuously iterating the steps until the set iteration times Q are reached, or until the values of Accurcry (alpha, beta) are less than or equal to GENERALITY (alpha, beta);
suitable thresholds (α, β) are calculated and determined according to the procedure described above, which can divide the subject into three parts, namely, an Inside (C k)、Partical(Ck)、Outside(Ck):
Inside(Ck)={oi∈U|e(ck,oi)≥α},
Partical(Ck)={oi∈U|β<e(ck,oi)<α},
Outside(Ck)={oi∈U|e(ck,oi)≤β}.
using distance formula (I represents the ith object, a is the object's attribute,The value of the a attribute of the i-th object) to determine the distance from all objects in the dataset to each missing value, from which the nearest neighbor of the normal object o i can be calculated, and then by an evaluation function:
Wherein Number ofo i neighbors belongingto ck represents the number of fields belonging to class c k in the field of object q, and Total neighbors ofo i represents the number of fields q;
Quantifying the relationship between object o i and cluster ckand;
object o i is classified into a corresponding region using the following formula
Inside(Ck)={oi∈U|e(ck,oi)≥α},
Partical(Ck)={oi∈U|β<e(ck,oi)<α},
Outside(Ck)={oi∈U|e(ck,oi)≤β}.
Where Inside represents the core domain, partical represents the boundary domain, and Outside represents the outer domain.
The working principle and the beneficial effects of the technical scheme are that in order to improve the clustering effect, the two initial thresholds are correspondingly adjusted within the range of the initial thresholds, then a trend chart of a clustering result in the adjusting process is obtained, specific data values of the first initial threshold and the second initial threshold are selected according to the trend chart, then clustering is carried out, and the effectiveness of a clustering algorithm is improved.
Example 8
Based on embodiment 1, the method for clustering three texts based on the improved game theory rough set further comprises the following steps:
and naming each data cluster respectively, and transmitting the data clusters to a designated terminal for display.
The technical scheme has the advantages that in order to distinguish different data clusters, the different data clusters are named and then transmitted to the appointed terminal for display, so that the user can conveniently check the data clusters.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (5)

1.一种基于改进的博弈论粗糙集的文本三支聚类方法,其特征在于,包括:1. A text three-branch clustering method based on improved game theory rough set, characterized by comprising: 步骤1:采集文本数据;Step 1: Collect text data; 步骤2:对文本数据进行预处理,建立文本数据集;Step 2: Preprocess the text data and establish a text dataset; 步骤3:根据文本数据集建立对应的粗糙集;Step 3: Establish the corresponding rough set based on the text data set; 步骤4:对粗糙集进行聚类得到聚类信息,根据聚类信息对文本数据集进行文本聚类,得到若干个数据簇;Step 4: Cluster the rough set to obtain clustering information, and perform text clustering on the text data set based on the clustering information to obtain several data clusters; 所述步骤4,包括:The step 4 comprises: 步骤41:获取每一粗糙集对应的第二集特征;Step 41: Obtain the second set features corresponding to each rough set; 步骤42:将第二集特征为缺陷特征的粗糙集记作缺陷粗糙集,将第二集特征为完整特征的粗糙集记作完整粗糙集;Step 42: record the rough set whose second set of features is defective features as defective rough set, and record the rough set whose second set of features is complete features as complete rough set; 步骤43:获取缺陷粗糙集和完整粗糙集,建立聚类信息;Step 43: Obtain defect rough sets and complete rough sets, and establish clustering information; 步骤44:根据聚类信息获取对应的聚类算法,根据聚类算法进行文本聚类得到若干个数据簇;Step 44: Obtain a corresponding clustering algorithm according to the clustering information, and perform text clustering according to the clustering algorithm to obtain a number of data clusters; 所述步骤44,包括:The step 44 comprises: 步骤441:利用一致流形逼近与投影分别对每一缺陷粗糙集进行降维,得到对应的补偿粗糙集,根据补偿粗糙集生成补偿文本数据集;Step 441: using consistent manifold approximation and projection to reduce the dimension of each defect rough set, obtain a corresponding compensation rough set, and generate a compensation text data set according to the compensation rough set; 步骤442:根据聚类信息在预设聚类算法库中提取目标聚类算法;Step 442: extracting a target clustering algorithm from a preset clustering algorithm library according to the clustering information; 步骤443:利用目标聚类算法对补偿文本数据集和完整文本数据集进行聚类,得到若干个数据簇;Step 443: clustering the compensated text data set and the complete text data set using a target clustering algorithm to obtain a number of data clusters; 还包括:Also includes: 获取缺陷粗糙集;Get defect rough set; 利用预设样本数据段遍历缺陷粗糙集,得到缺陷粗糙集上的缺陷数据段;Using the preset sample data segments to traverse the defect rough set, the defect data segments on the defect rough set are obtained; 将缺陷粗糙集中除了缺陷数据段的数据段记作非缺陷数据段;The data segments except the defective data segments in the defect rough set are recorded as non-defective data segments; 获取缺陷数据段与非缺陷数据段之间的数据比例,根据数据比例建立初始阈值组;Obtaining a data ratio between defective data segments and non-defective data segments, and establishing an initial threshold value group according to the data ratio; 利用初始阈值组修正目标聚类算法,利用修正后的目标聚类算法对补偿文本数据集和完整文本数据集进行聚类,得到若干个数据簇;The target clustering algorithm is modified by using the initial threshold group, and the compensated text data set and the complete text data set are clustered by using the modified target clustering algorithm to obtain several data clusters; 初始阈值组包含第一初始阈值和第二初始阈值,且第一初始阈值和第二初始阈值的取值范围为(0,1);The initial threshold group includes a first initial threshold and a second initial threshold, and the value range of the first initial threshold and the second initial threshold is (0, 1); 将初始阈值组与目标聚类算法相结合,在(0,1)范围内调节第一初始阈值和第二初始阈值;The initial threshold group is combined with the target clustering algorithm, and the first initial threshold and the second initial threshold are adjusted in the range of (0, 1); 确定合适的阈值,阈值将对象分为核心域、边界域和外部域三部分。Determining an appropriate threshold ,The threshold divides the object into three parts: core domain, boundary domain and external domain. 2.如权利要求1所述的一种基于改进的博弈论粗糙集的文本三支聚类方法,其特征在于,所述步骤2,包括:2. A text three-branch clustering method based on improved game theory rough sets as claimed in claim 1, characterized in that the step 2 comprises: 步骤21:剔除文本数据中出现频率小于预设频率的第一非重要词,得到目标文本数据;Step 21: Eliminate the first non-important words whose occurrence frequency in the text data is less than a preset frequency to obtain target text data; 步骤22:将预设数量的目标文本数据记为一类,建立标准数据集;Step 22: Record a preset amount of target text data as one category and establish a standard data set; 步骤23:剔除标准数据集中的属于预设词性的第二非重要词,建立文本数据集。Step 23: Eliminate the second non-important words belonging to the preset part of speech in the standard data set to establish a text data set. 3.如权利要求1所述的一种基于改进的博弈论粗糙集的文本三支聚类方法,其特征在于,所述步骤3,包括:3. A text three-branch clustering method based on improved game theory rough sets as claimed in claim 1, characterized in that said step 3 comprises: 步骤31:分别获取每一文本数据集对应的第一集特征;Step 31: Obtain the first set of features corresponding to each text data set respectively; 步骤32:根据第一集特征将文本数据集划分为完整文本数据集和缺陷文本数据集;Step 32: Divide the text data set into a complete text data set and a defective text data set according to the first set of features; 步骤33:将完整文本数据集和缺陷文本数据集转换为完整粗糙集和缺陷粗糙集;Step 33: Convert the complete text data set and the defect text data set into a complete rough set and a defect rough set; 步骤34:将完整粗糙集和缺陷粗糙集统一记作粗糙集。Step 34: The complete rough set and the defective rough set are uniformly recorded as rough sets. 4.如权利要求1所述的一种基于改进的博弈论粗糙集的文本三支聚类方法,其特征在于,还包括:4. The text three-branch clustering method based on improved game theory rough set as claimed in claim 1, characterized in that it also includes: 初始阈值组包含第一初始阈值和第二初始阈值,且第一初始阈值和第二初始阈值的取值范围为(0,1);The initial threshold group includes a first initial threshold and a second initial threshold, and the value range of the first initial threshold and the second initial threshold is (0, 1); 将初始阈值组与目标聚类算法相结合,在(0,1)范围内调节第一初始阈值和第二初始阈值,在调节过程中建立聚类结果趋势图;The initial threshold group is combined with the target clustering algorithm, the first initial threshold and the second initial threshold are adjusted within the range of (0, 1), and a clustering result trend graph is established during the adjustment process; 在聚类结果趋势图中目标聚类结果对应的目标趋势点;The target trend point corresponding to the target clustering result in the clustering result trend graph; 获取目标趋势点对应的第一取值和第二取值,利用第一取值和第二取值修正目标聚类算法。A first value and a second value corresponding to the target trend point are obtained, and the target clustering algorithm is corrected using the first value and the second value. 5.如权利要求1所述的一种基于改进的博弈论粗糙集的文本三支聚类方法,其特征在于,还包括:5. The text three-branch clustering method based on improved game theory rough set as claimed in claim 1, characterized in that it also includes: 分别为每一数据簇进行命名,并传输到指定终端进行显示。Each data cluster is named separately and transmitted to the designated terminal for display.
CN202310582217.7A 2023-05-23 2023-05-23 A three-branch text clustering method based on improved game theory rough set Active CN116701622B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310582217.7A CN116701622B (en) 2023-05-23 2023-05-23 A three-branch text clustering method based on improved game theory rough set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310582217.7A CN116701622B (en) 2023-05-23 2023-05-23 A three-branch text clustering method based on improved game theory rough set

Publications (2)

Publication Number Publication Date
CN116701622A CN116701622A (en) 2023-09-05
CN116701622B true CN116701622B (en) 2025-01-10

Family

ID=87844307

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310582217.7A Active CN116701622B (en) 2023-05-23 2023-05-23 A three-branch text clustering method based on improved game theory rough set

Country Status (1)

Country Link
CN (1) CN116701622B (en)

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8099453B2 (en) * 2009-01-22 2012-01-17 Hewlett-Packard Development Company, L.P. System and method for data clustering
US8346689B2 (en) * 2010-01-21 2013-01-01 National Cheng Kung University Recommendation system using rough-set and multiple features mining integrally and method thereof
US8995771B2 (en) * 2012-04-30 2015-03-31 Microsoft Technology Licensing, Llc Identification of duplicates within an image space
KR101866522B1 (en) * 2016-12-16 2018-06-12 인천대학교 산학협력단 Object clustering method for image segmentation
CN109635849A (en) * 2018-11-22 2019-04-16 华中师范大学 A kind of target clustering method and system based on three c-means decisions
CN114328920A (en) * 2021-12-27 2022-04-12 盐城工学院 Text clustering method and system based on consistent manifold approximation and projection
CN115358885A (en) * 2022-07-31 2022-11-18 云南电网有限责任公司 Platform area load identification and load response evaluation method based on deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
A Three-way Clustering Approach for Handling Missing Data using GTRS;Mohammad Khan Afridi等;International Journal of Approximate Reasoning;20180404;正文部分 *

Also Published As

Publication number Publication date
CN116701622A (en) 2023-09-05

Similar Documents

Publication Publication Date Title
WO2023201772A1 (en) Cross-domain remote sensing image semantic segmentation method based on adaptation and self-training in iteration domain
CN112819299A (en) Differential K-means load clustering method based on center optimization
CN109858544B (en) Steel quality detection method based on interval shadow set and density peak clustering
CN111338950A (en) Software defect feature selection method based on spectral clustering
CN112101765A (en) Abnormal data processing method and system for operation index data of power distribution network
CN115169617B (en) Mold maintenance prediction model training method, mold maintenance prediction method and system
CN114168782B (en) Deep hash image retrieval method based on triplet network
CN116523320A (en) Intellectual property risk intelligent analysis method based on Internet big data
CN112884570A (en) Method, device and equipment for determining model security
CN117131449A (en) Data management-oriented anomaly identification method and system with propagation learning capability
CN114093445B (en) Patient screening marking method based on partial multi-marking learning
CN116821832A (en) Abnormal data identification and correction method for high-voltage industrial and commercial user power load
CN117392450A (en) A steel material quality analysis method based on evolutionary multi-scale feature learning
CN116340776A (en) Power use behavior pattern recognition method and device based on noisy learning
CN117541562A (en) Semi-supervised non-reference image quality evaluation method based on uncertainty estimation
CN114841262B (en) A rolling bearing fault diagnosis method based on DS evidence theory
CN114677550B (en) Rapid image pixel screening method based on sparse discrimination K-means
CN116701622B (en) A three-branch text clustering method based on improved game theory rough set
CN118657786A (en) A device defect detection method under few sample conditions
CN110348005B (en) Distribution network equipment state data processing method and device, computer equipment and medium
CN116152194A (en) Object defect detection method, system, equipment and medium
CN111459838B (en) Software defect prediction method and system based on manifold alignment
CN115687899A (en) Hybrid feature selection method based on high-dimensional spinning data
CN109800384B (en) A Basic Probability Assignment Calculation Method Based on Rough Set Information Decision Table
CN109036390B (en) Broadcast keyword identification method based on integrated gradient elevator

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant