CN116701622B

CN116701622B - A three-branch text clustering method based on improved game theory rough set

Info

Publication number: CN116701622B
Application number: CN202310582217.7A
Authority: CN
Inventors: 徐森; 陆湘文; 徐秀芳; 花小朋; 朱锦新; 许贺洋; 郭乃瑄; 嵇宏伟; 姜陈雨; 陈思博; 蔡娜
Original assignee: Yancheng Institute of Technology
Current assignee: Yancheng Institute of Technology
Priority date: 2023-05-23
Filing date: 2023-05-23
Publication date: 2025-01-10
Anticipated expiration: 2043-05-23
Also published as: CN116701622A

Abstract

The invention provides a text three-branch clustering method based on an improved game theory rough set, which comprises the steps of collecting text data, preprocessing the text data, establishing a text data set, establishing a corresponding rough set according to the text data set, clustering the rough set to obtain clustering information, carrying out text clustering on the text data set according to the clustering information to obtain a plurality of data clusters, clustering the text data set based on the text three-branch clustering of the improved game theory rough set, processing the numerical value loss in the data set, obviously reducing the calculation time of the text three-branch clustering, improving the universality of a model, and effectively and efficiently improving the accuracy of the text three-branch clustering.

Description

Text three-branch clustering method based on improved game theory rough set

Technical Field

The invention relates to the technical field of data mining, in particular to a text three-branch clustering method based on an improved game theory rough set.

Background

Clustering techniques are widely used in the fields of machine learning, data analysis, artificial intelligence, and the like. It can be used to solve a number of practical problems such as market segments, recommendation systems, image analysis, text classification, etc.

The core idea of clustering techniques is to find natural groups in a dataset, which groups consist of data points with similar properties. Clustering algorithms typically represent data points as n-dimensional vectors and use distance metrics to calculate similarity between the data points. Once the similarity is measured, the algorithm may divide the data points into different clusters.

In modern data science, clustering algorithms have become a very important technique because it can process large-scale data sets and find structures therein. Many clustering algorithms are now available, including hierarchical clustering, k-means clustering, DBSCAN clustering, and the like. These algorithms have respective advantages and disadvantages and can be selected according to the requirements of different problems.

Due to incomplete data, missing data or damaged data, the effect of general clustering is not very good, and three-branch clustering is a good solution for processing uncertainty in clustering caused by missing values. The core idea of the method is to put an object in an indeterminable class and postpone the decision on the object when it cannot be determined whether the object belongs to a cluster. However, in this method, there is a critical problem that a fixed value is generally used for determining the threshold value of the three clusters.

Therefore, the invention provides a text three-branch clustering method based on an improved game theory rough set.

Disclosure of Invention

According to the text three-branch clustering method based on the improved game theory rough set, the text data set is clustered based on the text three-branch clustering of the improved game theory rough set, the defect of numerical values in the data set is processed, the calculation time of the text three-branch clustering is obviously reduced, the universality of a model is improved, and the accuracy of the text three-branch clustering is effectively and efficiently improved.

The invention provides a text three-branch clustering method based on an improved game theory rough set, which comprises the following steps:

Step 1, collecting text data;

Step 2, preprocessing the text data to establish a text data set;

Step 3, establishing a corresponding rough set according to the text data set;

And 4, clustering the rough set to obtain clustering information, and performing text clustering on the text data set according to the clustering information to obtain a plurality of data clusters.

In one embodiment of the present invention, in one possible implementation,

The step 2 includes:

Step 21, eliminating first non-important words with occurrence frequency smaller than a preset frequency in the text data to obtain target text data;

Step 22, marking the target text data with preset quantity as one type, and establishing a standard data set;

And 23, eliminating second non-important words belonging to the preset part of speech in the standard data set, and establishing a text data set.

In one embodiment of the present invention, in one possible implementation,

The step 3 includes:

Step 31, respectively acquiring a first set of features corresponding to each text data set;

Step 32, dividing the text data set into a complete text data set and a defect text data set according to the first set characteristics;

step 33, converting the complete text data set and the defect text data set into a complete rough set and a defect rough set;

step 34. The complete rough set and the defect rough set are collectively referred to as a rough set.

In one embodiment of the present invention, in one possible implementation,

The step 4 includes:

step 41, obtaining a second set of features corresponding to each rough set;

Step 42, marking the roughness of which the second set features are defect features as a defect rough set, and marking the rough set of which the second set features are complete features as a complete rough set;

Step 43, obtaining a defect rough set and a complete rough set, and establishing clustering information;

and 44, acquiring a corresponding clustering algorithm according to the clustering information, and carrying out text clustering according to the clustering algorithm to obtain a plurality of data clusters.

In one embodiment of the present invention, in one possible implementation,

The step 44 includes:

Step 441, performing dimension reduction on each defect rough set by using consistent manifold approximation and projection to obtain a corresponding compensation rough set, and generating a compensation text data set according to the compensation rough set;

Step 442, extracting a target clustering algorithm from a preset clustering algorithm library according to the clustering information;

and 443, clustering the compensating text data set and the complete text data set by using a target clustering algorithm to obtain a plurality of data clusters.

In one embodiment of the present invention, in one possible implementation,

Further comprises:

obtaining a defect rough set;

Traversing the defect rough set by using a preset sample data segment to obtain a defect data segment on the defect rough set;

Recording a data segment except a defect data segment in the defect rough set as a non-defect data segment;

Acquiring the data proportion between the defect data segment and the non-defect data segment, and establishing an initial threshold group according to the data proportion;

And correcting the target clustering algorithm by using the initial threshold group, and clustering the compensated text data set and the complete text data set by using the corrected target clustering algorithm to obtain a plurality of data clusters.

In one embodiment of the present invention, in one possible implementation,

Further comprises:

The initial threshold value group comprises a first initial threshold value and a second initial threshold value, and the value range of the first initial threshold value and the second initial threshold value is (0, 1);

Combining the initial threshold group with a target clustering algorithm, adjusting a first initial threshold and a second initial threshold in the range of (0, 1), and establishing a clustering result trend graph in the adjustment process;

target trend points corresponding to the target clustering results in the clustering result trend graph;

And acquiring a first value and a second value corresponding to the target trend point, and correcting the target clustering algorithm by using the first value and the second value.

In one embodiment of the present invention, in one possible implementation,

Further comprises:

and naming each data cluster respectively, and transmitting the data clusters to a designated terminal for display.

The text clustering method has the beneficial effects that the text data is collected and preprocessed to establish a text data set, then a corresponding rough set is established, the clustering of the text data is realized by clustering the rough set, a plurality of data clusters are obtained, the numerical value loss in the data set can be processed before the clustering, the calculation time of three clusters of the text is obviously reduced, the universality of a model is improved, and the accuracy of the three clusters of the text is effectively and efficiently improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:

FIG. 1 is a schematic workflow diagram of a text three-branch clustering method based on an improved game theory rough set in an embodiment of the invention;

Fig. 2 is a schematic workflow diagram of step 3 of a text three-branch clustering method based on an improved game theory rough set in an embodiment of the invention.

Detailed Description

The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.

Example 1

The embodiment provides a text three-branch clustering method based on an improved game theory rough set, which is shown in fig. 1 and comprises the following steps:

Step 1, collecting text data;

Step 2, preprocessing the text data to establish a text data set;

Step 3, establishing a corresponding rough set according to the text data set;

The working principle and the beneficial effects of the technical scheme are that the text data are collected, then the text data are preprocessed to establish a text data set, then a corresponding rough set is established, the text data are clustered in a mode of clustering the rough set, a plurality of data clusters are obtained, the numerical value loss in the data sets can be processed before clustering in such a mode, the calculation time of three clusters of the text is obviously shortened, the universality of a model is improved, and the accuracy of the three clusters of the text is effectively and efficiently improved.

Example 2

Based on the embodiment 1, the method for clustering three texts based on the improved game theory rough set, the step 2, includes:

In this example, the preset frequency may be 3 times;

In this example, the first non-important word represents a word whose number of occurrences is less than a preset frequency;

In this example, the preset part of speech may be a junction adverb;

in this example, the second non-important word represents a junction adverb;

In this example, the preset number may be 50.

The working principle and the beneficial effects of the technical scheme are that in order to accelerate the clustering speed and avoid interference of unnecessary factors on the clustering result, non-important words in text data are removed before the clustering analysis is carried out, and a standard data set is established, so that the workload of clustering is reduced, and the clustering efficiency is improved.

Example 3

On the basis of embodiment 1, the method for clustering three texts based on the improved game theory rough set, as shown in fig. 2, includes the following steps:

In this example, the first feature set represents the features of the text data set, and is used to distinguish between different text data sets, where "first" is used to distinguish from the feature set of the subsequently occurring rough set, and has no comparing and sorting effects;

in this example, the complete text data set indicates that all data bits in the text data set have text data and no missing data;

In this example, the defective text data indicates that the text data set contains blank data bits on which data is missing;

in this example, the complete rough set represents a rough set obtained by classifying all complete text data sets into one type;

In this example, the defect rough set represents a rough set obtained by classifying all defect text data sets into one type.

The technical scheme has the advantages that the text data sets are classified by the aid of the set features by acquiring the set features of the text data sets, the classified data sets are converted into rough sets, the text data are roughly classified by means of establishing the rough sets, and therefore follow-up clustering analysis is facilitated.

Example 4

Based on embodiment 1, the method for clustering three texts based on the improved game theory rough set, the step 4 includes:

step 41, obtaining a second set of features corresponding to each rough set;

In this example, the second feature set represents features of the rough set, where "second" is used to distinguish from the features of the previously mentioned text data set, without the function of sorting or comparing.

The working principle and the beneficial effects of the technical scheme are that the rough set is classified by utilizing the set characteristics of the rough set, the clustering information is established, then the clustering algorithm is matched by utilizing the clustering information, the clustering algorithm is utilized to cluster the clustering algorithm, a plurality of data clusters are obtained, and the clustering work is completed.

Example 5

Based on embodiment 4, the method for clustering three text branches based on the improved game theory rough set, the step 44 includes:

The working principle and the beneficial effects of the technical scheme are that the defect rough set is subjected to dimension reduction by utilizing consistent manifold approximation and projection, a corresponding compensation rough set can be generated according to the dimension reduction result, so that a compensation text data set can be established, clustering is performed by combining a target clustering algorithm proposed in a preset clustering algorithm library, a data cluster is obtained, the defect rough set is subjected to dimension reduction, the defect rough set is subjected to compensation and then clustering, the numerical value loss in the data set is processed, and the calculation time of three clusters of text is remarkably reduced.

Example 6

Based on embodiment 5, the method for clustering three texts based on the improved game theory rough set further comprises:

obtaining a defect rough set;

The preset data segment may be a blank data segment of length 1, for example.

The technical scheme has the working principle and beneficial effects that in order to improve the clustering accuracy when the target clustering algorithm is used for clustering, the defect data segments are divided into a plurality of data segments by using preset sample data segments, then the number of the defect data segments and the number of the non-defect data segments are compared, an initial threshold group is established, the target clustering algorithm is corrected by using the initial threshold group, and finally the clustering analysis is carried out, so that the clustering analysis accuracy is improved, the using surface of the clustering analysis is widened, and the clustering analysis can be carried out on different types of text data.

Example 7

Based on embodiment 6, the method for clustering three texts based on the improved game theory rough set further comprises:

In the example, an initial threshold is formulated as (alpha, beta) = (1, 0), the difference of the thresholds can influence the accuracy and the universality, the larger the values of alpha and beta are in a limited range, the better the three-branch clustering effect is, the threshold adjusting method is divided into three types, wherein ① reduces the threshold alpha, which is marked as alpha ∈, ② improves the threshold beta, which is marked as beta ∈, ③ reduces alpha and improves beta simultaneously, which is marked as alpha ∈beta ∈;

the accuracy of clustering the objects with missing values is defined as accuracy, and the formula is as follows:

Wherein Accuracy (α, β) represents Accuracy, correcthy clustered objects represents the number of correctly clustered objects, total clustered objects represents the total number of clustered objects;

The accuracy of the actually clustered objects is defined as commonality, and the formula is:

Wherein GENERALITY (α, β) represents commonality, totalclusteredobjects represents total clustered objects, totalobjects inU represents the total number of objects in U;

Taking the median value of alpha and beta in (1, 0) and (0, 1) respectively, and calculating the values of Accurcry (alpha, beta) and GENERALITY (alpha, beta), so that the values of the two formulas are as large as possible simultaneously, setting the new values of alpha and beta as a and b respectively, taking the median value of alpha and beta in (1, a) and (0, b) respectively next time, and continuously iterating the steps until the set iteration times Q are reached, or until the values of Accurcry (alpha, beta) are less than or equal to GENERALITY (alpha, beta);

suitable thresholds (α, β) are calculated and determined according to the procedure described above, which can divide the subject into three parts, namely, an Inside (C _k)、Partical(C_k)、Outside(C_k):

Inside(C_k)＝{o_i∈U|e(c_k,o_i)≥α},

Partical(C_k)＝{o_i∈U|β＜e(c_k,o_i)<α},

Outside(C_k)＝{o_i∈U|e(c_k,o_i)≤β}.

using distance formula (I represents the ith object, a is the object's attribute,The value of the a attribute of the i-th object) to determine the distance from all objects in the dataset to each missing value, from which the nearest neighbor of the normal object o _i can be calculated, and then by an evaluation function:

Wherein Number ofo _i neighbors belongingto c_k represents the number of fields belonging to class c _k in the field of object q, and Total neighbors ofo _i represents the number of fields q;

Quantifying the relationship between object o _i and cluster ckand;

object o _i is classified into a corresponding region using the following formula

Inside(C_k)＝{o_i∈U|e(c_k,o_i)≥α},

Partical(C_k)＝{o_i∈U|β＜e(c_k,o_i)<α},

Outside(C_k)＝{o_i∈U|e(c_k,o_i)≤β}.

Where Inside represents the core domain, partical represents the boundary domain, and Outside represents the outer domain.

The working principle and the beneficial effects of the technical scheme are that in order to improve the clustering effect, the two initial thresholds are correspondingly adjusted within the range of the initial thresholds, then a trend chart of a clustering result in the adjusting process is obtained, specific data values of the first initial threshold and the second initial threshold are selected according to the trend chart, then clustering is carried out, and the effectiveness of a clustering algorithm is improved.

Example 8

Based on embodiment 1, the method for clustering three texts based on the improved game theory rough set further comprises the following steps:

The technical scheme has the advantages that in order to distinguish different data clusters, the different data clusters are named and then transmitted to the appointed terminal for display, so that the user can conveniently check the data clusters.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A text three-branch clustering method based on improved game theory rough set, characterized by comprising:

Step 1: Collect text data;

Step 2: Preprocess the text data and establish a text dataset;

Step 3: Establish the corresponding rough set based on the text data set;

Step 4: Cluster the rough set to obtain clustering information, and perform text clustering on the text data set based on the clustering information to obtain several data clusters;

The step 4 comprises:

Step 41: Obtain the second set features corresponding to each rough set;

Step 42: record the rough set whose second set of features is defective features as defective rough set, and record the rough set whose second set of features is complete features as complete rough set;

Step 43: Obtain defect rough sets and complete rough sets, and establish clustering information;

Step 44: Obtain a corresponding clustering algorithm according to the clustering information, and perform text clustering according to the clustering algorithm to obtain a number of data clusters;

The step 44 comprises:

Step 441: using consistent manifold approximation and projection to reduce the dimension of each defect rough set, obtain a corresponding compensation rough set, and generate a compensation text data set according to the compensation rough set;

Step 442: extracting a target clustering algorithm from a preset clustering algorithm library according to the clustering information;

Step 443: clustering the compensated text data set and the complete text data set using a target clustering algorithm to obtain a number of data clusters;

Also includes:

Get defect rough set;

Using the preset sample data segments to traverse the defect rough set, the defect data segments on the defect rough set are obtained;

The data segments except the defective data segments in the defect rough set are recorded as non-defective data segments;

Obtaining a data ratio between defective data segments and non-defective data segments, and establishing an initial threshold value group according to the data ratio;

The target clustering algorithm is modified by using the initial threshold group, and the compensated text data set and the complete text data set are clustered by using the modified target clustering algorithm to obtain several data clusters;

The initial threshold group includes a first initial threshold and a second initial threshold, and the value range of the first initial threshold and the second initial threshold is (0, 1);

The initial threshold group is combined with the target clustering algorithm, and the first initial threshold and the second initial threshold are adjusted in the range of (0, 1);

Determining an appropriate threshold ,The threshold divides the object into three parts: core domain, boundary domain and external domain.

2. A text three-branch clustering method based on improved game theory rough sets as claimed in claim 1, characterized in that the step 2 comprises:

Step 21: Eliminate the first non-important words whose occurrence frequency in the text data is less than a preset frequency to obtain target text data;

Step 22: Record a preset amount of target text data as one category and establish a standard data set;

Step 23: Eliminate the second non-important words belonging to the preset part of speech in the standard data set to establish a text data set.

3. A text three-branch clustering method based on improved game theory rough sets as claimed in claim 1, characterized in that said step 3 comprises:

Step 31: Obtain the first set of features corresponding to each text data set respectively;

Step 32: Divide the text data set into a complete text data set and a defective text data set according to the first set of features;

Step 33: Convert the complete text data set and the defect text data set into a complete rough set and a defect rough set;

Step 34: The complete rough set and the defective rough set are uniformly recorded as rough sets.

4. The text three-branch clustering method based on improved game theory rough set as claimed in claim 1, characterized in that it also includes:

The initial threshold group is combined with the target clustering algorithm, the first initial threshold and the second initial threshold are adjusted within the range of (0, 1), and a clustering result trend graph is established during the adjustment process;

The target trend point corresponding to the target clustering result in the clustering result trend graph;

A first value and a second value corresponding to the target trend point are obtained, and the target clustering algorithm is corrected using the first value and the second value.

5. The text three-branch clustering method based on improved game theory rough set as claimed in claim 1, characterized in that it also includes:

Each data cluster is named separately and transmitted to the designated terminal for display.