Disclosure of Invention
The invention aims at least solving the technical problems existing in the prior art, and particularly creatively provides a ReliefF characteristic selection method for carrying out random multi-subspace on big data.
In order to achieve the above object of the present invention, the present invention provides a ReliefF feature selection method for random multi-subspace for big data, comprising:
S1, dividing an original feature space to generate a plurality of feature partitions which comprise a plurality of random subspaces with the same size and without intersecting each other;
S2, using ReliefF or a Relief algorithm in each random subspace to obtain local weights of the features, wherein the local weight vectors reflect the local holding capacity of the random subspace and the contribution capacity of the samples to the features. Wherein the random subspace has a plurality of features.
Combining the local weight vectors of the random subspaces in each feature partition to obtain a full weight vector;
And S3, integrating the full weight vectors of the feature partitions into the final weight vector of each feature, namely, obtaining the final weight vector of all the features by averaging the weight scores of the feature partitions.
Further, the step S1 includes:
representing the feature partition as:
Wherein, the Representing an ith feature partition;
P (i,j) represents A j-th random subspace of (a);
s represents The number of random subspaces;
Randomly sampling [ d/s ] each time and not repeatedly sampling when each feature partition is generated until all the features are sampled, and forming a random subspace by the residual features when the residual features are less than [ d/s ], wherein [ d/s ] represents a d/s truncated and rounded value, and d represents the total number of the features;
Will be The j-th random subspace of (a) may be marked as:
Wherein f 1 (i,j) represents the 1 st feature in the random subspace P (i,j);
f 2 (i,j) denotes the 2 nd feature in the random subspace P (i,j);
represents the d ij th feature in the random subspace P (i,j);
d ij represents the number of features in the random subspace P (i,j);
repeatedly and randomly generating feature partitions, resulting in a set of multiple feature partitions can be expressed as:
Wherein, the Representing the generated 1 st feature partition;
Representing the generated 2 nd feature partition;
representing the generated Mth feature partition;
M represents the number of feature partitions.
Further, the feature partition comprises s random subspaces, and each random subspace has the same number of [ d/s ] features, and if d cannot integrate s, the remainder of the remaining features self-form a random subspace.
Further, the obtaining the local weights of the features in S2 using ReliefF or Relief algorithms in each random subspace includes:
calculating local feature weights for features f l (i,j)(l=1,2,...,dij) in random subspace P (i,j) using ReliefF or Relief algorithms for the features;
the local weight vectors for all features in the random subspace P (i,j) can then be expressed as:
Wherein w (f 1 (i,j)) represents the local weight of the 1 st feature in the random subspace P (i,j);
w (f 2 (i,j)) represents the local weight of the 2 nd feature in the random subspace P (i,j);
Local weights representing the d ij th feature in the random subspace P (i,j);
d ij represents the number of features in the random subspace P (i,j);
(. Cndot.) T represents the transpose of the matrix.
Further, the ReliefF algorithm obtains the weight formula of the feature f l (i,j)(l=1,2,...,dij) in the random subspace P (i,j) as follows:
Wherein w m(fl (i,j)) represents the local weight of the first feature in the random subspace P (i,j), the value of which is calculated recursively m times (the number of samples), and w m-1(fl (i,j)) represents the local weight of the feature f l (i,j) in the random subspace P (i,j), the value of which is calculated recursively m-1 times;
f l (i,j) denotes the first feature in the random subspace P (i,j);
diff (f l (i,j),R,Hh) represents the difference between samples R and H h under feature f l (i,j) in the random subspace P (i,j);
diff (f l (i,j),R,Mh) represents the difference between samples R and M h under feature f l (i,j) in the random subspace P (i,j);
h h denotes selecting the H nearest neighbor of sample R from samples having the same class as sample R;
M h denotes selecting the h nearest neighbor of sample R from samples having a different class from sample R;
m represents the sampling times;
k represents the total number of nearest neighbors;
q+.class (R) represents a collection of classes different from the class of sample R;
q represents the set Q;
class (R) represents the class of sample R;
p (·) represents the probability that a given class occupies all classes.
The nearest neighbor is the sample nearest to sample R.
Further, the full weight vector in S2 includes:
each feature in the original feature space is only in a feature partition Once out of s random subspaces, can be obtained by concatenating the local weight vectors of the feature partition s subspacesIs expressed as:
Wherein the method comprises the steps of Representing feature partitionsIs a full weight vector of (a);
w (f 1 (i)) represents the local weight vector of f 1 (i);
f 1 (i) denotes the 1 st feature in the original feature space;
w (f 2 (i)) represents the local weight vector of f 2 (i);
f 2 (i) denotes the 2 nd feature in the original feature space;
w (f d (i)) represents the local weight vector of f d (i);
f d (i) denotes the d-th feature in the original feature space;
(. Cndot.) T represents the transpose of the matrix.
Further, the step S3 includes:
the final weight vector is expressed as:
Wherein the method comprises the steps of Representing the final weight vector;
Representing feature partitions Is a full weight vector of (a);
representing an ith feature partition;
M represents the number of feature partitions;
then calculate the final weight vector If the feature weight w (f i) in the final weight vector is greater than w avg, then selecting the feature as a subset of the selected features.
Further, the method further comprises the step of taking the precision, the precision and the recall of the KNN classifier and/or the DT classifier as evaluation indexes of the method, wherein the classification precision of the KNN classifier and the DT classifier is expressed as follows:
Wherein E (f; D) represents the classification accuracy of KNN and DT classifiers;
d represents a training dataset;
f represents a KNN classifier or a DT classifier;
n represents n dimension;
Is an indication function;
f (x i) represents the prediction result of the classifier in the ith sample;
c i represents the true class label of the classifier in the ith sample.
In summary, by adopting the technical scheme, the invention fully considers the diversity of subspaces and the contribution information of samples to the characteristics in the characteristic selection process, and has the capability of exploring each subspace in a high-dimensional space.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.
1. Introduction to the invention
Here we use the framework of filtering methods for feature selection, as they are superior to embedded and wrapped methods in terms of simplicity, scalability and independence.
In view of the two problems of the background art, a new feature selection method RBEFF is presented herein. The main idea is to use a large number of random subspaces divided from the original feature space to select a set of feature subsets containing a large amount of information. During the learning process, it is important that each feature should compete fairly in a large number of random subspaces. RBEFF consists of three steps. First, the original feature space is divided into a plurality of disjoint feature subspaces with equal size, called a feature partition, where the number (size) of feature partitions is M, and the subspace size in each feature partition is d/s. This process was repeated M times. Thus, in this step we get M feature partitions containing a large amount of subspace. And secondly, obtaining the local weight of each feature of the subspace by calculating the local weight of each feature in each random subspace by using ReliefF algorithm to form a local weight vector of the random subspace, wherein the local weight vector reflects the local holding capacity of the random subspace and the contribution capacity of a sample to the feature, namely, because the features of the random subspace are different, k similar neighbors and different similar neighbors of one sample are different, and each feature is given weight with different values in different subspaces. Since ReliefF algorithm samples randomly m times, the information of different samples is utilized to measure the weight size of the feature. These local weight vectors in each feature partition constitute a full weight vector, and the full weight vectors in the M feature partitions are integrated to obtain a final weight vector of the feature space. Finally, feature selection is adaptively performed according to the final weight vector, i.e., an optimal feature subset is selected to represent all features without setting the selected feature number. A number of experiments were performed on 26 standard datasets, and the results indicate the validity and effectiveness of the algorithm.
2. Background
2.1 Related work for feature selection
The problem of feature estimation has attracted a great deal of attention in the literature. There are several ways in which the quality of a feature can be assessed. If the target variable is discrete, the metric values include information gain, coefficient of parities, distance metric, j-metric, relief, reliefF, MDL, χ 2, and G statistic. When the target variable is a real valued function (numerical class and regression problem), mean square, mean absolute error and RReliefF are used. Most metrics assume that features have conditional independence (based on target variables) and therefore are unsuitable when designing a large number of feature interactions. Relief, reliefF and the RReliefF algorithm do not make this assumption and can correctly estimate the quality of the feature in the case of strong dependencies between attributes. Therefore, the Relief series algorithm is one of the most excellent preprocessing algorithms so far.
The Relief series algorithm is an efficient filtering feature selection algorithm. Kira and Rendell first proposed a feature selection algorithm, relief, for binary classification problems in 1992. The Relief algorithm is widely used because of being relatively simple, but high in operation efficiency and satisfactory in result. However, it is limited in that it can only process two types of data. Therefore Kononeill analyzed and expanded ReliefF algorithm in 1994 that could be used to deal with noise, imperfections and multiple classes of problems. Regarding how ReliefF is applied to the continuous class (regression) problem, the RReliefF algorithm was generated in 1997. After this, a number of feature selection algorithms based on the Relief series algorithm have been proposed. The Relief algorithm family (Relief, reliefF and rreieff) has been widely used to solve practical problems in life, such as signal recognition, transmission line fault localization, image processing and image classification, and gene classification, pattern recognition, etc., due to its excellent performance.
2.2Relief feature selection algorithm
The Relief algorithm is a feature weighting algorithm. And giving different weights to the features according to the relevance of each feature and category. Features with weights less than a certain threshold will be deleted. In the Relief algorithm, the correlation between features and categories is based on the ability of the features to distinguish between close range samples. D is divided into D + = { positive example sample } and D - = { negative example sample }. The algorithm randomly selects one sample r= (f 1,f2,...,fd) from the training data set D, then searches for the nearest positive sample Z + of R from D + = { positive sample } and searches for the nearest negative sample Z - of R from D - = { negative sample }. The weight of each feature is then updated according to the rules that if R is a positive example sample, then near_hit=z + and near_miss=z -, instead near_hit=z + and near_miss=z -, where the nearest neighbor is found using euclidean distance. We then update each weight of the data using equation (1). Wherein diff is calculated from equation (2) when the two variables x k and y k are nominal values, and from equation (3) when x k and y k are values. repeating the above process for m times to finally obtain the average weight of each feature. The pseudocode of the Relief algorithm is shown in algorithm 1. The larger the feature weight, the more classification capability the features. Conversely, the weaker the classification capability of a feature. The runtime of the Relief algorithm increases linearly with the number of samples m and the number of raw features d, and therefore the running efficiency is high.
Wi=Wi-diff(Ri,near_hiti)2+diff(Ri,near_missi)2,(i=1,2,...,d), (1)
W i represents the weight of the ith feature, the value of which is calculated recursively by m (the number of samples);
w i represents the cumulative weight of the samples of the ith feature m times, and finally the weight of each feature divided by m is averaged.
Diff (R i,near_hiti) represents the difference under feature i between a randomly drawn sample R and its nearest neighbor_hit of the same class.
Diff (R i,near_missi) represents the difference in feature i between a randomly drawn sample R and its different nearest neighbors.
R i represents the value of the randomly drawn sample R under the feature i.
Near_hit i represents the value of the nearest neighbor near_hit of the same class of randomly drawn samples R under feature i.
Near_miss i represents the value of the nearest neighbor_miss of a different class of randomly drawn samples R under feature i.
Where v k is a normalization unit that normalizes the value of diff to the [0,1] interval, e.g., min-max normalization formula that normalizes x to [0,1]Because x k and v k are vectors, max refers to the largest value in the vector and min refers to the smallest value in the vector.
2.3 ReliefF feature selection algorithm
In dealing with multiple classes of problems, the ReliefF algorithm randomly extracts one sample r= (f 1,f2,...,fd) at a time from the training dataset D. K nearest neighbors H h (h=1, 2,..k) of R are selected from samples of the same class as R, and k nearest neighbors M h (h=1, 2,..k) of R are found from samples of a different class than R, wherein k nearest neighbors are found using euclidean distance. The above process was repeated m times. We then update the weights for each feature using equation (4), where diff is calculated by equation (5). Thus, we do feature selection based on the weight of each feature and a given threshold. The pseudo code of ReliefF algorithm is described in algorithm 2.
(l=0,1,...,dij)
Wherein w m(fl (i,j)) represents the local weight of the first feature in the random subspace P (i,j), the value of which is calculated recursively m times (the number of samples);
w m-1(fl (i,j)) represents the local weight of feature f l (i,j) in the random subspace P (i,j), the value of which is calculated by recursion m-1 times;
f l (i,j) denotes the first feature in the random subspace P (i,j);
diff (f l (i,j),R,Hh) represents the difference between samples R and H h under feature f l (i,j) in the random subspace P (i,j);
diff (f l (i,j),R,Mh) represents the difference between samples R and M h under feature f l (i,j) in the random subspace P (i,j);
h h denotes selecting the H nearest neighbor of sample R from samples having the same class as sample R;
M h denotes selecting the h nearest neighbor of sample R from samples having a different class from sample R;
m represents the sampling times;
k represents the total number of nearest neighbors;
q+.class (R) represents a collection of classes different from the class of sample R;
q represents the set Q;
class (R) represents the class of sample R;
p (·) represents the probability that a given class occupies all classes.
The nearest neighbor is the sample nearest to sample R.
Where diff (A, R 1,R2) represents the difference between samples R 1 and R 2 under feature A, R 1 [ A ] and R 2 [ A ] represent the values of samples R 1 and R 2 under feature A, respectively, and maxA and minA represent the maximum and minimum values of all samples on feature A.
3. Proposed algorithm
3.1 Symbol description
Given the classification dataset d= (X, C),Representing a data matrix of n samples and d features, whereinRepresenting the i-th sample (i-th row of matrix X), i.e., X i is d-dimensional (has d features).Represents the j-th feature (j-th column of matrix X).Representing class vectors. Feature selection refers to the selection of some of the most efficient subsets of features from the original features by sufficiently estimating the underlying structure of the data, thereby preserving the structural information of the data. We express the matrix of this subset as T '= (f 1',f2',...,fd′'), where T 'is the submatrix of X T and d' represents the number of features selected. Wherein T 'represents the data matrix formed by the selected features, X T represents the transpose of X, d' represents the number of features selected by the features, and d represents the total number of features in the data set.
3.2 Generation of random subspaces
The data structures are often hidden in different subspaces of the original feature space. However, it is difficult to estimate the correct feature subspace and true structural information. Many existing feature selection methods tend to describe the hidden structure of data by taking into account the minimum redundancy of global and local information or samples. However, in a high-dimensional feature space, the diversity of random subspaces tends to be ignored and a large amount of structural information tends to be hidden. While some efforts have been made to describe learning structural information in subspaces, they implicitly assume that there is a single optimal subspace in a high-dimensional space, and that a single subspace cannot be exceeded to explore multiple, even large, subspaces. Thus, the diversity of subspaces is ignored. To address this problem, the goal herein is to mine the information between features and class attributes from subspaces. Since it is difficult to find a small portion of the appropriate subspace, we have employed a random subspace technique and verified that a large number of subspaces can significantly improve the accuracy of feature selection.
By using a large number of random subspaces, we hope that each feature will compete fairly in the random subspace, where each feature should show the same number in all subspaces, and the resulting subspaces should be of the same size. Therefore, in this section, we propose a scheme to generate random subspaces. The main idea is to generate a plurality of feature partitions from the original feature space. Each feature division is a completely random division of all features, and consists of a certain number of random subspaces with the same size, so that all features appear in the subspaces the same number of times. Thus, each feature participates fairly in fair competition between the generated subspaces.
Order theRepresenting the ith feature partition, P (i,j) representsA j-th random subspace of (a). The feature partitions can then be expressed as
Wherein s representsThe number of random subspaces in the matrix. Because ofThe size of the random subspace in (c) should be the same, so we let the size of the random subspace be d/s in this document. However, in most cases in reality, the value of d/s is not an integer. Therefore, we do a truncated rounding setting for d/s. Thus, as each feature partition is generated, we randomly sample [ d/s ] features each time and do not repeat the sampling until all features have been sampled, and when the remaining features are less than [ d/s ], the remaining features constitute a random subspace. Where [ d/s ] represents the value of d/s truncated rounding. It should be noted that when d cannot divide s all together, we actually generate s+1 random subspaces. The number of subspaces in each feature partition is denoted herein collectively as s. Is provided withRepresents the kth feature in the random subspace P (i,j), where d ij represents the feature number in the random subspace P (i,j). Thus, the first and second substrates are bonded together,The jth random subspace of (2) may be marked as
Thus, we can clearly conclude that in one feature partition, each feature appears only once in multiple random subspaces. That is, the different random subspaces of a feature partition do not intersect each other. That is, for the feature space P (i,j), forHas the following componentsThis is true.
We repeatedly and randomly generate feature partitions so that a set of multiple feature partitions can be represented as
Wherein, the Representing the generated i-th feature partition, and M represents the number of feature partitions. For one dataset, each feature partition contains s random subspaces, and each random subspace has the same number of features. Thus, the first and second substrates are bonded together,There are m·s random subspaces of the same size. According to the generation mechanism of the random subspace, each feature in the original dataset appears M times in the random subspace of m·s.
3.3 Weight learning and feature selection for each feature
In this section we use random subspaces to describe the underlying information structure of the data, and choose information features that preserve the spatial structure well. This process is feature selection.
For a total of m·s random subspaces, each local feature weight is obtained in each subspace according to the ReliefF algorithm (introduced in section 2.3). The k nearest neighbors of the randomly selected samples are not exactly the same, due to the different features contained in each random subspace. In sample space, a sample is randomly extracted under a given random subspace, and k nearest neighbors of the sample are found to update the weights. Thus, the underlying structure of data can be described as much as possible by a large number of random subspaces.
Given one random subspace P (i,j), for the feature f l (i,j)(l=1,2,...,dij in the random subspace P (i,j)), its local weight can be calculated by the equation (4) introduced in section 2.3, W (f 1 (i,j)) corresponding to W (f i) of equation (4), i.e., the local feature weight of the first feature in the current subspace in the jth random subspace of the ith feature partition is calculated by equation (4) (i in equation (4) represents the ith feature of the original feature space). The local weight vectors for all d ij features in the random subspace P (i,j) can then be expressed as
Where w (f l (i,j)) represents the local weight of the first feature in P (i,j). Each feature in the original feature space is only in a feature partitionOnce out of s random subspaces of (a) can be obtained by concatenating the local weight vectors of its s subspacesThe full weight vector (d is the number of all features) of (a) can be expressed as
Where f j (i) (i=1, 2,., d) is a feature in the original feature space.
The final weight vector is obtained by taking the average of the full weight vectors of the M feature partitions. The final weight vector can be expressed as
Thus, the final weight vector for all features can be obtained by reconciling the number of random subspaces. The average of the final weight vectors is then calculated and noted as w avg. For feature weights in the final weight vector, if w (f i) is greater than w avg, then the feature is selected as a subset of the selected features.
3.4 Overall algorithm
In this section we will summarize the whole procedure of the proposed algorithm, the detailed steps of which are shown in algorithm 3.
The main idea of this algorithm is to generate a large number of random subspaces and to obtain the weights of the features in each subspace. Finally, combining the information of the multiple subspaces to obtain the weight scores of all the features. In order to ensure that each feature of the original dataset can fairly compete and cooperate in the generated subspace in the process of acquiring the weight of each feature, a plurality of random subspaces which are the same in size and have no intersection should be generated in each feature partition. In generating the random subspace, two parameters, namely, the number of feature partitions M and the number of random subspaces s contained in each feature partition, need to be set. These two parameters are used to adjust the number of feature partitions and the random subspaces generated. We calculate the weights of the features and select the features according to the following. After a large number of random subspaces are generated, class attributes are added to each random subspace. Then, in each random subspace, using ReliefF algorithm to obtain local weight of each feature in the current subspace, obtaining local weight vector of the subspace, and connecting the local weight vectors of different random subspaces in the feature division in series to obtain full weight vector of all the features (i.e. vector obtained by each feature partition). Finally, the final weight vector of all the features is obtained by averaging the weight scores of the feature partitions (i.e., each feature is given a different weight in the feature partitions, and the average is taken to obtain the final weight vector of each feature). Here we use the average of the final weight vectors for feature selection, i.e. if the final score of a feature is greater than the average, then that feature is selected. Thus, the selected feature subset is ultimately returned. (as shown in algorithm 3)
4. Experiment
We have performed experiments on 26 standard datasets with different feature numbers to demonstrate that the proposed method is indeed superior to several existing feature selection methods.
4.1 Data sets
26 Data sets were obtained from the UCI machine learning library to verify the validity of the proposed method. It should be noted that the "Ecoli" and "processed. Cleveland" datasets are applied to the algorithm after the subclass samples are deleted. Table 1 shows a brief description of the data set used, including the number of features, the number of samples, and the number of classes.
Table 1 dataset
4.2 Contrast Algorithm
We compare our approach to feature selection methods using the features used, and using different metrics throughout the feature space or subspaces. Details of the comparison algorithm are shown below.
Benchmark (Baseline) classification using all features.
Feature selection based on Laplace (LS) feature selection using the laplace score throughout the feature space.
Feature Selection (SRCFS) based on Yu Duozi spatial random collaboration the feature is selected by computing the Laplacian scores over multiple subspaces.
Feature selection (MI) based on mutual information feature selection is performed using mutual information throughout the feature space.
Feature selection based on ReliefF (ReliefF) feature selection is performed using ReliefF algorithm throughout feature space.
Feature selection (FJMI) of fuzzy joint mutual information by computing fuzzy mutual information to select features.
4.3 Evaluation index
The main purpose of the feature selection method is to select fewer features to achieve higher or similar accuracy. To verify the superiority of the algorithm, two well-known classifiers, K-nearest neighbor (KNN) and Decision Tree (DT), were used in the experiment. Three criteria of KNN and DT were used to measure the quality of the algorithm, namely accuracy, precision and recall.
4.3.1 KNN classifier
The KNN algorithm was originally proposed by Cover and Hart in 1968, which is one of the simplest and most commonly used classification algorithms. Therefore, it is widely used in machine learning. For a test sample, KNN finds its k-nearest neighbor in the training sample, and then classifies the test sample into most of its k-nearest neighbors. In the KNN algorithm, the number k of the neighbors needs to be specified, and the value of the parameter k has great influence on the classification precision of the KNN classifier. Depending on the values, the accuracy may vary considerably. Euclidean distance is typically applied in KNN algorithms, which are defined as equation (12).
Wherein ED represents Euclidean distance, X 1 andTwo samples representing the d dimension, representing d features per sample. X 1,i represents one dimension of sample X 1, and X 2,i represents one dimension of sample X 2.
4.3.2DT classifier
Decision Time (DT) is a simple and common classifier in data mining. The purpose of the decision tree is to learn the model from the given training data and to predict the class of test samples. A decision tree is a tree structure in which each internal node represents a test of a feature, each branch represents a test output, and each leaf node represents a class. In the DT algorithm, the key to building a decision tree is which feature is selected as the basis for classification in the current state. The algorithm for building the tree is mainly three kinds of ID3, C4.5 and CART according to different objective functions. The main difference is that the selected objective functions are different. ID3 uses information gain, C4.5 uses information gain rate, and CART uses the coefficient of kunit. While the use of different objective functions has a great impact on the classification accuracy. Here we use the ID3 algorithm.
4.3.3 Classification Properties
(1) Classification accuracy
The classification accuracy of KNN and DT classifiers is expressed as:
Wherein n represents n dimensions, Is an oscillometric function, f (x i) and c i (i=1, 2,., n) represent the prediction result and the true class label of the classifier in n samples, respectively. E (f; D) represents the classification accuracy of the KNN and DT classifier, D in E (f; D) represents the training data set, and f represents the KNN classifier or the DT classifier.
(2) Precision and recall/recall
For the classification problem, the samples can be classified into four cases of true (true positive), false positive (false positive), true negative (true negative) and false negative (FALSE NEGATIVE) according to the combination of the true classification and the learner prediction classification. The classification problem can be generalized to the multi-classification problem. For the multi-classification problem, one of the classes is regarded as a positive example, the other classes are regarded as negative examples, and TP, FP, TN, FN respectively represents the corresponding sample numbers, so that it is obvious that tp+fp+tn+fn=total number of samples. The "confusion matrix" of the classification result is shown in table 2, and the precision and recall are calculated from the confusion matrix. The definition of the precision and recall is shown in formulas (14) and (15), the precision of the multi-class problems can be divided into macro precision and micro precision, and the recall of the multi-class problems can be divided into macro recall and micro recall. In this context, we use the macro precision and macro recall of the multi-class dataset.
Table 2 classified confusion matrix
4.3.4 Parameter settings
Here we used 10-fold cross-validation on our proposed method and other comparison methods, the distance measure of KNN used euclidean distance, and the KNN classifier selected 5 neighbors. Each algorithm was independently repeated 10 times. For the number of random spaces s, s is set to satisfy the size of 6 for each random subspace. Note that if the number of all features is not divisible by 6, then the remainder of the division by 6 is the size of the subspace. For fair comparison, the parameters in the compared algorithm and the parameters in the proposed algorithm should agree. The values of the specific parameters are shown in table 3.
Table 3 parameter settings for experiments
| Parameter |
Value(s) |
| r The number of runs |
10 |
| K for cross validation |
10 |
| k for KNN |
5 |
| the distance metric |
Euclidean |
| k for RBEFF |
5 |
| m the sample size for RBEFF |
10 |
| M the feature partition |
5 |
5. Experimental results
Experiments were performed here on 26 standard datasets with different feature numbers to compare our proposed method with several feature selection methods.
5.1 Classification precision
Tables 4 and 5 show the average classification accuracy of each feature selection method using KNN and DT classifiers. As can be seen from table 4, the proposed method RBEFF achieves the highest classification over 15 data sets under the KNN classifier compared to other methods, accounting for more than half of all data sets. As can be seen from table 5, the RBEFF method achieved the highest classification accuracy over the 12 datasets under the DT classifier compared to the other methods. The average classification accuracy for 26 data sets using KNN and DT classifiers on different algorithms is shown in the last row of tables 4 and 5. It can be derived that our algorithm achieves the highest average accuracy, whether KNN or DT classifiers are used. It can be concluded that our algorithm RBEFF has achieved good results in terms of classification accuracy.
Table 4 KNN average classification accuracy of dataset under classifier (darkened value indicates best results among several methods)
Table 5DT average classification accuracy of dataset under classifier (darkened value indicates best results among several methods)
| Dataset |
Baseline |
ReliefF |
MI |
LS |
SRCFS |
FJMI |
RBEFF |
| Data.user |
0.9016 |
0.8401 |
0.8381 |
0.9282 |
0.9256 |
0.9152 |
0.8435 |
| Cryotherapy |
0.8867 |
0.8256 |
0.8733 |
0.8667 |
0.8744 |
0.8567 |
0.8389 |
| Ecoli |
0.9786 |
0.9705 |
0.9745 |
0.9395 |
0.9745 |
0.9723 |
0.9866 |
| Seed1 |
0.9129 |
0.8971 |
0.8595 |
0.9219 |
0.9195 |
0.8314 |
0.9445 |
| Pima |
0.7238 |
0.5921 |
0.6780 |
0.6134 |
0.6916 |
0.6864 |
0.6674 |
| Biopsy |
0.9476 |
0.9558 |
0.9447 |
0.9376 |
0.9407 |
0.9307 |
0.9505 |
| Glass |
0.7830 |
0.7000 |
0.8381 |
0.8074 |
0.7888 |
0.8352 |
0.7961 |
| Breast-cancer |
0.9486 |
0.9490 |
0.9453 |
0.9404 |
0.9474 |
0.8486 |
0.9500 |
| Heart-failure |
0.7727 |
0.7708 |
0.7602 |
0.7707 |
0.7677 |
0.7910 |
0.7760 |
| Accent-mfcc |
0.7622 |
0.6972 |
0.759 |
0.732 |
0.7413 |
0.6222 |
0.7235 |
| Wine |
0.9323 |
0.926 |
0.9365 |
0.9381 |
0.9006 |
0.7124 |
0.9300 |
| Processed.cleveland |
0.5152 |
0.5664 |
0.4445 |
0.5003 |
0.4700 |
0.4248 |
0.5983 |
| Diabetes-data-upload |
0.9442 |
0.9115 |
0.9375 |
0.9537 |
0.9440 |
0.9494 |
0.9446 |
| Trial |
1 |
1 |
1 |
1 |
1 |
0.8258 |
1 |
| Hepatitis |
0.5225 |
0.6515 |
0.5425 |
0.5475 |
0.5977 |
0.4586 |
0.7075 |
| Brands |
0.632 |
0.6911 |
0.6433 |
0.6331 |
0.6315 |
0.6117 |
0.6873 |
| ForestTypes |
0.848 |
0.7463 |
0.8396 |
0.8384 |
0.8451 |
0.6111 |
0.7542 |
| Wdbc |
0.7075 |
0.7292 |
0.7317 |
0.7608 |
0.7492 |
0.7633 |
0.7633 |
| Lung-cancer |
0.9143 |
0.9086 |
0.9218 |
0.5763 |
0.9264 |
0.8868 |
0.9118 |
| Wpbc |
0.8761 |
0.882 |
0.8526 |
0.8484 |
0.8763 |
0.8654 |
0.8912 |
| Ionosphere |
0.6984 |
0.6909 |
0.6455 |
0.6761 |
0.6584 |
0.5941 |
0.7471 |
| Qsak |
0.8161 |
0.7767 |
0.7968 |
0.8181 |
0.8049 |
0.744 |
0.7679 |
| Sonar |
0.7515 |
0.7471 |
0.7722 |
0.781 |
0.7322 |
0.71 |
0.731 |
| Urban land cover |
0.7645 |
0.7584 |
0.7461 |
0.742 |
0.7175 |
0.5481 |
0.786 |
| Musk1 |
0.823 |
0.7784 |
0.7979 |
0.814 |
0.7707 |
0.81 |
0.7655 |
| Clean 1 |
0.7922 |
0.7807 |
0.8059 |
0.8191 |
0.7056 |
0.7788 |
0.7731 |
| Average |
0.73678 |
0.7897 |
0.7547 |
0.7022 |
0.7588 |
0.7171 |
0.8059 |
5.2 Precision ratio
Tables 6 and 7 show the average precision of each feature selection method under KNN and DT classifiers. As can be seen from table 6, the 11 data sets under the KNN classifier of the proposed method RBEFF achieved the highest accuracy than the other algorithms. It can also be seen from table 7 that the 10 data sets under the KNN classifier of the proposed method RBEFF achieved the highest accuracy than the other algorithms. Thus, it can be concluded that the method we propose performs better on half of the data set. The last row of tables 6 and 7 gives the total average precision of the 26 data sets using KNN and DT classifiers on different algorithms. It can be seen that our algorithm RBEFF achieves the highest average precision on both KNN and DT classifiers, indicating that RBEFF method performs well on precision.
Table 6 KNN average precision of data sets under classifier (darkened value indicates best results among several methods)
Table 7 DT average precision of data sets under classifier (darkened value indicates best results among several methods)
| Dataset |
Baseline |
ReliefF |
MI |
LS |
SRCFS |
FJMI |
RBEFF |
| Data.user |
0.9154 |
0.8411 |
0.8412 |
0.9233 |
0.9232 |
0.9162 |
0.9387 |
| Cryotherapy |
0.8949 |
0.8515 |
0.8870 |
0.8664 |
0.8725 |
0.8523 |
0.8612 |
| Ecoli |
0.9606 |
0.9948 |
0.9672 |
0.9662 |
0.9889 |
0.9785 |
0.9989 |
| Seed1 |
0.9145 |
0.8979 |
0.8503 |
0.9213 |
0.9147 |
0.8343 |
0.9150 |
| Pima |
0.7943 |
0.6997 |
0.7613 |
0.7112 |
0.6538 |
0.6484 |
0.7273 |
| Biopsy |
0.9363 |
0.9452 |
0.9305 |
0.9149 |
0.9374 |
0.9293 |
0.9275 |
| Glass |
0.7718 |
0.6875 |
0.8310 |
0.8017 |
0.7903 |
0.8384 |
0.7801 |
| Breast-cancer |
0.9382 |
0.9470 |
0.9401 |
0.9373 |
0.9448 |
0.8347 |
0.9474 |
| Heart-failure |
0.6636 |
0.6579 |
0.6309 |
0.6582 |
0.7307 |
0.7625 |
0.6297 |
| Accent-mfcc |
0.7763 |
0.6975 |
0.7567 |
0.7283 |
0.7434 |
0.6264 |
0.7221 |
| Wine |
0.9400 |
0.9262 |
0.9438 |
0.9362 |
0.9015 |
0.7060 |
0.8600 |
| Processed.cleveland |
0.3635 |
0.3467 |
0.3168 |
0.3373 |
0.3191 |
0.2390 |
0.4421 |
| Diabetes-data-upload |
0.9771 |
0.9540 |
0.9674 |
0.9643 |
0.9469 |
0.9452 |
0.9774 |
| Trial |
1 |
1 |
1 |
1 |
1 |
0.8118 |
1 |
| Hepatitis |
0.5930 |
0.6588 |
0.6238 |
0.6158 |
0.5417 |
0.4433 |
0.7006 |
| Brands |
0.7099 |
0.7514 |
0.7144 |
0.7111 |
0.6026 |
0.5715 |
0.7400 |
| ForestTypes |
0.8426 |
0.7483 |
0.8333 |
0.8331 |
0.8371 |
0.5913 |
0.7500 |
| Wdbc |
0.6387 |
0.6704 |
0.6525 |
0.7192 |
0.6871 |
0.7033 |
0.7072 |
| Lung-cancer |
0.9351 |
0.9264 |
0.9371 |
0.6458 |
0.9215 |
0.8801 |
0.9375 |
| Wpbc |
0.8679 |
0.8762 |
0.8484 |
0.8417 |
0.8679 |
0.8525 |
0.8853 |
| Ionosphere |
0.3738 |
0.3733 |
0.2617 |
0.3375 |
0.5428 |
0.4534 |
0.4738 |
| Qsak |
0.8627 |
0.8366 |
0.8486 |
0.8640 |
0.7822 |
0.7173 |
0.8202 |
| Sonar |
0.7160 |
0.7332 |
0.7714 |
0.7653 |
0.7337 |
0.7104 |
0.7370 |
| Urban land cover |
0.7523 |
0.7618 |
0.7219 |
0.7238 |
0.688 |
0.5086 |
0.7860 |
| Musk1 |
0.803 |
0.7401 |
0.7666 |
0.7864 |
0.7691 |
0.8076 |
0.7319 |
| Clean 1 |
0.7867 |
0.7383 |
0.7782 |
0.7959 |
0.7921 |
0.7746 |
0.7352 |
| Average |
0.7972 |
0.7793 |
0.7839 |
0.7810 |
0.7859 |
0.7283 |
0.7974 |
5.3 Recall/recall
Tables 8 and 9 show the average precision of each feature selection method under KNN and DT classifiers. As can be seen from table 8, the 10 data sets of the proposed method RBEFF under the KNN classifier achieved the highest accuracy than the other algorithms. It can also be seen from table 9 that the 9 data sets under the KNN classifier of the proposed method RBEFF achieved the highest accuracy than the other algorithms. Thus, it can be concluded that the method we propose performs better on half of the data set. The last row of tables 8 and 9 gives the total average precision of the 26 data sets using KNN and DT classifiers on different algorithms. It can be seen that our algorithm reaches the highest average recall, whether a KNN or DT classifier is used. Thus, it can be concluded that our algorithm RBEFF also achieved excellent performance in recall.
Table 8 KNN average recall of datasets under classifier (darkened values indicate best results among several methods)
| Dataset |
Baseline |
ReliefF |
MI |
LS |
SRCFS |
FJMI |
RBEFF |
| Data.user |
0.8480 |
0.8405 |
0.8426 |
0.9238 |
0.9200 |
0.8579 |
0.8448 |
| Cryotherapy |
0.7679 |
0.8118 |
0.7475 |
0.7593 |
0.8244 |
0.7281 |
0.8116 |
| Ecoli |
0.9703 |
0.9563 |
0.9490 |
0.8734 |
0.9773 |
0.9729 |
0.9726 |
| Seed1 |
0.8817 |
0.8958 |
0.8849 |
0.878 |
0.8806 |
0.8251 |
0.8958 |
| Pima |
0.8344 |
0.8357 |
0.8435 |
0.7743 |
0.6756 |
0.6783 |
0.8238 |
| Biopsy |
0.9650 |
0.9586 |
0.9595 |
0.9223 |
0.9400 |
0.9252 |
0.9494 |
| Glass |
0.8298 |
0.8277 |
0.8609 |
0.8254 |
0.7831 |
0.7682 |
0.8328 |
| Breast-cancer |
0.5474 |
0.9616 |
0.9628 |
0.5478 |
0.5386 |
0.5437 |
0.9775 |
| Heart-failure |
0.2060 |
0.6631 |
0.2000 |
0.2018 |
0.7033 |
0.7962 |
0.6567 |
| Accent-mfcc |
0.8657 |
0.642 |
0.8277 |
0.7489 |
0.5386 |
0.6263 |
0.7541 |
| Wine |
0.6762 |
0.907 |
0.6726 |
0.6784 |
0.686 |
0.6371 |
0.8700 |
| Processed.cleveland |
0.2600 |
0.3551 |
0.2524 |
0.2508 |
0.2483 |
0.2879 |
0.4041 |
| Diabetes-data-upload |
0.8472 |
0.8924 |
0.8624 |
0.8497 |
0.8851 |
0.9371 |
0.9297 |
| Trial |
0.8900 |
1 |
0.9556 |
0.8905 |
0.9012 |
0.8366 |
1 |
| Hepatitis |
0.7337 |
0.9167 |
0.7382 |
0.7504 |
0.5915 |
0.5430 |
0.9416 |
| Brands |
0.7985 |
0.8186 |
0.7961 |
0.7977 |
0.5732 |
0.5820 |
0.7612 |
| ForestTypes |
0.8753 |
0.8227 |
0.8748 |
0.8759 |
0.8713 |
0.6502 |
0.7727 |
| Wdbc |
0.6833 |
0.8021 |
0.7083 |
0.7683 |
0.7071 |
0.7022 |
0.8124 |
| Lung-cancer |
0.9625 |
0.8856 |
0.9563 |
0.7339 |
0.9228 |
0.8945 |
0.9355 |
| Wpbc |
0.7887 |
0.8028 |
0.7649 |
0.798 |
0.7905 |
0.8281 |
0.8312 |
| Ionosphere |
0.8424 |
0.8764 |
0.8584 |
0.8433 |
0.8015 |
0.6742 |
0.8315 |
| Sonar |
0.7359 |
0.7941 |
0.7445 |
0.683 |
0.7619 |
0.7497 |
0.7000 |
| Urban land cover |
0.3481 |
0.7569 |
0.3419 |
0.3484 |
0.4465 |
0.3089 |
0.7751 |
| Musk1 |
0.9326 |
0.8754 |
0.902 |
0.8567 |
0.8481 |
0.8167 |
0.9222 |
| Clean 1 |
0.9259 |
0.8738 |
0.8801 |
0.853 |
0.8477 |
0.8228 |
0.9115 |
| Average |
0.76066 |
0.8309 |
0.7755 |
0.7373 |
0.7466 |
0.7197 |
0.8367 |
Table 9 DT average recall for datasets under classifier (darkened values indicate best results among several methods)
From the above results we made a radar map of the average accuracy, recall and accuracy of the 26 datasets at the DT and KNN classifiers. As can be seen from fig. 1, RBEFF has better performance than the comparison method from the three indices of different classifiers. Thus, we conclude that our algorithm is competitive and advantageous in most cases from different indices of different classifiers.
Herein, we evaluate the stability dataset of the proposed algorithm from a box plot of average accuracy over the dataset that performed well using RBEFF method. The classifier uses 10 fold cross validation and runs 10 times on the proposed and comparative algorithm to draw each box plot from which the stability of the algorithm can be reflected. The minimum, maximum, median, first quartile and third quartile in the 10-degree classification accuracy can be seen in this box plot. The line in this box plot represents the median value. Fig. 2 and 3 show box plots of the average accuracy of KNN and DT classifiers over different data sets, which can be intuitively compared for average performance using different methods. As shown in fig. 2 and 3, the RBEFF method presented has a higher box plot for most datasets than other comparison methods, and we present methods that are also more accurate than other methods. It can also be seen that the box height of the proposed algorithm is short in most data sets. Thus, we can conclude that the algorithm we propose achieves better stability.
5.5 Selection of the number of features
As shown in tables 10 and 11, it is the total average feature number selected over 26 data sets using different methods and the about-reduction ratio, which refers to the ratio of the reduced feature number to the total feature number after the feature selection method is used on the data sets. It can be seen that RBEFF selects the smallest number of features on the 14 data sets compared to the other algorithms, and good results are also obtained on the other data sets (second or third rank). It can also be seen from table 11 that our algorithm deletes half or more features over 24 data sets. It can be concluded that RBEFF is an effective way to solve the feature selection problem.
We have made bar and line graphs using different methods for the total average number and average reduction ratio of selected features for 26 datasets, as shown in fig. 4 and 5. As can be seen from fig. 4, RBEFF achieves a tower average size 13.0769, slightly above FJMI. However, this method has a higher average classification accuracy than FJMI. As is clear from fig. 5, RBEFF achieves the highest average reduction rate in the feature selection process as compared to other methods.
Table 10 number of features selected by different methods on different data sets
Table 11 reduction rates for different method features on different data sets
5.6Wilcoxon signed rank test
Finally, the significance difference of the algorithm REBFF presented herein from the comparison algorithm was verified by the Wilcoxon signed rank test. Table 12 shows the results of the significance test of the proposed algorithm and other comparison algorithms on the KNN classifier when the significance level α was set to 0.05. From the statistics (table 12), it can be seen that all p values are less than 0.05, which indicates that there is a significant difference between the proposed algorithm and the comparison algorithm, i.e. the proposed algorithm RBEFF is significantly better than the other comparison algorithms. The Wilcoxon signed rank test results demonstrate the effectiveness and superiority of RBEFF.
Table 12 Wlicoxon symbol rank test on KNN classifier for 26 dataset RBEFF method and comparative method ("yes" means that there is a significant difference between the performance of the two methods
| Pairwise comparision |
P-value |
Significant difference |
| RBEFF VS.Baseline |
0.0092333 |
yes |
| RBEFF VS.FJMI |
0.0000457 |
yes |
| RBEFF VS.SRCFS |
0.0048130 |
yes |
| RBEFF VS.LS |
0.0007648 |
yes |
| RBEFF VS.MI |
0.0115009 |
yes |
| RBEFF VS.ReliefF |
0.0004456 |
yes |
6. Conclusion and future work
The method is used for mining the bottom information of a data set by randomly sampling a sample and searching k neighbors of the sample in different subspaces, and the weight of the feature is obtained by using distributed samples on the sample, so that the information contained in the sample is fully utilized. We use K-nearest neighbor and decision tree classifiers to verify the performance of our algorithm through 10-fold cross-validation. To investigate the reliability and quality of the results, we compared our algorithm with commonly used methods, such as Baseline, reliefF, LS, MI, FJMI and SRCFS algorithms, on 26 real datasets. The result shows that the algorithm has the highest classification precision under the condition of the least selected feature number or the same feature number compared with other methods. Stability analysis shows that our algorithm RBEFF has an advantage in terms of stability. Finally, wilcoxon signed rank test results demonstrated RBEFF effectiveness and superiority.
We can apply our algorithm to the ultra-high dimensional or noisy dataset and analyze the impact of different parameters (such as feature partition M and random subspace s) on experimental results, which can be a further extension of this work.
From this, it follows that the main contributions herein are summarized as follows:
(1) A RBEFF method is presented herein that fully considers the diversity of subspaces and the contribution information of samples to features during feature selection.
(2) The feature selection framework of the RBEFF algorithm is able to adaptively select valid features.
(3) A number of comparisons and statistical analyses were performed, such as classification accuracy, recall, accuracy, bar graph, block diagram, and Wilcoxon signed rank test.
(4) RBEFF in the feature selection process, the least feature number is selected on most of the 26 data sets, and the average reduction rate is higher than that of other comparison methods.
(5) The RBEFF algorithm is compared with other algorithms on 26 data sets with different scales, and experimental results show that the RBEFF algorithm has the highest average accuracy, recall rate and accuracy on the KNN and DT classifiers at the same time. It can be concluded that RBEFF has advantages and effectiveness in solving the feature selection problem.
The specific embodiment is as follows:
In the first embodiment, when the signal is identified, part of the characteristics of the signal are severely polluted by noise, the characteristics are evaluated by adopting ReliefF algorithm to evaluate the classification capability, the characteristics with stronger classification capability in the signal are extracted, then the redundant characteristics with similar classification capability are removed by adopting characteristic similarity algorithm, and then the redundant characteristics with similar classification capability are removed by adopting characteristic similarity algorithm. The remaining strong classifier features consist of feature vectors and are classified. Simulation experiment results show that the method can obtain higher correct recognition rate by using fewer features. In transmission fault location, a set of candidate features is extracted from single voltage measurement data by using an RReliefF algorithm, and a regression cancellation algorithm is used to select useful and high-line candidate features from the candidate features to accurately acquire an estimate of a fault location, which is not affected by a current signal measurement error during fault location and does not face problems and costs related to transmitting and synchronizing two lines. The ReliefF algorithm is widely used for feature selection of microarray gene expression data and processing of image data (image classification management, etc.) as one of the most successful feature filtering algorithms in the field of machine learning. The signals may be sound, communication signals, radar signals, etc.
In the second embodiment, in image processing and classification, the image data has the characteristics of less samples and high dimensionality, and the feature extraction and selection of the image are the most effective methods, so that a feature subspace with higher response data essential structure and recognition rate is obtained. The RBEFF algorithm performs the mining of the underlying structural information for a number of different random subspaces of the high-dimensional image data, extracts the most efficient information features describing the image, and thus accurately classifies the image. The RBEFF algorithm can effectively and accurately improve the useful characteristics and the classification accuracy of the image. Similarly, in gene classification, the RBEFF can automatically select a small number of disease-related features (genes) and perform effective decision judgment (classification) on gene data in the face of high-dimensional gene expression data with a small sample.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the spirit and scope of the invention as defined by the appended claims and their equivalents.