[go: up one dir, main page]

CN111476100A - Data processing method, device and storage medium based on principal component analysis - Google Patents

Data processing method, device and storage medium based on principal component analysis Download PDF

Info

Publication number
CN111476100A
CN111476100A CN202010155934.8A CN202010155934A CN111476100A CN 111476100 A CN111476100 A CN 111476100A CN 202010155934 A CN202010155934 A CN 202010155934A CN 111476100 A CN111476100 A CN 111476100A
Authority
CN
China
Prior art keywords
data
features
sample data
matrix
correlation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010155934.8A
Other languages
Chinese (zh)
Other versions
CN111476100B (en
Inventor
奚晓钰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
MIGU Culture Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
MIGU Culture Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, MIGU Culture Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202010155934.8A priority Critical patent/CN111476100B/en
Publication of CN111476100A publication Critical patent/CN111476100A/en
Application granted granted Critical
Publication of CN111476100B publication Critical patent/CN111476100B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

本发明实施例涉及软件缺陷预测领域,公开了一种基于主成分分析的数据处理方法、装置及计算机可读存储介质,所述方法包括:对初始样本数据进行降维处理,得到预设维度的样本数据;获取所述样本数据的多个特征,并计算每个特征与预设类别的相关度,其中,所述预设类别为所述样本数据具有的多种类别中的一种类别;去除所述多个特征中相关度小于预设相关度的特征,将剩余特征作为所述样本数据的鉴别特征。本发明提供的基于主成分分析的数据处理方法、装置及计算机可读存储介质能够去除样本数据中的冗余特征,得到具有高鉴别性的样本数据,从而提高预测效率。

Figure 202010155934

The embodiments of the present invention relate to the field of software defect prediction, and disclose a data processing method, device and computer-readable storage medium based on principal component analysis. The method includes: performing dimension reduction processing on initial sample data to obtain a preset dimension sample data; obtain multiple features of the sample data, and calculate the correlation between each feature and a preset category, where the preset category is one of the multiple categories of the sample data; remove Among the plurality of features, the correlation degree is less than the preset correlation degree, and the remaining features are used as the identification features of the sample data. The data processing method, device and computer-readable storage medium based on principal component analysis provided by the present invention can remove redundant features in sample data to obtain sample data with high discrimination, thereby improving prediction efficiency.

Figure 202010155934

Description

基于主成分分析的数据处理方法、装置及存储介质Data processing method, device and storage medium based on principal component analysis

技术领域technical field

本发明实施例涉及数据处理领域,特别涉及一种基于主成分分析的数据处理方法、装置及计算机可读存储介质。Embodiments of the present invention relate to the field of data processing, and in particular, to a principal component analysis-based data processing method, device, and computer-readable storage medium.

背景技术Background technique

信息熵是消除不确定性所需信息量的度量,也即未知事件可能含有的信息量。一个事件或一个系统,准确的说是一个随机变量,它有着一定的不确定性。某些随机变量的不确定性很高,要消除这个不确定性,就需要引入很多的信息,这些很多信息的度量就用“信息熵”表达。需要引入消除不确定性的信息量越多,则信息熵越高,反之则越低。如果某个情况因为确定性很高,几乎不需要引入信息,因此信息熵很低。根据香农给出的信息熵公式,对于任意一个随机变量X,它的信息熵定义如下,单位为比特(bit):H(X)=-∑xεX[P(x)logP(x)]。系统中各种随机性的概率越均等,信息熵越大,反之越小。Information entropy is a measure of the amount of information required to eliminate uncertainty, that is, the amount of information an unknown event may contain. An event or a system, to be precise, is a random variable, which has a certain uncertainty. The uncertainty of some random variables is very high. To eliminate this uncertainty, it is necessary to introduce a lot of information. The measurement of this lot of information is expressed by "information entropy". The more information that needs to be introduced to eliminate uncertainty, the higher the information entropy, and vice versa. If a situation is very deterministic, almost no information needs to be introduced, so the information entropy is very low. According to the information entropy formula given by Shannon, for any random variable X, its information entropy is defined as follows, the unit is bit (bit): H(X)=-∑xεX[P(x)logP(x)]. The more equal the probability of various randomness in the system, the greater the information entropy, and vice versa.

发明人发现现有技术中至少存在如下问题:根据上述公式分析样本数据的特征,得到的冗余特征较多,导致利用该样本数据训练的模型预测效率不高。The inventors found that there are at least the following problems in the prior art: analyzing the features of the sample data according to the above formula, many redundant features are obtained, resulting in a low prediction efficiency of the model trained by using the sample data.

发明内容SUMMARY OF THE INVENTION

本发明实施方式的目的在于提供一种基于主成分分析的数据处理方法、装置及计算机可读存储介质,其能够去除样本数据中的冗余特征,得到具有高鉴别性的样本数据,从而提高预测效率。The purpose of the embodiments of the present invention is to provide a data processing method, device and computer-readable storage medium based on principal component analysis, which can remove redundant features in sample data, obtain sample data with high discrimination, thereby improving prediction efficiency.

为解决上述技术问题,本发明的实施方式提供了一种基于主成分分析的数据处理方法,包括:In order to solve the above technical problems, embodiments of the present invention provide a data processing method based on principal component analysis, including:

对初始样本数据进行降维处理,得到预设维度的样本数据;获取所述样本数据的多个特征,并计算每个特征与预设类别的相关度,其中,所述预设类别为所述样本数据具有的多种类别中的一种类别;去除所述多个特征中相关度小于预设相关度的特征,将剩余特征作为所述样本数据的鉴别特征。Perform dimensionality reduction processing on the initial sample data to obtain sample data of preset dimensions; obtain multiple features of the sample data, and calculate the correlation between each feature and a preset category, wherein the preset category is the One of multiple categories that the sample data has; remove the features whose correlation is less than the preset correlation among the multiple features, and use the remaining features as the identification features of the sample data.

本发明的实施方式还提供了一种基于主成分分析的数据处理装置,包括:至少一个处理器;以及,与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行上述的基于主成分分析的数据处理方法。Embodiments of the present invention also provide a principal component analysis-based data processing device, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores data that can be The instructions executed by the at least one processor are executed by the at least one processor, so that the at least one processor can execute the above-mentioned principal component analysis-based data processing method.

本发明的实施方式还提供了一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时实现上述的基于主成分分析的数据处理方法。Embodiments of the present invention further provide a computer-readable storage medium storing a computer program, and when the computer program is executed by a processor, the above-mentioned data processing method based on principal component analysis is implemented.

本发明的实施方式相对于现有技术而言,通过对所述初始样本数据进行降维处理,得到预设维度的样本数据,以便于后续步骤的计算,减小后续步骤的运算量,从而提高数据处理方法的效率;通过获取所述样本数据的多个特征,并计算每个特征与预设类别的相关度,由于预设类别为样本数据具有的多种类别中的一种类别,通过此种方式,能够根据相似度得知样本数据的多个特征中哪些特征为冗余特征;通过去除所述多个特征中相关度小于预设相关度的特征,将剩余特征作为所述样本数据的鉴别特征,能够得到具有高鉴别性的样本数据,使得使用该样本数据的训练模型的预算速度变快,从而提高预测效率。Compared with the prior art, the embodiment of the present invention obtains sample data with a preset dimension by performing dimension reduction processing on the initial sample data, so as to facilitate the calculation of the subsequent steps, reduce the calculation amount of the subsequent steps, and thereby improve the The efficiency of the data processing method; by acquiring multiple features of the sample data, and calculating the correlation between each feature and the preset category, since the preset category is one of the multiple categories of the sample data, through this In this way, it is possible to know which features of the multiple features of the sample data are redundant features according to the similarity; by removing the features whose correlation is less than the preset correlation among the multiple features, the remaining features are used as the sample data. The discriminative feature can obtain sample data with high discriminant, which makes the budget speed of the training model using the sample data faster, thereby improving the prediction efficiency.

另外,在去除所述多个特征中相关度小于预设相关度的特征之后,还包括:将所述剩余特征按照所述相关度由高到低的顺序排序;将排序后的所述剩余特征划分为N个特征段,其中,每个特征段中均包括M个特征,N、M均为大于1的整数;判断是否存在M个特征均大于预设阈值的特征段,在判定存在时,去除所述特征段中相似度最小的特征。In addition, after removing the features whose correlation degree is less than the preset correlation degree among the plurality of features, the method further includes: sorting the remaining features according to the order of the correlation degree from high to low; sorting the sorted remaining features Divided into N feature segments, wherein each feature segment includes M features, and N and M are both integers greater than 1; judging whether there are feature segments with M features greater than a preset threshold, when judging that there are, The feature with the smallest similarity in the feature segment is removed.

另外,所述对所述初始样本数据进行降维处理,具体包括:将所述初始样本数据转化成数据矩阵;计算所述数据矩阵的协方差矩阵,并对所述协方差矩阵进行特征分解,得到所述协方差矩阵的特征值,以及与所述特征值对应的特征向量;根据所述特征值及所述特征向量得到投影矩阵,并将所述初始样本数据的维度降低至所述投影矩阵对应的维度。In addition, the performing dimensionality reduction processing on the initial sample data specifically includes: converting the initial sample data into a data matrix; calculating a covariance matrix of the data matrix, and performing eigendecomposition on the covariance matrix, Obtain the eigenvalues of the covariance matrix and the eigenvectors corresponding to the eigenvalues; obtain a projection matrix according to the eigenvalues and the eigenvectors, and reduce the dimension of the initial sample data to the projection matrix the corresponding dimension.

另外,所述根据所述特征值及所述特征向量得到投影矩阵,具体包括:将所述特征向量从上到下按行排列成矩阵,其中,所述特征向量对应的特征值越大,所述特征向量位于所述矩阵的越前行;取前k行组成所述投影矩阵,其中,k为大于1的整数。In addition, the obtaining the projection matrix according to the eigenvalues and the eigenvectors specifically includes: arranging the eigenvectors in rows from top to bottom into a matrix, wherein, the larger the eigenvalue corresponding to the eigenvector, the The eigenvector is located in the forward row of the matrix; the first k rows are taken to form the projection matrix, where k is an integer greater than 1.

另外,在计算所述数据矩阵的协方差矩阵之前,还包括:对所述数据矩阵的每一行进行零均值化处理;所述计算所述数据矩阵的协方差矩阵,具体包括:计算零均值化处理后的数据矩阵的协方差矩阵。In addition, before calculating the covariance matrix of the data matrix, the method further includes: performing zero-average processing on each row of the data matrix; and the calculating the covariance matrix of the data matrix specifically includes: calculating zero-average The covariance matrix of the processed data matrix.

另外,通过以下公式计算特征与所述预设类别的相关度:Si=[XT×Y+X`T×Y+XT×Y`+X`T×Y`]+[2×(IG(X|L))-(H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))];其中,Si为所述相似度;X、Y为样本数据的两个不同特征;X`为X在不同维度的表示、Y`为Y在不同维度的表示;L为所述预设类别;[XT×Y+X`T×Y+XT×Y`+X`T×Y`]表示X在不同维度的表示与Y在不同维度的表示的鉴别相关性;[2×(IG(X|L))-(H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))]表示X、Y分别与预设类别之间的相关度。In addition, the correlation between the feature and the preset category is calculated by the following formula: Si=[X T ×Y+X` T ×Y+X T ×Y`+X` T ×Y`]+[2×(IG (X|L))-(H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))]; where Si is the said similarity; X and Y are two different characteristics of the sample data; X` is the representation of X in different dimensions, Y` is the representation of Y in different dimensions; L is the preset category; [X T ×Y+ X` T ×Y+X T ×Y`+X` T ×Y`] represents the discriminative correlation between the representation of X in different dimensions and the representation of Y in different dimensions; [2×(IG(X|L))- (H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))] represents the correlation between X, Y and the preset category respectively Spend.

另外,通过以下公式计算特征与所述预设类别的相关度:Si=[XT×Y+X`T×Y+XT×Y`+X`T×Y`]+λ×[2×(IG(X|L))-(H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))];其中,Si为所述相似度;X、Y为样本数据的两个不同特征;X`为X在不同维度的表示、Y`为Y在不同维度的表示;L为所述预设类别;[XT×Y+X`T×Y+XT×Y`+X`T×Y`]表示X在不同维度的表示与Y在不同维度的表示的鉴别相关性;[2×(IG(X|L))-(H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))]表示X、Y分别与预设类别之间的相关度;λ为平衡常数。In addition, the correlation between the feature and the preset category is calculated by the following formula: Si=[X T ×Y+X` T ×Y+X T ×Y`+X` T ×Y`]+λ×[2× (IG(X|L))-(H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))]; where Si is the similarity; X and Y are two different characteristics of the sample data; X` is the representation of X in different dimensions, Y` is the representation of Y in different dimensions; L is the preset category; [X T × Y+X` T ×Y+X T ×Y`+X` T ×Y`] represents the discriminative correlation between the representation of X in different dimensions and the representation of Y in different dimensions; [2×(IG(X|L) )-(H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))] represents the difference between X, Y and the preset category The correlation degree; λ is the equilibrium constant.

另外,所述初始样本数据为图像样本数据。In addition, the initial sample data is image sample data.

附图说明Description of drawings

一个或多个实施例通过与之对应的附图中的图片进行示例性说明,这些示例性说明并不构成对实施例的限定,附图中具有相同参考数字标号的元件表示为类似的元件,除非有特别申明,附图中的图不构成比例限制。One or more embodiments are exemplified by the pictures in the corresponding drawings, and these exemplifications do not constitute limitations of the embodiments, and elements with the same reference numerals in the drawings are denoted as similar elements, Unless otherwise stated, the figures in the accompanying drawings do not constitute a scale limitation.

图1是根据本发明第一实施方式提供的基于主成分分析的数据处理方法的流程图;1 is a flowchart of a data processing method based on principal component analysis provided according to a first embodiment of the present invention;

图2是根据本发明第二实施方式提供的基于主成分分析的数据处理方法的流程图;2 is a flowchart of a data processing method based on principal component analysis provided according to a second embodiment of the present invention;

图3是根据本发明第三实施方式提供的基于主成分分析的数据处理装置的结构示意图。FIG. 3 is a schematic structural diagram of a data processing apparatus based on principal component analysis provided according to a third embodiment of the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合附图对本发明的各实施方式进行详细的阐述。然而,本领域的普通技术人员可以理解,在本发明各实施方式中,为了使读者更好地理解本发明而提出了许多技术细节。但是,即使没有这些技术细节和基于以下各实施方式的种种变化和修改,也可以实现本发明所要求保护的技术方案。In order to make the objectives, technical solutions and advantages of the embodiments of the present invention clearer, the various embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, those of ordinary skill in the art can appreciate that, in the various embodiments of the present invention, many technical details are set forth for the reader to better understand the present invention. However, even without these technical details and various changes and modifications based on the following embodiments, the technical solutions claimed in the present invention can be realized.

本发明的第一实施方式涉及一种基于主成分分析的数据处理方法,具体流程如图1所示,包括:The first embodiment of the present invention relates to a data processing method based on principal component analysis. The specific process is shown in Figure 1, including:

S101:对初始样本数据进行降维处理,得到预设维度的样本数据。S101: Perform dimension reduction processing on the initial sample data to obtain sample data of a preset dimension.

具体的说,在对初始样本数据进行降维处理之前,会预先获取待处理的初始样本数据。本实施方式中所述对所述初始样本数据进行降维处理,具体包括:将所述初始样本数据转化成数据矩阵;计算所述数据矩阵的协方差矩阵,并对所述协方差矩阵进行特征分解,得到所述协方差矩阵的特征值,以及与所述特征值对应的特征向量;根据所述特征值及所述特征向量得到投影矩阵,并将所述初始样本数据的维度降低至所述投影矩阵对应的维度。所述根据所述特征值及所述特征向量得到投影矩阵,具体包括:将所述特征向量从上到下按行排列成矩阵,其中,所述特征向量对应的特征值越大,所述特征向量位于所述矩阵的越前行;取前k行组成所述投影矩阵,其中,k为大于1的整数。例如,特征向量从上到下按行排列成的矩阵共有8行,表明共有8个特征向量,若某一特征向量对应的特征值在这8个特征向量对应的特征值中最大,则该特征向量位于该矩阵的第一行,以此类推。Specifically, before the dimensionality reduction processing is performed on the initial sample data, the initial sample data to be processed will be acquired in advance. The dimensionality reduction processing on the initial sample data in this embodiment specifically includes: converting the initial sample data into a data matrix; calculating a covariance matrix of the data matrix, and characterizing the covariance matrix Decompose to obtain the eigenvalues of the covariance matrix and the eigenvectors corresponding to the eigenvalues; obtain a projection matrix according to the eigenvalues and the eigenvectors, and reduce the dimension of the initial sample data to the The dimension corresponding to the projection matrix. The obtaining of the projection matrix according to the eigenvalues and the eigenvectors specifically includes: arranging the eigenvectors into a matrix from top to bottom in rows, wherein the larger the eigenvalue corresponding to the eigenvectors, the greater the eigenvalues. The vector is located in the first row of the matrix; the first k rows are taken to form the projection matrix, where k is an integer greater than 1. For example, the matrix in which the eigenvectors are arranged in rows from top to bottom has 8 rows, indicating that there are 8 eigenvectors. If the eigenvalue corresponding to a certain eigenvector is the largest among the eigenvalues corresponding to the 8 eigenvectors, then the The vector is in the first row of the matrix, and so on.

值得一提的是,为了减小数据矩阵的误差,避免数据矩阵中的噪声数据对最后的分析结果造成影响,在计算所述数据矩阵的协方差矩阵之前,还包括:对所述数据矩阵的每一行进行零均值化处理;所述计算所述数据矩阵的协方差矩阵,具体包括:计算零均值化处理后的数据矩阵的协方差矩阵。It is worth mentioning that, in order to reduce the error of the data matrix and avoid the influence of the noise data in the data matrix on the final analysis result, before calculating the covariance matrix of the data matrix, it also includes: Each row is subjected to zero-average processing; the calculating the covariance matrix of the data matrix specifically includes: calculating the covariance matrix of the zero-averaged data matrix.

可以理解的是,本实施方式是通过PCA方法降低初始样本数据的维度。需要说明的是,现有技术中通常使用多维尺度分析(MDS)对数据样本进行降维。MDS是一种维度降低的方法,通过分析相似数据来挖掘数据中的隐藏结构信息,通常,相似度量使用欧式距离度量来表示。所以,MDS算法的目的是在尽可能的保留数据样本间距离的情况下,将数据样本映射到一个低维的空间,以此降低样本的维度。MDS就是理论上保持欧式距离的一个经典方法,MDS最早主要用于做数据的可视化。由于MDS得到的低维表示中心在原点,所以又可以说保持内积。也就是说,用低维空间中的内积近似高维空间中的距离。经典的MDS方法,高维空间中的距离一般用欧式距离。多维尺度分析(MDS)和主成分分析(PCA)都是数据降维技术,但是在优化的方向有所不同。PCA的输入是n维空间的原始向量,并且将数据投影到具有最大协方差的投影方向上,因此在降维过程中数据的特性基本被保留。MDS的输入是点与点之间的成对距离,MDS的输出是距离被保留的点在二维或三维的投影。It can be understood that, in this embodiment, the dimension of the initial sample data is reduced by the PCA method. It should be noted that, in the prior art, multidimensional scaling analysis (MDS) is usually used to reduce the dimension of data samples. MDS is a dimensionality reduction method that mines the hidden structural information in the data by analyzing similar data. Usually, the similarity measure is represented by the Euclidean distance measure. Therefore, the purpose of the MDS algorithm is to map the data samples to a low-dimensional space while preserving the distance between the data samples as much as possible, thereby reducing the dimension of the samples. MDS is a classic method of theoretically maintaining Euclidean distance. MDS was first mainly used for data visualization. Since the low-dimensional representation obtained by MDS is centered at the origin, it can be said that the inner product is maintained. That is, the distance in the high-dimensional space is approximated by the inner product in the low-dimensional space. In the classic MDS method, Euclidean distance is generally used for the distance in high-dimensional space. Both Multidimensional Scaling Analysis (MDS) and Principal Component Analysis (PCA) are data dimensionality reduction techniques, but differ in the direction of optimization. The input of PCA is the original vector of the n-dimensional space, and the data is projected to the projection direction with the largest covariance, so the characteristics of the data are basically preserved during the dimensionality reduction process. The input to MDS is the pairwise distances between points, and the output of MDS is the projection of the distance-preserved points in 2D or 3D.

简言之,PCA最小化样本维度,是可以保存数据的协方差。MDS最小化样本维度,是可以保存数据点之间的距离。如果在数据协方差和高维度数据点之间的欧几里得距离,即欧式距离一致的时候,他们是相同的;如果距离测量是不同的,那这两种方法是不同的。显而易见,MDS有其局限性,而PCA恰好作为替代方法可以弥补,应用范围更加广泛,并且PCA的输入为n维空间的原始向量,因此其相对MDS在输入方面就简化了算法,降低了算法复杂度,最重要的是PCA方法在软件缺陷方面对数据的降维和预处理应用非常广泛,效果较MDS也较好。In short, PCA minimizes the sample dimension, which is the covariance of the data that can be preserved. MDS minimizes the sample dimension, which is the distance between data points that can be preserved. If the data covariance and the Euclidean distance between high-dimensional data points agree, they are the same; if the distance measure is different, then the two methods are different. Obviously, MDS has its limitations, and PCA can be used as an alternative method to make up for it, and the application range is wider, and the input of PCA is the original vector of n-dimensional space, so it simplifies the algorithm in terms of input compared to MDS, and reduces the complexity of the algorithm The most important thing is that the PCA method is widely used in data dimensionality reduction and preprocessing in terms of software defects, and the effect is better than MDS.

为了便于理解,下面对PCA方法的算法过程进行详细的解释说明:In order to facilitate understanding, the algorithm process of the PCA method is explained in detail below:

设共有N张图像训练样本,简单地表示为xk∈X(k=1,...,N),X为训练样本数据集,训练样本共有c类,每类分别有Ni张训练样本,把每幅数据的图像矩阵展开得到的列向量维数为n。所有图像训练样本的平均样本用下式表示:

Figure BDA0002404039420000051
Suppose there are N image training samples, simply denoted as x k ∈ X(k=1,...,N), X is the training sample data set, there are c classes of training samples, and each class has N i training samples , the dimension of the column vector obtained by expanding the image matrix of each data is n. The average sample of all image training samples is expressed as:
Figure BDA0002404039420000051

训练样本的第i(i=1,…,c)类的平均样本表示如下:

Figure BDA0002404039420000052
The average sample representation of the i-th (i=1,...,c) class of training samples is as follows:
Figure BDA0002404039420000052

主成分分析方法的具体过程是:首先要读入数据库,把每一个读入的二维的数据图像数据都展开成为一维的向量,每类图像样本都可以根据产生的随机矩阵选择一定数量的图像构成训练样本集,剩下的就构成测试样本集。接着就是计算K-L正交变换的生成矩阵,该生成矩阵可以由训练样本的总体散度矩阵ST表示,也可以由训练样本的类间散度矩阵SB来表示,散度矩阵是由训练集生成的,在此用总体散度矩阵ST表示,定义为:

Figure BDA0002404039420000053
The specific process of the principal component analysis method is: first, read the database, expand each two-dimensional data image data read into a one-dimensional vector, and each type of image sample can select a certain number of images according to the generated random matrix. The images constitute the training sample set, and the rest constitute the test sample set. The next step is to calculate the generator matrix of the KL orthogonal transformation. The generator matrix can be represented by the overall divergence matrix S T of the training samples, or by the inter-class divergence matrix S B of the training samples. The divergence matrix is represented by the training set generated, represented here by the population divergence matrix S T , defined as:
Figure BDA0002404039420000053

生成矩阵Σ可表示为:Σ=STST T The generator matrix Σ can be expressed as: Σ=S T S T T

接着进行特征值分解,计算生成矩阵Σ的特征值和特征向量,把特征值按从大到小依次进行排序,保留前m个最大的特征值,以及这m个特征值所对应的特征向量,从而获得了从高维空间向低维空间投影的投影矩阵,构造特征子空间。也就是说利用K-L变换的PCA方法,旨在寻找一组最佳的投影向量,满足准则函数:

Figure BDA0002404039420000054
Then perform eigenvalue decomposition, calculate the eigenvalues and eigenvectors of the generated matrix Σ, sort the eigenvalues in descending order, retain the first m largest eigenvalues, and the eigenvectors corresponding to these m eigenvalues, Thus, the projection matrix from the high-dimensional space to the low-dimensional space is obtained, and the feature subspace is constructed. That is to say, the PCA method using the KL transform aims to find a set of optimal projection vectors that satisfy the criterion function:
Figure BDA0002404039420000054

接下来就是寻找最佳投影向量,也就是最大化上述准则函数的单位向量w,其物理意义是:在该投影向量w表示的方向上,图像向量投影后得到的特征向量的总体分散程度最大,即图像数据的每个样本与总体训练样本的平均样本之间的距离最大。因为上述计算的最佳投影向量,就是总体散度矩阵ST的最大特征值所对应的单位特征向量。而在样本类别数较多的情况下,只有单一的最优投影方向不足以用来完全表示所有图像样本的特征。从而,这里就需要寻找一组既能够极大化准则函数

Figure BDA0002404039420000055
又能够满足标准正交条件的最佳投影向量组w1,w2,...,wm。而最佳投影矩阵就是通过最佳投影向量组表示的,即P=[w1,w2,...,wm]。The next step is to find the best projection vector, which is to maximize the unit vector w of the above criterion function. Its physical meaning is: in the direction represented by the projection vector w, the overall dispersion of the feature vector obtained after the image vector projection is the largest, That is, the distance between each sample of the image data and the average sample of the overall training samples is the largest. Because the optimal projection vector calculated above is the unit eigenvector corresponding to the largest eigenvalue of the overall divergence matrix ST . In the case of a large number of sample categories, only a single optimal projection direction is not enough to fully represent the characteristics of all image samples. Therefore, it is necessary to find a set of functions that can both maximize the criterion
Figure BDA0002404039420000055
The best projection vector groups w 1 , w 2 , . . . , w m can also satisfy the standard orthogonal conditions. The optimal projection matrix is represented by the optimal projection vector group, namely P=[w 1 ,w 2 ,...,w m ].

接着,分别将训练样本和测试样本都投影到上面求出的特征子空间中,每一幅数据图像投影到上述特征子空间之后,都会对应于子空间中的一个点。同样,特征子空间中的任意一点也都能找到其相应的某一幅数据图像,这些特征子空间中数据图像投影得到的点就被称为“特征脸”。顾名思义,“特征脸”方法即表示通过K-L正交变换来进行数据识别的方法。Next, the training samples and the test samples are respectively projected into the feature subspace obtained above. After each data image is projected into the feature subspace, it will correspond to a point in the subspace. Similarly, any point in the feature subspace can also find a corresponding data image, and the points obtained by the projection of the data image in these feature subspaces are called "eigenfaces". As the name suggests, the "eigenface" method refers to a method of data recognition through K-L orthogonal transformation.

最后,把经过上述向量投影,变换到特征子空间中的所有测试图像样本和训练图像样本进行比较,从而确定待识别数据图像样本所属的类别,这就是对测试样本进行的分类,需要选择合适的分类器和相异度测试公式。Finally, compare all the test image samples transformed into the feature subspace through the above vector projection with the training image samples to determine the category to which the data image samples to be identified belong. This is the classification of the test samples, and it is necessary to select the appropriate The classifier and dissimilarity test formula.

S102:获取样本数据的多个特征,并计算每个特征与预设类别的相关度。S102: Acquire multiple features of the sample data, and calculate the correlation between each feature and a preset category.

具体的说,本实施方式可以通过以下公式计算特征与所述预设类别的相关度:Si=[XT×Y+X`T×Y+XT×Y`+X`T×Y`]+[2×(IG(X|L))-(H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))];其中,Si为所述相似度;X、Y为样本数据的两个不同特征;X`为X在不同维度的表示、Y`为Y在不同维度的表示;L为所述预设类别;[XT×Y+X`T×Y+XT×Y`+X`T×Y`]表示X在不同维度的表示与Y在不同维度的表示的鉴别相关性;[2×(IG(X|L))-(H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))]表示X、Y分别与预设类别之间的相关度。Specifically, in this embodiment, the correlation between the feature and the preset category can be calculated by the following formula: Si=[X T ×Y+X` T ×Y+X T ×Y`+X` T ×Y`] +[2×(IG(X|L))-(H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))] Wherein, Si is the degree of similarity; X, Y are two different characteristics of the sample data; X' is the representation of X in different dimensions, Y' is the representation of Y in different dimensions; L is the preset category; [X T ×Y+X` T ×Y+X T ×Y`+X` T ×Y`] represents the discriminative correlation between the representation of X in different dimensions and the representation of Y in different dimensions; [2×(IG( X|L))-(H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))] means that X and Y are Set the correlation between categories.

值得一提的是,为了使得第一部分和第二部分的计算项平衡,需要增加一个平衡参数λ,因此,本实施方式还可以通过以下公式计算特征与所述预设类别的相关度:It is worth mentioning that, in order to balance the calculation items of the first part and the second part, a balance parameter λ needs to be added. Therefore, in this embodiment, the correlation between the feature and the preset category can also be calculated by the following formula:

Si=[XT×Y+X`T×Y+XT×Y`+X`T×Y`]+λ×[2×(IG(X|L))-(H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))];其中,Si为所述相似度;X、Y为样本数据的两个不同特征;X`为X在不同维度的表示、Y`为Y在不同维度的表示;L为所述预设类别;[XT×Y+X`T×Y+XT×Y`+X`T×Y`]表示X在不同维度的表示与Y在不同维度的表示的鉴别相关性;[2×(IG(X|L))-(H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))]表示X、Y分别与预设类别之间的相关度;λ为平衡常数。Si=[X T ×Y+X` T ×Y+X T ×Y`+X` T ×Y`]+λ×[2×(IG(X|L))-(H(X)+H( L))]+[2×(IG(Y|L))-(H(Y)+H(L))]; wherein, Si is the similarity; X and Y are two different features of the sample data ; X` is the representation of X in different dimensions, Y` is the representation of Y in different dimensions; L is the preset category; [X T ×Y+X` T ×Y+X T ×Y`+X` T ×Y`] represents the discriminative correlation between the representation of X in different dimensions and the representation of Y in different dimensions; [2×(IG(X|L))-(H(X)+H(L))]+[2 ×(IG(Y|L))-(H(Y)+H(L))] represents the correlation between X, Y and the preset category respectively; λ is the equilibrium constant.

可以理解的是,相对于现有技术中的信息熵公式,本实施方式中的公式扩展了原方法中只有样本的特征的计算方式,使得同一个样本之间任意两个不同特征之间的鉴别特征选取类别相关性较高。第一个方括号表示同一个样本之间任意两个不同特征及其在不同维度的表示之间的鉴别相关性,并且通过该计算的约束性可以更好的获得比较直观的相关性较高的特征,而原方法并不能很直观的通过计算得出相关性较高的特征,它只是考虑了样本特征的相关性,却忽略了样本特征之间存在更重要的特征相关性,通过该计算可以充分利用同一样本的不同特征之间的联系,包括相关特征和冗余特征,很显然,如果第一个方括号计算出来的值越大,则表示样本特征之间的相关性越高,可以获得其相关特征,反之亦然,如果第一个方括号计算出来的值越小,则表示样本特征之间的冗余度越高,可以有效去除其冗余特征,使得样本特征的鉴别性更高。后面两个方括号的计算表示同一样本的两个不同特征分别与类别变量之间的相似度,同样,值越大表示特征与类别的相似度越高,即相关性越高,反之亦然,值越小表示特征与类别的相似度越低,即相关性越低。It can be understood that, compared with the information entropy formula in the prior art, the formula in this embodiment expands the calculation method of only the characteristics of the sample in the original method, so that the discrimination between any two different characteristics of the same sample is made. Feature selection category correlation is high. The first square brackets indicate the discriminative correlation between any two different features of the same sample and their representations in different dimensions, and through the constraint of the calculation, a more intuitive and higher correlation can be obtained. The original method cannot intuitively calculate the features with high correlation. It only considers the correlation of sample features, but ignores the more important feature correlation between sample features. Make full use of the relationship between different features of the same sample, including related features and redundant features. Obviously, if the value calculated by the first square bracket is larger, it means that the correlation between the sample features is higher, and it is possible to obtain Its related features, and vice versa, if the value calculated by the first square bracket is smaller, it means that the redundancy between the sample features is higher, and its redundant features can be effectively removed, making the sample features more discriminative. . The calculation of the following two square brackets indicates the similarity between the two different features of the same sample and the categorical variable. Similarly, the larger the value, the higher the similarity between the feature and the category, that is, the higher the correlation, and vice versa, The smaller the value, the lower the similarity between the feature and the category, that is, the lower the correlation.

S103:去除多个特征中相关度小于预设相关度的特征,将剩余特征作为样本数据的鉴别特征。S103: Remove features with a correlation degree less than a preset correlation degree among the plurality of features, and use the remaining features as the identification features of the sample data.

具体的说,预设相关度的大小可以根据实际需求设置,本实施方式并不对此作具体限定。本实施方式是基于主成分分析样本数据,主成分分析的基本思想是提取高维数据空间的主要特征,并保持原来高维数据的绝大部分信息,使得高维数据可以在一个较低维的特征空间上被处理。K-L变换是主成分分析的基础。它是一种最优正交变换,是基于目标统计特征的,其目的是找到一个线性的投影变换,通过这个投影变换使得新的特征分量正交或不相关,并且为了使数据的能量更加集中,要求经过投影重建后的特征分量与原输入样本在最小均方意义下的误差最小。从而得到原样本的低维近似表示,能够更好得压缩原始数据。运用K-L变换进行数据识别,提出了经典的特征脸方法(Eigenfaces),形成了子空间学习方法的基础。简而言之,就是从输入数据训练图像中,通过主成分分析得到一组特征脸图像,再给定任意的数据图像,使得每个数据图像都可以用这组特征脸图像来线性表示,即通过计算主成分分析得到的特征脸图像的加权线性组合。Specifically, the size of the preset correlation degree may be set according to actual requirements, which is not specifically limited in this embodiment. This embodiment is based on principal component analysis of sample data. The basic idea of principal component analysis is to extract the main features of the high-dimensional data space and keep most of the information of the original high-dimensional data, so that the high-dimensional data can be stored in a lower-dimensional data space. processed in feature space. The K-L transform is the basis of principal component analysis. It is an optimal orthogonal transformation based on the statistical characteristics of the target, and its purpose is to find a linear projection transformation through which the new feature components are orthogonal or irrelevant, and in order to make the energy of the data more concentrated , it is required that the error between the feature component after projection reconstruction and the original input sample in the sense of least mean square is the smallest. Thus, a low-dimensional approximate representation of the original sample is obtained, which can better compress the original data. Using K-L transform for data recognition, a classic eigenfaces method is proposed, which forms the basis of the subspace learning method. In short, it is to obtain a set of eigenface images from the input data training image through principal component analysis, and then given any data image, so that each data image can be linearly represented by this set of eigenface images, that is, A weighted linear combination of eigenface images obtained by computational principal component analysis.

主成分分析的本质是计算协方差矩阵并将其对角化。可以假设所有的数据图像都在一个线性的低维空间中,而且在该低维空间中所有数据图像都是线性可分的,再把主成分分析方法用于数据特征识别,其具体做法是要进行K-L变换,将高维图像输入空间经变换得到一组新的正交基,按照一定的方法对变换得到的正交基按一定的条件进行筛选,剔除一些冗余的向量,保留那些特征鉴别能力强的向量,来生成低维的数据子空间,也就是数据的特征脸子空间。利用主成分分析方法降低输入数据空间维数的关键是找出最能够代表原始数据的投影方法,以“降噪”和消灭“冗余”的维度,使得降低维度的同时,能够保证原始输入数据中最重要的特征不丢失。在协方差矩阵中,只需要选取那些能量(特征值)比较大的维度,其余相对较低的就舍掉,这样就能够保留输入图像数据中那些重要的特征信息,而舍弃无益于数据识别的其他部分。The essence of PCA is to compute and diagonalize the covariance matrix. It can be assumed that all data images are in a linear low-dimensional space, and all data images in this low-dimensional space are linearly separable, and then the principal component analysis method is used for data feature recognition. The specific method is to Perform K-L transformation, transform the high-dimensional image input space to obtain a new set of orthonormal bases, screen the transformed orthonormal bases according to certain conditions, remove some redundant vectors, and retain those feature identification A powerful vector is used to generate a low-dimensional data subspace, that is, the eigenface subspace of the data. The key to using the principal component analysis method to reduce the spatial dimension of the input data is to find the projection method that can best represent the original data, in order to "reduce noise" and eliminate the "redundant" dimension, so that the original input data can be guaranteed while reducing the dimension. The most important features are not lost. In the covariance matrix, only those dimensions with relatively large energy (eigenvalues) need to be selected, and the remaining relatively low dimensions are discarded, so that those important feature information in the input image data can be retained, and those that are not conducive to data identification are discarded. other parts.

为了便于理解,下面对本实施方式中如何处理样本数据进行具体的举例说明:In order to facilitate understanding, a specific example of how to process sample data in this embodiment is described below:

输入:训练样本集:X=[X1,X2,...,Xc],其中Xi=(F1,F2,...,Fm,L),k<m,i=1...m。Input: training sample set: X=[X1, X2,..., Xc], where Xi=(F1, F2,..., Fm, L), k<m, i=1...m.

PCA数据降维的维度:kThe dimension of PCA data dimensionality reduction: k

相关性阈值(预设相关度):βCorrelation threshold (preset correlation): β

1)将原始数据按列组成数据矩阵,把每一个读入的二维的数据图像数据都展开成为一维的向量。1) The original data is formed into a data matrix by columns, and each two-dimensional data image data read in is expanded into a one-dimensional vector.

2)将数据矩阵的每一行(代表一个属性字段)进行零均值化,即减去这一行的均值。2) Zero-mean each row of the data matrix (representing an attribute field), that is, subtract the mean of this row.

3)求出协方差矩阵。3) Find the covariance matrix.

4)进行特征值分解,求出协方差矩阵的特征值及对应的特征向量。4) Carry out eigenvalue decomposition to obtain the eigenvalues of the covariance matrix and the corresponding eigenvectors.

5)将特征向量按对应特征值大小从上到下按行排列成矩阵,取前k行组成样本投影矩阵。5) Arrange the eigenvectors into a matrix from top to bottom according to the size of the corresponding eigenvalues, and take the first k rows to form a sample projection matrix.

6)将数据降维到投影矩阵对应的维度,即k,X’=PX即为降维到k维后的数据。得到降维后的样本集表示为X’=[X'1,X'2,…,X'c],其中X'i=(F1,F2,…,Fk,L),k<m,i=1…m。6) Reduce the dimension of the data to the dimension corresponding to the projection matrix, that is, k, X'=PX is the data after dimension reduction to k dimension. The sample set after dimensionality reduction is obtained as X'=[X'1, X'2, ..., X'c], where X'i=(F1, F2, ..., Fk, L), k<m, i =1...m.

7)令i=1 to k,j=1 to k(i≠j)循环,计算Si=ISU(Fi,Fi’,Fj,Fj’,L)。7) Let i=1 to k, j=1 to k (i≠j) loop, and calculate Si=ISU(Fi, Fi', Fj, Fj', L).

8)对Si按从大到小排序。8) Sort Si in descending order.

9)将序列中最前面的g个特征作为新样本的特征,得到样本集X”=[X”1,X”2,…,X”c],其中X”i=(F1,F2,…,Fg,L),g<k,i=1…m。9) Take the first g features in the sequence as the features of the new sample, and obtain the sample set X"=[X"1, X"2,...,X"c], where X"i=(F1, F2,... , Fg, L), g<k, i=1...m.

10)对每对特征从后向前进行相关性分析,去除大于β的指定特征,得出最终样本Y。10) Perform correlation analysis on each pair of features from back to front, remove the specified features greater than β, and obtain the final sample Y.

输出:样本集Y。Output: sample set Y.

本发明的实施方式相对于现有技术而言,通过对所述初始样本数据进行降维处理,得到预设维度的样本数据,以便于后续步骤的计算,减小后续步骤的运算量,从而提高数据处理方法的效率;通过获取所述样本数据的多个特征,并计算每个特征与预设类别的相关度,由于预设类别为样本数据具有的多种类别中的一种类别,通过此种方式,能够根据相似度得知样本数据的多个特征中哪些特征为冗余特征;通过去除所述多个特征中相关度小于预设相关度的特征,将剩余特征作为所述样本数据的鉴别特征,能够得到具有高鉴别性的样本数据,使得使用该样本数据的训练模型的预算速度变快,从而提高预测效率。Compared with the prior art, the embodiment of the present invention obtains sample data with a preset dimension by performing dimension reduction processing on the initial sample data, so as to facilitate the calculation of the subsequent steps, reduce the calculation amount of the subsequent steps, and thereby improve the The efficiency of the data processing method; by acquiring multiple features of the sample data, and calculating the correlation between each feature and the preset category, since the preset category is one of the multiple categories of the sample data, through this In this way, it is possible to know which features of the multiple features of the sample data are redundant features according to the similarity; by removing the features whose correlation is less than the preset correlation among the multiple features, the remaining features are used as the sample data. The discriminative feature can obtain sample data with high discriminant, which makes the budget speed of the training model using the sample data faster, thereby improving the prediction efficiency.

本发明的第二实施方式涉及一种基于主成分分析的数据处理方法,第二实施方式是在第一实施方式的基础上做了进一步的改进,具体改进之处在于:在第二实施方式中,在去除所述多个特征中相关度小于预设相关度的特征之后,还包括:将所述剩余特征按照所述相关度由高到低的顺序排序;将排序后的所述剩余特征划分为N个特征段,其中,每个特征段中均包括M个特征,N、M均为大于1的整数;判断是否存在M个特征均大于预设阈值的特征段,在判定存在时,去除所述特征段中相似度最小的特征。通过此种方式,能够进一步减少样本数据中的冗余特征,使得预测效率得到进一步的提高。The second embodiment of the present invention relates to a data processing method based on principal component analysis. The second embodiment is further improved on the basis of the first embodiment. The specific improvement is that: in the second embodiment , after removing the features whose correlation degree is less than the preset correlation degree among the plurality of features, further comprising: sorting the remaining features according to the order of the correlation degree from high to low; dividing the sorted remaining features into is N feature segments, wherein each feature segment includes M features, and N and M are both integers greater than 1; it is judged whether there are feature segments with M features that are all greater than the preset threshold, and when it is judged to exist, remove the The feature with the smallest similarity in the feature segment. In this way, redundant features in the sample data can be further reduced, so that the prediction efficiency is further improved.

本实施方式的具体流程如图2所示,包括:The specific process of this embodiment is shown in Figure 2, including:

S201:对初始样本数据进行降维处理,得到预设维度的样本数据。S201: Perform dimensionality reduction processing on the initial sample data to obtain sample data of preset dimensions.

S202:获取样本数据的多个特征,并计算每个特征与预设类别的相关度。S202: Acquire multiple features of the sample data, and calculate the correlation between each feature and a preset category.

S203:去除多个特征中相关度小于预设相关度的特征。S203: Remove features whose correlation degree is less than a preset correlation degree among the multiple features.

S204:将去除相关度小于预设相关度的特征后的多个特征按照相关度由高到低的顺序排序,并将排序后的剩余特征划分为N个特征段。S204 : Sort the multiple features after removing the features whose correlation is less than the preset correlation in descending order of the correlation, and divide the sorted remaining features into N feature segments.

S205:判断是否存在M个特征均大于预设阈值的特征段,在判定存在时,去除所述特征段中相似度最小的特征。S205: Determine whether there are feature segments with M features that are all greater than a preset threshold, and when it is determined that there are feature segments with the smallest similarity in the feature segments.

针对上述步骤S204至S205,具体的说,使用阈值相关性方法去除样本数据的冗余特征。阈值相关性方法是用特征之间的相关度来识别冗余特征,实际软件度量中,存在非线性关系,所以这里依然选择ISU来计算一对特征间的相关度,其阈值相关性方法使用预设的β(即预设阈值)作为相关性的临界值,在去除多个特征中相关度小于预设相关度的特征后,从后向前对剩余特征进行相关性分析,所有大于临界值的一对特征就从样本集中去除靠后的特征,以此类推。之所以从后向前进行相关性分析,是因为去除多个特征中相关度小于预设相关度的特征从后往前其鉴别性越来越高,所以从后往前进行相关性分析,当遇到相关度大于β值的两个特征时,就可以优先去掉鉴别性小的特征,从而保留鉴别性较大的特征。For the above steps S204 to S205, specifically, a threshold correlation method is used to remove redundant features of the sample data. The threshold correlation method uses the correlation between features to identify redundant features. In actual software metrics, there is a nonlinear relationship, so ISU is still selected to calculate the correlation between a pair of features. The threshold correlation method uses pre- The set β (that is, the preset threshold) is used as the critical value of the correlation. After removing the features whose correlation is less than the preset correlation among the multiple features, the correlation analysis is performed on the remaining features from the back to the front. A pair of features removes the later features from the sample set, and so on. The reason why the correlation analysis is performed from the back to the front is that the discrimination of the features whose correlation degree is less than the preset correlation degree among the multiple features is removed. When two features with a correlation greater than the β value are encountered, the less discriminative features can be preferentially removed, thereby retaining the more discriminative features.

S206:将剩余特征作为样本数据的鉴别特征。S206: Use the remaining features as the discriminating features of the sample data.

本实施方式的步骤S201至步骤S203、S206与第一实施方式的步骤S101至步骤S103类似,为了避免重复,此处不再赘述。Steps S201 to S203 and S206 of the present embodiment are similar to steps S101 to S103 of the first embodiment, and are not repeated here to avoid repetition.

本发明的实施方式相对于现有技术而言,通过对所述初始样本数据进行降维处理,得到预设维度的样本数据,以便于后续步骤的计算,减小后续步骤的运算量,从而提高数据处理方法的效率;通过获取所述样本数据的多个特征,并计算每个特征与预设类别的相关度,由于预设类别为样本数据具有的多种类别中的一种类别,通过此种方式,能够根据相似度得知样本数据的多个特征中哪些特征为冗余特征;通过去除所述多个特征中相关度小于预设相关度的特征,将剩余特征作为所述样本数据的鉴别特征,能够得到具有高鉴别性的样本数据,使得使用该样本数据的训练模型的预算速度变快,从而提高预测效率。Compared with the prior art, the embodiment of the present invention obtains sample data with a preset dimension by performing dimension reduction processing on the initial sample data, so as to facilitate the calculation of the subsequent steps, reduce the calculation amount of the subsequent steps, and thereby improve the The efficiency of the data processing method; by acquiring multiple features of the sample data, and calculating the correlation between each feature and the preset category, since the preset category is one of the multiple categories of the sample data, through this In this way, it is possible to know which features of the multiple features of the sample data are redundant features according to the similarity; by removing the features whose correlation is less than the preset correlation among the multiple features, the remaining features are used as the sample data. The discriminative feature can obtain sample data with high discriminant, which makes the budget speed of the training model using the sample data faster, thereby improving the prediction efficiency.

本发明第三实施方式涉及一种基于主成分分析的数据处理装置,如图3所示,包括:The third embodiment of the present invention relates to a data processing device based on principal component analysis, as shown in FIG. 3 , including:

至少一个处理器301;以及,at least one processor 301; and,

与至少一个处理器301通信连接的存储器302;其中,a memory 302 in communication with the at least one processor 301; wherein,

存储器302存储有可被至少一个处理器301执行的指令,指令被至少一个处理器301执行,以使至少一个处理器301能够执行上述基于主成分分析的数据处理方法。The memory 302 stores instructions executable by the at least one processor 301, and the instructions are executed by the at least one processor 301, so that the at least one processor 301 can execute the above-mentioned principal component analysis-based data processing method.

其中,存储器302和处理器301采用总线方式连接,总线可以包括任意数量的互联的总线和桥,总线将一个或多个处理器301和存储器302的各种电路连接在一起。总线还可以将诸如外围设备、稳压器和功率管理电路等之类的各种其他电路连接在一起,这些都是本领域所公知的,因此,本文不再对其进行进一步描述。总线接口在总线和收发机之间提供接口。收发机可以是一个元件,也可以是多个元件,比如多个接收器和发送器,提供用于在传输介质上与各种其他装置通信的单元。经处理器301处理的数据通过天线在无线介质上进行传输,进一步,天线还接收数据并将数据传送给处理器301。The memory 302 and the processor 301 are connected by a bus, and the bus may include any number of interconnected buses and bridges, and the bus connects one or more processors 301 and various circuits of the memory 302 together. The bus may also connect together various other circuits, such as peripherals, voltage regulators, and power management circuits, which are well known in the art and therefore will not be described further herein. The bus interface provides the interface between the bus and the transceiver. A transceiver may be a single element or multiple elements, such as multiple receivers and transmitters, providing a means for communicating with various other devices over a transmission medium. The data processed by the processor 301 is transmitted on the wireless medium through the antenna, and further, the antenna also receives the data and transmits the data to the processor 301 .

处理器301负责管理总线和通常的处理,还可以提供各种功能,包括定时,外围接口,电压调节、电源管理以及其他控制功能。而存储器302可以被用于存储处理器301在执行操作时所使用的数据。Processor 301 is responsible for managing the bus and general processing, and may also provide various functions including timing, peripheral interface, voltage regulation, power management, and other control functions. The memory 302 may be used to store data used by the processor 301 when performing operations.

本发明第四实施方式涉及一种计算机可读存储介质,存储有计算机程序。计算机程序被处理器执行时实现上述方法实施例。A fourth embodiment of the present invention relates to a computer-readable storage medium storing a computer program. The above method embodiments are implemented when the computer program is executed by the processor.

即,本领域技术人员可以理解,实现上述实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-OnlyMemory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。That is, those skilled in the art can understand that all or part of the steps in the method for implementing the above embodiments can be completed by instructing the relevant hardware through a program, and the program is stored in a storage medium and includes several instructions to make a device ( It may be a single chip microcomputer, a chip, etc.) or a processor (processor) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, removable hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes.

本领域的普通技术人员可以理解,上述各实施方式是实现本发明的具体实施例,而在实际应用中,可以在形式上和细节上对其作各种改变,而不偏离本发明的精神和范围。Those skilled in the art can understand that the above-mentioned embodiments are specific examples for realizing the present invention, and in practical applications, various changes in form and details can be made without departing from the spirit and the spirit of the present invention. scope.

Claims (10)

1. A data processing method based on principal component analysis is characterized by comprising the following steps:
performing dimensionality reduction on the initial sample data to obtain sample data with preset dimensionality;
obtaining a plurality of characteristics of the sample data, and calculating the correlation degree of each characteristic and a preset category, wherein the preset category is one of a plurality of categories of the sample data;
and removing the features of which the correlation degree is smaller than the preset correlation degree from the plurality of features, and taking the residual features as the identification features of the sample data.
2. The principal component analysis-based data processing method according to claim 1, further comprising, after removing features of the plurality of features for which the degree of correlation is less than a preset degree of correlation:
sorting the residual features according to the sequence of the relevance from high to low;
dividing the sorted residual features into N feature segments, wherein each feature segment comprises M features, and N, M are integers greater than 1;
and judging whether the M characteristic sections with the characteristics larger than a preset threshold exist or not, and removing the characteristic with the minimum similarity in the characteristic sections when judging that the M characteristic sections exist.
3. The principal component analysis-based data processing method according to claim 1, wherein the performing dimensionality reduction processing on the initial sample data specifically includes:
converting the initial sample data into a data matrix;
calculating a covariance matrix of the data matrix, and performing characteristic decomposition on the covariance matrix to obtain an eigenvalue of the covariance matrix and an eigenvector corresponding to the eigenvalue;
and obtaining a projection matrix according to the eigenvalue and the eigenvector, and reducing the dimensionality of the initial sample data to the dimensionality corresponding to the projection matrix.
4. The principal component analysis-based data processing method according to claim 3, wherein obtaining a projection matrix from the eigenvalues and the eigenvectors specifically comprises:
arranging the eigenvectors into a matrix from top to bottom, wherein the larger the eigenvalue corresponding to the eigenvector is, the more the eigenvector is located in the front row of the matrix;
and taking the first k rows to form the projection matrix, wherein k is an integer larger than 1.
5. The principal component analysis-based data processing method according to claim 3 or 4, further comprising, before calculating the covariance matrix of the data matrix:
carrying out zero equalization processing on each line of the data matrix;
the calculating the covariance matrix of the data matrix specifically includes:
and calculating a covariance matrix of the data matrix after zero averaging processing.
6. The principal component analysis-based data processing method according to claim 1, wherein the degree of correlation of a feature with the preset category is calculated by the following formula:
Si=[XT×Y+X`T×Y+XT×Y`+X`T×Y`]+[2×(IG(X|L))-(H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))];
wherein Si is the similarity, X, Y is two different characteristics of sample data, X 'is the representation of X in different dimensions, Y' is the representation of Y in different dimensions, L is the preset categoryT×Y+X`T×Y+XT×Y`+X`T×Y`]Representing the discriminatory relevance of X in different dimensions to Y in different dimensions, [2 × (IG (X | L)) - (H (X) + H (L) ]]+[2×(IG(Y|L))-(H(Y)+H(L))]The correlation between the representations X, Y and the preset categories, respectively.
7. The principal component analysis-based data processing method according to claim 1, wherein the degree of correlation of a feature with the preset category is calculated by the following formula:
Si=[XT×Y+X`T×Y+XT×Y`+X`T×Y`]+λ×[2×(IG(X|L))-(H(X)+H(L))]+[2×(IG(Y|L))-(H(Y)+H(L))];
wherein Si is the similarity, X, Y is two different characteristics of sample data, X 'is the representation of X in different dimensions, Y' is the representation of Y in different dimensions, L is the preset categoryT×Y+X`T×Y+XT×Y`+X`T×Y`]Representing the discriminatory relevance of X in different dimensions to Y in different dimensions, [2 × (IG (X | L)) - (H (X) + H (L) ]]+[2×(IG(Y|L))-(H(Y)+H(L))]Representing X, Y a degree of correlation between each and a preset category; λ is the equilibrium constant.
8. The principal component analysis-based data processing method according to claim 1, wherein the initial sample data is image sample data.
9. A data processing apparatus based on principal component analysis, comprising: at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a principal component analysis-based data processing method according to any one of claims 1 to 8.
10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the principal component analysis-based data processing method according to any one of claims 1 to 8.
CN202010155934.8A 2020-03-09 2020-03-09 Data processing method, device and storage medium based on principal component analysis Active CN111476100B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010155934.8A CN111476100B (en) 2020-03-09 2020-03-09 Data processing method, device and storage medium based on principal component analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010155934.8A CN111476100B (en) 2020-03-09 2020-03-09 Data processing method, device and storage medium based on principal component analysis

Publications (2)

Publication Number Publication Date
CN111476100A true CN111476100A (en) 2020-07-31
CN111476100B CN111476100B (en) 2023-11-14

Family

ID=71748104

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010155934.8A Active CN111476100B (en) 2020-03-09 2020-03-09 Data processing method, device and storage medium based on principal component analysis

Country Status (1)

Country Link
CN (1) CN111476100B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111914954A (en) * 2020-09-14 2020-11-10 中移(杭州)信息技术有限公司 Data analysis method, device and storage medium
CN112528893A (en) * 2020-12-15 2021-03-19 南京中兴力维软件有限公司 Abnormal state identification method and device and computer readable storage medium
CN113177879A (en) * 2021-04-30 2021-07-27 北京百度网讯科技有限公司 Image processing method, image processing apparatus, electronic device, and storage medium
CN114897261A (en) * 2022-06-02 2022-08-12 广东电网有限责任公司 Building comprehensive energy power prediction model construction method, prediction method and gateway
CN115730592A (en) * 2022-11-30 2023-03-03 贵州电网有限责任公司信息中心 Power grid redundant data elimination method, device, equipment and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101021897A (en) * 2006-12-27 2007-08-22 中山大学 Two-dimensional linear discrimination human face analysis identificating method based on interblock correlation
US20080080745A1 (en) * 2005-05-09 2008-04-03 Vincent Vanhoucke Computer-Implemented Method for Performing Similarity Searches
US20120121142A1 (en) * 2009-06-09 2012-05-17 Pradeep Nagesh Ultra-low dimensional representation for face recognition under varying expressions
CN103020640A (en) * 2012-11-28 2013-04-03 金陵科技学院 Facial image dimensionality reduction classification method based on two-dimensional principal component analysis
CN103942572A (en) * 2014-05-07 2014-07-23 中国标准化研究院 Method and device for extracting facial expression features based on bidirectional compressed data space dimension reduction
US20150310308A1 (en) * 2012-11-27 2015-10-29 Tencent Technology (Shenzhen) Company Limited Method and apparatus for recognizing client feature, and storage medium
CN105138972A (en) * 2015-08-11 2015-12-09 北京天诚盛业科技有限公司 Face authentication method and device
CN106845397A (en) * 2017-01-18 2017-06-13 湘潭大学 A kind of confirming face method based on measuring similarity
CN109784668A (en) * 2018-12-21 2019-05-21 国网江苏省电力有限公司南京供电分公司 A kind of sample characteristics dimension-reduction treatment method for electric power monitoring system unusual checking
CN109978023A (en) * 2019-03-11 2019-07-05 南京邮电大学 Feature selection approach and computer storage medium towards higher-dimension big data analysis
CN109981335A (en) * 2019-01-28 2019-07-05 重庆邮电大学 The feature selection approach of combined class uneven traffic classification

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080080745A1 (en) * 2005-05-09 2008-04-03 Vincent Vanhoucke Computer-Implemented Method for Performing Similarity Searches
CN101021897A (en) * 2006-12-27 2007-08-22 中山大学 Two-dimensional linear discrimination human face analysis identificating method based on interblock correlation
US20120121142A1 (en) * 2009-06-09 2012-05-17 Pradeep Nagesh Ultra-low dimensional representation for face recognition under varying expressions
US20150310308A1 (en) * 2012-11-27 2015-10-29 Tencent Technology (Shenzhen) Company Limited Method and apparatus for recognizing client feature, and storage medium
CN103020640A (en) * 2012-11-28 2013-04-03 金陵科技学院 Facial image dimensionality reduction classification method based on two-dimensional principal component analysis
CN103942572A (en) * 2014-05-07 2014-07-23 中国标准化研究院 Method and device for extracting facial expression features based on bidirectional compressed data space dimension reduction
CN105138972A (en) * 2015-08-11 2015-12-09 北京天诚盛业科技有限公司 Face authentication method and device
CN106845397A (en) * 2017-01-18 2017-06-13 湘潭大学 A kind of confirming face method based on measuring similarity
CN109784668A (en) * 2018-12-21 2019-05-21 国网江苏省电力有限公司南京供电分公司 A kind of sample characteristics dimension-reduction treatment method for electric power monitoring system unusual checking
CN109981335A (en) * 2019-01-28 2019-07-05 重庆邮电大学 The feature selection approach of combined class uneven traffic classification
CN109978023A (en) * 2019-03-11 2019-07-05 南京邮电大学 Feature selection approach and computer storage medium towards higher-dimension big data analysis

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YAN XU 等: "A new algorithm of face detection based on differential images and PCA in color image", 《2009 2ND IEEE INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND INFORMATION TECHNOLOGY》, pages 172 - 176 *
余大龙: "基于特征选择的数据降维算法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 08, pages 138 - 317 *
齐迎春 等: "基于FPCA和ReliefF算法的图像特征降维", 《吉林大学学报(理学版)》, no. 05, pages 153 - 158 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111914954A (en) * 2020-09-14 2020-11-10 中移(杭州)信息技术有限公司 Data analysis method, device and storage medium
CN112528893A (en) * 2020-12-15 2021-03-19 南京中兴力维软件有限公司 Abnormal state identification method and device and computer readable storage medium
CN113177879A (en) * 2021-04-30 2021-07-27 北京百度网讯科技有限公司 Image processing method, image processing apparatus, electronic device, and storage medium
CN114897261A (en) * 2022-06-02 2022-08-12 广东电网有限责任公司 Building comprehensive energy power prediction model construction method, prediction method and gateway
CN115730592A (en) * 2022-11-30 2023-03-03 贵州电网有限责任公司信息中心 Power grid redundant data elimination method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN111476100B (en) 2023-11-14

Similar Documents

Publication Publication Date Title
CN111476100B (en) Data processing method, device and storage medium based on principal component analysis
US11294624B2 (en) System and method for clustering data
US8209269B2 (en) Kernels for identifying patterns in datasets containing noise or transformation invariances
Chen et al. Feature-aware label space dimension reduction for multi-label classification
Wan et al. Adaptive similarity embedding for unsupervised multi-view feature selection
Husain et al. Improving large-scale image retrieval through robust aggregation of local descriptors
Sumithra et al. A review of various linear and non linear dimensionality reduction techniques
US11301509B2 (en) Image search system, image search method, and program
Li et al. Discriminative multi-view interactive image re-ranking
Blanco Valencia et al. Kernel-based framework for spectral dimensionality reduction and clustering formulation: A theoretical study
Kim et al. Sequential spectral learning to hash with multiple representations
CN112163114B (en) Image retrieval method based on feature fusion
Ramachandran et al. Evaluation of dimensionality reduction techniques for big data
CN105469117B (en) A kind of image-recognizing method and device extracted based on robust features
US8412757B2 (en) Non-negative matrix factorization as a feature selection tool for maximum margin classifiers
KR101435010B1 (en) Method for learning of sequential binary code using features and apparatus for the same
CN110083731B (en) Image retrieval method, device, computer equipment and storage medium
Huo et al. Ensemble of sparse cross-modal metrics for heterogeneous face recognition
Damasceno et al. Independent vector analysis with sparse inverse covariance estimation: An application to misinformation detection
Húsek et al. Comparison of neural network boolean factor analysis method with some other dimension reduction methods on bars problem
Li et al. A fast incremental spectral clustering algorithm with cosine similarity
Ruffini et al. Hierarchical methods of moments
CN111783816A (en) Feature selection method and device, multimedia and network data dimension reduction method and equipment
ul Haq et al. Hyperspectral data classification via sparse representation in homotopy
Sofuoglu et al. Tensor-train discriminant analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant