Disclosure of Invention
The invention aims at least solving the technical problems in the prior art, and particularly creatively provides a parallel deep convolutional neural network optimization method based on an Im2 col.
In order to achieve the above object of the present invention, the present invention provides a parallel deep convolutional neural network optimization method based on Im2col, comprising the steps of:
s1, extracting characteristics in parallel, namely extracting target characteristics in data to serve as input of a convolutional neural network, and solving the problem of more data redundancy characteristics;
S2, model parallel training, namely completing distributed convolution kernel pruning and multi-node convolution calculation through an IM-PMTS strategy in the convolution process of a parallel DCNN model training stage, and combining a MapReduce method and an Im2col method to train the model in parallel, so that the operation speed of a convolution layer is improved;
And S3, updating parameters in parallel, namely adopting an IM-BGDS strategy to update the parameters of the batch data in a back propagation stage, wherein the strategy can exclude a gradient descent method of abnormal data points for the batch data, and can avoid the influence of the abnormal data points on the gradient of the batch data.
S4, inputting the data to be tested into the DCNN model with the parameters updated in parallel, and outputting a classification result.
Further, the S1 adopts an MHO-PFES strategy to carry out feature parallel extraction, and the MHO-PFES strategy comprises the following steps:
S1-1, extracting features, namely filtering input data by adopting an improved non-average filter, calculating a Laplace equation h (x, y) of the filtered data, and searching zero crossings of the Laplace equation to extract data features;
S1-2, feature screening, namely providing a feature correlation index FCI (u, v) to compare the similarity between any two data blocks for further screening target features, setting a correlation coefficient epsilon, and reducing redundant features in the data by removing the data blocks with FCI (u, v) < epsilon.
Further, the improved non-average filter FT (a, b) comprises:
Wherein a represents a target window matrix;
b represents a neighborhood window matrix;
θ (·) is the feature transformation function;
g i is current data;
Vectorized representations of matrices a, b, respectively;
The |·| represents the modulus of the vector.
Further, the characteristic correlation index FCI (u, v) includes:
wherein μ u,μv represents the expectations of u and v, respectively;
σ u,σv represents the variance of u and v, respectively;
u and v represent two feature vectors, respectively.
Further, the IM-PMTS strategy in S2 comprises the following steps:
s2-1, convolutional kernel pruning, namely designing a Marsh distance central value MDCV, searching vectors linearly related to the convolutional kernels in the network model by solving the MDCV value, calculating the distance dist between the vectors and each convolutional kernel, and reducing redundancy parameters in the network model by setting a threshold value alpha and cutting the convolutional kernels with dist < alpha;
s2-2, parallel Im2col convolution, namely mapping the feature map into a matrix by using an Im2col algorithm, distributing the matrix and a corresponding convolution kernel storage key value pair to each computing node to perform matrix operation so as to accelerate the operation of a convolution layer, obtaining an operation result of the operation convolution layer, and storing the result into an HDFS.
Further, the mahalanobis distance central value MDCV includes:
Where μ represents the average of all convolution kernels;
s represents covariance matrix of all convolution kernels;
r n is the set of convolution kernels in the same hierarchical model, R n={X1,X2,...,Xn},x∈Rn, X takes any one of the volume sets { X 1,X2,...,Xn }, X 1,X2,...,Xn represents the convolution kernels in the network model;
t represents the transpose.
Further, the IM-BGDS strategy comprises the following steps:
S3-1, gradient construction, namely providing a loss average weight LAW (g i) to eliminate the influence of abnormal data on a batch gradient, and designing a loss sum gradient LSG (T) to construct a batch data average gradient, so that the problem of poor convergence of a loss function is solved;
And S3-2, parameter parallel updating, namely after obtaining the average gradient of the batch data, parallelly calculating errors by combining a MapReduce calculation frame and a counter-propagating error conduction formula, and realizing parallel updating of the parameters.
Further, the loss average weight LAW (g i) includes:
Wherein:
Wherein LAD (g i) is the absolute value of the difference between the loss function value and the loss function value mean for data g i;
g i represents one piece of data in the batch data;
τ is the threshold for LAD (g i);
batch_size represents the batch data size;
J (ω, b) i represents the data g i loss function value;
ω, b are the convolution kernel parameter and the bias of the convolution layer, respectively.
Further, the loss-summing gradient LSG (T) includes:
Where batch_size represents the batch data size;
Representing the gradient of the loss function of data g i with respect to parameter x;
T represents all data in the batch;
LAW (g i) is a weight indicator of the loss function value of data g i.
In summary, due to the adoption of the technical scheme, the problem of more data redundancy features can be avoided by the MHO-PFES strategy, the operation speed of a convolution layer is improved by the IM-PMTS strategy, the influence of abnormal data on batch gradients is eliminated by the IM-BGDS strategy, and the problem of poor convergence of a loss function is solved.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.
The invention provides a parallel deep convolutional neural network optimization method based on an Im2col, which comprises the following steps:
S1, extracting target features in medical image data in parallel as input of a convolutional neural network;
S2, model parallel training, namely completing distributed convolution kernel pruning and multi-node convolution calculation through an IM-PMTS strategy in the convolution process of a parallel DCNN model training stage, and combining a MapReduce method and an Im2col method to train the model in parallel;
s3, updating parameters in parallel, namely adopting an IM-BGDS strategy to update parameters for batch medical image data in a back propagation stage;
s4, inputting the data to be tested of the medical image into the DCNN model with the parameters updated in parallel, and outputting the classification result of the medical image.
The invention provides a parallel deep convolutional neural network optimization algorithm IA-PDCNNOA based on an Im2col algorithm based on the advantage of a MapReduce programming model. Firstly, a Parallel feature extraction strategy MHO-PFES (Parallel feature extraction strategy based on MARR HILDRETH operator) based on a Marr-Hildrth operator is provided, target features in data are extracted to serve as input of a convolutional neural network, the problem of multiple data redundancy features is effectively avoided, secondly, a Parallel model training strategy IM-PMTS (Parallel model TRAINING STRATEGY based on Im2col method) based on an Im2col method is designed, redundant convolutional kernels are removed through designing a Markov distance central value, and Parallel training models are combined with MapReduce and Im2col methods, the operation speed of a convolutional layer is improved, finally, an improved small-batch gradient descent strategy IM-BGDS (Improved Mini Batch GRADIENT DESCENT STRATEGY) is provided, the influence of abnormal data on batch gradients is eliminated, and the problem of poor convergence of a loss function is solved. The algorithm provided by the invention has remarkable improvement on the operation efficiency and the model accuracy, and in addition, the knowledge dug by the method can provide huge help in biology, medicine and astronomical physics.
1. Feature parallel extraction
At present, in a parallel DCNN algorithm under a big data environment, the problem of more data redundancy features exists in a model training process. In order to solve the problem, a MHO-PFES strategy based on Marr-Hildeth operator is proposed, which mainly comprises two steps of (1) extracting features, namely, filtering input data by a modified non-average filter FT (a, b) (Filter transformation), calculating a Laplace equation h (x, y) of the filtered data, searching zero crossing of the Laplace equation to extract data features, and (2) screening features, namely, for further screening target features, extracting feature correlation indexes FCI (u, v) (Feature correlation indices) to compare the similarity between any two data blocks, setting a correlation coefficient epsilon, and reducing redundant features in the data by removing the data blocks of FCI (u, v) < epsilon.
1.1 Feature extraction
In order to acquire high-precision data features, noise removal is carried out on an initial data set, a non-average filter FT (a, b) based on cosine similarity is provided, data noise is removed through self-similarity of data in different areas, then a Laplacian operation of a convolution kernel f (x, y) and data g (x, y) is carried out, zero crossing of a Laplacian equation is constructed and found to extract the data features, the specific process is that firstly, a target window matrix a and a neighborhood window matrix b are set, a neighborhood window slides in current data, a weighted value of the neighborhood window is obtained through cosine similarity of the comparison matrices a and b, noise reduction processing is carried out on the data according to the weighted value and gray values of each point to obtain a noise reduction image g (x, y), and then a convolution kernel f (x, y) with the size of 3*3 is set, and the Laplacian equation is obtained through the Laplacian operationWhere x, y denote the pixel values of the image at (x, y), respectively,And finally, judging whether the second derivative of the current node is a cross zero point or not, and if the first derivative of the current node is in a larger peak value, reserving the node if the condition is met, otherwise, setting the pixel point to be zero, and merging the current data nodes to obtain the data after feature extraction. Generally, for non-mean denoising algorithms, data refers to image data.
Theorem 1 (cosine similarity based non-average filter FT (a, b)) knowing that a represents the target window matrix, b represents the neighborhood window matrix, a, b e (x, y), (x, y) represents the current data. The calculation formula of the transformation function FT (a, b) is as follows:
Wherein θ (·) is a feature transformation function, which may be, for example, a linear kernel function, a gaussian kernel function, etc., G i is current data, The vectorized representation of the matrices a, b, |·| represents the modulus of the vector, respectively.
The non-local mean filtering principle is proved to utilize the non-correlation characteristic of noise, and the values of the pixel blocks without noise are set as omega (p, q) and the values of the noise are set as phi (p, q), the values of the pixel blocks fused with the noise are set as rho (p, q) =omega (p, q) +phi (p, q), and the average value is obtained after the similar pixel blocks are overlappedWherein ρi (p, q) represents the pixel value of the ith pixel block after being fused with noise, and k is the total number of the pixel blocksIs expected to be Due to the similarity of pixel blocks, E [ ω i (p, q) ] can be reduced to ω (p, q), and when the noise is 0, E [ ψ (p, q) ]=0, soFurthermore, due to the uncorrelation of noise, the variance of ω (p, q) isSince ω (p, q) is noiseless and the variance is 0It is shown that the noise ψ (p, q) is related to the variance and FT (p, q) reduces the data noise by reducing ψ (p, q). The Pichia of the syndrome
1.2 Feature screening
After feature extraction is completed, the data in the batch are cut into blocks by a strategy, feature correlation indexes FCI (u, v) are provided to calculate feature similarity between any two data blocks, then the data blocks of FCI (u, v) < epsilon are removed to remove redundant features in the data, the specific process is that firstly, data of the same class are divided into the batch, the data in the batch are cut into data blocks with the same size, each data block is numbered sequentially, the feature correlation indexes FCI (u, v) between any two data blocks are calculated, key value pairs < (u, v) are stored, FCI (u, v) are sent to HDFS, then correlation coefficients epsilon are set, key value pairs < (u, v) are traversed sequentially, FCI (u, v) is removed to remove items of FCI (u, v) < epsilon, finally, key value pairs < (u, v) are traversed again, FCI (u, v) are read to obtain key values of all key value pairs to obtain target feature blocks, and then data of the target feature blocks are obtained, and data of the data are subjected to convolutional network data are subjected to data filtering, and feature filtering is completed.
Theorem 2 (characteristic correlation index FCI (u, v)). Knowing that u and v represent two eigenvectors, mu u,μv represents the expectation of u and v, and σ u,σv represents the variance of u and v, respectively. The calculation formula of the characteristic correlation index FCI (u, v) is as follows:
It is proved that FCI (u, v) is an index for measuring the similarity of features between u and v, mu u,μv is set to represent the expectations of u and v, sigma u,σv is set to represent the variances of u and v, when the feature vector u is at sigma u =0, the operation of the convolution process on u belongs to linear superposition, features cannot be extracted, FCI (u, v) =0, and when sigma u≠0,σv is not equal to 0 and features of the feature vectors x and y are similar, FCI (u, v) →1, wherein→represents approach. The Pichia of the syndrome
2. Model parallel training
In the DCNN algorithm under the big data environment at present, the parallel training of the model needs to disperse the feature images and the convolution kernels to different calculation nodes for operation, but in the process of constructing parallel convolution operation, the algorithm is difficult to screen out redundant convolution kernels dispersed at all nodes, so that the problem of low operation speed of a convolution layer cannot be solved under the big data environment. In order to solve the problem, an IM-PMTS strategy is provided, which mainly comprises the steps of (1) pruning convolution kernels, namely designing a Mahalanobis DISTANCE CENTER Value (MDCV), searching vectors linearly related to the convolution kernels in a network model by solving the MDCV value, calculating the distance dist between the vectors and each convolution kernel, reducing redundancy parameters in the network model by setting a threshold value alpha and cutting the convolution kernels of dist < alpha, and (2) parallel IM2col convolution, namely mapping a feature map into a matrix by using an IM2col algorithm, storing key value pairs of the matrix and the corresponding convolution kernels, distributing the matrix to each calculation node to accelerate the operation of a convolution layer, obtaining operation results of the operation convolution layer, and storing the results in an HDFS (Hadoop distributed file system).
2.1 Convolution kernel pruning
In order to reduce invalid calculation generated by redundant convolution kernels in a convolutional neural network, a Markov distance central value MDCV is designed to screen out the redundant convolution kernels in a current convolutional layer and further accelerate the operation of the convolutional layer, and the method comprises the specific processes of firstly calculating covariance matrixes S and average mu of all convolution kernels X 1,X2,...,Xn of the convolutional layer to construct an objective function f (X) of the MDCV, and then calculating second-order Taylor expansion of f (X) at a standing point X k of the objective function f (X) Representing the Laplace operator, (. Cndot.) T represents the transpose, if the current second derivative is not singular, the next iteration point isIf the current second derivative is singular, firstly solvingAnd finally, calculating the distances dist from all convolution kernels in the convolution layer to the MDCV value, setting a threshold value alpha, and cutting the convolution kernels with dist < alpha to complete the convolution kernel pruning process. Where k is the number of searches.
Theorem 3 (mahalanobis distance center value MDCV) it is known that X 1,X2,...,Xn represents the convolution kernels in the network model, S represents the covariance matrix of all convolution kernels, and μ represents the mean of all convolution kernels. The calculation formula of the mahalanobis distance center value MDCV is as follows:
Where Rn is the set of convolution kernels for the same hierarchical model and T represents the transpose.
It is proved that MDCV is the minimum distance from the feature vector X to the feature vector set X 1,X2,...,Xn, S is the covariance matrix of the vector set X 1,X2,...,Xn, μ is the mean of the vector set, where the covariance matrix S is introduced to exclude the interference of the correlation between the variables, the feature vector X is easier to be replaced by the feature vector set when the feature vector x→mdcv value, and the MDCV value is the minimum distance representing the feature vector X * to the feature vector set X 1,X2,...,Xn when x=mdcv, X and X 1,X2,...,Xn are linearly related. The Pichia of the syndrome
2.2 Parallel Im2col convolutions
After convolution kernel pruning is completed, an Im2col convolution parallel operation can be realized by combining a MapReduce calculation framework, and the specific process is that firstly, an input feature Map M i is mapped into a convolution calculation matrix I i by an Im2col method, each mapping matrix I i and a corresponding convolution kernel store a key value pair < I i,Kz >, wherein K z represents a convolution kernel corresponding to the convolution calculation matrix I i and is in a many-to-many relation, then, a Map () function is called, matrix multiplication operation is carried out on a matrix I i in the key value pair and a one-dimensional vector of the corresponding convolution kernel to obtain a convolution intermediate result, and finally, a Reduce () function is called to combine feature maps of the same piece of data to obtain a final output feature Map NM i.
3. Parameter parallel update
The parallel DCNN algorithm under the current big data adopts a random gradient descent method or a batch gradient descent method to update parameters in the back propagation process. However, in the process of realizing gradient descent, training of the DCNN model on abnormal data (error labeling, noise data, etc.) may cause the loss function to converge and oscillate, resulting in poor convergence of the loss function. In order to solve the problem, an IM-BGDS strategy is provided, which mainly comprises two steps of (1) gradient construction, namely, providing a Loss average weight LAW (g i) (Loss AVERAGE WEIGHT) to eliminate the influence of abnormal data on batch gradients, designing a Loss summation gradient LSG (T) (Loss Sum Gradient) to construct batch data average gradients, solving the problem of poor convergence of a Loss function, and (2) parameter parallel update, namely, after obtaining the average gradients of the batch data, parallelly calculating errors by combining a MapReduce calculation frame and a counter-propagating error conduction formula, and realizing parallel update of parameters.
(1) Gradient construction
In order to eliminate the influence of abnormal data on batch gradient, a loss average weight LAW (g i) and a loss sum gradient LSG (T) are designed to solve the problem of poor convergence of a loss function, which is characterized in that firstly, when parameters are updated, the average value of the loss function of the whole batch data is calculated, the average value is differenced with the loss function value of each piece of data g i, a loss average weight LAW (g i) is constructed, a key value pair < g i,LAW(gi > is stored in an HDFS, and then, the deviation of the loss function of each piece of data g i to the current parameter delta z is calculatedStoring key value pairsIn HDFS, setting the batch_size as 1 in LAW (g i), and traversing key value pair < g i,LAW(gi) > and with g i as indexAn average gradient LSG (T) of the batch data is constructed to obtain a batch gradient for the current parameter.
Theorem 4 (loss average weight LAW (g i)): knowing that g i represents a piece of data in the batch, J (ω, b) i represents the data g i loss function value, ω, b is the convolution kernel parameter and the bias of the convolution layer, respectively, batch_size represents the batch data size, and LAD (g i) is the absolute value of the difference between the loss function value and the loss function value average for data g i. The calculation formula of the loss average weight LAW (g i) is as follows:
Wherein:
It is proved that LAW (g i) is a weight index of a loss function value of data g i, batch_size is set as a batch data size, τ is a threshold for measuring LAD (g i), when LAD (g i) < τ, the loss function value of current data g i is a conventional value, so that LAW (g i) =1 is reserved, and when LAD (g i) > τ, the loss function value of current data g i is an abnormal value, so that LAW (g i) =0. The Pichia of the syndrome
Theorem 5 (loss-summing gradient LSG (T)): knowing that T represents all data in the batch,The gradient of the penalty function for data g i to parameter x is represented, and batch_size represents the batch data size. The calculation formula of the loss-summed gradient LSG (T) is as follows:
the LSG (T) was shown to be the average gradient of the batch data batch, set As the gradient of the loss function of data g i to parameter x, batch_size is the lot data size, when LIW (g i) =1, gradient of data g i Decreasing toward the optimal direction, when LIW (g i) =0, the gradient of data g i The deviation from the optimal direction is large and is not accounted for in the LSG (T) gradient. The Pichia of the syndrome
(2) Parameter parallel update
After obtaining the average gradient of the batch data, updating the error item parameters by using an error back propagation algorithm in parallel, and combining with a MapReduce calculation frame to obtain a network model with parameters updated in parallel, wherein the parameter parallel updating process comprises the following steps of firstly, calculating a first-1 layer convolution kernel according to the calculationGradient of all parametersAnd map the result to key value pairsStoring in HDFS, then calculating convolution kernel in network modelAmount of change in parametersThe network parameters of the layer 1 convolution kernel are updated, wherein r is the convolution kernel number, and the function of the network parameters is corresponding to the corresponding gradient. And finally, synchronizing the updated parameters to all the computing nodes through the HDFS, and carrying out next updating until all the parameters in the network model are updated. Where the range of values of l depends on the number of convolutional layers of the network model employed.
4. Effectiveness of parallel deep convolutional neural network optimization algorithm (IA-PDCNNOA) based on Im2col
To verify the performance effect of algorithm IA-PDCNNOA, we applied the IA-PDCNNOA method to both ImageNet 1K dataset and CIFAR dataset, the specific information of which is shown in table 1. The MR-FPDCNN, SSOCNN, FCNN algorithm is compared in terms of algorithm parallelism, classification accuracy, etc.
Table 1 data set details
| Items |
CIFAR10 |
ImageNet 1K |
| Number of pictures/sheets |
60 000 |
1281 167 |
| Picture size/pixel |
32*32 |
224*224 |
| Number of categories/categories |
10 |
1000 |
4.1IA-PDCNNOA algorithm speed ratio experimental analysis
To verify the parallelization performance of the IA-PDCNNOA algorithm in big data environment, the acceleration ratio is used as a measurement index based on CIFAR and ImageNet 1K data sets, and is compared with the MR-FPDCNN, SSOCNN, FCNN algorithm respectively. Meanwhile, in order to ensure the accuracy of the experimental result, the average operation time length of each algorithm is taken for 10 times to calculate the speed-up ratio as the final experimental result. The experimental results are shown in fig. 1:
It can be seen from fig. 1 (a) that when processing CIFAR such a relatively small-scale dataset, the acceleration ratio of each algorithm increases slowly with increasing node number, wherein the acceleration ratio of IA-PDCNNOA is lower by 0.3 and 0.5, respectively, than the FCNN and SSOCNN algorithms with low parallelization degree when the cluster node number is 4, whereas in fig. 1 (b) the acceleration ratio of IA-PDCNNOA algorithm increases more when the algorithm processes a relatively large dataset of ImageNet 1K, reaching 9.8 when the cluster node number is 8, 1.1, 4.1 and 4.6, respectively, higher than the ME-FPDCNN, FCNN and SSOCNN algorithms. The reason for these results is that when the IA-PDCNNOA algorithm processes a data set with a relatively small scale, the data distribution to each computing node causes a rapid increase in the communication time overhead among the nodes, the running speed obtained by parallelization operation is extremely limited, when the IA-PDCNNOA algorithm processes a data set with a relatively large scale, because of the designed IM-PMTS strategy, the overhead of convolutional layer parameters in network communication is reduced by proposing a mahalanobis distance central value MDCV to prune the same-layer convolutional kernel, and then the process of convolutional operation is accelerated by combining the MapReduce and IM2col methods for parallel training, the operation speed of the convolutional layer is improved, the acceleration ratio of the algorithm is improved, and experiments show that the parallelization capability of the IA-PDCNNOA algorithm is remarkably improved along with the increase of cluster nodes, and the method is suitable for parallelization of a large data set and has better performance.
4.2IA-PDCNNOA algorithm accuracy test analysis
In order to further verify the training effect of the IA-PDCNNOA algorithm, the training effect of the algorithm was evaluated using the Top-1 accuracy as a measure, IA-PDCNNOA, MR-FPDCNN, SSOCNN and FCNN were processed on CIFAR and ImageNet 1K datasets, respectively, and the Top-1 accuracy was calculated as an experimental result, as shown in fig. 2:
as can be seen from fig. 2 (a), when processing the dataset with a relatively small size of CIFAR, the Top-1 accuracy of each algorithm can be stabilized at a higher value, wherein the Top-1 accuracy of the IA-PDCNNOA algorithm is highest, and convergence is completed earlier, reaching 89.72%, which is 2.87%, 4.62% and 6.48% higher than that of the MR-FPDCNN, SSOCNN and FCNN algorithms, but when the algorithm processes the dataset with a relatively large size of ImageNet 1K, the Top-1 accuracy of each algorithm and the convergence of the algorithm are greatly different in fig. 2 (b), wherein the Top-1 accuracy of the IA-PDCNNOA algorithm is highest among four parallelization algorithms, reaching 72.41%, which is 2.31%, 7.98% and 2.85% higher than that of the MR-FPDCNN, SSOCNN and FCNN algorithms, but the other three algorithms are difficult to converge to different degrees. These results are generated because the IA-PDCNNOA algorithm proposes an IM-BGDS strategy that designs a loss sum gradient LSG (T) to construct a small batch of data gradients, and updates parameters in parallel by an error back propagation algorithm, eliminating the influence of abnormal data on the batch gradients, enhancing the convergence of the IA-PDCNNOA algorithm. Experimental data shows that compared with other three parallelization algorithms, the IA-PDCNNOA has higher convergence speed and higher accuracy, and is suitable for model parallelization training of the deep convolutional neural network under a large dataset.
4.3IA-PDCNNOA algorithm runtime and FLOPs experimental analysis
To verify the algorithm execution speed and model optimization effect of the IA-PDCNNOA algorithm in big data environments, the running times and FLOPs of Baseline, IA-PDCNNOA, MR-FPDCNN, SSOCNN and FCNN were calculated based on CIFAR and ImageNet 1K datasets, respectively, where Baseline is the Baseline data of ResNet model at 1/8 data load, and experimental results are shown in table 2:
table 2 run time and FLOPs of each algorithm on two datasets
As can be seen from Table 2, when processing CIFAR such a relatively small-scale dataset, the algorithm runs without a large gap, but with a different degree of reduction in their floating point operations, wherein the floating point operations of IA-PDCNNOA are reduced by 5%, 21% and 16% respectively, compared to the MR-FPDCNN, SSOCNN and FCNN algorithms, but when processing a large dataset such as ImageNet 1K, the algorithm runs and floating point operations of IA-PDCNNOA are better than the other three algorithms, wherein the algorithm runs of IA-PDCNNOA are faster by 1.32X10 4s、3.85×104 s and 5.29X 10 4 s, and the floating point operations are reduced by 3%, 13% and 8% respectively, compared to the MR-FPDCNN, SSOCNN and FCNN algorithms. These results are generated because the MHO-PFES strategy proposed by the IA-PDCNNOA algorithm removes redundant features in the data by providing the feature correlation index FCI (u, v), and screens the target features of the data as inputs to the convolutional neural network, thereby reducing the floating point operand of the model and speeding up the operation speed of the algorithm. In general, comparing the running time and the floating point operand variation trend of the four algorithms on CIFAR and ImageNet 1K, it can be seen that the running time and the floating point operand reduction of the IA-PDCNNOA algorithm are greatly separated from other algorithms along with the increase of the training dataset, so that it can be concluded that the IA-PDCNNOA is better than the MR-FPDCNN, SSOCNN and FCNN, and the method is suitable for the parallelization training of the DCNN model under the large dataset.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the spirit and scope of the invention as defined by the appended claims and their equivalents.