CN114819136B

CN114819136B - Parallel deep convolutional neural network optimization method based on Im2col

Info

Publication number: CN114819136B
Application number: CN202210279825.6A
Authority: CN
Inventors: 毛伊敏; 戴经国; 龚克; 陈志刚; 霍英
Original assignee: Shaoguan University
Current assignee: Shaoguan University
Priority date: 2022-03-21
Filing date: 2022-03-21
Publication date: 2025-06-13
Anticipated expiration: 2042-03-21
Also published as: CN114819136A

Abstract

The present invention proposes a parallel deep convolutional neural network optimization method based on Im2col, comprising the following steps: S1, feature parallel extraction: extracting target features from data as input of convolutional neural network; S2, model parallel training: in the convolution process of the parallel DCNN model training stage, the distributed convolution kernel pruning and multi-node convolution calculation are completed by the IM‑PMTS strategy; and the model is trained in parallel by combining MapReduce and Im2col methods; S3, parameter parallel update: in the back propagation stage, the IM‑BGDS strategy is used to update the parameters for batch data; S4, the data to be tested is input into the DCNN model after the parameters are updated in parallel, and the classification result is output. The MHO‑PFES strategy proposed in the present invention can avoid the problem of redundant data features; the IM‑PMTS strategy improves the speed of convolution layer operation; the IM‑BGDS strategy eliminates the influence of abnormal data on batch gradients and solves the problem of poor convergence of loss function.

Description

Parallel deep convolution neural network optimization method based on Im2col

Technical Field

The invention relates to the field of big data mining, in particular to a parallel deep convolutional neural network optimization method based on an Im2 col.

Background

DCNN is used as an important classification algorithm in the field of deep learning, has strong characterization capability, generalization capability and fitting capability, has stable effect, does not need to do additional characteristic engineering on data, is often applied to the fields of image classification, voice recognition, object detection, semantic segmentation, face recognition, automatic driving and the like, and is widely focused and studied in depth by people.

With the rapid development of internet technology and the arrival of big data age, big data has the characteristics of large volume, high change speed (speed), multiple modes (variety) and high value (value) compared with traditional data, and the 4V characteristics lead to the DCNN model training to be subjected to a great deal of time consumption caused by massive data training, and the model parameters need to be trained repeatedly due to data and mode changes. Therefore, how to reduce the cost of DCNN model training in a big data environment becomes a urgent problem to be solved.

In recent years, the MapReduce parallel computing model developed by Google corporation is deeply favored by vast students and enterprises due to the advantages of easy programming, high fault tolerance, balanced load, strong expansibility and the like, and a plurality of DCNN algorithms based on the MapReduce computing model are also widely studied. The Leung J et al propose a parallelization DCNN algorithm based on MapReduce, the algorithm adopts a divide-and-conquer idea, data are divided through a Split method of the MapReduce, a plurality of computing nodes are built for training a DCNN network model at the same time, a network model with highest accuracy is selected as the output of the algorithm, and a DCNN parallelization training process is realized. Based on this, huang X et al propose a parallel deep convolutional neural network algorithm FCNN (Fully CNN for processing CT SCAN IMAGE) that converts the full view into a sparse view and smoothes the feature edges by a gaussian filter to enhance important texture feature information. Although the algorithm can accelerate the reading speed in the process of converting the full view into the sparse view, the feature structure of the sparse view is changed, so that the feature is difficult to screen, and the problem of more data redundancy features exists in the training process of the model. Wang H et al propose a single stride optimized CNN algorithm SSOCNN (An optimization of Im2col, an important method of CNNs based on continuous ADDRESS ACCESS) based on an Im2col method, which designs an Im2col algorithm acceleration method under the single stride condition based on continuous memory address reading, and accelerates the progress of mapping an image into a matrix by changing the data reading sequence, and performs matrix multiplication operation on a column vector and a convolution kernel by utilizing general matrix multiplication, thereby realizing the acceleration of convolution layer operation. However, in the process of constructing parallel convolution operation, the algorithm is difficult to screen out redundant convolution kernels scattered at all nodes, so that the problem of low operation speed of a convolution layer cannot be solved in a big data environment. The fur et al propose an MR-FPDCNN algorithm (Deep convolutional neural network algorithm based on feature graph and parallel computing entropy using MapReduce), by combining the DCNN and the firefly algorithm, which combines the information sharing search strategy with the firefly algorithm to find the optimal parameters of the network model, and shares the DCNN network parameters through a MapReduce communication mechanism, so that the convergence rate of the loss function is accelerated. However, the firefly algorithm has poor robustness, and when abnormal data (error labeling, noise data and the like) are processed, the loss function converges and oscillates, so that the loss function has poor convergence.

Disclosure of Invention

The invention aims at least solving the technical problems in the prior art, and particularly creatively provides a parallel deep convolutional neural network optimization method based on an Im2 col.

In order to achieve the above object of the present invention, the present invention provides a parallel deep convolutional neural network optimization method based on Im2col, comprising the steps of:

s1, extracting characteristics in parallel, namely extracting target characteristics in data to serve as input of a convolutional neural network, and solving the problem of more data redundancy characteristics;

S2, model parallel training, namely completing distributed convolution kernel pruning and multi-node convolution calculation through an IM-PMTS strategy in the convolution process of a parallel DCNN model training stage, and combining a MapReduce method and an Im2col method to train the model in parallel, so that the operation speed of a convolution layer is improved;

And S3, updating parameters in parallel, namely adopting an IM-BGDS strategy to update the parameters of the batch data in a back propagation stage, wherein the strategy can exclude a gradient descent method of abnormal data points for the batch data, and can avoid the influence of the abnormal data points on the gradient of the batch data.

S4, inputting the data to be tested into the DCNN model with the parameters updated in parallel, and outputting a classification result.

Further, the S1 adopts an MHO-PFES strategy to carry out feature parallel extraction, and the MHO-PFES strategy comprises the following steps:

S1-1, extracting features, namely filtering input data by adopting an improved non-average filter, calculating a Laplace equation h (x, y) of the filtered data, and searching zero crossings of the Laplace equation to extract data features;

S1-2, feature screening, namely providing a feature correlation index FCI (u, v) to compare the similarity between any two data blocks for further screening target features, setting a correlation coefficient epsilon, and reducing redundant features in the data by removing the data blocks with FCI (u, v) < epsilon.

Further, the improved non-average filter FT (a, b) comprises:

Wherein a represents a target window matrix;

b represents a neighborhood window matrix;

θ (·) is the feature transformation function;

g _i is current data;

Vectorized representations of matrices a, b, respectively;

The |·| represents the modulus of the vector.

Further, the characteristic correlation index FCI (u, v) includes:

wherein μ _u,μ_v represents the expectations of u and v, respectively;

σ _u,σ_v represents the variance of u and v, respectively;

u and v represent two feature vectors, respectively.

Further, the IM-PMTS strategy in S2 comprises the following steps:

s2-1, convolutional kernel pruning, namely designing a Marsh distance central value MDCV, searching vectors linearly related to the convolutional kernels in the network model by solving the MDCV value, calculating the distance dist between the vectors and each convolutional kernel, and reducing redundancy parameters in the network model by setting a threshold value alpha and cutting the convolutional kernels with dist < alpha;

s2-2, parallel Im2col convolution, namely mapping the feature map into a matrix by using an Im2col algorithm, distributing the matrix and a corresponding convolution kernel storage key value pair to each computing node to perform matrix operation so as to accelerate the operation of a convolution layer, obtaining an operation result of the operation convolution layer, and storing the result into an HDFS.

Further, the mahalanobis distance central value MDCV includes:

Where μ represents the average of all convolution kernels;

s represents covariance matrix of all convolution kernels;

r ⁿ is the set of convolution kernels in the same hierarchical model, R ⁿ＝{X₁,X₂,...,X_n},x∈Rⁿ, X takes any one of the volume sets { X ₁,X₂,...,X_n }, X ₁,X₂,...,X_n represents the convolution kernels in the network model;

t represents the transpose.

Further, the IM-BGDS strategy comprises the following steps:

S3-1, gradient construction, namely providing a loss average weight LAW (g _i) to eliminate the influence of abnormal data on a batch gradient, and designing a loss sum gradient LSG (T) to construct a batch data average gradient, so that the problem of poor convergence of a loss function is solved;

And S3-2, parameter parallel updating, namely after obtaining the average gradient of the batch data, parallelly calculating errors by combining a MapReduce calculation frame and a counter-propagating error conduction formula, and realizing parallel updating of the parameters.

Further, the loss average weight LAW (g _i) includes:

Wherein:

Wherein LAD (g _i) is the absolute value of the difference between the loss function value and the loss function value mean for data g _i;

g _i represents one piece of data in the batch data;

τ is the threshold for LAD (g _i);

batch_size represents the batch data size;

J (ω, b) _i represents the data g _i loss function value;

ω, b are the convolution kernel parameter and the bias of the convolution layer, respectively.

Further, the loss-summing gradient LSG (T) includes:

Where batch_size represents the batch data size;

Representing the gradient of the loss function of data g _i with respect to parameter x;

T represents all data in the batch;

LAW (g _i) is a weight indicator of the loss function value of data g _i.

In summary, due to the adoption of the technical scheme, the problem of more data redundancy features can be avoided by the MHO-PFES strategy, the operation speed of a convolution layer is improved by the IM-PMTS strategy, the influence of abnormal data on batch gradients is eliminated by the IM-BGDS strategy, and the problem of poor convergence of a loss function is solved.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the invention will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:

FIG. 1 is the acceleration ratio of each algorithm at CIFAR, imageNet1K dataset, where FIG. 1 (a) is the acceleration ratio of each algorithm on dataset CIFAR and FIG. 1 (b) is the acceleration ratio of each algorithm on dataset ImageNet 1K.

FIG. 2 is the Top-1 accuracy of each algorithm on CIFAR, imageNet1K, where FIG. 2 (a) is the Top-1 accuracy of each algorithm on dataset CIFAR and FIG. 2 (b) is the Top-1 accuracy of each algorithm on dataset ImageNet 1K.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.

The invention provides a parallel deep convolutional neural network optimization method based on an Im2col, which comprises the following steps:

S1, extracting target features in medical image data in parallel as input of a convolutional neural network;

S2, model parallel training, namely completing distributed convolution kernel pruning and multi-node convolution calculation through an IM-PMTS strategy in the convolution process of a parallel DCNN model training stage, and combining a MapReduce method and an Im2col method to train the model in parallel;

s3, updating parameters in parallel, namely adopting an IM-BGDS strategy to update parameters for batch medical image data in a back propagation stage;

s4, inputting the data to be tested of the medical image into the DCNN model with the parameters updated in parallel, and outputting the classification result of the medical image.

The invention provides a parallel deep convolutional neural network optimization algorithm IA-PDCNNOA based on an Im2col algorithm based on the advantage of a MapReduce programming model. Firstly, a Parallel feature extraction strategy MHO-PFES (Parallel feature extraction strategy based on MARR HILDRETH operator) based on a Marr-Hildrth operator is provided, target features in data are extracted to serve as input of a convolutional neural network, the problem of multiple data redundancy features is effectively avoided, secondly, a Parallel model training strategy IM-PMTS (Parallel model TRAINING STRATEGY based on Im2col method) based on an Im2col method is designed, redundant convolutional kernels are removed through designing a Markov distance central value, and Parallel training models are combined with MapReduce and Im2col methods, the operation speed of a convolutional layer is improved, finally, an improved small-batch gradient descent strategy IM-BGDS (Improved Mini Batch GRADIENT DESCENT STRATEGY) is provided, the influence of abnormal data on batch gradients is eliminated, and the problem of poor convergence of a loss function is solved. The algorithm provided by the invention has remarkable improvement on the operation efficiency and the model accuracy, and in addition, the knowledge dug by the method can provide huge help in biology, medicine and astronomical physics.

1. Feature parallel extraction

At present, in a parallel DCNN algorithm under a big data environment, the problem of more data redundancy features exists in a model training process. In order to solve the problem, a MHO-PFES strategy based on Marr-Hildeth operator is proposed, which mainly comprises two steps of (1) extracting features, namely, filtering input data by a modified non-average filter FT (a, b) (Filter transformation), calculating a Laplace equation h (x, y) of the filtered data, searching zero crossing of the Laplace equation to extract data features, and (2) screening features, namely, for further screening target features, extracting feature correlation indexes FCI (u, v) (Feature correlation indices) to compare the similarity between any two data blocks, setting a correlation coefficient epsilon, and reducing redundant features in the data by removing the data blocks of FCI (u, v) < epsilon.

1.1 Feature extraction

In order to acquire high-precision data features, noise removal is carried out on an initial data set, a non-average filter FT (a, b) based on cosine similarity is provided, data noise is removed through self-similarity of data in different areas, then a Laplacian operation of a convolution kernel f (x, y) and data g (x, y) is carried out, zero crossing of a Laplacian equation is constructed and found to extract the data features, the specific process is that firstly, a target window matrix a and a neighborhood window matrix b are set, a neighborhood window slides in current data, a weighted value of the neighborhood window is obtained through cosine similarity of the comparison matrices a and b, noise reduction processing is carried out on the data according to the weighted value and gray values of each point to obtain a noise reduction image g (x, y), and then a convolution kernel f (x, y) with the size of 3*3 is set, and the Laplacian equation is obtained through the Laplacian operationWhere x, y denote the pixel values of the image at (x, y), respectively,And finally, judging whether the second derivative of the current node is a cross zero point or not, and if the first derivative of the current node is in a larger peak value, reserving the node if the condition is met, otherwise, setting the pixel point to be zero, and merging the current data nodes to obtain the data after feature extraction. Generally, for non-mean denoising algorithms, data refers to image data.

Theorem 1 (cosine similarity based non-average filter FT (a, b)) knowing that a represents the target window matrix, b represents the neighborhood window matrix, a, b e (x, y), (x, y) represents the current data. The calculation formula of the transformation function FT (a, b) is as follows:

Wherein θ (·) is a feature transformation function, which may be, for example, a linear kernel function, a gaussian kernel function, etc., G _i is current data, The vectorized representation of the matrices a, b, |·| represents the modulus of the vector, respectively.

The non-local mean filtering principle is proved to utilize the non-correlation characteristic of noise, and the values of the pixel blocks without noise are set as omega (p, q) and the values of the noise are set as phi (p, q), the values of the pixel blocks fused with the noise are set as rho (p, q) =omega (p, q) +phi (p, q), and the average value is obtained after the similar pixel blocks are overlappedWherein ρi (p, q) represents the pixel value of the ith pixel block after being fused with noise, and k is the total number of the pixel blocksIs expected to be Due to the similarity of pixel blocks, E [ ω _i (p, q) ] can be reduced to ω (p, q), and when the noise is 0, E [ ψ (p, q) ]=0, soFurthermore, due to the uncorrelation of noise, the variance of ω (p, q) isSince ω (p, q) is noiseless and the variance is 0It is shown that the noise ψ (p, q) is related to the variance and FT (p, q) reduces the data noise by reducing ψ (p, q). The Pichia of the syndrome

1.2 Feature screening

After feature extraction is completed, the data in the batch are cut into blocks by a strategy, feature correlation indexes FCI (u, v) are provided to calculate feature similarity between any two data blocks, then the data blocks of FCI (u, v) < epsilon are removed to remove redundant features in the data, the specific process is that firstly, data of the same class are divided into the batch, the data in the batch are cut into data blocks with the same size, each data block is numbered sequentially, the feature correlation indexes FCI (u, v) between any two data blocks are calculated, key value pairs < (u, v) are stored, FCI (u, v) are sent to HDFS, then correlation coefficients epsilon are set, key value pairs < (u, v) are traversed sequentially, FCI (u, v) is removed to remove items of FCI (u, v) < epsilon, finally, key value pairs < (u, v) are traversed again, FCI (u, v) are read to obtain key values of all key value pairs to obtain target feature blocks, and then data of the target feature blocks are obtained, and data of the data are subjected to convolutional network data are subjected to data filtering, and feature filtering is completed.

Theorem 2 (characteristic correlation index FCI (u, v)). Knowing that u and v represent two eigenvectors, mu _u,μ_v represents the expectation of u and v, and σ _u,σ_v represents the variance of u and v, respectively. The calculation formula of the characteristic correlation index FCI (u, v) is as follows:

It is proved that FCI (u, v) is an index for measuring the similarity of features between u and v, mu _u,μ_v is set to represent the expectations of u and v, sigma _u,σ_v is set to represent the variances of u and v, when the feature vector u is at sigma _u =0, the operation of the convolution process on u belongs to linear superposition, features cannot be extracted, FCI (u, v) =0, and when sigma _u≠0,σ_v is not equal to 0 and features of the feature vectors x and y are similar, FCI (u, v) →1, wherein→represents approach. The Pichia of the syndrome

2. Model parallel training

In the DCNN algorithm under the big data environment at present, the parallel training of the model needs to disperse the feature images and the convolution kernels to different calculation nodes for operation, but in the process of constructing parallel convolution operation, the algorithm is difficult to screen out redundant convolution kernels dispersed at all nodes, so that the problem of low operation speed of a convolution layer cannot be solved under the big data environment. In order to solve the problem, an IM-PMTS strategy is provided, which mainly comprises the steps of (1) pruning convolution kernels, namely designing a Mahalanobis DISTANCE CENTER Value (MDCV), searching vectors linearly related to the convolution kernels in a network model by solving the MDCV value, calculating the distance dist between the vectors and each convolution kernel, reducing redundancy parameters in the network model by setting a threshold value alpha and cutting the convolution kernels of dist < alpha, and (2) parallel IM2col convolution, namely mapping a feature map into a matrix by using an IM2col algorithm, storing key value pairs of the matrix and the corresponding convolution kernels, distributing the matrix to each calculation node to accelerate the operation of a convolution layer, obtaining operation results of the operation convolution layer, and storing the results in an HDFS (Hadoop distributed file system).

2.1 Convolution kernel pruning

In order to reduce invalid calculation generated by redundant convolution kernels in a convolutional neural network, a Markov distance central value MDCV is designed to screen out the redundant convolution kernels in a current convolutional layer and further accelerate the operation of the convolutional layer, and the method comprises the specific processes of firstly calculating covariance matrixes S and average mu of all convolution kernels X ₁,X₂,...,X_n of the convolutional layer to construct an objective function f (X) of the MDCV, and then calculating second-order Taylor expansion of f (X) at a standing point X _k of the objective function f (X) Representing the Laplace operator, (. Cndot.) ^T represents the transpose, if the current second derivative is not singular, the next iteration point isIf the current second derivative is singular, firstly solvingAnd finally, calculating the distances dist from all convolution kernels in the convolution layer to the MDCV value, setting a threshold value alpha, and cutting the convolution kernels with dist < alpha to complete the convolution kernel pruning process. Where k is the number of searches.

Theorem 3 (mahalanobis distance center value MDCV) it is known that X ₁,X₂,...,X_n represents the convolution kernels in the network model, S represents the covariance matrix of all convolution kernels, and μ represents the mean of all convolution kernels. The calculation formula of the mahalanobis distance center value MDCV is as follows:

Where Rn is the set of convolution kernels for the same hierarchical model and T represents the transpose.

It is proved that MDCV is the minimum distance from the feature vector X to the feature vector set X ₁,X₂,...,X_n, S is the covariance matrix of the vector set X ₁,X₂,...,X_n, μ is the mean of the vector set, where the covariance matrix S is introduced to exclude the interference of the correlation between the variables, the feature vector X is easier to be replaced by the feature vector set when the feature vector x→mdcv value, and the MDCV value is the minimum distance representing the feature vector X ^* to the feature vector set X ₁,X₂,...,X_n when x=mdcv, X and X ₁,X₂,...,X_n are linearly related. The Pichia of the syndrome

2.2 Parallel Im2col convolutions

After convolution kernel pruning is completed, an Im2col convolution parallel operation can be realized by combining a MapReduce calculation framework, and the specific process is that firstly, an input feature Map M _i is mapped into a convolution calculation matrix I _i by an Im2col method, each mapping matrix I _i and a corresponding convolution kernel store a key value pair < I _i,K_z >, wherein K _z represents a convolution kernel corresponding to the convolution calculation matrix I _i and is in a many-to-many relation, then, a Map () function is called, matrix multiplication operation is carried out on a matrix I _i in the key value pair and a one-dimensional vector of the corresponding convolution kernel to obtain a convolution intermediate result, and finally, a Reduce () function is called to combine feature maps of the same piece of data to obtain a final output feature Map NM _i.

3. Parameter parallel update

The parallel DCNN algorithm under the current big data adopts a random gradient descent method or a batch gradient descent method to update parameters in the back propagation process. However, in the process of realizing gradient descent, training of the DCNN model on abnormal data (error labeling, noise data, etc.) may cause the loss function to converge and oscillate, resulting in poor convergence of the loss function. In order to solve the problem, an IM-BGDS strategy is provided, which mainly comprises two steps of (1) gradient construction, namely, providing a Loss average weight LAW (g _i) (Loss AVERAGE WEIGHT) to eliminate the influence of abnormal data on batch gradients, designing a Loss summation gradient LSG (T) (Loss Sum Gradient) to construct batch data average gradients, solving the problem of poor convergence of a Loss function, and (2) parameter parallel update, namely, after obtaining the average gradients of the batch data, parallelly calculating errors by combining a MapReduce calculation frame and a counter-propagating error conduction formula, and realizing parallel update of parameters.

(1) Gradient construction

In order to eliminate the influence of abnormal data on batch gradient, a loss average weight LAW (g _i) and a loss sum gradient LSG (T) are designed to solve the problem of poor convergence of a loss function, which is characterized in that firstly, when parameters are updated, the average value of the loss function of the whole batch data is calculated, the average value is differenced with the loss function value of each piece of data g _i, a loss average weight LAW (g _i) is constructed, a key value pair < g _i,LAW(g_i > is stored in an HDFS, and then, the deviation of the loss function of each piece of data g _i to the current parameter delta _z is calculatedStoring key value pairsIn HDFS, setting the batch_size as 1 in LAW (g _i), and traversing key value pair < g _i,LAW(g_i) > and with g _i as indexAn average gradient LSG (T) of the batch data is constructed to obtain a batch gradient for the current parameter.

Theorem 4 (loss average weight LAW (g _i)): knowing that g _i represents a piece of data in the batch, J (ω, b) _i represents the data g _i loss function value, ω, b is the convolution kernel parameter and the bias of the convolution layer, respectively, batch_size represents the batch data size, and LAD (g _i) is the absolute value of the difference between the loss function value and the loss function value average for data g _i. The calculation formula of the loss average weight LAW (g _i) is as follows:

Wherein:

It is proved that LAW (g _i) is a weight index of a loss function value of data g _i, batch_size is set as a batch data size, τ is a threshold for measuring LAD (g _i), when LAD (g _i) < τ, the loss function value of current data g _i is a conventional value, so that LAW (g _i) =1 is reserved, and when LAD (g _i) > τ, the loss function value of current data g _i is an abnormal value, so that LAW (g _i) =0. The Pichia of the syndrome

Theorem 5 (loss-summing gradient LSG (T)): knowing that T represents all data in the batch,The gradient of the penalty function for data g _i to parameter x is represented, and batch_size represents the batch data size. The calculation formula of the loss-summed gradient LSG (T) is as follows:

the LSG (T) was shown to be the average gradient of the batch data batch, set As the gradient of the loss function of data g _i to parameter x, batch_size is the lot data size, when LIW (g _i) =1, gradient of data g _i Decreasing toward the optimal direction, when LIW (g _i) =0, the gradient of data g _i The deviation from the optimal direction is large and is not accounted for in the LSG (T) gradient. The Pichia of the syndrome

(2) Parameter parallel update

After obtaining the average gradient of the batch data, updating the error item parameters by using an error back propagation algorithm in parallel, and combining with a MapReduce calculation frame to obtain a network model with parameters updated in parallel, wherein the parameter parallel updating process comprises the following steps of firstly, calculating a first-1 layer convolution kernel according to the calculationGradient of all parametersAnd map the result to key value pairsStoring in HDFS, then calculating convolution kernel in network modelAmount of change in parametersThe network parameters of the layer 1 convolution kernel are updated, wherein r is the convolution kernel number, and the function of the network parameters is corresponding to the corresponding gradient. And finally, synchronizing the updated parameters to all the computing nodes through the HDFS, and carrying out next updating until all the parameters in the network model are updated. Where the range of values of l depends on the number of convolutional layers of the network model employed.

4. Effectiveness of parallel deep convolutional neural network optimization algorithm (IA-PDCNNOA) based on Im2col

To verify the performance effect of algorithm IA-PDCNNOA, we applied the IA-PDCNNOA method to both ImageNet 1K dataset and CIFAR dataset, the specific information of which is shown in table 1. The MR-FPDCNN, SSOCNN, FCNN algorithm is compared in terms of algorithm parallelism, classification accuracy, etc.

Table 1 data set details

Items	CIFAR10	ImageNet 1K
			Number of pictures/sheets	60 000	1281 167
Picture size/pixel	32*32	224*224
			Number of categories/categories	10	1000

4.1IA-PDCNNOA algorithm speed ratio experimental analysis

To verify the parallelization performance of the IA-PDCNNOA algorithm in big data environment, the acceleration ratio is used as a measurement index based on CIFAR and ImageNet 1K data sets, and is compared with the MR-FPDCNN, SSOCNN, FCNN algorithm respectively. Meanwhile, in order to ensure the accuracy of the experimental result, the average operation time length of each algorithm is taken for 10 times to calculate the speed-up ratio as the final experimental result. The experimental results are shown in fig. 1:

It can be seen from fig. 1 (a) that when processing CIFAR such a relatively small-scale dataset, the acceleration ratio of each algorithm increases slowly with increasing node number, wherein the acceleration ratio of IA-PDCNNOA is lower by 0.3 and 0.5, respectively, than the FCNN and SSOCNN algorithms with low parallelization degree when the cluster node number is 4, whereas in fig. 1 (b) the acceleration ratio of IA-PDCNNOA algorithm increases more when the algorithm processes a relatively large dataset of ImageNet 1K, reaching 9.8 when the cluster node number is 8, 1.1, 4.1 and 4.6, respectively, higher than the ME-FPDCNN, FCNN and SSOCNN algorithms. The reason for these results is that when the IA-PDCNNOA algorithm processes a data set with a relatively small scale, the data distribution to each computing node causes a rapid increase in the communication time overhead among the nodes, the running speed obtained by parallelization operation is extremely limited, when the IA-PDCNNOA algorithm processes a data set with a relatively large scale, because of the designed IM-PMTS strategy, the overhead of convolutional layer parameters in network communication is reduced by proposing a mahalanobis distance central value MDCV to prune the same-layer convolutional kernel, and then the process of convolutional operation is accelerated by combining the MapReduce and IM2col methods for parallel training, the operation speed of the convolutional layer is improved, the acceleration ratio of the algorithm is improved, and experiments show that the parallelization capability of the IA-PDCNNOA algorithm is remarkably improved along with the increase of cluster nodes, and the method is suitable for parallelization of a large data set and has better performance.

4.2IA-PDCNNOA algorithm accuracy test analysis

In order to further verify the training effect of the IA-PDCNNOA algorithm, the training effect of the algorithm was evaluated using the Top-1 accuracy as a measure, IA-PDCNNOA, MR-FPDCNN, SSOCNN and FCNN were processed on CIFAR and ImageNet 1K datasets, respectively, and the Top-1 accuracy was calculated as an experimental result, as shown in fig. 2:

as can be seen from fig. 2 (a), when processing the dataset with a relatively small size of CIFAR, the Top-1 accuracy of each algorithm can be stabilized at a higher value, wherein the Top-1 accuracy of the IA-PDCNNOA algorithm is highest, and convergence is completed earlier, reaching 89.72%, which is 2.87%, 4.62% and 6.48% higher than that of the MR-FPDCNN, SSOCNN and FCNN algorithms, but when the algorithm processes the dataset with a relatively large size of ImageNet 1K, the Top-1 accuracy of each algorithm and the convergence of the algorithm are greatly different in fig. 2 (b), wherein the Top-1 accuracy of the IA-PDCNNOA algorithm is highest among four parallelization algorithms, reaching 72.41%, which is 2.31%, 7.98% and 2.85% higher than that of the MR-FPDCNN, SSOCNN and FCNN algorithms, but the other three algorithms are difficult to converge to different degrees. These results are generated because the IA-PDCNNOA algorithm proposes an IM-BGDS strategy that designs a loss sum gradient LSG (T) to construct a small batch of data gradients, and updates parameters in parallel by an error back propagation algorithm, eliminating the influence of abnormal data on the batch gradients, enhancing the convergence of the IA-PDCNNOA algorithm. Experimental data shows that compared with other three parallelization algorithms, the IA-PDCNNOA has higher convergence speed and higher accuracy, and is suitable for model parallelization training of the deep convolutional neural network under a large dataset.

4.3IA-PDCNNOA algorithm runtime and FLOPs experimental analysis

To verify the algorithm execution speed and model optimization effect of the IA-PDCNNOA algorithm in big data environments, the running times and FLOPs of Baseline, IA-PDCNNOA, MR-FPDCNN, SSOCNN and FCNN were calculated based on CIFAR and ImageNet 1K datasets, respectively, where Baseline is the Baseline data of ResNet model at 1/8 data load, and experimental results are shown in table 2:

table 2 run time and FLOPs of each algorithm on two datasets

As can be seen from Table 2, when processing CIFAR such a relatively small-scale dataset, the algorithm runs without a large gap, but with a different degree of reduction in their floating point operations, wherein the floating point operations of IA-PDCNNOA are reduced by 5%, 21% and 16% respectively, compared to the MR-FPDCNN, SSOCNN and FCNN algorithms, but when processing a large dataset such as ImageNet 1K, the algorithm runs and floating point operations of IA-PDCNNOA are better than the other three algorithms, wherein the algorithm runs of IA-PDCNNOA are faster by 1.32X10 ⁴s、3.85×10⁴ s and 5.29X 10 ⁴ s, and the floating point operations are reduced by 3%, 13% and 8% respectively, compared to the MR-FPDCNN, SSOCNN and FCNN algorithms. These results are generated because the MHO-PFES strategy proposed by the IA-PDCNNOA algorithm removes redundant features in the data by providing the feature correlation index FCI (u, v), and screens the target features of the data as inputs to the convolutional neural network, thereby reducing the floating point operand of the model and speeding up the operation speed of the algorithm. In general, comparing the running time and the floating point operand variation trend of the four algorithms on CIFAR and ImageNet 1K, it can be seen that the running time and the floating point operand reduction of the IA-PDCNNOA algorithm are greatly separated from other algorithms along with the increase of the training dataset, so that it can be concluded that the IA-PDCNNOA is better than the MR-FPDCNN, SSOCNN and FCNN, and the method is suitable for the parallelization training of the DCNN model under the large dataset.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the spirit and scope of the invention as defined by the appended claims and their equivalents.

Claims

1. A parallel deep convolutional neural network optimization method based on Im2col, characterized by comprising the following steps:

S1, parallel feature extraction: extracting target features from medical image data as input to the convolutional neural network;

S2, model parallel training: In the convolution process of the parallel DCNN model training stage, the IM-PMTS strategy is used to complete the distributed convolution kernel pruning and multi-node convolution calculation; and the model is trained in parallel by combining MapReduce and Im2col methods;

S3, parallel parameter update: In the back propagation stage, the IM-BGDS strategy is used to update the parameters of batch medical image data;

S4, inputting the measured data of the medical image to be tested into the DCNN model with updated parameters in parallel, and outputting the classification result of the medical image;

The S1 adopts the MHO-PFES strategy to perform parallel feature extraction, and the MHO-PFES strategy includes the following steps:

S1-1, feature extraction: use an improved non-mean filter to filter the input data, calculate the Laplace equation h(x,y) of the filtered data, and find the zero crossing of the Laplace equation to extract data features;

S1-2, feature screening: In order to further screen the target features, the feature correlation index FCI(u,v) is proposed to compare the similarity between any two data blocks, and the correlation coefficient ε is set to reduce the redundant features in the data by removing the data blocks with FCI(u,v)＜ε;

The feature correlation index FCI(u,v) includes:

Where μ _u ,μ _v represent the expectations of u and v respectively;

σ _u ,σ _v represent the variance of u and v respectively;

u and v represent two eigenvectors respectively;

The IM-PMTS strategy in S2 includes the following steps:

S2-1, convolution kernel pruning: Design the Mahalanobis distance center value MDCV, find the vector linearly related to the convolution kernel in the network model by solving the MDCV value, and calculate the distance dist between this vector and each convolution kernel. By setting the threshold α, prune the convolution kernel with dist < α to reduce the redundant parameters in the network model;

S2-2, parallel Im2col convolution: Use the Im2col algorithm to map the feature map into a matrix, store the matrix and the corresponding convolution kernel as key-value pairs, distribute them to each computing node for matrix operations to speed up the convolution layer operations, obtain the convolution layer operation results, and store the results in HDFS;

The Mahalanobis distance center value MDCV includes:

Where μ represents the mean of all convolution kernels;

S represents the covariance matrix of all convolution kernels;

^Rn is the set of convolution kernels in the same level model;

T stands for transpose;

The IM-BGDS strategy includes the following steps:

S3-1, gradient construction: The loss mean weight LAW( _gi ) is proposed to eliminate the influence of abnormal data on batch gradients, and the loss sum gradient LSG(T) is designed to construct the batch data average gradient, which solves the problem of poor convergence of the loss function;

S3-2, parallel parameter update: After obtaining the average gradient of the batch data, the MapReduce computing framework and the error conduction formula of back propagation are combined to parallelize the error calculation and implement parallel parameter update;

The loss mean weight LAW( _gi ) includes:

in:

Where LAD( _gi ) is the absolute value of the difference between the loss function value of data g _i and the mean of the loss function value;

_gi represents a piece of data in the batch data;

τ is the threshold for measuring LAD(g _i );

batch_size indicates the batch data size;

J(ω,b) _i represents the loss function value of data g _i ;

ω, b are the convolution kernel parameters and the bias of the convolution layer respectively.

2. According to a parallel deep convolutional neural network optimization method based on Im2col in claim 1, it is characterized in that the improved non-mean filter FT(a,b) comprises:

Where a represents the target window matrix;

b represents the neighborhood window matrix;

θ(·) is the feature transformation function;

_Gi is the current data;

They are the vectorized representations of matrices a and b respectively;

|·| represents the magnitude of a vector.

3. According to a parallel deep convolutional neural network optimization method based on Im2col according to claim 1, it is characterized in that the loss sum gradient LSG(T) comprises:

Where batch_size represents the batch data size;

▽J _xi represents the gradient of the loss function of data _gi with respect to parameter x;

T represents all the data in the batch;

LAW( _gi ) is the weight indicator of the loss function value of data _g .