Background
Neural network technology, especially lightweight neural networks, has been a hot topic of research and application. The convolution is divided into two steps, the first step is named as Depthwise convolution, the idea of grouping convolution is utilized, so that different convolution layers have no mutual calculation, only the result of single-layer convolution is calculated, and the calculation amount for realizing the convolution is greatly reduced. The second step is called Pointwise convolution, which effectively re-fuses the features learned by the first step of the Depthwise convolution, thereby realizing the defect that the Depthwise features only come from a single layer. So that the two can achieve the effect similar to the convolution of the traditional neural network as a whole. The specific implementation is typically done using a convolution with a convolution kernel of 1x 1.
In the batch normalization layer, the features learned by the middle layer of the neural network can be effectively normalized again, so that the gradient of the neural network can be effectively transmitted among multiple layers, and the training of the deep neural network becomes possible. It has four parameters, two to represent the mean and variance of the input, which are used to re-normalize the features. The other two parameters are parameters for neural network learning and are used for feature reconstruction, so that the features learned by the neural network model are not damaged. Both of them and the deep separable convolution are commonly used in the actual neural network model construction. Therefore, if the two can be fused together in the actual application, the calculation amount can be effectively reduced in the actual application.
Disclosure of Invention
The invention aims to provide a method for fusing depth separable convolution and batch normalization so as to effectively reduce the calculation amount.
The invention provides a method for fusing depth separable convolution and batch normalization, which comprises the steps of training a neural network model containing a depth separable convolution and batch normalization layer, deriving parameters of Pointwise convolution and batch normalization parameters, and recalculating a group of new parameters by a specially designed method for assigning values to the weight and bias of the Pointwise convolution and modifying the weight and bias of the Pointwise convolution; then deleting the effect of batch normalization layer in the original network structure, adding the calculation of the batch normalization layer into Pointwise convolution to obtain a depth separable convolution layer equivalent to the depth separable convolution and the batch normalization, and realizing the effect of convolution fusion batch normalization; the method comprises the following specific steps:
(1) for a trained neural network model containing deep separable convolution and batch normalization layers, which requires no nonlinear activation function between the deep separable convolution and the batch normalization layers, the weight w of Pointwise convolution of the deep separable convolution is firstly derivedpwConvAnd bias term bpwConvAnd parameters gamma, β, mean and var of the batch normalization layer, wherein gamma and β are learning parameters of the batch normalization layer, and mean and var are calculation parameters of the batch normalization layer, which are used for subsequent calculation;
(2) calculating to obtain a new Pointwise convolution parameter according to the following formula:
wherein, epsilon represents a hyper parameter for preventing dividing 0, and epsilon represents convolution calculation;
(3) will be provided with
And
weight w replacing the original Pointwise convolution
pwConvAnd bias term b
pwConvDeleting the batch normalization layer in the original network to obtain a new neural network structure and corresponding weight; at this point, the depth separable convolution and batch normalization fusion is completed; by y
dwConvRepresenting the output of the Depthwise convolution, y
bnRepresents the normalized output of the batch, and thus is directly connected to y
dwConvAnd y
bn:
(4) After the new network structure is obtained, the new network structure can be used to replace the original network structure, thereby achieving the effect of simplifying the calculation amount.
According to the invention, through the design of the method, the batch specification layer can be effectively fused into the deep separable convolution, so that the calculation amount of the neural network model in the inference stage can be reduced.
In the invention, after the model training is finished, all trained model parameters are derived, and the weight w of Pointwise convolution is weighted
pwConvAnd bias term b
pwConvAnd the parameters gamma, β, mean and var of the batch normalization layer are mathematically derived and calculated so that new parameters can be calculated
And
and uses it to replace the weight w of the original Pointwise convolution
pwConvAnd bias term b
pwConv。
In the present invention, the batch normalization layer in the original network structure is deleted, and then the weights and biases of the Pointwise convolution of the depth separable convolution of the original structure are modified by the new weights and biases.
The method can effectively reduce the calculation amount.
Detailed Description
The invention will be further described with reference to the following schematic drawings.
The starting neural network layer structure is shown in the upper half of fig. 1, which contains a depth separable convolution and batch normalization, which ends up in three parts in the schematic because the depth separable convolution contains two parts Depthwise and Pointwise. The first part is the Depthwise convolution, which is a separate convolution, all of which uses convolution kernels of three different colors convolved with the corresponding convolutional layers to represent the idea of their separate convolutions. By separating the convolutions, an output of the separating convolution is obtained. Which is fed to the Pointwise convolution as input to the Pointwise. For Pointwise convolution, which is a conventional convolution with a convolution kernel of 1x1, the convolution process is represented here using an interleaved 1x1 convolution kernel, and the effect of fusing the outputs of different Depthwise convolutions is achieved by such Pointwise convolution. After the Pointwise convolution is finished, the Pointwise output is further processed using a batch normalization layer (base normalization), which computes a Depthwise convolution with a convolution kernel of 1 × 1. This allows the data to be processed efficiently so that the counter-propagating gradient is better preserved.
It is worth noting that the method of the present invention requires that there cannot be a non-linear activation function between poitwise and batch normalization. In practical designs, the activation function is typically added after the batch normalization layer, which also ensures that the batch normalization layer performs well. After the model training is completed, the parameters of Depthwise convolution, Pointwise convolution and batch normalization are all determined and saved in the model file.
Reading the parameters from the model file, and calculating according to the formulas (1) and (2)
And
wherein the over-parameter is selected as 10
-20. A neural network model B is then redesigned as shown in the lower half of fig. 1. Its structure is nearly identical to the original model structure a except that the batch normalization after each deep separable convolution is removed from the network structure, while all other network layer structures are preserved.
And for the network layers except for the Pointwise convolution, assigning the weight of the corresponding layer of the originally trained network structure A to the model B. For Pointwise convolution, the calculated
And
and assigning the weights and the bias of the Pointwise convolution so as to finish the assignment of all the parameters of the constructed new network. This results in a completely new network structure model that can be used to replace the original model for inference.
It can be easily found that the originally trained network structure A has more normalized calculation amount than the newly designed simplified model B, and the calculation amount is consistent in other places. In fact, the performance of the newly designed model is almost consistent with that of the original model, so the invention realizes the effect of saving a part of the calculated amount in the original model. And finally, replacing A with the newly designed model B, and performing inference.