CN113673667A

CN113673667A - Design method of network structure in vehicle detection training

Info

Publication number: CN113673667A
Application number: CN202010400154.5A
Authority: CN
Inventors: 田凤彬; 于晓静
Original assignee: Beijing Ingenic Semiconductor Co Ltd
Current assignee: Beijing Ingenic Semiconductor Co Ltd
Priority date: 2020-05-13
Filing date: 2020-05-13
Publication date: 2021-11-19
Anticipated expiration: 2040-05-13
Also published as: CN113673667B

Abstract

The invention provides a method for designing a network structure in vehicle detection training, which comprises the following steps: s1, designing a loss function, and calculating the loss function: s1.1, training by adopting a secondary loss function, calculating a first-stage loss value through cross entropy, and finely adjusting the first-stage loss function by using a target four classification and a coordinate two-point four value; calculating a loss function value in the second-level classification through a log-likelihood function, wherein the second-level loss function is used for judging whether the second-level classification is a target or not and fine adjustment of two-point four-value coordinates; s1.2, calculating a loss value in the fine adjustment through a 2-norm; s1.3, when calculating the loss function of the whole secondary network, the first-stage loss value accounts for 0.65, and the loss function value in the second-stage classification accounts for 0.35; in each stage, the classification loss value accounts for 0.4, and the coordinate fine tuning loss value accounts for 0.6; s2, designing a network structure corresponding to the secondary loss function: s2.1, a first-level network; s2.2, second-level network.

Description

Design method of network structure in vehicle detection training

Technical Field

The invention relates to the field of neural networks, in particular to a method for designing a network structure in vehicle detection training.

Background

In the current society, the development of the neural network technology in the field of artificial intelligence is rapid. MTCNN technology is also one of the more popular technologies in recent years. MTCNN, Multi-task convolutional neural network, puts face region detection and face keypoint detection together, and can be generally divided into three-layer network structures of P-Net, R-Net and O-Net. The multi-task neural network model for the face detection task mainly adopts three cascaded networks and adopts the idea of adding a classifier into a candidate frame to carry out rapid and efficient face detection. The three cascaded networks are respectively P-Net for quickly generating candidate windows, R-Net for filtering and selecting high-precision candidate windows and O-Net for generating final bounding boxes and key points of the human face.

However, MTCNN cascade detection has the following drawbacks:

1. certain false detection exists, and the recall rate and the accuracy rate are relatively low.

2. In particular, the network structure corresponding to the first-order loss function used in the prior art cannot solve the problems that convergence is easy to occur for the target with the length-width ratio close to 1, and convergence is not easy for the vehicle with a large length-width dimension. Resulting in low accuracy and recall.

In addition, the following commonly used technical terms are also included in the prior art:

1. network structure cascading: the mode that several detectors detect in series is called cascade.

2. And (3) convolution kernel: the convolution kernel is a parameter used for performing an operation on a matrix and an original image during image processing. The convolution kernel is typically a matrix of column numbers (e.g., a 3 x 3 matrix) with a weight value for each square on the region. The matrix shape is typically 1 × 1,3 × 3,5 × 5,7 × 7,1 × 3,3 × 1,2 × 2,1 × 5,5 × 1, ….

3. Convolution: the centre of the convolution kernel is placed on the pixel to be calculated, the products of each element in the kernel and its covered image pixel value are calculated once and summed, and the resulting structure is the new pixel value at that location, a process called convolution.

4. Excitation function: a function that processes the convolved results.

5. Characteristic diagram: the result of the convolution calculation of the input data is called a feature map, and the result of the full connection of the data is also called a feature map. The feature size is typically expressed as length x width x depth, or 1 x depth

6. Step length: the length of the shift in the center position of the convolution kernel in the coordinates.

7. And (3) carrying out non-alignment treatment on two ends: when the image or data is processed by the convolution kernel with the size of 3 × 3, if one convolution kernel is not enough, the data on two sides is not enough, and the data on two sides or one side is discarded at the moment, which is called that the two sides do not process the data.

8. Loss calculation cascade: the method is that a loss value is calculated at a certain node of a network structure, and the loss value is weighted and calculated into the overall loss, and the method for calculating the loss value is called loss calculation cascade.

9. The loss function (loss function) is also called cost function (cost function). Is an objective function of neural network optimization, and the process of neural network training or optimization is a process of minimizing a loss function (the smaller the value of the loss function, the closer the values of the corresponding predicted result and the real result are.

Disclosure of Invention

In order to solve the problems in the prior art, the invention aims to design a secondary network structure by the method, solve the problem of using a primary network loss function in the prior art for calculation, and improve the accuracy and the recall rate.

Specifically, the invention provides a method for designing a network structure in vehicle detection training, which comprises the following steps:

s1, designing a loss function, and calculating the loss function:

s1.1, training by adopting a secondary loss function, calculating a first-stage loss value through cross entropy, and finely adjusting the first-stage loss function by using a target four classification and a coordinate two-point four value; calculating a loss function value in the second-level classification through a log-likelihood function, wherein the second-level loss function is used for judging whether the second-level classification is a target or not and fine adjustment of two-point four-value coordinates;

s1.2, calculating a loss value in the fine adjustment through a 2-norm;

s1.3, when calculating the loss function of the whole secondary network, the first-stage loss value accounts for 0.65, and the loss function value in the second-stage classification accounts for 0.35; in each stage, the classification loss value accounts for 0.4, and the coordinate fine tuning loss value accounts for 0.6;

s2, designing a network structure corresponding to the secondary loss function:

s2.1, a first-level network;

s2.2, second-level network.

The cross entropy calculation in S1.1 is performed by a cross entropy cost function

Where n is the number of training data, this sum covers all training inputs x, y being the desired output.

The log-likelihood function calculation described in said S1.1 is performed by means of a log-likelihood function C ═ Σ_ky_klog a_kIs obtained in which a_kRepresents the output value of the kth neuron, y_kAnd the real value corresponding to the kth neuron is represented, and the value is 0 or 1.

The 2-norm calculation described in S1.2, i.e. the square sum of absolute values of the vector elements, is:

the first-stage network of S2.1 is:

the input data of the first layer is 47 × 47 × 1, the gray scale map, the convolution kernel size is 3 × 3, the step size is 2, the output depth is 16, and the output result is a feature map (1) of 23 × 23 × 16;

the second layer input data feature map (1) is 23 × 23 × 16, the convolution kernel size is 3 × 3, the step size is 2, the output depth is 16, and the output result is the feature map (2)11 × 11 × 16;

the third layer of input data feature map (2) is 11 × 11 × 16, the size of the convolution kernel is 3 × 3, the step size is 2, the output depth is 16, and the output result is 5 × 5 × 16 of the feature map (3);

the feature map (3) of the input data of the fourth layer is 5 × 5 × 16, the size of a convolution kernel is 3 × 3, the step size is 1, the output depth is 16, and the output result is 3 × 3 × 16 of the feature map (6);

inputting a data characteristic diagram (3) at a fifth layer of 5 multiplied by 16, and removing values at the upper end and the lower end of the characteristic diagram in the width direction to obtain a characteristic diagram (4) of 5 multiplied by 3 multiplied by 16;

the sixth layer of input data feature map (3) is 5 multiplied by 16, and one value at the left end and the right end in the height direction of the feature map is removed to obtain the feature map (5) which is 3 multiplied by 5 multiplied by 16;

the seventh layer input data feature map (4) is 5 × 3 × 16, the size of a convolution kernel is 3 × 1, the step size is 1, the output depth is 16, and the output result is the feature map (7) of 3 × 3 × 16;

the eighth layer input data feature map (5) is 3 × 5 × 16, the convolution kernel size is 1 × 3, the step size is 1, the output depth is 16, and the output result is the feature map (8)3 × 3 × 16;

the ninth layer input data feature map (6) is 3 × 3 × 16, the convolution kernel size is 3 × 3, the step size is 1, the output depth is 1 and 4, and the output result is feature maps 1 × 1 × 1 and 1 × 1 × 4;

the tenth layer input data feature map (7) is 3 × 3 × 16, the convolution kernel size is 3 × 3, the step size is 1, the output depths are 1 and 4, and the output results are feature maps 1 × 11 and 1 × 1 × 4;

the eleventh layer input data feature map (8) is 3 × 3 × 16, the convolution kernel size is 3 × 3, the step size is 1, the output depth is 1 and 4, and the output results are feature maps 1 × 1 × 1 and 1 × 1 × 4;

the twelfth layer is obtained by combining the results of the ninth layer, the tenth layer and the eleventh layer into a feature map (9)1 × 1 × 3 and a feature map (10)1 × 1 × 12;

the twelfth layer input data feature map (9)1 × 1 × 3 and the feature map (10)1 × 1 × 12, the convolution kernel sizes are 1 × 1 and 1 × 1, the step size is 1, the output depths are 1 and 4, and the output result is the feature map (11)1 × 1 × 1 and the feature map (12)1 × 1 × 4;

all convolutions use a two-end non-alignment process.

Wherein, the feature diagram (9)1 multiplied by 3 and the feature diagram (10)1 multiplied by 12 are used as predicted values calculated by the first-stage loss function, and the loss function value is calculated according to the predicted values and the marked real values; the loss function value is calculated from the predicted value and the labeled true value using the feature map (11)1 × 1 × 1 and the feature map (12)1 × 1 × 4 as the predicted value calculated by the second-stage loss function.

The S2.2 second-level network is:

initial layer input data 49 × 49 × 1, a grayscale map, convolution kernel size 3 × 3, step size 1, output depth 16, and output result 47 × 47 × 16 of a feature map (0);

47 × 47 × 16 of the first-layer input data feature map (0), 3 × 3 of convolution kernel size, 2 of step size, 32 of output depth, and 23 × 23 × 32 of the feature map (1) as an output result;

the second layer input data feature map (1) is 23 × 23 × 32, the convolution kernel size is 3 × 3, the step size is 2, the output depth is 64, and the output result is the feature map (2)11 × 11 × 64;

the third layer of input data feature map (2) is 11 × 11 × 64, the size of the convolution kernel is 3 × 3, the step size is 2, the output depth is 64, and the output result is feature map (3)5 × 5 × 64;

the feature map (3) of the input data of the fourth layer is 5 × 5 × 64, the size of a convolution kernel is 3 × 3, the step size is 1, the output depth is 64, and the output result is 3 × 3 × 64 of the feature map (6);

inputting a data characteristic diagram (3)5 multiplied by 64 at the fifth layer, and removing values at the upper end and the lower end in the width direction of the characteristic diagram to obtain a characteristic diagram (4)5 multiplied by 3 multiplied by 64;

5 × 5 × 64 of a sixth layer of input data feature map (3), and removing values at the left end and the right end of the feature map in the height direction to obtain a 3 × 5 × 64 feature map (5);

the seventh layer input data feature map (4) is 5 × 3 × 64, the size of the convolution kernel is 3 × 1, the step size is 1, the output depth is 64, and the output result is the feature map (7) of 3 × 3 × 64;

the eighth layer input data feature map (5) is 3 × 5 × 64, the convolution kernel size is 1 × 3, the step size is 1, the output depth is 64, and the output result is the feature map (8)3 × 3 × 64;

the ninth layer input data feature map (6) is 3 × 3 × 64, the convolution kernel size is 3 × 3, the step size is 1, the output depth is 1 and 4, and the output result is feature maps 1 × 1 × 1 and 1 × 1 × 4;

the tenth layer input data feature map (7) is 3 × 3 × 64, the convolution kernel size is 3 × 3, the step size is 1, the output depths are 1 and 4, and the output results are feature maps 1 × 1 × 1 and 1 × 1 × 4;

the eleventh layer input data feature map (8) is 3 × 3 × 64, the convolution kernel size is 3 × 3, the step size is 1, the output depth is 1 and 4, and the output result is feature maps 1 × 1 × 1 and 1 × 1 × 4;

all convolutions use a two-end non-alignment process.

Thus, the present application has the advantages that: the method is simple, the accuracy in vehicle detection training is improved by designing the calculation method of the secondary loss function and the network structure corresponding to the secondary loss function, the structure is simple, the operation is convenient, and the cost is saved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention.

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 is a schematic diagram of a first level network structure in the method of the present invention.

Fig. 3 is a schematic diagram of a second level network architecture in the method of the present invention.

FIG. 4 is a schematic illustration of a sample-made first type of vehicle object to which the method of the present invention relates.

FIG. 5 is a schematic illustration of a sample-made second type of vehicle object involved in the method of the present invention.

FIG. 6 is a schematic illustration of a sample-made third class of vehicle targets involved in the method of the present invention.

Detailed Description

In order that the technical contents and advantages of the present invention can be more clearly understood, the present invention will now be described in further detail with reference to the accompanying drawings.

As shown in fig. 1, the present invention relates to a method for designing a network structure in vehicle detection training, the method comprising the steps of:

s1, designing a loss function, and calculating the loss function:

s1.2, calculating a loss value in the fine adjustment through a 2-norm;

s2, designing a network structure corresponding to the secondary loss function:

s2.1, a first-level network;

s2.2, second-level network.

Further, the method also comprises the following steps:

1. calculation of a loss function

The first-order loss value is calculated by cross entropy, the loss function value in the second-order classification is calculated by log-likelihood function, and the loss value in the fine adjustment is calculated by 2-norm. When the loss function of the whole network is calculated, the first-stage loss value accounts for 0.65, and the second-stage classification loss value accounts for 0.35. In each stage, the classification loss value is 0.4, and the coordinate fine-tuning loss value is 0.6.

Wherein, cross entropy cost function

Where n is the number of training data, this sum covers all training inputs x, y is the desired output; the cross-entropy (cross-entropy) cost function is derived from the concept of entropy in the information theory. Is a cost function commonly used in the current neural network classification problem (such as image classification). The cross-entropy cost function has a good explanation for the classification problem: when the result of classifying the output correct class (the value after the output layer uses the softmax function) is close to 1, the label corresponding to the correct class is 1, i.e., y is 1. It can be found that the first term in C is close to 0 and the second term is equal to 0. For the incorrect class, a is close to 0, y is 0, then the first term in C is 0 and the second term is close to 0. So eventually C is close to 0; when the difference between the result of classifying the output correct class and 1 is larger, the value of the above formula C is larger.

Log-likelihood function: c ═ Σ_ky_klog a_kWherein a is_kRepresents the output value of the kth neuron, y_kAnd the real value corresponding to the kth neuron is represented, and the value is 0 or 1. The log-likelihood function is similar to the cross-entropy cost function, but only accounts for the loss of the correct class, not the loss of the wrong class. Like the cross-entropy cost function, log-likelihood also has a good explanation for classification: when the output value a (the value after the output layer uses softmax only) of the correct class is close to 1, y is 1, and C is close to 0; the larger the output value a is from the distance a, the larger the value of C.

Functional formula of softmax:

wherein,

representing the input to the jth neuron at level L (usually the last level),

represents the output of the jth neuron at the L-th level, and e represents a natural constant.

Shows the input of all neurons in the L-th layerSum of them. The softmax function is most obviously characterized by: it takes the ratio of the input of each neuron to the sum of all neuron inputs of the current layer as the output of the neuron. This makes the output easier to interpret: the larger the output value of a neuron is, the higher the probability that the class corresponding to the neuron is a true class is.

2-norm: norm (norm), which is a function with the notion of "length". Euclidd norm (euclidean norm, commonly used to calculate vector length), i.e. the sum of squares of the absolute values of the vector elements reopen:

2. network architecture

1) First level network, as shown in fig. 2:

the first layer input data 47 × 47 × 1, the grayscale map, the convolution kernel size is 3 × 3, the step size is 2, the output depth is 16, and the output result is the feature map (1)23 × 23 × 16.

The second layer input data feature map (1) is 23 × 23 × 16, the convolution kernel size is 3 × 3, the step size is 2, the output depth is 16, and the output result is the feature map (2)11 × 11 × 16.

The third layer input data feature map (2) is 11 × 11 × 16, the convolution kernel size is 3 × 3, the step size is 2, the output depth is 16, and the output result is feature map (3)5 × 5 × 16.

The fourth layer input data feature map (3) is 5 × 5 × 16, the convolution kernel size is 3 × 3, the step size is 1, the output depth is 16, and the output result is the feature map (6)3 × 3 × 16.

And (3) inputting a data characteristic diagram (3) of 5 multiplied by 16 at the fifth layer, and removing values at the upper end and the lower end in the width direction of the characteristic diagram to obtain a characteristic diagram (4) of 5 multiplied by 3 multiplied by 16.

And (3) inputting a data feature map (3) of the sixth layer by 5 × 5 × 16, and removing values at the left end and the right end in the height direction of the feature map to obtain the feature map (5) by 3 × 5 × 16.

The seventh layer input data feature map (4) is 5 × 3 × 16, the convolution kernel size is 3 × 1, the step size is 1, the output depth is 16, and the output result is the feature map (7)3 × 3 × 16.

The eighth layer input data feature map (5) is 3 × 5 × 16, the convolution kernel size is 1 × 3, the step size is 1, the output depth is 16, and the output result is the feature map (8)3 × 3 × 16.

The ninth layer input data feature map (6) is 3 × 3 × 16, the convolution kernel size is 3 × 3, the step size is 1, the output depths are 1 and 4, and the output results are feature maps 1 × 1 × 1 and 1 × 1 × 4.

The tenth layer input data feature map (7) is 3 × 3 × 16, the convolution kernel size is 3 × 3, the step size is 1, the output depths are 1 and 4, and the output results are feature maps 1 × 1 × 1 and 1 × 1 × 4.

The eleventh layer input data feature map (8) is 3 × 3 × 16, the convolution kernel size is 3 × 3, the step size is 1, the output depths are 1 and 4, and the output results are feature maps 1 × 1 × 1 and 1 × 1 × 4.

The twelfth layer is a combination of the results of the ninth layer, the tenth layer and the eleventh layer into a feature map (9)1 × 1 × 3 and a feature map (10)1 × 1 × 12. The twelfth layer input data feature map (9)1 × 1 × 3 and feature map (10)1 × 1 × 12, the convolution kernel sizes are 1 × 1 and 1 × 1, the step size is 1, the output depths are 1 and 4, and the output results are feature map (11)1 × 1 × l and feature map (12)1 × 1 × 4. All convolutions use a two-end non-alignment process. Wherein, the feature map (9)1 × 1 × 3 and the feature map (10)1 × 1 × 12 are used as predicted values calculated by the first-stage loss function of the first-stage network, and the loss function value is calculated according to the predicted values and the labeled real values. The loss function value is calculated from the predicted value and the labeled true value using the feature map (11)1 × 1 × 1 and the feature map (12)1 × 1 × 4 as the predicted value calculated by the second-stage loss function.

2) Second level network, as shown in fig. 3:

the eleventh layer input data feature map (8) is 3 × 3 × 64, the convolution kernel size is 3 × 3, the step size is 1, the output depth is 1 and 4, and the output result is feature maps 1 × 1 × 1 and 1 × 1 × 4; the twelfth layer is obtained by combining the results of the ninth layer, the tenth layer and the eleventh layer into a feature map (9)1 × 1 × 3 and a feature map (10)1 × 1 × 12;

all convolutions use a two-end non-alignment process.

Wherein, the feature map (9)1 × 1 × 3 and the feature map (10)1 × 1 × 12 are used as predicted values calculated by the first-stage loss function of the second-stage network, and the loss function value is calculated according to the predicted values and the labeled real values; the loss function value is calculated from the predicted value and the labeled true value using the feature map (11)1 × 1 × 1 and the feature map (12)1 × 1 × 4 as the predicted value calculated by the second-stage loss function.

Because the vehicle postures are diverse, the length-width ratio is arbitrary, the difference is large, but the difference is in a certain range, a secondary loss function is adopted for training, the first-stage loss function uses the target four classification and the fine adjustment of the four values of two points of coordinates, and the second-stage loss function uses the secondary classification for judging whether the target is the target or not and the fine adjustment of the four values of two points of coordinates.

The method also relates to the preparation and training of the sample, which comprises the following steps:

1) and (5) preparing a sample. And marking the sample. And taking the minimum external rectangle of the vehicle as a marking target. All vehicles in each figure are labeled.

2) And (5) training the labeling of the sample. Vehicle targets are divided into three types: aspect ratio

When, defined as a first class vehicle target, the first stage loss is noted as [1, 0]The second level loss is noted as 1; aspect ratio

Then, defined as a second class vehicle target, the first stage loss is labeled [0, 1, 0]The second level loss is noted as 1; aspect ratio

Then, define as the third class vehicle target, the first level loss is labeled [0,0, 1]The second level loss is noted as 1. These three types of vehicles are shown in fig. 4-6. The three types of vehicles shown in FIGS. 4-6 are all positive samples, negative samples are no vehicles, and the first level is labeled as [0,0]The second level loss is noted as 0. Three classes of vehicles and one class of negative examples, for a total of four classes.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for designing a network structure in vehicle detection training, the method comprising the steps of:

s1, designing a loss function, and calculating the loss function:

s1.2, calculating a loss value in the fine adjustment through a 2-norm;

s2, designing a network structure corresponding to the secondary loss function:

s2.1, a first-level network;

s2.2, second-level network.

2. The method as claimed in claim 1, wherein the cross entropy calculation is performed by a cross entropy cost function

3. The method as claimed in claim 1, wherein the log-likelihood function is calculated by a log-likelihood function C ═ Σ_ky_klog a_kIs obtained in which a_kRepresents the output value of the kth neuron, y_kAnd the real value corresponding to the kth neuron is represented, and the value is 0 or 1.

4. The method as claimed in claim 1, wherein the 2-norm calculation is the square sum of absolute values of vector elements and the re-evolution:

5. the method for designing the network structure in the vehicle detection training as claimed in claim 1, wherein the first-stage network of S2.1 is:

the feature map (3) of the input data of the fourth layer is 5 × 5 × 6, the size of a convolution kernel is 3 × 3, the step size is 1, the output depth is 16, and the output result is 3 × 3 × 16 of the feature map (6);

the tenth layer input data feature map (7) is 3 × 3 × 16, the convolution kernel size is 3 × 3, the step size is 1, the output depths are 1 and 4, and the output results are feature maps 1 × 1 × 1 and 1 × 1 × 4;

all convolutions use a two-end non-alignment process.

6. The method of claim 5, wherein the network structure is a network structure for vehicle detection training,

calculating a loss function value according to the predicted value and the marked real value by using the feature map (9)1 multiplied by 3 and the feature map (10)1 multiplied by 12 as the predicted value calculated by the first-stage loss function;

the loss function value is calculated from the predicted value and the labeled true value using the feature map (11)1 × 1 × 1 and the feature map (12)1 × 1 × 4 as the predicted value calculated by the second-stage loss function.

7. The method for designing the network structure in the vehicle detection training as claimed in claim 1, wherein the S2.2 second-level network is:

all convolutions use a two-end non-alignment process.

8. The method of claim 7, wherein the network structure is a network structure for vehicle detection training,