CN113673667A - Design method of network structure in vehicle detection training - Google Patents
Design method of network structure in vehicle detection training Download PDFInfo
- Publication number
- CN113673667A CN113673667A CN202010400154.5A CN202010400154A CN113673667A CN 113673667 A CN113673667 A CN 113673667A CN 202010400154 A CN202010400154 A CN 202010400154A CN 113673667 A CN113673667 A CN 113673667A
- Authority
- CN
- China
- Prior art keywords
- feature map
- output
- value
- layer
- input data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a method for designing a network structure in vehicle detection training, which comprises the following steps: s1, designing a loss function, and calculating the loss function: s1.1, training by adopting a secondary loss function, calculating a first-stage loss value through cross entropy, and finely adjusting the first-stage loss function by using a target four classification and a coordinate two-point four value; calculating a loss function value in the second-level classification through a log-likelihood function, wherein the second-level loss function is used for judging whether the second-level classification is a target or not and fine adjustment of two-point four-value coordinates; s1.2, calculating a loss value in the fine adjustment through a 2-norm; s1.3, when calculating the loss function of the whole secondary network, the first-stage loss value accounts for 0.65, and the loss function value in the second-stage classification accounts for 0.35; in each stage, the classification loss value accounts for 0.4, and the coordinate fine tuning loss value accounts for 0.6; s2, designing a network structure corresponding to the secondary loss function: s2.1, a first-level network; s2.2, second-level network.
Description
Technical Field
The invention relates to the field of neural networks, in particular to a method for designing a network structure in vehicle detection training.
Background
In the current society, the development of the neural network technology in the field of artificial intelligence is rapid. MTCNN technology is also one of the more popular technologies in recent years. MTCNN, Multi-task convolutional neural network, puts face region detection and face keypoint detection together, and can be generally divided into three-layer network structures of P-Net, R-Net and O-Net. The multi-task neural network model for the face detection task mainly adopts three cascaded networks and adopts the idea of adding a classifier into a candidate frame to carry out rapid and efficient face detection. The three cascaded networks are respectively P-Net for quickly generating candidate windows, R-Net for filtering and selecting high-precision candidate windows and O-Net for generating final bounding boxes and key points of the human face.
However, MTCNN cascade detection has the following drawbacks:
1. certain false detection exists, and the recall rate and the accuracy rate are relatively low.
2. In particular, the network structure corresponding to the first-order loss function used in the prior art cannot solve the problems that convergence is easy to occur for the target with the length-width ratio close to 1, and convergence is not easy for the vehicle with a large length-width dimension. Resulting in low accuracy and recall.
In addition, the following commonly used technical terms are also included in the prior art:
1. network structure cascading: the mode that several detectors detect in series is called cascade.
2. And (3) convolution kernel: the convolution kernel is a parameter used for performing an operation on a matrix and an original image during image processing. The convolution kernel is typically a matrix of column numbers (e.g., a 3 x 3 matrix) with a weight value for each square on the region. The matrix shape is typically 1 × 1,3 × 3,5 × 5,7 × 7,1 × 3,3 × 1,2 × 2,1 × 5,5 × 1, ….
3. Convolution: the centre of the convolution kernel is placed on the pixel to be calculated, the products of each element in the kernel and its covered image pixel value are calculated once and summed, and the resulting structure is the new pixel value at that location, a process called convolution.
4. Excitation function: a function that processes the convolved results.
5. Characteristic diagram: the result of the convolution calculation of the input data is called a feature map, and the result of the full connection of the data is also called a feature map. The feature size is typically expressed as length x width x depth, or 1 x depth
6. Step length: the length of the shift in the center position of the convolution kernel in the coordinates.
7. And (3) carrying out non-alignment treatment on two ends: when the image or data is processed by the convolution kernel with the size of 3 × 3, if one convolution kernel is not enough, the data on two sides is not enough, and the data on two sides or one side is discarded at the moment, which is called that the two sides do not process the data.
8. Loss calculation cascade: the method is that a loss value is calculated at a certain node of a network structure, and the loss value is weighted and calculated into the overall loss, and the method for calculating the loss value is called loss calculation cascade.
9. The loss function (loss function) is also called cost function (cost function). Is an objective function of neural network optimization, and the process of neural network training or optimization is a process of minimizing a loss function (the smaller the value of the loss function, the closer the values of the corresponding predicted result and the real result are.
Disclosure of Invention
In order to solve the problems in the prior art, the invention aims to design a secondary network structure by the method, solve the problem of using a primary network loss function in the prior art for calculation, and improve the accuracy and the recall rate.
Specifically, the invention provides a method for designing a network structure in vehicle detection training, which comprises the following steps:
s1, designing a loss function, and calculating the loss function:
s1.1, training by adopting a secondary loss function, calculating a first-stage loss value through cross entropy, and finely adjusting the first-stage loss function by using a target four classification and a coordinate two-point four value; calculating a loss function value in the second-level classification through a log-likelihood function, wherein the second-level loss function is used for judging whether the second-level classification is a target or not and fine adjustment of two-point four-value coordinates;
s1.2, calculating a loss value in the fine adjustment through a 2-norm;
s1.3, when calculating the loss function of the whole secondary network, the first-stage loss value accounts for 0.65, and the loss function value in the second-stage classification accounts for 0.35; in each stage, the classification loss value accounts for 0.4, and the coordinate fine tuning loss value accounts for 0.6;
s2, designing a network structure corresponding to the secondary loss function:
s2.1, a first-level network;
s2.2, second-level network.
The cross entropy calculation in S1.1 is performed by a cross entropy cost functionWhere n is the number of training data, this sum covers all training inputs x, y being the desired output.
The log-likelihood function calculation described in said S1.1 is performed by means of a log-likelihood function C ═ Σkyklog akIs obtained in which akRepresents the output value of the kth neuron, ykAnd the real value corresponding to the kth neuron is represented, and the value is 0 or 1.
The 2-norm calculation described in S1.2, i.e. the square sum of absolute values of the vector elements, is:
the first-stage network of S2.1 is:
the input data of the first layer is 47 × 47 × 1, the gray scale map, the convolution kernel size is 3 × 3, the step size is 2, the output depth is 16, and the output result is a feature map (1) of 23 × 23 × 16;
the second layer input data feature map (1) is 23 × 23 × 16, the convolution kernel size is 3 × 3, the step size is 2, the output depth is 16, and the output result is the feature map (2)11 × 11 × 16;
the third layer of input data feature map (2) is 11 × 11 × 16, the size of the convolution kernel is 3 × 3, the step size is 2, the output depth is 16, and the output result is 5 × 5 × 16 of the feature map (3);
the feature map (3) of the input data of the fourth layer is 5 × 5 × 16, the size of a convolution kernel is 3 × 3, the step size is 1, the output depth is 16, and the output result is 3 × 3 × 16 of the feature map (6);
inputting a data characteristic diagram (3) at a fifth layer of 5 multiplied by 16, and removing values at the upper end and the lower end of the characteristic diagram in the width direction to obtain a characteristic diagram (4) of 5 multiplied by 3 multiplied by 16;
the sixth layer of input data feature map (3) is 5 multiplied by 16, and one value at the left end and the right end in the height direction of the feature map is removed to obtain the feature map (5) which is 3 multiplied by 5 multiplied by 16;
the seventh layer input data feature map (4) is 5 × 3 × 16, the size of a convolution kernel is 3 × 1, the step size is 1, the output depth is 16, and the output result is the feature map (7) of 3 × 3 × 16;
the eighth layer input data feature map (5) is 3 × 5 × 16, the convolution kernel size is 1 × 3, the step size is 1, the output depth is 16, and the output result is the feature map (8)3 × 3 × 16;
the ninth layer input data feature map (6) is 3 × 3 × 16, the convolution kernel size is 3 × 3, the step size is 1, the output depth is 1 and 4, and the output result is feature maps 1 × 1 × 1 and 1 × 1 × 4;
the tenth layer input data feature map (7) is 3 × 3 × 16, the convolution kernel size is 3 × 3, the step size is 1, the output depths are 1 and 4, and the output results are feature maps 1 × 11 and 1 × 1 × 4;
the eleventh layer input data feature map (8) is 3 × 3 × 16, the convolution kernel size is 3 × 3, the step size is 1, the output depth is 1 and 4, and the output results are feature maps 1 × 1 × 1 and 1 × 1 × 4;
the twelfth layer is obtained by combining the results of the ninth layer, the tenth layer and the eleventh layer into a feature map (9)1 × 1 × 3 and a feature map (10)1 × 1 × 12;
the twelfth layer input data feature map (9)1 × 1 × 3 and the feature map (10)1 × 1 × 12, the convolution kernel sizes are 1 × 1 and 1 × 1, the step size is 1, the output depths are 1 and 4, and the output result is the feature map (11)1 × 1 × 1 and the feature map (12)1 × 1 × 4;
all convolutions use a two-end non-alignment process.
Wherein, the feature diagram (9)1 multiplied by 3 and the feature diagram (10)1 multiplied by 12 are used as predicted values calculated by the first-stage loss function, and the loss function value is calculated according to the predicted values and the marked real values; the loss function value is calculated from the predicted value and the labeled true value using the feature map (11)1 × 1 × 1 and the feature map (12)1 × 1 × 4 as the predicted value calculated by the second-stage loss function.
The S2.2 second-level network is:
initial layer input data 49 × 49 × 1, a grayscale map, convolution kernel size 3 × 3, step size 1, output depth 16, and output result 47 × 47 × 16 of a feature map (0);
47 × 47 × 16 of the first-layer input data feature map (0), 3 × 3 of convolution kernel size, 2 of step size, 32 of output depth, and 23 × 23 × 32 of the feature map (1) as an output result;
the second layer input data feature map (1) is 23 × 23 × 32, the convolution kernel size is 3 × 3, the step size is 2, the output depth is 64, and the output result is the feature map (2)11 × 11 × 64;
the third layer of input data feature map (2) is 11 × 11 × 64, the size of the convolution kernel is 3 × 3, the step size is 2, the output depth is 64, and the output result is feature map (3)5 × 5 × 64;
the feature map (3) of the input data of the fourth layer is 5 × 5 × 64, the size of a convolution kernel is 3 × 3, the step size is 1, the output depth is 64, and the output result is 3 × 3 × 64 of the feature map (6);
inputting a data characteristic diagram (3)5 multiplied by 64 at the fifth layer, and removing values at the upper end and the lower end in the width direction of the characteristic diagram to obtain a characteristic diagram (4)5 multiplied by 3 multiplied by 64;
5 × 5 × 64 of a sixth layer of input data feature map (3), and removing values at the left end and the right end of the feature map in the height direction to obtain a 3 × 5 × 64 feature map (5);
the seventh layer input data feature map (4) is 5 × 3 × 64, the size of the convolution kernel is 3 × 1, the step size is 1, the output depth is 64, and the output result is the feature map (7) of 3 × 3 × 64;
the eighth layer input data feature map (5) is 3 × 5 × 64, the convolution kernel size is 1 × 3, the step size is 1, the output depth is 64, and the output result is the feature map (8)3 × 3 × 64;
the ninth layer input data feature map (6) is 3 × 3 × 64, the convolution kernel size is 3 × 3, the step size is 1, the output depth is 1 and 4, and the output result is feature maps 1 × 1 × 1 and 1 × 1 × 4;
the tenth layer input data feature map (7) is 3 × 3 × 64, the convolution kernel size is 3 × 3, the step size is 1, the output depths are 1 and 4, and the output results are feature maps 1 × 1 × 1 and 1 × 1 × 4;
the eleventh layer input data feature map (8) is 3 × 3 × 64, the convolution kernel size is 3 × 3, the step size is 1, the output depth is 1 and 4, and the output result is feature maps 1 × 1 × 1 and 1 × 1 × 4;
the twelfth layer is obtained by combining the results of the ninth layer, the tenth layer and the eleventh layer into a feature map (9)1 × 1 × 3 and a feature map (10)1 × 1 × 12;
the twelfth layer input data feature map (9)1 × 1 × 3 and the feature map (10)1 × 1 × 12, the convolution kernel sizes are 1 × 1 and 1 × 1, the step size is 1, the output depths are 1 and 4, and the output result is the feature map (11)1 × 1 × 1 and the feature map (12)1 × 1 × 4;
all convolutions use a two-end non-alignment process.
Wherein, the feature diagram (9)1 multiplied by 3 and the feature diagram (10)1 multiplied by 12 are used as predicted values calculated by the first-stage loss function, and the loss function value is calculated according to the predicted values and the marked real values; the loss function value is calculated from the predicted value and the labeled true value using the feature map (11)1 × 1 × 1 and the feature map (12)1 × 1 × 4 as the predicted value calculated by the second-stage loss function.
Thus, the present application has the advantages that: the method is simple, the accuracy in vehicle detection training is improved by designing the calculation method of the secondary loss function and the network structure corresponding to the secondary loss function, the structure is simple, the operation is convenient, and the cost is saved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention.
FIG. 1 is a flow chart of the method of the present invention.
Fig. 2 is a schematic diagram of a first level network structure in the method of the present invention.
Fig. 3 is a schematic diagram of a second level network architecture in the method of the present invention.
FIG. 4 is a schematic illustration of a sample-made first type of vehicle object to which the method of the present invention relates.
FIG. 5 is a schematic illustration of a sample-made second type of vehicle object involved in the method of the present invention.
FIG. 6 is a schematic illustration of a sample-made third class of vehicle targets involved in the method of the present invention.
Detailed Description
In order that the technical contents and advantages of the present invention can be more clearly understood, the present invention will now be described in further detail with reference to the accompanying drawings.
As shown in fig. 1, the present invention relates to a method for designing a network structure in vehicle detection training, the method comprising the steps of:
s1, designing a loss function, and calculating the loss function:
s1.1, training by adopting a secondary loss function, calculating a first-stage loss value through cross entropy, and finely adjusting the first-stage loss function by using a target four classification and a coordinate two-point four value; calculating a loss function value in the second-level classification through a log-likelihood function, wherein the second-level loss function is used for judging whether the second-level classification is a target or not and fine adjustment of two-point four-value coordinates;
s1.2, calculating a loss value in the fine adjustment through a 2-norm;
s1.3, when calculating the loss function of the whole secondary network, the first-stage loss value accounts for 0.65, and the loss function value in the second-stage classification accounts for 0.35; in each stage, the classification loss value accounts for 0.4, and the coordinate fine tuning loss value accounts for 0.6;
s2, designing a network structure corresponding to the secondary loss function:
s2.1, a first-level network;
s2.2, second-level network.
Further, the method also comprises the following steps:
1. calculation of a loss function
The first-order loss value is calculated by cross entropy, the loss function value in the second-order classification is calculated by log-likelihood function, and the loss value in the fine adjustment is calculated by 2-norm. When the loss function of the whole network is calculated, the first-stage loss value accounts for 0.65, and the second-stage classification loss value accounts for 0.35. In each stage, the classification loss value is 0.4, and the coordinate fine-tuning loss value is 0.6.
Wherein, cross entropy cost functionWhere n is the number of training data, this sum covers all training inputs x, y is the desired output; the cross-entropy (cross-entropy) cost function is derived from the concept of entropy in the information theory. Is a cost function commonly used in the current neural network classification problem (such as image classification). The cross-entropy cost function has a good explanation for the classification problem: when the result of classifying the output correct class (the value after the output layer uses the softmax function) is close to 1, the label corresponding to the correct class is 1, i.e., y is 1. It can be found that the first term in C is close to 0 and the second term is equal to 0. For the incorrect class, a is close to 0, y is 0, then the first term in C is 0 and the second term is close to 0. So eventually C is close to 0; when the difference between the result of classifying the output correct class and 1 is larger, the value of the above formula C is larger.
Log-likelihood function: c ═ Σkyklog akWherein a iskRepresents the output value of the kth neuron, ykAnd the real value corresponding to the kth neuron is represented, and the value is 0 or 1. The log-likelihood function is similar to the cross-entropy cost function, but only accounts for the loss of the correct class, not the loss of the wrong class. Like the cross-entropy cost function, log-likelihood also has a good explanation for classification: when the output value a (the value after the output layer uses softmax only) of the correct class is close to 1, y is 1, and C is close to 0; the larger the output value a is from the distance a, the larger the value of C.
Functional formula of softmax:wherein,representing the input to the jth neuron at level L (usually the last level),represents the output of the jth neuron at the L-th level, and e represents a natural constant.Shows the input of all neurons in the L-th layerSum of them. The softmax function is most obviously characterized by: it takes the ratio of the input of each neuron to the sum of all neuron inputs of the current layer as the output of the neuron. This makes the output easier to interpret: the larger the output value of a neuron is, the higher the probability that the class corresponding to the neuron is a true class is.
2-norm: norm (norm), which is a function with the notion of "length". Euclidd norm (euclidean norm, commonly used to calculate vector length), i.e. the sum of squares of the absolute values of the vector elements reopen:
2. network architecture
1) First level network, as shown in fig. 2:
the first layer input data 47 × 47 × 1, the grayscale map, the convolution kernel size is 3 × 3, the step size is 2, the output depth is 16, and the output result is the feature map (1)23 × 23 × 16.
The second layer input data feature map (1) is 23 × 23 × 16, the convolution kernel size is 3 × 3, the step size is 2, the output depth is 16, and the output result is the feature map (2)11 × 11 × 16.
The third layer input data feature map (2) is 11 × 11 × 16, the convolution kernel size is 3 × 3, the step size is 2, the output depth is 16, and the output result is feature map (3)5 × 5 × 16.
The fourth layer input data feature map (3) is 5 × 5 × 16, the convolution kernel size is 3 × 3, the step size is 1, the output depth is 16, and the output result is the feature map (6)3 × 3 × 16.
And (3) inputting a data characteristic diagram (3) of 5 multiplied by 16 at the fifth layer, and removing values at the upper end and the lower end in the width direction of the characteristic diagram to obtain a characteristic diagram (4) of 5 multiplied by 3 multiplied by 16.
And (3) inputting a data feature map (3) of the sixth layer by 5 × 5 × 16, and removing values at the left end and the right end in the height direction of the feature map to obtain the feature map (5) by 3 × 5 × 16.
The seventh layer input data feature map (4) is 5 × 3 × 16, the convolution kernel size is 3 × 1, the step size is 1, the output depth is 16, and the output result is the feature map (7)3 × 3 × 16.
The eighth layer input data feature map (5) is 3 × 5 × 16, the convolution kernel size is 1 × 3, the step size is 1, the output depth is 16, and the output result is the feature map (8)3 × 3 × 16.
The ninth layer input data feature map (6) is 3 × 3 × 16, the convolution kernel size is 3 × 3, the step size is 1, the output depths are 1 and 4, and the output results are feature maps 1 × 1 × 1 and 1 × 1 × 4.
The tenth layer input data feature map (7) is 3 × 3 × 16, the convolution kernel size is 3 × 3, the step size is 1, the output depths are 1 and 4, and the output results are feature maps 1 × 1 × 1 and 1 × 1 × 4.
The eleventh layer input data feature map (8) is 3 × 3 × 16, the convolution kernel size is 3 × 3, the step size is 1, the output depths are 1 and 4, and the output results are feature maps 1 × 1 × 1 and 1 × 1 × 4.
The twelfth layer is a combination of the results of the ninth layer, the tenth layer and the eleventh layer into a feature map (9)1 × 1 × 3 and a feature map (10)1 × 1 × 12. The twelfth layer input data feature map (9)1 × 1 × 3 and feature map (10)1 × 1 × 12, the convolution kernel sizes are 1 × 1 and 1 × 1, the step size is 1, the output depths are 1 and 4, and the output results are feature map (11)1 × 1 × l and feature map (12)1 × 1 × 4. All convolutions use a two-end non-alignment process. Wherein, the feature map (9)1 × 1 × 3 and the feature map (10)1 × 1 × 12 are used as predicted values calculated by the first-stage loss function of the first-stage network, and the loss function value is calculated according to the predicted values and the labeled real values. The loss function value is calculated from the predicted value and the labeled true value using the feature map (11)1 × 1 × 1 and the feature map (12)1 × 1 × 4 as the predicted value calculated by the second-stage loss function.
2) Second level network, as shown in fig. 3:
initial layer input data 49 × 49 × 1, a grayscale map, convolution kernel size 3 × 3, step size 1, output depth 16, and output result 47 × 47 × 16 of a feature map (0);
47 × 47 × 16 of the first-layer input data feature map (0), 3 × 3 of convolution kernel size, 2 of step size, 32 of output depth, and 23 × 23 × 32 of the feature map (1) as an output result;
the second layer input data feature map (1) is 23 × 23 × 32, the convolution kernel size is 3 × 3, the step size is 2, the output depth is 64, and the output result is the feature map (2)11 × 11 × 64;
the third layer of input data feature map (2) is 11 × 11 × 64, the size of the convolution kernel is 3 × 3, the step size is 2, the output depth is 64, and the output result is feature map (3)5 × 5 × 64;
the feature map (3) of the input data of the fourth layer is 5 × 5 × 64, the size of a convolution kernel is 3 × 3, the step size is 1, the output depth is 64, and the output result is 3 × 3 × 64 of the feature map (6);
inputting a data characteristic diagram (3)5 multiplied by 64 at the fifth layer, and removing values at the upper end and the lower end in the width direction of the characteristic diagram to obtain a characteristic diagram (4)5 multiplied by 3 multiplied by 64;
5 × 5 × 64 of a sixth layer of input data feature map (3), and removing values at the left end and the right end of the feature map in the height direction to obtain a 3 × 5 × 64 feature map (5);
the seventh layer input data feature map (4) is 5 × 3 × 64, the size of the convolution kernel is 3 × 1, the step size is 1, the output depth is 64, and the output result is the feature map (7) of 3 × 3 × 64;
the eighth layer input data feature map (5) is 3 × 5 × 64, the convolution kernel size is 1 × 3, the step size is 1, the output depth is 64, and the output result is the feature map (8)3 × 3 × 64;
the ninth layer input data feature map (6) is 3 × 3 × 64, the convolution kernel size is 3 × 3, the step size is 1, the output depth is 1 and 4, and the output result is feature maps 1 × 1 × 1 and 1 × 1 × 4;
the tenth layer input data feature map (7) is 3 × 3 × 64, the convolution kernel size is 3 × 3, the step size is 1, the output depths are 1 and 4, and the output results are feature maps 1 × 1 × 1 and 1 × 1 × 4;
the eleventh layer input data feature map (8) is 3 × 3 × 64, the convolution kernel size is 3 × 3, the step size is 1, the output depth is 1 and 4, and the output result is feature maps 1 × 1 × 1 and 1 × 1 × 4; the twelfth layer is obtained by combining the results of the ninth layer, the tenth layer and the eleventh layer into a feature map (9)1 × 1 × 3 and a feature map (10)1 × 1 × 12;
the twelfth layer input data feature map (9)1 × 1 × 3 and the feature map (10)1 × 1 × 12, the convolution kernel sizes are 1 × 1 and 1 × 1, the step size is 1, the output depths are 1 and 4, and the output result is the feature map (11)1 × 1 × 1 and the feature map (12)1 × 1 × 4;
all convolutions use a two-end non-alignment process.
Wherein, the feature map (9)1 × 1 × 3 and the feature map (10)1 × 1 × 12 are used as predicted values calculated by the first-stage loss function of the second-stage network, and the loss function value is calculated according to the predicted values and the labeled real values; the loss function value is calculated from the predicted value and the labeled true value using the feature map (11)1 × 1 × 1 and the feature map (12)1 × 1 × 4 as the predicted value calculated by the second-stage loss function.
Because the vehicle postures are diverse, the length-width ratio is arbitrary, the difference is large, but the difference is in a certain range, a secondary loss function is adopted for training, the first-stage loss function uses the target four classification and the fine adjustment of the four values of two points of coordinates, and the second-stage loss function uses the secondary classification for judging whether the target is the target or not and the fine adjustment of the four values of two points of coordinates.
The method also relates to the preparation and training of the sample, which comprises the following steps:
1) and (5) preparing a sample. And marking the sample. And taking the minimum external rectangle of the vehicle as a marking target. All vehicles in each figure are labeled.
2) And (5) training the labeling of the sample. Vehicle targets are divided into three types: aspect ratioWhen, defined as a first class vehicle target, the first stage loss is noted as [1, 0]The second level loss is noted as 1; aspect ratioThen, defined as a second class vehicle target, the first stage loss is labeled [0, 1, 0]The second level loss is noted as 1; aspect ratioThen, define as the third class vehicle target, the first level loss is labeled [0,0, 1]The second level loss is noted as 1. These three types of vehicles are shown in fig. 4-6. The three types of vehicles shown in FIGS. 4-6 are all positive samples, negative samples are no vehicles, and the first level is labeled as [0,0]The second level loss is noted as 0. Three classes of vehicles and one class of negative examples, for a total of four classes.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (8)
1. A method for designing a network structure in vehicle detection training, the method comprising the steps of:
s1, designing a loss function, and calculating the loss function:
s1.1, training by adopting a secondary loss function, calculating a first-stage loss value through cross entropy, and finely adjusting the first-stage loss function by using a target four classification and a coordinate two-point four value; calculating a loss function value in the second-level classification through a log-likelihood function, wherein the second-level loss function is used for judging whether the second-level classification is a target or not and fine adjustment of two-point four-value coordinates;
s1.2, calculating a loss value in the fine adjustment through a 2-norm;
s1.3, when calculating the loss function of the whole secondary network, the first-stage loss value accounts for 0.65, and the loss function value in the second-stage classification accounts for 0.35; in each stage, the classification loss value accounts for 0.4, and the coordinate fine tuning loss value accounts for 0.6;
s2, designing a network structure corresponding to the secondary loss function:
s2.1, a first-level network;
s2.2, second-level network.
3. The method as claimed in claim 1, wherein the log-likelihood function is calculated by a log-likelihood function C ═ Σkyklog akIs obtained in which akRepresents the output value of the kth neuron, ykAnd the real value corresponding to the kth neuron is represented, and the value is 0 or 1.
5. the method for designing the network structure in the vehicle detection training as claimed in claim 1, wherein the first-stage network of S2.1 is:
the input data of the first layer is 47 × 47 × 1, the gray scale map, the convolution kernel size is 3 × 3, the step size is 2, the output depth is 16, and the output result is a feature map (1) of 23 × 23 × 16;
the second layer input data feature map (1) is 23 × 23 × 16, the convolution kernel size is 3 × 3, the step size is 2, the output depth is 16, and the output result is the feature map (2)11 × 11 × 16;
the third layer of input data feature map (2) is 11 × 11 × 16, the size of the convolution kernel is 3 × 3, the step size is 2, the output depth is 16, and the output result is 5 × 5 × 16 of the feature map (3);
the feature map (3) of the input data of the fourth layer is 5 × 5 × 6, the size of a convolution kernel is 3 × 3, the step size is 1, the output depth is 16, and the output result is 3 × 3 × 16 of the feature map (6);
inputting a data characteristic diagram (3) at a fifth layer of 5 multiplied by 16, and removing values at the upper end and the lower end of the characteristic diagram in the width direction to obtain a characteristic diagram (4) of 5 multiplied by 3 multiplied by 16;
the sixth layer of input data feature map (3) is 5 multiplied by 16, and one value at the left end and the right end in the height direction of the feature map is removed to obtain the feature map (5) which is 3 multiplied by 5 multiplied by 16;
the seventh layer input data feature map (4) is 5 × 3 × 16, the size of a convolution kernel is 3 × 1, the step size is 1, the output depth is 16, and the output result is the feature map (7) of 3 × 3 × 16;
the eighth layer input data feature map (5) is 3 × 5 × 16, the convolution kernel size is 1 × 3, the step size is 1, the output depth is 16, and the output result is the feature map (8)3 × 3 × 16;
the ninth layer input data feature map (6) is 3 × 3 × 16, the convolution kernel size is 3 × 3, the step size is 1, the output depth is 1 and 4, and the output result is feature maps 1 × 1 × 1 and 1 × 1 × 4;
the tenth layer input data feature map (7) is 3 × 3 × 16, the convolution kernel size is 3 × 3, the step size is 1, the output depths are 1 and 4, and the output results are feature maps 1 × 1 × 1 and 1 × 1 × 4;
the eleventh layer input data feature map (8) is 3 × 3 × 16, the convolution kernel size is 3 × 3, the step size is 1, the output depth is 1 and 4, and the output results are feature maps 1 × 1 × 1 and 1 × 1 × 4;
the twelfth layer is obtained by combining the results of the ninth layer, the tenth layer and the eleventh layer into a feature map (9)1 × 1 × 3 and a feature map (10)1 × 1 × 12;
the twelfth layer input data feature map (9)1 × 1 × 3 and the feature map (10)1 × 1 × 12, the convolution kernel sizes are 1 × 1 and 1 × 1, the step size is 1, the output depths are 1 and 4, and the output result is the feature map (11)1 × 1 × 1 and the feature map (12)1 × 1 × 4;
all convolutions use a two-end non-alignment process.
6. The method of claim 5, wherein the network structure is a network structure for vehicle detection training,
calculating a loss function value according to the predicted value and the marked real value by using the feature map (9)1 multiplied by 3 and the feature map (10)1 multiplied by 12 as the predicted value calculated by the first-stage loss function;
the loss function value is calculated from the predicted value and the labeled true value using the feature map (11)1 × 1 × 1 and the feature map (12)1 × 1 × 4 as the predicted value calculated by the second-stage loss function.
7. The method for designing the network structure in the vehicle detection training as claimed in claim 1, wherein the S2.2 second-level network is:
initial layer input data 49 × 49 × 1, a grayscale map, convolution kernel size 3 × 3, step size 1, output depth 16, and output result 47 × 47 × 16 of a feature map (0);
47 × 47 × 16 of the first-layer input data feature map (0), 3 × 3 of convolution kernel size, 2 of step size, 32 of output depth, and 23 × 23 × 32 of the feature map (1) as an output result;
the second layer input data feature map (1) is 23 × 23 × 32, the convolution kernel size is 3 × 3, the step size is 2, the output depth is 64, and the output result is the feature map (2)11 × 11 × 64;
the third layer of input data feature map (2) is 11 × 11 × 64, the size of the convolution kernel is 3 × 3, the step size is 2, the output depth is 64, and the output result is feature map (3)5 × 5 × 64;
the feature map (3) of the input data of the fourth layer is 5 × 5 × 64, the size of a convolution kernel is 3 × 3, the step size is 1, the output depth is 64, and the output result is 3 × 3 × 64 of the feature map (6);
inputting a data characteristic diagram (3)5 multiplied by 64 at the fifth layer, and removing values at the upper end and the lower end in the width direction of the characteristic diagram to obtain a characteristic diagram (4)5 multiplied by 3 multiplied by 64;
5 × 5 × 64 of a sixth layer of input data feature map (3), and removing values at the left end and the right end of the feature map in the height direction to obtain a 3 × 5 × 64 feature map (5);
the seventh layer input data feature map (4) is 5 × 3 × 64, the size of the convolution kernel is 3 × 1, the step size is 1, the output depth is 64, and the output result is the feature map (7) of 3 × 3 × 64;
the eighth layer input data feature map (5) is 3 × 5 × 64, the convolution kernel size is 1 × 3, the step size is 1, the output depth is 64, and the output result is the feature map (8)3 × 3 × 64;
the ninth layer input data feature map (6) is 3 × 3 × 64, the convolution kernel size is 3 × 3, the step size is 1, the output depth is 1 and 4, and the output result is feature maps 1 × 1 × 1 and 1 × 1 × 4;
the tenth layer input data feature map (7) is 3 × 3 × 64, the convolution kernel size is 3 × 3, the step size is 1, the output depths are 1 and 4, and the output results are feature maps 1 × 1 × 1 and 1 × 1 × 4;
the eleventh layer input data feature map (8) is 3 × 3 × 64, the convolution kernel size is 3 × 3, the step size is 1, the output depth is 1 and 4, and the output result is feature maps 1 × 1 × 1 and 1 × 1 × 4;
the twelfth layer is obtained by combining the results of the ninth layer, the tenth layer and the eleventh layer into a feature map (9)1 × 1 × 3 and a feature map (10)1 × 1 × 12;
the twelfth layer input data feature map (9)1 × 1 × 3 and the feature map (10)1 × 1 × 12, the convolution kernel sizes are 1 × 1 and 1 × 1, the step size is 1, the output depths are 1 and 4, and the output result is the feature map (11)1 × 1 × 1 and the feature map (12)1 × 1 × 4;
all convolutions use a two-end non-alignment process.
8. The method of claim 7, wherein the network structure is a network structure for vehicle detection training,
calculating a loss function value according to the predicted value and the marked real value by using the feature map (9)1 multiplied by 3 and the feature map (10)1 multiplied by 12 as the predicted value calculated by the first-stage loss function;
the loss function value is calculated from the predicted value and the labeled true value using the feature map (11)1 × 1 × 1 and the feature map (12)1 × 1 × 4 as the predicted value calculated by the second-stage loss function.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010400154.5A CN113673667B (en) | 2020-05-13 | 2020-05-13 | Design method of network structure in vehicle detection training |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010400154.5A CN113673667B (en) | 2020-05-13 | 2020-05-13 | Design method of network structure in vehicle detection training |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN113673667A true CN113673667A (en) | 2021-11-19 |
| CN113673667B CN113673667B (en) | 2024-08-02 |
Family
ID=78536936
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202010400154.5A Active CN113673667B (en) | 2020-05-13 | 2020-05-13 | Design method of network structure in vehicle detection training |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN113673667B (en) |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20190095780A1 (en) * | 2017-08-18 | 2019-03-28 | Beijing Sensetime Technology Development Co., Ltd | Method and apparatus for generating neural network structure, electronic device, and storage medium |
| CN109558803A (en) * | 2018-11-01 | 2019-04-02 | 西安电子科技大学 | SAR target discrimination method based on convolutional neural networks Yu NP criterion |
| CN109961107A (en) * | 2019-04-18 | 2019-07-02 | 北京迈格威科技有限公司 | Training method, device, electronic device and storage medium for target detection model |
| WO2019136591A1 (en) * | 2018-01-09 | 2019-07-18 | 深圳大学 | Salient object detection method and system for weak supervision-based spatio-temporal cascade neural network |
-
2020
- 2020-05-13 CN CN202010400154.5A patent/CN113673667B/en active Active
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20190095780A1 (en) * | 2017-08-18 | 2019-03-28 | Beijing Sensetime Technology Development Co., Ltd | Method and apparatus for generating neural network structure, electronic device, and storage medium |
| WO2019136591A1 (en) * | 2018-01-09 | 2019-07-18 | 深圳大学 | Salient object detection method and system for weak supervision-based spatio-temporal cascade neural network |
| CN109558803A (en) * | 2018-11-01 | 2019-04-02 | 西安电子科技大学 | SAR target discrimination method based on convolutional neural networks Yu NP criterion |
| CN109961107A (en) * | 2019-04-18 | 2019-07-02 | 北京迈格威科技有限公司 | Training method, device, electronic device and storage medium for target detection model |
Also Published As
| Publication number | Publication date |
|---|---|
| CN113673667B (en) | 2024-08-02 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN111898621B (en) | A Contour Shape Recognition Method | |
| CN108960127B (en) | Re-identification of occluded pedestrians based on adaptive deep metric learning | |
| US5048100A (en) | Self organizing neural network method and system for general classification of patterns | |
| CN111476219A (en) | Image object detection method in smart home environment | |
| CN112232151B (en) | Iterative polymerization neural network high-resolution remote sensing scene classification method embedded with attention mechanism | |
| CN111191583A (en) | Spatial target recognition system and method based on convolutional neural network | |
| CN112819039A (en) | Texture recognition model establishing method based on multi-scale integrated feature coding and application | |
| Ying | Analytical analysis and feedback linearization tracking control of the general Takagi-Sugeno fuzzy dynamic systems | |
| CN108564097A (en) | A kind of multiscale target detection method based on depth convolutional neural networks | |
| CN115496928A (en) | Multi-modal image feature matching method based on multi-feature matching | |
| CN109446894B (en) | A Multispectral Image Change Detection Method Based on Probabilistic Segmentation and Gaussian Mixture Clustering | |
| Lin et al. | Determination of the varieties of rice kernels based on machine vision and deep learning technology | |
| CN119888729A (en) | YOLOv11 improved cell instance segmentation method and system | |
| CN115761502A (en) | SAR Image Change Detection Method Based on Hybrid Convolution | |
| CN111325259A (en) | Remote sensing image classification method based on deep learning and binary coding | |
| CN111523342A (en) | Two-dimensional code detection and correction method in complex scene | |
| CN115565009A (en) | Electronic component classification method based on deep denoising sparse self-encoder and ISSVM | |
| Cho | Ensemble of structure-adaptive self-organizing maps for high performance classification | |
| Chhabra et al. | High-order statistically derived combinations of geometric features for handprinted character recognition | |
| CN113673667B (en) | Design method of network structure in vehicle detection training | |
| CN113673543B (en) | Method for two-stage calculation of loss function in vehicle detection training | |
| CN120182974A (en) | An AI-assisted image annotation method based on semi-supervised learning | |
| CN113673271B (en) | Double-layer labeling calculation method for secondary loss based on pet detection | |
| CN111160372A (en) | A large target recognition method based on high-speed convolutional neural network | |
| CN113673668B (en) | A method for calculating the secondary loss function in vehicle detection training |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |