CN111126359A

CN111126359A - High-definition image small target detection method based on self-encoder and YOLO algorithm

Info

Publication number: CN111126359A
Application number: CN202010143805.7A
Authority: CN
Inventors: 吴宪云; 孙力; 李云松; 王柯俨; 刘凯; 雷杰; 郭杰; 苏丽雪; 王康; 司鹏辉
Original assignee: Xidian University
Current assignee: Nanjing Yixin Yiyi Information Technology Co ltd; Xidian University
Priority date: 2019-11-15
Filing date: 2020-03-04
Publication date: 2020-05-08
Anticipated expiration: 2040-03-04
Also published as: CN111126359B

Abstract

The invention discloses a high-definition image small target detection method based on an autoencoder and a YOLO algorithm, which mainly solves the problem that the accuracy and speed of the high-definition image small target detection in the prior art cannot be taken into account. The implementation steps are: 1) collecting high-definition images and labeling to obtain a training set and a test set; 2) performing data expansion on the labeled training set; 3) generating corresponding Mask data according to the labeling information; 4) building an autoencoder model; 5) Use the training set to train it; 6) Splicing the encoding network of the trained autoencoder with the YOLO-V3 detection network to obtain a hybrid network and use the training set to train it; 7) Use the trained hybrid network to test target detection on the set. The invention reduces the calculation amount of target detection, improves the detection speed, and improves the detection accuracy of small targets in high-definition images under the condition of ensuring the detection speed, and can be used for target recognition of drone aerial images.

Description

High-definition image small target detection method based on self-encoder and YOLO algorithm

Technical Field

The invention belongs to the technical field of target detection, and particularly relates to a method for detecting a small target of a high-definition image, which can be used for target identification of an aerial image of an unmanned aerial vehicle.

Technical Field

Currently, with the development of target detection technology, especially in recent years, target detection algorithms based on deep learning, such as fast-RCNN, SSD series, YOLO series, have been proposed, and compared with conventional target detection algorithms, the target detection algorithms based on deep learning greatly exceed the conventional detection algorithms in terms of accuracy and efficiency. However, the current algorithms are optimized based on the existing data sets, such as ImageNet, COCO, and the like, and in practical applications, such as unmanned aerial vehicle aerial image target detection, since the flying height of an unmanned aerial vehicle is high, the size of the acquired image is large and generally high-definition images are obtained, and in the acquired image, the size of the target is generally small, the method is mainly used for small target detection in the aspect of target detection of the high-definition images.

In target detection, there are two main processing modes for high-definition images, one is a down-sampling size scaling mode, and the other is an image cropping mode, which is specifically as follows:

joseph Redmon et al, in a non-patent document "YOLO 9000: Better, Faster, Stronger" of IEEE International computer Vision and Pattern recognition conference, proposed an improvement to a YOLO network that allows the network to detect input images of different sizes by removing all connected layers, which, in experimental results using a data set of VOC2007+ VOC2012, scales the input image to 288x288 by down-sampling size scaling, which can reach 91FPS at a speed but only 69.0mAP at a precision, and reduces the speed to 40FPS and increases the precision to 78.6mAP if the input image is scaled to 544 x 544 at a size. It can be seen from the experiment that the large-size input image target detection inevitably increases the calculation amount, thereby reducing the speed of target detection, and the downsampling size scaling mode also causes the loss of target space information, thereby reducing the precision of target detection. In the small target detection of the high-definition image, if the high-definition image is directly sent to a network for detection, the detection speed is reduced more seriously, and if the small target detection is carried out in a size scaling mode, the characteristic information of the small target is reduced, so that the precision is reduced.

The second common mode is image cropping, which specifically comprises the following steps: and cutting the original high-definition image into small images, sending the small images into a network for detection, and merging after the detection is finished. The method has the advantages that through cutting, the spatial information of the image is guaranteed not to be lost, and a good effect is achieved on the target detection precision, but because one image is cut into a plurality of images, the target detection speed is doubled.

In summary, how to perform fast and accurate target detection on a high-definition image in practical application becomes a problem to be solved.

Disclosure of Invention

The invention aims to provide a high-definition image small target detection method based on an autoencoder and a YOLO algorithm aiming at overcoming the defects of the existing method, and aims to improve the detection precision of the high-definition image small target under the condition of ensuring that the detection speed of the high-definition image is not reduced.

In order to achieve the purpose, the technical scheme of the invention comprises the following steps:

(1) collecting high-definition image data to form a data set, labeling the data set to obtain correct label data, and dividing the data set and the label data into a training set and a test set according to a ratio of 8: 2;

(2) carrying out data expansion on the marked training set;

(3) for each piece of high-definition image data, generating target Mask data of a corresponding image according to the size of the image and the labeling information;

(4) building a full convolution self-encoder model comprising an encoding network and a decoding network, wherein the encoding network is used for carrying out feature extraction and data compression on a high-definition image, and the decoding network is used for restoring a compressed feature map to an original size;

(5) sending high-definition image training set data into a full convolution self-encoder model for training to obtain a trained full convolution self-encoder model:

(5a) initializing the offset of the network to 0, initializing the weight parameters of the network by adopting a kaiming Gaussian initialization method, and setting the iteration times T of the self-encoder according to the size of a high-definition image training set₁；

(5b) The partition-based mean square error loss function is defined as follows:

the method comprises the following steps of (1) calculating a Mask-MSE-Loss (y, y _) according to the position of a target region, wherein the Mask-MSE-Loss (y, y _) is a Loss function to be calculated, y is an output image of a decoder, y _ is an input original high-definition image, α is a Loss penalty weight of the target region and is set to be 0.9, β is a background region penalty weight and is set to be 0.1, W is an input image size width of a self-encoder, H is an input image size width of the self-encoder, and Mask (i, j) is a value of the (i, j) th position of Mask;

(5c) inputting high-definition image training set data into a full convolution self-coding network, carrying out forward propagation to obtain a coded feature map, and recovering the feature map through a decoder;

(5d) calculating loss values of the input image and the output image by using the partition area-based mean square error loss function defined in the step (5 b);

(5e) updating the weight and the offset of the full convolution self-encoder by using a back propagation algorithm to finish one iteration of training the full convolution self-encoder;

(5f) repeating (5c) - (5e) until the iteration times T of all the self-encoders are completed₁Obtaining a trained full convolution self-encoder;

(6) splicing the coding network of the trained full-convolution self-encoder with a YOLO-V3 detection network, and training the spliced network:

(6a) splicing the coding network of the trained full-convolution self-encoder to the front of a YOLO-V3 detection network to form a spliced mixed network;

(6b) training the spliced hybrid network:

(6b1) reading parameters of the trained full-convolution self-encoder, initializing the coding network by using the read parameter values, and setting the parameters of the coding network in a non-trainable state;

(6b2) setting the input image size of the YOLO-V3 network to be the same as the input size of the full-convolution self-encoder network;

(6b3) downloading pre-trained parameters on ImageNet data sets from a Yolo organ network, initializing the parameters of the Yolo-V3 network by using the parameters, and setting the iteration times T of the Yolo-V3 network according to the size of the acquired data set in the step (1)₂；

(6b4) Sending the high-definition image training set data into the spliced hybrid network for forward propagation to obtain an output detection result;

(6b5) calculating a loss value between the output detection result and the correct label data marked in (1) by using a loss function in a YOLO-V3 algorithm;

(6b6) updating the weight and the offset of the hybrid network by using a back propagation algorithm according to the loss value, and completing one iteration of training the hybrid network;

(6b7) repeat (6b4) - (6b6) until all iterations T of YOLO-V3 are completed₂Obtaining a trained hybrid network;

(7) and (3) inputting the test set data in the step (1) into the trained mixed model to obtain a final detection result.

Compared with the prior art, the invention has the following advantages:

the invention combines the coding network of the self-encoder with the YOLO-V3 detection network, compresses the high-definition image on the premise of little loss of the target area characteristics through the coding network, and detects the small target of the compressed image through the YOLO-V3 detection network, and the coding network only compresses the background characteristic information and retains the target characteristic information, thereby improving the detection precision of the small target in the high-definition image under the condition of ensuring the detection speed.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a labeling diagram of the high-definition image acquired in the invention;

FIG. 3 is a Mask data diagram generated by labeling information in the present invention;

FIG. 4 is a network architecture of the convolutional auto-encoder of the present invention;

FIG. 5 is a block diagram of an encoder in combination with a YOLO-V3 network in accordance with the present invention;

FIG. 6 is a graph of simulated test results on a test specimen using the present invention;

FIG. 7 is a diagram of simulation detection results of a prior downsampling compressed high definition image method on a test sample through YOLO-V3.

Detailed Description

The following describes embodiments and effects of the present invention in further detail with reference to the accompanying drawings, where the embodiments are used for detecting a small target at a sewage discharge port of a high-definition image captured by an unmanned aerial vehicle.

Referring to fig. 1, the implementation steps of this example include the following:

step 1, collecting high-definition images to obtain a training set and a test set.

Acquiring high-definition image data aerial photographed by an unmanned aerial vehicle, wherein the image width is 1920 pixels, and the image height is 1080 pixels;

performing target annotation on the acquired image data by using a common image annotation tool LabelImg to obtain correct label data, as shown in FIG. 2;

the data set and label data were divided into training and test sets in an 8:2 ratio.

And 2, performing data expansion on the marked data set.

2.1) carrying out left-right turning, rotation, translation, noise addition, brightness adjustment, contrast adjustment and saturation adjustment on each high-definition image collected in the unmanned aerial vehicle aerial photography training set;

2.2) adding the processed image data into the original training data set to obtain an expanded training data set.

And 3, generating a target Mask data image of the corresponding image.

3.1) setting a Mask data image as binary image data according to the size and the labeling information of the high-definition image acquired by the unmanned aerial vehicle aerial photography, wherein the width and the height of the Mask data image are the same as those of the high-definition image acquired by the unmanned aerial vehicle aerial photography, namely the image width of the Mask data is 1920 pixels, the height of the Mask data is 1080 pixels, and the number of channels is 1;

3.2) reading the position information of the pixel points in the original image, and setting the values of the pixel points corresponding to the Mask data through the position information:

if the pixel point is in the background area, the value of the Mask data corresponding to the pixel position is set as 0,

if the pixel point is in the target area, the value of the corresponding pixel position of the Mask data is set as 1,

the formula is expressed as follows:

and (i, j) refers to the ith row and the jth column of the pixel point in the unmanned aerial vehicle aerial image data, and Mask (i, j) is the value of the Mask image data at the (i, j) position.

The Mask map generated by the method of fig. 2 according to the 3.2) is shown in fig. 3.

And 4, building a full convolution self-encoder model.

The full convolution self-encoder model comprises an encoding network and a decoding network, wherein the encoding network is used for carrying out feature extraction and data compression on a high-definition image, the decoding network is used for restoring a compressed feature map to an original size, and the building process comprises the following steps:

4.1) building a coding network:

the coding network comprises 5 convolutional layers, wherein each convolutional layer is connected in series, and the parameters of each convolutional layer are set as follows:

a first layer: the convolution kernel size is 3 × 3, the number is 16, the convolution step size is 1, the activation function adopts ReLU, and the output feature size is 1664 × 16;

a second layer: the convolution kernel size is 3 × 3, the number is 32, the convolution step size is 2, the activation function adopts ReLU, and the output feature map size is 832 × 32;

and a third layer: the convolution kernel size is 3 × 3, the number is 64, the convolution step size is 1, the activation function adopts ReLU, and the output feature map size is 832 × 64;

a fourth layer: the convolution kernel size is 3 × 3, the number is 128, the convolution step size is 2, the ReLU is adopted as the activation function, and the output feature map size is 416 × 128;

and a fifth layer: the convolution kernel size is 1 × 1, the number is 3, the convolution step size is 1, the activation function adopts Sigmoid, and the output feature graph size is 416 × 3;

4.2) building a decoding network:

the decoding network comprises 5 deconvolution layers, wherein each deconvolution layer is connected in series, and the parameters of each deconvolution layer are set as follows:

layer 1: the convolution kernel size is 1 × 1, the number is 128, the convolution step size is 1, the ReLU is adopted as the activation function, and the output feature map size is 416 × 128;

layer 2: the convolution kernel size is 3 × 3, the number is 64, the convolution step size is 2, the activation function adopts ReLU, and the output feature map size is 832 × 64;

layer 3: the convolution kernel size is 3 × 3, the number is 32, the convolution step size is 1, the activation function adopts ReLU, and the output feature map size is 832 × 32;

layer 4: convolution kernel size is 3 × 3, number is 16, convolution step is 2, activation function adopts ReLU, output feature size is 1664 × 16;

layer 5: the convolution kernel size is 3 × 3, the number is 3, the convolution step size is 1, the activation function adopts Sigmoid, and the output feature size is 1664 × 3;

the description form of the size of the convolution kernel is w x h, and the meaning of the description form indicates that the width of the convolution kernel is w and the height of the convolution kernel is h;

the characteristic diagram size description form is w x h c, and the meaning of the characteristic diagram size description form is that the width of the characteristic diagram is w pixels, the height of the characteristic diagram is h pixels, and the number of channels is c;

the constructed full convolutional network is shown in fig. 4.

And 5, training the built full convolution self-encoder model.

5.1) initializing network parameters:

initializing the offset of the network to 0, and initializing the weight parameters of the network by adopting a kaiming Gaussian initialization method so as to ensure that the weight parameters are distributed as follows:

wherein: w_lIs the weight of the l-th layer; n is Gaussian distribution, namely nominal normal distribution; a being ReLU activation function or LeakyReLU activation functionNegative half-axis slope, n_lFor the data dimension of each layer, n_lLength of convolution kernel edge²The number of channels is multiplied, and the channels are the number of channels input by each layer of convolution;

the iteration times of the self-encoder are set to 8000 according to the size of the high-definition image training set;

5.2) up-sampling the image data of the training set, and enabling the size of the image data of the up-sampled training set to be the same as the input size of the full convolution network, namely 1664 pixels in width, 1664 pixels in height and 3 in channel number;

5.3) performing up-sampling on the Mask data, wherein the size of the up-sampled Mask data is the same as the data width and height of the full convolution network, namely the width is 1664 pixels, the height is 1664 pixels, and the number of channels is 1;

5.4) inputting the up-sampled image into a full convolution self-coding network, carrying out forward propagation to obtain a coded feature map, and then restoring the feature map through a decoder;

5.5) constructing a mean square error loss function based on the subareas according to the following formula:

the method comprises the steps of calculating a Loss function of a decoder, calculating a Loss function of a Mask-MSE-Loss (y, y), outputting an image by the decoder, inputting an original high-definition image by y, setting α as a Loss penalty weight of a target area to be 0.9, setting β as a background area penalty weight to be 0.1, setting W as a width of input data of the encoder to be 1664, setting H as a height of the data of the encoder to be 1664, and setting Mask (i, j) as a value of the position (i, j) of the data of the Mask image subjected to upsampling;

5.6) calculating the loss value of the input image and the output image by using the loss function of 5.5):

5.7) updating the weight and the offset of the full convolution self-encoder by using a back propagation algorithm to finish one iteration of training the full convolution self-encoder:

5.7.1) updating the weight value by using a back propagation algorithm, wherein the formula is as follows:

wherein: w_t+1Is the updated weight; w_tIs the weight before update; μ is the learning rate of the back propagation algorithm, set here to 0.001;

partial derivative of the loss function of 5.5) with respect to the weight W;

5.7.2) update the offset using a back propagation algorithm, which is formulated as follows:

wherein: b_t+1Is the updated offset; b_tIs the offset before updating; mu is the learning rate of the back propagation algorithm, and the value is 0.001;

partial derivative of the loss function of 5.5) with respect to the offset b;

5.8) repeating the steps from 5.2) to 5.7) until the iteration times of the full convolution self-encoder are completed, and obtaining the trained full convolution self-encoder.

Step 6, splicing the coding network of the full convolution self-encoder and a YOLO-V3 detection network, training the spliced mixed network:

6.1) splicing the coding network of the trained full-convolution self-encoder to the front of a YOLO-V3 detection network to form a spliced mixed network, as shown in FIG. 5;

6.2) training the spliced hybrid network:

6.2.1) reading the parameters of the trained full-convolution self-encoder, initializing the encoding network by using the read parameter values, and setting the parameters of the encoding network in a non-trainable state;

6.2.2) set the input image size of the YOLO-V3 network to be the same as the input size of the full-convolution self-encoder network;

6.2.3) downloading the pre-training parameters on the ImageNet data set from the YoLO official network, initializing the parameters of the YoLO-V3 network by using the parameters, and setting the iteration times of the YoLO-V3 network to be 5000 times according to the size of the acquired data set in the step (1);

6.2.4) sending the high-definition image training set data of the unmanned aerial vehicle aerial photography into the spliced hybrid network for forward propagation to obtain an output detection result;

6.2.5) calculating a loss value between the output detection result and the correct tag data labeled in (1) using a loss function in the YOLO-V3 algorithm,

the loss function in the YOLO-V3 algorithm is expressed as follows:

wherein: lambda [ alpha ]_coordSetting the penalty weight of the predicted coordinate loss as 5;

λ_noobjsetting the penalty weight of confidence coefficient loss when the target is not detected to be 0.5;

k is the scale size of the output characteristic diagram;

m is the number of bounding boxes;

whether a jth bounding box of an ith unit in the output feature map contains a target or not is judged, if so, the value is 1, otherwise, the value is 0;

and

conversely, if a target is included, the value is 0, otherwise the value is 1;

x_ithe abscissa value of the predicted central position of the bounding box in the ith cell in the feature map output by the YOLO-V3 network;

the abscissa value of the center position of the actual boundary box in the ith cell is taken as the coordinate value;

y_ithe ordinate value of the predicted boundary box center position in the ith cell in the feature map output by the YOLO-V3 network;

the vertical coordinate value of the center position of the actual boundary frame in the ith cell;

w_ithe width of the predicted bounding box in the ith cell in the feature map output for the YOLO-V3 network;

the width of an actual bounding box in the ith cell;

h_ipredicting the height of a bounding box in the ith cell in the feature map output by the YOLO-V3 network;

the height of the actual bounding box in the ith cell;

C_iconfidence of the ith cell prediction output for the YOLO-V3 network;

confidence that the ith cell is true;

p_i(c) the probability that the ith cell type in the feature map output by the YOLO-V3 network is c;

is the probability that the ith cell category is c.

6.2.6) updating the weight and the offset of the hybrid network by using a back propagation algorithm according to the loss value calculated by 6.2.5), wherein the updating method of the weight and the offset is the same as the updating formula of 5.7), and one iteration of training the hybrid network is completed;

6.2.7) repeating (6.2.4) to (6.2.6) until all iterations of YOLO-V3 are completed, and obtaining a trained mixed network;

and 7, using the trained network to detect the target.

Inputting the test set data in the step 1 into the trained hybrid model to obtain a final detection result, and detecting a small target in the image, wherein the result is shown in fig. 6.

In fig. 6 and 7, the region where the frame is drawn and the text is noted indicates that the target is successfully detected in the region, and it can be seen from the results of the conventional method in fig. 7 that two obvious small dark-tube targets are not detected in the lower left corner, and one more obvious small dark-tube target is not detected in the lower right corner. Compared with the detection result in fig. 6, the present invention successfully detects the target at the lower left corner and the lower right corner because the spatial characteristics of the target are preserved in the image compression process. Compared with the prior art, the method has obvious advantages in the aspect of small target detection of high-definition images.

Claims

1. A high-definition image small target detection method based on an auto-encoder and a YOLO algorithm is characterized by comprising the following steps:

(2) carrying out data expansion on the marked training set;

(5b) The partition-based mean square error loss function is defined as follows:

(6b) training the spliced hybrid network:

2. The method according to claim 1, wherein the step (2) of performing data expansion on the labeled training set comprises performing left-right flipping, rotation, translation, noise adding, brightness adjustment, contrast adjustment and saturation adjustment on each high-definition image in the original data set, and adding the processed image data into the original data set to obtain expanded data.

3. The method according to claim 1, wherein in the step (3), for each piece of high-definition image data, target Mask data of a corresponding image is generated according to the image size and the annotation information, and the target Mask data is implemented as follows:

(3a) setting Mask data as binary image data, wherein the width and the height of the Mask data are the same as those of the acquired high-definition image;

(3b) reading position information of pixel points in an original image according to the marked data, and setting values of the pixel points corresponding to Mask data:

if the pixel point is in the target area, the value of the pixel point corresponding to the Mask data is set to be 1,

if the pixel point is in the background area, the value of the pixel point corresponding to the Mask data is set to be 0,

the formula is expressed as follows:

4. the method of claim 1, wherein the initializing of the weight parameters of the network in step (5a) using a kaiming gaussian initialization method is randomly initializing the weights of the network to obey the following distribution:

wherein: w_lIs the weight of the l-th layer; n is Gaussian distribution, namely nominal normal distribution; a is the negative half-axis slope of the ReLU activation function or Leaky ReLU activation function, n_lFor the data dimension of each layer, n_lLength of convolution kernel edge²The number of channels, channel being the number of channels input for each layer of convolution.

5. The method of claim 1, wherein the loss function in the YOLO-V3 algorithm used in step (6b5) is expressed as follows: