Image detection method based on improved FCOS network
Technical Field
The invention belongs to the technical field of artificial intelligence and image detection, and particularly relates to an image detection method based on an improved FCOS network.
Background
Object detection is an important technology in the field of computer vision, the main object of which is to accurately identify and locate a target object of interest from an image or video. With the rapid development of artificial intelligence and deep learning, the target detection technology has also made great progress. Before deep learning arises, object detection relies primarily on traditional computer vision methods. These methods include extracting features in the image using feature engineering and classifying and locating objects using conventional machine learning algorithms (e.g., SVM, decision tree, etc.). However, due to the diversity and complexity of the targets, conventional approaches often fail to efficiently process complex scenes and large-scale data. With the rise of deep learning technology, in particular Convolutional Neural Networks (CNNs), target detection has revolutionized. The excellent performance of CNN enables the computer to automatically learn advanced features in the image, thereby greatly improving the accuracy and efficiency of target detection. Among them, R-CNN (Region-based Convolutional Neural Networks) proposed by Yann LeCun et al is the earliest end-to-end target detection framework, which lays a foundation for subsequent development. Roos et al in 2014 proposed two-phase networks R-CNN, as a milestone for applying the CNN method to target detection problems. Along with development of R-CNN series algorithms such as Faster R-CNN and Mask R-CNN, the R-CNN series algorithm is applied to image detection and segmentation. These two-stage target detection algorithms have achieved some success in image detection, but have higher computational complexity, slower detection speed, and larger consumed computational resources, requiring higher hardware configuration support.
Disclosure of Invention
The invention aims to solve the problems of low image detection speed, difficult feature extraction, high calculation resource consumption and the like of a first-stage network model, and provides an image detection method based on an improved FCOS network. According to the method, SCConv convolution is adopted, the purpose of amplifying convolution receptive fields can be achieved through internal communication of features, the diversity of output features is further improved, the dependence relationship between remote space and channels is built around each space position in a self-adaptive mode through self-calibration operation, CNN is helped to generate feature expression with more discrimination capability, the feature expression has more abundant information, and under the condition that computing capability is limited, computing resources are distributed to more important tasks, meanwhile, the problem of information overload is solved, and the model is focused on information which is more critical to the current task.
The technical scheme for realizing the aim of the invention is as follows:
an image detection method based on an improved FCOS network, comprising the steps of:
1) Firstly, manufacturing a data set for training and testing, wherein the data set is an MRI-T2 image data set of a human lumbar intervertebral disc, and is divided into a train data set, a val data set and a test data set according to a ratio of 8:1:1;
2) Fixing pixels of an input image in a dataset to 768x768, and marking the image in the dataset by adopting a COCO data format;
3) Data enhancement is carried out on all input images, including turning and scaling, and the enhanced images are preprocessed by using top hat operation and gray stretching image preprocessing technology;
4) Adopting a general target detection platform MMDetection for detection, MMDetection is a target detection algorithm framework based on deep learning, a target detection network can be quickly built by using MMDetection, target detection is realized, firstly, a COCO data set code needs to be modified, 80 categories in the COCO data set code are replaced by normal and diseased 2 categories in a data set, and then the names of the categories are added into an initialization file;
5) A random gradient GRADIENT DESCENT (SGD for short) is adopted to optimize the training process, the initial learning rate is 0.005, and the momentum is 0.9;
6) Step 3) the preprocessed image is used as the input of a network model;
7) The method comprises the steps that a background carries out convolution operation on an input image with a convolution kernel size of 7x7 and a stride of 2, and then carries out maximum pooling with a convolution kernel size of 3x3 and a stride of 2 to obtain an output result C1;
8) C1 is sent to a first self-calibration convolution module SCConv _1 to obtain an output result C2;
9) C2 is sent to a second self-calibration convolution module SCConv _2, and an output result C3 is obtained;
10 C3 is sent to a third self-calibration convolution module SCConv _3 to obtain an output result C4;
11 C4 is sent to a fourth self-calibration convolution module SCConv _4 to obtain an output result C5;
12 The feature dimension is reduced to 1/r of the input, then the feature dimension is increased to the original dimension through one FC layer after being activated by a ReLU, the complex correlation among channels can be better fitted compared with the method of directly using one FC layer, the parameter quantity and the calculated quantity are greatly reduced, the normalized weight between 0 and 1 is obtained through a Sigmoid gate, finally the normalized weight is weighted to the feature of each channel through a Scale operation, after the activation operation, the output S3, S4 and S5 are respectively obtained after the SE size and the channel number before and after the activation operation are not changed;
13 S3, S4 and S5 are sent to an FPN module, P3, P4 and P5 are generated on S3, S4 and S5 output by SE Attention respectively by FPN, P6 is obtained on the basis of P5 through a convolution layer with the convolution kernel size of 3x3 and the step distance of 2, and P7 is obtained on the basis of P6 through a convolution layer with the convolution kernel size of 3x3 and the step distance of 2;
14 Before detection and classification, a loss function is required to be set, wherein the loss function has three output branches, namely classification, regression and centrality, so that the loss consists of three parts, namely classification loss Lcls, positioning loss Lreg and centrality loss Lctrness, and the calculation method is as shown in the following formula:
p (x,y) represents the score for each category predicted at the feature map (x, y) point, Representing the true class labels corresponding to the points of the feature map (x, y),The value is 1 when the feature map (x, y) points are matched as positive samples, otherwise 0, t x,y represents the target bounding box information predicted at the feature map (x, y) points,Representing real object bounding box information corresponding to a feature map (x, y) point, s x,y representing the centrality predicted at the feature map (x, y) point,Representing the true centrality corresponding to the points (x, y) of the feature map;
15 And (3) conveying the P3-P7 obtained in the step (13) to a detection head, wherein the P3-P7 shares a detection head, the detection head shares three subdivided branches, classification, regression and Center-less, wherein the regress and Center-less are two different small branches on the same branch, the Classification, regression and Center-less branches firstly pass through a combination module of 4 Conv2d+GN+ReLU, and then pass through a convolution layer with a convolution kernel size of 3x3 steps of 1 to obtain a final prediction result.
The technical scheme is realized by an anchor-free FCOS network model, SCConv self-calibration convolution modules and an SE Attention mechanism module. The SCConv convolution can achieve the purpose of amplifying the convolution receptive field through the inherent communication of the features, so that the diversity of the output features is further enhanced. SE Attention can better exploit dynamic relationships between feature channels.
The technical scheme has the advantages or beneficial effects that:
The target detection method provided by the technical scheme combines the latest excellent network FCOS, uses SCConv convolution which is different from standard convolution and adopts a small-size kernel (such as 3×3 convolution) to fuse the space dimension domain and channel dimension information, and SCConv can adaptively establish the dependency relationship between the remote space and the channel around each space position through self-calibration operation. Therefore, it can help CNN generate feature expression with more discrimination ability, because it has more abundant information. The SE Attention mechanism module used in the technical scheme distributes computing resources to more important tasks under the condition of limited computing capacity, solves the problem of information overload, and enables the model to focus on information more critical to the current task.
Drawings
FIG. 1 is a network block diagram of an embodiment;
FIG. 2 is a block diagram of SCConv in an embodiment;
FIG. 3 is an SC module configuration in an embodiment;
FIG. 4 is a diagram of a module of an attention mechanism in an embodiment;
FIG. 5 is a diagram of a test head structure in an embodiment;
FIG. 6 is a flow diagram of network reasoning of an embodiment;
FIG. 7 is an original image of a spinal MRI in an embodiment;
Fig. 8 is a graph of spinal MRI test results of an embodiment.
Detailed Description
The present invention will now be further illustrated, but not limited, by the following figures and examples.
Examples:
referring to fig. 6, an image detection method based on an improved FCOS network includes the steps of:
1) Firstly, a data set for training and testing is made, wherein the data set is a non-public human lumbar intervertebral disc MRI-T2 image data set collected from a network, as shown in figure 7, 470 pieces of the data set are divided into a train data set (376 pieces), a val data set (47 pieces) and a test data set (47 pieces) according to the ratio of 8:1:1;
2) Fixing pixels of an input image in a dataset to 768x768, and marking the image in the dataset by adopting a COCO data format;
3) Data enhancement is carried out on all input images, including turning and scaling, and the enhanced images are preprocessed by using top hat operation and gray stretching image preprocessing technology;
4) Adopting a target detection platform MMDetection to perform detection, firstly, modifying COCO data set codes, replacing 80 categories in the COCO data set codes with normal and diseased 2 categories in the data set, and then adding names of the categories in an initialization file;
5) Optimizing a training process by adopting a random gradient descent method (SGD), wherein the initial learning rate is 0.005 and the momentum is 0.9;
6) Step 3) the preprocessed image is used as the input of a network model, and the structure of the model is shown in figure 1;
7) The method comprises the steps that a background carries out convolution operation on an input image with a convolution kernel size of 7x7 and a stride of 2, and then carries out maximum pooling with a convolution kernel size of 3x3 and a stride of 2 to obtain an output result C1;
8) Feeding C1 into a first self-calibration convolution module SCConv _1, wherein the SCConv module is shown in FIG. 2, FIG. 2 (a) is an original structure, FIG. 2 (b) is a modified structure of the embodiment, and the internal structure of the SC module is shown in FIG. 3 to obtain an output result C2;
9) C2 is sent to a second self-calibration convolution module SCConv _2, and an output result C3 is obtained;
10 C3 is sent to a third self-calibration convolution module SCConv _3 to obtain an output result C4;
11 C4 is sent to a fourth self-calibration convolution module SCConv _4 to obtain an output result C5;
12 The structure of the SE Ateention module of this example is shown in fig. 4, global average pooling is used as a squeize operation, then two FC layers are combined into a Bottleneck structure to model the correlation among channels and output the weight the same as the quantity of input features, firstly, the feature dimension is reduced to 1/r of the input, then the feature dimension is increased to the original dimension through one FC layer after being activated by a ReLU, the advantage of the method is that the method has more nonlinearity than that of directly using one FC layer, can better fit the complex correlation among channels, greatly reduces the quantity of parameters and calculated quantity, then obtains the normalized weight between 0-1 through a Sigmoid gate, finally weights the normalized weight to the characteristics of each channel through a Scale operation, and after the activation operation, the size and the channel number before and after the operation are not changed, and the output S3, S4 and S5 are respectively obtained after the processing of the SE Attention module;
13 S3, S4 and S5 are sent to an FPN module, P3, P4 and P5 are generated on S3, S4 and S5 output by SE Attention respectively by FPN, P6 is obtained on the basis of P5 through a convolution layer with the convolution kernel size of 3x3 and the step distance of 2, and P7 is obtained on the basis of P6 through a convolution layer with the convolution kernel size of 3x3 and the step distance of 2;
14 Before detection and classification, a loss function is required to be set, wherein the loss function has three output branches, namely classification, regression and centrality, so that the loss consists of three parts, namely classification loss Lcls, positioning loss Lreg and centrality loss Lctrness, and the calculation method is as shown in the following formula:
p (x,y) represents the score for each category predicted at the feature map (x, y) point, Representing the true class labels corresponding to the points of the feature map (x, y),The value is 1 when the feature map (x, y) points are matched as positive samples, otherwise 0, t x,y represents the target bounding box information predicted at the feature map (x, y) points,Representing real object bounding box information corresponding to a feature map (x, y) point, s x,y representing the centrality predicted at the feature map (x, y) point,Representing the true centrality corresponding to the points (x, y) of the feature map;
15 Conveying the P3-P7 obtained in the step 13) to a detection head, wherein the P3-P7 shares a detection head, the structure of the detection head is shown in figure 5, the detection head shares three subdivided branches, classification, regression and Center-less, wherein the regress and Center-less are two different small branches on the same branch, the Classification, regression and Center-less branches firstly pass through a combination module of 4 Conv2d+GN+ReLU, and then pass through a convolution layer with a convolution kernel size of 3x3 and a step distance of 1 to obtain a final prediction result, as shown in figure 8;
16 Testing the test set by using the trained network model, wherein the test results of the original FCOS network are compared with the test results of the method. The method of the embodiment can be seen to have a significantly improved accuracy over the original FCOS network detection.