Disclosure of Invention
The invention aims to provide a method and a system for detecting a salient object based on boundary enhancement, so as to overcome the defects of the prior art.
In order to achieve the purpose, the invention adopts the following technical scheme:
a salient object detection method based on boundary enhancement comprises the following steps:
s1, extracting abstract feature maps with different resolutions from the training set image, performing multi-level fusion on the abstract feature maps to obtain multi-level fusion feature maps, and processing the multi-level fusion feature maps to obtain feature maps containing multi-scale information;
s2, performing information conversion on the obtained feature map containing the multi-scale information, splicing and fusing to obtain features containing boundary information, simultaneously obtaining a boundary detection result by using the features after each stage of conversion, and further splicing and fusing to obtain a fused boundary detection result;
s3, extracting multi-scale information of the feature graph containing the multi-scale information, and splicing the feature graph containing the multi-scale information with the features containing the boundary information to obtain a significant target detection result;
s4, training the saliency target detection model by using the saliency target detection result, the boundary detection result of each level, the fused boundary detection result and the corresponding training set until the convergence condition is met, and performing target detection by using the trained saliency target detection model;
further, in step S1, the convolutional neural network uses ResNet-50 trained on ImageNet as a backbone of the network, and removes the final pooling layer and the full-link layer to obtain five feature maps with different sizes.
Further, a network structure formed by using ResNet-50 trained on ImageNet as a backbone of the network comprises a multi-stage feature aggregation module, a multi-scale information extraction module and a boundary information extraction module.
Furthermore, the sizes are kept consistent through upsampling or pooling, information supplementation is carried out through element addition, characteristics after information supplementation are aggregated, five multilevel aggregation characteristic graphs with different sizes are obtained, and the expression capacity of the characteristics can be improved.
Further, the cavity convolutions with different cavity rates are used for sampling, information with different scales is obtained through different receptive fields, and the detection capability of the network model on the scale change target is improved.
Furthermore, in each step-by-step feature fusion process, multi-scale information extraction is carried out on features of different sizes for multiple times, and multi-scale information is further fused.
Furthermore, the step-by-step fusion features are used as input, boundary detection is carried out on each level of features, boundary information of a target can be extracted, and a significant target detection result of the network is further refined;
further, the training process uses a loss function and makes parameter adjustments as the loss propagates backwards.
Further, the loss function includes a significant target detection loss and a significant boundary detection loss, wherein the significant target detection loss is used for guiding the correct classification of the significant target pixel points, and the significant boundary detection loss is used for guiding the correct classification of the significant target boundary region pixel points.
Further, the loss of saliency target detection includes cross entropy loss BCE for individual pixels and consistency enhancement loss CEL for the entire image, which can make the detection result more uniformly highlighted.
Compared with the prior art, the invention has the following beneficial technical effects:
the invention relates to a salient object detection method based on boundary enhancement, which comprises the steps of extracting abstract feature maps with different resolutions from a training set image, carrying out multi-level fusion on the abstract feature maps to obtain multi-level fusion feature maps, and processing the multi-level fusion feature maps to obtain feature maps containing multi-scale information; the method comprises the steps that visual saliency image data are used as input, a saliency target area is predicted through a convolutional neural network, the problems of scale change and boundary area pixel blurring in a saliency target detection task are solved, feature information of different resolutions is used for mutual complementation, and the expression capacity of a single resolution feature is further enhanced; by using multi-scale feature extraction, information of different scales is extracted from the fixed resolution features, so that the problem of target scale change is solved better; and a mixed loss function is used for supervising model training from different layers so as to highlight the salient object area more uniformly and brightly.
Furthermore, multi-stage feature aggregation is adopted, so that features of different scales are mutually aggregated, and the expression capability of the features of fixed scales is enhanced.
Furthermore, multi-scale information is extracted from the features of fixed scale through multi-scale information extraction, and the detection capability of the network model on scenes with large target size changes is enhanced.
Furthermore, after the boundary information of the salient object is extracted, the salient object is supplemented, and the quality of model prediction is further improved.
Furthermore, the loss of the significant target detection is composed of binary cross entropy loss and consistency enhancement loss, and especially the consistency enhancement loss is supervised from the whole image layer, so that on one hand, a loss function can be more emphasized on the foreground, and on the other hand, the loss can be prevented from being interfered by scale change, and the significant target detection effect is improved.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention is described in further detail below with reference to the accompanying drawings:
as shown in fig. 1, a salient object detection method based on boundary enhancement includes the following steps;
s1, abstract feature maps with different resolutions are extracted from the training set image, multi-level fusion is carried out on the abstract feature maps to obtain multi-level fusion feature maps, and the multi-level fusion feature maps are processed to obtain feature maps containing multi-scale information
S2, performing information conversion on the obtained feature map containing the multi-scale information, splicing and fusing to obtain features containing boundary information, simultaneously obtaining a boundary detection result by using the features after each stage of conversion, and further splicing and fusing to obtain a fused boundary detection result;
s3, extracting multi-scale information of the feature graph containing the multi-scale information, and splicing the feature graph containing the multi-scale information with the features containing the boundary information to obtain a significant target detection result;
s4, training the saliency target detection model by using the saliency target detection result, the boundary detection result of each level, the fused boundary detection result and the corresponding training set until the convergence condition is met, and performing target detection by using the trained saliency target detection model;
the application adopts public data as a data set, and the data set is divided into a training set and a testing set.
The network structure design is as shown in fig. 2, and in the feature extraction stage: the convolutional neural network adopts ResNet-50 trained on ImageNet as a network backbone, removes the last pooling layer and the full connection layer, inputs an input image into the backbone network, and extracts five abstract feature maps F1-F5 with different levels after five groups of convolution operations, wherein the abstract feature maps are 1/2, 1/4, 1/8, 1/16 and 1/32 with the sizes of input, the number of channels is 64, 256, 512, 1024 and 2048, low-level detail information from F1 to F5 is continuously reduced, and high-level semantic information is gradually increased.
The multistage feature aggregation module is shown in FIG. 3 and is divided into two stages, complementary and aggregated, from F i-1 To F i+1 The feature separation rate is gradually reduced, and the number of channels is gradually increased. In a complementary phase S 1 In the method, input three features are firstly preprocessed through a 1 × 1 convolution to keep the number of channels consistent, so that on one hand, the calculated amount can be reduced, and on the other hand, the subsequent element fusion is facilitated; then the input feature F i Pooling and upsampling separately for F i-1 And F i+1 Is supplemented by the information of F i-1 And F i+1 Pooling and upsampling pairs F, respectively i Performing information supplementation, pooling and upsampling operations so that the mutually supplemented features have the same resolution, and the feature complementation process is expressed as follows:
in the formula: f'
j Represents F
j Features after reducing channel dimensions;
represents the complementary phase S
1 Supplemented ith-level features; conv (·) represents that the convolution is responsible for changing the channel dimensions; ReLU stands for ReLU nonlinear activation function; up (-) represents the upsampling operation; AvgPool (·) stands for mean pooling operation. It is worth noting that the top-most and bottom features have only one neighbor, so that in the complementary phase there is only L each
2 +L
3 And L
1 +L
2 Two channels.
The second stage is a feature aggregation stage S 2 In the stage, the complementary features from different channels are aggregated to obtain a feature containing multi-level information and the feature is output to a decoder as a transverse feature, and the specific formula is as follows:
in the formula: the features are fused in an element-by-element addition mode. Like the complementary stages, MF 1 And MF 5 Feature of only one neighbor aggregated. In addition, the above two stages of all element fusion operations are followed by a set of 3 × 3 convolutions, and a combination of regularization and nonlinear variation ReLU is used to further abstract features.
The reverse step-by-step feature fusion is performed on top MF5, multi-scale information extraction is performed on MF5, the feature containing the multi-scale information is up-sampled to 2 times of the original size, a 1 × 1 convolution operation is performed to reduce the channel dimension and keep the feature MF4 consistent, and element-by-element addition can be performed on the two features; in addition, after the addition of the characteristic picture element, a convolution operation of 3 multiplied by 3 is added for further fusion; in this way, the multi-stage polymerization features M1, M2, M3, M4 and M5 with different scales are finally fused step by step, and four step-by-step fusion features h1, h2, h3 and h4 are respectively obtained.
As shown in fig. 4, specifically, given an input feature h, in the forward process, the feature h is first extracted by the void convolution with the void ratios of 2, 4, and 8 to obtain a feature sh containing information of different scales 1 、sh 2 And sh 3 . And secondly, performing element addition on the original features and the sampled features by using a residual operation, and then further aggregating the features and improving the nonlinear capability by using a convolution operation and an activation function to finally obtain the features M containing multi-scale information. As shown in the following formula:
in the formula: conv
3×3 (. smallcircle.) denotes a 3 × 3 convolution operation;
is the fusion of the feature layer element by element addition.
Boundary information extraction Module As shown in FIG. 5, 4 progressive fusion features h are generated in the decoder 4 、h 3 、h 2 、h 1 As input, firstly, the feature eh containing boundary information is extracted through an information conversion module 4 、eh 3 、eh 2 、eh 1 The information conversion module is composed of two convolution groups of 1 × 1, 3 × 3 and 1 × 1, and the convolution group comprises a residual connecting operation; then, the features containing the boundary information are subjected to 1 multiplied by 1 convolution to reduce the number of channels and then are subjected to up-sampling to obtain a significant boundary prediction result e 4 、e 3 、e 2 、e 1 . In order to transmit the extracted boundary information of the salient object to the salient object prediction branch to make up for the loss of details, the extracted multistage boundary characteristics eh 4 、eh 3 、eh 2 、eh 1 And after upsampling, splicing along a channel, inputting the upsampled feature into a boundary feature aggregation module to obtain a final feature EF containing boundary information, wherein the boundary aggregation module consists of four 3 multiplied by 3 convolutions containing residual error operation. The final boundary characteristics are shown below:
EF=EdgeInfo(Concat(Up(eh 1 ),Up(eh 1 ),Up(eh 3 ),Up(eh 4 )))
in the formula: EdgeInfo (·) represents the boundary feature aggregation module; and EF represents the significance boundary features of the aggregated multilevel information, and can be used for fusing with significance target features in the next step.
In the training process, parameters of the network are optimized by using a back propagation strategy, a loss function is used for assisting training, and the loss functions used in the training process are divided into two types according to different tasks: the detection loss of the significant boundary and the detection loss of the significant target, and the total loss function formula in the training process is expressed as follows:
Loss=L sod +λ 1 L edge
in the formula: lambda [ alpha ] 1 Is a hyperparameter to balance the loss of both tasks, and its value was set to 10 in the experiment.
Due to the high sparsity of boundary pixels, the high imbalance of the number of boundary pixels and the number of non-boundary pixels, the problem of pixel imbalance can be solved by using balanced binary entropy loss to supervise the significant boundary learning process. The balanced binary entropy loss formula is expressed as follows:
in the formula: beta is the proportion of the number of non-boundary pixels to the number of all pixels.
The significant target detection loss function is formed by combining two loss functions with different emphasis points, and comprises binary cross entropy loss aiming at a single pixel point and consistency enhancement loss aiming at a foreground area, wherein the total loss is expressed as follows:
L sod =L bce +L cel
the binary cross entropy penalty is the most used penalty function in the salient object detection task, which is a penalty at the pixel level that converges on all pixels, and the formula shows:
in the formula: p represents a saliency target prediction map; p represents a pixel point in P; g represents a true value map; g represents a pixel point in G; log (-) pixel level logarithm operation.
The consistency enhancement penalty is an image-level penalty that can make the penalty function more focused on the foreground on the one hand and free from scale-variation on the other hand. The consistency enhancement loss function is formulated as follows:
in the formula: p represents a saliency target prediction map; p represents a pixel point in the prediction map.
The invention relates to a salient target detection method based on boundary enhancement, which can solve the problems of large size change of a target in a visual scene, fuzzy prediction of a boundary region of the salient target and non-uniform pixels in the region aiming at a salient data set.
A multi-feature aggregation module is inserted into a network transmission layer, and the expression capability of the fixed resolution features is enhanced by aggregating feature information of different respective rates of adjacent layers.
A multi-scale information extraction module is inserted into each level of a network decoder, and the capability of the network facing a scene with large target size change is enhanced by extracting the multi-scale information of each level of features.
And based on the step-by-step fusion characteristics, detecting the boundary of the salient object by using a boundary extraction module, fusing the boundary characteristics and the salient object characteristics, and enhancing the detection effect of the network model in the boundary area.
A mixed loss function is used for a significant target detection task, supervision is performed from two layers of a pixel level and an image level, return of a gradient is promoted, model convergence is enhanced, and a model training effect is further improved;
the application achieves competitive Fmax and MAE results on four sets of published significant detection data sets, and the performance is superior to that of several popular significant target detection methods.
Examples
A salient object detection method based on boundary enhancement comprises the following steps:
s1, four sets of public significance data sets were used as the experimental data sets. The specific working process is as follows:
(1.1) adopting the training set part of the maximum data set as the training set of the model, and using the test set of the maximum data set and other three data sets as test sets;
(1.2) randomly turning the image data set horizontally before inputting the image data set into the network training to realize data expansion.
And S2, extracting abstract feature maps with different resolutions and different channel numbers by using a feature extraction network. The specific working process is as follows:
(2.1), removing the final pooling layer and the full-connection layer of the ResNet50 network, and only reserving the rest network structures;
and (2.2) inputting the data processed in the step (1.2) into a ResNet50 feature extraction network according to the dimension of (N, C, H, W) to obtain five groups of abstract feature maps with different resolutions and channel numbers.
S3, using the multi-stage feature aggregation module to enhance the expressive power of the features extracted by the encoder, as shown in the figure. The specific working process is as follows:
(3.1) carrying out convolution processing on the abstract feature diagram extracted in the step (2.2) to change the number of channels, and keeping the number of channels of all features consistent;
(3.2) supplementing the features obtained in the step (3.1), specifically, performing element fusion on the feature with low resolution after upsampling on the feature with high resolution between adjacent layers, performing element fusion on the feature with high resolution after pooling on the feature with low resolution, and supplementing the features with different resolutions;
and (3.3) aggregating the mutually complemented features obtained in the step (3.2), specifically, for each level of features, if higher level features exist, upsampling the higher level features and adding elements of the upsampled features, if lower level features exist, pooling the lower level features and adding elements of the pooled lower level features, and obtaining a corresponding aggregated feature for each level of features in the step (2.2), so that the expression capability of the fixed resolution features can be enhanced.
And S4, extracting multi-scale information from the fixed features by using a multi-scale information extraction module, and enhancing the detection capability of the network on different-scale targets. The specific working process is as follows:
(4.1) inputting the multi-stage aggregation characteristics extracted in the step (3.3) into three branches in parallel from top to bottom, sampling the three branches by convolution with different void rates of 2, 4 and 8 respectively, and performing element addition on the original characteristics and the sampled characteristics by using a residual operation;
and (4.2) upsampling the features containing the multi-scale information extracted in the step (4.1), adding elements of the corresponding party features in the step (3.3), and performing 3 x 3 convolution to obtain features after gradual fusion.
And S5, extracting significant boundary information from the step-by-step fusion features by using a boundary information extraction module, further supplementing significant target information, and enhancing the detection effect of the network. The specific working process is as follows:
(5.1) carrying out information conversion on the features obtained in the step (4.2) to obtain a feature map containing boundary information;
(5.2) reducing the channel dimension to 1 through 1 × 1 convolution on the features obtained in the step (5.1), obtaining boundary output through up-sampling, and fusing a plurality of boundary outputs to obtain fused boundary output;
and (5.3) performing upsampling on the boundary features in the step (5.1), splicing along the channels, and further fusing and changing the number of the channels through a boundary aggregation module to obtain the boundary features.
And S6, fusing the boundary information and the target information to obtain the final significance prediction. The specific working process is as follows:
(6.1) carrying out multi-scale information extraction again on the last feature in the (4.2), wherein the last feature is the bottom layer, so that element addition with the transverse feature is not needed;
and (6.2) splicing the boundary characteristics of (5.3) and the characteristics obtained in (6.1) along a channel, further performing convolution fusion, and obtaining a final significance prediction map through channel transformation and upsampling.
S7, training the target detection model by using the obtained boundary detection result, target detection result and corresponding training set image: in the training process, binary cross entropy loss is used, consistency enhancement loss and balance binary cross entropy loss are used for promoting gradient return, model convergence is enhanced, and the training effect is further improved.
S8, regarding the trained salient object detection model, using the test image as input, and obtaining a salient object detection result, as shown in fig. 6. The specific working process is as follows:
(8.1) regarding the significance target detection model in the step S7, taking the test set in the step (1.1) as an input to obtain a significance detection result;
(8.2) comparing the detection result of the saliency target detection model in the step (8.1) with the actual saliency target true value map, the saliency target detection model in the step (8.1) achieves excellent detection effect, and the detection result performs very well on Fmax, MAE and Em indexes on the four data sets, as shown in the figure.