CN114821059A

CN114821059A - Salient object detection method and system based on boundary enhancement

Info

Publication number: CN114821059A
Application number: CN202210467623.4A
Authority: CN
Inventors: 田智强; 余凌昊; 陈张
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2022-07-29
Anticipated expiration: 2042-04-29
Also published as: CN114821059B

Abstract

The invention discloses a saliency target detection method and system based on boundary enhancement, which takes visual saliency image data as input, predicts the salient target area through a convolutional neural network, and solves the problem of scale change and variation in the salient target detection task. For the problem of blurred pixels in the boundary area, by using the feature information of different resolutions to complement each other, the expression ability of single resolution features is further enhanced; using multi-scale feature extraction, extracting information of different scales from fixed resolution features, better Solve the problem of target scale change; use boundary extraction to model saliency boundaries, and further supplement salient target feature information after extracting boundary information to solve the problem of unclear boundary pixels to a certain extent, and obtain the final salient target prediction. Hybrid loss function to supervise model training at different levels to highlight saliency target regions more uniformly.

Description

Salient object detection method and system based on boundary enhancement

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a method and a system for detecting a salient object based on boundary enhancement.

Background

Significant object detection has been studied for over twenty years. There are a total of three very important nodes. Significance target detection the first development surge originated from an article by Itti in 1998, which acts to mimic the human attention process, building a significance map from the bottom up using underlying features; the second node is a binary segmentation problem that the concept that the saliency detection is integrated into the target is changed into the saliency target, and the second node is more suitable for practical application; the third wave heat tide of the detection of the salient object is aroused by the open world of the convolutional neural network, and the convolutional neural network has great feature extraction capability and can obtain a larger receptive field, so that the salient region in the image can be better detected, and the method is also a mainstream method at present. At present, the significance detection field has produced great theoretical and application value, but another value thereof is as an aid to many other visual tasks, for example, as a preprocessing of tasks such as object recognition, image editing and semantic segmentation.

The scale change is one of the major challenges for the SOD task. CNN is difficult to handle this problem, limited by the downsampling operation. Different levels of feature layers have the ability to handle only a particular scale, with different resolutions of features having different amounts of target information embedded therein. One way is to output features of each layer laterally in a top-down path, upsample to a uniform resolution, and then fuse to obtain an output containing multi-scale information, but this approach uses only a single resolution feature in each layer, which is not sufficient to cope with various scale problems. Yet another simple strategy is to integrate feature layer information of different resolutions, but this way of fusion is prone to information redundancy and noise interference. The processing mode aiming at the scale change problem still has a space for further optimization.

Detail information is continuously lost in the feature extraction process, and the boundary area of the saliency target obtained by the pixel-level saliency method is often unsatisfactory. In order to obtain a fine and significant boundary, there are many innovative methods in addition to the multi-scale feature fusion method. Some methods use a recursive method to refine high-level features by using low-level local information, and some methods use superpixels to perform preprocessing before saliency detection to extract boundaries or CRF to perform post-processing on a saliency prediction map to maintain object boundaries, and such methods need additional processing procedures and are relatively inefficient. For the selection of the loss function, a training loss function commonly used for salient object detection is binary cross entropy loss, but the confidence of the binary cross entropy loss is low when boundary pixels are judged, so that the boundary is very fuzzy, and meanwhile, the consistency of a salient region cannot be ensured. There are many possibilities for improvement in the design of the network structure and the loss function of the boundary information extraction.

Disclosure of Invention

The invention aims to provide a method and a system for detecting a salient object based on boundary enhancement, so as to overcome the defects of the prior art.

In order to achieve the purpose, the invention adopts the following technical scheme:

a salient object detection method based on boundary enhancement comprises the following steps:

s1, extracting abstract feature maps with different resolutions from the training set image, performing multi-level fusion on the abstract feature maps to obtain multi-level fusion feature maps, and processing the multi-level fusion feature maps to obtain feature maps containing multi-scale information;

s2, performing information conversion on the obtained feature map containing the multi-scale information, splicing and fusing to obtain features containing boundary information, simultaneously obtaining a boundary detection result by using the features after each stage of conversion, and further splicing and fusing to obtain a fused boundary detection result;

s3, extracting multi-scale information of the feature graph containing the multi-scale information, and splicing the feature graph containing the multi-scale information with the features containing the boundary information to obtain a significant target detection result;

s4, training the saliency target detection model by using the saliency target detection result, the boundary detection result of each level, the fused boundary detection result and the corresponding training set until the convergence condition is met, and performing target detection by using the trained saliency target detection model;

further, in step S1, the convolutional neural network uses ResNet-50 trained on ImageNet as a backbone of the network, and removes the final pooling layer and the full-link layer to obtain five feature maps with different sizes.

Further, a network structure formed by using ResNet-50 trained on ImageNet as a backbone of the network comprises a multi-stage feature aggregation module, a multi-scale information extraction module and a boundary information extraction module.

Furthermore, the sizes are kept consistent through upsampling or pooling, information supplementation is carried out through element addition, characteristics after information supplementation are aggregated, five multilevel aggregation characteristic graphs with different sizes are obtained, and the expression capacity of the characteristics can be improved.

Further, the cavity convolutions with different cavity rates are used for sampling, information with different scales is obtained through different receptive fields, and the detection capability of the network model on the scale change target is improved.

Furthermore, in each step-by-step feature fusion process, multi-scale information extraction is carried out on features of different sizes for multiple times, and multi-scale information is further fused.

Furthermore, the step-by-step fusion features are used as input, boundary detection is carried out on each level of features, boundary information of a target can be extracted, and a significant target detection result of the network is further refined;

further, the training process uses a loss function and makes parameter adjustments as the loss propagates backwards.

Further, the loss function includes a significant target detection loss and a significant boundary detection loss, wherein the significant target detection loss is used for guiding the correct classification of the significant target pixel points, and the significant boundary detection loss is used for guiding the correct classification of the significant target boundary region pixel points.

Further, the loss of saliency target detection includes cross entropy loss BCE for individual pixels and consistency enhancement loss CEL for the entire image, which can make the detection result more uniformly highlighted.

Compared with the prior art, the invention has the following beneficial technical effects:

the invention relates to a salient object detection method based on boundary enhancement, which comprises the steps of extracting abstract feature maps with different resolutions from a training set image, carrying out multi-level fusion on the abstract feature maps to obtain multi-level fusion feature maps, and processing the multi-level fusion feature maps to obtain feature maps containing multi-scale information; the method comprises the steps that visual saliency image data are used as input, a saliency target area is predicted through a convolutional neural network, the problems of scale change and boundary area pixel blurring in a saliency target detection task are solved, feature information of different resolutions is used for mutual complementation, and the expression capacity of a single resolution feature is further enhanced; by using multi-scale feature extraction, information of different scales is extracted from the fixed resolution features, so that the problem of target scale change is solved better; and a mixed loss function is used for supervising model training from different layers so as to highlight the salient object area more uniformly and brightly.

Furthermore, multi-stage feature aggregation is adopted, so that features of different scales are mutually aggregated, and the expression capability of the features of fixed scales is enhanced.

Furthermore, multi-scale information is extracted from the features of fixed scale through multi-scale information extraction, and the detection capability of the network model on scenes with large target size changes is enhanced.

Furthermore, after the boundary information of the salient object is extracted, the salient object is supplemented, and the quality of model prediction is further improved.

Furthermore, the loss of the significant target detection is composed of binary cross entropy loss and consistency enhancement loss, and especially the consistency enhancement loss is supervised from the whole image layer, so that on one hand, a loss function can be more emphasized on the foreground, and on the other hand, the loss can be prevented from being interfered by scale change, and the significant target detection effect is improved.

Drawings

Fig. 1 is a flowchart of an implementation of a method for detecting a salient object with enhanced boundary in an embodiment of the present invention.

Fig. 2 is a network structure diagram of a salient object detection model with enhanced boundary in an embodiment of the present invention.

Fig. 3 is an internal structural diagram of a multi-stage feature aggregation module in an embodiment of the present invention.

Fig. 4 is an internal structure diagram of a multi-scale information extraction module in the embodiment of the present invention.

Fig. 5 is an internal structure diagram of the boundary information extraction module in the embodiment of the present invention.

Fig. 6 is a detection effect diagram of the salient object detection model with enhanced boundary in the embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention is described in further detail below with reference to the accompanying drawings:

as shown in fig. 1, a salient object detection method based on boundary enhancement includes the following steps;

s1, abstract feature maps with different resolutions are extracted from the training set image, multi-level fusion is carried out on the abstract feature maps to obtain multi-level fusion feature maps, and the multi-level fusion feature maps are processed to obtain feature maps containing multi-scale information

the application adopts public data as a data set, and the data set is divided into a training set and a testing set.

The network structure design is as shown in fig. 2, and in the feature extraction stage: the convolutional neural network adopts ResNet-50 trained on ImageNet as a network backbone, removes the last pooling layer and the full connection layer, inputs an input image into the backbone network, and extracts five abstract feature maps F1-F5 with different levels after five groups of convolution operations, wherein the abstract feature maps are 1/2, 1/4, 1/8, 1/16 and 1/32 with the sizes of input, the number of channels is 64, 256, 512, 1024 and 2048, low-level detail information from F1 to F5 is continuously reduced, and high-level semantic information is gradually increased.

The multistage feature aggregation module is shown in FIG. 3 and is divided into two stages, complementary and aggregated, from F _i-1 To F _i+1 The feature separation rate is gradually reduced, and the number of channels is gradually increased. In a complementary phase S ₁ In the method, input three features are firstly preprocessed through a 1 × 1 convolution to keep the number of channels consistent, so that on one hand, the calculated amount can be reduced, and on the other hand, the subsequent element fusion is facilitated; then the input feature F _i Pooling and upsampling separately for F _i-1 And F _i+1 Is supplemented by the information of F _i-1 And F _i+1 Pooling and upsampling pairs F, respectively _i Performing information supplementation, pooling and upsampling operations so that the mutually supplemented features have the same resolution, and the feature complementation process is expressed as follows:

in the formula: f' _j Represents F _j Features after reducing channel dimensions;

represents the complementary phase S ₁ Supplemented ith-level features; conv (·) represents that the convolution is responsible for changing the channel dimensions; ReLU stands for ReLU nonlinear activation function; up (-) represents the upsampling operation; AvgPool (·) stands for mean pooling operation. It is worth noting that the top-most and bottom features have only one neighbor, so that in the complementary phase there is only L each ₂ +L ₃ And L ₁ +L ₂ Two channels.

The second stage is a feature aggregation stage S ₂ In the stage, the complementary features from different channels are aggregated to obtain a feature containing multi-level information and the feature is output to a decoder as a transverse feature, and the specific formula is as follows:

in the formula: the features are fused in an element-by-element addition mode. Like the complementary stages, MF ₁ And MF ₅ Feature of only one neighbor aggregated. In addition, the above two stages of all element fusion operations are followed by a set of 3 × 3 convolutions, and a combination of regularization and nonlinear variation ReLU is used to further abstract features.

The reverse step-by-step feature fusion is performed on top MF5, multi-scale information extraction is performed on MF5, the feature containing the multi-scale information is up-sampled to 2 times of the original size, a 1 × 1 convolution operation is performed to reduce the channel dimension and keep the feature MF4 consistent, and element-by-element addition can be performed on the two features; in addition, after the addition of the characteristic picture element, a convolution operation of 3 multiplied by 3 is added for further fusion; in this way, the multi-stage polymerization features M1, M2, M3, M4 and M5 with different scales are finally fused step by step, and four step-by-step fusion features h1, h2, h3 and h4 are respectively obtained.

As shown in fig. 4, specifically, given an input feature h, in the forward process, the feature h is first extracted by the void convolution with the void ratios of 2, 4, and 8 to obtain a feature sh containing information of different scales ₁ 、sh ₂ And sh ₃ . And secondly, performing element addition on the original features and the sampled features by using a residual operation, and then further aggregating the features and improving the nonlinear capability by using a convolution operation and an activation function to finally obtain the features M containing multi-scale information. As shown in the following formula:

in the formula: conv _3×3 (. smallcircle.) denotes a 3 × 3 convolution operation;

is the fusion of the feature layer element by element addition.

Boundary information extraction Module As shown in FIG. 5, 4 progressive fusion features h are generated in the decoder ₄ 、h ₃ 、h ₂ 、h ₁ As input, firstly, the feature eh containing boundary information is extracted through an information conversion module ₄ 、eh ₃ 、eh ₂ 、eh ₁ The information conversion module is composed of two convolution groups of 1 × 1, 3 × 3 and 1 × 1, and the convolution group comprises a residual connecting operation; then, the features containing the boundary information are subjected to 1 multiplied by 1 convolution to reduce the number of channels and then are subjected to up-sampling to obtain a significant boundary prediction result e ₄ 、e ₃ 、e ₂ 、e ₁ . In order to transmit the extracted boundary information of the salient object to the salient object prediction branch to make up for the loss of details, the extracted multistage boundary characteristics eh ₄ 、eh ₃ 、eh ₂ 、eh ₁ And after upsampling, splicing along a channel, inputting the upsampled feature into a boundary feature aggregation module to obtain a final feature EF containing boundary information, wherein the boundary aggregation module consists of four 3 multiplied by 3 convolutions containing residual error operation. The final boundary characteristics are shown below:

EF＝EdgeInfo(Concat(Up(eh ₁ ),Up(eh ₁ ),Up(eh ₃ ),Up(eh ₄ )))

in the formula: EdgeInfo (·) represents the boundary feature aggregation module; and EF represents the significance boundary features of the aggregated multilevel information, and can be used for fusing with significance target features in the next step.

In the training process, parameters of the network are optimized by using a back propagation strategy, a loss function is used for assisting training, and the loss functions used in the training process are divided into two types according to different tasks: the detection loss of the significant boundary and the detection loss of the significant target, and the total loss function formula in the training process is expressed as follows:

Loss＝L _sod +λ ₁ L _edge

in the formula: lambda [ alpha ] ₁ Is a hyperparameter to balance the loss of both tasks, and its value was set to 10 in the experiment.

Due to the high sparsity of boundary pixels, the high imbalance of the number of boundary pixels and the number of non-boundary pixels, the problem of pixel imbalance can be solved by using balanced binary entropy loss to supervise the significant boundary learning process. The balanced binary entropy loss formula is expressed as follows:

in the formula: beta is the proportion of the number of non-boundary pixels to the number of all pixels.

The significant target detection loss function is formed by combining two loss functions with different emphasis points, and comprises binary cross entropy loss aiming at a single pixel point and consistency enhancement loss aiming at a foreground area, wherein the total loss is expressed as follows:

L _sod ＝L _bce +L _cel

the binary cross entropy penalty is the most used penalty function in the salient object detection task, which is a penalty at the pixel level that converges on all pixels, and the formula shows:

in the formula: p represents a saliency target prediction map; p represents a pixel point in P; g represents a true value map; g represents a pixel point in G; log (-) pixel level logarithm operation.

The consistency enhancement penalty is an image-level penalty that can make the penalty function more focused on the foreground on the one hand and free from scale-variation on the other hand. The consistency enhancement loss function is formulated as follows:

in the formula: p represents a saliency target prediction map; p represents a pixel point in the prediction map.

The invention relates to a salient target detection method based on boundary enhancement, which can solve the problems of large size change of a target in a visual scene, fuzzy prediction of a boundary region of the salient target and non-uniform pixels in the region aiming at a salient data set.

A multi-feature aggregation module is inserted into a network transmission layer, and the expression capability of the fixed resolution features is enhanced by aggregating feature information of different respective rates of adjacent layers.

A multi-scale information extraction module is inserted into each level of a network decoder, and the capability of the network facing a scene with large target size change is enhanced by extracting the multi-scale information of each level of features.

And based on the step-by-step fusion characteristics, detecting the boundary of the salient object by using a boundary extraction module, fusing the boundary characteristics and the salient object characteristics, and enhancing the detection effect of the network model in the boundary area.

A mixed loss function is used for a significant target detection task, supervision is performed from two layers of a pixel level and an image level, return of a gradient is promoted, model convergence is enhanced, and a model training effect is further improved;

the application achieves competitive Fmax and MAE results on four sets of published significant detection data sets, and the performance is superior to that of several popular significant target detection methods.

Examples

s1, four sets of public significance data sets were used as the experimental data sets. The specific working process is as follows:

(1.1) adopting the training set part of the maximum data set as the training set of the model, and using the test set of the maximum data set and other three data sets as test sets;

(1.2) randomly turning the image data set horizontally before inputting the image data set into the network training to realize data expansion.

And S2, extracting abstract feature maps with different resolutions and different channel numbers by using a feature extraction network. The specific working process is as follows:

(2.1), removing the final pooling layer and the full-connection layer of the ResNet50 network, and only reserving the rest network structures;

and (2.2) inputting the data processed in the step (1.2) into a ResNet50 feature extraction network according to the dimension of (N, C, H, W) to obtain five groups of abstract feature maps with different resolutions and channel numbers.

S3, using the multi-stage feature aggregation module to enhance the expressive power of the features extracted by the encoder, as shown in the figure. The specific working process is as follows:

(3.1) carrying out convolution processing on the abstract feature diagram extracted in the step (2.2) to change the number of channels, and keeping the number of channels of all features consistent;

(3.2) supplementing the features obtained in the step (3.1), specifically, performing element fusion on the feature with low resolution after upsampling on the feature with high resolution between adjacent layers, performing element fusion on the feature with high resolution after pooling on the feature with low resolution, and supplementing the features with different resolutions;

and (3.3) aggregating the mutually complemented features obtained in the step (3.2), specifically, for each level of features, if higher level features exist, upsampling the higher level features and adding elements of the upsampled features, if lower level features exist, pooling the lower level features and adding elements of the pooled lower level features, and obtaining a corresponding aggregated feature for each level of features in the step (2.2), so that the expression capability of the fixed resolution features can be enhanced.

And S4, extracting multi-scale information from the fixed features by using a multi-scale information extraction module, and enhancing the detection capability of the network on different-scale targets. The specific working process is as follows:

(4.1) inputting the multi-stage aggregation characteristics extracted in the step (3.3) into three branches in parallel from top to bottom, sampling the three branches by convolution with different void rates of 2, 4 and 8 respectively, and performing element addition on the original characteristics and the sampled characteristics by using a residual operation;

and (4.2) upsampling the features containing the multi-scale information extracted in the step (4.1), adding elements of the corresponding party features in the step (3.3), and performing 3 x 3 convolution to obtain features after gradual fusion.

And S5, extracting significant boundary information from the step-by-step fusion features by using a boundary information extraction module, further supplementing significant target information, and enhancing the detection effect of the network. The specific working process is as follows:

(5.1) carrying out information conversion on the features obtained in the step (4.2) to obtain a feature map containing boundary information;

(5.2) reducing the channel dimension to 1 through 1 × 1 convolution on the features obtained in the step (5.1), obtaining boundary output through up-sampling, and fusing a plurality of boundary outputs to obtain fused boundary output;

and (5.3) performing upsampling on the boundary features in the step (5.1), splicing along the channels, and further fusing and changing the number of the channels through a boundary aggregation module to obtain the boundary features.

And S6, fusing the boundary information and the target information to obtain the final significance prediction. The specific working process is as follows:

(6.1) carrying out multi-scale information extraction again on the last feature in the (4.2), wherein the last feature is the bottom layer, so that element addition with the transverse feature is not needed;

and (6.2) splicing the boundary characteristics of (5.3) and the characteristics obtained in (6.1) along a channel, further performing convolution fusion, and obtaining a final significance prediction map through channel transformation and upsampling.

S7, training the target detection model by using the obtained boundary detection result, target detection result and corresponding training set image: in the training process, binary cross entropy loss is used, consistency enhancement loss and balance binary cross entropy loss are used for promoting gradient return, model convergence is enhanced, and the training effect is further improved.

S8, regarding the trained salient object detection model, using the test image as input, and obtaining a salient object detection result, as shown in fig. 6. The specific working process is as follows:

(8.1) regarding the significance target detection model in the step S7, taking the test set in the step (1.1) as an input to obtain a significance detection result;

(8.2) comparing the detection result of the saliency target detection model in the step (8.1) with the actual saliency target true value map, the saliency target detection model in the step (8.1) achieves excellent detection effect, and the detection result performs very well on Fmax, MAE and Em indexes on the four data sets, as shown in the figure.

Claims

1. A salient object detection method based on boundary enhancement is characterized by comprising the following steps:

and S4, training the saliency target detection model by using the saliency target detection result, the boundary detection result and the corresponding training set until a convergence condition is met, and performing target detection by using the trained saliency target detection model.

2. The method for detecting the salient object based on the boundary enhancement as claimed in claim 1, wherein ResNet-50 trained on ImageNet is used as a backbone of a network to extract abstract feature maps with different resolutions from a training set image, and a final pooling layer and a full connection layer are removed to obtain five multi-level fusion feature maps with different sizes.

3. The method for detecting the salient object based on the boundary enhancement as claimed in claim 1, wherein the obtained multi-level fusion feature maps with different scales are fused from top to bottom in a reverse step-by-step manner, the multi-scale information extraction is performed on the current multi-level fusion feature map before each fusion, and then the up-sampling and the previous multi-level fusion feature map are fused to obtain the feature map containing the multi-scale information.

4. The method for detecting a salient object based on boundary enhancement as claimed in claim 1, wherein in step S2, the fusion of multilevel features is to keep the feature layer and its neighboring feature layer consistent in size through upsampling or pooling, supplement information by adding elements to each other, and aggregate the supplemented features to obtain five multilevel aggregate feature maps with different sizes.

5. The method for detecting the salient object based on the boundary enhancement as claimed in claim 1, is characterized in that the sampling is performed by using the hole convolution with different hole rates, and the information with different scales is obtained through different receptive fields, so that the detection capability of a network model for the object with the scale change is improved.

6. The method for detecting the salient object based on the boundary enhancement as claimed in claim 1, wherein in each step-by-step feature fusion process, multi-scale information extraction is performed on features of different sizes for multiple times, and the multi-scale information is further fused.

7. The method as claimed in claim 1, wherein the step-by-step fusion features are used as input, and the boundary detection is performed on each step of features to extract the boundary information of the target.

8. The method of claim 1, wherein the training process uses a loss function and adjusts parameters when the loss is propagated backwards.

9. The method of claim 8, wherein the loss function comprises significant object detection loss and significant boundary detection loss, wherein the significant object detection loss is used to guide the correct classification of the pixels in the significant object boundary region, and the significant boundary detection loss is used to guide the correct classification of the pixels in the significant object boundary region.

10. A salient object detection system based on boundary enhancement is characterized by comprising a convolution feature extraction network module, a multi-stage feature aggregation module, a boundary information extraction module, a multi-scale information extraction module and a detection module;

the convolutional feature extraction network is used for extracting abstract feature maps with different resolutions from a training set image, the multi-level feature aggregation module is used for carrying out multi-level fusion on the abstract feature maps to obtain multi-level fusion feature maps, and the multi-scale information extraction module is used for carrying out multi-scale extraction on the multi-level fusion feature maps to obtain feature maps containing multi-scale information;

the boundary information extraction module is used for carrying out information conversion on the obtained characteristic diagram containing the multi-scale information and then splicing and fusing the characteristic diagram to obtain characteristics containing the boundary information, obtaining a boundary detection result by utilizing the characteristics after each level of conversion, further splicing and fusing the characteristics to obtain a fused boundary detection result, and splicing the characteristic diagram containing the multi-scale information with the characteristic information containing the boundary information to obtain a significant target detection result after the multi-scale information extraction;

the detection module is used for training a significance target detection model according to the detected target detection result and the corresponding training set image until the loss value meets the convergence condition, and performing target detection by using the trained significance target detection model.