Disclosure of Invention
Aiming at the defects of the prior art, the invention provides an image contour detection method based on multi-level characteristic channel optimization coding.
Although the NSCT transform has excellent performance in characterizing image details, it usually adopts optimized encoding in a manner of performing some weighting on decomposition results in scale and direction, and the artificial setting of weighting parameters in the processing process causes large uncertainty of detection results. Considering the effectiveness of the Gabor filter in perceiving the target dimension and direction of the image, the invention firstly calculates the optimal dimension m corresponding to the Gabor filter for the input image I (x, y)optAnd a direction thetaoptAnd will beM obtainedoptAnd thetaoptAs the frequency separation parameter of NSCT transformation, the traditional redundant fusion mode that Gabor and NSCT need to traverse all scales and directions is changed; in addition, the invention performs feature enhancement fusion on the contour subgraph obtained by NSCT and I (x, y), which is beneficial to efficiently and accurately obtaining the primary contour response E (x, y) of I (x, y); and then E (x, y) is transmitted into a full convolution neural network formed by FSC-32S, FSC-16S, FSC-8S network units, active learning of network parameters is realized by utilizing a convolution and pooling module of a feature encoder, an image contour mask image corresponding to I (x, y) is obtained through a deconvolution and up-sampling module of a feature decoder, and dot multiplication operation is carried out on the image contour mask image and I (x, y), so that accurate detection of the image contour is finally realized. The method specifically comprises the following steps:
step 1, acquiring a primary contour response of an input image I (x, y). First, the Gabor filter response of the input image I (x, y) is calculated, and the result is noted as
As shown in formulas (1) to (4).
In the formula:
the representation image I (x, y) is obtained on the scale m, the direction θ ═ n pi/K through a Gabor filterGabor characteristic information of (a); sigma
x,σ
yRespectively representing the standard deviation of the Gabor wavelet basis function along the x-axis and the y-axis; omega is the complex modulation frequency of the Gaussian function; the Gabor filter psi is obtained by taking psi (x, y) as mother wavelet and carrying out scale and rotation transformation on the mother wavelet
m,n(x, y); wherein u, v is ψ
m,nA template size of (x, y); m-0,., S-1, n-0,., K-1, S and K denote the number of scales and directions, respectively; α is a scale factor of ψ (x, y), where: alpha is alpha>1。
Calculating the optimal scale m corresponding to the Gabor filter based on the similarity index SSIMoptAnd a direction thetaoptAs shown in formulas (5) to (8).
Wherein
Representing filter response
With the known outline marker image I
markThe similarity between them when
When taking the maximum value, obtaining the optimal scale m
optAnd a direction theta
opt;
And
respectively represent
And I
markQuantitative similarity measures in brightness, contrast, and structure between; u. of
Gabor、u
markRespectively representing images
And I
markMean value of brightness, delta
Gabor、δ
markRespectively representing images
And I
markThe standard deviation of the luminance of (a),
respectively representing images
And I
markBrightness variance of δ
G,mRepresentative image
And I
mark(ii) a luminance covariance of; i is
markThe pixels of the outline area of the image are 1, and the other pixels are 0; to avoid system instability caused by the denominator in equations (6) to (8) approaching zero, C
1、C
2And C
3Set to some positive constant, less than the filter response
3% of the mean brightness value.
M is to be
optAnd theta
optAs the frequency separation parameter of NSCT, the NSCT decomposes the image I (x, y) to obtain a profile subgraph
Since the NSCT decomposition process remains dimensionally unchanged, it will
And directly carrying out a pixel-level feature enhancement fusion operation with the I (x, y) to finally obtain a primary contour response E (x, y) of the input image I (x, y), as shown in formulas (9) and (10).
In the formula (I), the compound is shown in the specification,
represents the dimension m
optAnd a direction theta
optA non-downsampled contourlet transform under parametric conditions,
representing a corresponding NSCT contour sub-graph; t represents a contour subgraph
The average value of brightness of; max is the function of taking the maximum value, the same applies below.
Step 2: transmitting the primary contour response E (x, y) obtained in the step 1 to a full convolution neural network to obtain a heat map F trained by FCN-32S, FCN-16S, FCN-8S network elements respectively5,F4,F3. The full convolutional neural network is divided into two parts of a feature encoder and a feature decoder, and the whole network comprises 8 convolutional blocks, 5 maximum pooling layers, 5 upsampling layers and 2 convolutional layers. The concrete structure is as follows:
1. feature encoder
With VGG-16, performing optimization and reconstruction of the full convolution neural network by using the network as a basic network. In order to improve the network computing speed and enhance the generalization capability, 1 × 1 convolution kernels are added into every two convolution kernels with the convolution kernel number of 3 × 3 in a convolution block (3 × 3, 1 × 1 and 3 × 3) structure; in order to strengthen the nonlinearity and translation invariance of the learning image characteristics, a maximum pooling layer is added behind each layer of convolution module; meanwhile, E (x, y) is processed by a pooling layer Max pool5, and the size of E (x, y) is changed into 1/32 of I (x, y), which is marked as
Representing a feature diagram output after the FCN-32S network unit is trained; e (x, y) passes through pooling layer Max pool4 and convolutional layer 1X 1, and has a size of 1/16 of I (x, y), which is recorded as 1/16
Representing a feature diagram output after the FCN-16S network unit is trained; similarly, E (x, y) passes through pooling layer Max pool3 and convolutional layer 1 × 1, and the size becomes 1/8 of I (x, y), which is recorded as 1/8
And the characteristic diagram is output after the FCN-8S network unit is trained. Wherein each pooling layer output utilizes a Relu activation function to implement a sparse coding function. The characteristic encoder comprises the following thirteen-layer structure, wherein step lengths stride are all 1:
the first layer, convolutional layer CONV1-1, number of channels 8, convolution kernel size 3 × 3; CONV1-2, the number of channels is 8, and the size of a convolution kernel is 3 multiplied by 3;
a second layer, a maximum pooling layer Max pool1, with a pooling area size of 2 × 2;
the third layer, convolution layer CONV2-1, channel number 16, convolution kernel size 3x 3; CONV2-2, the number of channels is 16, and the size of a convolution kernel is 1 multiplied by 1; CONV2-3, the number of channels is 16, and the size of a convolution kernel is 3x 3;
the fourth layer is a maximum pooling layer Max pool2, and the size of the pooling area is 2 multiplied by 2;
the fifth layer, convolution layer CONV3-1, number of channels 32, convolution kernel size 3 × 3; CONV3-2, the number of channels is 32, and the size of a convolution kernel is 1 multiplied by 1; CONV3-3, the number of channels is 32, and the size of a convolution kernel is 3 multiplied by 3;
a sixth layer, a maximum pooling layer Max pool3, with a pooling area size of 2 × 2;
the seventh layer, convolution layer CONV4-1, channel number 64, convolution kernel size 3x 3; CONV4-2, the number of channels is 64, and the size of a convolution kernel is 1 multiplied by 1; CONV4-3, the number of channels is 64, and the size of a convolution kernel is 3 multiplied by 3;
the eighth layer, the largest pooling layer Max pool4, the pooling area size is 2 × 2;
the ninth layer, convolutional layer CONV5-1, number of channels 128, convolution kernel size 3 × 3; CONV5-2, the number of channels is 128, and the size of a convolution kernel is 1 multiplied by 1; CONV5-3, the number of channels is 128, and the size of a convolution kernel is 3 multiplied by 3;
the tenth layer, the largest pooling layer Max pool5, the pooling area size is 2 × 2;
the eleventh layer, convolutional layer CONV6, number of channels 256, size of convolutional kernel 1 × 1;
the twelfth layer, convolution layer CONV7, 256 channels, convolution kernel size 1 × 1;
the thirteenth layer, convolutional layer CONV8, number of channels 1, size of convolutional kernel 1 × 1;
2. feature decoder
The primary contour response E (x, y) is continuously reduced to 1/8, 1/16 and 1/32 times after feature coding, the obtained feature map has low resolution, and therefore a feature decoder is added to carry out bilinear upsampling operation on the feature map with low resolution. For 32 times down-sampled image
Using 32-fold bilinear upsampling, we obtain a heat map of the same size as I (x, y), denoted F
5(ii) a Adding a prediction convolutional layer 1 multiplied by 1 for adjusting the number of characteristic image channels after a pooling layer Max pool4, and outputting to obtain an image
Simultaneous down-sampling of 32-fold images
Performing two-fold upsampling to obtain a result
Adding corresponding elements, and obtaining a heat map with the same size as I (x, y) by utilizing 16 times of bilinear upsampling, and marking the heat map as F
4(ii) a Adding a prediction convolutional layer 1 multiplied by 1 for adjusting the number of characteristic image channels after a pooling layer Max pool3, and outputting to obtain an image
Simultaneous down-sampling of 16-fold images
Performing two-fold upsampling to obtain a result
Adding corresponding elements, and obtaining a heat map with the same size as I (x, y) by 8 times of bilinear upsampling, and marking the heat map as F
3。
And step 3: for the heatmap F obtained in step 25,F4,F3The max function is used to take the maximum pixel value of each pixel, the maximum pixel value is fused to obtain an image outline mask image F, and the image outline mask image F is further subjected to the action of Relu activation function and is combined with a known outline mark image ImarkAnd performing loss operation, recording the result as loss, continuously and iteratively updating parameters of each network layer by adopting random gradient descent, finishing the training when the loss value is smaller than a threshold value epsilon, and setting epsilon to be 1-3% of the total number of pixels of the training image sample to obtain the trained full convolution neural network.
And 4, step 4: and (3) passing the image to be detected through the Gabor filter constructed in the steps 1-3, non-subsampled contourlet transformation and the trained full convolution neural network to obtain an image contour mask image, and performing dot multiplication operation on the image contour mask image and the image to be detected to finally obtain an image contour detection result.
The invention has the following beneficial effects:
1. a novel primary contour response method for multi-stage characteristic channel optimization coding is provided. Since NSCT transformation can simulate the frequency domain separation effect of the LGN of the outer knee in visual information processing, the detection result has large uncertainty due to the artificial setting of weighting parameters in the image decomposition process. Considering that the response characteristic of the Gabor filter is similar to that of a human visual system, the Gabor filter has certain robustness on illumination and posture, and has high-quality spatial locality and directional selectivity. Therefore, the invention provides that the optimal scale and direction based on Gabor filter response are searched for each picture, and then the optimal scale and direction are used as direct basis for setting frequency separation parameters for NSCT; and performing feature enhancement fusion on the contour sub-image obtained by NSCT and the original image, and being beneficial to efficiently and accurately obtaining the primary contour response. A novel primary contour response method of multilevel characteristic channel optimization coding is constructed, an image characteristic channel with low dimension and redundancy is obtained, and the method has important application prospects in relieving network pressure, reducing the computational complexity of a convolutional neural network and improving the training efficiency of the network.
2. A full convolutional neural network is constructed for multi-scale training, and characteristics such as smoothness and fineness of the FCN-32S, FCN-16S, FCN-8S in heat map expression are fully complemented. The network is divided into a feature encoder and a feature decoder, region selection is not needed for a target image from end to end, the feature encoder continuously and actively learns feature parameters through convolution and pooling, feature maps are reduced in proportion, the feature decoder guarantees two-dimensional characteristics of extracted features through deconvolution and up-sampling processes, main outlines of the images are represented by obtaining heat maps with the same size as an original image, prediction of each pixel is achieved, and meanwhile space information of the original image is reserved.
Detailed Description
The following description of the present invention will be made with reference to the accompanying drawings, which are shown in fig. 1,
step 1, acquiring a primary contour response of an input image I (x, y). First, the Gabor filter response of the input image I (x, y) is calculated, and the result is noted as
As shown in formulas (11) to (14).
In the formula:
representing Gabor characteristic information obtained by an image I (x, y) on a scale m and a direction theta (n pi/K) through a Gabor filter; sigma
x,σ
yRespectively representing the standard deviation of the Gabor wavelet basis function along the x-axis and the y-axis; omega is the complex modulation frequency of the Gaussian function; the Gabor filter psi is obtained by taking psi (x, y) as mother wavelet and carrying out scale and rotation transformation on the mother wavelet
m,n(x, y); wherein u, v is ψ
m,nA template size of (x, y); m-0,., S-1, n-0,., K-1, S and K denote the number of scales and directions, respectively; α is a scale factor of ψ (x, y), where: alpha is alpha>1。
Calculating the optimal scale m corresponding to the Gabor filter based on the similarity index SSIMoptAnd a direction thetaoptAs shown in formulas (15) to (18).
Wherein
Representing filter response
With the known outline marker image I
markThe similarity between them when
When taking the maximum value, obtaining the optimal scale m
optAnd a direction theta
opt;
And
respectively represent
And I
markQuantitative similarity measures in brightness, contrast, and structure between; u. of
Gabor、u
markRespectively representing images
And I
markMean value of brightness, delta
Gabor、δ
markRespectively representing images
And I
markThe standard deviation of the luminance of (a),
respectively representing images
And I
markBrightness variance of δ
G,mRepresentative image
And I
mark(ii) a luminance covariance of; i is
markThe pixels of the outline area of the image are 1, and the other pixels are 0; to avoid system instability caused by the denominator in equations (16) to (18) approaching zero, C
1、C
2And C
3Set to some positive constant, less than the filter response
3% of the mean brightness value.
M is to be
optAnd theta
optObtaining a profile subgraph as a frequency separation parameter of NSCT
Since the NSCT maintains the same dimension after the image I (x, y) is decomposed, it will do
And directly carrying out a feature enhancement fusion operation at a pixel level with the I (x, y), and finally obtaining a primary contour response E (x, y) of the input image I (x, y), as shown in formulas (19) and (20).
In the formula (I), the compound is shown in the specification,
represents the dimension m
optAnd a direction theta
optA non-downsampled contourlet transform under parametric conditions,
representing a corresponding NSCT contour sub-graph; t represents a contour subgraph
The average value of brightness of; max is the function of taking the maximum value, the same applies below.
Step 2: as shown in FIG. 2, a full convolutional neural network was constructed, obtaining a heatmap F trained by FCN-32S, FCN-16S, FCN-8S network elements, respectively5,F4,F3. The full convolutional neural network is divided into two parts of a feature encoder and a feature decoder, and the whole network comprises 8 convolutional blocks, 5 maximum pooling layers, 5 upsampling layers and 2 convolutional layers. The concrete structure is as follows:
1. feature encoder
(1) The primary contour response E (x, y) passes through a 3X 3-8, 3X 3-8 CONV1 volume block, then maximum pooling of 2X 2 is carried out, and the effect of Relu activation function is used to obtain an image F1 1 /2As shown in formula (21), the representation I (x, y) is passed through convolution-Max pool1, and the size becomes 1/2.
Wherein conv1() represents the convolution operation of the first layer; pool1() represents the first max pooling operation; relu () represents the activation function for the sparse result, as follows.
(2) Will be provided with
The image is obtained by the CONV2 volume blocks of 3x 3-16, 1 x 1-16 and 3x 3-16, the maximum pooling of 2 x 2 is carried out, and then the 1 x 1 prediction convolution for adjusting the number of characteristic image channels and the Relu activation function are added
As shown in equation (22), the image is represented
After convolution with Max pool2, the size becomes 1/4 for I (x, y).
Wherein conv2() represents the second layer convolution operation; pool2() represents the second max pooling operation.
(3) Will be provided with
The image is obtained by 3x 3-32, 1 x 1-32 and 3x 3-32 CONV3 volume blocks, 2 x 2 maximum pooling and Relu activation function
As shown in equation (23), the image is represented
After convolution with a convolution block-Max pool 3-predictive convolution, the size becomes 1/8 for I (x, y).
Wherein conv3() represents the third layer of convolution operations; pool3() represents the third max pooling operation, conv1 × 1() represents a 1 × 1 convolution kernel, the same applies below.
(4) Will be provided with
The image is obtained by carrying out 3x 3-64, 1 x 1-64 and 3x 3-64 CONV4 volume blocks, then carrying out 2 x 2 maximum pooling, then adding 1 x 1 prediction convolution for adjusting the number of characteristic image channels, and carrying out the function of Relu activation function
As shown in equation (24), the image is represented
After convolution with a convolution block-Max pool 4-predictive convolution, the size becomes 1/16 for I (x, y).
Wherein conv4() represents the fourth layer convolution operation; pool4() represents the fourth max pooling operation.
(5) Will be provided with
Obtaining images through CONV5 volume blocks of 3x 3-64, 1 x 1-64 and 3x 3-64, then performing 2 x 2 maximum pooling and Relu activation function action
As shown in equation (25), the image is represented
After convolution with Max pool4, the size becomes 1/32 for I (x, y).
Wherein conv5() represents the fifth layer convolution operation; pool5() represents the fifth max pooling operation.
2. Feature decoder
(1) Image of a person
Using 32-fold bilinear upsampling, we get a heat map of the same size as I (x, y), denoted F
5. As shown in equation (26).
Where bilinear () represents a bilinear upsampling operation, the same applies below.
(2) Adding a prediction convolutional layer 1 multiplied by 1 for adjusting the number of characteristic image channels after a pooling layer Max pool4, and outputting to obtain an image
Simultaneous down-sampling of 32-fold images
Performing two-fold upsampling to obtain a result
Adding corresponding elements, and obtaining a heat map with the same size as I (x, y) by utilizing 16 times of bilinear upsampling, and marking the heat map as F
4As shown in formula (27).
Where sum () represents a matrix addition operation, the same applies below.
(3) Adding a prediction convolutional layer 1 multiplied by 1 for adjusting the number of characteristic image channels after a pooling layer Max pool3, and outputting to obtain an image
Simultaneous down-sampling of 16-fold images
Performing two-fold upsampling to obtain a result
Adding corresponding elements, and obtaining a heat map with the same size as I (x, y) by 8 times of bilinear upsampling, and marking the heat map as F
3As shown in equation (28).
And step 3: for the heatmap F obtained in step 25,F4,F3And taking the maximum pixel value of each pixel by using a max function, and fusing to obtain an image contour mask image F as shown in a formula (29). And performing loss operation on the training image with the known artificial marked contour under the action of a Relu activation function, and recording the result as loss as shown in a formula (30). And continuously and iteratively updating parameters of each network layer by adopting random gradient descent (Stochastic gradient parameter), finishing the training when the loss value is smaller than a threshold value epsilon, setting epsilon to be 1-3% of the total number of pixels of the training image sample, and obtaining the trained full convolution neural network.
F=max(F5,F4,F3) (29)
Wherein M, N is the number of rows and columns of the training image, F
i,jRepresenting the pixel values of the image profile mask F at coordinates (i, j),
marking an image I for a known contour
markThe pixel value at coordinate (i, j).
And 4, step 4: and (3) passing the image to be detected through the Gabor filter constructed in the steps 1-3, non-subsampled contourlet transformation and the trained full convolution neural network to obtain an image contour mask image, and performing dot multiplication operation on the image contour mask image and the image to be detected to finally obtain an image contour detection result.