CN117036656A

CN117036656A - A method for identifying floating objects on water surface in complex scenes

Info

Publication number: CN117036656A
Application number: CN202311038529.8A
Authority: CN
Inventors: 俞万能; 曾广淼; 王荣杰; 李慧慧; 商逸帆
Original assignee: Jimei University
Current assignee: Jimei University
Priority date: 2023-08-17
Filing date: 2023-08-17
Publication date: 2023-11-10
Anticipated expiration: 2043-08-17
Also published as: CN117036656B

Abstract

The invention relates to the technical field of water surface floater treatment, in particular to a water surface floater identification method under a complex scene. The method comprises the following specific steps: and (3) network model structural design: the Yolov7 algorithm is a target detection algorithm proposed in 2022, and test results on a COCO data set show that the speed and the precision of the method are superior to those of a plurality of target detectors YOLOR, YOLOX, scaled-Yolov4 and Yolov 5. Further analysis is carried out on the detection problem of the water surface floaters, and a target detection algorithm Yolov7-FC integrating an attention mechanism is provided, wherein the network structure of the target detection algorithm Yolov7-FC comprises Backbone, attention, FPN and a Head layer; the network optimization method comprises a loss function and image preprocessing; the invention provides a water surface floater identification algorithm integrating an attention mechanism, which is used for carrying out identification test on water surface floaters in different scenes, improves the identification accuracy and recall rate, and meets the requirement of a ship-borne platform on realizing real-time detection by using an off-line small-sized mobile platform for testing.

Description

Water surface floater identification method under complex scene

Technical Field

The invention relates to the technical field of water surface floater treatment, in particular to a water surface floater identification method under a complex scene.

Background

Marine waste pollution has become a global environmental problem, both in inland rivers and in the ocean, which severely threatens the balance of ecological environment. Most of the waste is typically carried into the ocean via large urban rivers, during which process the large plastic is broken down into micro-plastic and thus has a more serious impact on the aquatic ecosystem. If the plastic garbage can be cleaned as much as possible, the influence of the plastic on the marine ecology can be slowed down.

For the marine garbage, many coastal countries pay more attention to and develop a certain degree of cleaning work, but manual cleaning is still the main method. The inefficiency and high cost of manual cleaning remains an important issue for a large range and large number of ocean litter. Therefore, an automatic cleaning scheme integrating multiple sensors is designed to perform garbage cleaning, the remote sensing image information is analyzed to determine the distribution range of the water surface floaters, the pollution conditions of different degrees of the water surface are further given, then unmanned aerial vehicle is used for searching a pollution area, and the targets which are difficult to collect manually are captured. For areas where a large number of garbage floats are distributed, an unmanned boat carrying a cleaning robot can be used for going to and performing autonomous cleaning operation. For underwater garbage, unmanned underwater vehicles can also be used for collection. The unmanned equipment can efficiently finish the work, and apart from the deep learning technology which is rapidly developed in recent years, the technology is well developed in the field of marine environment perception, the intelligent technology of the unmanned robot is more perfect, and the unmanned equipment is used for replacing manpower to perform marine garbage cleaning work, so that the technology has become a trend.

However, the unmanned aerial vehicle, unmanned ship or unmanned underwater vehicle can accurately identify the target to be cleaned, which requires a reliable automatic identification system to detect the target. The deep learning-based target recognition technology which has been widely used in various fields can provide a plurality of feasible and reliable schemes for the deep learning-based target recognition technology, so that accurate real-time detection of the target to be cleaned is completed.

Detection of small objects is more difficult than large surface objects. These small floats also exist in different kinds such as wood blocks, aquatic plants, plastic bottles, plastic bags, etc., where only plastic products are the target of the float rubbish that the unmanned boat needs to salvage.

The small-sized floater takes a water surface floater as a research object and takes a plastic bottle as an example, is characterized by small volume, transparent color and easy integration into the background, and can not be distinguished from an interfering object by a radar almost, and can be subjected to characteristic capture by a visual sensor. Therefore, a target detection algorithm is designed based on the combination of the visible light sensor and the visual information, and the task of identifying the small target of the floater is completed. The recognition algorithm not only needs to achieve high recognition accuracy, but also needs high running speed to meet the requirement of autonomous cleaning operation of the unmanned ship.

The complex and changeable water surface environment, such as light spot reflection, insufficient light, interference objects and the like, has great influence on the recognition system, and reduces the accuracy of target recognition, so that a water surface floating object small target recognition method meeting different complex scenes needs to be designed.

In addition, the current small target recognition algorithms are all researched based on a public data set, such as a COCO data set and a PASCAL VOC data set, but the patterns of the small targets in the public data set and the patterns of the water surface floater targets still have differences, and many excellent small target recognition algorithms are not optimized for the water surface application scene. The invention adopts the Flow data set for training and testing, which is the data set manufactured aiming at the floating objects on the water surface, and verifies the advantages of the algorithm provided by the invention by comparing and evaluating various algorithms.

Disclosure of Invention

The invention provides a water surface floater identification algorithm integrating an attention mechanism, which is used for carrying out identification test on water surface floaters in different scenes, improves the identification accuracy and recall rate, and meets the requirement of a ship-borne platform on realizing real-time detection by using an off-line small-sized mobile platform for testing.

The technical scheme adopted by the invention is as follows: a method for identifying water surface floaters in a complex scene comprises the following specific steps:

s1: and (3) network model structural design: the Yolov7 algorithm is a target detection algorithm proposed in 2022, and test results on a COCO data set show that the speed and the precision of the method are superior to those of a plurality of target detectors YOLOR, YOLOX, scaled-Yolov4 and Yolov 5. Further analyzing the detection problem of the water surface floating object, providing a target detection algorithm Yolov7-FC integrating an attention mechanism, wherein a network structure comprises Backbone, attention, FPN and a Head layer, the network structure represents a framework structure for calculating different stages of the network structure from shallow to deep, the framework structure further comprises CBS (Conv), batch normalization layer (BN) and an activation function (silu), and the CBS color is different to indicate the convolution kernel k and the stride s of the convolution layer; CBM also represents a convolution block, unlike CBS, where the activation function in CBM uses a sigmoid activation function, the two activation functions are calculated as shown in equations 1 and 2:

UPS represents an up-sampling layer, and is calculated by adopting a nearest neighbor interpolation method (nearest); ELAN denotes a multi-branch stacking module, wherein concat denotes a merging-connection calculation (concat), o=i denotes that the number of output channels is equal to the number of input channels, and o=i/2 denotes that the number of output channels is equal to half of the number of input channels; MP/MP denotes a downsampled transition module, maxPool denotes maximum pooling, and CBS denotes a block therein that performs output computation in o=i/2.

S2: the network optimization method comprises a loss function and image preprocessing;

the loss function is specifically: the loss function of the Yolov7-FC algorithm consists of three parts: regression Loss of target (Loss _reg ) Loss of classification (Loss) _cls ) And Loss of position (Loss _loc ). As shown in formulas 10 to 14:

Loss＝λ _reg ·Loss _reg +λ _cls ·Loss _cls +λ _loc ·Loss _loc (10)

wherein lambda is _reg 、λ _cls And lambda (lambda) _loc The method comprises the steps that weights of three different categories of losses in a loss function are represented respectively, each input picture is divided into K multiplied by K unit cells by a Yolov7-FC network, M anchor frames (anchors) are generated in each grid, and after the previous calculation of each anchor frame (anchor) through the network, an adjusted boundary frame is obtained, wherein the total number of the boundary frames is K multiplied by M. I _ij ^obj And I _ij ^noobj For determining whether the central coordinates of the target are in the j anchor frame in the i-th grid, if so, the former is equal to 1 and the latter is equal to 0, otherwise, the latter is oppositeA kind of electronic device. C (C) _i For the confidence of the real box in the ith cell,confidence for the i-th intra-cell prediction frame. P is p _i (k) Conditional probability that the real box in the ith cell contains the kth type of object,/->Indicating that the prediction box in the ith cell contains the conditional probability of the kth type of object.

The image preprocessing specifically comprises the following steps: the mosaic image enhancement method (mosaics) is developed on the mixed image enhancement (mixup) method, and a new data enhancement algorithm is generated, which is different from the two-picture coverage fusion of the cutting mixing method, and uses four pictures for cutting and splicing to form a new picture. The method can better enrich the background of the target object and prevent the network generalization capability from being reduced due to the similarity of the background of the training set. However, since the generated training pictures are quite different from the natural pictures in distribution, the training pictures need to be forbidden after z iterations, so that the network deepens understanding of the natural pictures.

Image data enhancement is performed by combining mosaics and mixup, and a judgment formula of whether the data enhancement method is adopted for each iteration epoch is shown as formula 23:

where, bell () represents the boolean operation,&representation and operation, θ ₁ And theta ₂ And z is 0.7.

As a further scheme of the invention: the network model structural design includes Backbone, attention, FPN, head.

As a further scheme of the invention: in the Backbone network of the backhaul, the Yolov7-FC firstly uses an ELAN module to perform feature extraction, and then uses a transition module to perform downsampling, so that three effective feature layers are obtained for performing next network construction.

In the ELAN module, the network divides the input feature into 5 branches for computation, which are branches of 1 convolution block, 3 convolution blocks, 5 convolution blocks and 7 convolution blocks in sequence, and outputs the branches through 1 convolution block after performing the concat computation. Through the intensive residual structure, the characteristics of 5 different depths can be fused, and through using the jump connected residual block, the influence of the gradient vanishing problem caused by the increase of the network depth is reduced.

In the MP module, the network divides the input feature into 2 branches for calculation, the 1 st branch is the maximum pooling plus convolution block, the 2 nd branch is two convolution blocks with different convolution kernels and steps, and the subsequent output result is obtained by performing concat calculation on the two branches.

As a further scheme of the invention: after the Attention is subjected to feature extraction on the input picture through the backbone network, the network improves the Attention degree on effective features by using an Attention mechanism, generates Attention information in two dimensions of a channel and a space through a convolution Attention module (CBAM), and combines the Attention information to generate a new feature map.

For the network input feature map F, the method is divided into an upper process and a lower process, and a channel attention feature map M is respectively generated _c And a spatial attention profile M _s As shown in formula 2-formula 4:

where C, H and W represent the number of channels, height and width of the feature map, respectively.

In the process of calculating the feature map F, a channel attention algorithm is first used to generate a feature map F', and then a spatial attention algorithm map is used to generate a feature F ", as shown in equations 6 and 7:

wherein the method comprises the steps ofRepresenting the corresponding multiplication of the parity elements.

In the channel attention module, the network focuses on the distinguishing features of the images in the channel dimension by compressing the space dimension, performs feature extraction by using two methods of average pooling and maximum pooling, and performs average pooling F ^c _avg Acquiring the integral features of the feature region and combining the maximum pooling F ^c _max Obtaining the salient features of the feature region, and then carrying out feature fusion by using a weight-sharing multi-layer perceptron (MLP) network to obtain a final channel attention feature map M _c The calculation process is shown in formula 8:

wherein σ represents a sigmoid activation function, W ₀ And W is ₁ Is the weight of the MLP, and r is the dimension reduction coefficient used in the MLP.

In the spatial attention module, the network focuses on the azimuth features of the image in the spatial dimension by compressing the channel dimension, and also performs feature extraction by using two methods of average pooling and maximum pooling, and the average pooling F is adopted ^s _avg Acquiring the integral features of the feature region and combining the maximum pooling F ^s _max Obtaining the salient features of the feature region, and then performing feature fusion by using a convolution layer to obtain a final spatial attention feature map M _s The calculation process is shown in formula 9:

wherein f ^7×7 A convolution operation with a convolution kernel size of 7×7 is shown.

As a further scheme of the invention: after the FPN performs feature enhancement through the attention mechanism network, the feature map F' enters a feature pyramid stage for processing, and the feature enhancement is performed through a rapid spatial pyramid pooling optimization (SPPFCSPC) module. Compared with a space pyramid pooling optimization (SPPCSPC) module, the improved SPPFCSPC module replaces three convolution kernels with different dimensions by using the convolution kernels with the same dimension for the convolution module, and the same convolution module structure is reused in a series connection mode, so that the efficiency of the network structure is higher.

And (3) merging the feature layers with different scales after the downsampling and upsampling operations of a plurality of convolution blocks on the network structure of the rest part of the Feature Pyramid (FPN), so as to mix the shallow layer and deep layer network features and extract better feature values.

As a further scheme of the invention: and after the Head passes through the feature pyramid module, three enhanced feature graphs can be obtained, each feature layer has width, height and channel number, the network regards the feature graphs as a set of each feature point, and three prior frames with different sizes are used for judging the feature points. And then adjusting the prior frames contained by the neural network according to judgment feedback, and identifying and detecting targets with different sizes in the original image by using a non-maximum value inhibition method, so that the overall detection capability of the neural network on the multi-scale targets is improved.

As a further scheme of the invention: meanwhile, a complete cross-over ratio loss (CIoU) is used in the calculation of the position loss function, instead of the two-class cross entropy loss used in the regression loss and the class loss, which can describe the position information more accurately. The calculation method of the complete cross ratio loss is shown in the formulas 15-19:

loss _CIoU ＝1-IoU+R _CIoU (B,B ^gt ) (15)

wherein IoU is the cross-over ratio, prediction block b= (x, y, w, h), real block B _gt ＝(x _gt ,y _gt ,w _gt ,h _gt ) They consist of x, y coordinates representing the location of the center point and w, h coordinates of the width-height length. R is R _CIoU (B,B _gt ) For predicting frame B and true frame B _gt Penalty terms b and b of (2) _gt Represents B and B _gt And ρ (·) represents the euclidean distance, c is the diagonal distance of the smallest box that can contain both the predicted box and the real box. α is a positive trade-off parameter and v is a parameter that measures aspect ratio uniformity, which gives higher priority to factors in the overlapping region of the prediction box and the real box relative to non-overlapping portions in the regression calculation.

When judging whether the predicted frame is a positive sample or a negative sample, adopting a simOTA algorithm to determine, and combining the IOU loss between the predicted frame and the real frame _reg Class loss with prediction and real frames _cls As the Cost matrix Cost, as shown in equation (20):

wherein the balance coefficientSet to 3 to balance the difficulty of identifying the two losses.

From the Cost matrix, the higher the superposition degree of the real frame and the predicted frame is, the lower the Cost is, the more accurate the classification is, and therefore, the plurality of predicted frames with the best fitting degree with the real frame are found in a self-adaptive mode.

Then, selecting the largest candidate frame of the N IOUs according to the Cost value, and distributing proper number M of positive samples for different targets to be identified, as shown in the formula 21:

when the situation that the same candidate frame is matched with a plurality of real frames occurs, the real frame with smaller Cost is selected as a unique matching target.

In the initial stage of training, the network can be converged rapidly by using a large learning rate, and in the later stage of training, the network can be converged to an optimal value more favorably by using a small learning rate. Therefore, training is performed by using a learning rate exponential decay strategy, and the calculation mode of the learning rate gamma is shown as formula 22:

γ＝ε ^τ γ ₀ (22)

wherein, gamma ₀ The initial learning rate is represented, epsilon is the decay rate, and tau is the iteration number of the training network.

The invention has the beneficial effects that:

the invention provides a water surface target recognition algorithm integrating an attention mechanism, which focuses the algorithm on the distinguishing characteristics of images in the channel dimension by compressing the space dimension, and utilizes a rapid space pyramid pooling to improve the operation speed of a model.

Drawings

FIG. 1 is a network structure diagram of a Yolov7-FC algorithm of a water surface float identification method in a complex scene.

Fig. 2 is an algorithm flow chart of a CBAM attention mechanism of a method for identifying a water surface float in a complex scene of the present invention.

Fig. 3 is a schematic diagram of a network structure before and after improvement of the SPPFCSPC module of the method for identifying a water surface float in a complex scenario of the present invention.

Fig. 4 is a diagram showing the distribution of targets of different sizes in a dataset of a method for identifying a water surface float in a complex scene according to the present invention.

Fig. 5 is a schematic diagram showing the integration position of the attention mechanism of the method for identifying the water surface floaters in a complex scene.

Fig. 6 is a graph showing comparison of results of Yolov7 algorithm tests under different optimizers of a method for identifying water surface floats in a complex scene.

FIG. 7 is a graph showing a comparison of the results of the Yolov7-FC algorithm test under different optimizers of the method for identifying water surface floats in a complex scene.

Fig. 8 is a graph showing comparison of detection effects of different algorithms in a test video of a method for identifying a water surface float in a complex scene.

Detailed Description

The present invention will be further described below.

A method for identifying water surface floaters in a complex scene comprises the following specific steps:

s1: and (3) network model structural design: the Yolov7 algorithm is a target detection algorithm proposed in 2022, and test results on a COCO data set show that the speed and the precision of the method are superior to those of a plurality of target detectors YOLOR, YOLOX, scaled-Yolov4 and Yolov 5. Further analysis is performed on the detection problem of the water surface floaters, and a target detection algorithm Yolov7-FC integrating the attention mechanism is provided, and a network structure diagram of the target detection algorithm Yolov7-FC is shown in figure 1.

The upper half of fig. 1 is an overall frame structure diagram, which comprises Backbone, attention, FPN and Head layers, and represents the frame structure of the network structure at different stages of calculation from shallow to deep, and the lower half of fig. 1 is a composition mode of different structural blocks, wherein CBS represents convolution blocks, all of which are composed of a convolution layer (Conv), a batch normalization layer (BN) and an activation function (silu), and different CBS colors represent different magnitudes of a convolution kernel k and a stride s of the convolution layer; CBM also represents a convolution block, unlike CBS, where the activation function in CBM uses a sigmoid activation function, the two activation functions are calculated as shown in equations 1 and 2:

The network model structural design includes Backbone, attention, FPN, head.

In a Backbone network, the Yolov7-FC firstly uses an ELAN module to perform feature extraction, and then uses a transition module to perform downsampling, so that three effective feature layers are obtained for performing next network construction.

After the Attention is extracted from the input picture through the backbone network, the network will increase the Attention degree to the effective feature by using the Attention mechanism, the Attention information in two dimensions of the channel and the space is generated through the convolution Attention module (CBAM), and a new feature map is generated after the Attention information is combined with the space, the network structure diagram of the CBAM is shown in fig. 2, wherein (a) the generation mode of the channel Attention feature map and the space Attention feature map, and (b) the operation process of the input feature map F.

As can be seen from FIG. 2 (a), for the network input profile F, it is divided into two processes, namely, up and down, to generate the channel attention profile M _c And a spatial attention profile M _s As shown in formula 2-formula 4:

Fig. 2 (b) shows a process of performing an operation on the feature map F, where the feature map F' is first generated by using a channel attention algorithm, and then the feature f″ is generated by using a spatial attention algorithm map, as shown in equations 6 and 7:

Attention on the channelIn the module, the network compresses the space dimension, focuses on the distinguishing characteristics of the image in the channel dimension, performs characteristic extraction by using two methods of average pooling and maximum pooling, and performs F by average pooling ^c _avg Acquiring the integral features of the feature region and combining the maximum pooling F ^c _max Obtaining the salient features of the feature region, and then carrying out feature fusion by using a weight-sharing multi-layer perceptron (MLP) network to obtain a final channel attention feature map M _c The calculation process is shown in formula 8:

After the FPN performs feature enhancement through the attention mechanism network, the feature map F' enters a feature pyramid stage for processing, and the feature enhancement is performed through a rapid spatial pyramid pooling optimization (SPPFCSPC) module. Compared with a space pyramid pooling optimization (SPPCSPC) module, the improved SPPFCSPC module replaces three convolution kernels with different dimensions by using the convolution kernels with the same dimension for the convolution module, and the same convolution module structure is reused in a series connection mode, so that the efficiency of the network structure is higher. The network structure before and after the SPPFCSPC module is improved is shown in fig. 3, which shows (a) the module before the improvement and (b) the module after the improvement.

The network structure of the rest part of the Feature Pyramid (FPN) is shown in fig. 1, and feature layers with different scales are fused after the downsampling and upsampling operations of a plurality of convolution blocks, so that shallow layer and deep layer network features are mixed, and better feature values are extracted.

Three enhanced feature graphs can be obtained after the Head passes through the feature pyramid module, each feature layer has width, height and channel number, the network regards the feature graphs as a set of each feature point, and three prior frames with different sizes are used for judging the feature points. And then adjusting the prior frames contained by the neural network according to judgment feedback, and identifying and detecting targets with different sizes in the original image by using a non-maximum value inhibition method, so that the overall detection capability of the neural network on the multi-scale targets is improved.

Loss＝λ _reg ·Loss _reg +λ _cls ·Loss _cls +λ _loc ·Loss _loc (10)

wherein lambda is _reg 、λ _cls And lambda (lambda) _loc The method comprises the steps that weights of three different categories of losses in a loss function are represented respectively, each input picture is divided into K multiplied by K unit cells by a Yolov7-FC network, M anchor frames (anchors) are generated in each grid, and after the previous calculation of each anchor frame (anchor) through the network, an adjusted boundary frame is obtained, wherein the total number of the boundary frames is K multiplied by M. I _ij ^obj And I _ij ^noobj And judging whether the center coordinates of the target object are in the jth anchor frame in the ith grid, if so, the former is equal to 1, and the latter is equal to 0, otherwise, the latter is opposite. C (C) _i For the confidence of the real box in the ith cell,confidence for the i-th intra-cell prediction frame. P is p _i (k) Conditional probability that the real box in the ith cell contains the kth type of object,/->Indicating that the prediction box in the ith cell contains the conditional probability of the kth type of object.

Meanwhile, a complete cross-over ratio loss (CIoU) is used in the calculation of the position loss function, instead of the two-class cross entropy loss used in the regression loss and the class loss, which can describe the position information more accurately. The calculation method of the complete cross ratio loss is shown in the formulas 15-19:

loss _CIoU ＝1-IoU+R _CIoU (B,B ^gt ) (15)

γ＝ε ^τ γ ₀ (22)

where, bell () represents the boolean operation,&the representation and the operation are performed in a way that,θ ₁ and theta ₂ And z is 0.7.

This was tested as described above, as follows:

1. data set planning: the dataset used for training and testing in the present invention is a FloW dataset that contains a total of 2000 images with a resolution of 1280 x 720, 5271 floats, and at 3: and 2, dividing the training set and the testing set in a mode of 2, wherein 1200 pieces of training set and 800 pieces of testing set. A small target with pixels smaller than 32 x 32 pixels has 2996 pictures in the dataset, exceeding 50% of the total target number. The number of middle targets between 96×96 pixels and 32×32 pixels of the pixel point is 1974, the sum of the number of the middle targets and the number of the small targets exceeds 90%, and even the pixel point occupied by the middle targets does not exceed 1% of the total picture, which has a certain challenge to the identification requirement of the target detection algorithm. The size target profile in the data is shown in fig. 4.

2. Training and testing results: the algorithm of the invention is implemented on an open-source neural network framework Pytorch (3.8.5). The computing workstation configuration contained 1 GPU (GeForce RTX 3090 ti), CPU (AMD Ryzen 9 3950x16Core/3.5 GHz/72M), and 128G RAM. The small mobile test platform was built based on NVIDIA Jetson Agx Orin (275 tops,60 w) development board.

First, attention enhancement algorithms are integrated into the network structure of the Yolov7 algorithm, so that the attention to a floating target is improved, and test results of the Yolov7 algorithm before and after improvement are shown in table one. In table one, the CBAM algorithm and ECA algorithm are respectively incorporated into the Yolov7 algorithm structure, and the applicability of the neural network shallow a location and deep B location to the attention mechanism is compared, the locations of which are shown in fig. 5. In addition, the MAP boost rate and MAR boost rate in table one are based on the results of tests performed on the SGD optimizer by the Yolov7 algorithm that did not take into account the attention mechanism.

Table I contrast of the results of the Yolov7 algorithm test before and after the attention enhancement method

From table one, the network structure with the best detection effect is built by integrating a CBAM attention mechanism at the A position and training by using an ADAM optimizer, and then integrating an ECA attention mechanism at the B position and training by using an SGD optimizer, and the comparison graph of the test results is shown in fig. 6. From fig. 6, it can be seen that in several different combinations, the integration of the attention mechanism into the a-position algorithm results in a better performance improvement, and the average test result of the network after training by using the ADAM optimizer is better, which is reflected in the indexes of the accuracy and the recall. But ADAM optimizers are far less effective than SGD optimizers for the original algorithm that did not incorporate the attention mechanism.

In order to further improve the performance of the algorithm, the SPPFCSPC structure is used to optimize the FPN layer while the attention mechanism is integrated, and the testing results of the Yolov7-F algorithm before and after the improvement are shown in Table two. Similarly, the CBAM algorithm and the ECA algorithm are respectively integrated into the Yolov7-F algorithm structure, and the applicability of the neural network shallow A position and the deep B position to the structure is compared. Unlike Table one, the MAP and MAR rates in Table two are based on the results of the Yolov7 algorithm in Table one optimized on the SGD optimizer, to verify the advantages of SPPFCSPC structure.

Table II shows the accuracy of the Yolov7-F algorithm before and after the attention enhancement method

It can be seen from table two that after the SPPFCSPC structure is used, the previous optimal combination has an additional improvement in accuracy and recall, and a comparison of the test results of all combinations is shown in fig. 7. As can be seen from fig. 7, the network trained by the ADAM optimizer is improved, and the network trained by the SGD optimizer is improved in recall but reduced in precision, so that the combination effect of the SPPFCSPC structure and the ADAM optimizer is better, which can also be reflected from the network that never passes through the unfused attention mechanism. In addition, the test results are stable in change no matter the attention mechanism is integrated in the A position or the B position, so that the SPPFCSPC structure has no decisive influence on the position where the attention mechanism is integrated.

Table three shows the comparison of the test results of the Yolov7-FC algorithm with other algorithms, and can be seen that the Yolov7-FC algorithm has certain advantages in terms of accuracy and recall rate, compared with the Cascade R-CNN algorithm which is excellent in performance on the data set ^[17] Still improving the accuracy by 0.9%. The fourth table shows the comparison of the operation speeds of the algorithms on the small-sized mobile equipment, and it can be seen that the improved algorithm has increased structural complexity as compared with the original algorithm, and the requirements of off-line real-time detection of the shipborne small-sized mobile equipment are met although the operation speed is reduced by 2.8%. Cascade R-CNN algorithm with good accuracy ^[17] Only 23.8% of the detection speed of the Yolov7-FC algorithm can not meet the requirement of real-time operation basically. For a DSSD algorithm and a Yolov3 algorithm with high detection speed, the identification accuracy and recall rate are low, and the requirements of floater detection cannot be well met.

Table three Yolov7-FC algorithm and other algorithm test results comparison

Table four speed versus running algorithm on small mobile device

The actual test results of the different algorithms are shown in fig. 8, including scenes under different weather conditions. Fig. 8 (a) shows the target detection effect when the interfering object exists, at this time, the target is relatively far, although all algorithms have a certain degree of omission, at this time, the Yolov7, yolov3 and YoloX algorithms have more omission, and the Yolov7 erroneously recognizes the radar in the lower left corner of the recognition frame as a floater, and the improved algorithm better completes the target recognition task by integrating into the attention mechanism. In fig. 8 (b), not only the interfering object but also a green float having a color similar to that of the water surface exists, and the Yolov3 and Yolov5 algorithms cannot accurately identify the interfering object. Fig. 8 (c) is a very difficult flare situation, where the detection effect of all algorithms is poor, yolov7-FC and YoloX algorithms recognize relatively many targets, but YoloX misrecognizes the flare spot in the lower right corner. FIG. 8 (d) is a dense multi-objective case where all algorithms perform better, the Yolov7-FC and Yolox algorithms perform better, the weather conditions in the afternoon are better, the visibility is higher, and the floats are clearer. Fig. 8 (e) is a dim background, where the Yolov3 algorithm misrecognizes the far white vessel as a float and Yolov5 also misrecognizes the radar in the lower left corner as a float. Fig. 8 (f) is a far weak target, most algorithms fail to identify for green floats farther closer to the background, yolov7-FC and Yolov5 perform better, but Yolov5 still misidentified.

Therefore, by comparing the recognition effects in various contexts, it can be known that the influence of flare on the recognition algorithm is the most serious. And good detection effect can be obtained on the interference and the non-serious dim background. Although the remote small target can have a certain influence on the algorithm, the remote small target can still be accurately identified under the condition of higher contrast with the background, once the color of the target is similar to the background, the identification difficulty is greatly increased, and the identification accuracy rate and recall rate are reduced.

In a test experiment, the improved algorithm improves the recognition accuracy by 1.9%, the recall rate by 1.6%, and the improved algorithm has excellent recognition effect under different backgrounds of insufficient light spot reflection and insufficient light and interference objects, and meets the real-time detection requirement on small mobile equipment.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for identifying water surface floaters in a complex scene is characterized by comprising the following steps: the method comprises the following specific steps:

s1: and (3) network model structural design: the Yolov7 algorithm is a target detection algorithm proposed in 2022, and test results on a COCO data set show that the speed and the precision of the method are better than those of a plurality of target detectors YOLOR, YOLOX, scaled-Yolov4 and Yolov 5; further analyzing the detection problem of the water surface floaters, providing a target detection algorithm Yolov7-FC integrating an attention mechanism, wherein a network structure comprises Backbone, attention, FPN and a Head layer, the network structure represents a framework structure for calculating different stages of the network structure from shallow to deep, the network structure further comprises CBS (cubic boron nitride) representing convolution blocks, the CBS representing convolution blocks consists of a convolution layer Conv, a batch normalization layer BN and an activation function silu, and different CBS colors represent different magnitudes of convolution kernels k and stride s of the convolution layer; CBM also represents a convolution block, unlike CBS, where the activation function in CBM uses a sigmoid activation function, the two activation functions are calculated as shown in equations 1 and 2:

UPS represents an up-sampling layer, and nearest-neighbor interpolation method nearest is adopted for calculation; ELAN denotes a multi-branch stack module, wherein concat denotes a merged connection meter concat, o=i denotes that the number of output channels is equal to the number of input channels, and o=i/2 denotes that the number of output channels is equal to half of the number of input channels; MP/MP represents a downsampling transition module, maxPool represents maximum pooling, and CBS blocks in the downsampling transition module adopt an o=i/2 mode to carry out output calculation;

the loss function is specifically: the loss function of the Yolov7-FC algorithm consists of three parts: regression Loss of target _reg Loss of classification Loss _cls And Loss of position Loss _loc The method comprises the steps of carrying out a first treatment on the surface of the As shown in formulas 10 to 14:

Loss＝λ _reg ·Loss _reg +λ _cls ·Loss _cls +λ _loc ·Loss _loc (10)

wherein lambda is _reg 、λ _cls And lambda (lambda) _loc The method comprises the steps that weights of three different categories of losses in a loss function are represented respectively, each input picture is divided into K multiplied by K unit cells by a Yolov7-FC network, M anchor frame anchors are generated in each grid, and after the previous calculation of each anchor frame anchor through the network, an adjusted boundary frame is obtained, wherein the total number of the boundary frames is K multiplied by M;and->The method comprises the steps of judging whether the center coordinates of a target object are in a jth anchor frame in an ith grid, if so, enabling the center coordinates to be equal to 1, and if not, enabling the center coordinates to be equal to 0; c (C) _i Confidence for the true box in the ith cell, +.>Confidence for the i-th intra-cell prediction frame; p is p _i (k) Conditional probability that the real box in the ith cell contains the kth type of object,/->A conditional probability that the prediction box in the ith cell contains the kth type of object;

the image preprocessing specifically comprises the following steps: image data enhancement is performed by combining mosaics and mixup, and a judgment formula of whether the data enhancement method is adopted for each iteration epoch is shown as formula 23:

2. The method for identifying the water surface floaters in the complex scene according to claim 1, wherein the method comprises the following steps: the network model structural design includes Backbone, attention, FPN, head.

3. The method for identifying the water surface floaters in the complex scene according to claim 2, wherein the method comprises the following steps: in the Backbone network, the Yolov7-FC firstly uses an ELAN module to perform feature extraction, and then uses a transition module to perform downsampling, so that three effective feature layers are obtained for performing next network construction;

in the ELAN module, the network divides the input characteristic into 5 branches for calculation, and the branches are sequentially 1 convolution block, 3 convolution blocks, 5 convolution blocks and 7 convolution blocks, and the branches are subjected to concat calculation and then output through the 1 convolution blocks; through a dense residual structure, 5 features with different depths can be fused, and through using a jump connected residual block, the influence of gradient vanishing problem caused by network depth increase is reduced;

4. A method for identifying a water surface float in a complex scene according to claim 3, wherein: after the Attention is subjected to feature extraction on an input picture through a backbone network, the network improves the Attention degree on effective features by using an Attention mechanism, generates Attention information in two dimensions of a channel and a space through a convolution Attention module CBAM, and generates a new feature map after combining the Attention information;

wherein C, H and W represent the number of channels, height and width of the feature map, respectively;

wherein the method comprises the steps ofRepresenting the corresponding multiplication of the parity elements;

wherein σ represents a sigmoid activation function, W ₀ And W is ₁ Is the weight of the MLP, and r is the dimension reduction coefficient used in the MLP;

in the spatial attention module, the network focuses on the azimuth features of the image in the spatial dimension by compressing the channel dimension, and also performs feature extraction by using two methods of average pooling and maximum pooling, and the average pooling F is adopted ^s _avg Acquiring the integral features of the feature region and combining the maximum pooling F ^s _max Obtaining the salient features of the feature region, and then performing feature fusion by using a convolution layer to obtain the final space injectionForce of intention characteristic diagram M _s The calculation process is shown in formula 9:

5. The method for identifying the water surface floaters in the complex scene according to claim 4, wherein the method comprises the following steps: after the FPN performs feature enhancement through the attention mechanism network, a feature map F' enters a feature pyramid stage to be processed, and the feature enhancement is performed through a rapid spatial pyramid pooling optimization (SPPFCSPC) module; compared with a space pyramid pooling optimization SPPCSPC module, the improved SPPFCSPC module replaces three convolution kernels with different dimensions by using the convolution kernels with the same dimension for the convolution modules, and the same convolution module structure is reused in a series connection mode, so that the efficiency of the network structure is higher;

and the network structure of the rest part of the feature pyramid FPN is subjected to downsampling and upsampling operations of a plurality of convolution blocks, and then feature layers with different scales are fused, so that shallow layer and deep layer network features are mixed, and better feature values are extracted.

6. The method for identifying the water surface floaters in the complex scene according to claim 5, wherein the method comprises the following steps: three enhanced feature graphs can be obtained after the Head passes through the feature pyramid module, each feature layer has width, height and channel number, the network regards the feature graphs as a set of each feature point, and three prior frames with different sizes are used for judging the feature points; and then adjusting the prior frames contained by the neural network according to judgment feedback, and identifying and detecting targets with different sizes in the original image by using a non-maximum value inhibition method, so that the overall detection capability of the neural network on the multi-scale targets is improved.

7. The method for identifying the water surface floaters in the complex scene according to claim 1, wherein the method comprises the following steps: meanwhile, the complete cross-over ratio loss CIoU is adopted in the calculation of the position loss function, instead of the two-class cross entropy loss adopted in the regression loss and the classification loss, so that the position information can be more accurately described; the calculation method of the complete cross ratio loss is shown in the formulas 15-19:

wherein IoU is the cross-over ratio, prediction block b= (x, y, w, h), real block B _gt ＝(x _gt ,y _gt ,w _gt ,h _gt ) They consist of x, y coordinates representing the location of the center point and w, h coordinates of the width, height and length; r is R _CIoU (B,B _gt ) For predicting frame B and true frame B _gt Penalty terms b and b of (2) _gt Represents B and B _gt Wherein ρ (·) represents the euclidean distance, c is the diagonal distance of the smallest box that can contain both the predicted box and the real box; α is a positive trade-off parameter, v is a parameter that measures aspect ratio consistency, which gives higher priority to factors in the overlapping region of the prediction box and the real box relative to non-overlapping portions in the regression calculation;

wherein the balance coefficientSet to 3 to balance the difficulty of identifying the two losses;

the Cost matrix can show that the higher the superposition degree of the real frame and the predicted frame is, the lower the Cost is, the more accurate the classification Cost is, and therefore, a plurality of predicted frames with the best fitting degree with the real frame are found in a self-adaptive mode;

when the situation that the same candidate frame is matched with a plurality of real frames occurs, selecting the real frame with smaller Cost as a unique matching target;

in the initial stage of training, the network can be converged rapidly by using a large learning rate, and in the later stage of training, the network can be converged to an optimal value more favorably by using a small learning rate; therefore, training is performed by using a learning rate exponential decay strategy, and the calculation mode of the learning rate gamma is shown as formula 22:

γ＝ε ^τ γ ₀ (22)