Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.
The terms referred to in the present application are explained first:
target detection: aiming at enabling a computer to have the capability of identifying and positioning a determined target in an image through training, and the key technology comprises a deep convolutional neural network, feature extraction, feature matching, target positioning and the like;
deep convolutional neural network: is a kind of feedforward neural network containing convolution calculation and having depth structure;
feature extraction: in machine learning, pattern recognition and image processing, feature extraction starts from an initial set of measurement data and establishes derived values (features) aimed at providing information and non-redundancy, thereby facilitating subsequent learning and generalization steps and bringing better interpretability in some cases, feature extraction is related to dimensionality reduction, and the quality of features has a crucial influence on generalization capability;
and (3) feature matching: the image matching method comprises the steps of taking the features extracted from the images as conjugate entities, taking the extracted feature attributes or description parameters (actually, the features of the features, which can also be regarded as the features of the images) as matching entities, and calculating similarity measures between the matching entities to realize the registration of the conjugate entities;
feature fusion: by combining the characteristics of different dimensionalities or different perception attitudes, the method realizes the adaptation to more complex dimensionalities or perception attitudes, and can achieve the effect of complementary advantages in pattern recognition.
The specific application scene of the application can be a face detection system in a banking platform, and the face detection system acquires images collected by a camera and detects a face target from the images.
At present, target detection algorithms are various, but almost all target detection algorithms cannot be realized on mobile equipment due to the problems of overlarge models and high computational power requirements on hardware, and the overall detection precision still cannot reach the level of industrial application.
Illustratively, in the YOLOv4 target detection algorithm, CSPDarknet53 is used as a feature extraction Network, the idea of SPP (Spatial Pyramid Pooling) is adopted to expand the receptive field, and PAN (Path Aggregation Network) is used to implement a feature fusion structure for enhancing the bottom-up Path. However, the CSPDarknet53 network uses a complex convolution network structure, and uses a large number of 3 × 3 convolutions, and such a convolution structure consumes a large amount of computation power and increases time consumption, and the PAN structure also consumes a large amount of computation power.
The application provides a target detection method, a target detection device and a target detection equipment, and aims to solve the technical problems in the prior art.
The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
Fig. 1 is a schematic flow diagram illustrating a target detection method according to an exemplary embodiment, as shown in fig. 1, the method including:
and S101, acquiring an image to be detected.
The execution subject of this embodiment may be a server, a terminal, or a system including a server and a terminal, which is not limited in this respect. The present embodiment is described with an execution subject as a terminal.
The image to be detected is an image needing target detection, the image to be detected comprises a target object, the target detection is carried out on the image to be detected, and the aim of identifying and positioning the target object in the image to be detected is fulfilled.
S102, performing feature extraction on the image to be detected to obtain a multi-channel feature map corresponding to the image to be detected.
And extracting the characteristics of the image to be detected, wherein the image to be detected can be subjected to one-layer or multi-layer convolution processing, and pooling processing can also be performed after each convolution processing to obtain a multi-channel characteristic diagram corresponding to the image to be detected. The multi-channel feature map includes feature maps of a plurality of channels, the number of channels being the number of convolution kernels used for convolution processing.
S103, grouping the channels of the multi-channel feature map, respectively extracting the features of each group of channel feature map to obtain the features of each channel, and recombining the features of all the channels to obtain a recombined feature map.
And grouping the channels of the multi-channel characteristic diagram to obtain a plurality of groups of channel characteristic diagrams, wherein each group of channel characteristic diagrams comprises a plurality of channels. And (4) taking the group as a unit, respectively carrying out convolution processing on the characteristic graphs of the channels of each group to obtain the characteristics of each channel in each group of channels. The features of all channels are shuffled to recombine into a new feature map, obtaining a recombined feature map.
For example, assuming that the multi-channel feature map includes feature maps of 9 channels (channel 1, channel 2, channel 3, channel 4, channel 5, channel 6, channel 7, channel 8, and channel 9 in sequence), the multi-channel feature map is divided into 3 groups of channels, each group of channels includes 3 channels, which are respectively a first group (channel 1, channel 2, channel 3), a second group (channel 4, channel 5, channel 6), and a third group (channel 7, channel 8, and channel 9), the channel packet data is converted into a two-dimensional matrix, and the two-dimensional matrix is transposed to obtain a recombined channel arrangement, so that the channels in the recombined feature map are channel 1, channel 4, channel 7, channel 2, channel 5, channel 8, channel 3, channel 6, and channel 9 in sequence.
And S104, down-sampling the repeated feature maps to obtain first feature maps with multiple preset scales.
In the target detection task, the accuracy is improved by extracting the features of various scales. Downsampling the reconstructed feature map may reduce the feature map scale to obtain the first feature map of the desired scale.
Illustratively, three scales of 13 × 13, 26 × 26 and 52 × 52 are preset, downsampling is performed based on the reorganization feature map to obtain a 52 × 52 first feature map, downsampling is performed based on the 52 × 52 first feature map to obtain a 26 × 26 first feature map, and downsampling is performed based on the 26 × 26 first feature map to obtain a 13 × 13 first feature map.
And S105, carrying out target detection based on the first feature maps with various preset scales, and determining the category and the position information of the target object detected from the image to be detected.
The method comprises the steps of carrying out target detection based on first feature maps with various preset scales, obtaining a target detection result corresponding to an image to be detected, determining the category and the position information of a target object detected from the image to be detected based on the target detection result, representing what the target object is specifically (such as people, vehicles and trees) according to the category of the target object, and positioning the target object according to the position information of the target object.
In the embodiment, the calculation amount can be reduced by grouping the channel features and extracting the channel features and then recombining the extracted channel features, so that the consumption of calculation resources is reduced, the calculation force requirement on hardware is reduced, and the target detection can be realized on mobile equipment with relatively low calculation force; meanwhile, by recombining the characteristics of different channels, more abundant characteristic information can be obtained, and the target detection precision is favorably improved.
In one example, the performing feature extraction on each group of channel feature maps to obtain features of each channel includes: extracting first channel characteristics of each channel for each group of channel characteristic graphs; determining the weight of each channel according to the first channel characteristics of all the channels, and weighting the first channel characteristics of each channel respectively based on the weight of each channel to obtain the second channel characteristics of each channel; and respectively adding the first channel characteristics and the second channel characteristics of each channel to obtain the characteristics of each channel.
The weight of the channel is used for representing the importance degree of the channel characteristic, and the characteristic of each channel is weighted and reconstructed based on global information by introducing an attention mechanism so as to strengthen the important channel characteristic, inhibit the less important channel characteristic and improve the characteristic extraction capability.
Illustratively, combining a Channel Shuffle (Channel Shuffle) structure with a MobileNetV3 network results in a lightweight feature extraction network (denoted CS-MobileNet) based on the Channel Shuffle structure. Optionally, the CS-MobileNet includes a plurality of convolutional layers (Conv), a max pooling layer (MaxPool), and a convolutional block (Stage), and the specific structure is as shown in table 1 below:
TABLE 1
Wherein Layer represents a structural Layer in the CS-MobileNet network; input represents the Input, and the data in this column represents the size of the Input; operator represents an operation performed on an input; k Size represents the convolution kernel Size; stride represents the step size; filters denotes the number of convolution kernels; SE (i.e., Squeeze and Excitation) indicates the compression-activation structure; repeat indicates the number of repetitions.
Feature extraction is carried out on an image to be detected through Conv1, Conv2 and Maxpool to obtain a multi-channel feature map corresponding to the image to be detected, and channel grouping, feature extraction, feature recombination and down sampling are carried out on the multi-channel feature map through stages 1-5 to obtain first feature maps with 13 × 13, 26 × 26 and 52 × 52 scales.
FIG. 2 is a schematic diagram of a single Stage in a CS-MobileNet, where Channel Split represents Channel separation, according to an exemplary embodiment; conv denotes convolution; DWConv (i.e., depthwitse Conv) represents depth direction convolution; FC (i.e., Fully Connected) represents Fully Connected; BN (i.e., Batch Normalization) represents Batch Normalization; ReLU and h-swish represent activation functions; concat represents feature splicing fusion; channel Shuffle denotes Channel shuffling.
As can be seen from fig. 2, the lightweight attention mechanism SE structure is used in the convolutional layers of CS-MobileNet, and the structure of each convolutional layer is modified by combining Channel Shuffle, so that a new network structure is constructed by layer-by-layer feature fusion and Channel Split combination instead of using a simple residual hourglass network structure. Wherein, the left figure of fig. 2 is the structural branch of the non-down-sampled Channel Shuffle, and the right figure of fig. 2 is the structural branch of the down-sampled Channel Shuffle. The activation function uses an h-swish function. In terms of the number of channels of the convolutional layer, CS-MobileNet adopts a strategy of reducing the number of channels to reduce calculation, and does not perform dimensionality reduction operation after the Concat feature fusion structure of each Channel Shuffle so as to ensure enough semantic information. In the target detection model, only the lightweight network is used as the basic feature extraction network, and therefore, the lightweight network does not include the classification convolutional layer at the final stage in the final target detection model.
The results of CS-MobileNet in image dataset (ImageNet) classification experiments are compared to other network structures as shown in Table 2 below:
TABLE 2
Wherein, Accuracy represents the classification precision, Mult-ads represents the multiplication and addition operation amount, and Parameters represents the parameter complexity. As can be seen from Table 2, CS-MobileNet is superior to MobileNet V3 in both classification performance and speed. In general, the CS-MobileNet has stronger feature extraction capability and obvious speed advantage, and the network structure is not as simple as the design of other networks, so the CS-MobileNet is more suitable for target detection tasks than other networks.
Fig. 3 is a flowchart illustrating a method of object detection according to an exemplary embodiment, as shown in fig. 3, the method including:
s301, acquiring an image to be detected.
And S302, performing feature extraction on the image to be detected to obtain a multi-channel feature map corresponding to the image to be detected.
And S303, grouping the channels of the multi-channel feature map, respectively extracting the features of each group of channel feature map to obtain the features of each channel, and recombining the features of all the channels to obtain a recombined feature map.
S304, downsampling the recombined feature map to obtain first feature maps with multiple preset scales.
Exemplarily, steps S301 to S304 are similar to the implementation of steps S101 to S104, and are not described herein again.
S305, performing feature fusion based on the first feature maps with multiple preset scales to obtain fusion feature maps with multiple preset scales.
In an example, the plurality of predetermined dimensions includes a first dimension, a second dimension that is less than the first dimension, and a third dimension that is less than the second dimension. Optionally, the first dimension is 52 × 52, the second dimension is 26 × 26, and the third dimension is 13 × 13.
In one example, performing feature fusion based on first feature maps with multiple preset scales to obtain a fusion feature map with multiple preset scales includes: determining the first feature map of the first scale as a fused feature map of the first scale; performing multi-channel up-sampling on the first feature map of the first scale to obtain a second feature map of a second scale, and fusing the second feature map of the second scale and the first feature map of the second scale to obtain a fused feature map of the second scale; and performing multi-channel up-sampling on the fusion feature map of the second scale to obtain a second feature map of a third scale, and fusing the second feature map of the third scale and the first feature map of the third scale to obtain a fusion feature map of the third scale.
In this embodiment, through multichannel upsampling, the richness of the positioning information in the feature map after upsampling can be improved, and compared with common upsampling without distinguishing channels, the feature extraction capability is stronger.
In one example, multi-channel upsampling is performed on a first feature map at a first scale to obtain a second feature map at a second scale, and the method comprises the following steps: dividing the first feature map of the first scale into a first channel feature, a second channel feature and a third channel feature; deconvolution is carried out on the first channel characteristics to obtain first up-sampling characteristics, deconvolution and convolution are carried out on the second channel characteristics in sequence to obtain second up-sampling characteristics, and convolution, deconvolution and convolution are carried out on the third channel characteristics in sequence to obtain third up-sampling characteristics; wherein the convolution kernel size of the convolution is smaller than the convolution kernel size of the deconvolution; and fusing the first upsampling feature, the second upsampling feature and the third upsampling feature to obtain a second feature map of a second scale.
In one example, performing multi-channel upsampling on the fused feature map at the second scale to obtain a second feature map at a third scale includes: dividing the fused feature map of the second scale into a fourth channel feature, a fifth channel feature and a sixth channel feature; deconvolution is carried out on the fourth channel characteristic to obtain a fourth up-sampling characteristic, deconvolution and convolution are carried out on the fifth channel characteristic in sequence to obtain a fifth up-sampling characteristic, and convolution, deconvolution and convolution are carried out on the sixth channel characteristic in sequence to obtain a sixth up-sampling characteristic; wherein the convolution kernel size of the convolution is smaller than the convolution kernel size of the deconvolution; and fusing the fourth upsampling feature, the fifth upsampling feature and the sixth upsampling feature to obtain a second feature map of a third scale.
Illustratively, a Channel Shuffle-based upsampling structure (denoted as Shuffle _ Upsample) is used to perform multi-Channel upsampling on a first feature map at a first scale and to perform multi-Channel upsampling on a fused feature map at a second scale.
FIG. 4 is a schematic diagram of the structure of the Shuffle _ Upsample shown in accordance with an exemplary embodiment, where s denotes the input feature map size; 2s c denotes the fused feature size; DeConv denotes deconvolution; DWDeConv denotes depth direction deconvolution; 1 × 1Conv refers to PointWise convolution (PointWise Conv, PW convolution); BN (i.e., Batch Normalization) represents Batch Normalization; ReLU denotes the activation function; concat represents feature splicing fusion.
As can be seen from fig. 4, the Shuffle _ update upsampling structure is a multichannel convolution fusion structure based on Channel Shuffle, and the whole upsampling structure is divided into three channels, wherein 3 × 3 DeConv is used in the leftmost Channel for upsampling, so as to obtain a feature map with twice size of the input feature map; 3 multiplied by 3 DWDeconv is used for up-sampling in the intermediate channel, then 1 multiplied by 1Conv is used for dimensionality reduction, and finally a characteristic diagram with the size twice that of the input characteristic diagram is obtained; the rightmost channel is upscaled using 1 × 1Conv, then upsampled using 3 × 3 DWDeConv, and finally downscaled using 1 × 1 Conv. According to the Shuffle _ Upsample structure, the calculation process is simplified, but the fusion times of the feature maps are increased, so that the consumption of calculation resources can be reduced, meanwhile, richer feature information is extracted, and the detection precision is improved.
Illustratively, an output Feature map of the sample-on-Shuffle Upsample structure and the original-size Feature map are fused by using a Feature Pyramid Network (FPN) structure to obtain a second-scale fused Feature map and a third-scale fused Feature map. And combining the Shuffle _ Upsample upsampling structure and the FPN structure to obtain a lightweight feature fusion structure (represented by Shuffle _ FPN _ block).
Fig. 5 is a schematic diagram illustrating a structure of a Shuffle _ fpn _ block according to an exemplary embodiment, where SU represents a Shuffle _ Upsample structure; DBL represents DW convolution + BN + Leaky _ ReLU; PDPBL represents PW convolution + DW convolution + PW convolution + BN + Leaky _ ReLU; 255 denotes 3 × (80+4+1), where 80 corresponds to the number of categories, 4 corresponds to the number of pieces of position information (including the coordinates of the center point and the width and height of the bounding box), and 1 corresponds to the category confidence.
In the lightweight feature fusion structure shuffle _ fpn _ block, the Convolution is performed using Depth Separable Convolution (DSC), and upsampling is performed using shuffle _ upsample. The method comprises the steps of obtaining an up-sampling feature map (namely a 26 x 26 scale second feature map) after channel fusion by performing Shuffle _ Usample up-sampling on a 13 x 13 scale first feature map output by a basic feature extraction network (namely CS-Mobile Net), and performing first fusion on the 26 x 26 scale second feature map and the 26 x 26 scale first feature map output by the CS-Mobile Net to obtain a 26 x 26 scale fusion feature map. And performing upsampling on the fusion feature map with the 26 × 26 scale to obtain an upsampled feature map (namely a second feature map with the 52 × 52 scale) after channel fusion, and performing second fusion on the second feature map with the 52 × 52 scale and the first feature map with the 52 × 52 scale output by the CS-MobileNet to obtain a fusion feature map with the 52 × 52 scale. Through the process, three scale feature maps, namely a 13 × 13 scale feature map output by the basic feature extraction network, a 26 × 26 scale feature map after the first fusion and a 52 × 52 scale feature map after the second fusion are obtained.
The parameter pairs of shuffle _ fpn _ block and res _ fpn _ block are shown in table 3 below:
TABLE 3
As can be seen from table 3, the shuffle _ fpn _ block is reduced by 30% in parameter compared to the res _ fpn _ block, and the shuffle _ fpn _ block behaves "lighter" in model complexity.
S306, target detection is carried out based on the fusion characteristic graphs of the multiple preset scales, and the category and the position information of the target object detected from the image to be detected are determined.
In one example, the determining the category and the position information of the target object detected from the image to be detected based on the target detection performed by the fusion feature map with a plurality of preset scales includes: target detection is carried out on the basis of the fusion feature map of each preset scale, and target feature information of each preset scale is obtained; separating the target characteristic information of each preset scale to obtain the classification characteristic and the first positioning characteristic of each preset scale; respectively carrying out full convolution and reconstruction on the first positioning features of various preset scales to obtain second positioning features of various preset scales; and determining the category and the position information of the target object detected from the image to be detected based on the classification features and the second positioning features of multiple preset scales.
The target characteristic information comprises positioning characteristic information and classification characteristic information. The output of the regression network of the existing target detection mixes the positioning information and the classification information into a feature matrix, and in the process of feature extraction, the convolution calculation of a feature map is performed only through a convolution core shared by a weight, and the data volume of the classification feature information is far larger than that of the positioning feature information, which may cause certain influence on the processes of feature extraction and error back propagation adjustment of network parameters, thereby causing the training overfitting of the network to the classification features and the training underfitting of the positioning features. Therefore, a feature extraction method that simply combines the classification features and the positioning features cannot be well qualified for a positioning task in target detection, and target detection accuracy is affected. Based on this, in the present embodiment, the positioning feature information and the classification feature information are separated and extracted, and a full volume positioning Separation network (FCLSN) is designed to improve the target detection accuracy.
Fig. 6 is a schematic diagram illustrating an FCLSN structure according to an exemplary embodiment, where the FCLSN structure includes two Full Convolutional Nets (FCNs) and two full connection layers (FCs), as shown in fig. 6. And separating the target feature information of each preset scale to obtain the classification features (the sizes of which are respectively 13 multiplied by 80 multiplied by 3, 26 multiplied by 80 multiplied by 3 and 52 multiplied by 80 multiplied by 3) and the first positioning features of each preset scale. The first location feature is convolved into a one-dimensional feature matrix using two FCNs, the size of which is independent of the size of the data set, and only dependent on the scale of the three feature maps, the one-dimensional feature matrix size also being 1 × (4+1) × 13 × 13, 1 × (4+1) × 26 × 26, and 1 × (4+1) × 52 × 52, respectively. And then, respectively using the one-dimensional positioning feature matrixes corresponding to the three scale feature maps as the input of the FC layer, obtaining three positioning feature matrixes (namely second positioning features) with the sizes of 13 × 13 × (4+1) × 3, 26 × 26 × (4+1) × 3 and 52 × 52 × (4+1) × 3 through the output of the FC layer by reconstruction (reshape), and respectively fusing the three positioning feature matrixes and the classification feature matrixes with the corresponding scales by using a splicing method (Concat) to obtain three final output feature matrixes with the sizes of 13 × 13 × (4+1+80) × 3, 26 × 26 × (4+1+80) × 3 and 52 × 52 × (4+1+80) × 3.
In one example, the classification feature includes class information of the detected bounding box; the second positioning characteristics comprise position information and confidence information of the detected bounding box, wherein the position information comprises a central point coordinate and width and height; determining the category and the position information of the target object detected from the image to be detected based on the classification features and the second positioning features of multiple preset scales, wherein the method comprises the following steps: screening the detected boundary frame by using a non-maximum suppression mode according to the reference confidence coefficient and the category information, the position information and the confidence coefficient of the detected boundary frame to obtain the category information and the position information of the target detection frame; and determining the category and the position information of the target object detected from the image to be detected according to the category information and the position information of the target detection frame.
Specifically, a lightweight target detection network is obtained by combining a lightweight feature extraction network CS-mobileNet and a lightweight feature fusion structure Shuffle _ fpn _ block, and the accuracy of the lightweight target detection network is improved by using a candidate frame screening method of combining a relative entropy loss function (KL-loss) with a non-maximum suppression mode (Softer-NMS) while representing the lightweight target detection network by using LSCS-YOLO (Location Separation Channel-YOLO).
Illustratively, using KL-Loss to calculate the Loss of the classification section, the Loss function of the Softer-NMS algorithm in combination with the target detection model of the FCLSN structure is as follows:
wherein S is2Representing a scale; b represents the number of categories; x and y represent coordinates of the center point of the detected bounding box; x 'and y' represent the coordinates of the center point of the labeling frame; w and h represent the width and height of the bounding box; w 'and h' represent the width and height of the labeling box; c represents the confidence level that the target object exists in the boundary box; c' represents the confidence that the target object is in the mark frame; p represents the class probability of the bounding box; p' represents the class probability of the label box; λ, l, α are constants.
Fig. 7 is a diagram illustrating a structure of the Softer-NMS algorithm in combination with FCLSN according to an exemplary embodiment, where box std represents labeled box information, AbsVal represents calculating a standard deviation between labeled box information and bounding box information to add confidence information, and after the model is trained, the calculated standard deviation is used as a reference confidence to assist in candidate box screening when the model is used. By adding the FCLSN with the full convolution positioning separation structure and Softer-NMS with the screening frame algorithm, the positioning of the LSCS-YoLO on the small target and the shielded target is more excellent than that of the YOLOv3 model, the positioning of the shielded target is no longer a short plate of the regression algorithm and the lightweight model, and the positioning of the shielded target exceeds the YOLOv3 model in precision. Compared with other target detection networks, the LSCS-YOLO has the advantages of speed, light weight and precision.
Fig. 8 is a schematic structural diagram of an object detection apparatus according to an exemplary embodiment, which may be a part of a computer device using a software module or a hardware module, or a combination of the two modules, as shown in fig. 8, the apparatus 800 includes:
an obtaining unit 810, configured to obtain an image to be detected;
the first feature extraction unit 820 is configured to perform feature extraction on an image to be detected to obtain a multi-channel feature map corresponding to the image to be detected;
the second feature extraction unit 830 is configured to perform channel grouping on the multi-channel feature maps, perform feature extraction on each group of channel feature maps respectively to obtain features of each channel, and recombine the features of all the channels to obtain a recombined feature map;
the third feature extraction unit 840 is configured to perform downsampling on the reconstructed feature map to obtain first feature maps of multiple preset scales;
and the target detection unit 850 is used for performing target detection based on the first feature maps with multiple preset scales, and determining the category and position information of the target object detected from the image to be detected.
In an example, when the second feature extraction unit 830 performs feature extraction on each group of channel feature maps to obtain features of each channel, specifically: extracting first channel characteristics of each channel for each group of channel characteristic graphs; determining the weight of each channel according to the first channel characteristics of the channels, and weighting the first channel characteristics of each channel respectively based on the weight of each channel to obtain the second channel characteristics of each channel; and respectively adding the first channel characteristics and the second channel characteristics of each channel to obtain the characteristics of each channel.
Fig. 9 is a schematic structural diagram of an object detection apparatus according to an exemplary embodiment, and based on the embodiment shown in fig. 8, as shown in fig. 9, in the apparatus 900, an object detection unit 850 includes:
the feature fusion module 851 is used for performing feature fusion based on the first feature maps with multiple preset scales to obtain fusion feature maps with multiple preset scales;
the target detection module 852 is configured to perform target detection based on the fusion feature maps of multiple preset scales, and determine category and position information of a target object detected from the image to be detected.
In one example, the plurality of preset dimensions includes a first dimension, a second dimension smaller than the first dimension, and a third dimension smaller than the second dimension; the feature fusion module 851 is specifically configured to, when performing feature fusion on the first feature map based on multiple preset scales to obtain a fusion feature map of multiple preset scales: determining the first feature map of the first scale as a fused feature map of the first scale; performing multi-channel up-sampling on the first feature map of the first scale to obtain a second feature map of a second scale, and fusing the second feature map of the second scale and the first feature map of the second scale to obtain a fused feature map of the second scale; and performing multi-channel up-sampling on the fusion feature map of the second scale to obtain a second feature map of a third scale, and fusing the second feature map of the third scale and the first feature map of the third scale to obtain a fusion feature map of the third scale.
In an example, when performing multi-channel upsampling on the first feature map at the first scale to obtain the second feature map at the second scale, the feature fusion module 851 is specifically configured to: dividing the first feature map of the first scale into a first channel feature, a second channel feature and a third channel feature; deconvolution is carried out on the first channel characteristics to obtain first up-sampling characteristics, deconvolution and convolution are carried out on the second channel characteristics in sequence to obtain second up-sampling characteristics, and convolution, deconvolution and convolution are carried out on the third channel characteristics in sequence to obtain third up-sampling characteristics; wherein the convolution kernel size of the convolution is smaller than the convolution kernel size of the deconvolution; and fusing the first upsampling feature, the second upsampling feature and the third upsampling feature to obtain a second feature map of a second scale.
In an example, when performing multi-channel upsampling on the fused feature map at the second scale to obtain the second feature map at the third scale, the feature fusion module 851 is specifically configured to: dividing the fused feature map of the second scale into a fourth channel feature, a fifth channel feature and a sixth channel feature; deconvolution is carried out on the fourth channel characteristic to obtain a fourth up-sampling characteristic, deconvolution and convolution are carried out on the fifth channel characteristic in sequence to obtain a fifth up-sampling characteristic, and convolution, deconvolution and convolution are carried out on the sixth channel characteristic in sequence to obtain a sixth up-sampling characteristic; wherein the convolution kernel size of the convolution is smaller than that of the deconvolution; and fusing the fourth upsampling feature, the fifth upsampling feature and the sixth upsampling feature to obtain a second feature map of a third scale.
In an example, when the target detection module 852 performs target detection based on a fusion feature map with multiple preset scales and determines the category and the position information of a target object detected from an image to be detected, the target detection module is specifically configured to: target detection is carried out on the basis of the fusion feature map of each preset scale, and target feature information of each preset scale is obtained; separating the target characteristic information of each preset scale to obtain the classification characteristic and the first positioning characteristic of each preset scale; respectively carrying out full convolution and reconstruction on the first positioning features of various preset scales to obtain second positioning features of various preset scales; and determining the category and the position information of the target object detected from the image to be detected based on the classification features and the second positioning features of multiple preset scales.
In one example, the classification feature includes class information of the detected bounding box; the second positioning characteristics comprise position information and confidence information of the detected bounding box, wherein the position information comprises a central point coordinate and width and height; the target detection module 852 is specifically configured to, when determining the category and the position information of the target object detected from the image to be detected based on the classification features and the second positioning features of multiple preset scales: screening the detected boundary frame by using a non-maximum suppression mode according to the reference confidence coefficient and the category information, the position information and the confidence coefficient of the detected boundary frame to obtain the category information and the position information of the target detection frame; and determining the category and the position information of the target object detected from the image to be detected according to the category information and the position information of the target detection frame.
Fig. 10 is a schematic structural diagram illustrating an electronic device according to an exemplary embodiment, the electronic device including: a processor 1001 and a memory 1002 communicatively coupled to the processor 1001; the memory 1002 stores computer-executable instructions; the processor 1001 executes computer-executable instructions stored by the memory 1002 to implement the application profile acquisition method provided in the above-described embodiments.
The electronic device further comprises a receiver 1003 and a transmitter 1004. The receiver 1003 is used for receiving instructions and data transmitted from an external device, and the transmitter 1004 is used for transmitting instructions and data to the external device.
Fig. 11 is a block diagram illustrating a terminal device, which may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, etc., according to one exemplary embodiment.
The apparatus 1100 may include one or more of the following components: processing component 1102, memory 1104, power component 1106, multimedia component 1108, audio component 1110, input/output (I/O) interfaces 1112, sensor component 1114, and communications component 1116.
The processing component 1102 generally controls the overall operation of the device 1100, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 1102 may include one or more processors 1120 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 1102 may include one or more modules that facilitate interaction between the processing component 1102 and other components. For example, the processing component 1102 may include a multimedia module to facilitate interaction between the multimedia component 1108 and the processing component 1102.
The memory 1104 is configured to store various types of data to support operations at the apparatus 1100. Examples of such data include instructions for any application or method operating on device 1100, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 1104 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
A power component 1106 provides power to the various components of the device 1100. The power components 1106 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 1100.
The multimedia component 1108 includes a screen that provides an output interface between the device 1100 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 1108 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 1100 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 1110 is configured to output and/or input audio signals. For example, the audio component 1110 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 1100 is in operating modes, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 1104 or transmitted via the communication component 1116. In some embodiments, the audio assembly 1110 further includes a speaker for outputting audio signals.
The I/O interface 1112 provides an interface between the processing component 1102 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor assembly 1114 includes one or more sensors for providing various aspects of state assessment for the apparatus 1100. For example, the sensor assembly 1114 may detect an open/closed state of the apparatus 1100, the relative positioning of components, such as a display and keypad of the apparatus 1100, the sensor assembly 1114 may also detect a change in position of the apparatus 1100 or a component of the apparatus 1100, the presence or absence of user contact with the apparatus 1100, orientation or acceleration/deceleration of the apparatus 1100, and a change in temperature of the apparatus 1100. The sensor assembly 1114 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 1114 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 1114 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 1116 is configured to facilitate wired or wireless communication between the apparatus 1100 and other devices. The apparatus 1100 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 1116 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 1116 also includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 1100 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.
In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 1104 comprising instructions, executable by the processor 1120 of the apparatus 1100 to perform the method described above is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
Embodiments of the present application further provide a non-transitory computer-readable storage medium, where instructions in the storage medium, when executed by a processor of a computer device, enable the computer device to perform the method provided in any of the above embodiments.
An embodiment of the present invention further provides a computer program product, where the computer program product includes: a computer program, the computer program being stored in a readable storage medium, from which the computer program can be read by at least one processor of a computer device, execution of the computer program by the at least one processor causing the computer device to perform the method provided by any of the embodiments described above.
It should be understood that the terms "first", "second", etc. in the above-described embodiments are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Further, in the description of the present application, the meaning of "a plurality" means at least two unless otherwise specified.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.