WO2023284182A1 - Training method for recognizing moving target, method and device for recognizing moving target - Google Patents
Training method for recognizing moving target, method and device for recognizing moving target Download PDFInfo
- Publication number
- WO2023284182A1 WO2023284182A1 PCT/CN2021/128515 CN2021128515W WO2023284182A1 WO 2023284182 A1 WO2023284182 A1 WO 2023284182A1 CN 2021128515 W CN2021128515 W CN 2021128515W WO 2023284182 A1 WO2023284182 A1 WO 2023284182A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- features
- target
- layer
- class
- consecutive images
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
- G06V10/267—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/761—Proximity, similarity or dissimilarity measures
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/103—Static body considered as a whole, e.g. static pedestrian or occupant recognition
Definitions
- the present disclosure relates to the field of computer vision and machine learning, and in particular to a training method for recognizing a moving target, a method and an apparatus for recognizing a moving target.
- Recognizing a moving target refers to recognizing a pedestrian target in an image, wherein the image is captured while the pedestrian is walking.
- a relatively developed method for recognizing the pedestrian may include two types, a person re-identification method and a gait recognition method.
- the former method may extract static external features, such as dressing of the pedestrian, a hair style of the pedestrian, a backpack of the pedestrian, an umbrella of the pedestrian, and the like, from the image.
- the latter method may learn dynamic features, such as a walking posture, an amplitude of arm swinging, head shaking and shoulder shrugging, sensitivity of a motor nerve, and the like, based on continuous movements of the pedestrian.
- the applicant discovers that, when performing the method in the art to recognize the moving target, one single feature is relied on, such as a static RGB image, or a contour image, and the like. Robustness of the feature is not sufficient. Therefore, accuracy of a recognition result may be low.
- some technical solutions in the art recognize the moving target recognition based on feature fusion. For example, global features of an RGB image may be fused with local features of the RGB image. In this way, feature modality is relatively unitary. Performance of an apparatus may be scarified, whereas accuracy of prefabricated matching may not be improved.
- the present disclosure provides a training method for recognizing a moving target, a method and an apparatus for recognizing a moving target. In this way, robustness and accuracy of recognizing the moving target may be improved.
- a training method of recognizing a moving target includes: obtaining a plurality of consecutive images; inputting the plurality of consecutive images successively into an input end of an inner layer of a two-layer ViT feature fusion model to obtain a first class of static features and a second class of static features of the target in each of the plurality of consecutive images; fusing the first class of static features and the second class of static features in each of the plurality of consecutive images at an output end of the inner layer of the two-layer ViT feature fusion model to obtain fused features; and inputting the fused features of at least some of the plurality of consecutive images successively into an input end of an outer layer of the two-layer ViT feature fusion model for classification training until the entire network is converged.
- obtaining the first class of static features and the second class of static features of the target in each of the plurality of consecutive images includes: obtaining fine-grained static features and fine-grained contour features of the target in each of the plurality of consecutive images.
- the obtaining fine-grained static features and fine-grained contour features of the target in each of the plurality of consecutive images includes: segmenting the target into a plurality of portions, and inputting the plurality of portions successively into the first input end of the inner layer of the two-layer ViT feature fusion model to obtain the fine-grained static features; and segmenting a contour of the target into a plurality of contour portions, and inputting the plurality of contour portions successively into a second input end of the inner layer of the two-layer ViT feature fusion model to obtain the fine-grained contour features.
- fusing the first class of static features and the second class of static features in each of the plurality of consecutive images to obtain fused features includes: fusing the fine-grained static features and the fine-grained contour features by weighted average at an output end of the inner layer of the two-layer ViT feature fusion model to obtain the fused features.
- the inputting the fused features of at least some of the plurality of consecutive images successively into an input end of an outer layer of the two-layer ViT feature fusion model for classification training includes: inputting the fused features of the at least some of the plurality of consecutive images successively into an input layer of an outer layer of the two-layer ViT feature fusion model, and performing classification training based on normalized exponential loss, wherein dimension of an embedding layer is set to be positive integer times of 128, until the entire network is converged.
- a method for recognizing a moving target includes: obtaining a plurality of consecutive images of a target to be recognized; inputting the plurality of consecutive images successively into an input end of an inner layer of a two-layer ViT feature fusion model to obtain a first class of static features and a second class of static features of the target to be recognized in each of the plurality of consecutive images; fusing the first class of static features and the second class of static features in each of the plurality of consecutive images at an output end of the inner layer of the two-layer ViT feature fusion model to obtain fused features; inputting the fused features of at least some of the plurality of consecutive images successively into an input end of an outer layer of the two-layer ViT feature fusion model for fusing to obtain dynamic features; and obtaining a recognition result based on the dynamic features.
- the obtaining a recognition result based on the dynamic features includes: calculating cosine similarity between the dynamic features and each of all features stored in a base library of the moving target one by one; placing the cosine similarity in an order and obtaining a maximum cosine similarity; determining whether the maximum cosine similarity is greater than a predetermined recognition threshold; and obtaining a stored feature corresponding to the maximum cosine similarity, and taking identity information corresponding to the stored feature as a recognition result of the target to be recognized, in response to the maximum cosine similarity being greater than the predetermined recognition threshold.
- the method before the obtaining a plurality of consecutive images of a target to be recognized, the method further includes: establishing the base library of the moving target, wherein the base library of the moving target is configured to store all identity information of the target to be stored and the stored features.
- an apparatus for recognizing a moving target includes a memory and a processor coupled to the memory.
- the memory stores program instructions, and the program instructions are configured to be executed by the processor to implement the method for recognizing the moving target according to any one of the above embodiments.
- a training method for recognizing a moving target, a method and an apparatus for recognizing a moving target includes: obtaining a plurality of images taken at various time points; obtaining a first class of static features and a second class of static features of the target in each of the plurality of images; fusing the first class of static features and the second class of static features in each of the plurality of images to obtain a fused feature; performing classification training on the fused feature of at least some of the plurality of images until the entire network is converged.
- the two classes of static features in one image are extracted, spliced and fused.
- a plurality of consecutive fused features are input to a classification trainer.
- FIG. 1 is a flow chart of a training method for recognizing a moving target according to an embodiment of the present disclosure.
- FIG. 2 is a flow chart of an operation S102 shown in FIG. 1 according to an embodiment of the present disclosure.
- FIG. 3 is a network structural schematic view of a training method for recognizing a moving target according to an embodiment of the present disclosure.
- FIG. 4 is a flow chart of a method for recognizing a moving target according to an embodiment of the present disclosure.
- FIG. 5 is a flow chart of an operation S305 shown in FIG. 4 according to an embodiment of the present disclosure.
- FIG. 6 is a flow chart of operations performed before the operation S401 shown in FIG. 5 according to an embodiment of the present disclosure.
- FIG. 7 is a diagram of an apparatus for recognizing a moving target according to an embodiment of the present disclosure.
- FIG. 8 is a structural schematic view of an apparatus for recognizing a moving target according to an embodiment of the present disclosure.
- FIG. 9 is a diagram of a computer-readable storage medium according to an embodiment of the present disclosure.
- FIG. 1 is a flow chart of a training method for recognizing a moving target according to an embodiment of the present disclosure.
- the method includes following operations.
- a pedestrian area is labeled as 255, and a background area is labeled as 0.
- the RGB images and the contour image of the same person are labeled with identity information. So far, by performing the above operations, a standard set of RGB images and a standard set of contour images are obtained based on a same set of template RGB images. Further, consecutive RGB images and consecutive contour images cooperatively constitute the plurality of consecutive images.
- the first class of static features of the target is obtained based on detailed features in the RGB images obtained in the operation S101, such as a dressing feature, a hairstyle feature, a backpack feature, and the like.
- the second class of static features of the target is obtained based on the contour image obtained in the operation S101.
- the first class of static features in the operation S102 refers to fine-grained static features of the target in each image
- the second class of static features refers to the fine-grained contour features.
- coarse-grained static features and coarse-grained contour features of the target in each image may be extracted, serving as the first class of static features and the second class of static features, respectively. Recognition of the moving target may also be achieved in this way.
- FIG. 2 is a flow chart of an operation S102 shown in FIG. 1 according to an embodiment of the present disclosure.
- the operation S102 may include following operations.
- the moving target is segmented into a plurality of portions, the plurality of portions are successively input into a first input end of an inner layer of a two-layer Vision Transformer (ViT) feature fusion model to obtain the fine-grained static features.
- ViT Vision Transformer
- the ViT -based two-layer feature fusion model may process image sequence data where the target is continuously shown.
- a ViT algorithm for training and inferencing may generate a small computation amount, and the ViT algorithm may be light weighting.
- the static features corresponding to the target may also be obtained by applying a feature fusion model based on the convolutional neural network algorithm to inference and compute the image.
- FIG. 3 is a network structural schematic view of a training method for recognizing a moving target according to an embodiment of the present disclosure.
- the target may be segmented firstly.
- the RGB image may be segmented into 6 portions in an order of a head of the target, a middle-half of the target, and a lower-half of the target, and the 6 portions are equally sized. Subsequently, the 6 portions are successively input into the first input end of the inner layer of the two-layer ViT feature fusion model, i.e., input into an RGB image input end, such that the fined-grained static features of the target are obtained.
- a contour of the target are segmented into a plurality of portions by the means mentioned in the above, and the plurality of portions are input into a second input end of the inner layer of the two-layer ViT feature fusion model to obtain the fine-grained contour features.
- the contour of the target is segmented into 6 portions that are equally sized. Subsequently, the 6 portions are successively input into the second input end of the inner layer of the two-layer ViT feature fusion model, i.e., a contour image input end, to obtain the fined-grained contour features of the target.
- the first class of static features and the second class of static features in each image are fused to obtain the fused feature.
- the first class of static features which are obtained based on one RGB image and one contour image
- the second class of static features which are obtained based on one RGB image and one contour image
- the fine-grained static features and the fine-grained contour features are fused by weighted average at an output end of the inner layer of the two-layer ViT feature fusion model to obtain the fused feature.
- a weight factor of the fine-grained static features is set to be 0.5
- a weight factor of the fine-grained contour features is 0.5.
- the fused features is a sum of a product of 0.5 and the fine-grained static features and a product of 0.5 and the fine-grained contour features.
- classification training is performed on the fused features of at least some of the images until the entire network is converged.
- the at least some of the images refer to some consecutive frames of images selected from all of the plurality of images obtained in the operation S101.
- the fused features corresponding to the some consecutive frames of images may express the dynamic features of the target while the target is walking, such that an expression ability of the model may be improved.
- Preferably, five consecutive frames of RGB images and contour images are selected for classification training. In this way, the accuracy of the recognition result may be ensured, and the amount of computation may be reduced as much as possible.
- fused features of the five frames of images are successively input to the input end of the outer layer of the two-layer ViT feature fusion model for classification training until the entire network is converged.
- classification training based on normalized exponential loss may be applied, wherein dimension of an embedding layer is set to be positive integer times of 128, such as 128, 512, 1024, and the like, until the entire network is converged to obtain a recognition result of the moving target that meets a predefined condition.
- the fine-grained static features and the fine-grained contour features are extracted from one RGB image and one contour image.
- the two-layer ViT feature fusion model may be applied to fuse the three types of features. In this way, the final trained model has a stronger feature expression ability, higher robustness and a better differentiation ability. Applying the model to recognize the moving target may improve the accuracy of the recognition result.
- FIG. 4 is a flow chart of a method for recognizing a moving target according to an embodiment of the present disclosure.
- the method for recognizing the moving target according to the embodiment of the present disclosure includes following operations.
- a video that shows the target to be recognized is moving, is obtained and pre-processed firstly.
- a target RGB image sequence is obtained by a pedestrian detection and tracking tool.
- the RGB images are then normalized to obtain a standard target RGB image sequence.
- the standard target RGB image sequence is copied, and a front background and a rear background of the target are annotated to obtain the target contour image.
- the RGB images and the contour image obtained in the operation S301 are segmented in a same manner and are successively input into the first input end of the inner layer of the two-layer ViT feature fusion model to obtain fine-grained static features and the fine-grained contour features.
- the operation S303 is similar to the operation S103 in FIG. 1.
- the operation S303 will not be repeatedly described for providing a concise description.
- the fused feature of at least some of the images are fused to obtain the dynamic features.
- the fused feature corresponding to the plurality of consecutive frames of images are input to the input end of the outer layer of the two-layer ViT feature fusion model and are fused to obtain the dynamic features corresponding to the target to be recognized.
- the dimension of the embedding layer is set to be 1024, and the output dynamic features are represented by a 1024-dimension feature vector.
- the recognition result is obtained based on the dynamic features.
- FIG. 5 is a flow chart of an operation S305 shown in FIG. 4 according to an embodiment of the present disclosure.
- the operation S305 may include following operations.
- 100 features are stored in the base library of the moving target.
- the dynamic features of the target to be recognized are compared to each of the 100 stored features one by one, and the cosine similarity therebetween is calculated. At last, 100 cosine similarity values are obtained.
- the cosine similarity values are placed in an order, and a maximum cosine similarity value is obtained.
- the above 100 cosine similarity values are placed in the order, such that the maximum cosine similarity value is obtained.
- the method before performing the operation S401, the method further includes a process of establishing the base library of the moving target.
- FIG. 6 is a flow chart of operations performed before the operation S401 shown in FIG. 5 according to an embodiment of the present disclosure.
- the process of establishing the base library of the moving target includes following operations.
- each of the all videos is pre-processed, and a plurality of consecutive images in each of the all videos are obtained successively.
- the plurality of images are input into the trained two-layer ViT feature fusion model to obtain the dynamic features corresponding to each pedestrian target to be stored.
- mapping relationship between each pedestrian to be stored and corresponding dynamic features is constructed, and the mapping relationship is stored into the base library of the moving target.
- the fine-grained static features and the fine-grained contour features in one RGB image and one contour image are extracted.
- the two classes of static features are fully utilized, and the dynamic features of pedestrians included in a sequence of consecutive frames in the video are focused, such that the problem of the feature modality in the art being unitary may be solved.
- the two-layer ViT feature fusion model may be applied to fuse the three types of features, effectively improving the accuracy of recognition result.
- FIG. 7 is a diagram of an apparatus for recognizing a moving target according to an embodiment of the present disclosure.
- the apparatus includes an obtaining module 10, a fusing module 12 and a training module 14.
- the obtaining module 10 is configured to obtain a plurality of images taken at various time points and to obtain the first class of static features and the second class of static features of the target in each of the plurality of images.
- the fusing module 12 is configured to fuse the first class of static features and the second class of static features in each of the plurality of images to obtain the fused feature.
- the training module 14 is configured to perform classification training on the fused feature of at least some of the plurality of images until the entire network is converged.
- FIG. 8 is a structural schematic view of an apparatus for recognizing a moving target according to an embodiment of the present disclosure.
- the apparatus 20 includes a memory 100 and a processor 102 coupled to the memory 100.
- Program instructions are stored in the memory 100.
- the processor 102 is configured to execute the program instructions to implement the method according to any one of embodiments in the above.
- the processor 102 may also be referred to as a Central Processing Unit (CPU) .
- the processor 102 may be an integrated circuit chip able to process signals.
- the processor 102 may also be a general purpose processor, a Digital Signal Processor (DSP) , an Application Specific Integrated Circuit (ASIC) , a Field-Programmable Gate Array (FPGA) or other programmable logic devices, a discrete gate or a transistor logic device, a discrete hardware component.
- DSP Digital Signal Processor
- ASIC Application Specific Integrated Circuit
- FPGA Field-Programmable Gate Array
- the general purpose processor may be a microprocessor or any conventional processor.
- the processor 102 may be implemented by a plurality of integrated circuit chips together.
- FIG. 9 is a diagram of a computer-readable storage medium according to an embodiment of the present disclosure.
- the computer-readable storage medium 30 stores computer programs 300, which can be read by a computer.
- the computer programs 300 can be executed by a processor to implement the method mentioned in any of the above embodiments.
- the computer programs 300 may be stored in a form of a software product on the computer readable storage medium 30 as described above, and may include a number of instructions to enable a computer device (which may be a personal computer, a server, or a network device, and the like) or a processor to perform all or some of the operations of the method described in the various embodiments of the present disclosure.
- the computer-readable storage medium 30 that has the storage function may be a universal serial bus disc, a portable hard disc, a Read-Only Memory (ROM) , a Random Access Memory (RAM) , magnetic discs or optical discs, or various media that can store program codes, or terminal devices such as a computer, a server, a mobile phone, a tablet, and the like.
- ROM Read-Only Memory
- RAM Random Access Memory
- terminal devices such as a computer, a server, a mobile phone, a tablet, and the like.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Molecular Biology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Human Computer Interaction (AREA)
- Biodiversity & Conservation Biology (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Image Analysis (AREA)
Abstract
Description
Claims (9)
- A training method of recognizing a moving target, comprising:obtaining a plurality of consecutive images;inputting the plurality of consecutive images successively into an input end of an inner layer of a two-layer ViT feature fusion model to obtain a first class of static features and a second class of static features of the target in each of the plurality of consecutive images;fusing the first class of static features and the second class of static features in each of the plurality of consecutive images at an output end of the inner layer of the two-layer ViT feature fusion model to obtain fused features; andinputting the fused features of at least some of the plurality of consecutive images successively into an input end of an outer layer of the two-layer ViT feature fusion model for classification training until the entire network is converged.
- The training method according to claim 1, wherein obtaining the first class of static features and the second class of static features of the target in each of the plurality of consecutive images comprises:obtaining fine-grained static features and fine-grained contour features of the target in each of the plurality of consecutive images.
- The training method according to claim 2, wherein the obtaining fine-grained static features and fine-grained contour features of the target in each of the plurality of consecutive images, comprises:segmenting the target into a plurality of portions, and inputting the plurality of portions successively into the first input end of the inner layer of the two-layer ViT feature fusion model to obtain the fine-grained static features; andsegmenting a contour of the target into a plurality of contour portions, and inputting the plurality of contour portions successively into a second input end of the inner layer of the two-layer ViT feature fusion model to obtain the fine-grained contour features.
- The training method according to claim 3, wherein fusing the first class of static features and the second class of static features in each of the plurality of consecutive images to obtain fused features, comprises:fusing the fine-grained static features and the fine-grained contour features by weighted average at an output end of the inner layer of the two-layer ViT feature fusion model to obtain the fused features.
- The training method according to claim 1, wherein the inputting the fused features of at least some of the plurality of consecutive images successively into an input end of an outer layer of the two-layer ViT feature fusion model for classification training, comprises:inputting the fused features of the at least some of the plurality of consecutive images successively into an input layer of an outer layer of the two-layer ViT feature fusion model, and performing classification training based on normalized exponential loss, wherein dimension of an embedding layer is set to be positive integer times of 128, until the entire network is converged.
- A method for recognizing a moving target, comprising:obtaining a plurality of consecutive images of a target to be recognized;inputting the plurality of consecutive images successively into an input end of an inner layer of a two-layer ViT feature fusion model to obtain a first class of static features and a second class of static features of the target to be recognized in each of the plurality of consecutive images;fusing the first class of static features and the second class of static features in each of the plurality of consecutive images at an output end of the inner layer of the two-layer ViT feature fusion model to obtain fused features;inputting the fused features of at least some of the plurality of consecutive images successively into an input end of an outer layer of the two-layer ViT feature fusion model for fusing to obtain dynamic features; andobtaining a recognition result based on the dynamic features.
- The method according to claim 6, wherein the obtaining a recognition result based on the dynamic features, comprises:calculating cosine similarity between the dynamic features and each of all features stored in a base library of the moving target one by one;placing the cosine similarity in an order and obtaining a maximum cosine similarity;determining whether the maximum cosine similarity is greater than a predetermined recognition threshold; andobtaining a stored feature corresponding to the maximum cosine similarity, and taking identity information corresponding to the stored feature as a recognition result of the target to be recognized, in response to the maximum cosine similarity being greater than the predetermined recognition threshold.
- The method according to claim 7, wherein before the obtaining a plurality of consecutive images of a target to be recognized, the method further comprises:establishing the base library of the moving target, wherein the base library of the moving target is configured to store all identity information of the target to be stored and the stored features.
- An apparatus for recognizing a moving target, comprising a memory and a processor coupled to the memory, wherein the memory stores program instructions, the program instructions are configured to be executed by the processor to implement the method for recognizing the moving target according to any one of claims 6 to 8.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202110802833.X | 2021-07-15 | ||
| CN202110802833.XA CN113255630B (en) | 2021-07-15 | 2021-07-15 | Moving target recognition training method, moving target recognition method and device |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2023284182A1 true WO2023284182A1 (en) | 2023-01-19 |
Family
ID=77180490
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2021/128515 Ceased WO2023284182A1 (en) | 2021-07-15 | 2021-11-03 | Training method for recognizing moving target, method and device for recognizing moving target |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN113255630B (en) |
| WO (1) | WO2023284182A1 (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119169620A (en) * | 2024-08-27 | 2024-12-20 | 清华大学 | Video fine-grained understanding and annotation method and device based on multimodal large model |
Families Citing this family (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113255630B (en) * | 2021-07-15 | 2021-10-15 | 浙江大华技术股份有限公司 | Moving target recognition training method, moving target recognition method and device |
| CN113688745B (en) * | 2021-08-27 | 2024-04-05 | 大连海事大学 | A gait recognition method based on automatic mining of related nodes and statistical information |
| CN114724176A (en) * | 2022-03-09 | 2022-07-08 | 海纳云物联科技有限公司 | Tumble identification method |
| CN116110131B (en) * | 2023-04-11 | 2023-06-30 | 深圳未来立体教育科技有限公司 | Body interaction behavior recognition method and VR system |
| CN116844217B (en) * | 2023-08-30 | 2023-11-14 | 成都睿瞳科技有限责任公司 | Image processing system and method for generating face data |
| CN119295741B (en) * | 2024-12-10 | 2025-04-08 | 哈尔滨工程大学三亚南海创新发展基地 | Target tracking method, device, electronic equipment and storage medium for polar unmanned boat |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20190095764A1 (en) * | 2017-09-26 | 2019-03-28 | Panton, Inc. | Method and system for determining objects depicted in images |
| CN110555406A (en) * | 2019-08-31 | 2019-12-10 | 武汉理工大学 | Video moving target identification method based on Haar-like characteristics and CNN matching |
| CN112686193A (en) * | 2021-01-06 | 2021-04-20 | 东北大学 | Action recognition method and device based on compressed video and computer equipment |
| CN113096131A (en) * | 2021-06-09 | 2021-07-09 | 紫东信息科技(苏州)有限公司 | Gastroscope picture multi-label classification system based on VIT network |
| CN113255630A (en) * | 2021-07-15 | 2021-08-13 | 浙江大华技术股份有限公司 | Moving target recognition training method, moving target recognition method and device |
Family Cites Families (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109766925B (en) * | 2018-12-20 | 2021-05-11 | 深圳云天励飞技术有限公司 | Feature fusion method and device, electronic equipment and storage medium |
| US10977525B2 (en) * | 2019-03-29 | 2021-04-13 | Fuji Xerox Co., Ltd. | Indoor localization using real-time context fusion of visual information from static and dynamic cameras |
| CN110246518A (en) * | 2019-06-10 | 2019-09-17 | 深圳航天科技创新研究院 | Speech-emotion recognition method, device, system and storage medium based on more granularity sound state fusion features |
| CN111160194B (en) * | 2019-12-23 | 2022-06-24 | 浙江理工大学 | A still gesture image recognition method based on multi-feature fusion |
| CN111582126B (en) * | 2020-04-30 | 2024-02-27 | 浙江工商大学 | Pedestrian re-recognition method based on multi-scale pedestrian contour segmentation fusion |
| CN111814857B (en) * | 2020-06-29 | 2021-07-06 | 浙江大华技术股份有限公司 | Target re-identification method, network training method thereof and related device |
| CN111860291A (en) * | 2020-07-16 | 2020-10-30 | 上海交通大学 | Multimodal pedestrian identification method and system based on pedestrian appearance and gait information |
| CN112633058B (en) * | 2020-11-05 | 2024-05-31 | 北京工业大学 | Feature fusion-based frontal gait recognition method |
-
2021
- 2021-07-15 CN CN202110802833.XA patent/CN113255630B/en active Active
- 2021-11-03 WO PCT/CN2021/128515 patent/WO2023284182A1/en not_active Ceased
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20190095764A1 (en) * | 2017-09-26 | 2019-03-28 | Panton, Inc. | Method and system for determining objects depicted in images |
| CN110555406A (en) * | 2019-08-31 | 2019-12-10 | 武汉理工大学 | Video moving target identification method based on Haar-like characteristics and CNN matching |
| CN112686193A (en) * | 2021-01-06 | 2021-04-20 | 东北大学 | Action recognition method and device based on compressed video and computer equipment |
| CN113096131A (en) * | 2021-06-09 | 2021-07-09 | 紫东信息科技(苏州)有限公司 | Gastroscope picture multi-label classification system based on VIT network |
| CN113255630A (en) * | 2021-07-15 | 2021-08-13 | 浙江大华技术股份有限公司 | Moving target recognition training method, moving target recognition method and device |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119169620A (en) * | 2024-08-27 | 2024-12-20 | 清华大学 | Video fine-grained understanding and annotation method and device based on multimodal large model |
| CN119169620B (en) * | 2024-08-27 | 2025-09-19 | 清华大学 | Video fine granularity understanding labeling method and device based on multi-mode large model |
Also Published As
| Publication number | Publication date |
|---|---|
| CN113255630B (en) | 2021-10-15 |
| CN113255630A (en) | 2021-08-13 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12094247B2 (en) | Expression recognition method and related apparatus | |
| WO2023284182A1 (en) | Training method for recognizing moving target, method and device for recognizing moving target | |
| US12131580B2 (en) | Face detection method, apparatus, and device, and training method, apparatus, and device for image detection neural network | |
| CN110555481B (en) | Portrait style recognition method, device and computer readable storage medium | |
| US12314342B2 (en) | Object recognition method and apparatus | |
| CN109902548B (en) | Object attribute identification method and device, computing equipment and system | |
| US12210687B2 (en) | Gesture recognition method, electronic device, computer-readable storage medium, and chip | |
| CN112070044B (en) | Video object classification method and device | |
| WO2021068323A1 (en) | Multitask facial action recognition model training method, multitask facial action recognition method and apparatus, computer device, and storage medium | |
| WO2020103700A1 (en) | Image recognition method based on micro facial expressions, apparatus and related device | |
| WO2020228446A1 (en) | Model training method and apparatus, and terminal and storage medium | |
| US10339369B2 (en) | Facial expression recognition using relations determined by class-to-class comparisons | |
| CN114519877A (en) | Face recognition method, face recognition device, computer equipment and storage medium | |
| CN110866469A (en) | A method, device, equipment and medium for facial feature recognition | |
| CN111108508A (en) | Facial emotion recognition method, smart device and computer-readable storage medium | |
| CN111126358B (en) | Face detection method, device, storage medium and equipment | |
| CN114038045A (en) | Cross-modal face recognition model construction method and device and electronic equipment | |
| CN116152938A (en) | Identity recognition model training and electronic resource transfer method, device and equipment | |
| CN113822871A (en) | Object detection method, device, storage medium and device based on dynamic detection head | |
| CN114241559A (en) | Face recognition method, device, equipment and storage medium | |
| CN114764936A (en) | Image key point detection method and related equipment | |
| US12299911B2 (en) | Image processing method and apparatus, and storage medium | |
| CN116912924A (en) | A target image recognition method and device | |
| WO2020132825A1 (en) | Image processing apparatus and image processing method | |
| CN113887338A (en) | Method, device, terminal and storage medium for determining obstacle attributes |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 21949954 Country of ref document: EP Kind code of ref document: A1 |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21949954 Country of ref document: EP Kind code of ref document: A1 |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 21949954 Country of ref document: EP Kind code of ref document: A1 |
|
| 32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 09.09.2024) |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 21949954 Country of ref document: EP Kind code of ref document: A1 |