WO2023284182A1

WO2023284182A1 - Training method for recognizing moving target, method and device for recognizing moving target

Info

Publication number: WO2023284182A1
Application number: PCT/CN2021/128515
Authority: WO
Inventors: Jiang Zhang; Jun Yin; Mingwei Zhou; Xingming Zhang
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2021-07-15
Filing date: 2021-11-03
Publication date: 2023-01-19
Anticipated expiration: 2024-01-15
Also published as: CN113255630B; CN113255630A

Abstract

The present application provides a training method for recognizing a moving target, a method and an apparatus for recognizing a moving target. The training method includes: obtaining a plurality of images taken at different points in time; obtaining a first class of static features and a second class of static features of the target in each of the images; fusing the first class of static features and the second class of static features in each of the images to obtain fused features; and performing training classification on the fused features of the at least some of the images until the entire network is converged. In this way, richness of the target features can be effectively improved, and a moving target recognition model that has a better feature expression ability and higher robustness may be obtained.

Description

TRAINING METHOD FOR RECOGNIZING MOVING TARGET, METHOD AND DEVICE FOR RECOGNIZING MOVING TARGET

The present application claims priority of Chinese Patent Application No. 202110802833. X, filed on July 15, 2021, in China National Intellectual Property Administration, the entire contents of which are hereby incorporated by reference in their entireties.

TECHNICAL FIELD

The present disclosure relates to the field of computer vision and machine learning, and in particular to a training method for recognizing a moving target, a method and an apparatus for recognizing a moving target.

BACKGROUND

Recognizing a moving target refers to recognizing a pedestrian target in an image, wherein the image is captured while the pedestrian is walking. In the art, a relatively developed method for recognizing the pedestrian may include two types, a person re-identification method and a gait recognition method. The former method may extract static external features, such as dressing of the pedestrian, a hair style of the pedestrian, a backpack of the pedestrian, an umbrella of the pedestrian, and the like, from the image. The latter method may learn dynamic features, such as a walking posture, an amplitude of arm swinging, head shaking and shoulder shrugging, sensitivity of a motor nerve, and the like, based on continuous movements of the pedestrian.

During long-term research, the applicant discovers that, when performing the method in the art to recognize the moving target, one single feature is relied on, such as a static RGB image, or a contour image, and the like. Robustness of the feature is not sufficient. Therefore, accuracy of a recognition result may be low. In addition, some technical solutions in the art recognize the moving target recognition based on feature fusion. For example, global features of an RGB image may be fused with local features of the RGB image. In this way, feature modality is relatively unitary. Performance of an apparatus may be scarified, whereas accuracy of prefabricated matching may not be improved.

SUMMARY OF THE DISCLOSURE

The present disclosure provides a training method for recognizing a moving target, a method and an apparatus for recognizing a moving target. In this way, robustness and accuracy of recognizing the moving target may be improved.

According to a first aspect, a training method of recognizing a moving target includes: obtaining a plurality of consecutive images; inputting the plurality of consecutive images successively into an input end of an inner layer of a two-layer ViT feature fusion model to obtain a first class of static features and a second class of static features of the target in each of the plurality of consecutive images; fusing the first class of static features and the second class of static features in each of the plurality of consecutive images at an output end of the inner layer of the two-layer ViT feature fusion model to obtain fused features; and inputting the fused features of at least some of the plurality of consecutive images successively into an input end of an outer layer of the two-layer ViT feature fusion model for classification training until the entire network is converged.

In some embodiments, obtaining the first class of static features and the second class of static features of the target in each of the plurality of consecutive images includes: obtaining fine-grained static features and fine-grained contour features of the target in each of the plurality of consecutive images.

In some embodiments, the obtaining fine-grained static features and fine-grained contour features of the target in each of the plurality of consecutive images, includes: segmenting the target into a plurality of portions, and inputting the plurality of portions successively into the first input end of the inner layer of the two-layer ViT feature fusion model to obtain the fine-grained static features; and segmenting a contour of the target into a plurality of contour portions, and inputting the plurality of contour portions successively into a second input end of the inner layer of the two-layer ViT feature fusion model to obtain the fine-grained contour features.

In some embodiments, fusing the first class of static features and the second class of static features in each of the plurality of consecutive images to obtain fused features, includes: fusing the fine-grained static features and the fine-grained contour features by weighted average at an output end of the inner layer of the two-layer ViT feature fusion model to obtain the fused features.

In some embodiments, the inputting the fused features of at least some of the plurality of consecutive images successively into an input end of an outer layer of the two-layer ViT feature fusion model for classification training, includes: inputting the fused features of the at least some of the plurality of consecutive images successively into an input layer of an outer layer of the two-layer ViT feature fusion model, and performing classification training based on normalized exponential loss, wherein dimension of an embedding layer is set to be positive integer times of 128, until the entire network is converged.

According to a second aspect, a method for recognizing a moving target includes: obtaining a plurality of consecutive images of a target to be recognized; inputting the plurality of consecutive images successively into an input end of an inner layer of a two-layer ViT feature fusion model to obtain a first class of static features and a second class of static features of the target to be recognized in each of the plurality of consecutive images; fusing the first class of static features and the second class of static features in each of the plurality of consecutive images at an output end of the inner layer of the two-layer ViT feature fusion model to obtain fused features; inputting the fused features of at least some of the plurality of consecutive images successively into an input end of an outer layer of the two-layer ViT feature fusion model for fusing to obtain dynamic features; and obtaining a recognition result based on the dynamic features.

In some embodiments, the obtaining a recognition result based on the dynamic features, includes: calculating cosine similarity between the dynamic features and each of all features stored in a base library of the moving target one by one; placing the cosine similarity in an order and obtaining a maximum cosine similarity; determining whether the maximum cosine similarity is greater than a predetermined recognition threshold; and obtaining a stored feature corresponding to the maximum cosine similarity, and taking identity information corresponding to the stored feature as a recognition result of the target to be recognized, in response to the maximum cosine similarity being greater than the predetermined recognition threshold.

In some embodiments, before the obtaining a plurality of consecutive images of a target to be recognized, the method further includes: establishing the base library of the moving target, wherein the base library of the moving target is configured to store all identity information of the target to be stored and the stored features.

According to a third aspect, an apparatus for recognizing a moving target includes a memory and a processor coupled to the memory. The memory stores program instructions, and the program instructions are configured to be executed by the processor to implement the method for recognizing the moving target according to any one of the above embodiments.

According to the present disclosure, a training method for recognizing a moving target, a method and an apparatus for recognizing a moving target are provided. The training method for recognizing the moving target includes: obtaining a plurality of images taken at various time points; obtaining a first class of static features and a second class of static features of the target in each of the plurality of images; fusing the first class of static features and the second class of static features in each of the plurality of images to obtain a fused feature; performing classification training on the fused feature of at least some of the plurality of images until the entire network is converged. In this way, the two classes of static features in one image are extracted, spliced and fused. Further, a plurality of consecutive fused features are input to a classification trainer. Both static and dynamic features of the moving target are considered at the same time, richness of the features of the target may be improved effectively. The problem of the feature modality being unitary in the art may be solved. In this way, the final trained moving target recognition model has a stronger feature expression ability and better robustness. The accuracy of recognition results may be improved when applying the present model for recognizing the moving target.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure, the accompanying drawings for the description of the embodiments will be briefly described. Obviously, the drawings described below only illustrate some embodiments, an ordinary skilled person in the art may obtain other drawings based on these drawings, without making any creative work.

FIG. 1 is a flow chart of a training method for recognizing a moving target according to an embodiment of the present disclosure.

FIG. 2 is a flow chart of an operation S102 shown in FIG. 1 according to an embodiment of the present disclosure.

FIG. 3 is a network structural schematic view of a training method for recognizing a moving target according to an embodiment of the present disclosure.

FIG. 4 is a flow chart of a method for recognizing a moving target according to an embodiment of the present disclosure.

FIG. 5 is a flow chart of an operation S305 shown in FIG. 4 according to an embodiment of the present disclosure.

FIG. 6 is a flow chart of operations performed before the operation S401 shown in FIG. 5 according to an embodiment of the present disclosure.

FIG. 7 is a diagram of an apparatus for recognizing a moving target according to an embodiment of the present disclosure.

FIG. 8 is a structural schematic view of an apparatus for recognizing a moving target according to an embodiment of the present disclosure.

FIG. 9 is a diagram of a computer-readable storage medium according to an embodiment of the present disclosure.

DETAILED DESCRIPTION

Technical solutions of the embodiments of the present disclosure will be clearly and completely described by referring to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are only some, but not all, of the embodiments of the present disclosure. Based on the embodiments of the present disclosure, all other embodiments obtained by an ordinary skilled person in the art without making creative work shall fall within the scope of the present disclosure.

As shown in FIG. 1, FIG. 1 is a flow chart of a training method for recognizing a moving target according to an embodiment of the present disclosure. In detail, the method includes following operations.

In an operation S101, a plurality of images are captured consecutively.

In an embodiment, firstly, several video data, which is required for training a network and shows a moving target moving in a natural state, may be obtained. A pedestrian detection and tracking tool may be taken to parse the video data into a target RGB image sequence that includes a plurality of consecutive frames of images. A plurality of RGB images that are cropped based on a human detection frame are normalized to obtain a standard target RGB image sequence. The standard target RGB image sequence is copied, and a front background and a rear background of the target are annotated to obtain a target contour image. In the present embodiment, when normalizing the plurality of RGB images, the images may be scaled equally into a size of 96*64. When extracting the target contour image, a pedestrian area is labeled as 255, and a background area is labeled as 0. Finally, the RGB images and the contour image of the same person are labeled with identity information. So far, by performing the above operations, a standard set of RGB images and a standard set of contour images are obtained based on a same set of template RGB images. Further, consecutive RGB images and consecutive contour images cooperatively constitute the plurality of consecutive images.

In an operation S102, the first class of static features and the second class of static features of the target in each image are obtained.

Alternatively, the first class of static features of the target is obtained based on detailed features in the RGB images obtained in the operation S101, such as a dressing feature, a hairstyle feature, a backpack feature, and the like. The second class of static features of the target is obtained based on the contour image obtained in the operation S101. In the present embodiment, the first class of static features in the operation S102 refers to fine-grained static features of the target in each image, and the second class of static features refers to the fine-grained contour features. In other embodiments, coarse-grained static features and coarse-grained contour features of the target in each image may be extracted, serving as the first class of static features and the second class of static features, respectively. Recognition of the moving target may also be achieved in this way.

Alternatively, as shown in FIG. 2, FIG. 2 is a flow chart of an operation S102 shown in FIG. 1 according to an embodiment of the present disclosure. The operation S102 may include following operations.

In the operation S201, the moving target is segmented into a plurality of portions, the plurality of portions are successively input into a first input end of an inner layer of a two-layer Vision Transformer (ViT) feature fusion model to obtain the fine-grained static features.

Alternatively, the ViT -based two-layer feature fusion model may process image sequence data where the target is continuously shown. Compared to a traditional convolutional neural network (CNN) algorithm, while computational accuracy is comparable, applying a ViT algorithm for training and inferencing may generate a small computation amount, and the ViT algorithm may be light weighting. In other embodiments, the static features corresponding to the target may also be obtained by applying a feature fusion model based on the convolutional neural network algorithm to inference and compute the image.

In the present embodiment, as shown in FIG. 3, FIG. 3 is a network structural schematic view of a training method for recognizing a moving target according to an embodiment of the present disclosure. The target may be segmented firstly. The RGB image may be segmented into 6 portions in an order of a head of the target, a middle-half of the target, and a lower-half of the target, and the 6 portions are equally sized. Subsequently, the 6 portions are successively input into the first input end of the inner layer of the two-layer ViT feature fusion model, i.e., input into an RGB image input end, such that the fined-grained static features of the target are obtained.

In an operation S202, a contour of the target are segmented into a plurality of portions by the means mentioned in the above, and the plurality of portions are input into a second input end of the inner layer of the two-layer ViT feature fusion model to obtain the fine-grained contour features.

Alternatively, as shown in FIG. 3, by applying the segmentation method for segmenting the RGB image in the operation S201, the contour of the target is segmented into 6 portions that are equally sized. Subsequently, the 6 portions are successively input into the second input end of the inner layer of the two-layer ViT feature fusion model, i.e., a contour image input end, to obtain the fined-grained contour features of the target.

In an operation S103, the first class of static features and the second class of static features in each image are fused to obtain the fused feature.

Alternatively, in the operation S103, the first class of static features, which are obtained based on one RGB image and one contour image, and the second class of static features, which are obtained based on one RGB image and one contour image, are spliced and fused. Both the static features and the contour features of the moving target are considered, richness of the features of the target may be improved effectively.

In the present embodiment, the fine-grained static features and the fine-grained contour features are fused by weighted average at an output end of the inner layer of the two-layer ViT feature fusion model to obtain the fused feature. For example, when a weight factor of the fine-grained static features is set to be 0.5, a weight factor of the fine-grained contour features is 0.5. In this case, the fused features is a sum of a product of 0.5 and the fine-grained static features and a product of 0.5 and the fine-grained contour features.

In an operation S104, classification training is performed on the fused features of at least some of the images until the entire network is converged.

In the operation S104, the at least some of the images refer to some consecutive frames of images selected from all of the plurality of images obtained in the operation S101. The fused features corresponding to the some consecutive frames of images may express the dynamic features of the target while the target is walking, such that an expression ability of the model may be improved. Preferably, five consecutive frames of RGB images and contour images are selected for classification training. In this way, the accuracy of the recognition result may be ensured, and the amount of computation may be reduced as much as possible.

In the present embodiment, as shown in FIG. 3, fused features of the five frames of images are successively input to the input end of the outer layer of the two-layer ViT feature fusion model for classification training until the entire network is converged. In a specific implementation scenario, classification training based on normalized exponential loss may be applied, wherein dimension of an embedding layer is set to be positive integer times of 128, such as 128, 512, 1024, and the like, until the entire network is converged to obtain a recognition result of the moving target that meets a predefined condition.

According to the training method for recognizing the moving target described in the embodiments of the present disclosure, the fine-grained static features and the fine-grained contour features are extracted from one RGB image and one contour image. By making full use of the two classes of static features, the dynamic features of the pedestrian included in a sequence of consecutive frames of images in a video are focused, such that the problem of the feature modality in the art being unitary may be solved. The two-layer ViT feature fusion model may be applied to fuse the three types of features. In this way, the final trained model has a stronger feature expression ability, higher robustness and a better differentiation ability. Applying the model to recognize the moving target may improve the accuracy of the recognition result.

As shown in FIG. 4, FIG. 4 is a flow chart of a method for recognizing a moving target according to an embodiment of the present disclosure. The method for recognizing the moving target according to the embodiment of the present disclosure includes following operations.

In an operation S301, a plurality of consecutive images of the target to be recognized are obtained.

Alternatively, a video, that shows the target to be recognized is moving, is obtained and pre-processed firstly. Subsequently, a target RGB image sequence is obtained by a pedestrian detection and tracking tool. The RGB images are then normalized to obtain a standard target RGB image sequence. The standard target RGB image sequence is copied, and a front background and a rear background of the target are annotated to obtain the target contour image.

In an operation S302, the first class of static features and the second class of static features of the target to be recognized in each image are obtained.

Alternatively, in the present embodiment, the RGB images and the contour image obtained in the operation S301 are segmented in a same manner and are successively input into the first input end of the inner layer of the two-layer ViT feature fusion model to obtain fine-grained static features and the fine-grained contour features.

In an operation S303, the first class of static features and the second class of static features in each image are fused to obtain the fused feature.

In the present embodiment, the operation S303 is similar to the operation S103 in FIG. 1. The operation S303 will not be repeatedly described for providing a concise description.

In an operation S304, the fused feature of at least some of the images are fused to obtain the dynamic features.

Alternatively, the fused feature corresponding to the plurality of consecutive frames of images are input to the input end of the outer layer of the two-layer ViT feature fusion model and are fused to obtain the dynamic features corresponding to the target to be recognized. The dimension of the embedding layer is set to be 1024, and the output dynamic features are represented by a 1024-dimension feature vector.

In an operation S305, the recognition result is obtained based on the dynamic features.

As shown in FIG. 5, FIG. 5 is a flow chart of an operation S305 shown in FIG. 4 according to an embodiment of the present disclosure. The operation S305 may include following operations.

In an operation S401, cosine similarity between the dynamic features and each of all features stored in a base library of the moving target is calculated successively.

Alternatively, in the present embodiment, 100 features are stored in the base library of the moving target. The dynamic features of the target to be recognized are compared to each of the 100 stored features one by one, and the cosine similarity therebetween is calculated. At last, 100 cosine similarity values are obtained.

In an operation S402, the cosine similarity values are placed in an order, and a maximum cosine similarity value is obtained.

In the present embodiment, the above 100 cosine similarity values are placed in the order, such that the maximum cosine similarity value is obtained.

In an operation S403, it may be determined whether the maximum cosine similarity value is greater than a predetermined recognition threshold.

In an operation S404, in response to the maximum cosine similarity value being greater than the predetermined recognition threshold, a stored feature corresponding to the maximum cosine similarity value is obtained, and identity information corresponding to the stored feature is taken as the recognition result of the target to be recognized.

In an operation S405, in response to the maximum cosine similarity value being less than the predetermined recognition threshold, recognition is terminated.

In the present embodiment, before performing the operation S401, the method further includes a process of establishing the base library of the moving target. As shown in FIG. 6, FIG. 6 is a flow chart of operations performed before the operation S401 shown in FIG. 5 according to an embodiment of the present disclosure. The process of establishing the base library of the moving target includes following operations.

In an operation S501, all videos that show the target to be stored is in a walking state are provided.

In an operation S502, each of the all videos is pre-processed, and a plurality of consecutive images in each of the all videos are obtained successively.

In an operation S503, the plurality of images are input into the trained two-layer ViT feature fusion model to obtain the dynamic features corresponding to each pedestrian target to be stored.

In an operation S504, a mapping relationship between each pedestrian to be stored and corresponding dynamic features is constructed, and the mapping relationship is stored into the base library of the moving target.

According to the method for recognizing the moving target of the embodiments of the present disclosure, the fine-grained static features and the fine-grained contour features in one RGB image and one contour image are extracted. The two classes of static features are fully utilized, and the dynamic features of pedestrians included in a sequence of consecutive frames in the video are focused, such that the problem of the feature modality in the art being unitary may be solved. The two-layer ViT feature fusion model may be applied to fuse the three types of features, effectively improving the accuracy of recognition result.

As shown in FIG. 7, FIG. 7 is a diagram of an apparatus for recognizing a moving target according to an embodiment of the present disclosure. The apparatus includes an obtaining module 10, a fusing module 12 and a training module 14. In detail, the obtaining module 10 is configured to obtain a plurality of images taken at various time points and to obtain the first class of static features and the second class of static features of the target in each of the plurality of images. The fusing module 12 is configured to fuse the first class of static features and the second class of static features in each of the plurality of images to obtain the fused feature. The training module 14 is configured to perform classification training on the fused feature of at least some of the plurality of images until the entire network is converged. In this way, two classes of static features in one image are extracted, spliced and fused. A plurality of consecutive fused features are input into the classification trainer. Richness of the features of the target may be improved effectively, while both static and dynamic features of the moving target are considered, such that the problem of the feature modality in the art being unitary may be solved. The final trained model has a greater feature expression ability and higher robustness. By applying the present model for recognizing the moving target, the accuracy of the recognition result may be improved.

As shown in FIG. 8, FIG. 8 is a structural schematic view of an apparatus for recognizing a moving target according to an embodiment of the present disclosure. The apparatus 20 includes a memory 100 and a processor 102 coupled to the memory 100. Program instructions are stored in the memory 100. The processor 102 is configured to execute the program instructions to implement the method according to any one of embodiments in the above.

In detail, the processor 102 may also be referred to as a Central Processing Unit (CPU) . The processor 102 may be an integrated circuit chip able to process signals. The processor 102 may also be a general purpose processor, a Digital Signal Processor (DSP) , an Application Specific Integrated Circuit (ASIC) , a Field-Programmable Gate Array (FPGA) or other programmable logic devices, a discrete gate or a transistor logic device, a discrete hardware component. The general purpose processor may be a microprocessor or any conventional processor. In addition, the processor 102 may be implemented by a plurality of integrated circuit chips together.

As shown in FIG. 9, FIG. 9 is a diagram of a computer-readable storage medium according to an embodiment of the present disclosure. The computer-readable storage medium 30 stores computer programs 300, which can be read by a computer. The computer programs 300 can be executed by a processor to implement the method mentioned in any of the above embodiments. The computer programs 300 may be stored in a form of a software product on the computer readable storage medium 30 as described above, and may include a number of instructions to enable a computer device (which may be a personal computer, a server, or a network device, and the like) or a processor to perform all or some of the operations of the method described in the various embodiments of the present disclosure. The computer-readable storage medium 30 that has the storage function may be a universal serial bus disc, a portable hard disc, a Read-Only Memory (ROM) , a Random Access Memory (RAM) , magnetic discs or optical discs, or various media that can store program codes, or terminal devices such as a computer, a server, a mobile phone, a tablet, and the like.

The above description shows only embodiments of the present disclosure and does not limit the scope of the present disclosure. Any equivalent structure or equivalent process transformation based on the contents of the specification and accompanying drawings of the present disclosure, applied directly or indirectly in other related art, shall be included in the scope of the present disclosure.

Claims

A training method of recognizing a moving target, comprising:

obtaining a plurality of consecutive images;

inputting the plurality of consecutive images successively into an input end of an inner layer of a two-layer ViT feature fusion model to obtain a first class of static features and a second class of static features of the target in each of the plurality of consecutive images;

fusing the first class of static features and the second class of static features in each of the plurality of consecutive images at an output end of the inner layer of the two-layer ViT feature fusion model to obtain fused features; and

inputting the fused features of at least some of the plurality of consecutive images successively into an input end of an outer layer of the two-layer ViT feature fusion model for classification training until the entire network is converged.
The training method according to claim 1, wherein obtaining the first class of static features and the second class of static features of the target in each of the plurality of consecutive images comprises:

obtaining fine-grained static features and fine-grained contour features of the target in each of the plurality of consecutive images.
The training method according to claim 2, wherein the obtaining fine-grained static features and fine-grained contour features of the target in each of the plurality of consecutive images, comprises:

segmenting the target into a plurality of portions, and inputting the plurality of portions successively into the first input end of the inner layer of the two-layer ViT feature fusion model to obtain the fine-grained static features; and

segmenting a contour of the target into a plurality of contour portions, and inputting the plurality of contour portions successively into a second input end of the inner layer of the two-layer ViT feature fusion model to obtain the fine-grained contour features.
The training method according to claim 3, wherein fusing the first class of static features and the second class of static features in each of the plurality of consecutive images to obtain fused features, comprises:

fusing the fine-grained static features and the fine-grained contour features by weighted average at an output end of the inner layer of the two-layer ViT feature fusion model to obtain the fused features.
The training method according to claim 1, wherein the inputting the fused features of at least some of the plurality of consecutive images successively into an input end of an outer layer of the two-layer ViT feature fusion model for classification training, comprises:

inputting the fused features of the at least some of the plurality of consecutive images successively into an input layer of an outer layer of the two-layer ViT feature fusion model, and performing classification training based on normalized exponential loss, wherein dimension of an embedding layer is set to be positive integer times of 128, until the entire network is converged.
A method for recognizing a moving target, comprising:

obtaining a plurality of consecutive images of a target to be recognized;

inputting the plurality of consecutive images successively into an input end of an inner layer of a two-layer ViT feature fusion model to obtain a first class of static features and a second class of static features of the target to be recognized in each of the plurality of consecutive images;

fusing the first class of static features and the second class of static features in each of the plurality of consecutive images at an output end of the inner layer of the two-layer ViT feature fusion model to obtain fused features;

inputting the fused features of at least some of the plurality of consecutive images successively into an input end of an outer layer of the two-layer ViT feature fusion model for fusing to obtain dynamic features; and

obtaining a recognition result based on the dynamic features.
The method according to claim 6, wherein the obtaining a recognition result based on the dynamic features, comprises:

calculating cosine similarity between the dynamic features and each of all features stored in a base library of the moving target one by one;

placing the cosine similarity in an order and obtaining a maximum cosine similarity;

determining whether the maximum cosine similarity is greater than a predetermined recognition threshold; and

obtaining a stored feature corresponding to the maximum cosine similarity, and taking identity information corresponding to the stored feature as a recognition result of the target to be recognized, in response to the maximum cosine similarity being greater than the predetermined recognition threshold.
The method according to claim 7, wherein before the obtaining a plurality of consecutive images of a target to be recognized, the method further comprises:

establishing the base library of the moving target, wherein the base library of the moving target is configured to store all identity information of the target to be stored and the stored features.
An apparatus for recognizing a moving target, comprising a memory and a processor coupled to the memory, wherein the memory stores program instructions, the program instructions are configured to be executed by the processor to implement the method for recognizing the moving target according to any one of claims 6 to 8.