CN119600653A - A cross-modal pedestrian re-identification method and system - Google Patents
A cross-modal pedestrian re-identification method and system Download PDFInfo
- Publication number
- CN119600653A CN119600653A CN202411690034.8A CN202411690034A CN119600653A CN 119600653 A CN119600653 A CN 119600653A CN 202411690034 A CN202411690034 A CN 202411690034A CN 119600653 A CN119600653 A CN 119600653A
- Authority
- CN
- China
- Prior art keywords
- visible light
- features
- infrared
- image
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0475—Generative networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/094—Adversarial learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Human Computer Interaction (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a cross-mode pedestrian re-recognition method and system, wherein the method comprises the steps of randomly selecting a plurality of pedestrians from an infrared visible light data set in each training batch, preprocessing the visible light images and the infrared images, inputting the preprocessed visible light images and infrared images into a deep learning model, extracting visible light features, infrared features and mixed mode features through the deep learning model, constructing total loss, judging whether the deep learning model is trained, if the total loss is lower than a preset threshold value, finishing training, if the total loss is higher than the preset threshold value, continuing training until the total loss is lower than the preset threshold value, and recognizing the pedestrians from the visible light images or the infrared images to be recognized through the trained deep learning model. According to the invention, the cross-mode pedestrian re-recognition can be effectively performed through the constructed deep learning model.
Description
Technical Field
The invention relates to the technical field of pedestrian re-recognition, in particular to a cross-mode pedestrian re-recognition method and system.
Background
In recent years, pedestrian re-recognition is an important task in the fields of computer vision and video monitoring, and by comparing pedestrian images from different cameras, the same individuals are automatically recognized and matched, so that the pedestrian re-recognition method has wide application in the fields of intelligent monitoring, video retrieval, intelligent transportation and the like. However, conventional pedestrian re-recognition methods typically rely on single-modality visual feature extraction, presenting significant limitations in cross-modality and low-light scenes. For example, there is a significant modal difference between infrared and visible light images, making feature alignment and cross-modal matching very challenging.
In order to solve the problem, the infrared-visible light cross-mode pedestrian re-recognition method based on deep learning shows excellent performance, and cross-mode characteristics can be better extracted and aligned, so that recognition accuracy is remarkably improved. The infrared-visible light cross-modal pedestrian re-recognition realizes cross-modal pedestrian recognition by utilizing the correlation between visible light and infrared images, which has important significance in all-weather monitoring scenes, especially in night or low-illumination conditions. And thus has become an important research direction in the field of computer vision.
The infrared-visible light cross-modal pedestrian re-recognition is used as an important research direction in the field of computer vision, and aims to reduce modal differences between infrared image and visible light image features so as to realize cross-modal pedestrian matching and recognition. However, infrared-visible cross-modal pedestrian re-identification faces some challenges, the most significant of which can be generalized to the large modal differences and less cross-modal training data.
(One) tremendous modality differences
The infrared image and the visible light image have obvious modal differences in visual characteristics, and the visible light image is an imaging mode based on natural light reflection, so that rich color information, texture details and contour characteristics can be provided. The infrared image is based on the imaging mode of the thermal radiation of the object, does not contain color information, and mainly reflects the temperature distribution of the object. The significant difference in the modes makes the appearance style characteristics of the same person in the infrared image and the visible light image have large difference, so that the matching is difficult to directly utilize the traditional characteristic extraction method.
Thus, in a cross-modal pedestrian re-recognition implementation, these heterogeneous visual features need to be aligned to narrow the modal gap. There are two methods, one of which is a feature level method, which attempts to project the visible light and infrared features extracted through the backbone network into the same embedded space, and perform matching search in the new feature space after projection, so as to reduce the influence of the modal differences. However, due to the large modality differences, these methods have difficulty projecting cross-modality images directly into a feature space that enables alignment of infrared and visible image features, and even if feature mapping is implemented, many beneficial shared features are lost in the conversion of the original and new features. The other is an image level method, specifically, the method uses a generation countermeasure network to generate an infrared image and a visible light image into an image of a unified mode to perform feature extraction, or uses a mode of generating a countermeasure network patch deficiency, such as converting an infrared (or visible light) image into a visible light (or infrared) image, finally, performing feature fusion with the existing mode, and performing image matching in the fused feature space, wherein the generation of a cross-mode image is difficult due to large mode difference and lack of infrared-visible light image pairs, and is usually accompanied by larger noise.
(II) less cross-modal training data
Another major difficulty is the scarcity of cross-modal training data. Compared with a single-mode visible light pedestrian re-identification task, the infrared-visible light cross-mode dataset needs to contain paired datasets of infrared and visible light images at the same time, but the acquisition cost of the cross-mode dataset is higher, and the data labeling process is complex. In an actual monitoring scene, especially in a night scene, the acquisition of an infrared image is easier, but a corresponding visible light image is often missing, which leads to the defect of paired data of infrared light and visible light.
The problem of insufficient data directly affects the generalization ability of the deep learning model. Deep learning models typically require a significant amount of annotation data to train in order to learn effective cross-modal features. Particularly, for a deep learning model with a backbone network being a transducer and variants thereof, the model has no spatial or temporal inductive bias, has huge parameters, needs enough data to fully exert the structural advantages, establishes the global dependency relationship, and increases the generalization of the model. However, due to the scarcity of the cross-modal dataset, the model is easy to suffer from the problem of over fitting in the training process, so that the recognition performance in practical application is reduced. In order to alleviate this problem, in the current infrared-visible light cross-mode pedestrian re-recognition, one type of method uses data enhancement technology, such as channel erasure, channel exchange, random erasure, gray enhancement and the like, to add new samples, all of which are based on original images, and the newly added samples lack diversity, and the second type of method uses generation of new samples synthesized against a network or by mixing local characteristics of human bodies of pedestrians to generate more samples in the same mode, however, all of the methods are enhancement of data in the modes, and challenges brought by data scarcity still remain difficult to thoroughly solve.
In summary, the prior art cannot effectively solve the problem of large modal differences between infrared image and visible light image features, and the cross-modal training data are less, so that the cross-modal pedestrian matching and recognition effects are poor.
Disclosure of Invention
Therefore, the invention aims to solve the technical problems that in the prior art, the infrared image and the visible light image have larger modal differences, and the cross-modal training data are less, so that the cross-modal pedestrian matching and recognition effects are poor.
In order to solve the technical problems, the invention provides a cross-mode pedestrian re-identification method, which comprises the following steps:
Step S1, randomly selecting a plurality of pedestrians from an infrared visible light data set in each training batch, wherein each pedestrian comprises 4 visible light images and 4 infrared images;
s2, preprocessing visible light images and infrared images;
S3, inputting the preprocessed visible light image and infrared image into a deep learning model, and extracting visible light features, infrared features and mixed mode features through the deep learning model;
Step S4, constructing total loss, judging whether the deep learning model is trained, if the total loss is lower than a preset threshold, finishing training, if the total loss is higher than the preset threshold, continuing training until the total loss is lower than the preset threshold,
The total loss includes a local maximum average difference loss constructed based on the visible light features and infrared features, an intermodal divergence loss constructed based on the visible light features and infrared features, a circle loss constructed based on the visible light features, infrared features, and mixed modal features, a cross entropy loss constructed based on the visible light features, infrared features, and mixed modal features;
And S5, identifying pedestrians from the visible light image or the infrared image to be identified by the trained deep learning model.
In one embodiment of the present invention, the formula of the local maximum average difference loss constructed based on the visible light characteristic and the infrared characteristic in the step S4 is:
Wherein H represents a regenerated Hilbert space, phi (·) represents a regenerated Hilbert space conversion map, N id represents the total category number of pedestrian identities, N V represents the number of visible light images in a training batch, f i V represents the characteristics of the ith visible light image, Representing the probability weight that f i V belongs to category k, N I represents the number of infrared images in the training batch,Features representing the j-th infrared image,Representation ofProbability weights belonging to category k, |a-b|| 2 denote the square of the distance of a and b.
In one embodiment of the present invention, the formula of the intermodal divergence loss constructed based on the visible light characteristic and the infrared characteristic in the step S4 is:
wherein N V、NI represents the number of visible and infrared images in the training batch, C V、CI represents the visible and infrared image classifier, respectively, Features of the j-th infrared image are represented, and f i V features of the i-th visible image are represented.
In one embodiment of the present invention, the formula of the circle loss constructed based on the visible light characteristic, the infrared characteristic and the mixed mode characteristic in the step S4 is:
wherein, sp=wyfm/(||wy||||fm||),m∈{V,I,H},S p represents the similarity score between the classes and w j,wy represents the non-target classification weight and the target classification weight, w j,wy represents the transpose of w j and w y, f m represents the image feature belonging to the mode m, m is the mode to which the image feature belongs, V, I, H represents the visible light mode, the infrared mode and the mixed mode, respectively, |·|| represents the L2 norm,The first and second weight coefficients of the circle loss are represented respectively, and delta n、Δp and gamma are the first, second and third super parameters respectively.
In one embodiment of the present invention, the cross entropy loss constructed in step S4 based on the visible light feature, the infrared feature and the mixed mode feature is specifically:
The visible light characteristics, the infrared characteristics and the mixed mode characteristics are subjected to classification layers to obtain respective category scores, and cross entropy loss is constructed based on the respective category scores, wherein the formula is as follows:
Wherein N V,NI,NH represents the number of visible light images, infrared images and mixed mode images in the training batch, N id represents the total category number of the identity of the pedestrian, y i,k represents the true category score of the ith image belonging to the kth pedestrian, The prediction category score of the kth pedestrian of the ith image is represented, m is the mode of the image feature, and V, I and H respectively represent the visible light mode, the infrared mode and the mixed mode.
In one embodiment of the present invention, the deep learning model in step S3 includes a first stage, a mixed mode generation module, a second stage, a third stage and a fourth stage connected in sequence, each stage including at least one Swin transducer block appearing in pairs, each Swin transducer block appearing in pairs including a first sub-block and a second sub-block connected in sequence, wherein,
The first sub-block comprises a first normalization layer LN, a multi-head self-attention layer W-MSA based on a window, a second normalization layer LN and a first multi-layer perceptron MLP which are sequentially connected, wherein the input of the first normalization layer LN and the output of the multi-head self-attention layer W-MSA based on the window are subjected to matrix element addition and then serve as the input of the second normalization layer LN, and the input of the second normalization layer LN and the first multi-layer perceptron MLP are subjected to matrix element addition and then serve as the input of the second sub-block;
The second sub-block comprises a third normalization layer LN, a multi-head self-care layer SW-MSA based on a shift window, a fourth normalization layer LN and a second multi-layer perceptron MLP which are sequentially connected, wherein the input of the third normalization layer LN and the output of the multi-head self-care layer SW-MSA based on the shift window are subjected to matrix element addition and then serve as the input of the fourth normalization layer LN, and the input of the fourth normalization layer LN and the second multi-layer perceptron MLP are subjected to matrix element addition and then serve as the output of a Swin transform block which is formed in pairs;
the first Swin transducer block in the first stage is preceded by a Linear Embedding module, the Linear Embedding module is used for converting the image into patches required for the Swin transducer block operation, and the first Swin transducer block in the second, third and fourth stages is preceded by a PATCHING MERGING module, the PATCHING MERGING module is used for reducing the number of patches and increasing the dimension of the patches to form a hierarchical representation of the feature.
In one embodiment of the present invention, the obtaining, by the hybrid modality generating module, the hybrid modality image feature from the visible light image feature and the infrared image feature output in the first stage includes:
Setting the characteristic of each visible light image output in the first stage as x V and the characteristic of each infrared image as x I;
The method comprises the steps of convolving x V through first hollows with expansion rates of 1,2 and 4 respectively, adding and averaging results, and obtaining the generation characteristics of a visible light part through a full connection layer and a ReLU activation function;
The second cavity convolution with the expansion rate of 1,2 and 4 is carried out on x I, and the generation characteristics of the infrared part are obtained through a full connection layer and a ReLU activation function;
for extracting context information of a scene level common to an infrared image and a visible light image, generating characteristics of background characteristics, carrying out self-adaptive pooling on x V and x I through the same, carrying out convolution on 1x1, carrying out dot product on pixels to obtain common background characteristics, adjusting the common background characteristics to the same size as the image characteristics output in the first stage through bilinear interpolation, and obtaining global background generation characteristics through a ReLU activation function;
finally, adding and averaging the generation features of the visible light part, the generation features of the infrared part and the global background generation features to obtain final mixed mode image features;
In order to maintain consistency of the final mixed mode characteristics, mode consistency loss is introduced into the total loss, and the formula is as follows:
Wherein L 2 (·) represents L2 regularization, f t H represents a generated t-th image of the hybrid mode, f s H represents a generated s-th image of the hybrid mode, l· 2 represents euclidean distance, and n is the number of the same pedestrian hybrid modes.
In one embodiment of the invention, the hybrid modality generation module further includes feature alignment between the visible light image features x V output by the first stage and the hybrid modality image features generated by the hybrid modality generation module, the infrared image features x I output by the first stage and the hybrid modality image features generated by the hybrid modality generation module by antagonizing learning, including:
The mode differences between the visible light image characteristic x V and the mixed mode image characteristic generated by the mixed mode generation module and the infrared image characteristic x I and the mixed mode image characteristic generated by the mixed mode generation module are reduced through the generator and the discriminator, wherein,
The generator consists of three full-connection layers, wherein a ReLU activation function is arranged between the full-connection layers and used for increasing nonlinearity of the generator, the visible light image characteristic x V and the infrared image characteristic x I output in the first stage are added to be the same as the original input image in length and width, and after compensating embedded vectors with the dimension of 1 are obtained, fake mixed mode characteristics are obtained after the corresponding generators are used;
The discriminator is a Patch Gan discriminator, the mixed mode image characteristic forged based on the visible light image characteristic x V or the infrared image characteristic x I and the mixed mode image characteristic generated by the mixed mode generation module are input into the discriminator, and the countermeasure loss about the discriminator is constructed in the total loss function, so that whether the forged mixed mode characteristic is identical with the mixed mode image characteristic generated by the mixed mode generation module or not is judged by using the countermeasure loss, and the countermeasure loss comprises the discriminator loss, the generator loss and the joint countermeasure loss, wherein the formula is as follows:
LGAN=LD+LG;
Wherein, L D is the loss of the discriminator, L G is the loss of the generator, L GAN is the combined antagonism loss, N V is the number of visible light images in the training batch, N I is the number of infrared images in the training batch, N H is the number of mixed mode images in the training batch, D (·) is the discriminator, For the i-th visible light image,For the j-th infrared image,For the k Zhang Hunge mode image, y V is a label of a visible light image, y I is a label of an infrared image, y H is a label of a mixed mode, G V-H is a visible light-mixed mode generator, and G I-H is an infrared-mixed mode generator.
In one embodiment of the present invention, a component detection and exchange module is further provided after the last Swin Transformer block of the third stage, the component detection and exchange module comprising:
The visible light, the global image feature G V、GH、GI corresponding to the mixed and infrared states output by the last Swin transducer block in the third stage, and the features output by the first stage and the second stage are sent to the body part detector Component Detector to obtain initial part local features composed of a plurality of body part local features, the initial part local features pass through a part prediction network Predictor Network, the part prediction network Predictor Network is composed of a full-connection layer and a Sigmoid function and is used for predicting the score of each body part local feature in the initial part local features, the replacement of the body part features in the initial part local features is carried out according to the prediction score, if the body part feature score of the mixed mode is smaller than the body part feature scores of the two other modes, the body part feature with the largest score in the two other modes is replaced by the body part local feature of the mixed mode, the body part local features with overlapping redundancy in the mixed mode are replaced, and the part local features P V、PH、PI are obtained,
The body part detector Component Detector respectively multiplies the features output by the first stage and the second stage and the features output by the last Swin transducer block in the third stage by a1×1 convolutional neural network and a Sigmoid function to obtain attention masks of different levels, adjusts the attention masks to the same size and multiplies the attention masks by pixels to obtain a final mask, then respectively multiplies the final mask by global image features G V、GH、GI corresponding to visible light, mixed light and infrared states output by the last Swin transducer block in the third stage, and finally splices the final masks in dimension to obtain local features of the initial part;
constructing part characteristics in the total loss function, wherein the part characteristics are lost by part identities, and the formula is as follows:
Where N represents the total number of images in the training batch, N c represents the number of human body local features, N id is the total category number of pedestrian identities, y i,k represents the true category score of the ith image belonging to the kth pedestrian, and z i,j,k represents the predicted category score of the jth human body local feature of the ith image belonging to the kth pedestrian.
In order to solve the technical problems, the invention provides a cross-mode pedestrian re-identification system, which comprises:
the selection module is used for randomly selecting a plurality of pedestrians from the infrared visible light data set in each training batch, wherein each pedestrian comprises 4 visible light images and 4 infrared images;
the preprocessing module is used for preprocessing the visible light image and the infrared image;
the feature extraction module is used for inputting the preprocessed visible light images and infrared images into a deep learning model, and extracting visible light features, infrared features and mixed mode features through the deep learning model;
the construction module is used for constructing total loss, judging whether the deep learning model is trained, if the total loss is lower than a preset threshold, the training is ended, if the total loss is higher than the preset threshold, the training is continued until the total loss is lower than the preset threshold,
The total loss includes a local maximum average difference loss constructed based on the visible light features and infrared features, an intermodal divergence loss constructed based on the visible light features and infrared features, a circle loss constructed based on the visible light features, infrared features, and mixed modal features, a cross entropy loss constructed based on the visible light features, infrared features, and mixed modal features;
and the recognition module is used for recognizing the pedestrians from the visible light images or the infrared images to be recognized by the trained deep learning model.
In order to solve the technical problems, the invention provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of the cross-mode pedestrian re-identification method when executing the computer program.
To solve the above technical problem, the present invention provides a computer readable storage medium having a computer program stored thereon, which when executed by a processor, implements the steps of the above cross-modal pedestrian re-recognition method.
Compared with the prior art, the technical scheme of the invention has the following advantages:
the deep learning model constructed by the invention relates to a novel three-mode branch structure, and the mixed mode is learned from the visible light image and the infrared image in a learning mode, so that the number of training samples can be effectively increased, and the follow-up pedestrian re-recognition is facilitated;
According to the invention, the mixed mode generation module is introduced into the deep learning model, and alignment between the visible light image and the mixed mode image and alignment between the infrared image and the mixed mode image are realized through a learning mode embedding mode and a countermeasure generation network, so that the mode difference between the infrared mode and the visible light mode in actual retrieval can be effectively reduced;
The invention introduces a component detection and exchange module in the deep learning model, and reduces the alignment difficulty while exploring the slight difference of the images.
Drawings
In order that the invention may be more readily understood, a more particular description of the invention will be rendered by reference to specific embodiments thereof that are illustrated in the appended drawings.
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a schematic diagram of a deep learning model structure in an embodiment of the invention;
FIG. 3 is a block diagram of a Swin transducer according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a hybrid modality generation module according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a component detection and switching module in accordance with an embodiment of the present invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings and specific examples, which are not intended to be limiting, so that those skilled in the art will better understand the invention and practice it.
Example 1
Referring to fig. 1, the invention relates to a cross-mode pedestrian re-identification method, which comprises the following steps:
Step S1, randomly selecting a plurality of pedestrians from an infrared visible light data set in each training batch, wherein each pedestrian comprises 4 visible light images and 4 infrared images;
s2, preprocessing visible light images and infrared images;
S3, inputting the preprocessed visible light image and infrared image into a deep learning model, and extracting visible light features, infrared features and mixed mode features through the deep learning model;
Step S4, constructing total loss, judging whether the deep learning model is trained, if the total loss is lower than a preset threshold, finishing training, if the total loss is higher than the preset threshold, continuing training until the total loss is lower than the preset threshold,
The total loss includes a local maximum average difference loss constructed based on the visible light features and infrared features, an intermodal divergence loss constructed based on the visible light features and infrared features, a circle loss constructed based on the visible light features, infrared features, and mixed modal features, a cross entropy loss constructed based on the visible light features, infrared features, and mixed modal features;
And S5, identifying pedestrians from the visible light image or the infrared image to be identified by the trained deep learning model. Specifically, the model identifies a pedestrian in the infrared image to be identified based on a given visible light image, or identifies a pedestrian in the visible light image to be identified based on a given infrared image. It should be noted that, since the training deep learning model in this embodiment has three branches, when in actual use, the branches corresponding to the infrared image or the visible light image can be selected according to different application scenarios.
The present embodiment is described in detail below:
The deep learning model provided by the invention adopts a three-branch network for training, the basic network is Swin TransformerV, the three branches of the model are respectively a visible light branch, an infrared branch and a mixed mode branch, and model parameters of the three branches are shared. The training data of the mixed mode branch is formed by mixing a visible light image and an infrared image, so that the number of samples for model training can be effectively increased, meanwhile, the model can align the visible light image characteristic and the infrared image characteristic with the mixed mode characteristic, the difference between modes can be effectively reduced, and the overfitting of the training samples is reduced. In order to discover subtle differences among images, the model extracts local features through generating a mask in a third stage, and performs progressive human body local feature replacement among three mode branches, so that the human body local features with overlapping redundancy in the mixed mode are replaced, and the local features of the mixed mode are purer. Fig. 2 shows a schematic structural diagram of the whole model, and a specific training process of the model is as follows:
S1, randomly selecting 4 pedestrians from an infrared visible light data set during training of each batch, wherein each pedestrian comprises 4 visible light images and 4 infrared images.
S2, preprocessing all pedestrian images, firstly scaling all the image sizes to 288 multiplied by 144 multiplied by 3, simultaneously carrying out data enhancement, adopting random cutting, random horizontal overturning and random erasure, and random one enhancement mode of channel random erasure, random channel exchange enhancement, color dithering enhancement and spectrum dithering enhancement for visible light images, adopting only random cutting, random horizontal overturning and random erasure for infrared images, and finally carrying out normalization operation with the average value of [0.485,0.456,0.406] and the standard deviation of [0.229,0.224,0.225] for the used images.
S3, inputting the processed visible light and infrared images into a model to obtain the characteristics of three modes, namely visible light, infrared and mixed modes, generated and extracted by the model;
s4, constructing total loss, judging whether the deep learning model is trained, if so, finishing the training, if so, continuing the training until the total loss is lower than a preset threshold;
Specifically, the total loss includes a local maximum average difference loss, a modal-to-modal divergence loss, a circular loss, a cross entropy loss, specifically as follows:
(1) Monitoring the local maximum average difference loss and the inter-modal divergence loss of the extracted visible light characteristic f V and infrared characteristic f I, wherein the local maximum average difference loss function is as follows:
Wherein H represents a regenerated Hilbert space, phi (·) represents a regenerated Hilbert space conversion map, N id represents the total category number of pedestrian identities, N V represents the number of visible light images in a training batch, f i V represents the characteristics of the ith visible light image, Representing the probability weight that f i V belongs to category k, N I represents the number of infrared images in the training batch,Features representing the j-th infrared image,Representation ofProbability weights belonging to category k, |a-b|| 2 denote the square of the distance of a and b.
(2) The loss of modal dispersion is as follows:
wherein N V、NI represents the number of visible and infrared images in the training batch, C V、CI represents the visible and infrared image classifier, respectively, Features of the j-th infrared image are represented, and f i V features of the i-th visible image are represented.
(3) The visible light feature f V, the infrared feature f I and the mixed mode feature f H are subjected to circle loss supervision:
wherein, sp=wyfm/(||wy||||fm||),m∈{V,I,H},S p represents the similarity score between the classes and w j,wy represents the non-target classification weight and the target classification weight, w j,wy represents the transpose of w j and w y, f m represents the image feature belonging to the mode m, m is the mode to which the image feature belongs, V, I, H represents the visible light mode, the infrared mode and the mixed mode, respectively, |·|| represents the L2 norm,The first and second weight coefficients of the circle loss are represented respectively, delta n、Δp and gamma are respectively the first, second and third super parameters, and delta n、Δp and gamma take values of 0.5, 0.5 and 30 respectively.
(4) The three modal features are passed through a classification layer (namely a fully connected layer, an FC layer in FIG. 2) to obtain category scores z V、zI and z H, the category scores are subjected to cross entropy loss supervision, and a loss function formula is as follows:
Wherein N V,NI,NH represents the number of visible light images, infrared images and mixed mode images in the training batch, N id represents the total category number of the identity of the pedestrian, y i,k represents the true category score of the ith image belonging to the kth pedestrian, The prediction category score of the kth pedestrian of the ith image is represented, m is the mode of the image feature, and V, I and H respectively represent the visible light mode, the infrared mode and the mixed mode.
S5, identifying pedestrians from the visible light image and the infrared image to be identified by the trained deep learning model.
The specific flow of the step S2 is as follows:
S2-1. The deep learning model of the embodiment adopts Swin TransformerV2 as a backbone network, and the deep learning model comprises a first Stage (Stage 1), a mixed mode generation Module (HMG Module), a second Stage (Stage 2), a third Stage (Stage 3) and a fourth Stage (Stage 4) which are sequentially connected, wherein a component detection and exchange Module (CDS Module) is further arranged in the third Stage (Stage 3). Swin TransformerV2 firstly partitioning the obtained visible light and infrared images, specifically partitioning the images through a Patch Partition module (Patch partitioning module) in fig. 2 to obtain patches, in order to enable each Patch to fully sense the characteristics nearby the Patch and enable the model to fully utilize the image characteristics, the model adopts overlapped sampling partitioning, a specific operation is to perform step length 3 on an input image, a two-dimensional convolution operation with convolution kernel of 4 is performed, the size of the input image received by Swin TransformerV2 is [288,144,3], 96×48 patches are obtained through convolution partitioning operation, the dimension of each Patch is 128, then the patches are linearly embedded (Linear Embedding module in the first stage), a total of 4608 patches are paved to obtain image characteristics with the size of [4608,128], then Swin Transformer blocks which are divided into four stages are fed, and the number of Swin Transformer blocks in the four stages is [2,2,18,2] respectively.
In this embodiment the Swin fransformer blocks consisting of W-MSA and SW-MSA always appear in pairs, see fig. 3, each pair appearing Swin fransformer block comprising a first sub-block and a second sub-block connected in sequence, wherein,
The first sub-block comprises a first normalization layer LN, a multi-head self-attention layer W-MSA based on a window, a second normalization layer LN and a first multi-layer perceptron MLP which are sequentially connected, wherein the input of the first normalization layer LN and the output of the multi-head self-attention layer W-MSA based on the window are subjected to matrix element addition and then serve as the input of the second normalization layer LN, and the input of the second normalization layer LN and the first multi-layer perceptron MLP are subjected to matrix element addition and then serve as the input of the second sub-block;
The second sub-block comprises a third normalization layer LN, a multi-head self-care layer SW-MSA based on a shift window, a fourth normalization layer LN and a second multi-layer perceptron MLP which are sequentially connected, wherein the input of the third normalization layer LN and the output of the multi-head self-care layer SW-MSA based on the shift window are subjected to matrix element addition and then serve as the input of the fourth normalization layer LN, and the input of the fourth normalization layer LN and the second multi-layer perceptron MLP are subjected to matrix element addition and then serve as the output of the pair-appearing Swin converter block;
The first Swin Transformer block in the first stage is preceded by Linear Embedding modules (linear embedding modules), the Linear Embedding modules are used for converting the image into patches required by the Swin Transformer block operation (i.e. the patches obtained by the segmentation are mapped to a new dimension (three-dimensional change, flattening effect) so as to facilitate the subsequent Swin Transformer block processing), and the first Swin Transformer block in the second, third and fourth stages is preceded by PATCHING MERGING modules (patch merging modules), and the PATCHING MERGING modules are used for reducing the number of patches and increasing the dimension of the patches so as to be beneficial to forming the hierarchical representation of the features. Since Linear Embedding modules and PATCHING MERGING modules belong to the prior art, the description of this embodiment is omitted.
The multi-layer perceptron MLP is composed of two fully connected layers, and a GELU nonlinear activation function is used between the two fully connected layers.
Further, the Swin transducer block consisting of W-MSA and SW-MSA always appears in pairs, divided into two sub-blocks, namely a first sub-block and a second sub-block, by which a relation between patches capable of different areas is established. The Swin transducer block output image features with two sub-blocks are calculated as follows:
wherein, The characteristics of the first sub-block output through the multi-head self-attention layer and the characteristics of the second stage output through the multi-head self-attention layer are respectively,The method comprises the steps that characteristics of a first sub-block output through a multi-layer perceptron and characteristics of a second sub-block output through the multi-layer perceptron are respectively, WMSA () is a multi-head self-attention layer based on windows, MLP () is the multi-layer perceptron, SWMSA () is a multi-head self-attention layer based on shift windows, m is a mode to which image characteristics belong, V, I and H respectively represent visible light, infrared and mixed mode characteristics, in order to generate layered representation, the number of patches can be reduced through patch merging layers among Swin different stages, and meanwhile, the dimension of the patches is increased. Assuming each patch dimension is C, the patch merge layer concatenates features of each set of 2x 2 adjacent patches and reduces the dimension to 2C using a linear layer for the concatenated features of dimension 4C.
Referring to fig. 2, after the Swin transducer block in the first Stage (Stage 1), the output visible light and infrared mode features pass through a mixed mode generation Module (HMG Module) to obtain three branches, and the visible light features, the mixed mode image features and the infrared image features are respectively output, and then the three mode features are respectively sent to the Swin transducer blocks in the last three stages (stages 2,3 and 4).
Referring to fig. 2, the global image features (i.e. G V、GH、GI in fig. 2) corresponding to the three modes output by the last Swin transducer block in the third Stage (Stage 3) of the model, and the features output by the first and second stages (Stage 1, 2) are sent together to the component detection and exchange Module (CDS Module), the corresponding component local features P V、PH、PI are extracted and output together with the global image features G V、GH、GI, the global image features (G V、GH、GI) and the component local features (P V、PH、PI) of the three modes are sent to the Swin transducer block of the last Stage (Stage 4) respectively, and finally the feature stitching (i.e. "C" after the fourth Stage in fig. 2, meaning connect) is performed to stitch the respective global image features and the component local features in the patch dimension, and the final feature outputs f V、fI and f H are obtained by an average pooling layer (POOL) and a batch normalization layer (BN).
S2-2 referring to FIG. 4, the mixed mode generation Module (HMG Module) is specifically as follows:
for each pedestrian in a training batch, 4 visible light images and 4 infrared images are combined, 16 mixed mode images are generated through a mixed mode generation Module (HMG Module) (1 visible light image and 4 infrared images can be respectively generated into mixed mode images, then 4 visible light images and 4 infrared images can be respectively generated into 16 images), and the specific generation flow is that firstly visible light image characteristics and infrared image characteristics are added with respective mode compensation in an embedding manner and restored to the original image proportion, and for each visible light image characteristic x V and infrared image characteristic x I output in the first Stage (Stage 1), for capturing and generating image information of different scale information, x V is required to be convolved through a first cavity with expansion rates of 1,2 and 4 respectively, the results are added and averaged, and then the generation characteristics of a visible light part are obtained through a full connection layer and a ReLU activation function. Similar operations are performed on x I to obtain the infrared portion generation characteristics by a second hole convolution with the same expansion ratio of 1,2,4, respectively, but with different parameters than the first hole convolution, and by a full connection layer and a ReLU activation function. In addition, in order to extract the context information of the scene level common to the infrared image and the visible light image and perform feature generation on the background feature, x V and x I need to be pooled and 1x1 convolved through the same adaptive average and dot product on pixels to obtain the common background feature, then the common background feature is adjusted to the same size as the image feature output in the first Stage (Stage 1) through bilinear interpolation, and global background generation feature is obtained through ReLU activation. And finally, fusing the generated features of the multiple parts, namely adding and averaging the generated features of the visible light part, the generated features of the infrared part and the global background generated features to obtain the final mixed mode image features. In order to keep consistency of the generated mixed mode image characteristics, the embodiment introduces mode consistency loss in the total loss:
Wherein L 2 (·) represents L2 regularization, f t H represents a generated t-th image of the hybrid mode, f s H represents a generated s-th image of the hybrid mode, l· 2 represents euclidean distance, and n is the number of the same pedestrian hybrid modes.
For the generated 16 mixed mode image features, respectively, a predictive label is obtained through a subsequent Patch Gan discriminator, the mixed mode image features closest to the mixed mode label are selected as final outputs (the 4 mixed mode image features are selected here to ensure that the number of the mixed mode image features is the same as that of the visible light images/infrared images of the original input deep learning model), it should be noted that the mixed mode generating Module (HMG Module) can generate not only the final mixed mode image features, but also the fake mixed mode features based on the visible light image features x V and the fake mixed mode features based on the infrared image features x I, and in short, the mixed mode generating Module (HMGModule) can generate the features of three branches of visible light, mixed mode and infrared and respectively input the generated features of the three branches into the second stage (stage 2).
The model then performs feature alignment between the visible-mixed mode and the infrared-mixed mode by countermeasure learning, and the kernel of the alignment part is a Generator (Generator) and a discriminator D, and the mode differences between the visible light, the infrared image features and the mixed mode features are reduced by a mutual game between the Generator (Generator) and the discriminator D. The Generator (Generator) aims to make the characteristics of the mixed mode generated by the other modes as close as possible to the actual characteristics of the mixed mode, and the discriminator D aims to distinguish the characteristics of the mixed mode from the characteristics of the other two modes. The respective modal compensation is added to the embedded (i.e., visible Compensation Embedding and Infrared Compensation Embedding in fig. 4, the two embedded vectors are the same length as the original input image, The visible light features and the infrared features of the embedded vector with the same width and the dimension of 1 are divided into a visible light-mixed mode Generator G V-H and an infrared-mixed mode Generator G I-H by generators (generators), each Generator (Generator) is formed by three fully-connected layers in a cascade, and a ReLU activation function is arranged between the fully-connected layers to increase the nonlinearity of the Generator (Generator), and the characteristics are forged after passing through the respective generators. The structure of the discriminant D (i.e., D V-H and D I-H in FIG. 4) is a classical Patch Gan discriminant structure, which will be based on the mixed mode features of visible image feature x V or infrared image feature x I forgery, And inputting the characteristics of the mixed mode image generated by the mixed mode generating module into a corresponding discriminator D, and constructing countermeasures about the discriminator D in the total loss function, and judging whether the forged mixed mode characteristics are the same as the characteristics of the mixed mode image generated by the mixed mode generating module or not by utilizing the countermeasures (the countermeasures are the same when the countermeasures are smaller than a preset value, which indicates that the characteristics of the actually generated mixed mode image are effective, and the countermeasures are different when the countermeasures are larger than the preset value, which indicates that the characteristics of the actually generated mixed mode image are not ideal), so as to realize characteristic alignment, wherein the countermeasures comprise the discriminator losses, the generator losses and the combined countermeasures, and the formula is as follows:
LGAN=LD+LG;
Wherein, L D is the loss of the discriminator, L G is the loss of the generator, L GAN is the combined antagonism loss, N V is the number of visible light images in the training batch, N I is the number of infrared images in the training batch, N H is the number of mixed mode images in the training batch, D (·) is the discriminator, For the i-th visible light image,For the j-th infrared image,For the k Zhang Hunge mode image, y V is a label of a visible light image, y I is a label of an infrared image, y H is a label of a mixed mode, G V-H is a visible light-mixed mode generator, and G I-H is an infrared-mixed mode generator.
And finally, the aligned infrared image, visible light image and mixed mode image are sent to a subsequent model for processing by the model.
S2-3 referring to FIG. 5, the component detection and switching Module (CDS Module) is specifically as follows:
the last Swin transducer block in the third stage is further provided with a component detection and exchange module, the global image feature G V、GH、GI corresponding to the three modes output by the last Swin transducer block in the third stage and the features output by the first stage and the second stage are sent to the body component detector Component Detector to obtain initial component local features composed of a plurality of body local features, the initial component local features pass through a component prediction network Predictor Network, the component prediction network Predictor Network is composed of a full-connection layer and a Sigmoid function and is used for predicting the score of each body local feature in the initial component local features, the replacement of the body local features in the initial component local features is performed according to the prediction score, and as the generated initial component local features are inevitably redundant, the human local features corresponding to the mixed mode are gradually replaced with the local features detected by the infrared and visible optical modes through a Swapping module (replacement module) in fig. 5, if the human local feature score of the mixed mode is smaller than the human local feature score of the two other body local features, the mixed mode is replaced by the two local features, and the mixed mode is enabled to be more redundant by the human local feature of the mixed mode. It should be noted that, the local features of the initial components corresponding to the infrared branch and the visible light branch are the same as the final component local feature P I、PV, and only the local features of the initial components corresponding to the mixed mode branch are changed.
Specifically, the body component detector Component Detector multiplies the feature output by the first stage and the second stage respectively, and the global image feature (G V、GH、GI) corresponding to the visible light, the mixed and the infrared mode output by the last Swin transducer block in the third stage respectively through a Convolutional Neural Network (CNN) and a Sigmoid function of 1×1 to obtain attention masks of different levels, adjusts the attention masks to the same size and multiplies the attention masks by pixels to obtain a final Mask (Mask), multiplies the final Mask (Mask) by three mode data (G V、GH、GI) of the visible light, the mixed and the infrared modes output by the last Swin transducer block in the third stage, and finally splices in the dimension to obtain the local feature of the initial component.
Constructing loss related to local characteristics of a human body in the total loss function, wherein the formula is as follows:
Where N represents the total number of images in the training batch, N c represents the number of human body local features, N id is the total category number of pedestrian identities, y i,k represents the true category score of the ith image belonging to the kth pedestrian, and z i,j,k represents the predicted category score of the jth human body local feature of the ith image belonging to the kth pedestrian.
The experimental analysis is as follows:
1. Implementation details
The present embodiment implements a built deep learning model on 1 NVIDIA GTX 4090D GPU based on PyTorch framework. The backbone network uses Swin TransformerV2 pre-trained on ImageNet22K and removes the last classification layer of Swin TransformerV, directly outputting a 1024-dimensional feature. During testing, the present embodiment uses cosine similarity to compare the distance between pedestrian features in the query set and in the gallery set.
2. Sampling strategy
During training, the present embodiment samples 4 pedestrians randomly per batch, and extracts 4 visible light images and 4 infrared images for each pedestrian, respectively, to form a small batch of training samples. Before the model is sent to the model for training, the sizes of all images are scaled to 288 multiplied by 144 multiplied by 3, data enhancement is carried out at the same time, the visible light images are subjected to random clipping, random horizontal overturning and random erasing, and one random enhancement mode of channel random erasing, random channel exchange enhancement, color dithering enhancement and spectrum dithering enhancement is adopted, the infrared images are subjected to only random clipping, random horizontal overturning and random erasing, and finally the used images are subjected to normalization operation with the mean value of [0.485,0.456,0.406] and the standard deviation of [0.229,0.224,0.225], so that the risk of fitting after training is reduced.
During the test, the present embodiment uses a batch with a size of 64 to extract the features of the query set and the gallery set, respectively, the visible light image and the infrared image are both sized 288×144×3 before being sent to the model, and then the normalization operation with a mean value of [0.485,0.456,0.406] and a standard deviation of [0.229,0.224,0.225] is performed on the used images.
3. Training parameter settings
The model trained a total of 40 epochs. All parameters of the network model were initialized with Kaiming initialization random initialization parameters before training began. During training, the SGD optimizer was used to update the parameters, momentum was 0.9, weight decay factor was 0.001, and the learning rate of the initial gradient descent was set. The pre-heating learning rate Warmup, the pre-heating iteration number 5 and the pre-heating factor 0.01 are adopted, and the pre-heating learning rate adopts a linear increasing mode. Meanwhile, the learning rate is finally attenuated to 0.002 times of the initial learning rate according to the cosine attenuation strategy of the learning rate.
4. Comparison of Performance with common models
The deep learning model constructed in the embodiment is compared with the current advanced model on SYSU-MM01 data set, no processing technology such as re-ranking, fusion ranking and the like is adopted when the model is used for re-identifying pedestrians, and the deep learning model constructed in the embodiment has a good effect as can be easily found from the table 1.
Table 1 model comparison table
Example two
The embodiment provides a cross-mode pedestrian re-identification system, which comprises:
the selection module is used for randomly selecting a plurality of pedestrians from the infrared visible light data set in each training batch, wherein each pedestrian comprises 4 visible light images and 4 infrared images;
the preprocessing module is used for preprocessing the visible light image and the infrared image;
the feature extraction module is used for inputting the preprocessed visible light images and infrared images into a deep learning model, and extracting visible light features, infrared features and mixed mode features through the deep learning model;
the construction module is used for constructing total loss, judging whether the deep learning model is trained, if the total loss is lower than a preset threshold, the training is ended, if the total loss is higher than the preset threshold, the training is continued until the total loss is lower than the preset threshold,
The total loss includes a local maximum average difference loss constructed based on the visible light features and infrared features, an intermodal divergence loss constructed based on the visible light features and infrared features, a circle loss constructed based on the visible light features, infrared features, and mixed modal features, a cross entropy loss constructed based on the visible light features, infrared features, and mixed modal features;
and the recognition module is used for recognizing the pedestrians from the visible light images or the infrared images to be recognized by the trained deep learning model.
Example III
The present embodiment provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the steps of the cross-modal pedestrian re-recognition method of the embodiment are implemented when the processor executes the computer program.
Example IV
The present embodiment provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the cross-modality pedestrian re-identification method of embodiment one.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The scheme in the embodiment of the application can be realized by adopting various computer languages, such as object-oriented programming language Java, an transliteration script language JavaScript and the like.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.
It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations and modifications of the present invention will be apparent to those of ordinary skill in the art in light of the foregoing description. It is not necessary here nor is it exhaustive of all embodiments. And obvious variations or modifications thereof are contemplated as falling within the scope of the present invention.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202411690034.8A CN119600653B (en) | 2024-11-25 | 2024-11-25 | A cross-modal person re-identification method and system |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202411690034.8A CN119600653B (en) | 2024-11-25 | 2024-11-25 | A cross-modal person re-identification method and system |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN119600653A true CN119600653A (en) | 2025-03-11 |
| CN119600653B CN119600653B (en) | 2025-10-14 |
Family
ID=94843393
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202411690034.8A Active CN119600653B (en) | 2024-11-25 | 2024-11-25 | A cross-modal person re-identification method and system |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN119600653B (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120071051A (en) * | 2025-04-28 | 2025-05-30 | 北京航空航天大学杭州创新研究院 | Method and device for generating countermeasure sample for intelligent driving perception test |
Citations (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113989851A (en) * | 2021-11-10 | 2022-01-28 | 合肥工业大学 | A Cross-modal Person Re-identification Method Based on Heterogeneous Fusion Graph Convolutional Networks |
| CN114220124A (en) * | 2021-12-16 | 2022-03-22 | 华南农业大学 | Near-infrared-visible light cross-modal double-flow pedestrian re-identification method and system |
| CN114495010A (en) * | 2022-02-14 | 2022-05-13 | 广东工业大学 | A cross-modal pedestrian re-identification method and system based on multi-feature learning |
| CN115359554A (en) * | 2022-08-08 | 2022-11-18 | 天津师范大学 | A Cross-modal Pedestrian Retrieval Method for Autonomous Driving |
| US20230162023A1 (en) * | 2021-11-25 | 2023-05-25 | Mitsubishi Electric Research Laboratories, Inc. | System and Method for Automated Transfer Learning with Domain Disentanglement |
| CN116958584A (en) * | 2023-09-21 | 2023-10-27 | 腾讯科技(深圳)有限公司 | Key point detection method, regression model training method and device and electronic equipment |
| US11804060B1 (en) * | 2021-07-26 | 2023-10-31 | Amazon Technologies, Inc. | System for multi-modal anomaly detection |
| WO2023222643A1 (en) * | 2022-05-17 | 2023-11-23 | Continental Automotive Technologies GmbH | Method for image segmentation matching |
| WO2024021394A1 (en) * | 2022-07-29 | 2024-02-01 | 南京邮电大学 | Person re-identification method and apparatus for fusing global features with ladder-shaped local features |
| CN117975556A (en) * | 2024-01-15 | 2024-05-03 | 东北大学佛山研究生创新学院 | Cross-modal pedestrian re-recognition method based on three-modal collaborative learning |
| CN118038499A (en) * | 2024-04-12 | 2024-05-14 | 北京航空航天大学 | Cross-mode pedestrian re-identification method based on mode conversion |
| US20240386653A1 (en) * | 2023-05-17 | 2024-11-21 | Salesforce, Inc. | Systems and methods for reconstructing a three-dimensional object from an image |
-
2024
- 2024-11-25 CN CN202411690034.8A patent/CN119600653B/en active Active
Patent Citations (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11804060B1 (en) * | 2021-07-26 | 2023-10-31 | Amazon Technologies, Inc. | System for multi-modal anomaly detection |
| CN113989851A (en) * | 2021-11-10 | 2022-01-28 | 合肥工业大学 | A Cross-modal Person Re-identification Method Based on Heterogeneous Fusion Graph Convolutional Networks |
| US20230162023A1 (en) * | 2021-11-25 | 2023-05-25 | Mitsubishi Electric Research Laboratories, Inc. | System and Method for Automated Transfer Learning with Domain Disentanglement |
| CN114220124A (en) * | 2021-12-16 | 2022-03-22 | 华南农业大学 | Near-infrared-visible light cross-modal double-flow pedestrian re-identification method and system |
| CN114495010A (en) * | 2022-02-14 | 2022-05-13 | 广东工业大学 | A cross-modal pedestrian re-identification method and system based on multi-feature learning |
| WO2023222643A1 (en) * | 2022-05-17 | 2023-11-23 | Continental Automotive Technologies GmbH | Method for image segmentation matching |
| WO2024021394A1 (en) * | 2022-07-29 | 2024-02-01 | 南京邮电大学 | Person re-identification method and apparatus for fusing global features with ladder-shaped local features |
| CN115359554A (en) * | 2022-08-08 | 2022-11-18 | 天津师范大学 | A Cross-modal Pedestrian Retrieval Method for Autonomous Driving |
| US20240386653A1 (en) * | 2023-05-17 | 2024-11-21 | Salesforce, Inc. | Systems and methods for reconstructing a three-dimensional object from an image |
| CN116958584A (en) * | 2023-09-21 | 2023-10-27 | 腾讯科技(深圳)有限公司 | Key point detection method, regression model training method and device and electronic equipment |
| CN117975556A (en) * | 2024-01-15 | 2024-05-03 | 东北大学佛山研究生创新学院 | Cross-modal pedestrian re-recognition method based on three-modal collaborative learning |
| CN118038499A (en) * | 2024-04-12 | 2024-05-14 | 北京航空航天大学 | Cross-mode pedestrian re-identification method based on mode conversion |
Non-Patent Citations (6)
| Title |
|---|
| WENBO YU ET AL.: "Crossmodal Sequential Interaction Network for Hyperspectral and LiDAR Data Joint Classification", 《IEEE GEOSCIENCE AND REMOTE SENSING LETTERS》, vol. 21, 13 February 2024 (2024-02-13), pages 1 - 5 * |
| YUKANG ZHANG ET AL.: "Adaptive Middle Modality Alignment Learning for Visible-Infrared Person Re-identification", 《INTERNATIONAL JOURNAL OF COMPUTER VISION》, vol. 133, 9 November 2024 (2024-11-09), pages 2176, XP038124276, DOI: 10.1007/s11263-024-02276-4 * |
| 冯敏;张智成;吕进;余磊;韩斌;: "基于生成对抗网络的跨模态行人重识别研究", 《现代信息科技》, vol. 4, no. 04, 25 February 2020 (2020-02-25), pages 107 - 109 * |
| 张典;汪海涛;姜瑛;陈星;: "基于轻量网络的近红外光和可见光融合的异质人脸识别", 《小型微型计算机系统》, vol. 41, no. 04, 9 April 2020 (2020-04-09), pages 807 - 811 * |
| 范慧杰等: "可见光红外跨模态行人重识别方法综述", 《信息与控制 》, vol. 54, no. 01, 27 September 2024 (2024-09-27), pages 50 - 65 * |
| 陈丹;李永忠;于沛泽;邵长斌;: "跨模态行人重识别研究与展望", 《计算机系统应用》, vol. 29, no. 10, 13 October 2020 (2020-10-13), pages 20 - 28 * |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120071051A (en) * | 2025-04-28 | 2025-05-30 | 北京航空航天大学杭州创新研究院 | Method and device for generating countermeasure sample for intelligent driving perception test |
Also Published As
| Publication number | Publication date |
|---|---|
| CN119600653B (en) | 2025-10-14 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Li et al. | ConvTransNet: A CNN–transformer network for change detection with multiscale global–local representations | |
| Zhang et al. | Cross-modality interactive attention network for multispectral pedestrian detection | |
| CN114783024A (en) | Face recognition system of gauze mask is worn in public place based on YOLOv5 | |
| CN117292313A (en) | A small target floating garbage detection method based on the improved YOLOv7 model | |
| CN116704611B (en) | A cross-view gait recognition method based on motion feature mixing and fine-grained multi-stage feature extraction | |
| CN113689382B (en) | Tumor postoperative survival prediction method and system based on medical images and pathological images | |
| CN113344814A (en) | High-resolution countermeasure sample synthesis method based on generation mechanism | |
| CN120580436B (en) | Image segmentation method based on deep learning remote sensing image | |
| CN119600653B (en) | A cross-modal person re-identification method and system | |
| CN118015342A (en) | A multi-source target detection method based on prototype network dynamic balance optimization strategy | |
| Chen et al. | DDGAN: dense residual module and dual-stream attention-guided generative adversarial network for colorizing near-infrared images | |
| Zhang et al. | Adaptive transformer with Pyramid Fusion for cloth-changing Person Re-Identification | |
| Lu et al. | Cross-modality person re-identification based on intermediate modal generation | |
| Wang et al. | A Novel Transformer-based Multiscale Siamese Framework for High-resolution Remote Sensing Change Detection | |
| Yuan et al. | Instant pose extraction based on mask transformer for occluded person re-identification | |
| Lan et al. | Infrared dim and small targets detection via self-attention mechanism and pipeline correlator | |
| CN120543828A (en) | A small target detection method for industrial defect detection scenarios | |
| CN114419729A (en) | A Behavior Recognition Method Based on Lightweight Dual-Stream Network | |
| CN120108040A (en) | A multi-scale adaptive gait recognition method and system based on inner convolution | |
| Chen et al. | Two-stage dual-resolution face network for cross-resolution face recognition in surveillance systems | |
| Li et al. | A New Method for Vehicle Logo Recognition Based on Swin Transformer | |
| Sun et al. | Ship re-identification in foggy weather: A two-branch network with dynamic feature enhancement and dual attention | |
| CN119600305B (en) | Multi-scale feature extraction method, system, equipment and storage medium for enhancing image similarity matching | |
| Dong et al. | Cycle Translation-Based Collaborative Training for Hyperspectral-RGB Multimodal Change Detection | |
| Wu et al. | Image Colorization Algorithm Based on Improved GAN |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |