[go: up one dir, main page]

CN119600653A - A cross-modal pedestrian re-identification method and system - Google Patents

A cross-modal pedestrian re-identification method and system Download PDF

Info

Publication number
CN119600653A
CN119600653A CN202411690034.8A CN202411690034A CN119600653A CN 119600653 A CN119600653 A CN 119600653A CN 202411690034 A CN202411690034 A CN 202411690034A CN 119600653 A CN119600653 A CN 119600653A
Authority
CN
China
Prior art keywords
visible light
features
infrared
image
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202411690034.8A
Other languages
Chinese (zh)
Other versions
CN119600653B (en
Inventor
黄鹤
徐毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN202411690034.8A priority Critical patent/CN119600653B/en
Publication of CN119600653A publication Critical patent/CN119600653A/en
Application granted granted Critical
Publication of CN119600653B publication Critical patent/CN119600653B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0475Generative networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/094Adversarial learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a cross-mode pedestrian re-recognition method and system, wherein the method comprises the steps of randomly selecting a plurality of pedestrians from an infrared visible light data set in each training batch, preprocessing the visible light images and the infrared images, inputting the preprocessed visible light images and infrared images into a deep learning model, extracting visible light features, infrared features and mixed mode features through the deep learning model, constructing total loss, judging whether the deep learning model is trained, if the total loss is lower than a preset threshold value, finishing training, if the total loss is higher than the preset threshold value, continuing training until the total loss is lower than the preset threshold value, and recognizing the pedestrians from the visible light images or the infrared images to be recognized through the trained deep learning model. According to the invention, the cross-mode pedestrian re-recognition can be effectively performed through the constructed deep learning model.

Description

Cross-mode pedestrian re-identification method and system
Technical Field
The invention relates to the technical field of pedestrian re-recognition, in particular to a cross-mode pedestrian re-recognition method and system.
Background
In recent years, pedestrian re-recognition is an important task in the fields of computer vision and video monitoring, and by comparing pedestrian images from different cameras, the same individuals are automatically recognized and matched, so that the pedestrian re-recognition method has wide application in the fields of intelligent monitoring, video retrieval, intelligent transportation and the like. However, conventional pedestrian re-recognition methods typically rely on single-modality visual feature extraction, presenting significant limitations in cross-modality and low-light scenes. For example, there is a significant modal difference between infrared and visible light images, making feature alignment and cross-modal matching very challenging.
In order to solve the problem, the infrared-visible light cross-mode pedestrian re-recognition method based on deep learning shows excellent performance, and cross-mode characteristics can be better extracted and aligned, so that recognition accuracy is remarkably improved. The infrared-visible light cross-modal pedestrian re-recognition realizes cross-modal pedestrian recognition by utilizing the correlation between visible light and infrared images, which has important significance in all-weather monitoring scenes, especially in night or low-illumination conditions. And thus has become an important research direction in the field of computer vision.
The infrared-visible light cross-modal pedestrian re-recognition is used as an important research direction in the field of computer vision, and aims to reduce modal differences between infrared image and visible light image features so as to realize cross-modal pedestrian matching and recognition. However, infrared-visible cross-modal pedestrian re-identification faces some challenges, the most significant of which can be generalized to the large modal differences and less cross-modal training data.
(One) tremendous modality differences
The infrared image and the visible light image have obvious modal differences in visual characteristics, and the visible light image is an imaging mode based on natural light reflection, so that rich color information, texture details and contour characteristics can be provided. The infrared image is based on the imaging mode of the thermal radiation of the object, does not contain color information, and mainly reflects the temperature distribution of the object. The significant difference in the modes makes the appearance style characteristics of the same person in the infrared image and the visible light image have large difference, so that the matching is difficult to directly utilize the traditional characteristic extraction method.
Thus, in a cross-modal pedestrian re-recognition implementation, these heterogeneous visual features need to be aligned to narrow the modal gap. There are two methods, one of which is a feature level method, which attempts to project the visible light and infrared features extracted through the backbone network into the same embedded space, and perform matching search in the new feature space after projection, so as to reduce the influence of the modal differences. However, due to the large modality differences, these methods have difficulty projecting cross-modality images directly into a feature space that enables alignment of infrared and visible image features, and even if feature mapping is implemented, many beneficial shared features are lost in the conversion of the original and new features. The other is an image level method, specifically, the method uses a generation countermeasure network to generate an infrared image and a visible light image into an image of a unified mode to perform feature extraction, or uses a mode of generating a countermeasure network patch deficiency, such as converting an infrared (or visible light) image into a visible light (or infrared) image, finally, performing feature fusion with the existing mode, and performing image matching in the fused feature space, wherein the generation of a cross-mode image is difficult due to large mode difference and lack of infrared-visible light image pairs, and is usually accompanied by larger noise.
(II) less cross-modal training data
Another major difficulty is the scarcity of cross-modal training data. Compared with a single-mode visible light pedestrian re-identification task, the infrared-visible light cross-mode dataset needs to contain paired datasets of infrared and visible light images at the same time, but the acquisition cost of the cross-mode dataset is higher, and the data labeling process is complex. In an actual monitoring scene, especially in a night scene, the acquisition of an infrared image is easier, but a corresponding visible light image is often missing, which leads to the defect of paired data of infrared light and visible light.
The problem of insufficient data directly affects the generalization ability of the deep learning model. Deep learning models typically require a significant amount of annotation data to train in order to learn effective cross-modal features. Particularly, for a deep learning model with a backbone network being a transducer and variants thereof, the model has no spatial or temporal inductive bias, has huge parameters, needs enough data to fully exert the structural advantages, establishes the global dependency relationship, and increases the generalization of the model. However, due to the scarcity of the cross-modal dataset, the model is easy to suffer from the problem of over fitting in the training process, so that the recognition performance in practical application is reduced. In order to alleviate this problem, in the current infrared-visible light cross-mode pedestrian re-recognition, one type of method uses data enhancement technology, such as channel erasure, channel exchange, random erasure, gray enhancement and the like, to add new samples, all of which are based on original images, and the newly added samples lack diversity, and the second type of method uses generation of new samples synthesized against a network or by mixing local characteristics of human bodies of pedestrians to generate more samples in the same mode, however, all of the methods are enhancement of data in the modes, and challenges brought by data scarcity still remain difficult to thoroughly solve.
In summary, the prior art cannot effectively solve the problem of large modal differences between infrared image and visible light image features, and the cross-modal training data are less, so that the cross-modal pedestrian matching and recognition effects are poor.
Disclosure of Invention
Therefore, the invention aims to solve the technical problems that in the prior art, the infrared image and the visible light image have larger modal differences, and the cross-modal training data are less, so that the cross-modal pedestrian matching and recognition effects are poor.
In order to solve the technical problems, the invention provides a cross-mode pedestrian re-identification method, which comprises the following steps:
Step S1, randomly selecting a plurality of pedestrians from an infrared visible light data set in each training batch, wherein each pedestrian comprises 4 visible light images and 4 infrared images;
s2, preprocessing visible light images and infrared images;
S3, inputting the preprocessed visible light image and infrared image into a deep learning model, and extracting visible light features, infrared features and mixed mode features through the deep learning model;
Step S4, constructing total loss, judging whether the deep learning model is trained, if the total loss is lower than a preset threshold, finishing training, if the total loss is higher than the preset threshold, continuing training until the total loss is lower than the preset threshold,
The total loss includes a local maximum average difference loss constructed based on the visible light features and infrared features, an intermodal divergence loss constructed based on the visible light features and infrared features, a circle loss constructed based on the visible light features, infrared features, and mixed modal features, a cross entropy loss constructed based on the visible light features, infrared features, and mixed modal features;
And S5, identifying pedestrians from the visible light image or the infrared image to be identified by the trained deep learning model.
In one embodiment of the present invention, the formula of the local maximum average difference loss constructed based on the visible light characteristic and the infrared characteristic in the step S4 is:
Wherein H represents a regenerated Hilbert space, phi (·) represents a regenerated Hilbert space conversion map, N id represents the total category number of pedestrian identities, N V represents the number of visible light images in a training batch, f i V represents the characteristics of the ith visible light image, Representing the probability weight that f i V belongs to category k, N I represents the number of infrared images in the training batch,Features representing the j-th infrared image,Representation ofProbability weights belonging to category k, |a-b|| 2 denote the square of the distance of a and b.
In one embodiment of the present invention, the formula of the intermodal divergence loss constructed based on the visible light characteristic and the infrared characteristic in the step S4 is:
wherein N V、NI represents the number of visible and infrared images in the training batch, C V、CI represents the visible and infrared image classifier, respectively, Features of the j-th infrared image are represented, and f i V features of the i-th visible image are represented.
In one embodiment of the present invention, the formula of the circle loss constructed based on the visible light characteristic, the infrared characteristic and the mixed mode characteristic in the step S4 is:
wherein, sp=wyfm/(||wy||||fm||),m∈{V,I,H},S p represents the similarity score between the classes and w j,wy represents the non-target classification weight and the target classification weight, w j,wy represents the transpose of w j and w y, f m represents the image feature belonging to the mode m, m is the mode to which the image feature belongs, V, I, H represents the visible light mode, the infrared mode and the mixed mode, respectively, |·|| represents the L2 norm,The first and second weight coefficients of the circle loss are represented respectively, and delta n、Δp and gamma are the first, second and third super parameters respectively.
In one embodiment of the present invention, the cross entropy loss constructed in step S4 based on the visible light feature, the infrared feature and the mixed mode feature is specifically:
The visible light characteristics, the infrared characteristics and the mixed mode characteristics are subjected to classification layers to obtain respective category scores, and cross entropy loss is constructed based on the respective category scores, wherein the formula is as follows:
Wherein N V,NI,NH represents the number of visible light images, infrared images and mixed mode images in the training batch, N id represents the total category number of the identity of the pedestrian, y i,k represents the true category score of the ith image belonging to the kth pedestrian, The prediction category score of the kth pedestrian of the ith image is represented, m is the mode of the image feature, and V, I and H respectively represent the visible light mode, the infrared mode and the mixed mode.
In one embodiment of the present invention, the deep learning model in step S3 includes a first stage, a mixed mode generation module, a second stage, a third stage and a fourth stage connected in sequence, each stage including at least one Swin transducer block appearing in pairs, each Swin transducer block appearing in pairs including a first sub-block and a second sub-block connected in sequence, wherein,
The first sub-block comprises a first normalization layer LN, a multi-head self-attention layer W-MSA based on a window, a second normalization layer LN and a first multi-layer perceptron MLP which are sequentially connected, wherein the input of the first normalization layer LN and the output of the multi-head self-attention layer W-MSA based on the window are subjected to matrix element addition and then serve as the input of the second normalization layer LN, and the input of the second normalization layer LN and the first multi-layer perceptron MLP are subjected to matrix element addition and then serve as the input of the second sub-block;
The second sub-block comprises a third normalization layer LN, a multi-head self-care layer SW-MSA based on a shift window, a fourth normalization layer LN and a second multi-layer perceptron MLP which are sequentially connected, wherein the input of the third normalization layer LN and the output of the multi-head self-care layer SW-MSA based on the shift window are subjected to matrix element addition and then serve as the input of the fourth normalization layer LN, and the input of the fourth normalization layer LN and the second multi-layer perceptron MLP are subjected to matrix element addition and then serve as the output of a Swin transform block which is formed in pairs;
the first Swin transducer block in the first stage is preceded by a Linear Embedding module, the Linear Embedding module is used for converting the image into patches required for the Swin transducer block operation, and the first Swin transducer block in the second, third and fourth stages is preceded by a PATCHING MERGING module, the PATCHING MERGING module is used for reducing the number of patches and increasing the dimension of the patches to form a hierarchical representation of the feature.
In one embodiment of the present invention, the obtaining, by the hybrid modality generating module, the hybrid modality image feature from the visible light image feature and the infrared image feature output in the first stage includes:
Setting the characteristic of each visible light image output in the first stage as x V and the characteristic of each infrared image as x I;
The method comprises the steps of convolving x V through first hollows with expansion rates of 1,2 and 4 respectively, adding and averaging results, and obtaining the generation characteristics of a visible light part through a full connection layer and a ReLU activation function;
The second cavity convolution with the expansion rate of 1,2 and 4 is carried out on x I, and the generation characteristics of the infrared part are obtained through a full connection layer and a ReLU activation function;
for extracting context information of a scene level common to an infrared image and a visible light image, generating characteristics of background characteristics, carrying out self-adaptive pooling on x V and x I through the same, carrying out convolution on 1x1, carrying out dot product on pixels to obtain common background characteristics, adjusting the common background characteristics to the same size as the image characteristics output in the first stage through bilinear interpolation, and obtaining global background generation characteristics through a ReLU activation function;
finally, adding and averaging the generation features of the visible light part, the generation features of the infrared part and the global background generation features to obtain final mixed mode image features;
In order to maintain consistency of the final mixed mode characteristics, mode consistency loss is introduced into the total loss, and the formula is as follows:
Wherein L 2 (·) represents L2 regularization, f t H represents a generated t-th image of the hybrid mode, f s H represents a generated s-th image of the hybrid mode, l· 2 represents euclidean distance, and n is the number of the same pedestrian hybrid modes.
In one embodiment of the invention, the hybrid modality generation module further includes feature alignment between the visible light image features x V output by the first stage and the hybrid modality image features generated by the hybrid modality generation module, the infrared image features x I output by the first stage and the hybrid modality image features generated by the hybrid modality generation module by antagonizing learning, including:
The mode differences between the visible light image characteristic x V and the mixed mode image characteristic generated by the mixed mode generation module and the infrared image characteristic x I and the mixed mode image characteristic generated by the mixed mode generation module are reduced through the generator and the discriminator, wherein,
The generator consists of three full-connection layers, wherein a ReLU activation function is arranged between the full-connection layers and used for increasing nonlinearity of the generator, the visible light image characteristic x V and the infrared image characteristic x I output in the first stage are added to be the same as the original input image in length and width, and after compensating embedded vectors with the dimension of 1 are obtained, fake mixed mode characteristics are obtained after the corresponding generators are used;
The discriminator is a Patch Gan discriminator, the mixed mode image characteristic forged based on the visible light image characteristic x V or the infrared image characteristic x I and the mixed mode image characteristic generated by the mixed mode generation module are input into the discriminator, and the countermeasure loss about the discriminator is constructed in the total loss function, so that whether the forged mixed mode characteristic is identical with the mixed mode image characteristic generated by the mixed mode generation module or not is judged by using the countermeasure loss, and the countermeasure loss comprises the discriminator loss, the generator loss and the joint countermeasure loss, wherein the formula is as follows:
LGAN=LD+LG;
Wherein, L D is the loss of the discriminator, L G is the loss of the generator, L GAN is the combined antagonism loss, N V is the number of visible light images in the training batch, N I is the number of infrared images in the training batch, N H is the number of mixed mode images in the training batch, D (·) is the discriminator, For the i-th visible light image,For the j-th infrared image,For the k Zhang Hunge mode image, y V is a label of a visible light image, y I is a label of an infrared image, y H is a label of a mixed mode, G V-H is a visible light-mixed mode generator, and G I-H is an infrared-mixed mode generator.
In one embodiment of the present invention, a component detection and exchange module is further provided after the last Swin Transformer block of the third stage, the component detection and exchange module comprising:
The visible light, the global image feature G V、GH、GI corresponding to the mixed and infrared states output by the last Swin transducer block in the third stage, and the features output by the first stage and the second stage are sent to the body part detector Component Detector to obtain initial part local features composed of a plurality of body part local features, the initial part local features pass through a part prediction network Predictor Network, the part prediction network Predictor Network is composed of a full-connection layer and a Sigmoid function and is used for predicting the score of each body part local feature in the initial part local features, the replacement of the body part features in the initial part local features is carried out according to the prediction score, if the body part feature score of the mixed mode is smaller than the body part feature scores of the two other modes, the body part feature with the largest score in the two other modes is replaced by the body part local feature of the mixed mode, the body part local features with overlapping redundancy in the mixed mode are replaced, and the part local features P V、PH、PI are obtained,
The body part detector Component Detector respectively multiplies the features output by the first stage and the second stage and the features output by the last Swin transducer block in the third stage by a1×1 convolutional neural network and a Sigmoid function to obtain attention masks of different levels, adjusts the attention masks to the same size and multiplies the attention masks by pixels to obtain a final mask, then respectively multiplies the final mask by global image features G V、GH、GI corresponding to visible light, mixed light and infrared states output by the last Swin transducer block in the third stage, and finally splices the final masks in dimension to obtain local features of the initial part;
constructing part characteristics in the total loss function, wherein the part characteristics are lost by part identities, and the formula is as follows:
Where N represents the total number of images in the training batch, N c represents the number of human body local features, N id is the total category number of pedestrian identities, y i,k represents the true category score of the ith image belonging to the kth pedestrian, and z i,j,k represents the predicted category score of the jth human body local feature of the ith image belonging to the kth pedestrian.
In order to solve the technical problems, the invention provides a cross-mode pedestrian re-identification system, which comprises:
the selection module is used for randomly selecting a plurality of pedestrians from the infrared visible light data set in each training batch, wherein each pedestrian comprises 4 visible light images and 4 infrared images;
the preprocessing module is used for preprocessing the visible light image and the infrared image;
the feature extraction module is used for inputting the preprocessed visible light images and infrared images into a deep learning model, and extracting visible light features, infrared features and mixed mode features through the deep learning model;
the construction module is used for constructing total loss, judging whether the deep learning model is trained, if the total loss is lower than a preset threshold, the training is ended, if the total loss is higher than the preset threshold, the training is continued until the total loss is lower than the preset threshold,
The total loss includes a local maximum average difference loss constructed based on the visible light features and infrared features, an intermodal divergence loss constructed based on the visible light features and infrared features, a circle loss constructed based on the visible light features, infrared features, and mixed modal features, a cross entropy loss constructed based on the visible light features, infrared features, and mixed modal features;
and the recognition module is used for recognizing the pedestrians from the visible light images or the infrared images to be recognized by the trained deep learning model.
In order to solve the technical problems, the invention provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of the cross-mode pedestrian re-identification method when executing the computer program.
To solve the above technical problem, the present invention provides a computer readable storage medium having a computer program stored thereon, which when executed by a processor, implements the steps of the above cross-modal pedestrian re-recognition method.
Compared with the prior art, the technical scheme of the invention has the following advantages:
the deep learning model constructed by the invention relates to a novel three-mode branch structure, and the mixed mode is learned from the visible light image and the infrared image in a learning mode, so that the number of training samples can be effectively increased, and the follow-up pedestrian re-recognition is facilitated;
According to the invention, the mixed mode generation module is introduced into the deep learning model, and alignment between the visible light image and the mixed mode image and alignment between the infrared image and the mixed mode image are realized through a learning mode embedding mode and a countermeasure generation network, so that the mode difference between the infrared mode and the visible light mode in actual retrieval can be effectively reduced;
The invention introduces a component detection and exchange module in the deep learning model, and reduces the alignment difficulty while exploring the slight difference of the images.
Drawings
In order that the invention may be more readily understood, a more particular description of the invention will be rendered by reference to specific embodiments thereof that are illustrated in the appended drawings.
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a schematic diagram of a deep learning model structure in an embodiment of the invention;
FIG. 3 is a block diagram of a Swin transducer according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a hybrid modality generation module according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a component detection and switching module in accordance with an embodiment of the present invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings and specific examples, which are not intended to be limiting, so that those skilled in the art will better understand the invention and practice it.
Example 1
Referring to fig. 1, the invention relates to a cross-mode pedestrian re-identification method, which comprises the following steps:
Step S1, randomly selecting a plurality of pedestrians from an infrared visible light data set in each training batch, wherein each pedestrian comprises 4 visible light images and 4 infrared images;
s2, preprocessing visible light images and infrared images;
S3, inputting the preprocessed visible light image and infrared image into a deep learning model, and extracting visible light features, infrared features and mixed mode features through the deep learning model;
Step S4, constructing total loss, judging whether the deep learning model is trained, if the total loss is lower than a preset threshold, finishing training, if the total loss is higher than the preset threshold, continuing training until the total loss is lower than the preset threshold,
The total loss includes a local maximum average difference loss constructed based on the visible light features and infrared features, an intermodal divergence loss constructed based on the visible light features and infrared features, a circle loss constructed based on the visible light features, infrared features, and mixed modal features, a cross entropy loss constructed based on the visible light features, infrared features, and mixed modal features;
And S5, identifying pedestrians from the visible light image or the infrared image to be identified by the trained deep learning model. Specifically, the model identifies a pedestrian in the infrared image to be identified based on a given visible light image, or identifies a pedestrian in the visible light image to be identified based on a given infrared image. It should be noted that, since the training deep learning model in this embodiment has three branches, when in actual use, the branches corresponding to the infrared image or the visible light image can be selected according to different application scenarios.
The present embodiment is described in detail below:
The deep learning model provided by the invention adopts a three-branch network for training, the basic network is Swin TransformerV, the three branches of the model are respectively a visible light branch, an infrared branch and a mixed mode branch, and model parameters of the three branches are shared. The training data of the mixed mode branch is formed by mixing a visible light image and an infrared image, so that the number of samples for model training can be effectively increased, meanwhile, the model can align the visible light image characteristic and the infrared image characteristic with the mixed mode characteristic, the difference between modes can be effectively reduced, and the overfitting of the training samples is reduced. In order to discover subtle differences among images, the model extracts local features through generating a mask in a third stage, and performs progressive human body local feature replacement among three mode branches, so that the human body local features with overlapping redundancy in the mixed mode are replaced, and the local features of the mixed mode are purer. Fig. 2 shows a schematic structural diagram of the whole model, and a specific training process of the model is as follows:
S1, randomly selecting 4 pedestrians from an infrared visible light data set during training of each batch, wherein each pedestrian comprises 4 visible light images and 4 infrared images.
S2, preprocessing all pedestrian images, firstly scaling all the image sizes to 288 multiplied by 144 multiplied by 3, simultaneously carrying out data enhancement, adopting random cutting, random horizontal overturning and random erasure, and random one enhancement mode of channel random erasure, random channel exchange enhancement, color dithering enhancement and spectrum dithering enhancement for visible light images, adopting only random cutting, random horizontal overturning and random erasure for infrared images, and finally carrying out normalization operation with the average value of [0.485,0.456,0.406] and the standard deviation of [0.229,0.224,0.225] for the used images.
S3, inputting the processed visible light and infrared images into a model to obtain the characteristics of three modes, namely visible light, infrared and mixed modes, generated and extracted by the model;
s4, constructing total loss, judging whether the deep learning model is trained, if so, finishing the training, if so, continuing the training until the total loss is lower than a preset threshold;
Specifically, the total loss includes a local maximum average difference loss, a modal-to-modal divergence loss, a circular loss, a cross entropy loss, specifically as follows:
(1) Monitoring the local maximum average difference loss and the inter-modal divergence loss of the extracted visible light characteristic f V and infrared characteristic f I, wherein the local maximum average difference loss function is as follows:
Wherein H represents a regenerated Hilbert space, phi (·) represents a regenerated Hilbert space conversion map, N id represents the total category number of pedestrian identities, N V represents the number of visible light images in a training batch, f i V represents the characteristics of the ith visible light image, Representing the probability weight that f i V belongs to category k, N I represents the number of infrared images in the training batch,Features representing the j-th infrared image,Representation ofProbability weights belonging to category k, |a-b|| 2 denote the square of the distance of a and b.
(2) The loss of modal dispersion is as follows:
wherein N V、NI represents the number of visible and infrared images in the training batch, C V、CI represents the visible and infrared image classifier, respectively, Features of the j-th infrared image are represented, and f i V features of the i-th visible image are represented.
(3) The visible light feature f V, the infrared feature f I and the mixed mode feature f H are subjected to circle loss supervision:
wherein, sp=wyfm/(||wy||||fm||),m∈{V,I,H},S p represents the similarity score between the classes and w j,wy represents the non-target classification weight and the target classification weight, w j,wy represents the transpose of w j and w y, f m represents the image feature belonging to the mode m, m is the mode to which the image feature belongs, V, I, H represents the visible light mode, the infrared mode and the mixed mode, respectively, |·|| represents the L2 norm,The first and second weight coefficients of the circle loss are represented respectively, delta n、Δp and gamma are respectively the first, second and third super parameters, and delta n、Δp and gamma take values of 0.5, 0.5 and 30 respectively.
(4) The three modal features are passed through a classification layer (namely a fully connected layer, an FC layer in FIG. 2) to obtain category scores z V、zI and z H, the category scores are subjected to cross entropy loss supervision, and a loss function formula is as follows:
Wherein N V,NI,NH represents the number of visible light images, infrared images and mixed mode images in the training batch, N id represents the total category number of the identity of the pedestrian, y i,k represents the true category score of the ith image belonging to the kth pedestrian, The prediction category score of the kth pedestrian of the ith image is represented, m is the mode of the image feature, and V, I and H respectively represent the visible light mode, the infrared mode and the mixed mode.
S5, identifying pedestrians from the visible light image and the infrared image to be identified by the trained deep learning model.
The specific flow of the step S2 is as follows:
S2-1. The deep learning model of the embodiment adopts Swin TransformerV2 as a backbone network, and the deep learning model comprises a first Stage (Stage 1), a mixed mode generation Module (HMG Module), a second Stage (Stage 2), a third Stage (Stage 3) and a fourth Stage (Stage 4) which are sequentially connected, wherein a component detection and exchange Module (CDS Module) is further arranged in the third Stage (Stage 3). Swin TransformerV2 firstly partitioning the obtained visible light and infrared images, specifically partitioning the images through a Patch Partition module (Patch partitioning module) in fig. 2 to obtain patches, in order to enable each Patch to fully sense the characteristics nearby the Patch and enable the model to fully utilize the image characteristics, the model adopts overlapped sampling partitioning, a specific operation is to perform step length 3 on an input image, a two-dimensional convolution operation with convolution kernel of 4 is performed, the size of the input image received by Swin TransformerV2 is [288,144,3], 96×48 patches are obtained through convolution partitioning operation, the dimension of each Patch is 128, then the patches are linearly embedded (Linear Embedding module in the first stage), a total of 4608 patches are paved to obtain image characteristics with the size of [4608,128], then Swin Transformer blocks which are divided into four stages are fed, and the number of Swin Transformer blocks in the four stages is [2,2,18,2] respectively.
In this embodiment the Swin fransformer blocks consisting of W-MSA and SW-MSA always appear in pairs, see fig. 3, each pair appearing Swin fransformer block comprising a first sub-block and a second sub-block connected in sequence, wherein,
The first sub-block comprises a first normalization layer LN, a multi-head self-attention layer W-MSA based on a window, a second normalization layer LN and a first multi-layer perceptron MLP which are sequentially connected, wherein the input of the first normalization layer LN and the output of the multi-head self-attention layer W-MSA based on the window are subjected to matrix element addition and then serve as the input of the second normalization layer LN, and the input of the second normalization layer LN and the first multi-layer perceptron MLP are subjected to matrix element addition and then serve as the input of the second sub-block;
The second sub-block comprises a third normalization layer LN, a multi-head self-care layer SW-MSA based on a shift window, a fourth normalization layer LN and a second multi-layer perceptron MLP which are sequentially connected, wherein the input of the third normalization layer LN and the output of the multi-head self-care layer SW-MSA based on the shift window are subjected to matrix element addition and then serve as the input of the fourth normalization layer LN, and the input of the fourth normalization layer LN and the second multi-layer perceptron MLP are subjected to matrix element addition and then serve as the output of the pair-appearing Swin converter block;
The first Swin Transformer block in the first stage is preceded by Linear Embedding modules (linear embedding modules), the Linear Embedding modules are used for converting the image into patches required by the Swin Transformer block operation (i.e. the patches obtained by the segmentation are mapped to a new dimension (three-dimensional change, flattening effect) so as to facilitate the subsequent Swin Transformer block processing), and the first Swin Transformer block in the second, third and fourth stages is preceded by PATCHING MERGING modules (patch merging modules), and the PATCHING MERGING modules are used for reducing the number of patches and increasing the dimension of the patches so as to be beneficial to forming the hierarchical representation of the features. Since Linear Embedding modules and PATCHING MERGING modules belong to the prior art, the description of this embodiment is omitted.
The multi-layer perceptron MLP is composed of two fully connected layers, and a GELU nonlinear activation function is used between the two fully connected layers.
Further, the Swin transducer block consisting of W-MSA and SW-MSA always appears in pairs, divided into two sub-blocks, namely a first sub-block and a second sub-block, by which a relation between patches capable of different areas is established. The Swin transducer block output image features with two sub-blocks are calculated as follows:
wherein, The characteristics of the first sub-block output through the multi-head self-attention layer and the characteristics of the second stage output through the multi-head self-attention layer are respectively,The method comprises the steps that characteristics of a first sub-block output through a multi-layer perceptron and characteristics of a second sub-block output through the multi-layer perceptron are respectively, WMSA () is a multi-head self-attention layer based on windows, MLP () is the multi-layer perceptron, SWMSA () is a multi-head self-attention layer based on shift windows, m is a mode to which image characteristics belong, V, I and H respectively represent visible light, infrared and mixed mode characteristics, in order to generate layered representation, the number of patches can be reduced through patch merging layers among Swin different stages, and meanwhile, the dimension of the patches is increased. Assuming each patch dimension is C, the patch merge layer concatenates features of each set of 2x 2 adjacent patches and reduces the dimension to 2C using a linear layer for the concatenated features of dimension 4C.
Referring to fig. 2, after the Swin transducer block in the first Stage (Stage 1), the output visible light and infrared mode features pass through a mixed mode generation Module (HMG Module) to obtain three branches, and the visible light features, the mixed mode image features and the infrared image features are respectively output, and then the three mode features are respectively sent to the Swin transducer blocks in the last three stages (stages 2,3 and 4).
Referring to fig. 2, the global image features (i.e. G V、GH、GI in fig. 2) corresponding to the three modes output by the last Swin transducer block in the third Stage (Stage 3) of the model, and the features output by the first and second stages (Stage 1, 2) are sent together to the component detection and exchange Module (CDS Module), the corresponding component local features P V、PH、PI are extracted and output together with the global image features G V、GH、GI, the global image features (G V、GH、GI) and the component local features (P V、PH、PI) of the three modes are sent to the Swin transducer block of the last Stage (Stage 4) respectively, and finally the feature stitching (i.e. "C" after the fourth Stage in fig. 2, meaning connect) is performed to stitch the respective global image features and the component local features in the patch dimension, and the final feature outputs f V、fI and f H are obtained by an average pooling layer (POOL) and a batch normalization layer (BN).
S2-2 referring to FIG. 4, the mixed mode generation Module (HMG Module) is specifically as follows:
for each pedestrian in a training batch, 4 visible light images and 4 infrared images are combined, 16 mixed mode images are generated through a mixed mode generation Module (HMG Module) (1 visible light image and 4 infrared images can be respectively generated into mixed mode images, then 4 visible light images and 4 infrared images can be respectively generated into 16 images), and the specific generation flow is that firstly visible light image characteristics and infrared image characteristics are added with respective mode compensation in an embedding manner and restored to the original image proportion, and for each visible light image characteristic x V and infrared image characteristic x I output in the first Stage (Stage 1), for capturing and generating image information of different scale information, x V is required to be convolved through a first cavity with expansion rates of 1,2 and 4 respectively, the results are added and averaged, and then the generation characteristics of a visible light part are obtained through a full connection layer and a ReLU activation function. Similar operations are performed on x I to obtain the infrared portion generation characteristics by a second hole convolution with the same expansion ratio of 1,2,4, respectively, but with different parameters than the first hole convolution, and by a full connection layer and a ReLU activation function. In addition, in order to extract the context information of the scene level common to the infrared image and the visible light image and perform feature generation on the background feature, x V and x I need to be pooled and 1x1 convolved through the same adaptive average and dot product on pixels to obtain the common background feature, then the common background feature is adjusted to the same size as the image feature output in the first Stage (Stage 1) through bilinear interpolation, and global background generation feature is obtained through ReLU activation. And finally, fusing the generated features of the multiple parts, namely adding and averaging the generated features of the visible light part, the generated features of the infrared part and the global background generated features to obtain the final mixed mode image features. In order to keep consistency of the generated mixed mode image characteristics, the embodiment introduces mode consistency loss in the total loss:
Wherein L 2 (·) represents L2 regularization, f t H represents a generated t-th image of the hybrid mode, f s H represents a generated s-th image of the hybrid mode, l· 2 represents euclidean distance, and n is the number of the same pedestrian hybrid modes.
For the generated 16 mixed mode image features, respectively, a predictive label is obtained through a subsequent Patch Gan discriminator, the mixed mode image features closest to the mixed mode label are selected as final outputs (the 4 mixed mode image features are selected here to ensure that the number of the mixed mode image features is the same as that of the visible light images/infrared images of the original input deep learning model), it should be noted that the mixed mode generating Module (HMG Module) can generate not only the final mixed mode image features, but also the fake mixed mode features based on the visible light image features x V and the fake mixed mode features based on the infrared image features x I, and in short, the mixed mode generating Module (HMGModule) can generate the features of three branches of visible light, mixed mode and infrared and respectively input the generated features of the three branches into the second stage (stage 2).
The model then performs feature alignment between the visible-mixed mode and the infrared-mixed mode by countermeasure learning, and the kernel of the alignment part is a Generator (Generator) and a discriminator D, and the mode differences between the visible light, the infrared image features and the mixed mode features are reduced by a mutual game between the Generator (Generator) and the discriminator D. The Generator (Generator) aims to make the characteristics of the mixed mode generated by the other modes as close as possible to the actual characteristics of the mixed mode, and the discriminator D aims to distinguish the characteristics of the mixed mode from the characteristics of the other two modes. The respective modal compensation is added to the embedded (i.e., visible Compensation Embedding and Infrared Compensation Embedding in fig. 4, the two embedded vectors are the same length as the original input image, The visible light features and the infrared features of the embedded vector with the same width and the dimension of 1 are divided into a visible light-mixed mode Generator G V-H and an infrared-mixed mode Generator G I-H by generators (generators), each Generator (Generator) is formed by three fully-connected layers in a cascade, and a ReLU activation function is arranged between the fully-connected layers to increase the nonlinearity of the Generator (Generator), and the characteristics are forged after passing through the respective generators. The structure of the discriminant D (i.e., D V-H and D I-H in FIG. 4) is a classical Patch Gan discriminant structure, which will be based on the mixed mode features of visible image feature x V or infrared image feature x I forgery, And inputting the characteristics of the mixed mode image generated by the mixed mode generating module into a corresponding discriminator D, and constructing countermeasures about the discriminator D in the total loss function, and judging whether the forged mixed mode characteristics are the same as the characteristics of the mixed mode image generated by the mixed mode generating module or not by utilizing the countermeasures (the countermeasures are the same when the countermeasures are smaller than a preset value, which indicates that the characteristics of the actually generated mixed mode image are effective, and the countermeasures are different when the countermeasures are larger than the preset value, which indicates that the characteristics of the actually generated mixed mode image are not ideal), so as to realize characteristic alignment, wherein the countermeasures comprise the discriminator losses, the generator losses and the combined countermeasures, and the formula is as follows:
LGAN=LD+LG;
Wherein, L D is the loss of the discriminator, L G is the loss of the generator, L GAN is the combined antagonism loss, N V is the number of visible light images in the training batch, N I is the number of infrared images in the training batch, N H is the number of mixed mode images in the training batch, D (·) is the discriminator, For the i-th visible light image,For the j-th infrared image,For the k Zhang Hunge mode image, y V is a label of a visible light image, y I is a label of an infrared image, y H is a label of a mixed mode, G V-H is a visible light-mixed mode generator, and G I-H is an infrared-mixed mode generator.
And finally, the aligned infrared image, visible light image and mixed mode image are sent to a subsequent model for processing by the model.
S2-3 referring to FIG. 5, the component detection and switching Module (CDS Module) is specifically as follows:
the last Swin transducer block in the third stage is further provided with a component detection and exchange module, the global image feature G V、GH、GI corresponding to the three modes output by the last Swin transducer block in the third stage and the features output by the first stage and the second stage are sent to the body component detector Component Detector to obtain initial component local features composed of a plurality of body local features, the initial component local features pass through a component prediction network Predictor Network, the component prediction network Predictor Network is composed of a full-connection layer and a Sigmoid function and is used for predicting the score of each body local feature in the initial component local features, the replacement of the body local features in the initial component local features is performed according to the prediction score, and as the generated initial component local features are inevitably redundant, the human local features corresponding to the mixed mode are gradually replaced with the local features detected by the infrared and visible optical modes through a Swapping module (replacement module) in fig. 5, if the human local feature score of the mixed mode is smaller than the human local feature score of the two other body local features, the mixed mode is replaced by the two local features, and the mixed mode is enabled to be more redundant by the human local feature of the mixed mode. It should be noted that, the local features of the initial components corresponding to the infrared branch and the visible light branch are the same as the final component local feature P I、PV, and only the local features of the initial components corresponding to the mixed mode branch are changed.
Specifically, the body component detector Component Detector multiplies the feature output by the first stage and the second stage respectively, and the global image feature (G V、GH、GI) corresponding to the visible light, the mixed and the infrared mode output by the last Swin transducer block in the third stage respectively through a Convolutional Neural Network (CNN) and a Sigmoid function of 1×1 to obtain attention masks of different levels, adjusts the attention masks to the same size and multiplies the attention masks by pixels to obtain a final Mask (Mask), multiplies the final Mask (Mask) by three mode data (G V、GH、GI) of the visible light, the mixed and the infrared modes output by the last Swin transducer block in the third stage, and finally splices in the dimension to obtain the local feature of the initial component.
Constructing loss related to local characteristics of a human body in the total loss function, wherein the formula is as follows:
Where N represents the total number of images in the training batch, N c represents the number of human body local features, N id is the total category number of pedestrian identities, y i,k represents the true category score of the ith image belonging to the kth pedestrian, and z i,j,k represents the predicted category score of the jth human body local feature of the ith image belonging to the kth pedestrian.
The experimental analysis is as follows:
1. Implementation details
The present embodiment implements a built deep learning model on 1 NVIDIA GTX 4090D GPU based on PyTorch framework. The backbone network uses Swin TransformerV2 pre-trained on ImageNet22K and removes the last classification layer of Swin TransformerV, directly outputting a 1024-dimensional feature. During testing, the present embodiment uses cosine similarity to compare the distance between pedestrian features in the query set and in the gallery set.
2. Sampling strategy
During training, the present embodiment samples 4 pedestrians randomly per batch, and extracts 4 visible light images and 4 infrared images for each pedestrian, respectively, to form a small batch of training samples. Before the model is sent to the model for training, the sizes of all images are scaled to 288 multiplied by 144 multiplied by 3, data enhancement is carried out at the same time, the visible light images are subjected to random clipping, random horizontal overturning and random erasing, and one random enhancement mode of channel random erasing, random channel exchange enhancement, color dithering enhancement and spectrum dithering enhancement is adopted, the infrared images are subjected to only random clipping, random horizontal overturning and random erasing, and finally the used images are subjected to normalization operation with the mean value of [0.485,0.456,0.406] and the standard deviation of [0.229,0.224,0.225], so that the risk of fitting after training is reduced.
During the test, the present embodiment uses a batch with a size of 64 to extract the features of the query set and the gallery set, respectively, the visible light image and the infrared image are both sized 288×144×3 before being sent to the model, and then the normalization operation with a mean value of [0.485,0.456,0.406] and a standard deviation of [0.229,0.224,0.225] is performed on the used images.
3. Training parameter settings
The model trained a total of 40 epochs. All parameters of the network model were initialized with Kaiming initialization random initialization parameters before training began. During training, the SGD optimizer was used to update the parameters, momentum was 0.9, weight decay factor was 0.001, and the learning rate of the initial gradient descent was set. The pre-heating learning rate Warmup, the pre-heating iteration number 5 and the pre-heating factor 0.01 are adopted, and the pre-heating learning rate adopts a linear increasing mode. Meanwhile, the learning rate is finally attenuated to 0.002 times of the initial learning rate according to the cosine attenuation strategy of the learning rate.
4. Comparison of Performance with common models
The deep learning model constructed in the embodiment is compared with the current advanced model on SYSU-MM01 data set, no processing technology such as re-ranking, fusion ranking and the like is adopted when the model is used for re-identifying pedestrians, and the deep learning model constructed in the embodiment has a good effect as can be easily found from the table 1.
Table 1 model comparison table
Example two
The embodiment provides a cross-mode pedestrian re-identification system, which comprises:
the selection module is used for randomly selecting a plurality of pedestrians from the infrared visible light data set in each training batch, wherein each pedestrian comprises 4 visible light images and 4 infrared images;
the preprocessing module is used for preprocessing the visible light image and the infrared image;
the feature extraction module is used for inputting the preprocessed visible light images and infrared images into a deep learning model, and extracting visible light features, infrared features and mixed mode features through the deep learning model;
the construction module is used for constructing total loss, judging whether the deep learning model is trained, if the total loss is lower than a preset threshold, the training is ended, if the total loss is higher than the preset threshold, the training is continued until the total loss is lower than the preset threshold,
The total loss includes a local maximum average difference loss constructed based on the visible light features and infrared features, an intermodal divergence loss constructed based on the visible light features and infrared features, a circle loss constructed based on the visible light features, infrared features, and mixed modal features, a cross entropy loss constructed based on the visible light features, infrared features, and mixed modal features;
and the recognition module is used for recognizing the pedestrians from the visible light images or the infrared images to be recognized by the trained deep learning model.
Example III
The present embodiment provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the steps of the cross-modal pedestrian re-recognition method of the embodiment are implemented when the processor executes the computer program.
Example IV
The present embodiment provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the cross-modality pedestrian re-identification method of embodiment one.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The scheme in the embodiment of the application can be realized by adopting various computer languages, such as object-oriented programming language Java, an transliteration script language JavaScript and the like.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.
It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations and modifications of the present invention will be apparent to those of ordinary skill in the art in light of the foregoing description. It is not necessary here nor is it exhaustive of all embodiments. And obvious variations or modifications thereof are contemplated as falling within the scope of the present invention.

Claims (10)

1.一种跨模态行人重识别方法,其特征在于:包括:1. A cross-modal person re-identification method, comprising: 步骤S1:在每个训练批次中,从红外可见光数据集中随机选择若干行人,每个行人包括4幅可见光图像和4幅红外图像;Step S1: In each training batch, a number of pedestrians are randomly selected from the infrared and visible light dataset, and each pedestrian includes 4 visible light images and 4 infrared images; 步骤S2:对可见光图像和红外图像进行预处理;Step S2: preprocessing the visible light image and the infrared image; 步骤S3:将预处理后的可见光图像和红外图像输入至深度学习模型中,通过所述深度学习模型提取可见光特征、红外特征以及混合模态特征;Step S3: inputting the preprocessed visible light image and infrared image into a deep learning model, and extracting visible light features, infrared features, and mixed modal features through the deep learning model; 步骤S4:构建总损失,通过所述总损失判断所述深度学习模型是否训练结束,若所述总损失低于预设阈值,则训练结束;若所述总损失高于预设阈值,则继续训练直至总损失低于预设阈值,其中,Step S4: construct a total loss, and judge whether the deep learning model training is completed by the total loss. If the total loss is lower than a preset threshold, the training is completed; if the total loss is higher than the preset threshold, the training is continued until the total loss is lower than the preset threshold, wherein: 所述总损失包括基于所述可见光特征和红外特征构建的局部最大平均差异损失、基于所述可见光特征和红外特征构建的模态间散度损失、基于所述可见光特征、红外特征和混合模态特征构建的圆损失、基于所述可见光特征、红外特征和混合模态特征构建的交叉熵损失;The total loss includes a local maximum average difference loss constructed based on the visible light feature and the infrared feature, an inter-modal divergence loss constructed based on the visible light feature and the infrared feature, a circle loss constructed based on the visible light feature, the infrared feature and the mixed modal feature, and a cross entropy loss constructed based on the visible light feature, the infrared feature and the mixed modal feature; 步骤S5:将训练好的深度学习模型从待识别的可见光图像或红外图像中识别出行人。Step S5: Use the trained deep learning model to identify pedestrians from the visible light image or infrared image to be identified. 2.根据权利要求1所述的跨模态行人重识别方法,其特征在于:所述步骤S4中基于所述可见光特征和红外特征构建的局部最大平均差异损失的公式为:2. The cross-modal pedestrian re-identification method according to claim 1, characterized in that: the formula of the local maximum average difference loss constructed based on the visible light feature and the infrared feature in step S4 is: 其中,H表示再生希尔伯特空间,φ(·)表示再生希尔伯特空间转换映射,Nid代表行人身份的总类别数,NV表示训练批次中可见光图像数量,表示第i张可见光图像的特征,表示属于类别k的概率权重,NI表示训练批次中红外图像数量,表示第j张红外图像的特征,表示属于类别k的概率权重,||a-b||2表示a和b的距离平方。Where H represents the regenerated Hilbert space, φ(·) represents the regenerated Hilbert space transformation mapping, N id represents the total number of pedestrian identity categories, and N V represents the number of visible light images in the training batch. represents the features of the i-th visible light image, express The probability weight of belonging to category k, NI represents the number of infrared images in the training batch, represents the features of the jth infrared image, express The probability weight of belonging to category k, ||ab|| 2 represents the square of the distance between a and b. 3.根据权利要求1所述的跨模态行人重识别方法,其特征在于:所述步骤S4中基于所述可见光特征和红外特征构建的模态间散度损失的公式为:3. The cross-modal pedestrian re-identification method according to claim 1, characterized in that: the formula of the inter-modal divergence loss constructed based on the visible light feature and the infrared feature in step S4 is: 其中,NV、NI分别代表训练批次中可见光图像和红外图像的数量,CV、CI分别代表可见光图像分类器和红外图像分类器,表示第j张红外图像的特征,表示第i张可见光图像的特征。Where NV and NI represent the number of visible light images and infrared images in the training batch, CV and CI represent the visible light image classifier and infrared image classifier, respectively. represents the features of the jth infrared image, Represents the features of the i-th visible light image. 4.根据权利要求1所述的跨模态行人重识别方法,其特征在于:所述步骤S4中基于所述可见光特征、红外特征和混合模态特征构建的圆损失的公式为:4. The cross-modal pedestrian re-identification method according to claim 1, characterized in that: the formula of the circle loss constructed based on the visible light feature, infrared feature and mixed modal feature in step S4 is: 其中,sp=wyfm/(||wy||||fm||),m∈{V,I,H},sp分别表示类内相似度得分和类间相似度得分,wj,wy分别表示非目标的分类权重和目标的分类权重,wj,wy分别表示wj和wy的转置,fm表示属于模态m的图像特征,m为图像特征所属的模态,V,I,H分别表示可见光模态,红外模态和混合模态,||·||表示L2范数,分别代表圆损失第一、第二权重系数,Δn、Δp、γ分别为第一、第二和第三超参数。in, s p = w y f m /(||w y ||||f m ||),m∈{V,I,H}, s p represents the intra-class similarity score and the inter-class similarity score respectively, w j , w y represent the classification weight of non-target and the classification weight of target respectively, w j , w y represent the transpose of w j and w y respectively, f m represents the image feature belonging to modality m, m is the modality to which the image feature belongs, V, I, H represent the visible light modality, infrared modality and mixed modality respectively, ||·|| represents the L2 norm, They represent the first and second weight coefficients of circle loss respectively, and Δn , Δp and γ are the first, second and third hyperparameters respectively. 5.根据权利要求1所述的跨模态行人重识别方法,其特征在于:所述步骤S4中基于所述可见光特征、红外特征和混合模态特征构建的交叉熵损失,具体为:5. The cross-modal pedestrian re-identification method according to claim 1, characterized in that: the cross entropy loss constructed based on the visible light feature, infrared feature and mixed modal feature in step S4 is specifically: 将所述可见光特征、红外特征和混合模态特征通过分类层得到各自的类别得分,基于各自的类别得分构建交叉熵损失,公式为:The visible light features, infrared features and mixed modal features are passed through the classification layer to obtain their respective category scores, and the cross entropy loss is constructed based on their respective category scores. The formula is: 其中,NV,NI,NH分别表示训练批次中可见光图像、红外图像和混合模态图像的数量,Nid代表行人身份的总类别数,yi,k表示第i张图像属于第k个行人的真实类别得分,表示第i张图像属于第k个行人的预测类别得分,m为图像特征所属的模态,V,I,H分别表示可见光模态,红外模态和混合模态。Where NV , NI , NH represent the number of visible light images, infrared images, and mixed modality images in the training batch, respectively; Nid represents the total number of categories of pedestrian identities; yi ,k represents the true category score of the i-th image belonging to the k-th pedestrian. It represents the predicted category score that the i-th image belongs to the k-th pedestrian, m is the modality to which the image feature belongs, V, I, and H represent the visible light modality, infrared modality, and mixed modality, respectively. 6.根据权利要求1所述的跨模态行人重识别方法,其特征在于:所述步骤S3中的深度学习模型包括依次连接的第一阶段、混合模态生成模块、第二阶段、第三阶段和第四阶段,每个阶段均包括至少一个成对出现的Swin Transformer块,每个成对出现的SwinTransformer块包括依次连接的第一子块和第二子块,其中,6. The cross-modal person re-identification method according to claim 1 is characterized in that: the deep learning model in step S3 includes a first stage, a mixed modality generation module, a second stage, a third stage and a fourth stage connected in sequence, each stage includes at least one Swin Transformer block that appears in pairs, and each Swin Transformer block that appears in pairs includes a first sub-block and a second sub-block that are connected in sequence, wherein: 所述第一子块包括依次连接的第一归一化层LN、基于窗口的多头自注意层W-MSA、第二归一化层LN、第一多层感知机MLP,并且第一归一化层LN的输入和基于窗口的多头自注意层W-MSA的输出进行矩阵元素相加后作为第二归一化层LN的输入,第二归一化层LN的输入和第一多层感知机MLP进行矩阵元素相加后作为第二子块的输入;The first sub-block includes a first normalization layer LN, a window-based multi-head self-attention layer W-MSA, a second normalization layer LN, and a first multi-layer perceptron MLP connected in sequence, and the input of the first normalization layer LN and the output of the window-based multi-head self-attention layer W-MSA are added by matrix elements as the input of the second normalization layer LN, and the input of the second normalization layer LN and the first multi-layer perceptron MLP are added by matrix elements as the input of the second sub-block; 所述第二子块包括依次连接的第三归一化层LN、基于移位窗口的多头自注意层SW-MSA、第四归一化层LN、第二多层感知机MLP,并且第三归一化层LN的输入和基于移位窗口的多头自注意层SW-MSA的输出进行矩阵元素相加后作为第四归一化层LN的输入,第四归一化层LN的输入和第二多层感知机MLP进行矩阵元素相加后作为成对出现的Swin Transformer块的输出;The second sub-block includes a third normalization layer LN, a multi-head self-attention layer SW-MSA based on a shift window, a fourth normalization layer LN, and a second multi-layer perceptron MLP connected in sequence, and the input of the third normalization layer LN and the output of the multi-head self-attention layer SW-MSA based on a shift window are added as the input of the fourth normalization layer LN after matrix elements are added, and the input of the fourth normalization layer LN and the second multi-layer perceptron MLP are added as the output of the paired Swin Transformer block; 第一阶段中第一个Swin Transformer块之前设置有Linear Embedding模块,所述Linear Embedding模块用于将图像转化为Swin Transformer块运算所需要的补丁,第二、第三、第四阶段中第一个Swin Transformer块之前设置有Patching Merging模块,所述Patching Merging模块用于减少补丁数量同时增加补丁的维数,以形成特征的分层表示。In the first stage, a Linear Embedding module is set before the first Swin Transformer block. The Linear Embedding module is used to convert the image into patches required for the Swin Transformer block operation. In the second, third, and fourth stages, a Patching Merging module is set before the first Swin Transformer block. The Patching Merging module is used to reduce the number of patches and increase the dimension of the patches to form a hierarchical representation of features. 7.根据权利要求6所述的跨模态行人重识别方法,其特征在于:所述第一阶段输出的可见光图像特征和红外图像特征通过所述混合模态生成模块得到混合模态图像特征,包括:7. The cross-modal person re-identification method according to claim 6, characterized in that: the visible light image features and infrared image features outputted in the first stage are used by the hybrid modality generation module to obtain hybrid modality image features, comprising: 设第一阶段输出的每张可见光图像特征为xV、红外图像特征为xIAssume that the features of each visible light image output in the first stage are x V and the features of each infrared image are x I ; 将xV通过膨胀率分别为1,2,4的第一空洞卷积,将结果相加并平均,再通过一个全连接层和一个ReLU激活函数,获得可见光部分的生成特征;Pass x V through the first dilated convolution with dilation rates of 1, 2, and 4, add and average the results, and then pass it through a fully connected layer and a ReLU activation function to obtain the generated features of the visible light part; 对xI通过膨胀率分别为1,2,4的第二空洞卷积,并通过一个全连接层和一个ReLU激活函数,得到红外部分的生成特征;Through the second dilated convolution of xI with dilation rates of 1, 2, and 4, and through a fully connected layer and a ReLU activation function, the generated features of the infrared part are obtained; 为提取红外图像和可见光图像共同的场景级别的上下文信息,并对背景特征进行特征生成,将xV和xI通过同一个自适应平均池化和1x1卷积并将两者进行像素上的点积,得到共同背景特征,再通过双线性插值调整到与第一阶段输出的图像特征相同的大小,经过ReLU激活函数得到全局背景生成特征;In order to extract the common scene-level context information of infrared images and visible light images and generate background features, xV and xI are subjected to the same adaptive average pooling and 1x1 convolution and the dot product of the two is performed on the pixels to obtain the common background features, which are then adjusted to the same size as the image features output in the first stage through bilinear interpolation, and the global background generation features are obtained through the ReLU activation function; 最后将可见光部分的生成特征、红外部分的生成特征和全局背景生成特征相加并平均,得到最终的混合模态图像特征;Finally, the generated features of the visible light part, the generated features of the infrared part and the generated features of the global background are added and averaged to obtain the final mixed-modal image features; 为了保持最终的混合模态特征具有一致性,在所述总损失中引入模态一致性损失,公式为:In order to keep the final mixed modal features consistent, the modal consistency loss is introduced into the total loss, and the formula is: 其中,l2(·)代表L2正则化,ft H表示生成出的混合模态的第t张图像,fs H表示生成出的混合模态的第s张图像,||·||2代表欧氏距离,n为同一行人混合模态的数量。Where l 2 (·) represents L2 regularization, f t H represents the tth image of the generated mixed mode, f s H represents the sth image of the generated mixed mode, ||·|| 2 represents the Euclidean distance, and n is the number of mixed modes of the same pedestrian. 8.根据权利要求7所述的跨模态行人重识别方法,其特征在于:所述混合模态生成模块还包括:通过对抗学习来进行第一阶段输出的可见光图像特征xV与混合模态生成模块生成的混合模态图像特征、第一阶段输出的红外图像特征xI与混合模态生成模块生成的混合模态图像特征之间的特征对齐,包括:8. The cross-modal pedestrian re-identification method according to claim 7, characterized in that: the hybrid modality generation module further comprises: performing feature alignment between the visible light image feature x V outputted in the first stage and the hybrid modality image feature generated by the hybrid modality generation module, and the infrared image feature x I outputted in the first stage and the hybrid modality image feature generated by the hybrid modality generation module through adversarial learning, comprising: 通过生成器和判别器来缩小可见光图像特征xV与混合模态生成模块生成的混合模态图像特征、红外图像特征xI与混合模态生成模块生成的混合模态图像特征之间的模态差异,其中,The modality differences between the visible light image feature x V and the mixed modality image feature generated by the mixed modality generation module, and between the infrared image feature x I and the mixed modality image feature generated by the mixed modality generation module are reduced by the generator and the discriminator, wherein: 所述生成器由三个全连接层级联组成,全连接层之间都有一个ReLU激活函数用于增加生成器的非线性,将第一阶段输出的可见光图像特征xV和红外图像特征xI加入和原输入图像长、宽一样,且维度为1的补偿嵌入向量之后,通过各自的生成器后得到伪造的混合模态特征;The generator is composed of three cascaded fully connected layers. There is a ReLU activation function between the fully connected layers to increase the nonlinearity of the generator. The visible light image feature x V and the infrared image feature x I output from the first stage are added to the compensation embedding vector with the same length and width as the original input image and a dimension of 1, and then pass through their respective generators to obtain the forged mixed modal features; 所述判别器为Patch Gan判别器,将基于可见光图像特征xV或红外图像特征xI伪造的混合模态特征、以及混合模态生成模块生成的混合模态图像特征输入判别器,并通过在所述总损失函数中构建关于判别器的对抗损失,利用所述对抗损失判断伪造的混合模态特征与混合模态生成模块生成的混合模态图像特征是否相同,以实现特征对齐,所述对抗损失包括判别器损失、生成器损失以及联合对抗损失,公式为:The discriminator is a Patch Gan discriminator, and the mixed modal features forged based on the visible light image feature x V or the infrared image feature x I and the mixed modal image features generated by the mixed modal generation module are input into the discriminator, and the adversarial loss about the discriminator is constructed in the total loss function, and the adversarial loss is used to judge whether the forged mixed modal features are the same as the mixed modal image features generated by the mixed modal generation module, so as to achieve feature alignment. The adversarial loss includes the discriminator loss, the generator loss and the joint adversarial loss, and the formula is: LGAN=LD+LG LGAN = LD + LG ; 其中,LD为判别器损失,LG为生成器损失,LGAN为联合对抗损失,NV为训练批次中可见光图像数量,NI为训练批次中红外图像数量,NH为训练批次中混合模态图像数量,D(·)为判别器,为第i张可见光图像,为第j张红外图像,为第k张混合模态图像,yV为可见光图像的标签,yI为红外图像的标签,yH为混合模态的标签,GV-H为可见光-混合模态生成器,GI-H为红外-混合模态生成器。Where LD is the discriminator loss, LG is the generator loss, LGAN is the joint adversarial loss, NV is the number of visible light images in the training batch, NI is the number of infrared images in the training batch, NH is the number of mixed modality images in the training batch, D(·) is the discriminator, is the i-th visible light image, is the jth infrared image, is the kth mixed modal image, y V is the label of the visible light image, y I is the label of the infrared image, y H is the label of the mixed modality, G VH is the visible light-mixed modality generator, and G IH is the infrared-mixed modality generator. 9.根据权利要求7所述的跨模态行人重识别方法,其特征在于:所述第三阶段的最后一个Swin Transformer块之后还设置有部件检测和交换模块,所述部件检测和交换模块包括:9. The cross-modal person re-identification method according to claim 7, characterized in that: a component detection and exchange module is further provided after the last Swin Transformer block in the third stage, and the component detection and exchange module comprises: 将所述第三阶段中最后一个Swin Transformer块输出的可见光、混合、红外模态对应的全局图像特征GV、GH、GI、以及将第一、第二阶段各自输出的特征送入身体部件检测器Component Detector得到若干人体局部特征构成的初始部件局部特征,初始部件局部特征会经过一个部件预测网络Predictor Network,该部件预测网络Predictor Network由一个全连接层和一个Sigmoid函数组成,用于预测初始部件局部特征中每个人体局部特征的得分,根据预测得分进行初始部件局部特征中人体局部特征的替换:如果混合模态的人体局部特征得分比其他两个模态的人体局部特征得分小,则将其他两个模态中得分最大的人体局部特征替换掉混合模态的人体局部特征,实现将混合模态中有重叠冗余的人体局部特征替换掉,得到部件局部特征PV、PH、PI,其中,The global image features G V , G H , and G I corresponding to the visible light, mixed, and infrared modes output by the last Swin Transformer block in the third stage, as well as the features output by the first and second stages are sent to the body part detector Component Detector to obtain initial component local features composed of several human local features. The initial component local features will pass through a component prediction network Predictor Network, which is composed of a fully connected layer and a Sigmoid function, and is used to predict the score of each human local feature in the initial component local features. The human local features in the initial component local features are replaced according to the predicted scores: if the human local feature score of the mixed mode is smaller than the human local feature scores of the other two modes, the human local feature with the largest score in the other two modes is replaced with the human local feature of the mixed mode, so as to replace the overlapping and redundant human local features in the mixed mode, and obtain component local features PV , PH , and PI , wherein, 所述身体部件检测器Component Detector将第一、第二阶段各自输出的特征、以及第三阶段最后一个Swin Transformer块输出的特征分别通过一个1×1的卷积神经网络和Sigmoid函数,得到不同层次的注意力掩模,将注意力掩模调整到同一尺寸并按像素相乘,得到最终掩模,再将最终掩模分别与第三阶段最后一个Swin Transformer块输出的可见光、混合、红外模态对应的全局图像特征GV、GH、GI进行相乘,最后在维数维度上拼接得到初始部件局部特征;The body part detector Component Detector passes the features output by the first and second stages and the features output by the last Swin Transformer block in the third stage through a 1×1 convolutional neural network and a Sigmoid function to obtain attention masks at different levels, adjusts the attention masks to the same size and multiplies them by pixel to obtain the final mask, and then multiplies the final mask with the global image features G V , G H , and GI corresponding to the visible light, mixed, and infrared modes output by the last Swin Transformer block in the third stage, respectively, and finally splices them in the dimension dimension to obtain the initial component local features; 在所述总损失函数中构建人体局部特征的损失,公式为:The loss of local features of the human body is constructed in the total loss function, and the formula is: 其中,N代表训练批次中图像的总数量,Nc代表人体局部特征的数量,Nid为行人身份的总类别数,yi,k表示第i张图像属于第k个行人的真实类别得分,zi,j,k表示第i张图像的第j个人体局部特征属于第k个行人的预测类别得分。Among them, N represents the total number of images in the training batch, Nc represents the number of local features of the human body, Nid is the total number of categories of pedestrian identities, yi ,k represents the true category score of the i-th image belonging to the k-th pedestrian, and zi ,j,k represents the predicted category score of the j-th local feature of the human body in the i-th image belonging to the k-th pedestrian. 10.一种跨模态行人重识别系统,其特征在于:包括:10. A cross-modal person re-identification system, comprising: 选择模块:用于在每个训练批次中,从红外可见光数据集中随机选择若干行人,每个行人包括4幅可见光图像和4幅红外图像;Selection module: used to randomly select several pedestrians from the infrared and visible light dataset in each training batch. Each pedestrian includes 4 visible light images and 4 infrared images. 预处理模块:用于对可见光图像和红外图像进行预处理;Preprocessing module: used to preprocess visible light images and infrared images; 特征提取模块:用于将预处理后的可见光图像和红外图像输入至深度学习模型中,通过所述深度学习模型提取可见光特征、红外特征以及混合模态特征;Feature extraction module: used to input the preprocessed visible light image and infrared image into the deep learning model, and extract visible light features, infrared features and mixed modal features through the deep learning model; 构建模块:用于构建总损失,通过所述总损失判断所述深度学习模型是否训练结束,若所述总损失低于预设阈值,则训练结束;若所述总损失高于预设阈值,则继续训练直至总损失低于预设阈值,其中,Construction module: used to construct the total loss, and judge whether the training of the deep learning model is completed by the total loss. If the total loss is lower than the preset threshold, the training is completed; if the total loss is higher than the preset threshold, the training is continued until the total loss is lower than the preset threshold, wherein, 所述总损失包括基于所述可见光特征和红外特征构建的局部最大平均差异损失、基于所述可见光特征和红外特征构建的模态间散度损失、基于所述可见光特征、红外特征和混合模态特征构建的圆损失、基于所述可见光特征、红外特征和混合模态特征构建的交叉熵损失;The total loss includes a local maximum average difference loss constructed based on the visible light feature and the infrared feature, an inter-modal divergence loss constructed based on the visible light feature and the infrared feature, a circle loss constructed based on the visible light feature, the infrared feature and the mixed modal feature, and a cross entropy loss constructed based on the visible light feature, the infrared feature and the mixed modal feature; 识别模块:用于将训练好的深度学习模型从待识别的可见光图像或红外图像中识别出行人。Recognition module: used to use the trained deep learning model to identify pedestrians from the visible light image or infrared image to be identified.
CN202411690034.8A 2024-11-25 2024-11-25 A cross-modal person re-identification method and system Active CN119600653B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411690034.8A CN119600653B (en) 2024-11-25 2024-11-25 A cross-modal person re-identification method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411690034.8A CN119600653B (en) 2024-11-25 2024-11-25 A cross-modal person re-identification method and system

Publications (2)

Publication Number Publication Date
CN119600653A true CN119600653A (en) 2025-03-11
CN119600653B CN119600653B (en) 2025-10-14

Family

ID=94843393

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411690034.8A Active CN119600653B (en) 2024-11-25 2024-11-25 A cross-modal person re-identification method and system

Country Status (1)

Country Link
CN (1) CN119600653B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120071051A (en) * 2025-04-28 2025-05-30 北京航空航天大学杭州创新研究院 Method and device for generating countermeasure sample for intelligent driving perception test

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113989851A (en) * 2021-11-10 2022-01-28 合肥工业大学 A Cross-modal Person Re-identification Method Based on Heterogeneous Fusion Graph Convolutional Networks
CN114220124A (en) * 2021-12-16 2022-03-22 华南农业大学 Near-infrared-visible light cross-modal double-flow pedestrian re-identification method and system
CN114495010A (en) * 2022-02-14 2022-05-13 广东工业大学 A cross-modal pedestrian re-identification method and system based on multi-feature learning
CN115359554A (en) * 2022-08-08 2022-11-18 天津师范大学 A Cross-modal Pedestrian Retrieval Method for Autonomous Driving
US20230162023A1 (en) * 2021-11-25 2023-05-25 Mitsubishi Electric Research Laboratories, Inc. System and Method for Automated Transfer Learning with Domain Disentanglement
CN116958584A (en) * 2023-09-21 2023-10-27 腾讯科技(深圳)有限公司 Key point detection method, regression model training method and device and electronic equipment
US11804060B1 (en) * 2021-07-26 2023-10-31 Amazon Technologies, Inc. System for multi-modal anomaly detection
WO2023222643A1 (en) * 2022-05-17 2023-11-23 Continental Automotive Technologies GmbH Method for image segmentation matching
WO2024021394A1 (en) * 2022-07-29 2024-02-01 南京邮电大学 Person re-identification method and apparatus for fusing global features with ladder-shaped local features
CN117975556A (en) * 2024-01-15 2024-05-03 东北大学佛山研究生创新学院 Cross-modal pedestrian re-recognition method based on three-modal collaborative learning
CN118038499A (en) * 2024-04-12 2024-05-14 北京航空航天大学 Cross-mode pedestrian re-identification method based on mode conversion
US20240386653A1 (en) * 2023-05-17 2024-11-21 Salesforce, Inc. Systems and methods for reconstructing a three-dimensional object from an image

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11804060B1 (en) * 2021-07-26 2023-10-31 Amazon Technologies, Inc. System for multi-modal anomaly detection
CN113989851A (en) * 2021-11-10 2022-01-28 合肥工业大学 A Cross-modal Person Re-identification Method Based on Heterogeneous Fusion Graph Convolutional Networks
US20230162023A1 (en) * 2021-11-25 2023-05-25 Mitsubishi Electric Research Laboratories, Inc. System and Method for Automated Transfer Learning with Domain Disentanglement
CN114220124A (en) * 2021-12-16 2022-03-22 华南农业大学 Near-infrared-visible light cross-modal double-flow pedestrian re-identification method and system
CN114495010A (en) * 2022-02-14 2022-05-13 广东工业大学 A cross-modal pedestrian re-identification method and system based on multi-feature learning
WO2023222643A1 (en) * 2022-05-17 2023-11-23 Continental Automotive Technologies GmbH Method for image segmentation matching
WO2024021394A1 (en) * 2022-07-29 2024-02-01 南京邮电大学 Person re-identification method and apparatus for fusing global features with ladder-shaped local features
CN115359554A (en) * 2022-08-08 2022-11-18 天津师范大学 A Cross-modal Pedestrian Retrieval Method for Autonomous Driving
US20240386653A1 (en) * 2023-05-17 2024-11-21 Salesforce, Inc. Systems and methods for reconstructing a three-dimensional object from an image
CN116958584A (en) * 2023-09-21 2023-10-27 腾讯科技(深圳)有限公司 Key point detection method, regression model training method and device and electronic equipment
CN117975556A (en) * 2024-01-15 2024-05-03 东北大学佛山研究生创新学院 Cross-modal pedestrian re-recognition method based on three-modal collaborative learning
CN118038499A (en) * 2024-04-12 2024-05-14 北京航空航天大学 Cross-mode pedestrian re-identification method based on mode conversion

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
WENBO YU ET AL.: "Crossmodal Sequential Interaction Network for Hyperspectral and LiDAR Data Joint Classification", 《IEEE GEOSCIENCE AND REMOTE SENSING LETTERS》, vol. 21, 13 February 2024 (2024-02-13), pages 1 - 5 *
YUKANG ZHANG ET AL.: "Adaptive Middle Modality Alignment Learning for Visible-Infrared Person Re-identification", 《INTERNATIONAL JOURNAL OF COMPUTER VISION》, vol. 133, 9 November 2024 (2024-11-09), pages 2176, XP038124276, DOI: 10.1007/s11263-024-02276-4 *
冯敏;张智成;吕进;余磊;韩斌;: "基于生成对抗网络的跨模态行人重识别研究", 《现代信息科技》, vol. 4, no. 04, 25 February 2020 (2020-02-25), pages 107 - 109 *
张典;汪海涛;姜瑛;陈星;: "基于轻量网络的近红外光和可见光融合的异质人脸识别", 《小型微型计算机系统》, vol. 41, no. 04, 9 April 2020 (2020-04-09), pages 807 - 811 *
范慧杰等: "可见光红外跨模态行人重识别方法综述", 《信息与控制 》, vol. 54, no. 01, 27 September 2024 (2024-09-27), pages 50 - 65 *
陈丹;李永忠;于沛泽;邵长斌;: "跨模态行人重识别研究与展望", 《计算机系统应用》, vol. 29, no. 10, 13 October 2020 (2020-10-13), pages 20 - 28 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120071051A (en) * 2025-04-28 2025-05-30 北京航空航天大学杭州创新研究院 Method and device for generating countermeasure sample for intelligent driving perception test

Also Published As

Publication number Publication date
CN119600653B (en) 2025-10-14

Similar Documents

Publication Publication Date Title
Li et al. ConvTransNet: A CNN–transformer network for change detection with multiscale global–local representations
Zhang et al. Cross-modality interactive attention network for multispectral pedestrian detection
CN114783024A (en) Face recognition system of gauze mask is worn in public place based on YOLOv5
CN117292313A (en) A small target floating garbage detection method based on the improved YOLOv7 model
CN116704611B (en) A cross-view gait recognition method based on motion feature mixing and fine-grained multi-stage feature extraction
CN113689382B (en) Tumor postoperative survival prediction method and system based on medical images and pathological images
CN113344814A (en) High-resolution countermeasure sample synthesis method based on generation mechanism
CN120580436B (en) Image segmentation method based on deep learning remote sensing image
CN119600653B (en) A cross-modal person re-identification method and system
CN118015342A (en) A multi-source target detection method based on prototype network dynamic balance optimization strategy
Chen et al. DDGAN: dense residual module and dual-stream attention-guided generative adversarial network for colorizing near-infrared images
Zhang et al. Adaptive transformer with Pyramid Fusion for cloth-changing Person Re-Identification
Lu et al. Cross-modality person re-identification based on intermediate modal generation
Wang et al. A Novel Transformer-based Multiscale Siamese Framework for High-resolution Remote Sensing Change Detection
Yuan et al. Instant pose extraction based on mask transformer for occluded person re-identification
Lan et al. Infrared dim and small targets detection via self-attention mechanism and pipeline correlator
CN120543828A (en) A small target detection method for industrial defect detection scenarios
CN114419729A (en) A Behavior Recognition Method Based on Lightweight Dual-Stream Network
CN120108040A (en) A multi-scale adaptive gait recognition method and system based on inner convolution
Chen et al. Two-stage dual-resolution face network for cross-resolution face recognition in surveillance systems
Li et al. A New Method for Vehicle Logo Recognition Based on Swin Transformer
Sun et al. Ship re-identification in foggy weather: A two-branch network with dynamic feature enhancement and dual attention
CN119600305B (en) Multi-scale feature extraction method, system, equipment and storage medium for enhancing image similarity matching
Dong et al. Cycle Translation-Based Collaborative Training for Hyperspectral-RGB Multimodal Change Detection
Wu et al. Image Colorization Algorithm Based on Improved GAN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant