CN119600653A

CN119600653A - A cross-modal pedestrian re-identification method and system

Info

Publication number: CN119600653A
Application number: CN202411690034.8A
Authority: CN
Inventors: 黄鹤; 徐毅
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2024-11-25
Filing date: 2024-11-25
Publication date: 2025-03-11
Anticipated expiration: 2044-11-25
Also published as: CN119600653B

Abstract

The invention relates to a cross-mode pedestrian re-recognition method and system, wherein the method comprises the steps of randomly selecting a plurality of pedestrians from an infrared visible light data set in each training batch, preprocessing the visible light images and the infrared images, inputting the preprocessed visible light images and infrared images into a deep learning model, extracting visible light features, infrared features and mixed mode features through the deep learning model, constructing total loss, judging whether the deep learning model is trained, if the total loss is lower than a preset threshold value, finishing training, if the total loss is higher than the preset threshold value, continuing training until the total loss is lower than the preset threshold value, and recognizing the pedestrians from the visible light images or the infrared images to be recognized through the trained deep learning model. According to the invention, the cross-mode pedestrian re-recognition can be effectively performed through the constructed deep learning model.

Description

Cross-mode pedestrian re-identification method and system

Technical Field

The invention relates to the technical field of pedestrian re-recognition, in particular to a cross-mode pedestrian re-recognition method and system.

Background

In recent years, pedestrian re-recognition is an important task in the fields of computer vision and video monitoring, and by comparing pedestrian images from different cameras, the same individuals are automatically recognized and matched, so that the pedestrian re-recognition method has wide application in the fields of intelligent monitoring, video retrieval, intelligent transportation and the like. However, conventional pedestrian re-recognition methods typically rely on single-modality visual feature extraction, presenting significant limitations in cross-modality and low-light scenes. For example, there is a significant modal difference between infrared and visible light images, making feature alignment and cross-modal matching very challenging.

In order to solve the problem, the infrared-visible light cross-mode pedestrian re-recognition method based on deep learning shows excellent performance, and cross-mode characteristics can be better extracted and aligned, so that recognition accuracy is remarkably improved. The infrared-visible light cross-modal pedestrian re-recognition realizes cross-modal pedestrian recognition by utilizing the correlation between visible light and infrared images, which has important significance in all-weather monitoring scenes, especially in night or low-illumination conditions. And thus has become an important research direction in the field of computer vision.

The infrared-visible light cross-modal pedestrian re-recognition is used as an important research direction in the field of computer vision, and aims to reduce modal differences between infrared image and visible light image features so as to realize cross-modal pedestrian matching and recognition. However, infrared-visible cross-modal pedestrian re-identification faces some challenges, the most significant of which can be generalized to the large modal differences and less cross-modal training data.

(One) tremendous modality differences

The infrared image and the visible light image have obvious modal differences in visual characteristics, and the visible light image is an imaging mode based on natural light reflection, so that rich color information, texture details and contour characteristics can be provided. The infrared image is based on the imaging mode of the thermal radiation of the object, does not contain color information, and mainly reflects the temperature distribution of the object. The significant difference in the modes makes the appearance style characteristics of the same person in the infrared image and the visible light image have large difference, so that the matching is difficult to directly utilize the traditional characteristic extraction method.

Thus, in a cross-modal pedestrian re-recognition implementation, these heterogeneous visual features need to be aligned to narrow the modal gap. There are two methods, one of which is a feature level method, which attempts to project the visible light and infrared features extracted through the backbone network into the same embedded space, and perform matching search in the new feature space after projection, so as to reduce the influence of the modal differences. However, due to the large modality differences, these methods have difficulty projecting cross-modality images directly into a feature space that enables alignment of infrared and visible image features, and even if feature mapping is implemented, many beneficial shared features are lost in the conversion of the original and new features. The other is an image level method, specifically, the method uses a generation countermeasure network to generate an infrared image and a visible light image into an image of a unified mode to perform feature extraction, or uses a mode of generating a countermeasure network patch deficiency, such as converting an infrared (or visible light) image into a visible light (or infrared) image, finally, performing feature fusion with the existing mode, and performing image matching in the fused feature space, wherein the generation of a cross-mode image is difficult due to large mode difference and lack of infrared-visible light image pairs, and is usually accompanied by larger noise.

(II) less cross-modal training data

Another major difficulty is the scarcity of cross-modal training data. Compared with a single-mode visible light pedestrian re-identification task, the infrared-visible light cross-mode dataset needs to contain paired datasets of infrared and visible light images at the same time, but the acquisition cost of the cross-mode dataset is higher, and the data labeling process is complex. In an actual monitoring scene, especially in a night scene, the acquisition of an infrared image is easier, but a corresponding visible light image is often missing, which leads to the defect of paired data of infrared light and visible light.

The problem of insufficient data directly affects the generalization ability of the deep learning model. Deep learning models typically require a significant amount of annotation data to train in order to learn effective cross-modal features. Particularly, for a deep learning model with a backbone network being a transducer and variants thereof, the model has no spatial or temporal inductive bias, has huge parameters, needs enough data to fully exert the structural advantages, establishes the global dependency relationship, and increases the generalization of the model. However, due to the scarcity of the cross-modal dataset, the model is easy to suffer from the problem of over fitting in the training process, so that the recognition performance in practical application is reduced. In order to alleviate this problem, in the current infrared-visible light cross-mode pedestrian re-recognition, one type of method uses data enhancement technology, such as channel erasure, channel exchange, random erasure, gray enhancement and the like, to add new samples, all of which are based on original images, and the newly added samples lack diversity, and the second type of method uses generation of new samples synthesized against a network or by mixing local characteristics of human bodies of pedestrians to generate more samples in the same mode, however, all of the methods are enhancement of data in the modes, and challenges brought by data scarcity still remain difficult to thoroughly solve.

In summary, the prior art cannot effectively solve the problem of large modal differences between infrared image and visible light image features, and the cross-modal training data are less, so that the cross-modal pedestrian matching and recognition effects are poor.

Disclosure of Invention

Therefore, the invention aims to solve the technical problems that in the prior art, the infrared image and the visible light image have larger modal differences, and the cross-modal training data are less, so that the cross-modal pedestrian matching and recognition effects are poor.

In order to solve the technical problems, the invention provides a cross-mode pedestrian re-identification method, which comprises the following steps:

Step S1, randomly selecting a plurality of pedestrians from an infrared visible light data set in each training batch, wherein each pedestrian comprises 4 visible light images and 4 infrared images;

s2, preprocessing visible light images and infrared images;

S3, inputting the preprocessed visible light image and infrared image into a deep learning model, and extracting visible light features, infrared features and mixed mode features through the deep learning model;

Step S4, constructing total loss, judging whether the deep learning model is trained, if the total loss is lower than a preset threshold, finishing training, if the total loss is higher than the preset threshold, continuing training until the total loss is lower than the preset threshold,

The total loss includes a local maximum average difference loss constructed based on the visible light features and infrared features, an intermodal divergence loss constructed based on the visible light features and infrared features, a circle loss constructed based on the visible light features, infrared features, and mixed modal features, a cross entropy loss constructed based on the visible light features, infrared features, and mixed modal features;

And S5, identifying pedestrians from the visible light image or the infrared image to be identified by the trained deep learning model.

In one embodiment of the present invention, the formula of the local maximum average difference loss constructed based on the visible light characteristic and the infrared characteristic in the step S4 is:

Wherein H represents a regenerated Hilbert space, phi (·) represents a regenerated Hilbert space conversion map, N _id represents the total category number of pedestrian identities, N _V represents the number of visible light images in a training batch, f _i ^V represents the characteristics of the ith visible light image, Representing the probability weight that f _i ^V belongs to category k, N _I represents the number of infrared images in the training batch,Features representing the j-th infrared image,Representation ofProbability weights belonging to category k, |a-b|| ² denote the square of the distance of a and b.

In one embodiment of the present invention, the formula of the intermodal divergence loss constructed based on the visible light characteristic and the infrared characteristic in the step S4 is:

wherein N _V、N_I represents the number of visible and infrared images in the training batch, C _V、C_I represents the visible and infrared image classifier, respectively, Features of the j-th infrared image are represented, and f _i ^V features of the i-th visible image are represented.

In one embodiment of the present invention, the formula of the circle loss constructed based on the visible light characteristic, the infrared characteristic and the mixed mode characteristic in the step S4 is:

wherein, s_p＝w_yf^m/(||w_y||||f^m||),m∈{V,I,H},S _p represents the similarity score between the classes and w _j,w_y represents the non-target classification weight and the target classification weight, w _j,w_y represents the transpose of w _j and w _y, f ^m represents the image feature belonging to the mode m, m is the mode to which the image feature belongs, V, I, H represents the visible light mode, the infrared mode and the mixed mode, respectively, |·|| represents the L2 norm,The first and second weight coefficients of the circle loss are represented respectively, and delta _n、Δ_p and gamma are the first, second and third super parameters respectively.

In one embodiment of the present invention, the cross entropy loss constructed in step S4 based on the visible light feature, the infrared feature and the mixed mode feature is specifically:

The visible light characteristics, the infrared characteristics and the mixed mode characteristics are subjected to classification layers to obtain respective category scores, and cross entropy loss is constructed based on the respective category scores, wherein the formula is as follows:

Wherein N _V,N_I,N_H represents the number of visible light images, infrared images and mixed mode images in the training batch, N _id represents the total category number of the identity of the pedestrian, y _i,k represents the true category score of the ith image belonging to the kth pedestrian, The prediction category score of the kth pedestrian of the ith image is represented, m is the mode of the image feature, and V, I and H respectively represent the visible light mode, the infrared mode and the mixed mode.

In one embodiment of the present invention, the deep learning model in step S3 includes a first stage, a mixed mode generation module, a second stage, a third stage and a fourth stage connected in sequence, each stage including at least one Swin transducer block appearing in pairs, each Swin transducer block appearing in pairs including a first sub-block and a second sub-block connected in sequence, wherein,

The first sub-block comprises a first normalization layer LN, a multi-head self-attention layer W-MSA based on a window, a second normalization layer LN and a first multi-layer perceptron MLP which are sequentially connected, wherein the input of the first normalization layer LN and the output of the multi-head self-attention layer W-MSA based on the window are subjected to matrix element addition and then serve as the input of the second normalization layer LN, and the input of the second normalization layer LN and the first multi-layer perceptron MLP are subjected to matrix element addition and then serve as the input of the second sub-block;

The second sub-block comprises a third normalization layer LN, a multi-head self-care layer SW-MSA based on a shift window, a fourth normalization layer LN and a second multi-layer perceptron MLP which are sequentially connected, wherein the input of the third normalization layer LN and the output of the multi-head self-care layer SW-MSA based on the shift window are subjected to matrix element addition and then serve as the input of the fourth normalization layer LN, and the input of the fourth normalization layer LN and the second multi-layer perceptron MLP are subjected to matrix element addition and then serve as the output of a Swin transform block which is formed in pairs;

the first Swin transducer block in the first stage is preceded by a Linear Embedding module, the Linear Embedding module is used for converting the image into patches required for the Swin transducer block operation, and the first Swin transducer block in the second, third and fourth stages is preceded by a PATCHING MERGING module, the PATCHING MERGING module is used for reducing the number of patches and increasing the dimension of the patches to form a hierarchical representation of the feature.

In one embodiment of the present invention, the obtaining, by the hybrid modality generating module, the hybrid modality image feature from the visible light image feature and the infrared image feature output in the first stage includes:

Setting the characteristic of each visible light image output in the first stage as x ^V and the characteristic of each infrared image as x ^I;

The method comprises the steps of convolving x ^V through first hollows with expansion rates of 1,2 and 4 respectively, adding and averaging results, and obtaining the generation characteristics of a visible light part through a full connection layer and a ReLU activation function;

The second cavity convolution with the expansion rate of 1,2 and 4 is carried out on x ^I, and the generation characteristics of the infrared part are obtained through a full connection layer and a ReLU activation function;

for extracting context information of a scene level common to an infrared image and a visible light image, generating characteristics of background characteristics, carrying out self-adaptive pooling on x ^V and x ^I through the same, carrying out convolution on 1x1, carrying out dot product on pixels to obtain common background characteristics, adjusting the common background characteristics to the same size as the image characteristics output in the first stage through bilinear interpolation, and obtaining global background generation characteristics through a ReLU activation function;

finally, adding and averaging the generation features of the visible light part, the generation features of the infrared part and the global background generation features to obtain final mixed mode image features;

In order to maintain consistency of the final mixed mode characteristics, mode consistency loss is introduced into the total loss, and the formula is as follows:

Wherein L ₂ (·) represents L2 regularization, f _t ^H represents a generated t-th image of the hybrid mode, f _s ^H represents a generated s-th image of the hybrid mode, l· ₂ represents euclidean distance, and n is the number of the same pedestrian hybrid modes.

In one embodiment of the invention, the hybrid modality generation module further includes feature alignment between the visible light image features x ^V output by the first stage and the hybrid modality image features generated by the hybrid modality generation module, the infrared image features x ^I output by the first stage and the hybrid modality image features generated by the hybrid modality generation module by antagonizing learning, including:

The mode differences between the visible light image characteristic x ^V and the mixed mode image characteristic generated by the mixed mode generation module and the infrared image characteristic x ^I and the mixed mode image characteristic generated by the mixed mode generation module are reduced through the generator and the discriminator, wherein,

The generator consists of three full-connection layers, wherein a ReLU activation function is arranged between the full-connection layers and used for increasing nonlinearity of the generator, the visible light image characteristic x ^V and the infrared image characteristic x ^I output in the first stage are added to be the same as the original input image in length and width, and after compensating embedded vectors with the dimension of 1 are obtained, fake mixed mode characteristics are obtained after the corresponding generators are used;

The discriminator is a Patch Gan discriminator, the mixed mode image characteristic forged based on the visible light image characteristic x ^V or the infrared image characteristic x ^I and the mixed mode image characteristic generated by the mixed mode generation module are input into the discriminator, and the countermeasure loss about the discriminator is constructed in the total loss function, so that whether the forged mixed mode characteristic is identical with the mixed mode image characteristic generated by the mixed mode generation module or not is judged by using the countermeasure loss, and the countermeasure loss comprises the discriminator loss, the generator loss and the joint countermeasure loss, wherein the formula is as follows:

L_GAN＝L_D+L_G;

Wherein, L _D is the loss of the discriminator, L _G is the loss of the generator, L _GAN is the combined antagonism loss, N _V is the number of visible light images in the training batch, N _I is the number of infrared images in the training batch, N _H is the number of mixed mode images in the training batch, D (·) is the discriminator, For the i-th visible light image,For the j-th infrared image,For the k Zhang Hunge mode image, y _V is a label of a visible light image, y _I is a label of an infrared image, y _H is a label of a mixed mode, G _V-H is a visible light-mixed mode generator, and G _I-H is an infrared-mixed mode generator.

In one embodiment of the present invention, a component detection and exchange module is further provided after the last Swin Transformer block of the third stage, the component detection and exchange module comprising:

The visible light, the global image feature G ^V、G^H、G^I corresponding to the mixed and infrared states output by the last Swin transducer block in the third stage, and the features output by the first stage and the second stage are sent to the body part detector Component Detector to obtain initial part local features composed of a plurality of body part local features, the initial part local features pass through a part prediction network Predictor Network, the part prediction network Predictor Network is composed of a full-connection layer and a Sigmoid function and is used for predicting the score of each body part local feature in the initial part local features, the replacement of the body part features in the initial part local features is carried out according to the prediction score, if the body part feature score of the mixed mode is smaller than the body part feature scores of the two other modes, the body part feature with the largest score in the two other modes is replaced by the body part local feature of the mixed mode, the body part local features with overlapping redundancy in the mixed mode are replaced, and the part local features P ^V、P^H、P^I are obtained,

The body part detector Component Detector respectively multiplies the features output by the first stage and the second stage and the features output by the last Swin transducer block in the third stage by a1×1 convolutional neural network and a Sigmoid function to obtain attention masks of different levels, adjusts the attention masks to the same size and multiplies the attention masks by pixels to obtain a final mask, then respectively multiplies the final mask by global image features G ^V、G^H、G^I corresponding to visible light, mixed light and infrared states output by the last Swin transducer block in the third stage, and finally splices the final masks in dimension to obtain local features of the initial part;

constructing part characteristics in the total loss function, wherein the part characteristics are lost by part identities, and the formula is as follows:

Where N represents the total number of images in the training batch, N _c represents the number of human body local features, N _id is the total category number of pedestrian identities, y _i,k represents the true category score of the ith image belonging to the kth pedestrian, and z _i,j,k represents the predicted category score of the jth human body local feature of the ith image belonging to the kth pedestrian.

In order to solve the technical problems, the invention provides a cross-mode pedestrian re-identification system, which comprises:

the selection module is used for randomly selecting a plurality of pedestrians from the infrared visible light data set in each training batch, wherein each pedestrian comprises 4 visible light images and 4 infrared images;

the preprocessing module is used for preprocessing the visible light image and the infrared image;

the feature extraction module is used for inputting the preprocessed visible light images and infrared images into a deep learning model, and extracting visible light features, infrared features and mixed mode features through the deep learning model;

the construction module is used for constructing total loss, judging whether the deep learning model is trained, if the total loss is lower than a preset threshold, the training is ended, if the total loss is higher than the preset threshold, the training is continued until the total loss is lower than the preset threshold,

and the recognition module is used for recognizing the pedestrians from the visible light images or the infrared images to be recognized by the trained deep learning model.

In order to solve the technical problems, the invention provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of the cross-mode pedestrian re-identification method when executing the computer program.

To solve the above technical problem, the present invention provides a computer readable storage medium having a computer program stored thereon, which when executed by a processor, implements the steps of the above cross-modal pedestrian re-recognition method.

Compared with the prior art, the technical scheme of the invention has the following advantages:

the deep learning model constructed by the invention relates to a novel three-mode branch structure, and the mixed mode is learned from the visible light image and the infrared image in a learning mode, so that the number of training samples can be effectively increased, and the follow-up pedestrian re-recognition is facilitated;

According to the invention, the mixed mode generation module is introduced into the deep learning model, and alignment between the visible light image and the mixed mode image and alignment between the infrared image and the mixed mode image are realized through a learning mode embedding mode and a countermeasure generation network, so that the mode difference between the infrared mode and the visible light mode in actual retrieval can be effectively reduced;

The invention introduces a component detection and exchange module in the deep learning model, and reduces the alignment difficulty while exploring the slight difference of the images.

Drawings

In order that the invention may be more readily understood, a more particular description of the invention will be rendered by reference to specific embodiments thereof that are illustrated in the appended drawings.

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of a deep learning model structure in an embodiment of the invention;

FIG. 3 is a block diagram of a Swin transducer according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a hybrid modality generation module according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a component detection and switching module in accordance with an embodiment of the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings and specific examples, which are not intended to be limiting, so that those skilled in the art will better understand the invention and practice it.

Example 1

Referring to fig. 1, the invention relates to a cross-mode pedestrian re-identification method, which comprises the following steps:

s2, preprocessing visible light images and infrared images;

And S5, identifying pedestrians from the visible light image or the infrared image to be identified by the trained deep learning model. Specifically, the model identifies a pedestrian in the infrared image to be identified based on a given visible light image, or identifies a pedestrian in the visible light image to be identified based on a given infrared image. It should be noted that, since the training deep learning model in this embodiment has three branches, when in actual use, the branches corresponding to the infrared image or the visible light image can be selected according to different application scenarios.

The present embodiment is described in detail below:

The deep learning model provided by the invention adopts a three-branch network for training, the basic network is Swin TransformerV, the three branches of the model are respectively a visible light branch, an infrared branch and a mixed mode branch, and model parameters of the three branches are shared. The training data of the mixed mode branch is formed by mixing a visible light image and an infrared image, so that the number of samples for model training can be effectively increased, meanwhile, the model can align the visible light image characteristic and the infrared image characteristic with the mixed mode characteristic, the difference between modes can be effectively reduced, and the overfitting of the training samples is reduced. In order to discover subtle differences among images, the model extracts local features through generating a mask in a third stage, and performs progressive human body local feature replacement among three mode branches, so that the human body local features with overlapping redundancy in the mixed mode are replaced, and the local features of the mixed mode are purer. Fig. 2 shows a schematic structural diagram of the whole model, and a specific training process of the model is as follows:

S1, randomly selecting 4 pedestrians from an infrared visible light data set during training of each batch, wherein each pedestrian comprises 4 visible light images and 4 infrared images.

S2, preprocessing all pedestrian images, firstly scaling all the image sizes to 288 multiplied by 144 multiplied by 3, simultaneously carrying out data enhancement, adopting random cutting, random horizontal overturning and random erasure, and random one enhancement mode of channel random erasure, random channel exchange enhancement, color dithering enhancement and spectrum dithering enhancement for visible light images, adopting only random cutting, random horizontal overturning and random erasure for infrared images, and finally carrying out normalization operation with the average value of [0.485,0.456,0.406] and the standard deviation of [0.229,0.224,0.225] for the used images.

S3, inputting the processed visible light and infrared images into a model to obtain the characteristics of three modes, namely visible light, infrared and mixed modes, generated and extracted by the model;

s4, constructing total loss, judging whether the deep learning model is trained, if so, finishing the training, if so, continuing the training until the total loss is lower than a preset threshold;

Specifically, the total loss includes a local maximum average difference loss, a modal-to-modal divergence loss, a circular loss, a cross entropy loss, specifically as follows:

(1) Monitoring the local maximum average difference loss and the inter-modal divergence loss of the extracted visible light characteristic f ^V and infrared characteristic f ^I, wherein the local maximum average difference loss function is as follows:

(2) The loss of modal dispersion is as follows:

(3) The visible light feature f ^V, the infrared feature f ^I and the mixed mode feature f ^H are subjected to circle loss supervision:

wherein, s_p＝w_yf^m/(||w_y||||f^m||),m∈{V,I,H},S _p represents the similarity score between the classes and w _j,w_y represents the non-target classification weight and the target classification weight, w _j,w_y represents the transpose of w _j and w _y, f ^m represents the image feature belonging to the mode m, m is the mode to which the image feature belongs, V, I, H represents the visible light mode, the infrared mode and the mixed mode, respectively, |·|| represents the L2 norm,The first and second weight coefficients of the circle loss are represented respectively, delta _n、Δ_p and gamma are respectively the first, second and third super parameters, and delta _n、Δ_p and gamma take values of 0.5, 0.5 and 30 respectively.

(4) The three modal features are passed through a classification layer (namely a fully connected layer, an FC layer in FIG. 2) to obtain category scores z ^V、z^I and z ^H, the category scores are subjected to cross entropy loss supervision, and a loss function formula is as follows:

S5, identifying pedestrians from the visible light image and the infrared image to be identified by the trained deep learning model.

The specific flow of the step S2 is as follows:

S2-1. The deep learning model of the embodiment adopts Swin TransformerV2 as a backbone network, and the deep learning model comprises a first Stage (Stage 1), a mixed mode generation Module (HMG Module), a second Stage (Stage 2), a third Stage (Stage 3) and a fourth Stage (Stage 4) which are sequentially connected, wherein a component detection and exchange Module (CDS Module) is further arranged in the third Stage (Stage 3). Swin TransformerV2 firstly partitioning the obtained visible light and infrared images, specifically partitioning the images through a Patch Partition module (Patch partitioning module) in fig. 2 to obtain patches, in order to enable each Patch to fully sense the characteristics nearby the Patch and enable the model to fully utilize the image characteristics, the model adopts overlapped sampling partitioning, a specific operation is to perform step length 3 on an input image, a two-dimensional convolution operation with convolution kernel of 4 is performed, the size of the input image received by Swin TransformerV2 is [288,144,3], 96×48 patches are obtained through convolution partitioning operation, the dimension of each Patch is 128, then the patches are linearly embedded (Linear Embedding module in the first stage), a total of 4608 patches are paved to obtain image characteristics with the size of [4608,128], then Swin Transformer blocks which are divided into four stages are fed, and the number of Swin Transformer blocks in the four stages is [2,2,18,2] respectively.

In this embodiment the Swin fransformer blocks consisting of W-MSA and SW-MSA always appear in pairs, see fig. 3, each pair appearing Swin fransformer block comprising a first sub-block and a second sub-block connected in sequence, wherein,

The second sub-block comprises a third normalization layer LN, a multi-head self-care layer SW-MSA based on a shift window, a fourth normalization layer LN and a second multi-layer perceptron MLP which are sequentially connected, wherein the input of the third normalization layer LN and the output of the multi-head self-care layer SW-MSA based on the shift window are subjected to matrix element addition and then serve as the input of the fourth normalization layer LN, and the input of the fourth normalization layer LN and the second multi-layer perceptron MLP are subjected to matrix element addition and then serve as the output of the pair-appearing Swin converter block;

The first Swin Transformer block in the first stage is preceded by Linear Embedding modules (linear embedding modules), the Linear Embedding modules are used for converting the image into patches required by the Swin Transformer block operation (i.e. the patches obtained by the segmentation are mapped to a new dimension (three-dimensional change, flattening effect) so as to facilitate the subsequent Swin Transformer block processing), and the first Swin Transformer block in the second, third and fourth stages is preceded by PATCHING MERGING modules (patch merging modules), and the PATCHING MERGING modules are used for reducing the number of patches and increasing the dimension of the patches so as to be beneficial to forming the hierarchical representation of the features. Since Linear Embedding modules and PATCHING MERGING modules belong to the prior art, the description of this embodiment is omitted.

The multi-layer perceptron MLP is composed of two fully connected layers, and a GELU nonlinear activation function is used between the two fully connected layers.

Further, the Swin transducer block consisting of W-MSA and SW-MSA always appears in pairs, divided into two sub-blocks, namely a first sub-block and a second sub-block, by which a relation between patches capable of different areas is established. The Swin transducer block output image features with two sub-blocks are calculated as follows:

wherein, The characteristics of the first sub-block output through the multi-head self-attention layer and the characteristics of the second stage output through the multi-head self-attention layer are respectively,The method comprises the steps that characteristics of a first sub-block output through a multi-layer perceptron and characteristics of a second sub-block output through the multi-layer perceptron are respectively, WMSA () is a multi-head self-attention layer based on windows, MLP () is the multi-layer perceptron, SWMSA () is a multi-head self-attention layer based on shift windows, m is a mode to which image characteristics belong, V, I and H respectively represent visible light, infrared and mixed mode characteristics, in order to generate layered representation, the number of patches can be reduced through patch merging layers among Swin different stages, and meanwhile, the dimension of the patches is increased. Assuming each patch dimension is C, the patch merge layer concatenates features of each set of 2x 2 adjacent patches and reduces the dimension to 2C using a linear layer for the concatenated features of dimension 4C.

Referring to fig. 2, after the Swin transducer block in the first Stage (Stage 1), the output visible light and infrared mode features pass through a mixed mode generation Module (HMG Module) to obtain three branches, and the visible light features, the mixed mode image features and the infrared image features are respectively output, and then the three mode features are respectively sent to the Swin transducer blocks in the last three stages (stages 2,3 and 4).

Referring to fig. 2, the global image features (i.e. G ^V、G^H、G^I in fig. 2) corresponding to the three modes output by the last Swin transducer block in the third Stage (Stage 3) of the model, and the features output by the first and second stages (Stage 1, 2) are sent together to the component detection and exchange Module (CDS Module), the corresponding component local features P ^V、P^H、P^I are extracted and output together with the global image features G ^V、G^H、G^I, the global image features (G ^V、G^H、G^I) and the component local features (P ^V、P^H、P^I) of the three modes are sent to the Swin transducer block of the last Stage (Stage 4) respectively, and finally the feature stitching (i.e. "C" after the fourth Stage in fig. 2, meaning connect) is performed to stitch the respective global image features and the component local features in the patch dimension, and the final feature outputs f ^V、f^I and f ^H are obtained by an average pooling layer (POOL) and a batch normalization layer (BN).

S2-2 referring to FIG. 4, the mixed mode generation Module (HMG Module) is specifically as follows:

for each pedestrian in a training batch, 4 visible light images and 4 infrared images are combined, 16 mixed mode images are generated through a mixed mode generation Module (HMG Module) (1 visible light image and 4 infrared images can be respectively generated into mixed mode images, then 4 visible light images and 4 infrared images can be respectively generated into 16 images), and the specific generation flow is that firstly visible light image characteristics and infrared image characteristics are added with respective mode compensation in an embedding manner and restored to the original image proportion, and for each visible light image characteristic x ^V and infrared image characteristic x ^I output in the first Stage (Stage 1), for capturing and generating image information of different scale information, x ^V is required to be convolved through a first cavity with expansion rates of 1,2 and 4 respectively, the results are added and averaged, and then the generation characteristics of a visible light part are obtained through a full connection layer and a ReLU activation function. Similar operations are performed on x ^I to obtain the infrared portion generation characteristics by a second hole convolution with the same expansion ratio of 1,2,4, respectively, but with different parameters than the first hole convolution, and by a full connection layer and a ReLU activation function. In addition, in order to extract the context information of the scene level common to the infrared image and the visible light image and perform feature generation on the background feature, x ^V and x ^I need to be pooled and 1x1 convolved through the same adaptive average and dot product on pixels to obtain the common background feature, then the common background feature is adjusted to the same size as the image feature output in the first Stage (Stage 1) through bilinear interpolation, and global background generation feature is obtained through ReLU activation. And finally, fusing the generated features of the multiple parts, namely adding and averaging the generated features of the visible light part, the generated features of the infrared part and the global background generated features to obtain the final mixed mode image features. In order to keep consistency of the generated mixed mode image characteristics, the embodiment introduces mode consistency loss in the total loss:

For the generated 16 mixed mode image features, respectively, a predictive label is obtained through a subsequent Patch Gan discriminator, the mixed mode image features closest to the mixed mode label are selected as final outputs (the 4 mixed mode image features are selected here to ensure that the number of the mixed mode image features is the same as that of the visible light images/infrared images of the original input deep learning model), it should be noted that the mixed mode generating Module (HMG Module) can generate not only the final mixed mode image features, but also the fake mixed mode features based on the visible light image features x ^V and the fake mixed mode features based on the infrared image features x ^I, and in short, the mixed mode generating Module (HMGModule) can generate the features of three branches of visible light, mixed mode and infrared and respectively input the generated features of the three branches into the second stage (stage 2).

The model then performs feature alignment between the visible-mixed mode and the infrared-mixed mode by countermeasure learning, and the kernel of the alignment part is a Generator (Generator) and a discriminator D, and the mode differences between the visible light, the infrared image features and the mixed mode features are reduced by a mutual game between the Generator (Generator) and the discriminator D. The Generator (Generator) aims to make the characteristics of the mixed mode generated by the other modes as close as possible to the actual characteristics of the mixed mode, and the discriminator D aims to distinguish the characteristics of the mixed mode from the characteristics of the other two modes. The respective modal compensation is added to the embedded (i.e., visible Compensation Embedding and Infrared Compensation Embedding in fig. 4, the two embedded vectors are the same length as the original input image, The visible light features and the infrared features of the embedded vector with the same width and the dimension of 1 are divided into a visible light-mixed mode Generator G _V-H and an infrared-mixed mode Generator G _I-H by generators (generators), each Generator (Generator) is formed by three fully-connected layers in a cascade, and a ReLU activation function is arranged between the fully-connected layers to increase the nonlinearity of the Generator (Generator), and the characteristics are forged after passing through the respective generators. The structure of the discriminant D (i.e., D _V-H and D _I-H in FIG. 4) is a classical Patch Gan discriminant structure, which will be based on the mixed mode features of visible image feature x ^V or infrared image feature x ^I forgery, And inputting the characteristics of the mixed mode image generated by the mixed mode generating module into a corresponding discriminator D, and constructing countermeasures about the discriminator D in the total loss function, and judging whether the forged mixed mode characteristics are the same as the characteristics of the mixed mode image generated by the mixed mode generating module or not by utilizing the countermeasures (the countermeasures are the same when the countermeasures are smaller than a preset value, which indicates that the characteristics of the actually generated mixed mode image are effective, and the countermeasures are different when the countermeasures are larger than the preset value, which indicates that the characteristics of the actually generated mixed mode image are not ideal), so as to realize characteristic alignment, wherein the countermeasures comprise the discriminator losses, the generator losses and the combined countermeasures, and the formula is as follows:

L_GAN＝L_D+L_G;

And finally, the aligned infrared image, visible light image and mixed mode image are sent to a subsequent model for processing by the model.

S2-3 referring to FIG. 5, the component detection and switching Module (CDS Module) is specifically as follows:

the last Swin transducer block in the third stage is further provided with a component detection and exchange module, the global image feature G ^V、G^H、G^I corresponding to the three modes output by the last Swin transducer block in the third stage and the features output by the first stage and the second stage are sent to the body component detector Component Detector to obtain initial component local features composed of a plurality of body local features, the initial component local features pass through a component prediction network Predictor Network, the component prediction network Predictor Network is composed of a full-connection layer and a Sigmoid function and is used for predicting the score of each body local feature in the initial component local features, the replacement of the body local features in the initial component local features is performed according to the prediction score, and as the generated initial component local features are inevitably redundant, the human local features corresponding to the mixed mode are gradually replaced with the local features detected by the infrared and visible optical modes through a Swapping module (replacement module) in fig. 5, if the human local feature score of the mixed mode is smaller than the human local feature score of the two other body local features, the mixed mode is replaced by the two local features, and the mixed mode is enabled to be more redundant by the human local feature of the mixed mode. It should be noted that, the local features of the initial components corresponding to the infrared branch and the visible light branch are the same as the final component local feature P ^I、P^V, and only the local features of the initial components corresponding to the mixed mode branch are changed.

Specifically, the body component detector Component Detector multiplies the feature output by the first stage and the second stage respectively, and the global image feature (G ^V、G^H、G^I) corresponding to the visible light, the mixed and the infrared mode output by the last Swin transducer block in the third stage respectively through a Convolutional Neural Network (CNN) and a Sigmoid function of 1×1 to obtain attention masks of different levels, adjusts the attention masks to the same size and multiplies the attention masks by pixels to obtain a final Mask (Mask), multiplies the final Mask (Mask) by three mode data (G ^V、G^H、G^I) of the visible light, the mixed and the infrared modes output by the last Swin transducer block in the third stage, and finally splices in the dimension to obtain the local feature of the initial component.

Constructing loss related to local characteristics of a human body in the total loss function, wherein the formula is as follows:

The experimental analysis is as follows:

1. Implementation details

The present embodiment implements a built deep learning model on 1 NVIDIA GTX 4090D GPU based on PyTorch framework. The backbone network uses Swin TransformerV2 pre-trained on ImageNet22K and removes the last classification layer of Swin TransformerV, directly outputting a 1024-dimensional feature. During testing, the present embodiment uses cosine similarity to compare the distance between pedestrian features in the query set and in the gallery set.

2. Sampling strategy

During training, the present embodiment samples 4 pedestrians randomly per batch, and extracts 4 visible light images and 4 infrared images for each pedestrian, respectively, to form a small batch of training samples. Before the model is sent to the model for training, the sizes of all images are scaled to 288 multiplied by 144 multiplied by 3, data enhancement is carried out at the same time, the visible light images are subjected to random clipping, random horizontal overturning and random erasing, and one random enhancement mode of channel random erasing, random channel exchange enhancement, color dithering enhancement and spectrum dithering enhancement is adopted, the infrared images are subjected to only random clipping, random horizontal overturning and random erasing, and finally the used images are subjected to normalization operation with the mean value of [0.485,0.456,0.406] and the standard deviation of [0.229,0.224,0.225], so that the risk of fitting after training is reduced.

During the test, the present embodiment uses a batch with a size of 64 to extract the features of the query set and the gallery set, respectively, the visible light image and the infrared image are both sized 288×144×3 before being sent to the model, and then the normalization operation with a mean value of [0.485,0.456,0.406] and a standard deviation of [0.229,0.224,0.225] is performed on the used images.

3. Training parameter settings

The model trained a total of 40 epochs. All parameters of the network model were initialized with Kaiming initialization random initialization parameters before training began. During training, the SGD optimizer was used to update the parameters, momentum was 0.9, weight decay factor was 0.001, and the learning rate of the initial gradient descent was set. The pre-heating learning rate Warmup, the pre-heating iteration number 5 and the pre-heating factor 0.01 are adopted, and the pre-heating learning rate adopts a linear increasing mode. Meanwhile, the learning rate is finally attenuated to 0.002 times of the initial learning rate according to the cosine attenuation strategy of the learning rate.

4. Comparison of Performance with common models

The deep learning model constructed in the embodiment is compared with the current advanced model on SYSU-MM01 data set, no processing technology such as re-ranking, fusion ranking and the like is adopted when the model is used for re-identifying pedestrians, and the deep learning model constructed in the embodiment has a good effect as can be easily found from the table 1.

Table 1 model comparison table

Example two

The embodiment provides a cross-mode pedestrian re-identification system, which comprises:

Example III

The present embodiment provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the steps of the cross-modal pedestrian re-recognition method of the embodiment are implemented when the processor executes the computer program.

Example IV

The present embodiment provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the cross-modality pedestrian re-identification method of embodiment one.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The scheme in the embodiment of the application can be realized by adopting various computer languages, such as object-oriented programming language Java, an transliteration script language JavaScript and the like.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations and modifications of the present invention will be apparent to those of ordinary skill in the art in light of the foregoing description. It is not necessary here nor is it exhaustive of all embodiments. And obvious variations or modifications thereof are contemplated as falling within the scope of the present invention.

Claims

1. A cross-modal person re-identification method, comprising:

Step S1: In each training batch, a number of pedestrians are randomly selected from the infrared and visible light dataset, and each pedestrian includes 4 visible light images and 4 infrared images;

Step S2: preprocessing the visible light image and the infrared image;

Step S3: inputting the preprocessed visible light image and infrared image into a deep learning model, and extracting visible light features, infrared features, and mixed modal features through the deep learning model;

Step S4: construct a total loss, and judge whether the deep learning model training is completed by the total loss. If the total loss is lower than a preset threshold, the training is completed; if the total loss is higher than the preset threshold, the training is continued until the total loss is lower than the preset threshold, wherein:

The total loss includes a local maximum average difference loss constructed based on the visible light feature and the infrared feature, an inter-modal divergence loss constructed based on the visible light feature and the infrared feature, a circle loss constructed based on the visible light feature, the infrared feature and the mixed modal feature, and a cross entropy loss constructed based on the visible light feature, the infrared feature and the mixed modal feature;

Step S5: Use the trained deep learning model to identify pedestrians from the visible light image or infrared image to be identified.

2. The cross-modal pedestrian re-identification method according to claim 1, characterized in that: the formula of the local maximum average difference loss constructed based on the visible light feature and the infrared feature in step S4 is:

Where H represents the regenerated Hilbert space, φ(·) represents the regenerated Hilbert space transformation mapping, N _id represents the total number of pedestrian identity categories, and N _V represents the number of visible light images in the training batch. represents the features of the i-th visible light image, express The probability weight of belonging to category k, _NI represents the number of infrared images in the training batch, represents the features of the jth infrared image, express The probability weight of belonging to category k, ||ab|| ² represents the square of the distance between a and b.

3. The cross-modal pedestrian re-identification method according to claim 1, characterized in that: the formula of the inter-modal divergence loss constructed based on the visible light feature and the infrared feature in step S4 is:

Where _NV and _NI represent the number of visible light images and infrared images in the training batch, _CV and _CI represent the visible light image classifier and infrared image classifier, respectively. represents the features of the jth infrared image, Represents the features of the i-th visible light image.

4. The cross-modal pedestrian re-identification method according to claim 1, characterized in that: the formula of the circle loss constructed based on the visible light feature, infrared feature and mixed modal feature in step S4 is:

in, s _p = w _y f ^m /(||w _y ||||f ^m ||),m∈{V,I,H}, s _p represents the intra-class similarity score and the inter-class similarity score respectively, w _j , w _y represent the classification weight of non-target and the classification weight of target respectively, w _j , w _y represent the transpose of w _j and w _y respectively, f ^m represents the image feature belonging to modality m, m is the modality to which the image feature belongs, V, I, H represent the visible light modality, infrared modality and mixed modality respectively, ||·|| represents the L2 norm, They represent the first and second weight coefficients of circle loss respectively, and _Δn , _Δp and γ are the first, second and third hyperparameters respectively.

5. The cross-modal pedestrian re-identification method according to claim 1, characterized in that: the cross entropy loss constructed based on the visible light feature, infrared feature and mixed modal feature in step S4 is specifically:

The visible light features, infrared features and mixed modal features are passed through the classification layer to obtain their respective category scores, and the cross entropy loss is constructed based on their respective category scores. The formula is:

Where _NV , _NI , _NH represent the number of visible light images, infrared images, and mixed modality images in the training batch, respectively; _Nid represents the total number of categories of pedestrian identities; yi _,k represents the true category score of the i-th image belonging to the k-th pedestrian. It represents the predicted category score that the i-th image belongs to the k-th pedestrian, m is the modality to which the image feature belongs, V, I, and H represent the visible light modality, infrared modality, and mixed modality, respectively.

6. The cross-modal person re-identification method according to claim 1 is characterized in that: the deep learning model in step S3 includes a first stage, a mixed modality generation module, a second stage, a third stage and a fourth stage connected in sequence, each stage includes at least one Swin Transformer block that appears in pairs, and each Swin Transformer block that appears in pairs includes a first sub-block and a second sub-block that are connected in sequence, wherein:

The first sub-block includes a first normalization layer LN, a window-based multi-head self-attention layer W-MSA, a second normalization layer LN, and a first multi-layer perceptron MLP connected in sequence, and the input of the first normalization layer LN and the output of the window-based multi-head self-attention layer W-MSA are added by matrix elements as the input of the second normalization layer LN, and the input of the second normalization layer LN and the first multi-layer perceptron MLP are added by matrix elements as the input of the second sub-block;

The second sub-block includes a third normalization layer LN, a multi-head self-attention layer SW-MSA based on a shift window, a fourth normalization layer LN, and a second multi-layer perceptron MLP connected in sequence, and the input of the third normalization layer LN and the output of the multi-head self-attention layer SW-MSA based on a shift window are added as the input of the fourth normalization layer LN after matrix elements are added, and the input of the fourth normalization layer LN and the second multi-layer perceptron MLP are added as the output of the paired Swin Transformer block;

In the first stage, a Linear Embedding module is set before the first Swin Transformer block. The Linear Embedding module is used to convert the image into patches required for the Swin Transformer block operation. In the second, third, and fourth stages, a Patching Merging module is set before the first Swin Transformer block. The Patching Merging module is used to reduce the number of patches and increase the dimension of the patches to form a hierarchical representation of features.

7. The cross-modal person re-identification method according to claim 6, characterized in that: the visible light image features and infrared image features outputted in the first stage are used by the hybrid modality generation module to obtain hybrid modality image features, comprising:

Assume that the features of each visible light image output in the first stage are x ^V and the features of each infrared image are x ^I ;

Pass x ^V through the first dilated convolution with dilation rates of 1, 2, and 4, add and average the results, and then pass it through a fully connected layer and a ReLU activation function to obtain the generated features of the visible light part;

Through the second dilated convolution of ^xI with dilation rates of 1, 2, and 4, and through a fully connected layer and a ReLU activation function, the generated features of the infrared part are obtained;

In order to extract the common scene-level context information of infrared images and visible light images and generate background features, ^xV and ^xI are subjected to the same adaptive average pooling and 1x1 convolution and the dot product of the two is performed on the pixels to obtain the common background features, which are then adjusted to the same size as the image features output in the first stage through bilinear interpolation, and the global background generation features are obtained through the ReLU activation function;

Finally, the generated features of the visible light part, the generated features of the infrared part and the generated features of the global background are added and averaged to obtain the final mixed-modal image features;

In order to keep the final mixed modal features consistent, the modal consistency loss is introduced into the total loss, and the formula is:

Where l ₂ (·) represents L2 regularization, f _t ^H represents the tth image of the generated mixed mode, f _s ^H represents the sth image of the generated mixed mode, ||·|| ₂ represents the Euclidean distance, and n is the number of mixed modes of the same pedestrian.

8. The cross-modal pedestrian re-identification method according to claim 7, characterized in that: the hybrid modality generation module further comprises: performing feature alignment between the visible light image feature x ^V outputted in the first stage and the hybrid modality image feature generated by the hybrid modality generation module, and the infrared image feature x ^I outputted in the first stage and the hybrid modality image feature generated by the hybrid modality generation module through adversarial learning, comprising:

The modality differences between the visible light image feature x ^V and the mixed modality image feature generated by the mixed modality generation module, and between the infrared image feature x ^I and the mixed modality image feature generated by the mixed modality generation module are reduced by the generator and the discriminator, wherein:

The generator is composed of three cascaded fully connected layers. There is a ReLU activation function between the fully connected layers to increase the nonlinearity of the generator. The visible light image feature x ^V and the infrared image feature x ^I output from the first stage are added to the compensation embedding vector with the same length and width as the original input image and a dimension of 1, and then pass through their respective generators to obtain the forged mixed modal features;

The discriminator is a Patch Gan discriminator, and the mixed modal features forged based on the visible light image feature x ^V or the infrared image feature x ^I and the mixed modal image features generated by the mixed modal generation module are input into the discriminator, and the adversarial loss about the discriminator is constructed in the total loss function, and the adversarial loss is used to judge whether the forged mixed modal features are the same as the mixed modal image features generated by the mixed modal generation module, so as to achieve feature alignment. The adversarial loss includes the discriminator loss, the generator loss and the joint adversarial loss, and the formula is:

_LGAN = _LD + _LG ;

Where _LD is the discriminator loss, _LG is the generator loss, _LGAN is the joint adversarial loss, _NV is the number of visible light images in the training batch, _NI is the number of infrared images in the training batch, _NH is the number of mixed modality images in the training batch, D(·) is the discriminator, is the i-th visible light image, is the jth infrared image, is the kth mixed modal image, y _V is the label of the visible light image, y _I is the label of the infrared image, y _H is the label of the mixed modality, G _VH is the visible light-mixed modality generator, and G _IH is the infrared-mixed modality generator.

9. The cross-modal person re-identification method according to claim 7, characterized in that: a component detection and exchange module is further provided after the last Swin Transformer block in the third stage, and the component detection and exchange module comprises:

The global image features G ^V , G ^H , and G ^I corresponding to the visible light, mixed, and infrared modes output by the last Swin Transformer block in the third stage, as well as the features output by the first and second stages are sent to the body part detector Component Detector to obtain initial component local features composed of several human local features. The initial component local features will pass through a component prediction network Predictor Network, which is composed of a fully connected layer and a Sigmoid function, and is used to predict the score of each human local feature in the initial component local features. The human local features in the initial component local features are replaced according to the predicted scores: if the human local feature score of the mixed mode is smaller than the human local feature scores of the other two modes, the human local feature with the largest score in the other two modes is replaced with the human local feature of the mixed mode, so as to replace the overlapping and redundant human local features in the mixed mode, and obtain component local features ^PV , ^PH , and ^PI , wherein,

The body part detector Component Detector passes the features output by the first and second stages and the features output by the last Swin Transformer block in the third stage through a 1×1 convolutional neural network and a Sigmoid function to obtain attention masks at different levels, adjusts the attention masks to the same size and multiplies them by pixel to obtain the final mask, and then multiplies the final mask with the global image features G ^V , G ^H , and ^GI corresponding to the visible light, mixed, and infrared modes output by the last Swin Transformer block in the third stage, respectively, and finally splices them in the dimension dimension to obtain the initial component local features;

The loss of local features of the human body is constructed in the total loss function, and the formula is:

Among them, N represents the total number of images in the training batch, _Nc represents the number of local features of the human body, _Nid is the total number of categories of pedestrian identities, yi _,k represents the true category score of the i-th image belonging to the k-th pedestrian, and zi _,j,k represents the predicted category score of the j-th local feature of the human body in the i-th image belonging to the k-th pedestrian.

10. A cross-modal person re-identification system, comprising:

Selection module: used to randomly select several pedestrians from the infrared and visible light dataset in each training batch. Each pedestrian includes 4 visible light images and 4 infrared images.

Preprocessing module: used to preprocess visible light images and infrared images;

Feature extraction module: used to input the preprocessed visible light image and infrared image into the deep learning model, and extract visible light features, infrared features and mixed modal features through the deep learning model;

Construction module: used to construct the total loss, and judge whether the training of the deep learning model is completed by the total loss. If the total loss is lower than the preset threshold, the training is completed; if the total loss is higher than the preset threshold, the training is continued until the total loss is lower than the preset threshold, wherein,

Recognition module: used to use the trained deep learning model to identify pedestrians from the visible light image or infrared image to be identified.