[go: up one dir, main page]

WO2023068821A1 - Dispositif et procédé de suivi multi-objets reposant sur l'apprentissage auto-supervisé - Google Patents

Dispositif et procédé de suivi multi-objets reposant sur l'apprentissage auto-supervisé Download PDF

Info

Publication number
WO2023068821A1
WO2023068821A1 PCT/KR2022/015985 KR2022015985W WO2023068821A1 WO 2023068821 A1 WO2023068821 A1 WO 2023068821A1 KR 2022015985 W KR2022015985 W KR 2022015985W WO 2023068821 A1 WO2023068821 A1 WO 2023068821A1
Authority
WO
WIPO (PCT)
Prior art keywords
self
tracking
supervised learning
object tracking
branch
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/KR2022/015985
Other languages
English (en)
Korean (ko)
Inventor
고병철
김상원
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industry Academic Cooperation Foundation of Keimyung University
Original Assignee
Industry Academic Cooperation Foundation of Keimyung University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industry Academic Cooperation Foundation of Keimyung University filed Critical Industry Academic Cooperation Foundation of Keimyung University
Publication of WO2023068821A1 publication Critical patent/WO2023068821A1/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/187Segmentation; Edge detection involving region growing; involving region merging; involving connected component labelling
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30232Surveillance

Definitions

  • the present invention relates to a multi-object tracking apparatus and method, and more particularly to a multi-object tracking apparatus and method based on self-supervised learning.
  • Object tracking can be mainly divided into single-object tracking (SOT) and multi-object tracking (MOT).
  • SOT single-object tracking
  • MOT multi-object tracking
  • MOTs that can track multiple objects simultaneously are receiving more attention than SOTs that track only one object.
  • the tracking-by-detection paradigm which is the most common approach in MOT, largely depends on two performances.
  • the first is object detection performance. Objects must be accurately detected in every frame to prevent tracking from being broken or misconnected during subsequent tracking operations. Recently, various high-performance object detectors based on CNN have been introduced, and the deterioration in tracking performance due to false object detection has been solved to some extent. However, object detectors may detect false objects or miss objects due to object occlusion or camera shake.
  • the second is data connection performance. To compensate for the MOT's inaccuracy due to false object detection, data linking can link previously acquired trajectories with new detections. In online MOT, real-time data connection is more emphasized than in offline MOT.
  • the key problem of real-time data connection is to determine the optimal connection between the detection object and the tracking object.
  • the most representative data connection methods are bipartite assignment and Hungarian-based approaches. This method models the weights as an affinity matrix between a set of graphs composed of existing trajectory nodes and new detection nodes. Matches between nodes in a graph set are determined according to the weights of the association matrix. Dynamic programming can be used to find and match the shortest path between detection and tracklets. A min-cost flow and a conditional random field treat the data connection as a graph. In this approach, detections or tracklets are represented as nodes in a graph, a flow model or label predicts the edge strength, and nodes with high strength are connected.
  • Siamese networks and modified triplet and quadruplet networks applied the same networks to detection and tracking objects and calculated similarity based on differences in output characteristics.
  • Siamese network-based data association can encode reliable pairwise interactions between objects, but it is not suitable when considering high-dimensional information where multiple objects and states exist in a scene.
  • Self-Supervised Learning is a technology that finds and analyzes rules by itself with a minimum amount of data, and is a technology to solve the limitations of the existing supervised learning method, which was used as learning data by attaching labels one by one. .
  • self-supervised learning has not yet been applied to online multi-object estimation.
  • Patent Publication No. 10-2021-0111417 (Title of Invention: Robust Multi-Object Detection Apparatus and Method Using Siamese Network, Publication Date: September 13, 2021) has been disclosed. there is a bar
  • the present invention is proposed to solve the above problems of the previously proposed methods, and is the first attempt to apply self-supervised learning to online multi-object estimation, and PN-GAN ( pose normalized GAN) and image augmentation, and through PN-GAN, it is possible to design a MOT model that can obtain better results with a small number of data by solving the problem of insufficient training data, and a new loss function ACD (Affinity Correlation Distance), so that more discriminant feature vectors can be generated when training an encoder network, can be trained using an end-to-end approach, and no additional fine-tuning of classification, clustering and regression is required. Therefore, the purpose is to provide a multi-object tracking device and method based on self-supervised learning that can be applied immediately without additional transfer learning, occupies less memory than other up-to-date methods, and can provide fast and high tracking performance. to be
  • the multi-object tracking device based on self-supervised learning according to the features of the present invention for achieving the above object is,
  • a pose change is generated from an input image to obtain a plurality of pose images, and the input image and the pose image are a data augmentation unit generating an augmented image by applying image augmentation;
  • PN-GAN pre-trained pose normalized-generative adversarial network
  • an encoder network with a contrastive structure for multi-object tracking is constructed using a Self-Supervised Learning (SSL) method.
  • SSL Self-Supervised Learning
  • It is characterized in that it includes an object tracking unit for estimating the similarity between the tracking object and the detection object using the encoder network learned in the self-supervised learning unit and performing object tracking.
  • the self-supervised learning unit Preferably, the self-supervised learning unit,
  • An SSL-MOT is constituted by including two encoder networks of a contrast structure, and the SSL-MOT may be composed of a first branch and a second branch that learn each of the two encoder networks.
  • the encoder network of the first branch and the encoder network of the second branch may share a weight.
  • the self-supervised learning unit More preferably, the self-supervised learning unit,
  • the two augmented images are respectively input to the first branch and the second branch to learn the SSL-MOT, and the cross-correlation calculated from the output alignment matrix of the two augmented image pairs is an invariant term (
  • the SSL-MOT is trained using a loss function that uses an Affinity Correlation Distance (ACD) that calculates the similarity between the two augmented images by combining an invariance term and a redundancy reduction term.
  • ACD Affinity Correlation Distance
  • An affinity correlation matrix (A) is obtained through the dot product of the two disposition matrices generated in the first branch and the second branch, respectively, and is calculated by constructing the invariant term and the redundancy reduction term with the association correlation matrix. can do.
  • the self-supervised learning unit More preferably, the self-supervised learning unit,
  • a stochastic gradient optimizer (SGD) is applied to backpropagate the average loss during learning, but the parameters of the encoder network of the second branch may not be updated using a stop-gradient. .
  • the object tracking unit Preferably, the object tracking unit, the object tracking unit, and
  • a final detection-tracking object pair may be set by estimating an association correlation matrix (A) based on pairwise similarities between a plurality of tracking feature vectors and detection feature vectors.
  • the object tracking unit More preferably, the object tracking unit,
  • the state of the tracking object may be updated by combining the detection objects of the final detection-tracking object pair set above through data connection and Hungarian matching.
  • the multi-object tracking method based on self-supervised learning for achieving the above object is,
  • PN-GAN pre-trained pose normalized-generative adversarial network
  • step (3) It is characterized in that it includes an object tracking step of estimating the similarity between the tracking object and the detection object using the encoder network learned in step (2) and performing object tracking.
  • An SSL-MOT is constituted by including two encoder networks of a contrast structure, and the SSL-MOT may be composed of a first branch and a second branch that learn each of the two encoder networks.
  • the encoder network of the first branch and the encoder network of the second branch may share a weight.
  • step (2) More preferably, in the step (2),
  • the two augmented images are respectively input to the first branch and the second branch to learn the SSL-MOT, and the cross-correlation calculated from the output alignment matrix of the two augmented image pairs is an invariant term (
  • the SSL-MOT is trained using a loss function that uses an Affinity Correlation Distance (ACD) that calculates the similarity between the two augmented images by combining an invariance term and a redundancy reduction term.
  • ACD Affinity Correlation Distance
  • An affinity correlation matrix (A) is obtained through the dot product of the two disposition matrices generated in the first branch and the second branch, respectively, and is calculated by constructing the invariant term and the redundancy reduction term with the association correlation matrix. can do.
  • step (2) More preferably, in the step (2),
  • a stochastic gradient optimizer (SGD) is applied to backpropagate the average loss during learning, but the parameters of the encoder network of the second branch may not be updated using a stop-gradient. .
  • a tracking set composed of a plurality of tracking objects and a detection set composed of a plurality of detection objects are configured, respectively, and all elements of the tracking set and the detection set are input to the encoder network learned in step (2), respectively.
  • a final detection-tracking object pair may be set by estimating an association correlation matrix (A) based on the pairwise similarity between the tracking feature vector and the detection feature vector.
  • step (3) More preferably, in the step (3),
  • the state of the tracking object may be updated by combining the detection objects of the final detection-tracking object pair set above through data connection and Hungarian matching.
  • the self-supervised learning-based multi-object tracking apparatus and method proposed in the present invention it is the first attempt to apply self-supervised learning to online multi-object estimation, and PN-GAN ( pose normalized GAN) and image augmentation, and through PN-GAN, it is possible to design a MOT model that can obtain better results with a small number of data by solving the problem of insufficient training data, and a new loss function ACD (Affinity Correlation Distance), so that more discriminant feature vectors can be generated when training an encoder network, can be trained using an end-to-end approach, and no additional fine-tuning of classification, clustering and regression is required. Therefore, it can be applied immediately without additional transfer learning, occupies less memory than other state-of-the-art methods, and provides fast and high tracking performance.
  • ACD Affinity Correlation Distance
  • FIG. 1 is a diagram showing the configuration of a multi-object tracking device based on self-supervised learning according to an embodiment of the present invention.
  • FIG. 2 is a diagram showing a learning process performed in a multi-object tracking device based on self-supervised learning according to an embodiment of the present invention.
  • FIG. 3 is a diagram showing a learning algorithm of a multi-object tracking device based on self-supervised learning according to an embodiment of the present invention.
  • FIG. 4 is a diagram showing a test process performed in a multi-object tracking device based on self-supervised learning according to an embodiment of the present invention.
  • FIG. 5 is a flowchart illustrating a multi-object tracking method based on self-supervised learning according to an embodiment of the present invention.
  • FIG. 6 is a view showing a comparison of MOT performance results according to a data augmentation method in the multi-object tracking apparatus and method based on self-supervised learning according to an embodiment of the present invention.
  • FIG. 7 is a diagram showing performance results according to a similarity metric used in a loss function in the multi-object tracking apparatus and method based on self-supervised learning according to an embodiment of the present invention.
  • FIG. 8 is a visual representation of discriminant power of an output vector of an encoder network according to a loss function in a multi-object tracking apparatus and method based on self-supervised learning according to an embodiment of the present invention.
  • FIG. 10 is a diagram illustrating a comparison of experimental results of a multi-object tracking device and method based on self-supervised learning and the latest methods according to an embodiment of the present invention using MOT 16 dataset.
  • FIG. 11 is a diagram showing a comparison of experimental results of a multi-object tracking device and method based on self-supervised learning and the latest methods according to an embodiment of the present invention using MOT 17 dataset.
  • FIG. 12 is a diagram illustrating a comparison of experimental results of a multi-object tracking device and method based on self-supervised learning according to an embodiment of the present invention and the latest methods using the MOT 20 dataset.
  • the multi-object tracking device 100 based on self-supervised learning according to an embodiment of the present invention includes a data augmentation unit 110, a self-supervised learning unit 120, and an object tracking unit. It may be configured to include (130).
  • the data augmentation unit 110 and the self-supervised learning unit 120 may process a learning process
  • the object tracking unit 130 may process a process of tracking an object by applying it to a test or actual online MOT, respectively.
  • the multi-object tracking apparatus 100 and method based on self-supervised learning are the first attempts to apply self-supervised learning to online multi-object estimation, and have the following characteristics: (1 ) apply PN-GAN (pose normalized GAN) and image augmentation for learning considering the pose change of the object. (2) Through PN-GAN, it is possible to design a MOT model that can obtain better results with a small number of data by solving the problem of insufficient training data. (3) By proposing a new loss function ACD (Affinity Correlation Distance), more distinct feature vectors can be generated when learning the encoder network.
  • ACD Affinity Correlation Distance
  • SimCLR Chole et al., 2020
  • BYOL Grill et al., 2020
  • Barlow Twins Zbotar et al., 2021 fed two identical networks with two distorted samples from an input image and measured the cross-correlation matrix between the two outputs.
  • This method applies an objective function that naturally avoids a collapsed representation by making the cross-correlation matrix as close as possible to the identity matrix. However, it still requires a large batch size to avoid reduced representation, and it is not suitable for real-time connection because the output dimension is as large as 4096. SimSiam (Chen and He, 2021) directly maximizes the similarity of two views of an image using symmetry loss, without using negative pairs or momentum encoders. It can use stop-gradient operations to avoid collapse representation, works with normal batch sizes and does not rely on training large batches.
  • SCARF (Bahri et al., 2021) consists of two encoders with shared weights, but both branches have pretrained heads with shared weights.
  • the existing SSL methods as described above do not apply strict end-to-end learning, and after learning the encoder, transfer learning must be performed to apply clustering, classification, and regression.
  • the basic structure of SSL is Chen and He (2021) (Chen, X.; and He, K. 2021. Exploring Simple Siamese While using the structure mentioned by Representation Learning. In CVPR.), the internal learning procedure was reorganized for MOT optimization so that it is resistant to changes in human posture and enables end-to-end learning and testing. This is the first application of SSL to online MOT.
  • the data augmentation unit 110 acquires a plurality of pose images by generating a pose change from an input image using a pre-learned pose normalized-generative adversarial network (PN-GAN), ,
  • PN-GAN pre-learned pose normalized-generative adversarial network
  • An augmented image can be generated by applying image augmentation to the input image and the pose image.
  • FIG. 2 is a diagram illustrating a learning process performed in the multi-object tracking apparatus 100 based on self-supervised learning according to an embodiment of the present invention.
  • the multi-object tracking apparatus 100 based on self-supervised learning according to an embodiment of the present invention uses the augmented image generated by the data augmentation unit 110, and the self-supervised learning unit SSL-MOT can be learned at (120).
  • Image enhancement techniques used in SSL may include random cropping, resizing, flipping, color jittering, Gaussian blur, and the like.
  • pose changes occur frequently, and it is difficult to accurately match images with simple image enlargement.
  • the data augmentation unit 110 of the multi-object tracking apparatus 100 based on self-supervised learning according to an embodiment of the present invention, as shown in “Data Augmentation” on the left side of FIG. 1, pre-processes for image augmentation.
  • PN-GAN Qian et al., 2018
  • the data augmentation unit 110 may obtain a generated pose image by supplying the input image x to the PN-GAN, and randomly apply image augmentation to the pose image.
  • PN-GAN can generate a new pose only when the whole body is included in the image
  • PN-GAN can be applied only when a pose is detected using OpenPose (Cao et al., 2021) as preprocessing. That is, in the data augmentation unit 110, when a whole body pose is detected in the input image, a plurality of pose changes are generated using PN-GAN, and image augmentation is applied to the generated pose image, while the whole body pose is detected If not, only image augmentation can be applied.
  • the self-supervised learning unit 120 self-supervises an encoder network of a contrastive structure for multi-object tracking (MOT) using positive pairs of augmented images. (Self-Supervised Learning; SSL) method.
  • the SSL-MOT network can use two augmented images for the learning encoder.
  • the SSL-MOT network consists of two encoder networks of contrasting structure, and may consist of two branches learning each encoder network.
  • the MOT may consist of a first branch and a second branch learning each of the two encoder networks. Also, the encoder network of the first branch and the encoder network of the second branch may share weights.
  • the first branch includes an encoder network (Encoder) outputting an expression vector from an input augmented image; and a predictor that outputs the expression vector output from the encoder network as a new output vector. That is, the first network of the first branch may output expression vectors from the augmented input image using the encoder network f( ⁇ ). Then, the encoder network f( ⁇ ) can be combined with the projection MLP with ResNet50 as the backbone network. Another network can serve to transform one view to match another with its prediction head h( ⁇ ).
  • Encoder Encoder
  • the second branch may include an encoder network (Encoder) that shares weights with the encoder network of the first branch. That is, the second branch may consist of only one encoder network f( ⁇ ), and the projection head uses MLP and may be used only for learning, not for testing.
  • Encoder encoder network
  • SSL-MOT outputs an affinity matrix that can check the similarity of two input images (input augmented images). end) learning is necessary.
  • FIG. 3 is a diagram showing a learning algorithm of the multi-object tracking device 100 based on self-supervised learning according to an embodiment of the present invention. That is, FIG. 3 shows a process of learning SSL-MOT using N unlabeled images.
  • FIGS. 2 and 3 the SSL-MOT learning process of the self-supervised learning unit 120 is detailed. let me explain
  • image augmentation is applied to image x i as shown in “Data Augmentation” on the left side of FIG. 2, and one of the augmented images is a tracker object tr i and the other is a detection object d i .
  • the encoder can output the same M-dimensional vector z d . However, it may be used without being supplied to the prediction head.
  • the self-supervised learning unit 120 of the multi-object tracking device 100 based on self-supervised learning is configured to perform end-to-end learning so as to be applicable to classification, clustering, regression, etc. without transfer learning. , a new loss function was used.
  • the self-supervised learning unit 120 learns the SSL-MOT by inputting two augmented images into the first branch and the second branch, respectively, and cross-correlation calculated from the output arrangement matrix of the two augmented image pairs ( SSL using a loss function that uses the Affinity Correlation Distance (ACD), which computes the similarity between two augmented images by combining cross-correlation with an invariance term and a redundancy reduction term.
  • ACD Affinity Correlation Distance
  • the association correlation distance (ACD) is obtained by obtaining an affinity correlation matrix (A) through the dot product of the two arrangement matrices generated in the first branch and the second branch, respectively, and invariant terms and It can be calculated by constructing a redundancy reduction term.
  • ACD affinity correlation matrix
  • Loss functions commonly used in SSL are mean square error (MSE) (Grill et al., 2020), L1 loss, cosine similarity (Chen and He, 2021), cross entropy, cross correlation matrix (Zbontar, et al., et al. al., 2021), etc.
  • MSE mean square error
  • L1 loss L1 loss
  • cosine similarity Choen and He, 2021
  • cross entropy cross correlation matrix
  • Zbontar cross correlation matrix
  • an affinity correlation matrix A may be obtained through the dot product of the two disposition matrices.
  • N L-2 regularization.
  • the correlation matrix A is used to calculate the ACD of Equation 2 below.
  • is a weight that cancels the importance of the first and second terms of the loss and i and j are the indices of batch samples.
  • the same pair of samples attempts to generate a value of 0, and in the second term, the redundancy reduction term, an attempt is made to obtain a value close to 0 for different samples.
  • the self-supervised learning unit 120 applies a stochastic gradient optimizer (SGD) to backpropagate the average loss during learning in all learning steps, but uses a stop-gradient.
  • the parameters of the encoder network of the second branch may not be updated. This only updates the parameters ⁇ of the encoder and prediction network of the first branch and does not update the encoder parameters of the second branch, which can be expressed as:
  • Equation 3 may be changed as in Equation 5 below.
  • the object tracking unit 130 may estimate the similarity between the tracking object and the detection object using the encoder network learned in the self-supervised learning unit 120 and perform object tracking.
  • the object tracking unit 130 configures a tracking set composed of a plurality of tracking objects and a detection set composed of a plurality of detection objects, respectively, and all elements of the tracking set and the detection set are learned by the self-supervised learning section 120.
  • a final detection-tracking object pair may be set by estimating an association correlation matrix (A) based on the degree of similarity between pairs of a plurality of tracking feature vectors and detection feature vectors obtained by inputting each input to the decoded encoder network. More specifically, the object tracking unit 130 may update the state of the tracking object by combining the detection object of the final detection-tracking object pair set through data connection and Hungarian matching.
  • FIG. 4 is a diagram showing a test process performed in the multi-object tracking device 100 based on self-supervised learning according to an embodiment of the present invention.
  • the object tracking unit 130 of the multi-object tracking device 100 based on self-supervised learning according to an embodiment of the present invention will be described in detail with reference to FIG. 4 .
  • the learned SSL-MOT can be applied to online object tracking. More specifically, after the self-supervised learning unit 120 learns the SSL-MOT, an affinity matrix between the tracking object and the detection object during actual tracking may be generated using the learned SSL-MOT.
  • Each element d g of the detection set D in every frame is input to the learned encoder f, and L2-normalization can be applied to the output to generate a new d-dimensional feature vector z dg .
  • set of transformed detection objects can be obtained through the same operation for all elements of the set D.
  • the same encoder can be applied to each element tr i of the tracking object to obtain a new feature vector z trh .
  • Converted tracking object The set of can also be obtained through the same operation for all elements of the set T
  • the goal of SSL-MOT is the H ⁇ G association correlation matrix based on the pair-wise similarity of the features extracted from the T tracking object and the D detection object in the online MOT. is to estimate A set to create an association correlation matrix A between the tracking object and the detection object to a (H ⁇ d) matrix, and the set can be transposed to (G ⁇ d) T. Then, as shown in Equation 6, the dot product between the two matrices is taken.
  • is a constant value used to widen the difference between the output values.
  • z hg can have a large value if the detection object d g t is related to the tracking object tr h t-1 , and can have a small value if the two objects are not related.
  • the association score is smaller than the threshold value ⁇ 1 , the matching speed may be improved by filtering the association score of the corresponding pair to 0.
  • the association score is combined with the relative L1 centroid distance Dis(tr h t-1 , d g t ) in the appearance space using the weighted sum as shown in Equation 7: can
  • ⁇ and ⁇ may represent weights of 0.5 and 0.5, respectively.
  • the cost function c can be used for Hungarian matching.
  • the tracking object and the detection object can generate trajectories by updating new connections through the proposed SSL-MOT-based data connection and Hungarian matching.
  • is a tunable parameter that controls the state of the tracked object and can generally be set to 0.5.
  • this detection object is assigned as a new potential tracking object, and if the potential tracking object matches the threshold value ⁇ 2 or more times, it is newly assigned as a true tracking object (T t ⁇ T t-1 ⁇ (d g t ,h+1)), otherwise it is recognized as a false tracking object and can be removed from the tracking object pool.
  • the tracking object may temporarily disappear due to occlusion, so it can be observed while maintaining the state of the tracking object for a threshold value ⁇ 2 frames without immediately removing it.
  • ⁇ 2 a threshold value
  • the value of ⁇ 2 can be adjusted according to the characteristics of the data set.
  • a multi-object tracking method based on self-supervised learning is a pre-learned pose normalized-generative adversarial network (GAN (Generative Adversarial Network); PN- A data augmentation step (S100) of acquiring a plurality of pose images by generating a pose change from an input image using GAN) and generating an augmented image by applying image augmentation to the input image and the pose image; Self-Supervised Learning (SSL) method to train an encoder network of contrastive structure for Multi-Object Tracking (MOT) using positive pairs It may be implemented including a self-supervised learning step (S200) and an object tracking step (S300) of estimating the similarity between the tracking object and the detection object using the encoder network learned in step S200 and performing object tracking.
  • GAN Generative Adversarial Network
  • each step may be performed in the multi-object tracking device 100 based on self-supervised learning according to an embodiment of the present invention, It may consist of software recorded in hardware including a memory and a processor. For example, it may be stored and implemented in a personal computer, a notebook computer, a server computer, a PDA, a smart phone, a tablet PC, and the like.
  • the SSL-MOT proposed in the present invention was implemented using Pytorch and Pytorch-Geometric, and trained and tested on an Intel Core i9-9900k CPU cluster using Nvidia GeForce RTX 2080ti GPU.
  • the network configuration and learning settings of the present invention are as follows.
  • OpenPose consists of multi-level CNNs.
  • the first step predicts the part affinity fields, and the second step predicts the confidence map.
  • the PN-GAN generator is designed based on ResNet-50 and consists of an encoder-decoder network that generates four consecutive standard poses for a given input image.
  • the image augmentation pipeline consists of five image transformations (Gaussian blurring, horizontal flipping, perspective, cutout, and brightness jittering), which are randomly applied with a certain probability.
  • SGD stochastic gradient optimizer
  • the prediction network h consists of two fc layers and its output is 2048-d.
  • MOTA object tracking accuracy
  • IDF1 identity F1 score
  • FP number of false positives
  • FN number of false negatives
  • IDsw number of ID switches
  • Hz frames per second
  • a generalized SSL-MOT network optimized for object tracking is learned using PN-GAN and image augmentation. Rather than normal image augmentation, it recognizes a person's pose and creates a new pose that can be changed in the next frame.
  • PN-GAN-based data augmentation (1) normal image augmentation in SSL, (2) PN-GAN without image augmentation, (3) PN-GAN with image augmentation, (4) OpenPose, PN -We performed comparative experiments on 4 augmentation cases of GAN and image augmentation (proposed method).
  • the proposed method applies image augmentation by selecting, resizing, and horizontally flipping only random patches in consideration of frequent occlusion of pedestrians and camera viewpoints.
  • FIG. 6 is a diagram illustrating a comparison of MOT performance results according to a data augmentation method in the multi-object tracking apparatus 100 and method based on self-supervised learning according to an embodiment of the present invention.
  • SSL-MOT learns by predicting the posture of an object
  • the overall performance of the PN-GAN-based method ((2), (3) and (4)) compared to the basic image augmentation method (1) Performance has been improved.
  • Method (4) increases MOTA by 0.8% and IDF1 by 3.2% compared to method (3) because learning takes into account not only pose change but also image enhancement.
  • method (2) showed slightly better performance than method (4) in IDF1 and FP
  • method (4) showed more than 3% higher performance in MOTA, an important indicator in MOT. Therefore, in the experiment below, the SSL-MOT model trained by combining PN-GAN and OpenPose and applying method (4) using image augmentation is used.
  • FIG. 7 is a diagram showing performance results according to a similarity metric used in a loss function in the multi-object tracking apparatus 100 and method based on self-supervised learning according to an embodiment of the present invention.
  • MSE showed the lowest performance in the three indicators.
  • ACD showed the best performance using dot product results that can maintain angular and magnitude properties between output vectors.
  • the encoder is trained to reduce intra-class variability and increase inter-class variability, and an encoder output vector has a characteristic of increasing inter-class discrimination using an ACD invariant and redundancy reduction term.
  • 8 is a diagram showing discriminant power of an output vector of an encoder network according to a loss function in a self-supervised learning-based multi-object tracking apparatus 100 and method according to an embodiment of the present invention.
  • 8 shows t-SNE by inputting 10 person images after encoder learning, and shows (a) MSE, (b) cosine similarity, and (3) proposed ACD, respectively.
  • the proposed ACD showed excellent discriminant power in terms of intra-class and inter-class variability compared to MSE and cosine similarity, and maximized the similarity between the same person within the embedded feature space.
  • the proposed SSL-MOT was tested using the MOT17 data set, BYOL trained using data augmentation (Grill et al., 2020), SimSiam (Chen and He, 2021), Four performance metrics were measured and compared for Barlow Twins (Zbontar et al., 2021) and SSL-MOT.
  • FIG. 9 is a diagram showing the result of performing MOT according to the SSL approach.
  • the proposed SSL-MOT showed the best performance in all metrics.
  • SimSiam and SSL-MOT have a similar basic network structure, but SSL-MOT can improve tracking object matching performance compared to SimSiam through an ACD-based loss function.
  • SSL-MOT can estimate an accurate affinity score by calculating only the dot product of the output vectors, resulting in shorter data association times and faster rates (Hz) compared to other methods, even when there are many matches. there is.
  • MOT performance on the MOT16 dataset was measured using four SOTA online MOT approaches: (1) BLSTM_MTP_O (Kim et al., 2021), (2) MFI_TST (Yang et al., 2021), (3) ArTIST-T (Saleh et al., 2021), (4) SiamesRF (Lee et al., 2020).
  • FIG. 10 is a diagram showing a comparison of results of an experiment using the MOT 16 dataset with the multi-object tracking apparatus 100 and method based on self-supervised learning according to an embodiment of the present invention and the latest methods.
  • the proposed method SSL-MOT obtained the highest MOTA and FN scores while maintaining fast tracking speed in all methods.
  • FIG. 11 is a diagram illustrating a comparison between the results of an experiment using the MOT 17 dataset with the multi-object tracking apparatus 100 and method based on self-supervised learning according to an embodiment of the present invention and the latest methods.
  • CenterTrack showed excellent performance for MOT17.
  • CenterTrack uses open detection, it cannot be directly compared to other methods that only use MOT open detection.
  • the proposed method achieved the best MOTA and FN, similar to the MOT16 results.
  • FIG. 12 is a diagram illustrating a comparison of experimental results of the self-supervised learning-based multi-object tracking apparatus 100 and method according to an embodiment of the present invention and the latest methods using the MOT 20 dataset.
  • TMOH showed the best performance among online MOT methods, but this method also has a very slow processing speed of 0.6 Hz using an open detection method such as CenterTrack.
  • the proposed SSL-MOT method showed low performance because it did not use detection refinement and excluded MOT20 training data from learning. However, the proposed method showed the fastest tracking speed of 1.6 Hz online.
  • the proposed method exhibits excellent performance in terms of tracking accuracy and speed among SOTA online MOT approaches.
  • the self-supervised learning-based multi-object tracking apparatus 100 and method proposed in the present invention it is the first attempt to apply self-supervised learning to online multi-object estimation, and the object's pose change PN-GAN (pose normalized GAN) and image augmentation are applied for considered learning, and MOT models can be designed that can obtain better results with a small number of data by solving the problem of insufficient training data through PN-GAN.
  • PN-GAN pose normalized GAN
  • MOT models can be designed that can obtain better results with a small number of data by solving the problem of insufficient training data through PN-GAN.
  • ACD Affinity Correlation Distance
  • the present invention may include a computer-readable medium including program instructions for performing operations implemented in various communication terminals.
  • computer-readable media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD_ROMs and DVDs, and floptical disks. It may include hardware devices specially configured to store and execute program instructions, such as magneto-optical media and ROM, RAM, flash memory, and the like.
  • Such computer-readable media may include program instructions, data files, data structures, etc. alone or in combination.
  • program instructions recorded on a computer-readable medium may be specially designed and configured to implement the present invention, or may be known and usable to those skilled in computer software.
  • it may include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes generated by a compiler.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

Un dispositif et un procédé de suivi multi-objets reposant sur l'apprentissage auto-supervisé décrits dans la présente invention constituent la première tentative d'application de l'apprentissage auto-supervisé au suivi multi-objets en ligne, appliquent un GAN normalisé en pose (PN-GAN) et une augmentation d'image pour un apprentissage en tenant compte d'un changement de pose d'un objet, peuvent concevoir un modèle MOT générant un meilleur résultat même avec un plus petit nombre de données en résolvant le problème de manque de données d'apprentissage au moyen du PN-GAN, peuvent générer davantage de vecteurs de caractéristiques de discrimination pendant l'apprentissage d'un réseau encodeur en proposant une distance de corrélation d'affinité (ACD) qui est une nouvelle fonction de perte, peuvent apprendre en utilisant un système de bout en bout, peuvent être directement appliqués sans apprentissage par transfert supplémentaire parce qu'un réglage fin supplémentaire pour la classification, le regroupement et la régression n'est pas nécessaire, et peuvent occuper moins de mémoire que d'autres procédés les plus récents et offrir de hautes performances de suivi rapide.
PCT/KR2022/015985 2021-10-22 2022-10-20 Dispositif et procédé de suivi multi-objets reposant sur l'apprentissage auto-supervisé Ceased WO2023068821A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR10-2021-0141928 2021-10-22
KR1020210141928A KR102763536B1 (ko) 2021-10-22 2021-10-22 자기-지도 학습 기반의 다중 객체 추적 장치 및 방법

Publications (1)

Publication Number Publication Date
WO2023068821A1 true WO2023068821A1 (fr) 2023-04-27

Family

ID=86059483

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2022/015985 Ceased WO2023068821A1 (fr) 2021-10-22 2022-10-20 Dispositif et procédé de suivi multi-objets reposant sur l'apprentissage auto-supervisé

Country Status (2)

Country Link
KR (1) KR102763536B1 (fr)
WO (1) WO2023068821A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116580060A (zh) * 2023-05-31 2023-08-11 重庆理工大学 基于对比损失的无监督跟踪模型训练方法
CN117197060A (zh) * 2023-08-29 2023-12-08 浪潮软件集团有限公司 基于自监督对比学习的线缆表面异常检测方法及系统
CN119540650A (zh) * 2024-11-26 2025-02-28 西安交通大学 一种基于特征冗余损失的非对称自监督熔池缺陷识别方法及相关系统

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102675168B1 (ko) * 2023-12-11 2024-06-17 주식회사 하이브마인드 저작권 침해 탐지 시스템 및 방법

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20200077370A (ko) * 2019-04-22 2020-06-30 주식회사 로민 영상 마스킹 장치 및 영상 마스킹 방법
US20200380274A1 (en) * 2019-06-03 2020-12-03 Nvidia Corporation Multi-object tracking using correlation filters in video analytics applications
US20200387866A1 (en) * 2019-06-05 2020-12-10 Inokyo, Inc. Environment tracking
US20200404245A1 (en) * 2011-08-04 2020-12-24 Trx Systems, Inc. Mapping and tracking system with features in three-dimensional space
US20210279475A1 (en) * 2016-07-29 2021-09-09 Unifai Holdings Limited Computer vision systems

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200404245A1 (en) * 2011-08-04 2020-12-24 Trx Systems, Inc. Mapping and tracking system with features in three-dimensional space
US20210279475A1 (en) * 2016-07-29 2021-09-09 Unifai Holdings Limited Computer vision systems
KR20200077370A (ko) * 2019-04-22 2020-06-30 주식회사 로민 영상 마스킹 장치 및 영상 마스킹 방법
US20200380274A1 (en) * 2019-06-03 2020-12-03 Nvidia Corporation Multi-object tracking using correlation filters in video analytics applications
US20200387866A1 (en) * 2019-06-05 2020-12-10 Inokyo, Inc. Environment tracking

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KIM SANGWON, LEE JIMI, KO BYOUNG CHUL: "SSL-MOT: self-supervised learning based multi-object tracking", APPLIED INTELLIGENCE., KLUWER ACADEMIC PUBLISHERS, DORDRECHT., NL, vol. 53, no. 1, 22 April 2022 (2022-04-22), NL , pages 930 - 940, XP093059001, ISSN: 0924-669X, DOI: 10.1007/s10489-022-03473-9 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116580060A (zh) * 2023-05-31 2023-08-11 重庆理工大学 基于对比损失的无监督跟踪模型训练方法
CN117197060A (zh) * 2023-08-29 2023-12-08 浪潮软件集团有限公司 基于自监督对比学习的线缆表面异常检测方法及系统
CN119540650A (zh) * 2024-11-26 2025-02-28 西安交通大学 一种基于特征冗余损失的非对称自监督熔池缺陷识别方法及相关系统

Also Published As

Publication number Publication date
KR102763536B1 (ko) 2025-02-07
KR20230057765A (ko) 2023-05-02

Similar Documents

Publication Publication Date Title
WO2023068821A1 (fr) Dispositif et procédé de suivi multi-objets reposant sur l'apprentissage auto-supervisé
WO2020190112A1 (fr) Procédé, appareil, dispositif et support permettant de générer des informations de sous-titrage de données multimédias
WO2021150017A1 (fr) Procédé de segmentation interactive d'un objet sur une image et dispositif informatique électronique mettant en œuvre ledit procédé
EP3908943A1 (fr) Procédé, appareil, dispositif électronique et support d'informations lisible par ordinateur permettant de rechercher une image
WO2022154457A1 (fr) Procédé de localisation d'action, dispositif, équipement électronique et support de stockage lisible par ordinateur
WO2015088179A1 (fr) Procédé et dispositif destinés au positionnement par rapport aux points principaux d'un visage
WO2021157902A1 (fr) Localisation sans dispositif robuste aux changements environnementaux
WO2020149454A1 (fr) Dispositif électronique destiné à effectuer une authentification d'utilisateur et procédé d'opération associé
WO2013009020A4 (fr) Procédé et appareil de génération d'informations de traçage de visage de spectateur, support d'enregistrement pour ceux-ci et appareil d'affichage tridimensionnel
WO2021006404A1 (fr) Serveur d'intelligence artificielle
WO2023038414A1 (fr) Procédé de traitement d'informations, appareil, dispositif électronique, support de stockage et produit programme
WO2019135621A1 (fr) Dispositif de lecture vidéo et son procédé de commande
WO2021167210A1 (fr) Serveur, dispositif électronique, et procédés de commande associés
WO2022075668A1 (fr) Système de traitement distribué de modèle d'intelligence artificielle, et son procédé de fonctionnement
WO2023167532A1 (fr) Procédé et appareil de classification d'actions sur vidéo
WO2024043428A1 (fr) Appareil pour réaliser un dé-clôturage de trames multiples et procédé associé
WO2022092451A1 (fr) Procédé de positionnement d'emplacement en intérieur utilisant un apprentissage profond
WO2021125521A1 (fr) Procédé de reconnaissance d'action utilisant des données caractéristiques séquentielles et appareil pour cela
WO2023224430A1 (fr) Procédé et appareil d'analyse personnalisée sur dispositif à l'aide d'un modèle d'apprentissage automatique
WO2024122990A1 (fr) Procédé et dispositif permettant d'effectuer une localisation visuelle
WO2023121408A1 (fr) Procédé et appareil pour effectuer une opération de convolution basée sur des données éparses à l'aide d'un réseau de neurones artificiels
WO2023085587A1 (fr) Appareil d'intelligence artificielle et procédé de détection d'éléments appartenant à une classe « non vu » associé
WO2021206363A1 (fr) Procédé et dispositif de suivi de pupille écoénergétiques basés sur la simplification d'une forêt de régression en cascade
WO2023224428A1 (fr) Architecture coopérative pour l'apprentissage non supervisé de relations causales dans la génération de données
WO2025058432A1 (fr) Procédé et dispositif électronique permettant d'accélérer une inférence de modèle de langage avec un échantillonnage multijeton dynamique

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22884043

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22884043

Country of ref document: EP

Kind code of ref document: A1