[go: up one dir, main page]

WO2025186903A1 - Information processing device, information processing method, and recording medium - Google Patents

Information processing device, information processing method, and recording medium

Info

Publication number
WO2025186903A1
WO2025186903A1 PCT/JP2024/008288 JP2024008288W WO2025186903A1 WO 2025186903 A1 WO2025186903 A1 WO 2025186903A1 JP 2024008288 W JP2024008288 W JP 2024008288W WO 2025186903 A1 WO2025186903 A1 WO 2025186903A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
information processing
tracking
processing device
feature amount
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/JP2024/008288
Other languages
French (fr)
Japanese (ja)
Other versions
WO2025186903A8 (en
Inventor
宏 福井
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NEC Corp
Original Assignee
NEC Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEC Corp filed Critical NEC Corp
Priority to PCT/JP2024/008288 priority Critical patent/WO2025186903A1/en
Publication of WO2025186903A1 publication Critical patent/WO2025186903A1/en
Publication of WO2025186903A8 publication Critical patent/WO2025186903A8/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments

Definitions

  • This disclosure relates to the technical fields of information processing devices, information processing methods, and recording media.
  • Patent Document 1 discloses generating a first feature vector indicating the position information of an object in a first image and a second feature vector indicating the position information of the object in a second image, and using the first feature vector and the second feature vector to generate correspondence information indicating the correspondence between the objects, thereby tracking the objects.
  • the objective of this disclosure is to provide an information processing device, an information processing method, and a recording medium that aim to improve upon the technology disclosed in prior art documents.
  • One aspect of the information processing device disclosed herein comprises an acquisition means for acquiring images from a video, a detection means for detecting the position of a tracked object contained in the images, a conversion means for converting position information relating to the position of the tracked object into a feature quantity indicating the characteristics of the position information, an update means for updating the feature quantity using a cross-attention mechanism capable of matching the tracked object between multiple images, and a restoration means for restoring the updated feature quantity to the position information.
  • One aspect of the information processing method disclosed herein involves using at least one computer to acquire images from a video, detect the position of a tracked object contained in the images, convert the position information regarding the position of the tracked object into features indicating the characteristics of the position information, update the features using a cross-attention mechanism capable of matching the tracked object across multiple images, and restore the updated features to the position information.
  • One aspect of the recording medium of this disclosure is a computer program recorded on at least one computer that causes the computer to execute an information processing method that acquires images from a video, detects the position of a tracked target included in the images, converts position information regarding the position of the tracked target into feature quantities that indicate the characteristics of the position information, updates the feature quantities using a cross-attention mechanism that can match the tracked target between multiple images, and restores the updated feature quantities to the position information.
  • FIG. 2 is a block diagram showing a hardware configuration of a first information processing apparatus.
  • FIG. 10 is a conceptual diagram illustrating an example of a tracking technique using query propagation.
  • FIG. 2 is a block diagram showing a functional configuration of a first information processing apparatus.
  • 10 is a flowchart showing the flow of a tracking process by the first information processing device.
  • FIG. 2 is a block diagram showing the configuration of an intersection attention mechanism in the first information processing device.
  • FIG. 10 is a plan view illustrating an example of an affinity matrix calculated by a cross-attention mechanism.
  • FIG. 2 is a block diagram showing a functional configuration of a second information processing apparatus.
  • FIG. 10 is a conceptual diagram illustrating a method for generating training data according to a comparative example.
  • FIG. 10 is a conceptual diagram illustrating a method for generating training data according to a comparative example.
  • FIG. 10 is a conceptual diagram illustrating a method for generating learning data according to the second information processing device.
  • FIG. 10 is a conceptual diagram illustrating query propagation in a learning operation by a second information processing device.
  • 10 is a flowchart showing the flow of a learning operation by the second information processing device.
  • Fig. 1 is a block diagram showing the hardware configuration of the first information processing apparatus.
  • the first information processing device 1 includes a processor 11, a RAM (Random Access Memory) 12, a ROM (Read Only Memory) 13, a storage device 14, an input device 15, and an output device 16.
  • the processor 11, RAM 12, ROM 13, storage device 14, input device 15, and output device 16 are each connected via a data bus 17.
  • the data bus 17 may be an interface other than a data bus (for example, a LAN, USB, etc.).
  • the processor 11 loads a computer program.
  • the processor 11 is configured to load a computer program stored in at least one of the RAM 12, ROM 13, and storage device 14.
  • the processor 11 may load a computer program stored in a computer-readable storage medium using a storage medium reading device (not shown).
  • the processor 11 may also obtain (i.e., load) the computer program from a device (not shown) located outside the first information processing device 1 via a network interface.
  • the processor 11 performs various processes by executing the loaded computer program.
  • functional blocks related to the tracking process performed by the first information processing device 1 are realized within the processor 11.
  • the processor 11 may function as a controller that executes each control in the first information processing device 1.
  • Processor 11 may be configured as, for example, a CPU (Central Processing Unit), GPU (Graphics Processing Unit), FPGA (Field-Programmable Gate Array), DSP (Digital Signal Processor), ASIC (Application Specific Integrated Circuit), or quantum processor. Processor 11 may be configured as one of these, or as multiple processors operating in parallel.
  • CPU Central Processing Unit
  • GPU Graphics Processing Unit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • Quantifier 11 may be configured as one of these, or as multiple processors operating in parallel.
  • RAM 12 temporarily stores computer programs executed by processor 11.
  • RAM 12 temporarily stores data that processor 11 uses temporarily while it is executing a computer program.
  • RAM 12 may be, for example, D-RAM (Dynamic Random Access Memory) or SRAM (Static Random Access Memory). Also, other types of volatile memory may be used instead of RAM 12.
  • ROM 13 stores computer programs executed by processor 11. ROM 13 may also store fixed data. ROM 13 may be, for example, a P-ROM (Programmable Read Only Memory) or an EPROM (Erasable Read Only Memory). Also, other types of non-volatile memory may be used instead of ROM 13.
  • the storage device 14 stores data that the first information processing device 1 saves over the long term.
  • the storage device 14 may operate as a temporary storage device for the processor 11.
  • the storage device 14 may also store computer programs executed by the processor 11.
  • the storage device 14 may include, for example, at least one of a hard disk device, a magneto-optical disk device, an SSD (Solid State Drive), and a disk array device.
  • the input device 15 is a device that receives input instructions from a user of the information processing device 1.
  • the input device 15 may include, for example, at least one of a keyboard, a mouse, and a touch panel.
  • the input device 15 may also be a device that allows voice input, for example, including a microphone.
  • the input device 15 may also be configured as various types of terminals, such as a smartphone, tablet, or laptop computer.
  • the output device 16 is a device that outputs information related to the information processing device 1 to the outside.
  • the output device 16 may be a display device (e.g., a display or digital signage) that can display information related to the information processing device 1.
  • the output device 16 may also be a speaker or the like that can output information related to the first information processing device 1 as audio.
  • the output device 16 may be configured as various terminals, such as a smartphone, tablet, or laptop computer.
  • the first information processing device 1 may be configured to include some of the components described in FIG. 1.
  • the first information processing device 1 may be configured to include a processor 11, RAM 12, and ROM 13.
  • the storage device 14, input device 15, and output device 16 may each be configured as an external device connected to the first information processing device 1.
  • some of the calculation functions of the first information processing device 1 may be realized by an external server, cloud, etc.
  • Fig. 2 is a conceptual diagram showing an example of a tracing technique using query propagation.
  • the first information processing device 1 is configured to be able to execute a tracking process for tracking a tracking target included in a video.
  • the tracking target may be, for example, a person or an animal, or an object such as luggage or a car.
  • a feature based on the position information of the tracking target detected from the video is acquired as a "detection query.”
  • This detection query is then used to update a "tracking query,” which is a query for tracking the target.
  • the first information processing device 1 tracks the tracking target by propagating this tracking query in chronological order. Note that these queries are set for each tracking target. Therefore, if a video includes multiple tracking targets, the queries corresponding to each of the multiple tracking targets are updated.
  • the detection query at time T1 includes detection queries for objects A and B.
  • the tracking query is then updated using the detection query at time T1, and as a result, the tracking query at time T2 includes tracking queries for objects A and B.
  • the detection query at time T2 includes the detection query for object C.
  • the tracking query is then updated using the detection query at time T2.
  • the tracking query at time T3 includes the tracking queries for objects A and B that were included in the tracking query at time T2 (in other words, propagated from the previous time), as well as the tracking query for the newly detected object C.
  • Fig. 3 is a block diagram showing the functional configuration of the first information processing apparatus.
  • the first information processing device 10 includes, as components for realizing its functions, an image acquisition unit 110, a target position detection unit 120, a feature conversion unit 130, a feature update unit 140, a location information restoration unit 150, and a storage unit 155.
  • Each of the image acquisition unit 110, target position detection unit 120, feature conversion unit 130, feature update unit 140, and location information restoration unit 150 may be a processing block realized by the processor 11 described above (see FIG. 1).
  • the storage unit 155 may be realized by the storage device 14 described above (see FIG. 1), etc.
  • the image acquisition unit 110 is configured to be able to acquire videos and images. More specifically, the image acquisition unit 110 may acquire images of each frame constituting a video sequentially in chronological order. Alternatively, it may acquire images arranged in chronological order. For example, images may be acquired from each frame constituting a video at predetermined intervals. The image acquisition unit 110 may also be configured to acquire images in real time while capturing a video with a camera. The images acquired by the image acquisition unit 110 are configured to be output to the target position detection unit 120.
  • the target position detection unit 120 is configured to be able to detect the position of the tracking target from the image acquired by the image acquisition unit 110.
  • the target position detection unit 120 may be configured to, for example, detect a person included in the image and output position information indicating the position of the detected person. Note that if the image includes multiple tracking targets, the target position detection unit 120 may detect position information for each of the multiple tracking targets.
  • the target position detection unit 120 may be configured as a trained detection model made up of, for example, a neural network.
  • the position information of the tracking target detected by the target position detection unit 120 is configured to be output to the feature conversion unit 130.
  • the feature conversion unit 130 is configured to be able to convert the position information of the tracked target detected by the target position detection unit 120 into feature amounts.
  • the feature amounts here may be feature vectors indicating the position information of the tracked target.
  • the feature conversion unit 130 may be configured as an encoder equipped with multiple feature extraction blocks.
  • the feature conversion unit 130 may be constructed, for example, with approximately three fully connected layers in a neural network.
  • the feature amounts converted by the feature conversion unit 130 may be used, for example, as detection queries (see Figure 2) used in tracking processing.
  • the feature amounts converted by the feature conversion unit 130 are configured to be output to the feature update unit 140.
  • the feature update unit 140 is configured to be able to update the features converted by the feature conversion unit 130 using a cross-attention mechanism 200.
  • the cross-attention mechanism 200 has the function of comparing the position information of the tracked target between multiple images. Specifically, the cross-attention mechanism 200 has the function of associating targets included in each of consecutive images acquired in time series to determine whether they are the same person. The specific configuration of the cross-attention mechanism 200 will be explained in detail later.
  • the features updated by the feature update unit 140 (hereinafter referred to as "updated features") may be used, for example, as a tracking query (see Figure 2) used in the tracking process. In this case, the updated features may be temporarily stored in the memory unit 155, which stores tracking queries.
  • the updated features are also configured to be output to the position information restoration unit 150.
  • the location information restoration unit 150 is configured to be able to restore the updated features updated by the feature update unit 140 to location information.
  • the location information restoration unit 150 may be configured as a decoder equipped with multiple feature extraction blocks.
  • the location information restoration unit 150 may be constructed, for example, with approximately three fully connected layers in a neural network.
  • the location information restoration unit 150 may also have a function to output the restored location information.
  • the storage unit 155 is configured to store feature amounts and location information for each tracking target.
  • the storage unit 155 may be configured to store detection queries and tracking queries for each tracking target.
  • the storage unit 155 may have multiple memory areas, and store tracking queries for tracking a single person in one memory area so that they are accumulated in chronological order.
  • the storage unit 155 may also be configured to store feature amounts and location information for each tracking target, linked to the corresponding ID.
  • Fig. 4 is a flowchart showing the flow of the tracking process by the first information processing device.
  • the image acquisition unit 110 first acquires an image from the video (step S101). Then, the target position detection unit 120 detects the position of the tracking target contained in the image from the image acquired by the image acquisition unit 110 (step S102).
  • the feature conversion unit 130 converts the position information of the tracked target detected by the target position detection unit 120 into a feature (step S103). Then, the feature update unit 140 updates the feature converted by the feature conversion unit 130 using the intersection attention mechanism 200 (step S104).
  • the location information restoration unit 150 restores the updated feature amounts updated by the feature amount update unit 140 into location information (step S105). Then, the first information processing device 1 determines whether to end the tracking process (step S106).
  • step S106 NO
  • the process may be executed again from step S101. That is, the next frame image may be acquired, and the above-described process may be executed repeatedly.
  • step S106: YES the series of processes will end.
  • Fig. 5 is a block diagram showing the configuration of the intersection attention mechanism in the first information processing apparatus.
  • the cross-attention mechanism 200 includes three feature embedding units 210, 220, and 230 corresponding to the query, key, and value, respectively, a matrix multiplication unit 240, a normalization unit 250, a matrix multiplication unit 260, a residual processing unit 270, and a memory update unit 280.
  • the feature embedding processor 210 is configured to extract a query from the feature values at time t (i.e., the feature values corresponding to the frame captured at time t) input from the feature converter 130.
  • the feature embedding processor 220 is configured to extract a key from the feature values at time t- ⁇ calculated in a past tracking process (i.e., the feature values corresponding to the frame captured at time t- ⁇ , prior to time t).
  • the feature embedding processor 230 is configured to extract a value from the feature values at time t- ⁇ calculated in a past tracking process.
  • the query and key are configured to be output to the matrix multiplication processor 240.
  • the value is configured to be output to the matrix multiplication processor 260.
  • the matrix multiplication calculation unit 240 is configured to calculate a weight (Attention Weight) indicating the correlation between the query and the key by calculating the matrix product of the query and the key.
  • the matrix calculation unit 240 is configured to calculate a weight indicating the correlation between the feature corresponding to the frame captured at time t and the feature corresponding to the frame captured at time t- ⁇ .
  • the matrix multiplication calculation unit 240 may calculate (use) an affinity matrix (Affinity Matrix) in which the vertical axis represents the feature corresponding to the frame captured at time t and the horizontal axis represents the feature corresponding to the frame captured at time t- ⁇ as the weight (Attention Weight) of the cross-attention mechanism 200.
  • the normalization unit 250 is configured to be able to perform normalization processing on the weights calculated by the matrix multiplication unit 240.
  • the normalization unit 250 may, for example, perform processing to normalize the similarity matrix calculated by the matrix multiplication unit 240 using a cross-softmax function.
  • the weights normalized by the normalization unit 250 are configured to be output to the matrix multiplication unit 260.
  • the matrix multiplication unit 260 is configured to perform processing to reflect the weight in the value by calculating the matrix product of the output from the normalization unit 250 and the value.
  • the matrix product in this embodiment may typically be a tensor product (in other words, a direct product).
  • the matrix product may be a Kronecker product.
  • the calculation result of the matrix multiplication unit 260 is configured to be output to the residual processing unit 270.
  • the residual processing unit 270 is configured to perform residual processing on the calculation results of the matrix multiplication calculation unit 260.
  • This residual processing may be a process of adding the calculation results of the matrix multiplication calculation unit 260 and the feature values input to the cross-attention mechanism 200 (specifically, the feature values at time t). This is to prevent the feature values from not being generated as the calculation results of the cross-attention mechanism 200 even if a correlation is not calculated. For example, if 0 is calculated as the correlation (weight), the value will be multiplied by that 0, causing the feature value in the calculation results of the matrix calculation unit 260 to become 0 (disappear). To prevent this, the residual processing unit 270 performs the residual processing described above. The calculation results of the residual processing unit 270 are output from the cross-attention mechanism 200 as the updated feature values at time t.
  • the memory update unit 280 updates the stored feature quantities corresponding to the tracked object.
  • the memory update unit 280 may update only the feature quantities stored in the memory means corresponding to the updated feature quantities output by the matrix multiplication calculation unit 260, or may overwrite the feature quantities output by the calculation results of the residual processing unit 270 in the memory means to update them.
  • the tracked object may be identified by weights calculated from the query and key by the matrix multiplication calculation unit 240, and it may be determined which tracked object to update from the multiple tracked objects stored in the memory unit 155.
  • the updated feature quantities calculated by the matrix multiplication calculation unit 260 from the normalized weights and values may be determined as the updated quantities for the feature quantities of the tracked object stored in the memory unit 155.
  • the first information processing device 1 essentially focuses on the similarity between the tracking process and the operations performed by the cross-attention mechanism 200, and can be said to perform an operation to update features using information generated when matching objects.
  • the tracking process involves a process to detect the tracked object, a process to match the tracked object, and a process to update the tracked object detection results.
  • the cross-attention mechanism 200 involves a process to extract features related to the tracked object, a process to calculate weights, and a process to update the features related to the tracked object.
  • the first information processing device 1 essentially reuses the process of calculating weights in the cross-attention mechanism 200 as the process of matching the tracked object in the tracking process.
  • the first information processing device 1 essentially reuses the process of matching the tracked object in the tracking process as the process of calculating weights in the cross-attention mechanism 200. Therefore, it can also be said that the first information processing device 1 realizes the operations of detecting an object, matching the object, and updating the detection results using the cross-attention mechanism 200.
  • the intersection attention mechanism 200 uses the feature quantity corresponding to the frame captured at time t as a query, and obtains and uses the feature quantities contained in frames captured up to time t- ⁇ before time t as keys and values from the memory unit 160, which stores these feature quantities, to track the target.
  • the memory update unit 280 may also update the memory means by overwriting the feature quantity output by the calculation results of the residual processing unit 270.
  • the target may be identified from the weights calculated by the matrix multiplication calculation unit 240, and the feature quantity corresponding to the ID of the identified target, which is stored in the memory means, may be updated using the updated feature quantity output by the matrix multiplication calculation unit 260. In this way, it is possible to accurately track a target included in a video using a relatively simple algorithm.
  • Fig. 6 is a plan view showing an example of the similarity matrix calculated by the intersection attention mechanism.
  • the similarity matrix AM used as a weight by the cross attention mechanism 200 is information indicating the correspondence between the tracking target O t- ⁇ at time t- ⁇ and the tracking target O t at time t.
  • the similarity matrix AM is information indicating that (1) a first tracking target O t - ⁇ among the multiple tracking targets O t- ⁇ corresponds to a first tracking target O t among the multiple tracking targets O t (that is, both are the same person), (2) a second tracking target O t- ⁇ among the multiple tracking targets O t- ⁇ corresponds to a second tracking target O t among the multiple tracking targets O t, ..., (N) an Nth tracking target O t- ⁇ among the multiple tracking targets O t- ⁇ corresponds to an Nth tracking target O t among the multiple tracking targets O t.
  • the similarity matrix AM is information indicating the correspondence between the tracking target O t- ⁇ and the tracking target O t , and therefore may be referred to as correspondence information.
  • the similarity matrix AM can be considered to be a matrix whose vertical axis corresponds to the vector components of the feature vector CV t- ⁇ and whose horizontal axis corresponds to the vector components of the feature vector CV t . Therefore, the size of the vertical axis of the similarity matrix AM is the size of the feature vector CV t- ⁇ , which corresponds to the size of the image captured at time t- ⁇ (i.e., the number of pixels). Similarly, the size of the horizontal axis of the similarity matrix AM is the size of the feature vector CV t , which corresponds to the size of the image captured at time t (i.e., the number of pixels).
  • the similarity matrix AM can be considered to be a matrix whose vertical axis corresponds to the detection result of the tracking target O t- ⁇ reflected in the image at time t- ⁇ (i.e., the detected position of the tracking target O t- ⁇ ) and whose horizontal axis corresponds to the detection result of the tracking target O t reflected in the image at time t (i.e., the detected position of the tracking target O t ).
  • an element of the similarity matrix AM reacts (typically has a non-zero value) at a position where a vector component on the vertical axis corresponding to a certain tracking target O t- ⁇ intersects with a vector component on the horizontal axis corresponding to the same tracking target O t.
  • an element of the similarity matrix AM reacts at a position where a detection result for tracking target O t- ⁇ on the vertical axis intersects with a detection result for tracking target O t on the horizontal axis.
  • the similarity matrix AM is typically a matrix in which the value of an element at a position where a vector component corresponding to tracking target O t- ⁇ included in feature vector CV t- ⁇ intersects with a vector component corresponding to the same tracking target O t included in feature vector CV t is a value obtained by multiplying both vector components (i.e., a non-zero value), while the values of the other elements are 0.
  • the elements of the similarity matrix AM react at positions where the detection result of tracking target O#k reflected in the image captured at time t- ⁇ intersects with the detection result of tracking target O#k reflected in the image captured at time t.
  • the elements of the similarity matrix AM do not react (typically become 0) at the position where the vector component corresponding to the tracking target O t- ⁇ included in the feature vector CV t- ⁇ intersects with the vector component corresponding to the same tracking target O t included in the feature vector CV t, it is estimated that the tracking target O t- ⁇ that was reflected in the image captured at time t- ⁇ is not reflected in the image captured at time t (for example, it has moved outside the angle of view of the camera).
  • the similarity matrix AM can be used as information indicating the correspondence between the tracking target O t- ⁇ and the tracking target O t .
  • the similarity matrix AM can be used as information indicating the result of matching the tracking target O t- ⁇ reflected in the image captured at time t- ⁇ with the object O t reflected in the image captured at time t. Therefore, the similarity matrix AM can be used as information for tracking the position of the tracking target O t- ⁇ reflected in the image captured at time t- ⁇ within the image captured at time t.
  • the operation of this cross-attention mechanism 200 is used to perform tracking processing for the tracked object.
  • tracking processing can be performed more appropriately compared to when the cross-attention mechanism 200 described in this embodiment is not used.
  • learning is required so that the weights in the self-attention mechanism react strongly between the same tracked objects.
  • realizing such a configuration requires a large number of self-attention mechanisms, which poses a technical problem in that the algorithm used for tracking processing becomes complicated.
  • the tracking processing algorithm can be constructed with a simple structure, making it possible to achieve highly accurate tracking processing while suppressing computational costs.
  • the second information processing device 1 will be described with reference to Figures 7 to 11.
  • the second information processing device 1 differs in some configurations and operations from the first information processing device 1 described above, but other parts may be similar to the first information processing device 1. Therefore, the following will describe in detail the parts that differ from the first embodiment, and will omit explanations of other overlapping parts as appropriate.
  • Fig. 7 is a block diagram showing the functional configuration of the second information processing device.
  • the second information processing device 1 includes, as components for realizing its functions, an image acquisition unit 110, a target position detection unit 120, a feature conversion unit 130, a feature update unit 140, a position information restoration unit 150, and a learning unit 160. That is, the second information processing device 1 further includes a learning unit 160 in addition to the configuration already described in the first embodiment (see FIG. 3). Note that the learning unit 160 may be a processing block realized by the above-mentioned processor 11 (see FIG. 1).
  • the learning unit 160 is configured to be able to perform learning related to the tracking process of the tracked object executed by the second information processing device 1. More specifically, the learning unit 160 may perform machine learning on the tracking model 50 that performs the tracking process (i.e., a model having the functions of the object position detection unit 120, feature conversion unit 130, feature update unit 140, and object information restoration unit 140) to enable tracking with higher accuracy.
  • the learning unit 160 may perform learning related to the operation of the cross-attention mechanism 200 used by the feature update unit 140. For example, the learning unit 160 may perform learning so that the operation of matching tracked objects performed by the cross-attention mechanism 200 can be performed more accurately. Specifically, the learning unit 160 may learn so that the similarity matrix AM used by the cross-attention mechanism 200 reacts more strongly to the same tracked object.
  • the specific learning method used by the learning unit 160 will be explained in detail below.
  • Fig. 8 is a conceptual diagram showing a learning data generation technique according to a comparative example.
  • Fig. 9 is a conceptual diagram showing a learning data generation technique according to a second information processing device.
  • Fig. 10 is a conceptual diagram showing query propagation in the learning operation by the second information processing device.
  • the learning unit 160 in the second information processing device 1 performs mini-batch conversion on a single video and uses the result as training data.
  • the learning unit 160 may use all of the video frames included in the mini-batch for training the query propagation of the tracking model 50.
  • the training data may include each frame included in a single video and ground truth data indicating the correspondence between the tracked targets captured in each frame (i.e., which people are the same person).
  • the learning unit 160 uses multiple frames (frame t1, frame t2, frame t3, ...) arranged in chronological order from a single video as learning data. In this way, it is possible to significantly increase the number of frames that can be used to learn query propagation in the tracking model 50, and it is also possible to learn query propagation by the tracking model 50 in chronological order.
  • Fig. 11 is a flowchart showing the flow of the learning operation by the second information processing device.
  • the learning unit 160 first batch-converts one video to create learning data (step S201).
  • the learning unit 160 then inputs the learning data 202 into the tracking model 50 (step S202).
  • the learning unit 160 compares the output result of the tracking model 50 with the ground truth data to calculate a loss function (step S203). Then, the learning unit 160 calculates the gradient of the loss function (step S204).
  • the learning unit 160 updates the parameters of the tracking model based on the calculated gradient so as to reduce the loss function (step S205). After that, the learning unit 160 determines whether learning has been performed using all frames of the training data (step S206).
  • step S206 If not all frames have been used for learning (step S206: NO), the learning unit 160 starts processing again from step S202. That is, the learning unit 160 repeats the process from inputting images, which are learning data, into the tracking model 50 to updating the parameters. On the other hand, if all frames have been used for learning (step S206: YES), the learning unit 160 determines that learning has ended and saves the learned model (step S207).
  • the second information processing device As described in Figures 6 to 11, in the second information processing device 1, multiple frames included in one video are converted into one mini-batch for learning. In this way, it is possible to significantly increase the number of frames that can be used in learning query propagation for the tracking model 50, compared to, for example, when multiple videos are each converted into mini-batches. Note that there is no particular limit on the number of frames in the tracking process performed by the second information processing device 1. Therefore, by realizing learning using a large number of frames (in other words, long-term time-series learning), it is possible to effectively improve tracking accuracy.
  • the information processing device 1 can be applied to various systems that perform tracking processing on an object.
  • the information processing device 1 can be applied to a gateless authentication system that tracks an object passing through a predetermined area and performs authentication processing using biometric information (e.g., facial information, iris information, etc.) of the object being tracked.
  • biometric information e.g., facial information, iris information, etc.
  • each embodiment also includes a processing method in which a program that operates the configuration of each embodiment to realize the functions of the above-mentioned embodiments is recorded on a recording medium, the program recorded on the recording medium is read as code, and the program is executed on a computer.
  • computer-readable recording media are also included in the scope of each embodiment.
  • each embodiment includes not only the recording medium on which the above-mentioned program is recorded, but also the program itself.
  • the recording medium may be, for example, a floppy disk, hard disk, optical disk, magneto-optical disk, CD-ROM, magnetic tape, non-volatile memory card, or ROM.
  • the scope of each embodiment is not limited to programs recorded on the recording medium that execute processing by themselves, but also includes programs that execute processing by running on an OS in conjunction with other software or expansion board functions.
  • the program itself may be stored on a server, and part or all of the program may be downloadable from the server to a user terminal.
  • the program may also be provided to the user in, for example, a SaaS (Software as a Service) format.
  • the information processing device described in Supplementary Note 1 is an information processing device that includes an acquisition means for acquiring an image from a video, a detection means for detecting the position of a tracked object included in the image, a conversion means for converting position information regarding the position of the tracked object into a feature quantity indicating the characteristics of the position information, an update means for updating the feature quantity using a cross-attention mechanism that can match the tracked object between a plurality of the images, and a restoration means for restoring the updated feature quantity to the position information.
  • Appendix 2 The information processing device described in Appendix 2 is the information processing device described in Appendix 1, in which the cross-attention mechanism matches the tracked target using a weight calculated from a first feature that is the feature related to a first image and a second feature that is the feature related to a second image that was taken before the first image.
  • the information processing device described in Supplementary Note 3 is the information processing device described in Supplementary Note 2, in which the cross-attention mechanism matches the tracked target by using a similarity matrix obtained by calculating a matrix product of the first feature and the second feature as the weight.
  • the information processing device described in Supplementary Note 4 is the information processing device described in any one of Supplementary Notes 1 to 3, further comprising: a storage means capable of storing the feature; and a storage update means that updates the feature stored in the storage means based on the feature updated by the update means.
  • the information processing device described in Supplementary Note 5 is the information processing device described in any one of Supplementary Notes 1 to 3, further including a learning unit that converts multiple frames included in one video into one mini-batch to use as training data, and performs learning related to matching of the tracked target.
  • the information processing method described in Supplementary Note 6 is an information processing method that, by at least one computer, acquires an image from a video, detects a position of a tracked target included in the image, converts position information regarding the position of the tracked target into a feature amount indicating characteristics of the position information, updates the feature amount using a cross-attention mechanism that can match the tracked target between a plurality of the images, and restores the updated feature amount to the position information.
  • the recording medium described in Supplementary Note 7 is a recording medium having recorded thereon a computer program for causing at least one computer to execute an information processing method of acquiring images from a video, detecting a position of a tracked target included in the images, converting position information regarding the position of the tracked target into feature quantities indicating characteristics of the position information, updating the feature quantities using a cross-attention mechanism capable of matching the tracked target between a plurality of the images, and restoring the updated feature quantities to the position information.
  • the computer program described in Supplementary Note 8 is a computer program that causes at least one computer to execute an information processing method of acquiring images from a video, detecting a position of a tracked target included in the images, converting position information regarding the position of the tracked target into features indicating characteristics of the position information, updating the features using a cross-attention mechanism that can match the tracked target between a plurality of the images, and restoring the updated features to the position information.
  • the tracking device described in Supplementary Note 9 is a tracking device including: an acquisition means for acquiring an image from a video; a detection means for detecting a position of a tracking target included in the image; a conversion means for converting position information regarding the position of the tracking target into a feature quantity indicating characteristics of the position information; an update means for updating the feature quantity using a cross-attention mechanism capable of matching the tracking target between a plurality of the images; a restoration means for restoring the updated feature quantity to the position information; and a tracking means for tracking the tracking target based on the restored position information.
  • the tracking method described in Supplementary Note 10 is a tracking method that, by at least one computer, acquires an image from a video, detects a position of a tracked object included in the image, converts position information regarding the position of the tracked object into a feature amount indicating characteristics of the position information, updates the feature amount using a cross-attention mechanism that can match the tracked object between a plurality of the images, restores the updated feature amount to the position information, and tracks the tracked object based on the restored position information.
  • the recording medium described in Supplementary Note 11 is a recording medium having recorded thereon a computer program for causing at least one computer to execute a tracking method of acquiring images from a video, detecting a position of a tracked target included in the images, converting position information regarding the position of the tracked target into feature amounts indicating characteristics of the position information, updating the feature amounts using a cross-attention mechanism capable of matching the tracked target between a plurality of the images, restoring the updated feature amounts to the position information, and tracking the tracked target based on the restored position information.
  • the computer program described in Supplementary Note 12 is a computer program that causes at least one computer to execute a tracking method of acquiring images from a video, detecting a position of a tracked target included in the images, converting position information regarding the position of the tracked target into feature amounts indicating characteristics of the position information, updating the feature amounts using a cross-attention mechanism that can match the tracked target across a plurality of the images, restoring the updated feature amounts to the position information, and tracking the tracked target based on the restored position information.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)
  • Studio Devices (AREA)

Abstract

This information processing device includes: an acquisition means for acquiring an image from a video; a detection means for detecting the position of a tracking target included in the image; a conversion means for converting position information related to the position of the tracking target into a feature amount indicating a feature of the position information; an update means for updating the feature amount by using a cross attention mechanism capable of cross-checking the tracking target between the plurality of images; and a restoration means for restoring the updated feature amount to position information. According to such an information processing device, it is possible to accurately track a tracking target included in a video.

Description

情報処理装置、情報処理方法、及び記録媒体Information processing device, information processing method, and recording medium

 この開示は、情報処理装置、情報処理方法、及び記録媒体の技術分野に関する。 This disclosure relates to the technical fields of information processing devices, information processing methods, and recording media.

 この種の装置として、動画に含まれている対象の追跡処理を実行するものが知られている。例えば特許文献1は、第1画像中の物体の位置情報を示す第1特徴ベクトルと、第2画像中の物体の位置情報を示す第2特徴ベクトルとを生成すること、及び、第1特徴ベクトル及び第2特徴ベクトルを用いて物体の対応関係を示す対応情報を生成して物体を追跡することを開示している。 A known device of this type is one that performs tracking of objects contained in video. For example, Patent Document 1 discloses generating a first feature vector indicating the position information of an object in a first image and a second feature vector indicating the position information of the object in a second image, and using the first feature vector and the second feature vector to generate correspondence information indicating the correspondence between the objects, thereby tracking the objects.

国際公開第2021/130951号International Publication No. 2021/130951

 この開示は、先行技術文献に開示された技術を改良することを目的とする情報処理装置、情報処理方法、及び記録媒体を提供することを課題とする。 The objective of this disclosure is to provide an information processing device, an information processing method, and a recording medium that aim to improve upon the technology disclosed in prior art documents.

 この開示の情報処理装置の一の態様は、動画から画像を取得する取得手段と、前記画像に含まれる追跡対象の位置を検出する検出手段と、前記追跡対象の位置に関する位置情報を、前記位置情報の特徴を示す特徴量に変換する変換手段と、前記特徴量を、複数の前記画像間で前記追跡対象を照合可能な交差注意機構を用いて更新する更新手段と、更新された前記特徴量を前記位置情報に復元する復元手段と、を備える。 One aspect of the information processing device disclosed herein comprises an acquisition means for acquiring images from a video, a detection means for detecting the position of a tracked object contained in the images, a conversion means for converting position information relating to the position of the tracked object into a feature quantity indicating the characteristics of the position information, an update means for updating the feature quantity using a cross-attention mechanism capable of matching the tracked object between multiple images, and a restoration means for restoring the updated feature quantity to the position information.

 この開示の情報処理方法の一の態様は、少なくとも1つのコンピュータによって、動画から画像を取得し、前記画像に含まれる追跡対象の位置を検出し、前記追跡対象の位置に関する位置情報を、前記位置情報の特徴を示す特徴量に変換し、前記特徴量を、複数の前記画像間で前記追跡対象を照合可能な交差注意機構を用いて更新し、更新された前記特徴量を前記位置情報に復元する。 One aspect of the information processing method disclosed herein involves using at least one computer to acquire images from a video, detect the position of a tracked object contained in the images, convert the position information regarding the position of the tracked object into features indicating the characteristics of the position information, update the features using a cross-attention mechanism capable of matching the tracked object across multiple images, and restore the updated features to the position information.

 この開示の記録媒体の一の態様は、少なくとも1つのコンピュータに、動画から画像を取得し、前記画像に含まれる追跡対象の位置を検出し、前記追跡対象の位置に関する位置情報を、前記位置情報の特徴を示す特徴量に変換し、前記特徴量を、複数の前記画像間で前記追跡対象を照合可能な交差注意機構を用いて更新し、更新された前記特徴量を前記位置情報に復元する、情報処理方法を実行させるコンピュータプログラムが記録されている。 One aspect of the recording medium of this disclosure is a computer program recorded on at least one computer that causes the computer to execute an information processing method that acquires images from a video, detects the position of a tracked target included in the images, converts position information regarding the position of the tracked target into feature quantities that indicate the characteristics of the position information, updates the feature quantities using a cross-attention mechanism that can match the tracked target between multiple images, and restores the updated feature quantities to the position information.

第1の情報処理装置のハードウェア構成を示すブロック図である。FIG. 2 is a block diagram showing a hardware configuration of a first information processing apparatus. クエリ伝搬を用いた追跡手法の一例を示す概念図である。FIG. 10 is a conceptual diagram illustrating an example of a tracking technique using query propagation. 第1の情報処理装置の機能的構成を示すブロック図である。FIG. 2 is a block diagram showing a functional configuration of a first information processing apparatus. 第1の情報処理装置による追跡処理の流れを示すフローチャートである。10 is a flowchart showing the flow of a tracking process by the first information processing device. 第1の情報処理装置における交差注意機構の構成を示すブロック図である。FIG. 2 is a block diagram showing the configuration of an intersection attention mechanism in the first information processing device. 交差注意機構により算出される類似性行列の一例を示す平面図である。FIG. 10 is a plan view illustrating an example of an affinity matrix calculated by a cross-attention mechanism. 第2の情報処理装置の機能的構成を示すブロック図である。FIG. 2 is a block diagram showing a functional configuration of a second information processing apparatus. 比較例に係る学習データの生成手法を示す概念図である。FIG. 10 is a conceptual diagram illustrating a method for generating training data according to a comparative example. 第2の情報処理装置に係る学習データの生成手法を示す概念図である。FIG. 10 is a conceptual diagram illustrating a method for generating learning data according to the second information processing device. 第2の情報処理装置による学習動作におけるクエリ伝搬を示す概念図である。FIG. 10 is a conceptual diagram illustrating query propagation in a learning operation by a second information processing device. 第2の情報処理装置による学習動作の流れを示すフローチャートである。10 is a flowchart showing the flow of a learning operation by the second information processing device.

 以下、図面を参照しながら、情報処理装置、情報処理方法、及び記録媒体の実施形態について説明する。 Below, embodiments of an information processing device, an information processing method, and a recording medium will be described with reference to the drawings.

 <第1実施形態>
 第1の情報処理装置について、図1から図6を参照して説明する。
First Embodiment
The first information processing apparatus will be described with reference to FIGS.

 (ハードウェア構成)
 まず、図1を参照しながら、第1の情報処理装置のハードウェア構成について説明する。図1は、第1の情報処理装置のハードウェア構成を示すブロック図である。
(Hardware configuration)
First, the hardware configuration of the first information processing apparatus will be described with reference to Fig. 1. Fig. 1 is a block diagram showing the hardware configuration of the first information processing apparatus.

 図1に示すように、第1の情報処理装置1は、プロセッサ11と、RAM(Random Access Memory)12と、ROM(Read Only Memory)13と、記憶装置14と、入力装置15と、出力装置16と、を備えている。上述したプロセッサ11と、RAM12と、ROM13と、記憶装置14と、入力装置15と、出力装置16と、は、それぞれデータバス17を介して接続されている。なお、データバス17は、データバス以外のインターフェース(例えば、LANやUSB等)であってもよい。 As shown in FIG. 1, the first information processing device 1 includes a processor 11, a RAM (Random Access Memory) 12, a ROM (Read Only Memory) 13, a storage device 14, an input device 15, and an output device 16. The processor 11, RAM 12, ROM 13, storage device 14, input device 15, and output device 16 are each connected via a data bus 17. Note that the data bus 17 may be an interface other than a data bus (for example, a LAN, USB, etc.).

 プロセッサ11は、コンピュータプログラムを読み込む。例えば、プロセッサ11は、RAM12、ROM13及び記憶装置14のうちの少なくとも一つが記憶しているコンピュータプログラムを読み込むように構成されている。或いは、プロセッサ11は、コンピュータで読み取り可能な記録媒体が記憶しているコンピュータプログラムを、図示しない記録媒体読み取り装置を用いて読み込んでもよい。プロセッサ11は、ネットワークインタフェースを介して、第1の情報処理装置1の外部に配置される不図示の装置からコンピュータプログラムを取得してもよい(つまり、読み込んでもよい)。プロセッサ11は、読み込んだコンピュータプログラムを実行することで各種処理を実行する。本実施形態では特に、プロセッサ11が読み込んだコンピュータプログラムを実行すると、プロセッサ11内に、第1の情報処理装置1が実行する追跡処理に関連する機能ブロックが実現される。即ち、プロセッサ11は、情第1の報処理装置1における各制御を実行するコントローラとして機能してよい。 The processor 11 loads a computer program. For example, the processor 11 is configured to load a computer program stored in at least one of the RAM 12, ROM 13, and storage device 14. Alternatively, the processor 11 may load a computer program stored in a computer-readable storage medium using a storage medium reading device (not shown). The processor 11 may also obtain (i.e., load) the computer program from a device (not shown) located outside the first information processing device 1 via a network interface. The processor 11 performs various processes by executing the loaded computer program. In particular, in this embodiment, when the processor 11 executes the loaded computer program, functional blocks related to the tracking process performed by the first information processing device 1 are realized within the processor 11. In other words, the processor 11 may function as a controller that executes each control in the first information processing device 1.

 プロセッサ11は、例えばCPU(Central Processing Unit)、GPU(Graphics Processing Unit)、FPGA(field-programmable gate array)、DSP(Digital Signal Processor)、ASIC(Application Specific Integrated Circuit)、量子プロセッサとして構成されてよい。プロセッサ11は、これらのうち一つで構成されてもよいし、複数を並列で用いるように構成されてもよい。 Processor 11 may be configured as, for example, a CPU (Central Processing Unit), GPU (Graphics Processing Unit), FPGA (Field-Programmable Gate Array), DSP (Digital Signal Processor), ASIC (Application Specific Integrated Circuit), or quantum processor. Processor 11 may be configured as one of these, or as multiple processors operating in parallel.

 RAM12は、プロセッサ11が実行するコンピュータプログラムを一時的に記憶する。RAM12は、プロセッサ11がコンピュータプログラムを実行している際にプロセッサ11が一時的に使用するデータを一時的に記憶する。RAM12は、例えば、D-RAM(Dynamic Random Access Memory)や、SRAM(Static Random Access Memory)であってよい。また、RAM12に代えて、他の種類の揮発性メモリが用いられてもよい。 RAM 12 temporarily stores computer programs executed by processor 11. RAM 12 temporarily stores data that processor 11 uses temporarily while it is executing a computer program. RAM 12 may be, for example, D-RAM (Dynamic Random Access Memory) or SRAM (Static Random Access Memory). Also, other types of volatile memory may be used instead of RAM 12.

 ROM13は、プロセッサ11が実行するコンピュータプログラムを記憶する。ROM13は、その他に固定的なデータを記憶していてもよい。ROM13は、例えば、P-ROM(Programmable Read Only Memory)や、EPROM(Erasable Read Only Memory)であってよい。また、ROM13に代えて、他の種類の不揮発性メモリが用いられてもよい。 ROM 13 stores computer programs executed by processor 11. ROM 13 may also store fixed data. ROM 13 may be, for example, a P-ROM (Programmable Read Only Memory) or an EPROM (Erasable Read Only Memory). Also, other types of non-volatile memory may be used instead of ROM 13.

 記憶装置14は、第1の情報処理装置1が長期的に保存するデータを記憶する。記憶装置14は、プロセッサ11の一時記憶装置として動作してもよい。記憶装置14は、プロセッサ11が実行するコンピュータプログラムを記憶してもよい。記憶装置14は、例えば、ハードディスク装置、光磁気ディスク装置、SSD(Solid State Drive)及びディスクアレイ装置のうちの少なくとも一つを含んでいてもよい。 The storage device 14 stores data that the first information processing device 1 saves over the long term. The storage device 14 may operate as a temporary storage device for the processor 11. The storage device 14 may also store computer programs executed by the processor 11. The storage device 14 may include, for example, at least one of a hard disk device, a magneto-optical disk device, an SSD (Solid State Drive), and a disk array device.

 入力装置15は、情報処理装置1のユーザからの入力指示を受け取る装置である。入力装置15は、例えば、キーボード、マウス及びタッチパネルのうちの少なくとも一つを含んでいてもよい。入力装置15は、例えばマイクを含む音声入力が可能な装置であってもよい。入力装置15は、例えばスマートフォン、タブレット、ノートパソコン等の各種端末として構成されていてもよい。 The input device 15 is a device that receives input instructions from a user of the information processing device 1. The input device 15 may include, for example, at least one of a keyboard, a mouse, and a touch panel. The input device 15 may also be a device that allows voice input, for example, including a microphone. The input device 15 may also be configured as various types of terminals, such as a smartphone, tablet, or laptop computer.

 出力装置16は、情報処理装置1に関する情報を外部に対して出力する装置である。例えば、出力装置16は、情報処理装置1に関する情報を表示可能な表示装置(例えば、ディスプレイやデジタルサイネージ等)であってもよい。また、出力装置16は、第1の情報処理装置1に関する情報を音声出力可能なスピーカ等であってもよい。出力装置16は、例えばスマートフォン、タブレット、ノートパソコン等の各種端末として構成されていてもよい。 The output device 16 is a device that outputs information related to the information processing device 1 to the outside. For example, the output device 16 may be a display device (e.g., a display or digital signage) that can display information related to the information processing device 1. The output device 16 may also be a speaker or the like that can output information related to the first information processing device 1 as audio. The output device 16 may be configured as various terminals, such as a smartphone, tablet, or laptop computer.

 なお、第1の情報処理装置1は、図1で説明した各構成要素の一部を含むものとして構成されてよい。例えば、第1の情報処理装置1は、プロセッサ11、RAM12、及びROM13を含んで構成されてよい。この場合、記憶装置14、入力装置15、及び出力装置16の各々は、第1の情報処理装置1と接続される外部の装置として構成されてもよい。また、第1の情報処理装置1における演算機能の一部は、外部サーバやクラウド等によって実現されてもよい。 The first information processing device 1 may be configured to include some of the components described in FIG. 1. For example, the first information processing device 1 may be configured to include a processor 11, RAM 12, and ROM 13. In this case, the storage device 14, input device 15, and output device 16 may each be configured as an external device connected to the first information processing device 1. Furthermore, some of the calculation functions of the first information processing device 1 may be realized by an external server, cloud, etc.

 (追跡手法)
 次に、図2を参照しながら、第1の情報処理装置1が実行する追跡処理の手法について説明する。図2は、クエリ伝搬を用いた追跡手法の一例を示す概念図である。
(Tracking method)
Next, a tracing process technique executed by the first information processing device 1 will be described with reference to Fig. 2. Fig. 2 is a conceptual diagram showing an example of a tracing technique using query propagation.

 図2において、第1の情報処理装置1は、動画に含まれる追跡対象を追跡する追跡処理を実行可能に構成されている。追跡対象は、例えば人物や動物であってもよいし、荷物や車などの物体であってもよい。第1の情報処理装置が実行する追跡処理では、動画から検出した追跡対象の位置情報に基づく特徴量が「検出クエリ」として取得される。そして、この検出クエリを用いて、対象を追跡するためのクエリである「追跡クエリ」が更新される。第1の情報処理装置1は、この追跡クエリを時系列で伝搬していくことにより追跡対象を追跡する。なお、これらのクエリは、追跡対象ごとに設定されるものである。このため、動画に複数の追跡対象が含まれている場合には、複数の追跡対象の各々に対応するクエリが更新されていく。 In FIG. 2, the first information processing device 1 is configured to be able to execute a tracking process for tracking a tracking target included in a video. The tracking target may be, for example, a person or an animal, or an object such as luggage or a car. In the tracking process executed by the first information processing device, a feature based on the position information of the tracking target detected from the video is acquired as a "detection query." This detection query is then used to update a "tracking query," which is a query for tracking the target. The first information processing device 1 tracks the tracking target by propagating this tracking query in chronological order. Note that these queries are set for each tracking target. Therefore, if a video includes multiple tracking targets, the queries corresponding to each of the multiple tracking targets are updated.

 例えば、図2に示す例では、時刻T1に撮影されたフレームから対象A及び対象Bが検出されている。このため、時刻T1の検出クエリには、対象A及びBの検出クエリが含まれている。そして、時刻T1の検出クエリを用いて追跡クエリが更新される、この結果、時刻T2の追跡クエリには、対象A及びBの追跡クエリが含まれている。 For example, in the example shown in Figure 2, objects A and B are detected from a frame captured at time T1. Therefore, the detection query at time T1 includes detection queries for objects A and B. The tracking query is then updated using the detection query at time T1, and as a result, the tracking query at time T2 includes tracking queries for objects A and B.

 続いて、時刻T2に撮影されたフレームからは、新たに対象Cが検出されている。このため、時刻T2の検出クエリには、対象Cの検出クエリが含まれている。そして、時刻T2の検出クエリを用いて追跡クエリが更新される。この結果、時刻T3の追跡クエリには、時刻T2の追跡クエリに含まれていた(言いかえれば、前の時刻から伝搬された)対象A及びBの追跡クエリと、新たに検出された対象Cの追跡クエリとが含まれている。 Subsequently, a new object C is detected in the frame captured at time T2. Therefore, the detection query at time T2 includes the detection query for object C. The tracking query is then updated using the detection query at time T2. As a result, the tracking query at time T3 includes the tracking queries for objects A and B that were included in the tracking query at time T2 (in other words, propagated from the previous time), as well as the tracking query for the newly detected object C.

 なお、時刻T3に撮影されたフレームからは、対象A、対象B及び対象Cがそれぞれ検出されているが、時刻T4に撮影されたフレームからは、対象A及び対象Cのみが検出され、対象Bは検出されていない。このため、時刻T4の追跡クエリからは、対象Bの追跡クエリが消失し、対象A及びCの追跡クエリが含まれている。 Note that in the frame captured at time T3, objects A, B, and C are each detected, but in the frame captured at time T4, only objects A and C are detected, and object B is not. As a result, the tracking query for object B has disappeared from the tracking query at time T4, and tracking queries for objects A and C are included.

 (機能的構成)
 次に、図3を参照しながら、上述した追跡処理を実行するための機能的構成について説明する。図3は、第1の情報処理装置の機能的構成を示すブロック図である。
(Functional configuration)
Next, the functional configuration for executing the above-mentioned tracking process will be described with reference to Fig. 3. Fig. 3 is a block diagram showing the functional configuration of the first information processing apparatus.

 図3に示すように、第1の情報処理装置10は、その機能を実現するための構成要素として、画像取得部110と、対象位置検出部120と、特徴量変換部130と、特徴量更新部140と、位置情報復元部150と、記憶部155と、を備えている。なお、情画像取得部110、対象位置検出部120、特徴量変換部130、特徴量更新部140、及び位置情報復元部150の各々は、上述したプロセッサ11(図1参照)によって実現される処理ブロックであってよい。記憶部155は、上述した記憶装置14(図1参照)等によって実現されるものであってよい。 As shown in FIG. 3, the first information processing device 10 includes, as components for realizing its functions, an image acquisition unit 110, a target position detection unit 120, a feature conversion unit 130, a feature update unit 140, a location information restoration unit 150, and a storage unit 155. Each of the image acquisition unit 110, target position detection unit 120, feature conversion unit 130, feature update unit 140, and location information restoration unit 150 may be a processing block realized by the processor 11 described above (see FIG. 1). The storage unit 155 may be realized by the storage device 14 described above (see FIG. 1), etc.

 画像取得部110は、動画や画像を取得可能に構成されている。より具体的には、画像取得部110は、動画を構成する各フレームの画像を時系列で逐次的に取得してもよい。また、時系列で並んだ画像を取得してもよい。例えば、動画を構成する各フレームの画像から所定周期毎に画像を取得してもよい。なお、画像取得部110は、カメラで動画を撮像しながら、リアルタイムで画像を取得するように構成されてもよい。画像取得部110で取得された画像は、対象位置検出部120に出力される構成となっている。 The image acquisition unit 110 is configured to be able to acquire videos and images. More specifically, the image acquisition unit 110 may acquire images of each frame constituting a video sequentially in chronological order. Alternatively, it may acquire images arranged in chronological order. For example, images may be acquired from each frame constituting a video at predetermined intervals. The image acquisition unit 110 may also be configured to acquire images in real time while capturing a video with a camera. The images acquired by the image acquisition unit 110 are configured to be output to the target position detection unit 120.

 対象位置検出部120は、画像取得部110で取得した画像から、追跡対象の位置を検出可能に構成されている。対象位置検出部120は、例えば画像に含まれる人物を検出して、検出した人物の位置を示す位置情報を出力するように構成されてよい。なお、画像に複数の追跡対象が含まれている場合、対象位置検出部120は、複数の追跡対象の各々について位置情報を検出してよい。対象位置検出部120は、例えばニューラルネットワークで構成された学習済みの検出モデルとして構成されてもよい。対象位置検出部120で検出された追跡対象の位置情報は、特徴量変換部130に出力される構成となっている。 The target position detection unit 120 is configured to be able to detect the position of the tracking target from the image acquired by the image acquisition unit 110. The target position detection unit 120 may be configured to, for example, detect a person included in the image and output position information indicating the position of the detected person. Note that if the image includes multiple tracking targets, the target position detection unit 120 may detect position information for each of the multiple tracking targets. The target position detection unit 120 may be configured as a trained detection model made up of, for example, a neural network. The position information of the tracking target detected by the target position detection unit 120 is configured to be output to the feature conversion unit 130.

 特徴量変換部130は、対象位置検出部120で検出された追跡対象の位置情報を特徴量に変換可能に構成されている。ここでの特徴量は、追跡対象の位置情報を示す特徴ベクトルであってよい。特徴量変換部130は、複数の特徴抽出ブロックを備えたエンコーダとして構成されてよい。特徴量変換部130は、例えばニューラルネットワークにおける全結合層3層程度で構築されてよい。特徴量変換部130で変換された特徴量は、例えば追跡処理に用いる検出クエリ(図2参照)として用いられてよい。特徴量変換部130で変換された特徴量は、特徴量更新部140に出力される構成となっている。 The feature conversion unit 130 is configured to be able to convert the position information of the tracked target detected by the target position detection unit 120 into feature amounts. The feature amounts here may be feature vectors indicating the position information of the tracked target. The feature conversion unit 130 may be configured as an encoder equipped with multiple feature extraction blocks. The feature conversion unit 130 may be constructed, for example, with approximately three fully connected layers in a neural network. The feature amounts converted by the feature conversion unit 130 may be used, for example, as detection queries (see Figure 2) used in tracking processing. The feature amounts converted by the feature conversion unit 130 are configured to be output to the feature update unit 140.

 特徴量更新部140は、特徴量変換部130で変換された特徴量を、交差注意機構(Cross‐attention機構)200を用いて更新可能に構成されている。交差注意機構200は、複数の画像間で追跡対象の位置情報を照合する機能を有している。具体的には、交差注意機構200は、時系列で取得される連続した画像の各々に含まれる対象について、どの対象とどの対象とが同一人物であるかの対応付けを行う機能を有している。交差注意機構200の具体的な構成については、後に詳しく説明する。特徴量更新部140で更新された特徴量(以下、適宜「更新特徴量」と称する)は、例えば追跡処理に用いる追跡クエリ(図2参照)として用いられてよい。この場合、更新特徴量は、追跡クエリを記憶する記憶部155によって一時的に記憶されてよい。また、更新特徴量は、位置情報復元部150に出力される構成となっている。 The feature update unit 140 is configured to be able to update the features converted by the feature conversion unit 130 using a cross-attention mechanism 200. The cross-attention mechanism 200 has the function of comparing the position information of the tracked target between multiple images. Specifically, the cross-attention mechanism 200 has the function of associating targets included in each of consecutive images acquired in time series to determine whether they are the same person. The specific configuration of the cross-attention mechanism 200 will be explained in detail later. The features updated by the feature update unit 140 (hereinafter referred to as "updated features") may be used, for example, as a tracking query (see Figure 2) used in the tracking process. In this case, the updated features may be temporarily stored in the memory unit 155, which stores tracking queries. The updated features are also configured to be output to the position information restoration unit 150.

 位置情報復元部150は、特徴量更新部140で更新された更新特徴量を位置情報に復元可能に構成されている。位置情報復元部150は、複数の特徴抽出ブロックを備えたデコーダとして構成されてよい。位置情報復元部150は、例えばニューラルネットワークにおける全結合層3層程度で構築されてよい。位置情報復元部150は、復元した位置情報を出力する機能を有していてもよい。 The location information restoration unit 150 is configured to be able to restore the updated features updated by the feature update unit 140 to location information. The location information restoration unit 150 may be configured as a decoder equipped with multiple feature extraction blocks. The location information restoration unit 150 may be constructed, for example, with approximately three fully connected layers in a neural network. The location information restoration unit 150 may also have a function to output the restored location information.

 記憶部155は、追跡対象毎に、特徴量及び位置情報を記憶するように構成されている。例えば、記憶部155は、追跡対象毎に検出クエリや追跡クエリを記憶するように構成されてもよい。例えば、記憶部155は、複数のメモリ領域を備え、1つのメモリ領域に対してある1人の人物に対して追跡を行った追跡クエリを時系列で蓄積するよう保存してもよい。また、追跡対象毎に対応するIDに紐づけて、特徴量及び位置情報を記憶するように構成されてもよい。 The storage unit 155 is configured to store feature amounts and location information for each tracking target. For example, the storage unit 155 may be configured to store detection queries and tracking queries for each tracking target. For example, the storage unit 155 may have multiple memory areas, and store tracking queries for tracking a single person in one memory area so that they are accumulated in chronological order. The storage unit 155 may also be configured to store feature amounts and location information for each tracking target, linked to the corresponding ID.

 (追跡処理の流れ)
 次に、図4を参照しながら、第1の情報処理装置1が実行する追跡処理の流れについて説明する。図4は、第1の情報処理装置による追跡処理の流れを示すフローチャートである。
(Tracking process flow)
Next, the flow of the tracking process executed by the first information processing device 1 will be described with reference to Fig. 4. Fig. 4 is a flowchart showing the flow of the tracking process by the first information processing device.

 図5に示すように、第1の情報処理装置1による追跡処理が開始されると、まず画像取得部110が、動画から画像を取得する(ステップS101)。そして、対象位置検出部120が、画像取得部110で取得した画像から、画像に含まれる追跡対象の位置を検出する(ステップS102)。 As shown in FIG. 5, when the tracking process by the first information processing device 1 begins, the image acquisition unit 110 first acquires an image from the video (step S101). Then, the target position detection unit 120 detects the position of the tracking target contained in the image from the image acquired by the image acquisition unit 110 (step S102).

 続いて、特徴量変換部130が、対象位置検出部120で検出された追跡対象の位置情報を特徴量に変換する(ステップS103)。そして、特徴量更新部140は、特徴量変換部130で変換された特徴量を、交差注意機構200を用いて更新する(ステップS104)。 Next, the feature conversion unit 130 converts the position information of the tracked target detected by the target position detection unit 120 into a feature (step S103). Then, the feature update unit 140 updates the feature converted by the feature conversion unit 130 using the intersection attention mechanism 200 (step S104).

 その後、位置情報復元部150は、特徴量更新部140で更新された更新特徴量を位置情報に復元する(ステップS105)。その後、第1の情報処理装置1は、追跡処理を終了するか否かを判定する(ステップS106)。 Then, the location information restoration unit 150 restores the updated feature amounts updated by the feature amount update unit 140 into location information (step S105). Then, the first information processing device 1 determines whether to end the tracking process (step S106).

 追跡処理を終了しない場合(ステップS106:NO)、再びステップS101からの処理が実行されてよい。即ち、次のフレームの画像を取得して、上述した処理が繰り返し実行されてよい。一方、追跡処理を終了しない場合(ステップS106:YES)、一連の処理は終了することになる。 If the tracking process is not to be ended (step S106: NO), the process may be executed again from step S101. That is, the next frame image may be acquired, and the above-described process may be executed repeatedly. On the other hand, if the tracking process is not to be ended (step S106: YES), the series of processes will end.

 (交差注意機構)
 次に、図5を参照しながら、交差注意機構200の構成及び動作について説明する。図5は、第1の情報処理装置における交差注意機構の構成を示すブロック図である。
(Cross-Attention Mechanism)
Next, the configuration and operation of the intersection attention mechanism 200 will be described with reference to Fig. 5. Fig. 5 is a block diagram showing the configuration of the intersection attention mechanism in the first information processing apparatus.

 図5に示すように、交差注意機構200は、クエリ、キー、バリューの各々に対応する3つの特徴埋込処理部210、220、及び230と、行列積演算部240と、正規化部250と、行列積演算部260と、残差処理部270と、記憶更新部280とを備えている。 As shown in FIG. 5, the cross-attention mechanism 200 includes three feature embedding units 210, 220, and 230 corresponding to the query, key, and value, respectively, a matrix multiplication unit 240, a normalization unit 250, a matrix multiplication unit 260, a residual processing unit 270, and a memory update unit 280.

 特徴埋込処理部210は、特徴量変換部130から入力される時刻tの特徴量(即ち、時刻tに撮影されたフレームに対応する特徴量)からクエリを抽出可能に構成されている。特徴埋込処理部220は、過去の追跡処理において演算された時刻t-τの特徴量(即ち、時刻tよりも前の時刻t-τに撮影されたフレームに対応する特徴量)からキーを抽出可能に構成されている。特徴埋込処理部230は、過去の追跡処理において演算された時刻t-τの特徴量からバリューを抽出可能に構成されている。クエリ及びキーは、行列積演算部240に出力される構成となっている。他方、バリューは、行列積演算部260に出力される構成となっている The feature embedding processor 210 is configured to extract a query from the feature values at time t (i.e., the feature values corresponding to the frame captured at time t) input from the feature converter 130. The feature embedding processor 220 is configured to extract a key from the feature values at time t-τ calculated in a past tracking process (i.e., the feature values corresponding to the frame captured at time t-τ, prior to time t). The feature embedding processor 230 is configured to extract a value from the feature values at time t-τ calculated in a past tracking process. The query and key are configured to be output to the matrix multiplication processor 240. On the other hand, the value is configured to be output to the matrix multiplication processor 260.

 行列積演算部240は、クエリ及びキーの行列積を演算することで、クエリ及びキーの相関関係を示す重み(Attention Weight)を算出可能に構成されている。即ち、行列演算部240は、時刻tに撮影されたフレームに対応する特徴量と、時刻t-τに撮影されたフレームに対応する特徴量との相関関係を示す重みを算出可能に構成されている。行列積演算部240は、例えば、時刻tに撮影されたフレームに対応する特徴量を縦軸、時刻t-τに撮影されたフレームに対応する特徴量を横軸とする類似性行列(Affinity Matrix)を交差注意機構200の重み(Attention Weight)として算出(使用)してもよい。 The matrix multiplication calculation unit 240 is configured to calculate a weight (Attention Weight) indicating the correlation between the query and the key by calculating the matrix product of the query and the key. In other words, the matrix calculation unit 240 is configured to calculate a weight indicating the correlation between the feature corresponding to the frame captured at time t and the feature corresponding to the frame captured at time t-τ. For example, the matrix multiplication calculation unit 240 may calculate (use) an affinity matrix (Affinity Matrix) in which the vertical axis represents the feature corresponding to the frame captured at time t and the horizontal axis represents the feature corresponding to the frame captured at time t-τ as the weight (Attention Weight) of the cross-attention mechanism 200.

 正規化部250は、行列積演算部240で演算した重みに対して正規化処理を実行可能に構成されている。正規化部250は、例えば、行列積演算部240が算出した類似性行列を、クロスソフトマックス(Cross‐softmax)関数を用いて正規化する処理を実行してよい。正規化部250が正規化した重みは、行列積演算部260に出力される構成となっている。 The normalization unit 250 is configured to be able to perform normalization processing on the weights calculated by the matrix multiplication unit 240. The normalization unit 250 may, for example, perform processing to normalize the similarity matrix calculated by the matrix multiplication unit 240 using a cross-softmax function. The weights normalized by the normalization unit 250 are configured to be output to the matrix multiplication unit 260.

 行列積演算部260は、正規化部250からの出力と、バリューとの行列積を演算することで、重みをバリューに反映する処理を実行可能に構成されている。なお、本実施形態における行列積は、典型的には、テンソル積(言い換えれば、直積)であってよい。例えば、行列積は、クロネッカー積であってもよい。行列積演算部260の演算結果は、残差処理部270に出力される構成となっている。 The matrix multiplication unit 260 is configured to perform processing to reflect the weight in the value by calculating the matrix product of the output from the normalization unit 250 and the value. Note that the matrix product in this embodiment may typically be a tensor product (in other words, a direct product). For example, the matrix product may be a Kronecker product. The calculation result of the matrix multiplication unit 260 is configured to be output to the residual processing unit 270.

 残差処理部270は、行列積演算部260の演算結果に対して、残差処理を実行可能に構成されている。この残差処理は、行列積演算部260の演算結果と、交差注意機構200に入力された特徴量(具体的には、時刻tの特徴量)とを加算する処理であってよい。これは、相関関係が仮に算出されなかった場合でも、交差注意機構200の演算結果としての特徴量が生成されなくなるのを防ぐためである。例えば、相関関係(重み)として0が算出されると、バリュー値に対してその0が乗算されることにより、行列演算部260の演算結果における特徴値が0となる(消失する)ことになる。これを防ぐために、残差処理部270は上述した残差処理を実行する。残差処理部270の演算結果は、更新された時刻tの特徴量として交差注意機構200から出力されることになる。 The residual processing unit 270 is configured to perform residual processing on the calculation results of the matrix multiplication calculation unit 260. This residual processing may be a process of adding the calculation results of the matrix multiplication calculation unit 260 and the feature values input to the cross-attention mechanism 200 (specifically, the feature values at time t). This is to prevent the feature values from not being generated as the calculation results of the cross-attention mechanism 200 even if a correlation is not calculated. For example, if 0 is calculated as the correlation (weight), the value will be multiplied by that 0, causing the feature value in the calculation results of the matrix calculation unit 260 to become 0 (disappear). To prevent this, the residual processing unit 270 performs the residual processing described above. The calculation results of the residual processing unit 270 are output from the cross-attention mechanism 200 as the updated feature values at time t.

 記憶更新部280は、記憶されている追跡対象に対応する特徴量を更新する。記憶更新部280は、行列積演算部260にて、出力された更新特徴量に対応する記憶手段に記憶されている特徴量のみを更新してもよいし、残差処理部270の演算結果によって出力された特徴量を記憶手段に上書きして更新してもよい。例えば、行列積演算部240にてクエリ及びキーから演算された重みによって、追跡対象が特定され、記憶部155に記憶されている複数の追跡対象からどの追跡対象を更新するか決定されてもよい。さらに、行列積演算部260にて、正規化された重み及びバリューから演算された更新特徴量が、記憶部155に記憶されている追跡対象の特徴量の更新量として決定されてもよい。 The memory update unit 280 updates the stored feature quantities corresponding to the tracked object. The memory update unit 280 may update only the feature quantities stored in the memory means corresponding to the updated feature quantities output by the matrix multiplication calculation unit 260, or may overwrite the feature quantities output by the calculation results of the residual processing unit 270 in the memory means to update them. For example, the tracked object may be identified by weights calculated from the query and key by the matrix multiplication calculation unit 240, and it may be determined which tracked object to update from the multiple tracked objects stored in the memory unit 155. Furthermore, the updated feature quantities calculated by the matrix multiplication calculation unit 260 from the normalized weights and values may be determined as the updated quantities for the feature quantities of the tracked object stored in the memory unit 155.

 なお、第1の情報処理装置1は、実質的には、追跡処理と交差注意機構200で行われる動作とが似ていることに着目し、物体を照合する際に生成される情報を用いて特徴量を更新する動作を行っていると言える。例えば、追跡処理では、追跡対象を検出する処理、追跡対象を照合する処理、及び追跡対象の検出結果を更新する処理が行われる。一方で、交差注意機構200では、追跡対象に関する特徴量を抽出する処理、重みを算出する処理、及び追跡対象に関する特徴量を更新する処理が行われる。第1の情報処理装置1は、交差注意機構200において重みを算出する処理を、実質的には、追跡処理において追跡対象を照合する処理としても流用している。言い換えれば、第1の情報処理装置1は、追跡処理において追跡対象を照合する処理を、実質的には、交差注意機構200において重みを算出する処理としても流用している。従って、第1の情報処理装置1は、物体を検出する動作、物体を照合する動作、及び検出結果を更新する動作を、交差注意機構200を用いて実現しているとも言える。 Note that the first information processing device 1 essentially focuses on the similarity between the tracking process and the operations performed by the cross-attention mechanism 200, and can be said to perform an operation to update features using information generated when matching objects. For example, the tracking process involves a process to detect the tracked object, a process to match the tracked object, and a process to update the tracked object detection results. On the other hand, the cross-attention mechanism 200 involves a process to extract features related to the tracked object, a process to calculate weights, and a process to update the features related to the tracked object. The first information processing device 1 essentially reuses the process of calculating weights in the cross-attention mechanism 200 as the process of matching the tracked object in the tracking process. In other words, the first information processing device 1 essentially reuses the process of matching the tracked object in the tracking process as the process of calculating weights in the cross-attention mechanism 200. Therefore, it can also be said that the first information processing device 1 realizes the operations of detecting an object, matching the object, and updating the detection results using the cross-attention mechanism 200.

 より具体的には、交差注意機構200は、上述したように時刻tに撮影されたフレームに対応する特徴量をクエリとし、時刻tよりも前の時刻t-τまでに撮影されたフレームに含まれる特徴量を記憶した記憶部160からキー及びバリューとして取得し、用いることで追跡対象を追跡する処理を行っている。また、記憶更新部280は、残差処理部270の演算結果によって出力された特徴量を記憶手段に上書きして更新してもよい。また、行列積演算部240にて演算された重みから、追跡対象を特定し、記憶手段に記憶されている特定された追跡対象のIDに対応する特徴量を行列積演算部260にて出力された更新特徴量を用いて更新してもよい。このようにすれば、動画に含まれる追跡対象に対する追跡処理を、比較的簡素なアルゴリズムで精度よく実行することが可能である。 More specifically, as described above, the intersection attention mechanism 200 uses the feature quantity corresponding to the frame captured at time t as a query, and obtains and uses the feature quantities contained in frames captured up to time t-τ before time t as keys and values from the memory unit 160, which stores these feature quantities, to track the target. The memory update unit 280 may also update the memory means by overwriting the feature quantity output by the calculation results of the residual processing unit 270. Alternatively, the target may be identified from the weights calculated by the matrix multiplication calculation unit 240, and the feature quantity corresponding to the ID of the identified target, which is stored in the memory means, may be updated using the updated feature quantity output by the matrix multiplication calculation unit 260. In this way, it is possible to accurately track a target included in a video using a relatively simple algorithm.

 (類似性行列)
 次に、図6を参照しながら、上述した交差注意機構200が算出する類似性行列について具体的に説明する。図6は、交差注意機構により算出される類似性行列の一例を示す平面図である。
(similarity matrix)
Next, the similarity matrix calculated by the above-described intersection attention mechanism 200 will be specifically described with reference to Fig. 6. Fig. 6 is a plan view showing an example of the similarity matrix calculated by the intersection attention mechanism.

 図6に示すように、交差注意機構200が重みとして用いる類似性行列AMは、時刻t-τの追跡対象Ot-τと、時刻tの追跡対象Oとの対応関係を示す情報となる。例えば、類似性行列AMは、(1)複数の追跡対象Ot-τのうちの第1の追跡対象Ot-τが、複数の追跡対象Oのうちの第1の追跡対象Oに対応しており(つまり、両者が同一の人物であり)、(2)複数の追跡対象Ot-τのうちの第2の追跡対象Ot-τが、複数の追跡対象Oのうちの第2の追跡対象Oに対応しており、・・・、(N)複数の追跡対象Ot-τのうちの第Nの追跡対象Ot-τが、複数の追跡対象Oのうちの第Nの追跡対象Oに対応していることを示す情報となる。なお、類似性行列AMは、追跡対象Ot-τと追跡対象Oとの対応関係を示す情報であるがゆえに、対応情報と称してもよい。 As shown in FIG. 6, the similarity matrix AM used as a weight by the cross attention mechanism 200 is information indicating the correspondence between the tracking target O t- τ at time t-τ and the tracking target O t at time t. For example, the similarity matrix AM is information indicating that (1) a first tracking target O t among the multiple tracking targets O t- τ corresponds to a first tracking target O t among the multiple tracking targets O t (that is, both are the same person), (2) a second tracking target O t-τ among the multiple tracking targets O t- τ corresponds to a second tracking target O t among the multiple tracking targets O t, ..., (N) an Nth tracking target O t-τ among the multiple tracking targets O t- τ corresponds to an Nth tracking target O t among the multiple tracking targets O t. Note that the similarity matrix AM is information indicating the correspondence between the tracking target O t-τ and the tracking target O t , and therefore may be referred to as correspondence information.

 具体的には、類似性行列AMは、その縦軸が特徴ベクトルCVt-τのベクトル成分に対応しており且つその横軸が特徴ベクトルCVのベクトル成分に対応している行列であるとみなすことができる。このため、類似性行列AMの縦軸のサイズは、特徴ベクトルCVt-τのサイズであり、時刻t-τに撮影された画像のサイズ(つまり、画素数)に対応するサイズ)になる。同様に、類似性行列AMの横軸のサイズは、特徴ベクトルCVのサイズであり、時刻tに撮影された画のサイズ(つまり、画素数)に対応するサイズ)になる。言い換えれば、類似性行列AMは、その縦軸が時刻t-τの画像に映り込んでいる追跡対象Ot-τの検出結果(つまり、追跡対象Ot-τの検出位置)に対応しており、且つ、その横軸が時刻tの画像に映り込んでいる追跡対象Oの検出結果(つまり、追跡対象Oの検出位置)に対応している行列であるとみなすことができる。この場合、縦軸上のある追跡対象Ot-τに対応するベクトル成分と横軸上の同じ追跡対象Oに対応するベクトル成分とが交差する位置において、類似性行列AMの要素が反応する(典型的には、0でない値を有する)。言い換えれば、縦軸上の追跡対象Ot-τの検出結果と横軸上の追跡対象Oの検出結果とが交差する位置において、類似性行列AMの要素が反応する。つまり、類似性行列AMは、典型的には、特徴ベクトルCVt-τに含まれる追跡対象Ot-τに対応するベクトル成分と特徴ベクトルCVに含まれる同じ追跡対象Oに対応するベクトル成分とが交差する位置の要素の値が、両ベクトル成分を掛け合わせることで得られる値(つまり、0ではない値)となる一方で、それ以外の要素の値が0になる行列となる。 Specifically, the similarity matrix AM can be considered to be a matrix whose vertical axis corresponds to the vector components of the feature vector CV t-τ and whose horizontal axis corresponds to the vector components of the feature vector CV t . Therefore, the size of the vertical axis of the similarity matrix AM is the size of the feature vector CV t-τ , which corresponds to the size of the image captured at time t-τ (i.e., the number of pixels). Similarly, the size of the horizontal axis of the similarity matrix AM is the size of the feature vector CV t , which corresponds to the size of the image captured at time t (i.e., the number of pixels). In other words, the similarity matrix AM can be considered to be a matrix whose vertical axis corresponds to the detection result of the tracking target O t-τ reflected in the image at time t-τ (i.e., the detected position of the tracking target O t-τ ) and whose horizontal axis corresponds to the detection result of the tracking target O t reflected in the image at time t (i.e., the detected position of the tracking target O t ). In this case, an element of the similarity matrix AM reacts (typically has a non-zero value) at a position where a vector component on the vertical axis corresponding to a certain tracking target O t-τ intersects with a vector component on the horizontal axis corresponding to the same tracking target O t. In other words, an element of the similarity matrix AM reacts at a position where a detection result for tracking target O t-τ on the vertical axis intersects with a detection result for tracking target O t on the horizontal axis. In other words, the similarity matrix AM is typically a matrix in which the value of an element at a position where a vector component corresponding to tracking target O t- τ included in feature vector CV t- τ intersects with a vector component corresponding to the same tracking target O t included in feature vector CV t is a value obtained by multiplying both vector components (i.e., a non-zero value), while the values of the other elements are 0.

 例えば、図5に示す例では、特徴ベクトルCVt-τに含まれる追跡対象O#k(但し、kは、検出された追跡対象Oの数であり、図5に示す例では、k=1、2、3又は4)に対応するベクトル成分と特徴ベクトルCVに含まれる同じ追跡対象O#kに対応するベクトル成分とが交差する位置において、類似性行列AMの要素が反応する。つまり、時刻t-τに撮影された画像に映り込んだ追跡対象O#kの検出結果と、時刻tに撮影された画像に映り込んだ追跡対象O#kの検出結果とが交差する位置において、類似性行列AMの要素が反応する。 For example, in the example shown in Fig. 5, the elements of the similarity matrix AM react at positions where vector components corresponding to tracking target O#k (where k is the number of detected tracking targets O, and in the example shown in Fig. 5, k = 1, 2, 3, or 4) included in the feature vector CV t- τ intersect with vector components corresponding to the same tracking target O#k included in the feature vector CV t. In other words, the elements of the similarity matrix AM react at positions where the detection result of tracking target O#k reflected in the image captured at time t-τ intersects with the detection result of tracking target O#k reflected in the image captured at time t.

 逆に、特徴ベクトルCVt-τに含まれる追跡対象Ot-τに対応するベクトル成分と特徴ベクトルCVに含まれる同じ追跡対象Oに対応するベクトル成分とが交差する位置において類似性行列AMの要素が反応しない(典型的には、0になる)場合には、時刻t-τに撮影された画像に映り込んでいた追跡対象Ot-τは、時刻tに撮影された画像には映り込んでいない(例えば、カメラの撮影画角外へ出てしまった)と推定される。 Conversely, if the elements of the similarity matrix AM do not react (typically become 0) at the position where the vector component corresponding to the tracking target O t-τ included in the feature vector CV t- τ intersects with the vector component corresponding to the same tracking target O t included in the feature vector CV t, it is estimated that the tracking target O t-τ that was reflected in the image captured at time t-τ is not reflected in the image captured at time t (for example, it has moved outside the angle of view of the camera).

 このように、類似性行列AMは、追跡対象Ot-τと追跡対象Oとの対応関係を示す情報として利用可能である。つまり、類似性行列AMは、時刻t-τに撮影された画像に映り込んでいる追跡対象Ot-τと、時刻tに撮影された画像に映り込んでいる物体Oとの照合結果を示す情報として利用可能である。よって、類似性行列AMは、時刻t-τに撮影された画像に映り込んでいた追跡対象Ot-τの、時刻tに撮影された画像内での位置を追跡するための情報として利用可能である。このように類似性行列AMを用いれば、動画に含まれている追跡対象の追跡処理を精度よく実行することが可能である。 In this way, the similarity matrix AM can be used as information indicating the correspondence between the tracking target O t-τ and the tracking target O t . In other words, the similarity matrix AM can be used as information indicating the result of matching the tracking target O t-τ reflected in the image captured at time t-τ with the object O t reflected in the image captured at time t. Therefore, the similarity matrix AM can be used as information for tracking the position of the tracking target O t-τ reflected in the image captured at time t-τ within the image captured at time t. By using the similarity matrix AM in this way, it is possible to accurately perform tracking processing of the tracking target included in the video.

 (技術的効果)
 次に、第1の情報処理装置1によって得られる技術的効果について説明する。
(Technical effect)
Next, the technical effects obtained by the first information processing device 1 will be described.

 図1から図6で説明したように、第1の情報処理装置1では、動画から取得された画像における対象の位置情報に関する特徴量が、対象を照合する機能を有する交差注意機構200を用いて更新される。第1の情報処理装置1では、この交差注意機構200の動作を利用して、追跡対象に対する追跡処理が実行される。このようにすれば、本実施形態で説明した交差注意機構200を使用しない場合と比較して、より適切に追跡処理を実行することが可能となる。例えば、追跡処理に自己注意機構(Self‐attention機構)を利用しようとする場合、同一の追跡対象間で自己注意機構における重みが強く反応するような学習が必要とされる。しかしながら、このような構成を実現するためには、大量の自己注意機構が要求されてしまい、追跡処理に用いるアルゴリズムが複雑化してしまうという技術的問題点が生ずる。しかるに本実施形態で説明した交差注意機構200を用いれば、追跡処理のアルゴリズムを簡素な構造で構築できるため、計算コストを抑制しつつ高精度な追跡処理を実現することが可能となる。 1 to 6, in the first information processing device 1, feature quantities related to the position information of an object in an image acquired from a video are updated using a cross-attention mechanism 200 having the function of matching objects. In the first information processing device 1, the operation of this cross-attention mechanism 200 is used to perform tracking processing for the tracked object. In this way, tracking processing can be performed more appropriately compared to when the cross-attention mechanism 200 described in this embodiment is not used. For example, when attempting to use a self-attention mechanism for tracking processing, learning is required so that the weights in the self-attention mechanism react strongly between the same tracked objects. However, realizing such a configuration requires a large number of self-attention mechanisms, which poses a technical problem in that the algorithm used for tracking processing becomes complicated. However, by using the cross-attention mechanism 200 described in this embodiment, the tracking processing algorithm can be constructed with a simple structure, making it possible to achieve highly accurate tracking processing while suppressing computational costs.

 <第2実施形態>
 第2の情報処理装置1について、図7から図11を参照して説明する。なお、第2の情報処理装置1は、上述した第1の情報処理装置1と比べて一部の構成及び動作が異なるものであり、その他の部分については第1の情報処理装置1と同様であってよい。このため、以下では、第1実施形態と異なる部分について詳しく説明し、他の重複する部分については適宜説明を省略するものとする。
Second Embodiment
The second information processing device 1 will be described with reference to Figures 7 to 11. The second information processing device 1 differs in some configurations and operations from the first information processing device 1 described above, but other parts may be similar to the first information processing device 1. Therefore, the following will describe in detail the parts that differ from the first embodiment, and will omit explanations of other overlapping parts as appropriate.

 (機能的構成)
 まず、図7を参照しながら、第2の情報処理装置1の機能的構成について説明する。図7は、第2の情報処理装置の機能的構成を示すブロック図である。
(Functional configuration)
First, the functional configuration of the second information processing device 1 will be described with reference to Fig. 7. Fig. 7 is a block diagram showing the functional configuration of the second information processing device.

 図7に示すように、第2の情報処理装置1は、その機能を実現するための構成要素として、画像取得部110と、対象位置検出部120と、特徴量変換部130と、特徴量更新部140と、位置情報復元部150と、学習部160と、を備えている。即ち、第2の情報処理装置1は、すでに第1実施形態で説明した構成(図3参照)に加えて、学習部160を更に備えている。なお、学習部160は、上述したプロセッサ11(図1参照)によって実現される処理ブロックであってよい。 As shown in FIG. 7, the second information processing device 1 includes, as components for realizing its functions, an image acquisition unit 110, a target position detection unit 120, a feature conversion unit 130, a feature update unit 140, a position information restoration unit 150, and a learning unit 160. That is, the second information processing device 1 further includes a learning unit 160 in addition to the configuration already described in the first embodiment (see FIG. 3). Note that the learning unit 160 may be a processing block realized by the above-mentioned processor 11 (see FIG. 1).

 学習部160は、第2の情報処理装置1が実行する追跡対象の追跡処理に関する学習を実行可能に構成されている。より具体的には、学習部160は、追跡処理を実行する追跡モデル50(即ち、対象位置検出部120、特徴量変換部130、特徴量更新部140、及び対象情報復元部140の機能を有するモデル)に対して、より高精度で追跡が行えるような機械学習を実行してよい。学習部160は、特徴量更新部140が用いる交差注意機構200の動作に関する学習を実行してよい。例えば、学習部160は、交差注意機構200が行う追跡対象を照合する動作がより正確に行われるように学習を実行してよい。具体的には、学習部160は、交差注意機構200が用いる類似性行列AMが同一の追跡対象でより強く反応するように学習してよい。学習部160が用いる具体的な学習手法については、以下で詳しく説明する。 The learning unit 160 is configured to be able to perform learning related to the tracking process of the tracked object executed by the second information processing device 1. More specifically, the learning unit 160 may perform machine learning on the tracking model 50 that performs the tracking process (i.e., a model having the functions of the object position detection unit 120, feature conversion unit 130, feature update unit 140, and object information restoration unit 140) to enable tracking with higher accuracy. The learning unit 160 may perform learning related to the operation of the cross-attention mechanism 200 used by the feature update unit 140. For example, the learning unit 160 may perform learning so that the operation of matching tracked objects performed by the cross-attention mechanism 200 can be performed more accurately. Specifically, the learning unit 160 may learn so that the similarity matrix AM used by the cross-attention mechanism 200 reacts more strongly to the same tracked object. The specific learning method used by the learning unit 160 will be explained in detail below.

 (学習手法)
 次に、図8から図10を参照しながら、学習部が実行する学習の手法について具体的に説明する。図8は、比較例に係る学習データの生成手法を示す概念図である。図9は、第2の情報処理装置に係る学習データの生成手法を示す概念図である。図10は、第2の情報処理装置による学習動作におけるクエリ伝搬を示す概念図である。
(Learning method)
Next, the learning technique executed by the learning unit will be specifically described with reference to Fig. 8 to Fig. 10. Fig. 8 is a conceptual diagram showing a learning data generation technique according to a comparative example. Fig. 9 is a conceptual diagram showing a learning data generation technique according to a second information processing device. Fig. 10 is a conceptual diagram showing query propagation in the learning operation by the second information processing device.

 図8において、まず比較例に係る学習手法について説明する。比較例に係る学習手法では、複数の動画を学習データとして用いる。具体的には、複数の動画をそれぞれミニバッチに変換して学習データとして用いる。このような学習手法を、クエリ伝搬を用いる追跡処理の学習に適用しようとすると、例えばメモリ量の制限等により多くのフレームを学習に用いることができず、結果として適切な学習効果を得ることができない。 In Figure 8, we will first explain a learning method according to a comparative example. In the learning method according to the comparative example, multiple videos are used as learning data. Specifically, multiple videos are each converted into mini-batches and used as learning data. When attempting to apply this type of learning method to learning tracking processing using query propagation, it is not possible to use many frames for learning due to memory limitations, for example, and as a result, it is not possible to obtain appropriate learning results.

 他方、図9に示すように、第2の情報処理装置1における学習部160は、1本の動画に対してミニバッチ変換を行い学習データとして利用する。例えば、学習部160は、ミニバッチに含まれるすべての動画フレームを、追跡モデル50のクエリ伝搬の学習にすべて使用してもよい。学習データは、1本の動画に含まれる各フレームと、各フレームに映り込んだ追跡対象の対応関係(即ち、どの人物とどの人物が同一であるか)を示す正解データとを含むデータであってよい。 On the other hand, as shown in FIG. 9, the learning unit 160 in the second information processing device 1 performs mini-batch conversion on a single video and uses the result as training data. For example, the learning unit 160 may use all of the video frames included in the mini-batch for training the query propagation of the tracking model 50. The training data may include each frame included in a single video and ground truth data indicating the correspondence between the tracked targets captured in each frame (i.e., which people are the same person).

 図10に示すように、学習部160は、1つの動画から取得した時系列で並ぶ複数のフレーム(フレームt1、フレームt2、フレームt3、…)を学習データとして用いる。このようにすれば、追跡モデル50のクエリ伝搬の学習で使用できるフレーム数を大幅に増加させることが可能であるまた、追跡モデル50によるクエリの伝搬を時系列で学習していくことも可能となる。 As shown in Figure 10, the learning unit 160 uses multiple frames (frame t1, frame t2, frame t3, ...) arranged in chronological order from a single video as learning data. In this way, it is possible to significantly increase the number of frames that can be used to learn query propagation in the tracking model 50, and it is also possible to learn query propagation by the tracking model 50 in chronological order.

 (学習動作の流れ)
 次に、図11を参照しながら、第2の情報処理装置1が実行する学習動作(即ち、学習部160によって追跡モデル50を学習する際の動作)の流れについて説明する。図11は、第2の情報処理装置による学習動作の流れを示すフローチャートである。
(Learning operation flow)
Next, the flow of the learning operation executed by the second information processing device 1 (i.e., the operation when the learning unit 160 learns the tracking model 50) will be described with reference to Fig. 11. Fig. 11 is a flowchart showing the flow of the learning operation by the second information processing device.

 図11に示すように、第2の情報処理装置10による学習動作が開始されると、まず学習部160が、1つの動画をバッチ変換して学習データとする(ステップS201)。そして、学習部160は、学習データ202を追跡モデル50に入力する(ステップS202)。 As shown in FIG. 11, when the learning operation by the second information processing device 10 begins, the learning unit 160 first batch-converts one video to create learning data (step S201). The learning unit 160 then inputs the learning data 202 into the tracking model 50 (step S202).

 続いて、学習部160は、追跡モデル50の出力結果と正解データとを比較して損失関数を算出する(ステップS203)。そして、学習部160は、損失関数の勾配を算出する(ステップS204)。 Next, the learning unit 160 compares the output result of the tracking model 50 with the ground truth data to calculate a loss function (step S203). Then, the learning unit 160 calculates the gradient of the loss function (step S204).

 続いて、学習部160は、算出した勾配に基づいて、損失関数が小さくなるように追跡モデルのパラメータを更新する(ステップS205)。その後、学習部160は、学習データの全フレームを用いて学習を実行したか否かを判定する(ステップS206)。 Next, the learning unit 160 updates the parameters of the tracking model based on the calculated gradient so as to reduce the loss function (step S205). After that, the learning unit 160 determines whether learning has been performed using all frames of the training data (step S206).

 全フレームを学習に用いていない場合(ステップS206:NO)。学習部160は、再びステップS202から処理を開始する。即ち、学習部160は、学習データである画像を追跡モデル50に入力してから、パラメータを更新する処理までの処理を繰り返す。一方で、全フレームを学習に用いている場合(ステップS206:YES)、学習部160は学習が終了したと判断し、学習済みモデルを保存する(ステップS207)。 If not all frames have been used for learning (step S206: NO), the learning unit 160 starts processing again from step S202. That is, the learning unit 160 repeats the process from inputting images, which are learning data, into the tracking model 50 to updating the parameters. On the other hand, if all frames have been used for learning (step S206: YES), the learning unit 160 determines that learning has ended and saves the learned model (step S207).

 (技術的効果)
 次に、第2の情報処理装置1によって得られる技術的効果について説明する。
(Technical effect)
Next, the technical effects obtained by the second information processing device 1 will be described.

 図6から図11で説明したように、第2の情報処理装置1では、1つの動画に含まれる複数フレームを1つのミニバッチに変換して学習が行われる。このようにすれば、例えば複数の動画をそれぞれミニバッチ変換する場合と比較して、追跡モデル50のクエリ伝搬の学習で使用できるフレーム数を大幅に増加させることが可能である。なお、第2の情報処理装置1が行う追跡処理は、フレーム数に対して特に制限が設けられない。このため、多くのフレームを用いた学習(言い換えれば、長期間の時系列学習)を実現することで、追跡の精度を効果的に高めることが可能である。 As described in Figures 6 to 11, in the second information processing device 1, multiple frames included in one video are converted into one mini-batch for learning. In this way, it is possible to significantly increase the number of frames that can be used in learning query propagation for the tracking model 50, compared to, for example, when multiple videos are each converted into mini-batches. Note that there is no particular limit on the number of frames in the tracking process performed by the second information processing device 1. Therefore, by realizing learning using a large number of frames (in other words, long-term time-series learning), it is possible to effectively improve tracking accuracy.

 なお、上述した各実施形態に係る情報処理装置1は、対象に対して追跡処理を実行する各種システムに適用することが可能である。例えば、情報処理装置1は、所定領域を通過する対象を追跡して、追跡している対象の生体情報(例えば、顔情報や虹彩情報等)を用いた認証処理を実行するゲートレス認証システムに適用することが可能である。 The information processing device 1 according to each of the above-described embodiments can be applied to various systems that perform tracking processing on an object. For example, the information processing device 1 can be applied to a gateless authentication system that tracks an object passing through a predetermined area and performs authentication processing using biometric information (e.g., facial information, iris information, etc.) of the object being tracked.

 上述した各実施形態の機能を実現するように該実施形態の構成を動作させるプログラムを記録媒体に記録させ、該記録媒体に記録されたプログラムをコードとして読み出し、コンピュータにおいて実行する処理方法も各実施形態の範疇に含まれる。すなわち、コンピュータ読取可能な記録媒体も各実施形態の範囲に含まれる。また、上述のプログラムが記録された記録媒体はもちろん、そのプログラム自体も各実施形態に含まれる。 The scope of each embodiment also includes a processing method in which a program that operates the configuration of each embodiment to realize the functions of the above-mentioned embodiments is recorded on a recording medium, the program recorded on the recording medium is read as code, and the program is executed on a computer. In other words, computer-readable recording media are also included in the scope of each embodiment. Furthermore, each embodiment includes not only the recording medium on which the above-mentioned program is recorded, but also the program itself.

 記録媒体としては例えばフロッピー(登録商標)ディスク、ハードディスク、光ディスク、光磁気ディスク、CD-ROM、磁気テープ、不揮発性メモリカード、ROMを用いることができる。また該記録媒体に記録されたプログラム単体で処理を実行しているものに限らず、他のソフトウェア、拡張ボードの機能と共同して、OS上で動作して処理を実行するものも各実施形態の範疇に含まれる。更に、プログラム自体がサーバに記憶され、ユーザ端末にサーバからプログラムの一部または全てをダウンロード可能なようにしてもよい。プログラムは、例えばSaaS(Software as a Service)形式でユーザに提供されてもよい。 The recording medium may be, for example, a floppy disk, hard disk, optical disk, magneto-optical disk, CD-ROM, magnetic tape, non-volatile memory card, or ROM. Furthermore, the scope of each embodiment is not limited to programs recorded on the recording medium that execute processing by themselves, but also includes programs that execute processing by running on an OS in conjunction with other software or expansion board functions. Furthermore, the program itself may be stored on a server, and part or all of the program may be downloadable from the server to a user terminal. The program may also be provided to the user in, for example, a SaaS (Software as a Service) format.

 <付記>
 以上説明した実施形態に関して、更に以下の付記のようにも記載されうるが、以下には限られない。
<Additional Notes>
The above-described embodiment may be further described as follows, but is not limited to the following.

 (付記1)
 付記1に記載の情報処理装置は、動画から画像を取得する取得手段と、前記画像に含まれる追跡対象の位置を検出する検出手段と、前記追跡対象の位置に関する位置情報を、前記位置情報の特徴を示す特徴量に変換する変換手段と、前記特徴量を、複数の前記画像間で前記追跡対象を照合可能な交差注意機構を用いて更新する更新手段と、更新された前記特徴量を前記位置情報に復元する復元手段と、を備える情報処理装置である。
(Appendix 1)
The information processing device described in Supplementary Note 1 is an information processing device that includes an acquisition means for acquiring an image from a video, a detection means for detecting the position of a tracked object included in the image, a conversion means for converting position information regarding the position of the tracked object into a feature quantity indicating the characteristics of the position information, an update means for updating the feature quantity using a cross-attention mechanism that can match the tracked object between a plurality of the images, and a restoration means for restoring the updated feature quantity to the position information.

 (付記2)
 付記2に記載の情報処理装置は、前記交差注意機構は、第1の画像に関する前記特徴量である第1特徴量と、前記第1の画像よりも前に撮影された第2の画像に関する前記特徴量である第2特徴量と、から算出される重みを用いて前記追跡対象を照合する、付記1に記載の情報処理装置である。
(Appendix 2)
The information processing device described in Appendix 2 is the information processing device described in Appendix 1, in which the cross-attention mechanism matches the tracked target using a weight calculated from a first feature that is the feature related to a first image and a second feature that is the feature related to a second image that was taken before the first image.

 (付記3)
 付記3に記載の情報処理装置は、前記交差注意機構は、前記第1特徴量と前記第2特徴量との行列積を演算して得られる類似性行列を前記重みとして用いることで前記追跡対象を照合する、付記2に記載の情報処理装置である。
(Appendix 3)
The information processing device described in Supplementary Note 3 is the information processing device described in Supplementary Note 2, in which the cross-attention mechanism matches the tracked target by using a similarity matrix obtained by calculating a matrix product of the first feature and the second feature as the weight.

 (付記4)
 付記4に記載の情報処理装置は、前記特徴量を記憶可能な記憶手段と、前記更新手段で更新された前記特徴量に基づいて、前記記憶手段に記憶された前記特徴量を更新する記憶更新手段と、を更に備える付記1から3のいずれか一項に記載の情報処理装置である。
(Appendix 4)
The information processing device described in Supplementary Note 4 is the information processing device described in any one of Supplementary Notes 1 to 3, further comprising: a storage means capable of storing the feature; and a storage update means that updates the feature stored in the storage means based on the feature updated by the update means.

 (付記5)
 付記5に記載の情報処理装置は、1つの動画に含まれる複数フレームを1つのミニバッチに変換して学習データとし、前記追跡対象の照合に関する学習を行う学習部を更に備える、付記1から3のいずれか一項に記載の情報処理装置である。
(Appendix 5)
The information processing device described in Supplementary Note 5 is the information processing device described in any one of Supplementary Notes 1 to 3, further including a learning unit that converts multiple frames included in one video into one mini-batch to use as training data, and performs learning related to matching of the tracked target.

 (付記6)
 付記6に記載の情報処理方法は、少なくとも1つのコンピュータによって、動画から画像を取得し、前記画像に含まれる追跡対象の位置を検出し、前記追跡対象の位置に関する位置情報を、前記位置情報の特徴を示す特徴量に変換し、前記特徴量を、複数の前記画像間で前記追跡対象を照合可能な交差注意機構を用いて更新し、更新された前記特徴量を前記位置情報に復元する、情報処理方法である。
(Appendix 6)
The information processing method described in Supplementary Note 6 is an information processing method that, by at least one computer, acquires an image from a video, detects a position of a tracked target included in the image, converts position information regarding the position of the tracked target into a feature amount indicating characteristics of the position information, updates the feature amount using a cross-attention mechanism that can match the tracked target between a plurality of the images, and restores the updated feature amount to the position information.

 (付記7)
 付記7に記載の記録媒体は、少なくとも1つのコンピュータに、動画から画像を取得し、前記画像に含まれる追跡対象の位置を検出し、前記追跡対象の位置に関する位置情報を、前記位置情報の特徴を示す特徴量に変換し、前記特徴量を、複数の前記画像間で前記追跡対象を照合可能な交差注意機構を用いて更新し、更新された前記特徴量を前記位置情報に復元する、情報処理方法を実行させるコンピュータプログラムが記録された記録媒体である。
(Appendix 7)
The recording medium described in Supplementary Note 7 is a recording medium having recorded thereon a computer program for causing at least one computer to execute an information processing method of acquiring images from a video, detecting a position of a tracked target included in the images, converting position information regarding the position of the tracked target into feature quantities indicating characteristics of the position information, updating the feature quantities using a cross-attention mechanism capable of matching the tracked target between a plurality of the images, and restoring the updated feature quantities to the position information.

 (付記8)
 付記8に記載のコンピュータプログラムは、少なくとも1つのコンピュータに、動画から画像を取得し、前記画像に含まれる追跡対象の位置を検出し、前記追跡対象の位置に関する位置情報を、前記位置情報の特徴を示す特徴量に変換し、前記特徴量を、複数の前記画像間で前記追跡対象を照合可能な交差注意機構を用いて更新し、更新された前記特徴量を前記位置情報に復元する、情報処理方法を実行させるコンピュータプログラムである。
(Appendix 8)
The computer program described in Supplementary Note 8 is a computer program that causes at least one computer to execute an information processing method of acquiring images from a video, detecting a position of a tracked target included in the images, converting position information regarding the position of the tracked target into features indicating characteristics of the position information, updating the features using a cross-attention mechanism that can match the tracked target between a plurality of the images, and restoring the updated features to the position information.

 (付記9)
 付記9に記載の追跡装置は、動画から画像を取得する取得手段と、前記画像に含まれる追跡対象の位置を検出する検出手段と、前記追跡対象の位置に関する位置情報を、前記位置情報の特徴を示す特徴量に変換する変換手段と、前記特徴量を、複数の前記画像間で前記追跡対象を照合可能な交差注意機構を用いて更新する更新手段と、更新された前記特徴量を前記位置情報に復元する復元手段と、復元された前記位置情報に基づいて前記追跡対象を追跡する追跡手段と、を備える追跡装置である。
(Appendix 9)
The tracking device described in Supplementary Note 9 is a tracking device including: an acquisition means for acquiring an image from a video; a detection means for detecting a position of a tracking target included in the image; a conversion means for converting position information regarding the position of the tracking target into a feature quantity indicating characteristics of the position information; an update means for updating the feature quantity using a cross-attention mechanism capable of matching the tracking target between a plurality of the images; a restoration means for restoring the updated feature quantity to the position information; and a tracking means for tracking the tracking target based on the restored position information.

 (付記10)
 付記10に記載の追跡方法は、少なくとも1つのコンピュータによって、動画から画像を取得し、前記画像に含まれる追跡対象の位置を検出し、前記追跡対象の位置に関する位置情報を、前記位置情報の特徴を示す特徴量に変換し、前記特徴量を、複数の前記画像間で前記追跡対象を照合可能な交差注意機構を用いて更新し、更新された前記特徴量を前記位置情報に復元し、復元された前記位置情報に基づいて前記追跡対象を追跡する、追跡方法である。
(Appendix 10)
The tracking method described in Supplementary Note 10 is a tracking method that, by at least one computer, acquires an image from a video, detects a position of a tracked object included in the image, converts position information regarding the position of the tracked object into a feature amount indicating characteristics of the position information, updates the feature amount using a cross-attention mechanism that can match the tracked object between a plurality of the images, restores the updated feature amount to the position information, and tracks the tracked object based on the restored position information.

 (付記11)
 付記11に記載の記録媒体は、少なくとも1つのコンピュータに、動画から画像を取得し、前記画像に含まれる追跡対象の位置を検出し、前記追跡対象の位置に関する位置情報を、前記位置情報の特徴を示す特徴量に変換し、前記特徴量を、複数の前記画像間で前記追跡対象を照合可能な交差注意機構を用いて更新し、更新された前記特徴量を前記位置情報に復元し、復元された前記位置情報に基づいて前記追跡対象を追跡する、追跡方法を実行させるコンピュータプログラムが記録された記録媒体である。
(Appendix 11)
The recording medium described in Supplementary Note 11 is a recording medium having recorded thereon a computer program for causing at least one computer to execute a tracking method of acquiring images from a video, detecting a position of a tracked target included in the images, converting position information regarding the position of the tracked target into feature amounts indicating characteristics of the position information, updating the feature amounts using a cross-attention mechanism capable of matching the tracked target between a plurality of the images, restoring the updated feature amounts to the position information, and tracking the tracked target based on the restored position information.

 (付記12)
 付記12に記載のコンピュータプログラムは、少なくとも1つのコンピュータに、動画から画像を取得し、前記画像に含まれる追跡対象の位置を検出し、前記追跡対象の位置に関する位置情報を、前記位置情報の特徴を示す特徴量に変換し、前記特徴量を、複数の前記画像間で前記追跡対象を照合可能な交差注意機構を用いて更新し、更新された前記特徴量を前記位置情報に復元し、復元された前記位置情報に基づいて前記追跡対象を追跡する、追跡方法を実行させるコンピュータプログラムである。
(Appendix 12)
The computer program described in Supplementary Note 12 is a computer program that causes at least one computer to execute a tracking method of acquiring images from a video, detecting a position of a tracked target included in the images, converting position information regarding the position of the tracked target into feature amounts indicating characteristics of the position information, updating the feature amounts using a cross-attention mechanism that can match the tracked target across a plurality of the images, restoring the updated feature amounts to the position information, and tracking the tracked target based on the restored position information.

 この開示は、請求の範囲及び明細書全体から読み取ることのできる発明の要旨又は思想に反しない範囲で適宜変更可能であり、そのような変更を伴う情報処理装置、情報処理方法、及び記録媒体もまたこの開示の技術思想に含まれる。 This disclosure may be modified as appropriate within the scope of the claims and the spirit or concept of the invention as can be read from the entire specification, and information processing devices, information processing methods, and recording media incorporating such modifications are also included within the technical concept of this disclosure.

 10 情報処理装置
 11 プロセッサ
 12 RAM
 13 ROM
 14 記憶装置
 15 入力装置
 16 出力装置
 17 データバス
 50 追跡モデル
 110 画像取得部
 120 対象位置検出部
 130 特徴量変換部
 140 特徴量更新部
 150 位置情報復元部
 155 記憶部
 160 学習部
 200 交差注意機構
 210 クエリ
 220 キー
 230 バリュー
 240 行列積演算部
 250 正規化部
 260 行列積演算部
 270 残差処理部
 280 記憶更新部
10 Information processing device 11 Processor 12 RAM
13 ROM
14 Storage device 15 Input device 16 Output device 17 Data bus 50 Tracking model 110 Image acquisition unit 120 Target position detection unit 130 Feature conversion unit 140 Feature update unit 150 Position information restoration unit 155 Storage unit 160 Learning unit 200 Intersection attention mechanism 210 Query 220 Key 230 Value 240 Matrix multiplication unit 250 Normalization unit 260 Matrix multiplication unit 270 Residual processing unit 280 Memory update unit

Claims (7)

 動画から画像を取得する取得手段と、
 前記画像に含まれる追跡対象の位置を検出する検出手段と、
 前記追跡対象の位置に関する位置情報を、前記位置情報の特徴を示す特徴量に変換する変換手段と、
 前記特徴量を、複数の前記画像間で前記追跡対象を照合可能な交差注意機構を用いて更新する更新手段と、
 更新された前記特徴量を前記位置情報に復元する復元手段と、
 を備える情報処理装置。
an acquisition means for acquiring images from the video;
a detection means for detecting a position of a tracking target included in the image;
a conversion means for converting position information relating to the position of the tracked object into a feature quantity indicating a feature of the position information;
updating means for updating the feature amount using a cross-attention mechanism capable of matching the tracked object between a plurality of the images;
a restoration means for restoring the updated feature amount to the position information;
An information processing device comprising:
 前記交差注意機構は、第1の画像に関する前記特徴量である第1特徴量と、前記第1の画像よりも前に撮影された第2の画像に関する前記特徴量である第2特徴量と、から算出される重みを用いて前記追跡対象を照合する、
 請求項1に記載の情報処理装置。
the cross-attention mechanism matches the tracked target using a weight calculated from a first feature amount that is the feature amount related to a first image and a second feature amount that is the feature amount related to a second image captured before the first image;
The information processing device according to claim 1 .
 前記交差注意機構は、前記第1特徴量と前記第2特徴量との行列積を演算して得られる類似性行列を前記重みとして用いることで前記追跡対象を照合する、
 請求項2に記載の情報処理装置。
the cross-attention mechanism matches the tracked target by using a similarity matrix obtained by calculating a matrix product of the first feature amount and the second feature amount as the weight;
The information processing device according to claim 2 .
 前記特徴量を記憶可能な記憶手段と、
 前記更新手段で更新された前記特徴量に基づいて、前記記憶手段に記憶された前記特徴量を更新する記憶更新手段と、
 を更に備える請求項1から3のいずれか一項に記載の情報処理装置。
a storage means capable of storing the feature amount;
a storage update means for updating the feature quantity stored in the storage means based on the feature quantity updated by the update means;
The information processing device according to claim 1 , further comprising:
 1つの動画に含まれる複数フレームを1つのミニバッチに変換して学習データとし、前記追跡対象の照合に関する学習を行う学習部を更に備える、
 請求項1から3のいずれか一項に記載の情報処理装置。
The tracking system further includes a learning unit that converts a plurality of frames included in one video into one mini-batch as training data and performs training related to matching of the tracked object.
The information processing device according to claim 1 .
 少なくとも1つのコンピュータによって、
 動画から画像を取得し、
 前記画像に含まれる追跡対象の位置を検出し、
 前記追跡対象の位置に関する位置情報を、前記位置情報の特徴を示す特徴量に変換し、
 前記特徴量を、複数の前記画像間で前記追跡対象を照合可能な交差注意機構を用いて更新し、
 更新された前記特徴量を前記位置情報に復元する、
 情報処理方法。
by at least one computer,
Get images from videos,
Detecting the position of a tracked object included in the image;
converting location information relating to the location of the tracked object into a feature quantity indicating a feature of the location information;
updating the feature amount using a cross-attention mechanism capable of matching the tracked object between the plurality of images;
restoring the updated feature amount to the position information;
Information processing methods.
 少なくとも1つのコンピュータに、
 動画から画像を取得し、
 前記画像に含まれる追跡対象の位置を検出し、
 前記追跡対象の位置に関する位置情報を、前記位置情報の特徴を示す特徴量に変換し、
 前記特徴量を、複数の前記画像間で前記追跡対象を照合可能な交差注意機構を用いて更新し、
 更新された前記特徴量を前記位置情報に復元する、
 情報処理方法を実行させるコンピュータプログラムが記録された記録媒体。
At least one computer
Get images from videos,
Detecting the position of a tracked object included in the image;
converting location information relating to the location of the tracked object into a feature quantity indicating a feature of the location information;
updating the feature amount using a cross-attention mechanism capable of matching the tracked object between the plurality of images;
restoring the updated feature amount to the position information;
A recording medium on which a computer program for executing an information processing method is recorded.
PCT/JP2024/008288 2024-03-05 2024-03-05 Information processing device, information processing method, and recording medium Pending WO2025186903A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2024/008288 WO2025186903A1 (en) 2024-03-05 2024-03-05 Information processing device, information processing method, and recording medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2024/008288 WO2025186903A1 (en) 2024-03-05 2024-03-05 Information processing device, information processing method, and recording medium

Publications (2)

Publication Number Publication Date
WO2025186903A1 true WO2025186903A1 (en) 2025-09-12
WO2025186903A8 WO2025186903A8 (en) 2025-10-02

Family

ID=96990125

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2024/008288 Pending WO2025186903A1 (en) 2024-03-05 2024-03-05 Information processing device, information processing method, and recording medium

Country Status (1)

Country Link
WO (1) WO2025186903A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021130951A1 (en) * 2019-12-26 2021-07-01 日本電気株式会社 Object-tracking device, object-tracking method, and recording medium
JP2021189857A (en) * 2020-06-01 2021-12-13 キヤノン株式会社 Information processing equipment, information processing methods, and programs
WO2022102083A1 (en) * 2020-11-13 2022-05-19 日本電気株式会社 Information processing device, information processing method, and computer program

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021130951A1 (en) * 2019-12-26 2021-07-01 日本電気株式会社 Object-tracking device, object-tracking method, and recording medium
JP2021189857A (en) * 2020-06-01 2021-12-13 キヤノン株式会社 Information processing equipment, information processing methods, and programs
WO2022102083A1 (en) * 2020-11-13 2022-05-19 日本電気株式会社 Information processing device, information processing method, and computer program

Also Published As

Publication number Publication date
WO2025186903A8 (en) 2025-10-02

Similar Documents

Publication Publication Date Title
CN111091176A (en) Data recognition device and method and training device and method
US20220004904A1 (en) Deepfake detection models utilizing subject-specific libraries
KR20210077464A (en) Method and system for detecting duplicated document using vector quantization
Peng et al. BDNN: Binary convolution neural networks for fast object detection
JPWO2019220620A1 (en) Anomaly detection device, anomaly detection method and program
KR20200076461A (en) Method and apparatus for processing neural network based on nested bit representation
CN118095359B (en) Large language model training method and device for privacy protection, medium and equipment
KR20200083119A (en) User verification device and method
KR102029860B1 (en) Method for tracking multi objects by real time and apparatus for executing the method
US20210232855A1 (en) Movement state recognition model training device, movement state recognition device, methods and programs therefor
WO2022190301A1 (en) Learning device, learning method, and computer-readable medium
WO2025186903A1 (en) Information processing device, information processing method, and recording medium
US20240371142A1 (en) Video conferencing device and image quality verifying method thereof
KR102533512B1 (en) Personal information object detection method and device
Karathanasis et al. A Comparative Analysis of Compression and Transfer Learning Techniques in DeepFake Detection Models.
JP7661981B2 (en) Information processing device, information processing method, and computer program
KR20250034834A (en) Method and system for enhancing image understanding
KR102829378B1 (en) Method and apparatus for re-registering pre-registered identity information in new identity recognition system
KR20240066032A (en) Method and apparatus for analyzing object from image using attention network
KR20230054182A (en) Person re-identification method using artificial neural network and computing apparatus for performing the same
CN116888665A (en) Electronic equipment and control methods
US20250014196A1 (en) Scene flow estimation apparatus and method
US20240013407A1 (en) Information processing apparatus, information processing method, and non-transitory computer-readable storage medium
JP7683812B2 (en) Learning device, learning method, tracking device, tracking method, and recording medium
CN119091470B (en) A video-based single-stage multi-person two-dimensional human posture estimation method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24928563

Country of ref document: EP

Kind code of ref document: A1