WO2025186903A1

WO2025186903A1 - Information processing device, information processing method, and recording medium

Info

Publication number: WO2025186903A1
Application number: PCT/JP2024/008288
Authority: WO
Inventors: 宏福井
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2024-03-05
Filing date: 2024-03-05
Publication date: 2025-09-12
Anticipated expiration: 2026-09-05
Also published as: WO2025186903A8

Abstract

This information processing device includes: an acquisition means for acquiring an image from a video; a detection means for detecting the position of a tracking target included in the image; a conversion means for converting position information related to the position of the tracking target into a feature amount indicating a feature of the position information; an update means for updating the feature amount by using a cross attention mechanism capable of cross-checking the tracking target between the plurality of images; and a restoration means for restoring the updated feature amount to position information. According to such an information processing device, it is possible to accurately track a tracking target included in a video.

Description

Information processing device, information processing method, and recording medium

　この開示は、情報処理装置、情報処理方法、及び記録媒体の技術分野に関する。 This disclosure relates to the technical fields of information processing devices, information processing methods, and recording media.

　この種の装置として、動画に含まれている対象の追跡処理を実行するものが知られている。例えば特許文献１は、第１画像中の物体の位置情報を示す第１特徴ベクトルと、第２画像中の物体の位置情報を示す第２特徴ベクトルとを生成すること、及び、第１特徴ベクトル及び第２特徴ベクトルを用いて物体の対応関係を示す対応情報を生成して物体を追跡することを開示している。 A known device of this type is one that performs tracking of objects contained in video. For example, Patent Document 1 discloses generating a first feature vector indicating the position information of an object in a first image and a second feature vector indicating the position information of the object in a second image, and using the first feature vector and the second feature vector to generate correspondence information indicating the correspondence between the objects, thereby tracking the objects.

国際公開第２０２１／１３０９５１号International Publication No. 2021/130951

　この開示は、先行技術文献に開示された技術を改良することを目的とする情報処理装置、情報処理方法、及び記録媒体を提供することを課題とする。 The objective of this disclosure is to provide an information processing device, an information processing method, and a recording medium that aim to improve upon the technology disclosed in prior art documents.

　この開示の情報処理装置の一の態様は、動画から画像を取得する取得手段と、前記画像に含まれる追跡対象の位置を検出する検出手段と、前記追跡対象の位置に関する位置情報を、前記位置情報の特徴を示す特徴量に変換する変換手段と、前記特徴量を、複数の前記画像間で前記追跡対象を照合可能な交差注意機構を用いて更新する更新手段と、更新された前記特徴量を前記位置情報に復元する復元手段と、を備える。 One aspect of the information processing device disclosed herein comprises an acquisition means for acquiring images from a video, a detection means for detecting the position of a tracked object contained in the images, a conversion means for converting position information relating to the position of the tracked object into a feature quantity indicating the characteristics of the position information, an update means for updating the feature quantity using a cross-attention mechanism capable of matching the tracked object between multiple images, and a restoration means for restoring the updated feature quantity to the position information.

　この開示の情報処理方法の一の態様は、少なくとも１つのコンピュータによって、動画から画像を取得し、前記画像に含まれる追跡対象の位置を検出し、前記追跡対象の位置に関する位置情報を、前記位置情報の特徴を示す特徴量に変換し、前記特徴量を、複数の前記画像間で前記追跡対象を照合可能な交差注意機構を用いて更新し、更新された前記特徴量を前記位置情報に復元する。 One aspect of the information processing method disclosed herein involves using at least one computer to acquire images from a video, detect the position of a tracked object contained in the images, convert the position information regarding the position of the tracked object into features indicating the characteristics of the position information, update the features using a cross-attention mechanism capable of matching the tracked object across multiple images, and restore the updated features to the position information.

　この開示の記録媒体の一の態様は、少なくとも１つのコンピュータに、動画から画像を取得し、前記画像に含まれる追跡対象の位置を検出し、前記追跡対象の位置に関する位置情報を、前記位置情報の特徴を示す特徴量に変換し、前記特徴量を、複数の前記画像間で前記追跡対象を照合可能な交差注意機構を用いて更新し、更新された前記特徴量を前記位置情報に復元する、情報処理方法を実行させるコンピュータプログラムが記録されている。 One aspect of the recording medium of this disclosure is a computer program recorded on at least one computer that causes the computer to execute an information processing method that acquires images from a video, detects the position of a tracked target included in the images, converts position information regarding the position of the tracked target into feature quantities that indicate the characteristics of the position information, updates the feature quantities using a cross-attention mechanism that can match the tracked target between multiple images, and restores the updated feature quantities to the position information.

第１の情報処理装置のハードウェア構成を示すブロック図である。FIG. 2 is a block diagram showing a hardware configuration of a first information processing apparatus. クエリ伝搬を用いた追跡手法の一例を示す概念図である。FIG. 10 is a conceptual diagram illustrating an example of a tracking technique using query propagation. 第１の情報処理装置の機能的構成を示すブロック図である。FIG. 2 is a block diagram showing a functional configuration of a first information processing apparatus. 第１の情報処理装置による追跡処理の流れを示すフローチャートである。10 is a flowchart showing the flow of a tracking process by the first information processing device. 第１の情報処理装置における交差注意機構の構成を示すブロック図である。FIG. 2 is a block diagram showing the configuration of an intersection attention mechanism in the first information processing device. 交差注意機構により算出される類似性行列の一例を示す平面図である。FIG. 10 is a plan view illustrating an example of an affinity matrix calculated by a cross-attention mechanism. 第２の情報処理装置の機能的構成を示すブロック図である。FIG. 2 is a block diagram showing a functional configuration of a second information processing apparatus. 比較例に係る学習データの生成手法を示す概念図である。FIG. 10 is a conceptual diagram illustrating a method for generating training data according to a comparative example. 第２の情報処理装置に係る学習データの生成手法を示す概念図である。FIG. 10 is a conceptual diagram illustrating a method for generating learning data according to the second information processing device. 第２の情報処理装置による学習動作におけるクエリ伝搬を示す概念図である。FIG. 10 is a conceptual diagram illustrating query propagation in a learning operation by a second information processing device. 第２の情報処理装置による学習動作の流れを示すフローチャートである。10 is a flowchart showing the flow of a learning operation by the second information processing device.

　以下、図面を参照しながら、情報処理装置、情報処理方法、及び記録媒体の実施形態について説明する。 Below, embodiments of an information processing device, an information processing method, and a recording medium will be described with reference to the drawings.

　＜第１実施形態＞
　第１の情報処理装置について、図１から図６を参照して説明する。 First Embodiment
The first information processing apparatus will be described with reference to FIGS.

　（ハードウェア構成）
　まず、図１を参照しながら、第１の情報処理装置のハードウェア構成について説明する。図１は、第１の情報処理装置のハードウェア構成を示すブロック図である。 (Hardware configuration)
First, the hardware configuration of the first information processing apparatus will be described with reference to Fig. 1. Fig. 1 is a block diagram showing the hardware configuration of the first information processing apparatus.

　図１に示すように、第１の情報処理装置１は、プロセッサ１１と、ＲＡＭ（Ｒａｎｄｏｍ　Ａｃｃｅｓｓ　Ｍｅｍｏｒｙ）１２と、ＲＯＭ（Ｒｅａｄ　Ｏｎｌｙ　Ｍｅｍｏｒｙ）１３と、記憶装置１４と、入力装置１５と、出力装置１６と、を備えている。上述したプロセッサ１１と、ＲＡＭ１２と、ＲＯＭ１３と、記憶装置１４と、入力装置１５と、出力装置１６と、は、それぞれデータバス１７を介して接続されている。なお、データバス１７は、データバス以外のインターフェース（例えば、ＬＡＮやＵＳＢ等）であってもよい。 As shown in FIG. 1, the first information processing device 1 includes a processor 11, a RAM (Random Access Memory) 12, a ROM (Read Only Memory) 13, a storage device 14, an input device 15, and an output device 16. The processor 11, RAM 12, ROM 13, storage device 14, input device 15, and output device 16 are each connected via a data bus 17. Note that the data bus 17 may be an interface other than a data bus (for example, a LAN, USB, etc.).

　プロセッサ１１は、コンピュータプログラムを読み込む。例えば、プロセッサ１１は、ＲＡＭ１２、ＲＯＭ１３及び記憶装置１４のうちの少なくとも一つが記憶しているコンピュータプログラムを読み込むように構成されている。或いは、プロセッサ１１は、コンピュータで読み取り可能な記録媒体が記憶しているコンピュータプログラムを、図示しない記録媒体読み取り装置を用いて読み込んでもよい。プロセッサ１１は、ネットワークインタフェースを介して、第１の情報処理装置１の外部に配置される不図示の装置からコンピュータプログラムを取得してもよい（つまり、読み込んでもよい）。プロセッサ１１は、読み込んだコンピュータプログラムを実行することで各種処理を実行する。本実施形態では特に、プロセッサ１１が読み込んだコンピュータプログラムを実行すると、プロセッサ１１内に、第１の情報処理装置１が実行する追跡処理に関連する機能ブロックが実現される。即ち、プロセッサ１１は、情第１の報処理装置１における各制御を実行するコントローラとして機能してよい。 The processor 11 loads a computer program. For example, the processor 11 is configured to load a computer program stored in at least one of the RAM 12, ROM 13, and storage device 14. Alternatively, the processor 11 may load a computer program stored in a computer-readable storage medium using a storage medium reading device (not shown). The processor 11 may also obtain (i.e., load) the computer program from a device (not shown) located outside the first information processing device 1 via a network interface. The processor 11 performs various processes by executing the loaded computer program. In particular, in this embodiment, when the processor 11 executes the loaded computer program, functional blocks related to the tracking process performed by the first information processing device 1 are realized within the processor 11. In other words, the processor 11 may function as a controller that executes each control in the first information processing device 1.

　プロセッサ１１は、例えばＣＰＵ（Ｃｅｎｔｒａｌ　Ｐｒｏｃｅｓｓｉｎｇ　Ｕｎｉｔ）、ＧＰＵ（Ｇｒａｐｈｉｃｓ　Ｐｒｏｃｅｓｓｉｎｇ　Ｕｎｉｔ）、ＦＰＧＡ（ｆｉｅｌｄ－ｐｒｏｇｒａｍｍａｂｌｅ　ｇａｔｅ　ａｒｒａｙ）、ＤＳＰ（Ｄｉｇｉｔａｌ　Ｓｉｇｎａｌ　Ｐｒｏｃｅｓｓｏｒ）、ＡＳＩＣ（Ａｐｐｌｉｃａｔｉｏｎ　Ｓｐｅｃｉｆｉｃ　Ｉｎｔｅｇｒａｔｅｄ　Ｃｉｒｃｕｉｔ）、量子プロセッサとして構成されてよい。プロセッサ１１は、これらのうち一つで構成されてもよいし、複数を並列で用いるように構成されてもよい。 Processor 11 may be configured as, for example, a CPU (Central Processing Unit), GPU (Graphics Processing Unit), FPGA (Field-Programmable Gate Array), DSP (Digital Signal Processor), ASIC (Application Specific Integrated Circuit), or quantum processor. Processor 11 may be configured as one of these, or as multiple processors operating in parallel.

　ＲＡＭ１２は、プロセッサ１１が実行するコンピュータプログラムを一時的に記憶する。ＲＡＭ１２は、プロセッサ１１がコンピュータプログラムを実行している際にプロセッサ１１が一時的に使用するデータを一時的に記憶する。ＲＡＭ１２は、例えば、Ｄ－ＲＡＭ（Ｄｙｎａｍｉｃ　Ｒａｎｄｏｍ　Ａｃｃｅｓｓ　Ｍｅｍｏｒｙ）や、ＳＲＡＭ(Ｓｔａｔｉｃ　Ｒａｎｄｏｍ　Ａｃｃｅｓｓ　Ｍｅｍｏｒｙ)であってよい。また、ＲＡＭ１２に代えて、他の種類の揮発性メモリが用いられてもよい。 RAM 12 temporarily stores computer programs executed by processor 11. RAM 12 temporarily stores data that processor 11 uses temporarily while it is executing a computer program. RAM 12 may be, for example, D-RAM (Dynamic Random Access Memory) or SRAM (Static Random Access Memory). Also, other types of volatile memory may be used instead of RAM 12.

　ＲＯＭ１３は、プロセッサ１１が実行するコンピュータプログラムを記憶する。ＲＯＭ１３は、その他に固定的なデータを記憶していてもよい。ＲＯＭ１３は、例えば、Ｐ－ＲＯＭ（Ｐｒｏｇｒａｍｍａｂｌｅ　Ｒｅａｄ　Ｏｎｌｙ　Ｍｅｍｏｒｙ）や、ＥＰＲＯＭ(Ｅｒａｓａｂｌｅ　Ｒｅａｄ　Ｏｎｌｙ　Ｍｅｍｏｒｙ)であってよい。また、ＲＯＭ１３に代えて、他の種類の不揮発性メモリが用いられてもよい。 ROM 13 stores computer programs executed by processor 11. ROM 13 may also store fixed data. ROM 13 may be, for example, a P-ROM (Programmable Read Only Memory) or an EPROM (Erasable Read Only Memory). Also, other types of non-volatile memory may be used instead of ROM 13.

　記憶装置１４は、第１の情報処理装置１が長期的に保存するデータを記憶する。記憶装置１４は、プロセッサ１１の一時記憶装置として動作してもよい。記憶装置１４は、プロセッサ１１が実行するコンピュータプログラムを記憶してもよい。記憶装置１４は、例えば、ハードディスク装置、光磁気ディスク装置、ＳＳＤ（Ｓｏｌｉｄ　Ｓｔａｔｅ　Ｄｒｉｖｅ）及びディスクアレイ装置のうちの少なくとも一つを含んでいてもよい。 The storage device 14 stores data that the first information processing device 1 saves over the long term. The storage device 14 may operate as a temporary storage device for the processor 11. The storage device 14 may also store computer programs executed by the processor 11. The storage device 14 may include, for example, at least one of a hard disk device, a magneto-optical disk device, an SSD (Solid State Drive), and a disk array device.

　入力装置１５は、情報処理装置１のユーザからの入力指示を受け取る装置である。入力装置１５は、例えば、キーボード、マウス及びタッチパネルのうちの少なくとも一つを含んでいてもよい。入力装置１５は、例えばマイクを含む音声入力が可能な装置であってもよい。入力装置１５は、例えばスマートフォン、タブレット、ノートパソコン等の各種端末として構成されていてもよい。 The input device 15 is a device that receives input instructions from a user of the information processing device 1. The input device 15 may include, for example, at least one of a keyboard, a mouse, and a touch panel. The input device 15 may also be a device that allows voice input, for example, including a microphone. The input device 15 may also be configured as various types of terminals, such as a smartphone, tablet, or laptop computer.

　出力装置１６は、情報処理装置１に関する情報を外部に対して出力する装置である。例えば、出力装置１６は、情報処理装置１に関する情報を表示可能な表示装置（例えば、ディスプレイやデジタルサイネージ等）であってもよい。また、出力装置１６は、第１の情報処理装置１に関する情報を音声出力可能なスピーカ等であってもよい。出力装置１６は、例えばスマートフォン、タブレット、ノートパソコン等の各種端末として構成されていてもよい。 The output device 16 is a device that outputs information related to the information processing device 1 to the outside. For example, the output device 16 may be a display device (e.g., a display or digital signage) that can display information related to the information processing device 1. The output device 16 may also be a speaker or the like that can output information related to the first information processing device 1 as audio. The output device 16 may be configured as various terminals, such as a smartphone, tablet, or laptop computer.

　なお、第１の情報処理装置１は、図１で説明した各構成要素の一部を含むものとして構成されてよい。例えば、第１の情報処理装置１は、プロセッサ１１、ＲＡＭ１２、及びＲＯＭ１３を含んで構成されてよい。この場合、記憶装置１４、入力装置１５、及び出力装置１６の各々は、第１の情報処理装置１と接続される外部の装置として構成されてもよい。また、第１の情報処理装置１における演算機能の一部は、外部サーバやクラウド等によって実現されてもよい。 The first information processing device 1 may be configured to include some of the components described in FIG. 1. For example, the first information processing device 1 may be configured to include a processor 11, RAM 12, and ROM 13. In this case, the storage device 14, input device 15, and output device 16 may each be configured as an external device connected to the first information processing device 1. Furthermore, some of the calculation functions of the first information processing device 1 may be realized by an external server, cloud, etc.

　（追跡手法）
　次に、図２を参照しながら、第１の情報処理装置１が実行する追跡処理の手法について説明する。図２は、クエリ伝搬を用いた追跡手法の一例を示す概念図である。 (Tracking method)
Next, a tracing process technique executed by the first information processing device 1 will be described with reference to Fig. 2. Fig. 2 is a conceptual diagram showing an example of a tracing technique using query propagation.

　図２において、第１の情報処理装置１は、動画に含まれる追跡対象を追跡する追跡処理を実行可能に構成されている。追跡対象は、例えば人物や動物であってもよいし、荷物や車などの物体であってもよい。第１の情報処理装置が実行する追跡処理では、動画から検出した追跡対象の位置情報に基づく特徴量が「検出クエリ」として取得される。そして、この検出クエリを用いて、対象を追跡するためのクエリである「追跡クエリ」が更新される。第１の情報処理装置１は、この追跡クエリを時系列で伝搬していくことにより追跡対象を追跡する。なお、これらのクエリは、追跡対象ごとに設定されるものである。このため、動画に複数の追跡対象が含まれている場合には、複数の追跡対象の各々に対応するクエリが更新されていく。 In FIG. 2, the first information processing device 1 is configured to be able to execute a tracking process for tracking a tracking target included in a video. The tracking target may be, for example, a person or an animal, or an object such as luggage or a car. In the tracking process executed by the first information processing device, a feature based on the position information of the tracking target detected from the video is acquired as a "detection query." This detection query is then used to update a "tracking query," which is a query for tracking the target. The first information processing device 1 tracks the tracking target by propagating this tracking query in chronological order. Note that these queries are set for each tracking target. Therefore, if a video includes multiple tracking targets, the queries corresponding to each of the multiple tracking targets are updated.

　例えば、図２に示す例では、時刻Ｔ１に撮影されたフレームから対象Ａ及び対象Ｂが検出されている。このため、時刻Ｔ１の検出クエリには、対象Ａ及びＢの検出クエリが含まれている。そして、時刻Ｔ１の検出クエリを用いて追跡クエリが更新される、この結果、時刻Ｔ２の追跡クエリには、対象Ａ及びＢの追跡クエリが含まれている。 For example, in the example shown in Figure 2, objects A and B are detected from a frame captured at time T1. Therefore, the detection query at time T1 includes detection queries for objects A and B. The tracking query is then updated using the detection query at time T1, and as a result, the tracking query at time T2 includes tracking queries for objects A and B.

　続いて、時刻Ｔ２に撮影されたフレームからは、新たに対象Ｃが検出されている。このため、時刻Ｔ２の検出クエリには、対象Ｃの検出クエリが含まれている。そして、時刻Ｔ２の検出クエリを用いて追跡クエリが更新される。この結果、時刻Ｔ３の追跡クエリには、時刻Ｔ２の追跡クエリに含まれていた（言いかえれば、前の時刻から伝搬された）対象Ａ及びＢの追跡クエリと、新たに検出された対象Ｃの追跡クエリとが含まれている。 Subsequently, a new object C is detected in the frame captured at time T2. Therefore, the detection query at time T2 includes the detection query for object C. The tracking query is then updated using the detection query at time T2. As a result, the tracking query at time T3 includes the tracking queries for objects A and B that were included in the tracking query at time T2 (in other words, propagated from the previous time), as well as the tracking query for the newly detected object C.

　なお、時刻Ｔ３に撮影されたフレームからは、対象Ａ、対象Ｂ及び対象Ｃがそれぞれ検出されているが、時刻Ｔ４に撮影されたフレームからは、対象Ａ及び対象Ｃのみが検出され、対象Ｂは検出されていない。このため、時刻Ｔ４の追跡クエリからは、対象Ｂの追跡クエリが消失し、対象Ａ及びＣの追跡クエリが含まれている。 Note that in the frame captured at time T3, objects A, B, and C are each detected, but in the frame captured at time T4, only objects A and C are detected, and object B is not. As a result, the tracking query for object B has disappeared from the tracking query at time T4, and tracking queries for objects A and C are included.

　（機能的構成）
　次に、図３を参照しながら、上述した追跡処理を実行するための機能的構成について説明する。図３は、第１の情報処理装置の機能的構成を示すブロック図である。 (Functional configuration)
Next, the functional configuration for executing the above-mentioned tracking process will be described with reference to Fig. 3. Fig. 3 is a block diagram showing the functional configuration of the first information processing apparatus.

　図３に示すように、第１の情報処理装置１０は、その機能を実現するための構成要素として、画像取得部１１０と、対象位置検出部１２０と、特徴量変換部１３０と、特徴量更新部１４０と、位置情報復元部１５０と、記憶部１５５と、を備えている。なお、情画像取得部１１０、対象位置検出部１２０、特徴量変換部１３０、特徴量更新部１４０、及び位置情報復元部１５０の各々は、上述したプロセッサ１１（図１参照）によって実現される処理ブロックであってよい。記憶部１５５は、上述した記憶装置１４（図１参照）等によって実現されるものであってよい。 As shown in FIG. 3, the first information processing device 10 includes, as components for realizing its functions, an image acquisition unit 110, a target position detection unit 120, a feature conversion unit 130, a feature update unit 140, a location information restoration unit 150, and a storage unit 155. Each of the image acquisition unit 110, target position detection unit 120, feature conversion unit 130, feature update unit 140, and location information restoration unit 150 may be a processing block realized by the processor 11 described above (see FIG. 1). The storage unit 155 may be realized by the storage device 14 described above (see FIG. 1), etc.

　画像取得部１１０は、動画や画像を取得可能に構成されている。より具体的には、画像取得部１１０は、動画を構成する各フレームの画像を時系列で逐次的に取得してもよい。また、時系列で並んだ画像を取得してもよい。例えば、動画を構成する各フレームの画像から所定周期毎に画像を取得してもよい。なお、画像取得部１１０は、カメラで動画を撮像しながら、リアルタイムで画像を取得するように構成されてもよい。画像取得部１１０で取得された画像は、対象位置検出部１２０に出力される構成となっている。 The image acquisition unit 110 is configured to be able to acquire videos and images. More specifically, the image acquisition unit 110 may acquire images of each frame constituting a video sequentially in chronological order. Alternatively, it may acquire images arranged in chronological order. For example, images may be acquired from each frame constituting a video at predetermined intervals. The image acquisition unit 110 may also be configured to acquire images in real time while capturing a video with a camera. The images acquired by the image acquisition unit 110 are configured to be output to the target position detection unit 120.

　対象位置検出部１２０は、画像取得部１１０で取得した画像から、追跡対象の位置を検出可能に構成されている。対象位置検出部１２０は、例えば画像に含まれる人物を検出して、検出した人物の位置を示す位置情報を出力するように構成されてよい。なお、画像に複数の追跡対象が含まれている場合、対象位置検出部１２０は、複数の追跡対象の各々について位置情報を検出してよい。対象位置検出部１２０は、例えばニューラルネットワークで構成された学習済みの検出モデルとして構成されてもよい。対象位置検出部１２０で検出された追跡対象の位置情報は、特徴量変換部１３０に出力される構成となっている。 The target position detection unit 120 is configured to be able to detect the position of the tracking target from the image acquired by the image acquisition unit 110. The target position detection unit 120 may be configured to, for example, detect a person included in the image and output position information indicating the position of the detected person. Note that if the image includes multiple tracking targets, the target position detection unit 120 may detect position information for each of the multiple tracking targets. The target position detection unit 120 may be configured as a trained detection model made up of, for example, a neural network. The position information of the tracking target detected by the target position detection unit 120 is configured to be output to the feature conversion unit 130.

　特徴量変換部１３０は、対象位置検出部１２０で検出された追跡対象の位置情報を特徴量に変換可能に構成されている。ここでの特徴量は、追跡対象の位置情報を示す特徴ベクトルであってよい。特徴量変換部１３０は、複数の特徴抽出ブロックを備えたエンコーダとして構成されてよい。特徴量変換部１３０は、例えばニューラルネットワークにおける全結合層３層程度で構築されてよい。特徴量変換部１３０で変換された特徴量は、例えば追跡処理に用いる検出クエリ（図２参照）として用いられてよい。特徴量変換部１３０で変換された特徴量は、特徴量更新部１４０に出力される構成となっている。 The feature conversion unit 130 is configured to be able to convert the position information of the tracked target detected by the target position detection unit 120 into feature amounts. The feature amounts here may be feature vectors indicating the position information of the tracked target. The feature conversion unit 130 may be configured as an encoder equipped with multiple feature extraction blocks. The feature conversion unit 130 may be constructed, for example, with approximately three fully connected layers in a neural network. The feature amounts converted by the feature conversion unit 130 may be used, for example, as detection queries (see Figure 2) used in tracking processing. The feature amounts converted by the feature conversion unit 130 are configured to be output to the feature update unit 140.

　特徴量更新部１４０は、特徴量変換部１３０で変換された特徴量を、交差注意機構（Ｃｒｏｓｓ‐ａｔｔｅｎｔｉｏｎ機構）２００を用いて更新可能に構成されている。交差注意機構２００は、複数の画像間で追跡対象の位置情報を照合する機能を有している。具体的には、交差注意機構２００は、時系列で取得される連続した画像の各々に含まれる対象について、どの対象とどの対象とが同一人物であるかの対応付けを行う機能を有している。交差注意機構２００の具体的な構成については、後に詳しく説明する。特徴量更新部１４０で更新された特徴量（以下、適宜「更新特徴量」と称する）は、例えば追跡処理に用いる追跡クエリ（図２参照）として用いられてよい。この場合、更新特徴量は、追跡クエリを記憶する記憶部１５５によって一時的に記憶されてよい。また、更新特徴量は、位置情報復元部１５０に出力される構成となっている。 The feature update unit 140 is configured to be able to update the features converted by the feature conversion unit 130 using a cross-attention mechanism 200. The cross-attention mechanism 200 has the function of comparing the position information of the tracked target between multiple images. Specifically, the cross-attention mechanism 200 has the function of associating targets included in each of consecutive images acquired in time series to determine whether they are the same person. The specific configuration of the cross-attention mechanism 200 will be explained in detail later. The features updated by the feature update unit 140 (hereinafter referred to as "updated features") may be used, for example, as a tracking query (see Figure 2) used in the tracking process. In this case, the updated features may be temporarily stored in the memory unit 155, which stores tracking queries. The updated features are also configured to be output to the position information restoration unit 150.

　位置情報復元部１５０は、特徴量更新部１４０で更新された更新特徴量を位置情報に復元可能に構成されている。位置情報復元部１５０は、複数の特徴抽出ブロックを備えたデコーダとして構成されてよい。位置情報復元部１５０は、例えばニューラルネットワークにおける全結合層３層程度で構築されてよい。位置情報復元部１５０は、復元した位置情報を出力する機能を有していてもよい。 The location information restoration unit 150 is configured to be able to restore the updated features updated by the feature update unit 140 to location information. The location information restoration unit 150 may be configured as a decoder equipped with multiple feature extraction blocks. The location information restoration unit 150 may be constructed, for example, with approximately three fully connected layers in a neural network. The location information restoration unit 150 may also have a function to output the restored location information.

　記憶部１５５は、追跡対象毎に、特徴量及び位置情報を記憶するように構成されている。例えば、記憶部１５５は、追跡対象毎に検出クエリや追跡クエリを記憶するように構成されてもよい。例えば、記憶部１５５は、複数のメモリ領域を備え、１つのメモリ領域に対してある１人の人物に対して追跡を行った追跡クエリを時系列で蓄積するよう保存してもよい。また、追跡対象毎に対応するＩＤに紐づけて、特徴量及び位置情報を記憶するように構成されてもよい。 The storage unit 155 is configured to store feature amounts and location information for each tracking target. For example, the storage unit 155 may be configured to store detection queries and tracking queries for each tracking target. For example, the storage unit 155 may have multiple memory areas, and store tracking queries for tracking a single person in one memory area so that they are accumulated in chronological order. The storage unit 155 may also be configured to store feature amounts and location information for each tracking target, linked to the corresponding ID.

　（追跡処理の流れ）
　次に、図４を参照しながら、第１の情報処理装置１が実行する追跡処理の流れについて説明する。図４は、第１の情報処理装置による追跡処理の流れを示すフローチャートである。 (Tracking process flow)
Next, the flow of the tracking process executed by the first information processing device 1 will be described with reference to Fig. 4. Fig. 4 is a flowchart showing the flow of the tracking process by the first information processing device.

　図５に示すように、第１の情報処理装置１による追跡処理が開始されると、まず画像取得部１１０が、動画から画像を取得する（ステップＳ１０１）。そして、対象位置検出部１２０が、画像取得部１１０で取得した画像から、画像に含まれる追跡対象の位置を検出する（ステップＳ１０２）。 As shown in FIG. 5, when the tracking process by the first information processing device 1 begins, the image acquisition unit 110 first acquires an image from the video (step S101). Then, the target position detection unit 120 detects the position of the tracking target contained in the image from the image acquired by the image acquisition unit 110 (step S102).

　続いて、特徴量変換部１３０が、対象位置検出部１２０で検出された追跡対象の位置情報を特徴量に変換する（ステップＳ１０３）。そして、特徴量更新部１４０は、特徴量変換部１３０で変換された特徴量を、交差注意機構２００を用いて更新する（ステップＳ１０４）。 Next, the feature conversion unit 130 converts the position information of the tracked target detected by the target position detection unit 120 into a feature (step S103). Then, the feature update unit 140 updates the feature converted by the feature conversion unit 130 using the intersection attention mechanism 200 (step S104).

　その後、位置情報復元部１５０は、特徴量更新部１４０で更新された更新特徴量を位置情報に復元する（ステップＳ１０５）。その後、第１の情報処理装置１は、追跡処理を終了するか否かを判定する（ステップＳ１０６）。 Then, the location information restoration unit 150 restores the updated feature amounts updated by the feature amount update unit 140 into location information (step S105). Then, the first information processing device 1 determines whether to end the tracking process (step S106).

　追跡処理を終了しない場合（ステップＳ１０６：ＮＯ）、再びステップＳ１０１からの処理が実行されてよい。即ち、次のフレームの画像を取得して、上述した処理が繰り返し実行されてよい。一方、追跡処理を終了しない場合（ステップＳ１０６：ＹＥＳ）、一連の処理は終了することになる。 If the tracking process is not to be ended (step S106: NO), the process may be executed again from step S101. That is, the next frame image may be acquired, and the above-described process may be executed repeatedly. On the other hand, if the tracking process is not to be ended (step S106: YES), the series of processes will end.

　（交差注意機構）
　次に、図５を参照しながら、交差注意機構２００の構成及び動作について説明する。図５は、第１の情報処理装置における交差注意機構の構成を示すブロック図である。 (Cross-Attention Mechanism)
Next, the configuration and operation of the intersection attention mechanism 200 will be described with reference to Fig. 5. Fig. 5 is a block diagram showing the configuration of the intersection attention mechanism in the first information processing apparatus.

　図５に示すように、交差注意機構２００は、クエリ、キー、バリューの各々に対応する３つの特徴埋込処理部２１０、２２０、及び２３０と、行列積演算部２４０と、正規化部２５０と、行列積演算部２６０と、残差処理部２７０と、記憶更新部２８０とを備えている。 As shown in FIG. 5, the cross-attention mechanism 200 includes three feature embedding units 210, 220, and 230 corresponding to the query, key, and value, respectively, a matrix multiplication unit 240, a normalization unit 250, a matrix multiplication unit 260, a residual processing unit 270, and a memory update unit 280.

　特徴埋込処理部２１０は、特徴量変換部１３０から入力される時刻ｔの特徴量（即ち、時刻ｔに撮影されたフレームに対応する特徴量）からクエリを抽出可能に構成されている。特徴埋込処理部２２０は、過去の追跡処理において演算された時刻ｔ－τの特徴量（即ち、時刻ｔよりも前の時刻ｔ－τに撮影されたフレームに対応する特徴量）からキーを抽出可能に構成されている。特徴埋込処理部２３０は、過去の追跡処理において演算された時刻ｔ－τの特徴量からバリューを抽出可能に構成されている。クエリ及びキーは、行列積演算部２４０に出力される構成となっている。他方、バリューは、行列積演算部２６０に出力される構成となっている The feature embedding processor 210 is configured to extract a query from the feature values at time t (i.e., the feature values corresponding to the frame captured at time t) input from the feature converter 130. The feature embedding processor 220 is configured to extract a key from the feature values at time t-τ calculated in a past tracking process (i.e., the feature values corresponding to the frame captured at time t-τ, prior to time t). The feature embedding processor 230 is configured to extract a value from the feature values at time t-τ calculated in a past tracking process. The query and key are configured to be output to the matrix multiplication processor 240. On the other hand, the value is configured to be output to the matrix multiplication processor 260.

　行列積演算部２４０は、クエリ及びキーの行列積を演算することで、クエリ及びキーの相関関係を示す重み（Ａｔｔｅｎｔｉｏｎ　Ｗｅｉｇｈｔ）を算出可能に構成されている。即ち、行列演算部２４０は、時刻ｔに撮影されたフレームに対応する特徴量と、時刻ｔ－τに撮影されたフレームに対応する特徴量との相関関係を示す重みを算出可能に構成されている。行列積演算部２４０は、例えば、時刻ｔに撮影されたフレームに対応する特徴量を縦軸、時刻ｔ－τに撮影されたフレームに対応する特徴量を横軸とする類似性行列（Ａｆｆｉｎｉｔｙ　Ｍａｔｒｉｘ）を交差注意機構２００の重み（Ａｔｔｅｎｔｉｏｎ　Ｗｅｉｇｈｔ）として算出（使用）してもよい。 The matrix multiplication calculation unit 240 is configured to calculate a weight (Attention Weight) indicating the correlation between the query and the key by calculating the matrix product of the query and the key. In other words, the matrix calculation unit 240 is configured to calculate a weight indicating the correlation between the feature corresponding to the frame captured at time t and the feature corresponding to the frame captured at time t-τ. For example, the matrix multiplication calculation unit 240 may calculate (use) an affinity matrix (Affinity Matrix) in which the vertical axis represents the feature corresponding to the frame captured at time t and the horizontal axis represents the feature corresponding to the frame captured at time t-τ as the weight (Attention Weight) of the cross-attention mechanism 200.

　正規化部２５０は、行列積演算部２４０で演算した重みに対して正規化処理を実行可能に構成されている。正規化部２５０は、例えば、行列積演算部２４０が算出した類似性行列を、クロスソフトマックス（Ｃｒｏｓｓ‐ｓｏｆｔｍａｘ）関数を用いて正規化する処理を実行してよい。正規化部２５０が正規化した重みは、行列積演算部２６０に出力される構成となっている。 The normalization unit 250 is configured to be able to perform normalization processing on the weights calculated by the matrix multiplication unit 240. The normalization unit 250 may, for example, perform processing to normalize the similarity matrix calculated by the matrix multiplication unit 240 using a cross-softmax function. The weights normalized by the normalization unit 250 are configured to be output to the matrix multiplication unit 260.

　行列積演算部２６０は、正規化部２５０からの出力と、バリューとの行列積を演算することで、重みをバリューに反映する処理を実行可能に構成されている。なお、本実施形態における行列積は、典型的には、テンソル積（言い換えれば、直積）であってよい。例えば、行列積は、クロネッカー積であってもよい。行列積演算部２６０の演算結果は、残差処理部２７０に出力される構成となっている。 The matrix multiplication unit 260 is configured to perform processing to reflect the weight in the value by calculating the matrix product of the output from the normalization unit 250 and the value. Note that the matrix product in this embodiment may typically be a tensor product (in other words, a direct product). For example, the matrix product may be a Kronecker product. The calculation result of the matrix multiplication unit 260 is configured to be output to the residual processing unit 270.

　残差処理部２７０は、行列積演算部２６０の演算結果に対して、残差処理を実行可能に構成されている。この残差処理は、行列積演算部２６０の演算結果と、交差注意機構２００に入力された特徴量（具体的には、時刻ｔの特徴量）とを加算する処理であってよい。これは、相関関係が仮に算出されなかった場合でも、交差注意機構２００の演算結果としての特徴量が生成されなくなるのを防ぐためである。例えば、相関関係（重み）として０が算出されると、バリュー値に対してその０が乗算されることにより、行列演算部２６０の演算結果における特徴値が０となる（消失する）ことになる。これを防ぐために、残差処理部２７０は上述した残差処理を実行する。残差処理部２７０の演算結果は、更新された時刻ｔの特徴量として交差注意機構２００から出力されることになる。 The residual processing unit 270 is configured to perform residual processing on the calculation results of the matrix multiplication calculation unit 260. This residual processing may be a process of adding the calculation results of the matrix multiplication calculation unit 260 and the feature values input to the cross-attention mechanism 200 (specifically, the feature values at time t). This is to prevent the feature values from not being generated as the calculation results of the cross-attention mechanism 200 even if a correlation is not calculated. For example, if 0 is calculated as the correlation (weight), the value will be multiplied by that 0, causing the feature value in the calculation results of the matrix calculation unit 260 to become 0 (disappear). To prevent this, the residual processing unit 270 performs the residual processing described above. The calculation results of the residual processing unit 270 are output from the cross-attention mechanism 200 as the updated feature values at time t.

　記憶更新部２８０は、記憶されている追跡対象に対応する特徴量を更新する。記憶更新部２８０は、行列積演算部２６０にて、出力された更新特徴量に対応する記憶手段に記憶されている特徴量のみを更新してもよいし、残差処理部２７０の演算結果によって出力された特徴量を記憶手段に上書きして更新してもよい。例えば、行列積演算部２４０にてクエリ及びキーから演算された重みによって、追跡対象が特定され、記憶部１５５に記憶されている複数の追跡対象からどの追跡対象を更新するか決定されてもよい。さらに、行列積演算部２６０にて、正規化された重み及びバリューから演算された更新特徴量が、記憶部１５５に記憶されている追跡対象の特徴量の更新量として決定されてもよい。 The memory update unit 280 updates the stored feature quantities corresponding to the tracked object. The memory update unit 280 may update only the feature quantities stored in the memory means corresponding to the updated feature quantities output by the matrix multiplication calculation unit 260, or may overwrite the feature quantities output by the calculation results of the residual processing unit 270 in the memory means to update them. For example, the tracked object may be identified by weights calculated from the query and key by the matrix multiplication calculation unit 240, and it may be determined which tracked object to update from the multiple tracked objects stored in the memory unit 155. Furthermore, the updated feature quantities calculated by the matrix multiplication calculation unit 260 from the normalized weights and values may be determined as the updated quantities for the feature quantities of the tracked object stored in the memory unit 155.

　なお、第１の情報処理装置１は、実質的には、追跡処理と交差注意機構２００で行われる動作とが似ていることに着目し、物体を照合する際に生成される情報を用いて特徴量を更新する動作を行っていると言える。例えば、追跡処理では、追跡対象を検出する処理、追跡対象を照合する処理、及び追跡対象の検出結果を更新する処理が行われる。一方で、交差注意機構２００では、追跡対象に関する特徴量を抽出する処理、重みを算出する処理、及び追跡対象に関する特徴量を更新する処理が行われる。第１の情報処理装置１は、交差注意機構２００において重みを算出する処理を、実質的には、追跡処理において追跡対象を照合する処理としても流用している。言い換えれば、第１の情報処理装置１は、追跡処理において追跡対象を照合する処理を、実質的には、交差注意機構２００において重みを算出する処理としても流用している。従って、第１の情報処理装置１は、物体を検出する動作、物体を照合する動作、及び検出結果を更新する動作を、交差注意機構２００を用いて実現しているとも言える。 Note that the first information processing device 1 essentially focuses on the similarity between the tracking process and the operations performed by the cross-attention mechanism 200, and can be said to perform an operation to update features using information generated when matching objects. For example, the tracking process involves a process to detect the tracked object, a process to match the tracked object, and a process to update the tracked object detection results. On the other hand, the cross-attention mechanism 200 involves a process to extract features related to the tracked object, a process to calculate weights, and a process to update the features related to the tracked object. The first information processing device 1 essentially reuses the process of calculating weights in the cross-attention mechanism 200 as the process of matching the tracked object in the tracking process. In other words, the first information processing device 1 essentially reuses the process of matching the tracked object in the tracking process as the process of calculating weights in the cross-attention mechanism 200. Therefore, it can also be said that the first information processing device 1 realizes the operations of detecting an object, matching the object, and updating the detection results using the cross-attention mechanism 200.

　より具体的には、交差注意機構２００は、上述したように時刻ｔに撮影されたフレームに対応する特徴量をクエリとし、時刻ｔよりも前の時刻ｔ－τまでに撮影されたフレームに含まれる特徴量を記憶した記憶部１６０からキー及びバリューとして取得し、用いることで追跡対象を追跡する処理を行っている。また、記憶更新部２８０は、残差処理部２７０の演算結果によって出力された特徴量を記憶手段に上書きして更新してもよい。また、行列積演算部２４０にて演算された重みから、追跡対象を特定し、記憶手段に記憶されている特定された追跡対象のＩＤに対応する特徴量を行列積演算部２６０にて出力された更新特徴量を用いて更新してもよい。このようにすれば、動画に含まれる追跡対象に対する追跡処理を、比較的簡素なアルゴリズムで精度よく実行することが可能である。 More specifically, as described above, the intersection attention mechanism 200 uses the feature quantity corresponding to the frame captured at time t as a query, and obtains and uses the feature quantities contained in frames captured up to time t-τ before time t as keys and values from the memory unit 160, which stores these feature quantities, to track the target. The memory update unit 280 may also update the memory means by overwriting the feature quantity output by the calculation results of the residual processing unit 270. Alternatively, the target may be identified from the weights calculated by the matrix multiplication calculation unit 240, and the feature quantity corresponding to the ID of the identified target, which is stored in the memory means, may be updated using the updated feature quantity output by the matrix multiplication calculation unit 260. In this way, it is possible to accurately track a target included in a video using a relatively simple algorithm.

　（類似性行列）
　次に、図６を参照しながら、上述した交差注意機構２００が算出する類似性行列について具体的に説明する。図６は、交差注意機構により算出される類似性行列の一例を示す平面図である。 (similarity matrix)
Next, the similarity matrix calculated by the above-described intersection attention mechanism 200 will be specifically described with reference to Fig. 6. Fig. 6 is a plan view showing an example of the similarity matrix calculated by the intersection attention mechanism.

　図６に示すように、交差注意機構２００が重みとして用いる類似性行列ＡＭは、時刻ｔ－τの追跡対象Ｏ_ｔ－τと、時刻ｔの追跡対象Ｏ_ｔとの対応関係を示す情報となる。例えば、類似性行列ＡＭは、（１）複数の追跡対象Ｏ_ｔ－τのうちの第１の追跡対象Ｏ_ｔ－τが、複数の追跡対象Ｏ_ｔのうちの第１の追跡対象Ｏ_ｔに対応しており（つまり、両者が同一の人物であり）、（２）複数の追跡対象Ｏ_ｔ－τのうちの第２の追跡対象Ｏ_ｔ－τが、複数の追跡対象Ｏ_ｔのうちの第２の追跡対象Ｏ_ｔに対応しており、・・・、（Ｎ）複数の追跡対象Ｏ_ｔ－τのうちの第Ｎの追跡対象Ｏ_ｔ－τが、複数の追跡対象Ｏ_ｔのうちの第Ｎの追跡対象Ｏ_ｔに対応していることを示す情報となる。なお、類似性行列ＡＭは、追跡対象Ｏ_ｔ－τと追跡対象Ｏ_ｔとの対応関係を示す情報であるがゆえに、対応情報と称してもよい。 As shown in FIG. 6, the similarity matrix AM used as a weight by the cross attention mechanism 200 is information indicating the correspondence between the tracking target O _t- τ at time t-τ and the tracking target O _t at time t. For example, the similarity matrix AM is information indicating that (1) a first tracking target O t _-τ _among the multiple tracking targets O t- _τ corresponds to a first tracking target O _t among the multiple tracking targets O t (that is, both are the same person), (2) a second tracking target O _t-τ _among the multiple tracking targets O t- _τ corresponds to a second tracking target _{O t} _among the multiple tracking targets O t, ..., (N) an Nth tracking target O _t-τ among the multiple tracking targets O t- _τ corresponds to an Nth tracking target O _t among the multiple tracking targets O t. Note that the similarity matrix AM is information indicating the correspondence between the tracking target O _t-τ and the tracking target O _t , and therefore may be referred to as correspondence information.

　具体的には、類似性行列ＡＭは、その縦軸が特徴ベクトルＣＶ_ｔ－τのベクトル成分に対応しており且つその横軸が特徴ベクトルＣＶ_ｔのベクトル成分に対応している行列であるとみなすことができる。このため、類似性行列ＡＭの縦軸のサイズは、特徴ベクトルＣＶ_ｔ－τのサイズであり、時刻ｔ－τに撮影された画像のサイズ（つまり、画素数）に対応するサイズ）になる。同様に、類似性行列ＡＭの横軸のサイズは、特徴ベクトルＣＶ_ｔのサイズであり、時刻ｔに撮影された画のサイズ（つまり、画素数）に対応するサイズ）になる。言い換えれば、類似性行列ＡＭは、その縦軸が時刻ｔ－τの画像に映り込んでいる追跡対象Ｏ_ｔ－τの検出結果（つまり、追跡対象Ｏ_ｔ－τの検出位置）に対応しており、且つ、その横軸が時刻ｔの画像に映り込んでいる追跡対象Ｏ_ｔの検出結果（つまり、追跡対象Ｏ_ｔの検出位置）に対応している行列であるとみなすことができる。この場合、縦軸上のある追跡対象Ｏ_ｔ－τに対応するベクトル成分と横軸上の同じ追跡対象Ｏ_ｔに対応するベクトル成分とが交差する位置において、類似性行列ＡＭの要素が反応する（典型的には、０でない値を有する）。言い換えれば、縦軸上の追跡対象Ｏ_ｔ－τの検出結果と横軸上の追跡対象Ｏ_ｔの検出結果とが交差する位置において、類似性行列ＡＭの要素が反応する。つまり、類似性行列ＡＭは、典型的には、特徴ベクトルＣＶ_ｔ－τに含まれる追跡対象Ｏ_ｔ－τに対応するベクトル成分と特徴ベクトルＣＶ_ｔに含まれる同じ追跡対象Ｏ_ｔに対応するベクトル成分とが交差する位置の要素の値が、両ベクトル成分を掛け合わせることで得られる値（つまり、０ではない値）となる一方で、それ以外の要素の値が０になる行列となる。 Specifically, the similarity matrix AM can be considered to be a matrix whose vertical axis corresponds to the vector components of the feature vector CV _t-τ and whose horizontal axis corresponds to the vector components of the feature vector CV _t . Therefore, the size of the vertical axis of the similarity matrix AM is the size of the feature vector CV _t-τ , which corresponds to the size of the image captured at time t-τ (i.e., the number of pixels). Similarly, the size of the horizontal axis of the similarity matrix AM is the size of the feature vector CV _t , which corresponds to the size of the image captured at time t (i.e., the number of pixels). In other words, the similarity matrix AM can be considered to be a matrix whose vertical axis corresponds to the detection result of the tracking target O _t-τ reflected in the image at time t-τ (i.e., the detected position of the tracking target O _t-τ ) and whose horizontal axis corresponds to the detection result of the tracking target O _t reflected in the image at time t (i.e., the detected position of the tracking target O _t ). In this case, an element of the similarity matrix AM reacts (typically has a non-zero value) at a position where a vector component on the vertical axis corresponding to a certain tracking target O _t-τ _intersects with a vector component on the horizontal axis corresponding to the same tracking target O t. In other words, an element of the similarity matrix AM reacts at a position where a detection result for tracking target O _t-τ on the vertical axis intersects with a detection result for tracking target O _t on the horizontal axis. In other words, the similarity matrix AM is typically a matrix in which the value of an element at a position where a vector component corresponding to tracking target O _t- _τ included in feature vector CV t- _τ intersects with a vector component corresponding to the same tracking target O _t included in feature vector CV t is a value obtained by multiplying both vector components (i.e., a non-zero value), while the values of the other elements are 0.

　例えば、図５に示す例では、特徴ベクトルＣＶ_ｔ－τに含まれる追跡対象Ｏ＃ｋ（但し、ｋは、検出された追跡対象Ｏの数であり、図５に示す例では、ｋ＝１、２、３又は４）に対応するベクトル成分と特徴ベクトルＣＶ_ｔに含まれる同じ追跡対象Ｏ＃ｋに対応するベクトル成分とが交差する位置において、類似性行列ＡＭの要素が反応する。つまり、時刻ｔ－τに撮影された画像に映り込んだ追跡対象Ｏ＃ｋの検出結果と、時刻ｔに撮影された画像に映り込んだ追跡対象Ｏ＃ｋの検出結果とが交差する位置において、類似性行列ＡＭの要素が反応する。 For example, in the example shown in Fig. 5, the elements of the similarity matrix _{AM react} at positions where vector components corresponding to tracking target O#k (where k is the number of detected tracking targets O, and in the example shown in Fig. 5, k = 1, 2, 3, or 4) included in the feature vector CV t- _τ intersect with vector components corresponding to the same tracking target O#k included in the feature vector CV t. In other words, the elements of the similarity matrix AM react at positions where the detection result of tracking target O#k reflected in the image captured at time t-τ intersects with the detection result of tracking target O#k reflected in the image captured at time t.

　逆に、特徴ベクトルＣＶ_ｔ－τに含まれる追跡対象Ｏ_ｔ－τに対応するベクトル成分と特徴ベクトルＣＶ_ｔに含まれる同じ追跡対象Ｏ_ｔに対応するベクトル成分とが交差する位置において類似性行列ＡＭの要素が反応しない（典型的には、０になる）場合には、時刻ｔ－τに撮影された画像に映り込んでいた追跡対象Ｏ_ｔ－τは、時刻ｔに撮影された画像には映り込んでいない（例えば、カメラの撮影画角外へ出てしまった）と推定される。 Conversely, if the elements of the similarity matrix AM do not react (typically become 0) at the position where the vector component corresponding to the tracking target O _t-τ _included in the feature vector CV _t- _τ intersects with the vector component corresponding to the same tracking target O t included in the feature vector CV t, it is estimated that the tracking target O _t-τ that was reflected in the image captured at time t-τ is not reflected in the image captured at time t (for example, it has moved outside the angle of view of the camera).

　このように、類似性行列ＡＭは、追跡対象Ｏ_ｔ－τと追跡対象Ｏ_ｔとの対応関係を示す情報として利用可能である。つまり、類似性行列ＡＭは、時刻ｔ－τに撮影された画像に映り込んでいる追跡対象Ｏ_ｔ－τと、時刻ｔに撮影された画像に映り込んでいる物体Ｏ_ｔとの照合結果を示す情報として利用可能である。よって、類似性行列ＡＭは、時刻ｔ－τに撮影された画像に映り込んでいた追跡対象Ｏ_ｔ－τの、時刻ｔに撮影された画像内での位置を追跡するための情報として利用可能である。このように類似性行列ＡＭを用いれば、動画に含まれている追跡対象の追跡処理を精度よく実行することが可能である。 In this way, the similarity matrix AM can be used as information indicating the correspondence between the tracking target O _t-τ and the tracking target O _t . In other words, the similarity matrix AM can be used as information indicating the result of matching the tracking target O _t-τ reflected in the image captured at time t-τ with the object O _t reflected in the image captured at time t. Therefore, the similarity matrix AM can be used as information for tracking the position of the tracking target O _t-τ reflected in the image captured at time t-τ within the image captured at time t. By using the similarity matrix AM in this way, it is possible to accurately perform tracking processing of the tracking target included in the video.

　（技術的効果）
　次に、第１の情報処理装置１によって得られる技術的効果について説明する。 (Technical effect)
Next, the technical effects obtained by the first information processing device 1 will be described.

　図１から図６で説明したように、第１の情報処理装置１では、動画から取得された画像における対象の位置情報に関する特徴量が、対象を照合する機能を有する交差注意機構２００を用いて更新される。第１の情報処理装置１では、この交差注意機構２００の動作を利用して、追跡対象に対する追跡処理が実行される。このようにすれば、本実施形態で説明した交差注意機構２００を使用しない場合と比較して、より適切に追跡処理を実行することが可能となる。例えば、追跡処理に自己注意機構（Ｓｅｌｆ‐ａｔｔｅｎｔｉｏｎ機構）を利用しようとする場合、同一の追跡対象間で自己注意機構における重みが強く反応するような学習が必要とされる。しかしながら、このような構成を実現するためには、大量の自己注意機構が要求されてしまい、追跡処理に用いるアルゴリズムが複雑化してしまうという技術的問題点が生ずる。しかるに本実施形態で説明した交差注意機構２００を用いれば、追跡処理のアルゴリズムを簡素な構造で構築できるため、計算コストを抑制しつつ高精度な追跡処理を実現することが可能となる。 1 to 6, in the first information processing device 1, feature quantities related to the position information of an object in an image acquired from a video are updated using a cross-attention mechanism 200 having the function of matching objects. In the first information processing device 1, the operation of this cross-attention mechanism 200 is used to perform tracking processing for the tracked object. In this way, tracking processing can be performed more appropriately compared to when the cross-attention mechanism 200 described in this embodiment is not used. For example, when attempting to use a self-attention mechanism for tracking processing, learning is required so that the weights in the self-attention mechanism react strongly between the same tracked objects. However, realizing such a configuration requires a large number of self-attention mechanisms, which poses a technical problem in that the algorithm used for tracking processing becomes complicated. However, by using the cross-attention mechanism 200 described in this embodiment, the tracking processing algorithm can be constructed with a simple structure, making it possible to achieve highly accurate tracking processing while suppressing computational costs.

　＜第２実施形態＞
　第２の情報処理装置１について、図７から図１１を参照して説明する。なお、第２の情報処理装置１は、上述した第１の情報処理装置１と比べて一部の構成及び動作が異なるものであり、その他の部分については第１の情報処理装置１と同様であってよい。このため、以下では、第１実施形態と異なる部分について詳しく説明し、他の重複する部分については適宜説明を省略するものとする。 Second Embodiment
The second information processing device 1 will be described with reference to Figures 7 to 11. The second information processing device 1 differs in some configurations and operations from the first information processing device 1 described above, but other parts may be similar to the first information processing device 1. Therefore, the following will describe in detail the parts that differ from the first embodiment, and will omit explanations of other overlapping parts as appropriate.

　（機能的構成）
　まず、図７を参照しながら、第２の情報処理装置１の機能的構成について説明する。図７は、第２の情報処理装置の機能的構成を示すブロック図である。 (Functional configuration)
First, the functional configuration of the second information processing device 1 will be described with reference to Fig. 7. Fig. 7 is a block diagram showing the functional configuration of the second information processing device.

　図７に示すように、第２の情報処理装置１は、その機能を実現するための構成要素として、画像取得部１１０と、対象位置検出部１２０と、特徴量変換部１３０と、特徴量更新部１４０と、位置情報復元部１５０と、学習部１６０と、を備えている。即ち、第２の情報処理装置１は、すでに第１実施形態で説明した構成（図３参照）に加えて、学習部１６０を更に備えている。なお、学習部１６０は、上述したプロセッサ１１（図１参照）によって実現される処理ブロックであってよい。 As shown in FIG. 7, the second information processing device 1 includes, as components for realizing its functions, an image acquisition unit 110, a target position detection unit 120, a feature conversion unit 130, a feature update unit 140, a position information restoration unit 150, and a learning unit 160. That is, the second information processing device 1 further includes a learning unit 160 in addition to the configuration already described in the first embodiment (see FIG. 3). Note that the learning unit 160 may be a processing block realized by the above-mentioned processor 11 (see FIG. 1).

　学習部１６０は、第２の情報処理装置１が実行する追跡対象の追跡処理に関する学習を実行可能に構成されている。より具体的には、学習部１６０は、追跡処理を実行する追跡モデル５０（即ち、対象位置検出部１２０、特徴量変換部１３０、特徴量更新部１４０、及び対象情報復元部１４０の機能を有するモデル）に対して、より高精度で追跡が行えるような機械学習を実行してよい。学習部１６０は、特徴量更新部１４０が用いる交差注意機構２００の動作に関する学習を実行してよい。例えば、学習部１６０は、交差注意機構２００が行う追跡対象を照合する動作がより正確に行われるように学習を実行してよい。具体的には、学習部１６０は、交差注意機構２００が用いる類似性行列ＡＭが同一の追跡対象でより強く反応するように学習してよい。学習部１６０が用いる具体的な学習手法については、以下で詳しく説明する。 The learning unit 160 is configured to be able to perform learning related to the tracking process of the tracked object executed by the second information processing device 1. More specifically, the learning unit 160 may perform machine learning on the tracking model 50 that performs the tracking process (i.e., a model having the functions of the object position detection unit 120, feature conversion unit 130, feature update unit 140, and object information restoration unit 140) to enable tracking with higher accuracy. The learning unit 160 may perform learning related to the operation of the cross-attention mechanism 200 used by the feature update unit 140. For example, the learning unit 160 may perform learning so that the operation of matching tracked objects performed by the cross-attention mechanism 200 can be performed more accurately. Specifically, the learning unit 160 may learn so that the similarity matrix AM used by the cross-attention mechanism 200 reacts more strongly to the same tracked object. The specific learning method used by the learning unit 160 will be explained in detail below.

　（学習手法）
　次に、図８から図１０を参照しながら、学習部が実行する学習の手法について具体的に説明する。図８は、比較例に係る学習データの生成手法を示す概念図である。図９は、第２の情報処理装置に係る学習データの生成手法を示す概念図である。図１０は、第２の情報処理装置による学習動作におけるクエリ伝搬を示す概念図である。 (Learning method)
Next, the learning technique executed by the learning unit will be specifically described with reference to Fig. 8 to Fig. 10. Fig. 8 is a conceptual diagram showing a learning data generation technique according to a comparative example. Fig. 9 is a conceptual diagram showing a learning data generation technique according to a second information processing device. Fig. 10 is a conceptual diagram showing query propagation in the learning operation by the second information processing device.

　図８において、まず比較例に係る学習手法について説明する。比較例に係る学習手法では、複数の動画を学習データとして用いる。具体的には、複数の動画をそれぞれミニバッチに変換して学習データとして用いる。このような学習手法を、クエリ伝搬を用いる追跡処理の学習に適用しようとすると、例えばメモリ量の制限等により多くのフレームを学習に用いることができず、結果として適切な学習効果を得ることができない。 In Figure 8, we will first explain a learning method according to a comparative example. In the learning method according to the comparative example, multiple videos are used as learning data. Specifically, multiple videos are each converted into mini-batches and used as learning data. When attempting to apply this type of learning method to learning tracking processing using query propagation, it is not possible to use many frames for learning due to memory limitations, for example, and as a result, it is not possible to obtain appropriate learning results.

　他方、図９に示すように、第２の情報処理装置１における学習部１６０は、１本の動画に対してミニバッチ変換を行い学習データとして利用する。例えば、学習部１６０は、ミニバッチに含まれるすべての動画フレームを、追跡モデル５０のクエリ伝搬の学習にすべて使用してもよい。学習データは、１本の動画に含まれる各フレームと、各フレームに映り込んだ追跡対象の対応関係（即ち、どの人物とどの人物が同一であるか）を示す正解データとを含むデータであってよい。 On the other hand, as shown in FIG. 9, the learning unit 160 in the second information processing device 1 performs mini-batch conversion on a single video and uses the result as training data. For example, the learning unit 160 may use all of the video frames included in the mini-batch for training the query propagation of the tracking model 50. The training data may include each frame included in a single video and ground truth data indicating the correspondence between the tracked targets captured in each frame (i.e., which people are the same person).

　図１０に示すように、学習部１６０は、１つの動画から取得した時系列で並ぶ複数のフレーム（フレームｔ１、フレームｔ２、フレームｔ３、…）を学習データとして用いる。このようにすれば、追跡モデル５０のクエリ伝搬の学習で使用できるフレーム数を大幅に増加させることが可能であるまた、追跡モデル５０によるクエリの伝搬を時系列で学習していくことも可能となる。 As shown in Figure 10, the learning unit 160 uses multiple frames (frame t1, frame t2, frame t3, ...) arranged in chronological order from a single video as learning data. In this way, it is possible to significantly increase the number of frames that can be used to learn query propagation in the tracking model 50, and it is also possible to learn query propagation by the tracking model 50 in chronological order.

　（学習動作の流れ）
　次に、図１１を参照しながら、第２の情報処理装置１が実行する学習動作（即ち、学習部１６０によって追跡モデル５０を学習する際の動作）の流れについて説明する。図１１は、第２の情報処理装置による学習動作の流れを示すフローチャートである。 (Learning operation flow)
Next, the flow of the learning operation executed by the second information processing device 1 (i.e., the operation when the learning unit 160 learns the tracking model 50) will be described with reference to Fig. 11. Fig. 11 is a flowchart showing the flow of the learning operation by the second information processing device.

　図１１に示すように、第２の情報処理装置１０による学習動作が開始されると、まず学習部１６０が、１つの動画をバッチ変換して学習データとする（ステップＳ２０１）。そして、学習部１６０は、学習データ２０２を追跡モデル５０に入力する（ステップＳ２０２）。 As shown in FIG. 11, when the learning operation by the second information processing device 10 begins, the learning unit 160 first batch-converts one video to create learning data (step S201). The learning unit 160 then inputs the learning data 202 into the tracking model 50 (step S202).

　続いて、学習部１６０は、追跡モデル５０の出力結果と正解データとを比較して損失関数を算出する（ステップＳ２０３）。そして、学習部１６０は、損失関数の勾配を算出する（ステップＳ２０４）。 Next, the learning unit 160 compares the output result of the tracking model 50 with the ground truth data to calculate a loss function (step S203). Then, the learning unit 160 calculates the gradient of the loss function (step S204).

　続いて、学習部１６０は、算出した勾配に基づいて、損失関数が小さくなるように追跡モデルのパラメータを更新する（ステップＳ２０５）。その後、学習部１６０は、学習データの全フレームを用いて学習を実行したか否かを判定する（ステップＳ２０６）。 Next, the learning unit 160 updates the parameters of the tracking model based on the calculated gradient so as to reduce the loss function (step S205). After that, the learning unit 160 determines whether learning has been performed using all frames of the training data (step S206).

　全フレームを学習に用いていない場合（ステップＳ２０６：ＮＯ）。学習部１６０は、再びステップＳ２０２から処理を開始する。即ち、学習部１６０は、学習データである画像を追跡モデル５０に入力してから、パラメータを更新する処理までの処理を繰り返す。一方で、全フレームを学習に用いている場合（ステップＳ２０６：ＹＥＳ）、学習部１６０は学習が終了したと判断し、学習済みモデルを保存する（ステップＳ２０７）。 If not all frames have been used for learning (step S206: NO), the learning unit 160 starts processing again from step S202. That is, the learning unit 160 repeats the process from inputting images, which are learning data, into the tracking model 50 to updating the parameters. On the other hand, if all frames have been used for learning (step S206: YES), the learning unit 160 determines that learning has ended and saves the learned model (step S207).

　（技術的効果）
　次に、第２の情報処理装置１によって得られる技術的効果について説明する。 (Technical effect)
Next, the technical effects obtained by the second information processing device 1 will be described.

　図６から図１１で説明したように、第２の情報処理装置１では、１つの動画に含まれる複数フレームを１つのミニバッチに変換して学習が行われる。このようにすれば、例えば複数の動画をそれぞれミニバッチ変換する場合と比較して、追跡モデル５０のクエリ伝搬の学習で使用できるフレーム数を大幅に増加させることが可能である。なお、第２の情報処理装置１が行う追跡処理は、フレーム数に対して特に制限が設けられない。このため、多くのフレームを用いた学習（言い換えれば、長期間の時系列学習）を実現することで、追跡の精度を効果的に高めることが可能である。 As described in Figures 6 to 11, in the second information processing device 1, multiple frames included in one video are converted into one mini-batch for learning. In this way, it is possible to significantly increase the number of frames that can be used in learning query propagation for the tracking model 50, compared to, for example, when multiple videos are each converted into mini-batches. Note that there is no particular limit on the number of frames in the tracking process performed by the second information processing device 1. Therefore, by realizing learning using a large number of frames (in other words, long-term time-series learning), it is possible to effectively improve tracking accuracy.

　なお、上述した各実施形態に係る情報処理装置１は、対象に対して追跡処理を実行する各種システムに適用することが可能である。例えば、情報処理装置１は、所定領域を通過する対象を追跡して、追跡している対象の生体情報（例えば、顔情報や虹彩情報等）を用いた認証処理を実行するゲートレス認証システムに適用することが可能である。 The information processing device 1 according to each of the above-described embodiments can be applied to various systems that perform tracking processing on an object. For example, the information processing device 1 can be applied to a gateless authentication system that tracks an object passing through a predetermined area and performs authentication processing using biometric information (e.g., facial information, iris information, etc.) of the object being tracked.

　上述した各実施形態の機能を実現するように該実施形態の構成を動作させるプログラムを記録媒体に記録させ、該記録媒体に記録されたプログラムをコードとして読み出し、コンピュータにおいて実行する処理方法も各実施形態の範疇に含まれる。すなわち、コンピュータ読取可能な記録媒体も各実施形態の範囲に含まれる。また、上述のプログラムが記録された記録媒体はもちろん、そのプログラム自体も各実施形態に含まれる。 The scope of each embodiment also includes a processing method in which a program that operates the configuration of each embodiment to realize the functions of the above-mentioned embodiments is recorded on a recording medium, the program recorded on the recording medium is read as code, and the program is executed on a computer. In other words, computer-readable recording media are also included in the scope of each embodiment. Furthermore, each embodiment includes not only the recording medium on which the above-mentioned program is recorded, but also the program itself.

　記録媒体としては例えばフロッピー（登録商標）ディスク、ハードディスク、光ディスク、光磁気ディスク、ＣＤ－ＲＯＭ、磁気テープ、不揮発性メモリカード、ＲＯＭを用いることができる。また該記録媒体に記録されたプログラム単体で処理を実行しているものに限らず、他のソフトウェア、拡張ボードの機能と共同して、ＯＳ上で動作して処理を実行するものも各実施形態の範疇に含まれる。更に、プログラム自体がサーバに記憶され、ユーザ端末にサーバからプログラムの一部または全てをダウンロード可能なようにしてもよい。プログラムは、例えばＳａａＳ（Ｓｏｆｔｗａｒｅ　ａｓ　ａ　Ｓｅｒｖｉｃｅ）形式でユーザに提供されてもよい。 The recording medium may be, for example, a floppy disk, hard disk, optical disk, magneto-optical disk, CD-ROM, magnetic tape, non-volatile memory card, or ROM. Furthermore, the scope of each embodiment is not limited to programs recorded on the recording medium that execute processing by themselves, but also includes programs that execute processing by running on an OS in conjunction with other software or expansion board functions. Furthermore, the program itself may be stored on a server, and part or all of the program may be downloadable from the server to a user terminal. The program may also be provided to the user in, for example, a SaaS (Software as a Service) format.

　＜付記＞
　以上説明した実施形態に関して、更に以下の付記のようにも記載されうるが、以下には限られない。 <Additional Notes>
The above-described embodiment may be further described as follows, but is not limited to the following.

　（付記１）
　付記１に記載の情報処理装置は、動画から画像を取得する取得手段と、前記画像に含まれる追跡対象の位置を検出する検出手段と、前記追跡対象の位置に関する位置情報を、前記位置情報の特徴を示す特徴量に変換する変換手段と、前記特徴量を、複数の前記画像間で前記追跡対象を照合可能な交差注意機構を用いて更新する更新手段と、更新された前記特徴量を前記位置情報に復元する復元手段と、を備える情報処理装置である。 (Appendix 1)
The information processing device described in Supplementary Note 1 is an information processing device that includes an acquisition means for acquiring an image from a video, a detection means for detecting the position of a tracked object included in the image, a conversion means for converting position information regarding the position of the tracked object into a feature quantity indicating the characteristics of the position information, an update means for updating the feature quantity using a cross-attention mechanism that can match the tracked object between a plurality of the images, and a restoration means for restoring the updated feature quantity to the position information.

　（付記２）
　付記２に記載の情報処理装置は、前記交差注意機構は、第１の画像に関する前記特徴量である第１特徴量と、前記第１の画像よりも前に撮影された第２の画像に関する前記特徴量である第２特徴量と、から算出される重みを用いて前記追跡対象を照合する、付記１に記載の情報処理装置である。 (Appendix 2)
The information processing device described in Appendix 2 is the information processing device described in Appendix 1, in which the cross-attention mechanism matches the tracked target using a weight calculated from a first feature that is the feature related to a first image and a second feature that is the feature related to a second image that was taken before the first image.

　（付記３）
　付記３に記載の情報処理装置は、前記交差注意機構は、前記第１特徴量と前記第２特徴量との行列積を演算して得られる類似性行列を前記重みとして用いることで前記追跡対象を照合する、付記２に記載の情報処理装置である。 (Appendix 3)
The information processing device described in Supplementary Note 3 is the information processing device described in Supplementary Note 2, in which the cross-attention mechanism matches the tracked target by using a similarity matrix obtained by calculating a matrix product of the first feature and the second feature as the weight.

　（付記４）
　付記４に記載の情報処理装置は、前記特徴量を記憶可能な記憶手段と、前記更新手段で更新された前記特徴量に基づいて、前記記憶手段に記憶された前記特徴量を更新する記憶更新手段と、を更に備える付記１から３のいずれか一項に記載の情報処理装置である。 (Appendix 4)
The information processing device described in Supplementary Note 4 is the information processing device described in any one of Supplementary Notes 1 to 3, further comprising: a storage means capable of storing the feature; and a storage update means that updates the feature stored in the storage means based on the feature updated by the update means.

　（付記５）
　付記５に記載の情報処理装置は、１つの動画に含まれる複数フレームを１つのミニバッチに変換して学習データとし、前記追跡対象の照合に関する学習を行う学習部を更に備える、付記１から３のいずれか一項に記載の情報処理装置である。 (Appendix 5)
The information processing device described in Supplementary Note 5 is the information processing device described in any one of Supplementary Notes 1 to 3, further including a learning unit that converts multiple frames included in one video into one mini-batch to use as training data, and performs learning related to matching of the tracked target.

　（付記６）
　付記６に記載の情報処理方法は、少なくとも１つのコンピュータによって、動画から画像を取得し、前記画像に含まれる追跡対象の位置を検出し、前記追跡対象の位置に関する位置情報を、前記位置情報の特徴を示す特徴量に変換し、前記特徴量を、複数の前記画像間で前記追跡対象を照合可能な交差注意機構を用いて更新し、更新された前記特徴量を前記位置情報に復元する、情報処理方法である。 (Appendix 6)
The information processing method described in Supplementary Note 6 is an information processing method that, by at least one computer, acquires an image from a video, detects a position of a tracked target included in the image, converts position information regarding the position of the tracked target into a feature amount indicating characteristics of the position information, updates the feature amount using a cross-attention mechanism that can match the tracked target between a plurality of the images, and restores the updated feature amount to the position information.

　（付記７）
　付記７に記載の記録媒体は、少なくとも１つのコンピュータに、動画から画像を取得し、前記画像に含まれる追跡対象の位置を検出し、前記追跡対象の位置に関する位置情報を、前記位置情報の特徴を示す特徴量に変換し、前記特徴量を、複数の前記画像間で前記追跡対象を照合可能な交差注意機構を用いて更新し、更新された前記特徴量を前記位置情報に復元する、情報処理方法を実行させるコンピュータプログラムが記録された記録媒体である。 (Appendix 7)
The recording medium described in Supplementary Note 7 is a recording medium having recorded thereon a computer program for causing at least one computer to execute an information processing method of acquiring images from a video, detecting a position of a tracked target included in the images, converting position information regarding the position of the tracked target into feature quantities indicating characteristics of the position information, updating the feature quantities using a cross-attention mechanism capable of matching the tracked target between a plurality of the images, and restoring the updated feature quantities to the position information.

　（付記８）
　付記８に記載のコンピュータプログラムは、少なくとも１つのコンピュータに、動画から画像を取得し、前記画像に含まれる追跡対象の位置を検出し、前記追跡対象の位置に関する位置情報を、前記位置情報の特徴を示す特徴量に変換し、前記特徴量を、複数の前記画像間で前記追跡対象を照合可能な交差注意機構を用いて更新し、更新された前記特徴量を前記位置情報に復元する、情報処理方法を実行させるコンピュータプログラムである。 (Appendix 8)
The computer program described in Supplementary Note 8 is a computer program that causes at least one computer to execute an information processing method of acquiring images from a video, detecting a position of a tracked target included in the images, converting position information regarding the position of the tracked target into features indicating characteristics of the position information, updating the features using a cross-attention mechanism that can match the tracked target between a plurality of the images, and restoring the updated features to the position information.

　（付記９）
　付記９に記載の追跡装置は、動画から画像を取得する取得手段と、前記画像に含まれる追跡対象の位置を検出する検出手段と、前記追跡対象の位置に関する位置情報を、前記位置情報の特徴を示す特徴量に変換する変換手段と、前記特徴量を、複数の前記画像間で前記追跡対象を照合可能な交差注意機構を用いて更新する更新手段と、更新された前記特徴量を前記位置情報に復元する復元手段と、復元された前記位置情報に基づいて前記追跡対象を追跡する追跡手段と、を備える追跡装置である。 (Appendix 9)
The tracking device described in Supplementary Note 9 is a tracking device including: an acquisition means for acquiring an image from a video; a detection means for detecting a position of a tracking target included in the image; a conversion means for converting position information regarding the position of the tracking target into a feature quantity indicating characteristics of the position information; an update means for updating the feature quantity using a cross-attention mechanism capable of matching the tracking target between a plurality of the images; a restoration means for restoring the updated feature quantity to the position information; and a tracking means for tracking the tracking target based on the restored position information.

　（付記１０）
　付記１０に記載の追跡方法は、少なくとも１つのコンピュータによって、動画から画像を取得し、前記画像に含まれる追跡対象の位置を検出し、前記追跡対象の位置に関する位置情報を、前記位置情報の特徴を示す特徴量に変換し、前記特徴量を、複数の前記画像間で前記追跡対象を照合可能な交差注意機構を用いて更新し、更新された前記特徴量を前記位置情報に復元し、復元された前記位置情報に基づいて前記追跡対象を追跡する、追跡方法である。 (Appendix 10)
The tracking method described in Supplementary Note 10 is a tracking method that, by at least one computer, acquires an image from a video, detects a position of a tracked object included in the image, converts position information regarding the position of the tracked object into a feature amount indicating characteristics of the position information, updates the feature amount using a cross-attention mechanism that can match the tracked object between a plurality of the images, restores the updated feature amount to the position information, and tracks the tracked object based on the restored position information.

　（付記１１）
　付記１１に記載の記録媒体は、少なくとも１つのコンピュータに、動画から画像を取得し、前記画像に含まれる追跡対象の位置を検出し、前記追跡対象の位置に関する位置情報を、前記位置情報の特徴を示す特徴量に変換し、前記特徴量を、複数の前記画像間で前記追跡対象を照合可能な交差注意機構を用いて更新し、更新された前記特徴量を前記位置情報に復元し、復元された前記位置情報に基づいて前記追跡対象を追跡する、追跡方法を実行させるコンピュータプログラムが記録された記録媒体である。 (Appendix 11)
The recording medium described in Supplementary Note 11 is a recording medium having recorded thereon a computer program for causing at least one computer to execute a tracking method of acquiring images from a video, detecting a position of a tracked target included in the images, converting position information regarding the position of the tracked target into feature amounts indicating characteristics of the position information, updating the feature amounts using a cross-attention mechanism capable of matching the tracked target between a plurality of the images, restoring the updated feature amounts to the position information, and tracking the tracked target based on the restored position information.

　（付記１２）
　付記１２に記載のコンピュータプログラムは、少なくとも１つのコンピュータに、動画から画像を取得し、前記画像に含まれる追跡対象の位置を検出し、前記追跡対象の位置に関する位置情報を、前記位置情報の特徴を示す特徴量に変換し、前記特徴量を、複数の前記画像間で前記追跡対象を照合可能な交差注意機構を用いて更新し、更新された前記特徴量を前記位置情報に復元し、復元された前記位置情報に基づいて前記追跡対象を追跡する、追跡方法を実行させるコンピュータプログラムである。 (Appendix 12)
The computer program described in Supplementary Note 12 is a computer program that causes at least one computer to execute a tracking method of acquiring images from a video, detecting a position of a tracked target included in the images, converting position information regarding the position of the tracked target into feature amounts indicating characteristics of the position information, updating the feature amounts using a cross-attention mechanism that can match the tracked target across a plurality of the images, restoring the updated feature amounts to the position information, and tracking the tracked target based on the restored position information.

　この開示は、請求の範囲及び明細書全体から読み取ることのできる発明の要旨又は思想に反しない範囲で適宜変更可能であり、そのような変更を伴う情報処理装置、情報処理方法、及び記録媒体もまたこの開示の技術思想に含まれる。 This disclosure may be modified as appropriate within the scope of the claims and the spirit or concept of the invention as can be read from the entire specification, and information processing devices, information processing methods, and recording media incorporating such modifications are also included within the technical concept of this disclosure.

　１０　情報処理装置
　１１　プロセッサ
　１２　ＲＡＭ
　１３　ＲＯＭ
　１４　記憶装置
　１５　入力装置
　１６　出力装置
　１７　データバス
　５０　追跡モデル
　１１０　画像取得部
　１２０　対象位置検出部
　１３０　特徴量変換部
　１４０　特徴量更新部
　１５０　位置情報復元部
　１５５　記憶部
　１６０　学習部
　２００　交差注意機構
　２１０　クエリ
　２２０　キー
　２３０　バリュー
　２４０　行列積演算部
　２５０　正規化部
　２６０　行列積演算部
　２７０　残差処理部
　２８０　記憶更新部 10 Information processing device 11 Processor 12 RAM
13 ROM
14 Storage device 15 Input device 16 Output device 17 Data bus 50 Tracking model 110 Image acquisition unit 120 Target position detection unit 130 Feature conversion unit 140 Feature update unit 150 Position information restoration unit 155 Storage unit 160 Learning unit 200 Intersection attention mechanism 210 Query 220 Key 230 Value 240 Matrix multiplication unit 250 Normalization unit 260 Matrix multiplication unit 270 Residual processing unit 280 Memory update unit

Claims

an acquisition means for acquiring images from the video;
a detection means for detecting a position of a tracking target included in the image;
a conversion means for converting position information relating to the position of the tracked object into a feature quantity indicating a feature of the position information;
updating means for updating the feature amount using a cross-attention mechanism capable of matching the tracked object between a plurality of the images;
a restoration means for restoring the updated feature amount to the position information;
An information processing device comprising:

the cross-attention mechanism matches the tracked target using a weight calculated from a first feature amount that is the feature amount related to a first image and a second feature amount that is the feature amount related to a second image captured before the first image;
The information processing device according to claim 1 .

the cross-attention mechanism matches the tracked target by using a similarity matrix obtained by calculating a matrix product of the first feature amount and the second feature amount as the weight;
The information processing device according to claim 2 .

a storage means capable of storing the feature amount;
a storage update means for updating the feature quantity stored in the storage means based on the feature quantity updated by the update means;
The information processing device according to claim 1 , further comprising:

The tracking system further includes a learning unit that converts a plurality of frames included in one video into one mini-batch as training data and performs training related to matching of the tracked object.
The information processing device according to claim 1 .

by at least one computer,
Get images from videos,
Detecting the position of a tracked object included in the image;
converting location information relating to the location of the tracked object into a feature quantity indicating a feature of the location information;
updating the feature amount using a cross-attention mechanism capable of matching the tracked object between the plurality of images;
restoring the updated feature amount to the position information;
Information processing methods.

At least one computer
Get images from videos,
Detecting the position of a tracked object included in the image;
converting location information relating to the location of the tracked object into a feature quantity indicating a feature of the location information;
updating the feature amount using a cross-attention mechanism capable of matching the tracked object between the plurality of images;
restoring the updated feature amount to the position information;
A recording medium on which a computer program for executing an information processing method is recorded.