WO2021095213A1

WO2021095213A1 - Learning method, learning program, and learning device

Info

Publication number: WO2021095213A1
Application number: PCT/JP2019/044771
Authority: WO
Inventors: 裕一鎌田; 中川　章
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2019-11-14
Filing date: 2019-11-14
Publication date: 2021-05-20
Anticipated expiration: 2022-05-14

Abstract

This learning device (100) extracts a feature quantity from first modal information using an extraction model (111). The learning device (100) converts the extracted feature quantity using a conversion model (112) on the basis of a plurality of parameters and thereby acquires a new feature quantity. The learning device (100) inputs the acquired new feature quantity to a first processing model (121) and thereby acquires a first output value. The learning device (100) inputs other feature quantities extracted from second modal information to a second processing model (122) and thereby acquires a second output value. The learning device (100) inputs the acquired first and second output values to a third processing model (123) and thereby acquires a third output value. The learning device (100) updates the plurality of parameters on the basis of the acquired third output value.

Description

Learning methods, learning programs, and learning devices

　本発明は、学習方法、学習プログラム、および学習装置に関する。 The present invention relates to a learning method, a learning program, and a learning device.

　従来、所定のモーダルの情報を用いて文書翻訳や質疑応答、物体検出、状況判断などの問題を解く技術がある。例えば、文書に関するモーダルの情報を用いて問題を解くためのＢＥＲＴ（Ｂｉｄｉｒｅｃｔｉｏｎａｌ　Ｅｎｃｏｄｅｒ　Ｒｅｐｒｅｓｅｎｔａｔｉｏｎｓ　ｆｒｏｍ　Ｔｒａｎｓｆｏｒｍｅｒｓ）と呼ばれるモデルがある。また、ＢＥＲＴを、画像に関するモーダルの情報も用いて問題を解くように拡張したモデルがある。ここで、モーダルとは、情報の様式や種類を示す概念であり、具体例としては、画像、文書（テキスト）、音声などを挙げることができる。複数のモーダルを用いた機械学習はマルチモーダル学習と呼ばれる。また、マルチモーダル学習のうち、複数のモーダル間の共起関係を学習するものは、クロスモーダル学習と呼ばれる場合もある。 Conventionally, there is a technology to solve problems such as document translation, question and answer, object detection, and situation judgment using predetermined modal information. For example, there is a model called BERT (Bidirectional Encoder Representations from Transformers) for solving a problem using modal information about a document. There is also a model that extends BERT to solve problems using modal information about images. Here, the modal is a concept indicating the style and type of information, and specific examples thereof include images, documents (texts), and sounds. Machine learning using multiple modals is called multimodal learning. Further, among multimodal learning, learning the co-occurrence relationship between a plurality of modals is sometimes called cross-modal learning.

　先行技術としては、例えば、ジェスチャ特徴を入力とし、入力されたジェスチャ特徴が単語と対応するか否かを分類するモデルを生成するものがある。また、例えば、単語近くに現れる単語共起語の出現頻度を要素とする単語共起ベクトルと、画像近くに現れる画像共起語の出現頻度を要素とする画像共起ベクトルとを用いて、指定された単語および画像の類似度を求める技術がある。また、例えば、手動作の時系列データから手動作の個々の構成要素である手話素の認識処理を行い、認識された手話素から手話単語認識処理を行う技術がある。 As the prior art, for example, there is a method in which a gesture feature is input and a model for classifying whether or not the input gesture feature corresponds to a word is generated. Further, for example, a word co-occurrence vector whose element is the frequency of occurrence of word co-occurrence words appearing near a word and an image co-occurrence vector whose element is the frequency of appearance of image co-occurrence words appearing near an image are used. There is a technique for determining the similarity between words and images. Further, for example, there is a technique of performing recognition processing of sign language elements, which are individual components of sign language movements, from time-series data of hand movements, and performing sign language word recognition processing from the recognized sign language elements.

特開２０１８－１６３４００号公報JP-A-2018-163400 特開２００２－１３２８２３号公報Japanese Unexamined Patent Publication No. 2002-132823 特開平９－３４８６３号公報Japanese Unexamined Patent Publication No. 9-34863

　しかしながら、従来技術では、モーダルの情報を用いて問題を解くにあたり、モーダルの情報から特徴量を抽出する有用なモデルを得ることが難しい。例えば、モデルは、画像に関するモーダルの情報を扱うにあたり、画像に関するモーダルの情報と言語に関するモーダルの情報との関係性に基づく特徴を表現する表現空間が設定されておらず、問題を解くにあたり有用ではない場合がある。 However, with the conventional technology, it is difficult to obtain a useful model for extracting features from modal information when solving a problem using modal information. For example, when dealing with modal information about images, the model does not have an expression space that expresses features based on the relationship between modal information about images and modal information about languages, and is not useful in solving problems. May not be available.

　１つの側面では、本発明は、モーダルの情報から特徴量を抽出する有用なモデルを学習することを目的とする。 In one aspect, the present invention aims to learn a useful model for extracting features from modal information.

　１つの実施態様によれば、第１のモーダルの情報から特徴量を抽出し、抽出した前記特徴量をパラメータに基づいて変換することにより、新たな特徴量を取得し、取得した前記新たな特徴量を、前記第１のモーダルに関する第１の処理モデルに入力することにより、第１の出力値を取得し、前記第１のモーダルとは異なる第２のモーダルの情報から抽出された他の特徴量を、前記第２のモーダルに関する第２の処理モデルに入力することにより、第２の出力値を取得し、取得した前記第１の出力値と前記第２の出力値とを、前記第１のモーダルと前記第２のモーダルとに関する第３の処理モデルに入力することにより、第３の出力値を取得し、取得した前記第３の出力値に基づいて、前記パラメータを更新する学習方法、学習プログラム、および学習装置が提案される。 According to one embodiment, a feature amount is extracted from the information of the first modal, and the extracted feature amount is converted based on a parameter to acquire a new feature amount, and the acquired new feature amount is obtained. By inputting the quantity into the first processing model for the first modal, the first output value is obtained and other features extracted from the information of the second modal different from the first modal. By inputting the amount into the second processing model relating to the second modal, a second output value is acquired, and the acquired first output value and the second output value are combined with the first output value. A learning method in which a third output value is acquired by inputting into a third processing model relating to the modal of the above and the second modal, and the parameter is updated based on the acquired third output value. A learning program and a learning device are proposed.

　一態様によれば、モーダルの情報から特徴量を抽出する有用なモデルを学習することが可能になる。 According to one aspect, it becomes possible to learn a useful model for extracting features from modal information.

図１は、実施の形態にかかる学習方法の一実施例を示す説明図である。FIG. 1 is an explanatory diagram showing an embodiment of a learning method according to an embodiment. 図２は、情報処理システム２００の一例を示す説明図である。FIG. 2 is an explanatory diagram showing an example of the information processing system 200. 図３は、学習装置１００のハードウェア構成例を示すブロック図である。FIG. 3 is a block diagram showing a hardware configuration example of the learning device 100. 図４は、学習装置１００の機能的構成例を示すブロック図である。FIG. 4 is a block diagram showing a functional configuration example of the learning device 100. 図５は、学習装置１００の具体的な機能的構成例を示すブロック図である。FIG. 5 is a block diagram showing a specific functional configuration example of the learning device 100. 図６は、統合モデル６００を学習する一例を示す説明図である。FIG. 6 is an explanatory diagram showing an example of learning the integrated model 600. 図７は、画像特徴量列Ｆを変換する一例を示す説明図である。FIG. 7 is an explanatory diagram showing an example of converting the image feature quantity sequence F. 図８は、画像特徴量列Ｆを変換する計算の具体例を示す説明図である。FIG. 8 is an explanatory diagram showing a specific example of the calculation for converting the image feature quantity sequence F. 図９は、画像特徴量列を変換する別の例を示す説明図である。FIG. 9 is an explanatory diagram showing another example of converting the image feature quantity sequence. 図１０は、学習装置１００が統合モデル１０００を利用する一例を示す説明図である。FIG. 10 is an explanatory diagram showing an example in which the learning device 100 uses the integrated model 1000. 図１１は、学習処理手順の一例を示すフローチャートである。FIG. 11 is a flowchart showing an example of the learning processing procedure.

　以下に、図面を参照して、本発明にかかる学習方法、学習プログラム、および学習装置の実施の形態を詳細に説明する。 Hereinafter, embodiments of the learning method, learning program, and learning device according to the present invention will be described in detail with reference to the drawings.

（実施の形態にかかる学習方法の一実施例）
　図１は、実施の形態にかかる学習方法の一実施例を示す説明図である。学習装置１００は、所定のモーダルの情報を用いて問題を解く際に利用可能な、所定のモーダルの情報から特徴量を抽出する有用なモデルを学習するためのコンピュータである。 (An example of a learning method according to an embodiment)
FIG. 1 is an explanatory diagram showing an embodiment of a learning method according to an embodiment. The learning device 100 is a computer for learning a useful model for extracting a feature amount from a predetermined modal information, which can be used when solving a problem using a predetermined modal information.

　従来、例えば、問題を解くための、ＢＥＲＴと呼ばれる事前学習モデル（ｐｒｅ－ｔｒａｉｎモデル）がある。ＢＥＲＴは、具体的には、ＴｒａｎｓｆｏｒｍｅｒのＥｎｃｏｄｅｒ部を積み重ねて形成される。ＢＥＲＴについては、例えば、下記非特許文献１を参照することができる。 Conventionally, for example, there is a pre-learning model (pre-train model) called BERT for solving a problem. Specifically, the BERT is formed by stacking the Encoder portions of the Transformer. For BERT, for example, the following Non-Patent Document 1 can be referred to.

　非特許文献１　：　Ｄｅｖｌｉｎ，　Ｊａｃｏｂ　ｅｔ　ａｌ．　“ＢＥＲＴ：　Ｐｒｅ－ｔｒａｉｎｉｎｇ　ｏｆ　Ｄｅｅｐ　Ｂｉｄｉｒｅｃｔｉｏｎａｌ　Ｔｒａｎｓｆｏｒｍｅｒｓ　ｆｏｒ　Ｌａｎｇｕａｇｅ　Ｕｎｄｅｒｓｔａｎｄｉｎｇ．”　ＮＡＡＣＬ－ＨＬＴ　（２０１９）． Non-Patent Document 1: Devlin, Jacob et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL-HLT (2019).

　ここで、ＢＥＲＴは、言語に関するモーダルの情報を用いて問題を解くような状況に適用することが想定されており、複数のモーダルの情報を用いて問題を解くような状況に適用することができない。 Here, BERT is supposed to be applied to a situation where a problem is solved using modal information about a language, and cannot be applied to a situation where a problem is solved using a plurality of modal information. ..

　これに対し、例えば、ＢＥＲＴを、言語に関するモーダルの情報の他、画像に関するモーダルの情報も用いて問題を解くような状況に適用可能に拡張した、ＶｉｄｅｏＢＥＲＴと呼ばれる拡張モデルがある。また、ＶｉｄｅｏＢＥＲＴより性能向上を図った、ＣＢＴ（Ｃｏｎｔｒａｓｔｉｖｅ　Ｂｉｄｉｒｅｃｔｉｏｎａｌ　Ｔｒａｎｓｆｏｒｍｅｒ　ｆｏｒ　Ｔｅｍｐｏｒａｌ　Ｒｅｐｒｅｓｅｎｔａｔｉｏｎ　Ｌｅａｒｎｉｎｇ）と呼ばれる拡張モデルがある。 On the other hand, for example, there is an extended model called VideoBERT that extends BERT so that it can be applied to situations where problems are solved using modal information related to languages as well as modal information related to images. In addition, there is an extended model called CBT (Contrastive Biraditional Transformer for Temporal Representation Learning), which has improved performance over VideoBERT.

　ＣＢＴは、具体的には、言語特徴量の共起関係を学習する言語処理モデルと、画像特徴量の共起関係を学習する画像処理モデルと、言語処理モデルと画像処理モデルとの出力を統合し、言語と画像との共起関係を学習するクロスモーダル処理モデルとで形成される。ＣＢＴについては、例えば、下記非特許文献２を参照することができる。 Specifically, CBT integrates the output of a language processing model that learns the co-occurrence relationship of language features, an image processing model that learns the co-occurrence relationship of image features, and a language processing model and an image processing model. It is formed by a cross-modal processing model that learns the co-occurrence relationship between language and images. For CBT, for example, the following Non-Patent Document 2 can be referred to.

　非特許文献２　：　Ｓｕｎ，　Ｃｈｅｎ，　ｅｔ　ａｌ．　“Ｃｏｎｔｒａｓｔｉｖｅ　Ｂｉｄｉｒｅｃｔｉｏｎａｌ　Ｔｒａｎｓｆｏｒｍｅｒ　ｆｏｒ　Ｔｅｍｐｏｒａｌ　Ｒｅｐｒｅｓｅｎｔａｔｉｏｎ　Ｌｅａｒｎｉｎｇ．”　ａｒＸｉｖ　ｐｒｅｐｒｉｎｔ　ａｒＸｉｖ：１９０６．０５７４３　（２０１９）． Non-Patent Document 2: Sun, Chen, et al. "Contrastive Biraditional Transition Former for Temporal Representation Learning." ArXiv preprint arXiv: 1906.05743 (2019).

　しかしながら、上述した各種モデルは、問題を解くにあたり有用なモデルとはならない場合がある。例えば、上述した各種モデルは、モーダルの情報を用いて問題を解くにあたり、モーダルの情報から特徴量を抽出する有用なモデルとはならない場合がある。例えば、ＣＢＴは、画像に関するモーダルの情報を扱うにあたり、画像に関するモーダルの情報と言語に関するモーダルの情報との関係性に基づく特徴を表現する表現空間が設定されておらず、問題を解くにあたり有用なモデルとはならない場合がある。 However, the various models mentioned above may not be useful models for solving problems. For example, the various models described above may not be useful models for extracting features from modal information when solving a problem using modal information. For example, when dealing with modal information about images, CBT does not have an expression space that expresses features based on the relationship between modal information about images and modal information about languages, which is useful in solving problems. It may not be a model.

　具体的には、画像特徴量は、画像に関するモーダルの情報を反映する表現空間であり、言語に関するモーダルの情報を反映する表現空間を有さないという性質がある。このため、ＣＢＴを事前学習したとしても、ＣＢＴに含まれる画像処理モデルは、言語に関するモーダルの情報を有効に活用可能なモデルにはならず、問題を解くにあたり有用なモデルにもならない。また、画像特徴量は、同じ物体が写っていても、異なる写り方の画像に対しては、異なる特徴量となり得るという性質がある。このため、画像特徴量に対し、言語に関するモーダルの情報の特徴を反映させる際、１つの画像特徴量ではなく、色々な画像特徴量の表現について更新することになり、有効な更新が難しく、または、問題を解くにあたり悪影響を及ぼすことが考えられる。 Specifically, the image feature amount is an expression space that reflects modal information about the image, and has the property of not having an expression space that reflects modal information about the language. Therefore, even if CBT is pre-learned, the image processing model included in CBT does not become a model that can effectively utilize modal information about the language, and does not become a useful model for solving the problem. Further, the image feature amount has a property that even if the same object is captured, the image feature amount can be different for images of different appearances. For this reason, when reflecting the characteristics of modal information related to the language with respect to the image feature amount, it is necessary to update not one image feature amount but various image feature amount expressions, which makes effective update difficult or difficult. , It is possible that it will have an adverse effect on solving the problem.

　そこで、本実施の形態では、モーダルの情報を変換する際に用いるパラメータを設定し、パラメータによる表現空間を設けることにより、モーダルの情報から特徴量を抽出する有用なモデルを学習することができる学習方法について説明する。 Therefore, in the present embodiment, by setting the parameters used when converting the modal information and providing the expression space by the parameters, it is possible to learn a useful model for extracting the feature amount from the modal information. The method will be described.

　図１において、学習装置１００は、例えば、モデル１０１を有する。モデル１０１は、抽出モデル１１１と、変換モデル１１２と、第１の処理モデル１２１と、第２の処理モデル１２２と、第３の処理モデル１２３とを有する。抽出モデル１１１と、変換モデル１１２と、第１の処理モデル１２１とは、第１のモーダルに関する。第２の処理モデル１２２は、第２のモーダルに関する。第３の処理モデル１２３は、第１のモーダルと第２のモーダルとに関する。 In FIG. 1, the learning device 100 has, for example, a model 101. The model 101 has an extraction model 111, a conversion model 112, a first processing model 121, a second processing model 122, and a third processing model 123. The extraction model 111, the conversion model 112, and the first processing model 121 relate to the first modal. The second processing model 122 relates to a second modal. The third processing model 123 relates to a first modal and a second modal.

　学習装置１００は、第１のモーダルの情報と、第２のモーダルの情報とを取得する。モーダルは、情報の様式を意味する。第１のモーダルと、第２のモーダルとは、それぞれ異なるモーダルである。第１のモーダルは、例えば、画像に関するモーダルである。第１のモーダルが、画像に関する場合、第１のモーダルの情報は、例えば、画像である。第２のモーダルは、例えば、言語に関するモーダルである。第２のモーダルが、言語に関する場合、第２のモーダルの情報は、例えば、文書である。 The learning device 100 acquires the information of the first modal and the information of the second modal. Modal means a form of information. The first modal and the second modal are different modals. The first modal is, for example, an image modal. If the first modal is about an image, the information in the first modal is, for example, an image. The second modal is, for example, a modal related to language. If the second modal is about language, the information in the second modal is, for example, a document.

　（１－１）学習装置１００は、抽出モデル１１１を用いて、第１のモーダルの情報から特徴量を抽出する。学習装置１００は、例えば、画像から画像特徴量を抽出する。画像特徴量は、例えば、配列を示すベクトルにより表現される。 (1-1) The learning device 100 uses the extraction model 111 to extract the feature amount from the first modal information. The learning device 100 extracts, for example, an image feature amount from an image. The image feature amount is represented by, for example, a vector indicating an array.

　（１－２）学習装置１００は、変換モデル１１２を用いて、抽出した特徴量をパラメータに基づいて変換することにより、新たな特徴量を取得する。パラメータは、例えば、複数存在する。学習装置１００は、例えば、抽出した画像特徴量と複数のパラメータとに基づいて、抽出した画像特徴量を補正する補正量を算出し、抽出した画像特徴量に加算することにより、新たな画像特徴量を取得する。 (1-2) The learning device 100 acquires a new feature amount by converting the extracted feature amount based on the parameter using the conversion model 112. There are a plurality of parameters, for example. The learning device 100 calculates, for example, a correction amount for correcting the extracted image feature amount based on the extracted image feature amount and a plurality of parameters, and adds the correction amount to the extracted image feature amount to obtain a new image feature. Get the quantity.

　（１－３）学習装置１００は、取得した新たな特徴量を、第１の処理モデル１２１に入力することにより、第１の出力値を取得する。第１の処理モデル１２１は、例えば、画像処理モデルである。学習装置１００は、例えば、新たな画像特徴量を、画像処理モデルに入力することにより、第１の出力値を取得する。 (1-3) The learning device 100 acquires the first output value by inputting the acquired new feature amount into the first processing model 121. The first processing model 121 is, for example, an image processing model. The learning device 100 acquires the first output value by inputting a new image feature amount into the image processing model, for example.

　（１－４）学習装置１００は、第２のモーダルの情報から抽出された他の特徴量を、第２の処理モデル１２２に入力することにより、第２の出力値を取得する。第２の処理モデル１２２は、例えば、言語処理モデルである。学習装置１００は、例えば、文書から抽出された言語特徴量を、言語処理モデルに入力することにより、第２の出力値を取得する。言語特徴量は、例えば、配列を示すベクトルにより表現される。 (1-4) The learning device 100 acquires a second output value by inputting another feature amount extracted from the second modal information into the second processing model 122. The second processing model 122 is, for example, a language processing model. The learning device 100 acquires a second output value by inputting, for example, a language feature amount extracted from a document into a language processing model. Language features are represented, for example, by vectors representing arrays.

　（１－５）学習装置１００は、取得した第１の出力値と第２の出力値とを、第３の処理モデル１２３に入力することにより、第３の出力値を取得する。第３の処理モデル１２３は、例えば、クロスモーダル処理モデルである。クロスモーダル処理モデルは、複数のモーダルの情報を統合し、複数のモーダルの情報の共起を学習する。学習装置１００は、例えば、第１の出力値と第２の出力値とを、クロスモーダル処理モデルに入力することにより、第３の出力値を取得する。 (1-5) The learning device 100 acquires the third output value by inputting the acquired first output value and the second output value into the third processing model 123. The third processing model 123 is, for example, a cross-modal processing model. The cross-modal processing model integrates information from multiple modals and learns co-occurrence of information from multiple modals. The learning device 100 acquires the third output value by inputting the first output value and the second output value into the cross-modal processing model, for example.

　（１－６）学習装置１００は、取得した第３の出力値に基づいて、複数のパラメータを更新する。学習装置１００は、例えば、第３の出力値に基づいて、誤差逆伝搬法により、複数のパラメータを更新する。学習装置１００は、更新後の複数のパラメータを出力してもよい。これにより、学習装置１００は、問題を解くにあたり、第１のモーダルの情報と、第２のモーダルの情報とを扱う観点で有用な複数のパラメータを得ることができる。 (1-6) The learning device 100 updates a plurality of parameters based on the acquired third output value. The learning device 100 updates a plurality of parameters by the error back propagation method, for example, based on the third output value. The learning device 100 may output a plurality of updated parameters. As a result, the learning device 100 can obtain a plurality of parameters useful from the viewpoint of handling the first modal information and the second modal information in solving the problem.

　学習装置１００は、例えば、第２のモーダルの情報の特徴を反映可能な、複数のパラメータが明示的に用意されており、第１のモーダルの情報から新たな特徴量を取得するにあたり、第２のモーダルの情報の特徴を有効に活用可能にすることができる。また、学習装置１００は、第１のモーダルの情報から抽出する特徴量自体には、第２のモーダルの情報の特徴を直接反映させず、問題を解く際の悪影響を低減することができる。 The learning device 100 is, for example, explicitly prepared with a plurality of parameters capable of reflecting the features of the information of the second modal, and when acquiring a new feature amount from the information of the first modal, the second It is possible to effectively utilize the characteristics of modal information. Further, the learning device 100 does not directly reflect the features of the information of the second modal in the feature amount itself extracted from the information of the first modal, and can reduce the adverse effect when solving the problem.

　ここでは、学習装置１００が、第３の出力値に基づいて、複数のパラメータを更新する場合について説明したが、これに限らない。例えば、学習装置１００が、さらに、第３の出力値に基づいて、第１の処理モデル１２１を更新する場合があってもよい。これにより、学習装置１００は、問題を解くにあたり、第１のモーダルの情報を扱う観点で有用な第１の処理モデル１２１を得ることができる。そして、学習装置１００は、モデル１０１を、第１のモーダルの情報と、第２のモーダルの情報とを用いて問題を解く際に利用し、得られる解の精度向上を図ることができる。 Here, the case where the learning device 100 updates a plurality of parameters based on the third output value has been described, but the present invention is not limited to this. For example, the learning device 100 may further update the first processing model 121 based on the third output value. As a result, the learning device 100 can obtain a first processing model 121 that is useful from the viewpoint of handling the first modal information in solving the problem. Then, the learning device 100 can use the model 101 when solving a problem by using the information of the first modal and the information of the second modal, and can improve the accuracy of the obtained solution.

　ここで、学習装置１００は、モデル１０１から、抽出モデル１１１と、変換モデル１１２と、第１の処理モデル１２１とを分離してもよい。これによれば、学習装置１００は、抽出モデル１１１と、変換モデル１１２と、第１の処理モデル１２１との、有用な組み合わせモデルを得ることができる。そして、学習装置１００は、分離した抽出モデル１１１と、変換モデル１１２と、第１の処理モデル１２１とを、第１のモーダルの情報を用いて問題を解く際に利用し、得られる解の精度向上を図ってもよい。 Here, the learning device 100 may separate the extraction model 111, the conversion model 112, and the first processing model 121 from the model 101. According to this, the learning device 100 can obtain a useful combination model of the extraction model 111, the conversion model 112, and the first processing model 121. Then, the learning device 100 uses the separated extraction model 111, the conversion model 112, and the first processing model 121 when solving the problem using the first modal information, and the accuracy of the obtained solution is obtained. You may try to improve.

　また、例えば、学習装置１００が、さらに、第２の処理モデル１２２と、第３の処理モデル１２３とを更新する場合があってもよい。これにより、学習装置１００は、問題を解くにあたり、第２のモーダルの情報を扱う観点で有用な第２の処理モデル１２２を得ることができる。また、学習装置１００は、問題を解くにあたり、第１のモーダルの情報と、第２のモーダルの情報とを統合する観点で有用な第３の処理モデル１２３を得ることができる。これによれば、学習装置１００は、抽出モデル１１１と、変換モデル１１２と、第１の処理モデル１２１と、第２の処理モデル１２２と、第３の処理モデル１２３とを組み合わせた、有用なモデル１０１を得ることができる。そして、学習装置１００は、モデル１０１を、第１のモーダルの情報と、第２のモーダルの情報とを用いて問題を解く際に利用し、得られる解の精度向上を図ることができる。 Further, for example, the learning device 100 may further update the second processing model 122 and the third processing model 123. As a result, the learning device 100 can obtain a second processing model 122 that is useful from the viewpoint of handling the second modal information in solving the problem. In addition, the learning device 100 can obtain a third processing model 123 that is useful from the viewpoint of integrating the information of the first modal and the information of the second modal in solving the problem. According to this, the learning device 100 is a useful model in which the extraction model 111, the conversion model 112, the first processing model 121, the second processing model 122, and the third processing model 123 are combined. 101 can be obtained. Then, the learning device 100 can use the model 101 when solving a problem by using the information of the first modal and the information of the second modal, and can improve the accuracy of the obtained solution.

　また、例えば、学習装置１００は、モデル１０１から、変換モデル１１２を分離してもよい。これによれば、学習装置１００は、有用な変換モデル１１２を得ることができる。そして、学習装置１００は、分離した変換モデル１１２を、第１のモーダルの情報を用いて問題を解く際に利用し、得られる解の精度向上を図ってもよい。 Further, for example, the learning device 100 may separate the conversion model 112 from the model 101. According to this, the learning device 100 can obtain a useful conversion model 112. Then, the learning device 100 may use the separated conversion model 112 when solving the problem using the information of the first modal to improve the accuracy of the obtained solution.

　また、例えば、学習装置１００は、モデル１０１から、抽出モデル１１１と、変換モデル１１２と、第１の処理モデル１２１と、第２の処理モデル１２２とを分離してもよい。これによれば、学習装置１００は、抽出モデル１１１と、変換モデル１１２と、第１の処理モデル１２１と、第２の処理モデル１２２との、有用な組み合わせモデルを得ることができる。そして、学習装置１００は、分離した抽出モデル１１１と、変換モデル１１２と、第１の処理モデル１２１と、第２の処理モデル１２２とを、第１のモーダルの情報と、第２のモーダルの情報とを用いて問題を解く際に利用し、得られる解の精度向上を図ってもよい。 Further, for example, the learning device 100 may separate the extraction model 111, the conversion model 112, the first processing model 121, and the second processing model 122 from the model 101. According to this, the learning device 100 can obtain a useful combination model of the extraction model 111, the conversion model 112, the first processing model 121, and the second processing model 122. Then, the learning device 100 uses the separated extraction model 111, the conversion model 112, the first processing model 121, and the second processing model 122 as the first modal information and the second modal information. It may be used when solving a problem by using and to improve the accuracy of the obtained solution.

　以上のように、学習装置１００は、有用なモデルを得ることができる。有用なモデルとは、例えば、更新後の、抽出モデル１１１と、変換モデル１１２と、第１の処理モデル１２１と、第２の処理モデル１２２と、第３の処理モデル１２３とのいずれかのモデルである。また、有用なモデルとは、例えば、更新後の、抽出モデル１１１と、変換モデル１１２と、第１の処理モデル１２１と、第２の処理モデル１２２と、第３の処理モデル１２３とのうち２以上の組み合わせであってもよい。 As described above, the learning device 100 can obtain a useful model. The useful model is, for example, one of the updated extraction model 111, the conversion model 112, the first processing model 121, the second processing model 122, and the third processing model 123. Is. Further, the useful model is, for example, two of the updated extraction model 111, the conversion model 112, the first processing model 121, the second processing model 122, and the third processing model 123. The above combination may be used.

（情報処理システム２００の一例）
　次に、図２を用いて、図１に示した学習装置１００を適用した、情報処理システム２００の一例について説明する。 (Example of information processing system 200)
Next, an example of the information processing system 200 to which the learning device 100 shown in FIG. 1 is applied will be described with reference to FIG.

　図２は、情報処理システム２００の一例を示す説明図である。図２において、情報処理システム２００は、学習装置１００と、クライアント装置２０１と、端末装置２０２とを含む。 FIG. 2 is an explanatory diagram showing an example of the information processing system 200. In FIG. 2, the information processing system 200 includes a learning device 100, a client device 201, and a terminal device 202.

　情報処理システム２００において、学習装置１００とクライアント装置２０１とは、有線または無線のネットワーク２１０を介して接続される。ネットワーク２１０は、例えば、ＬＡＮ（Ｌｏｃａｌ　Ａｒｅａ　Ｎｅｔｗｏｒｋ）、ＷＡＮ（Ｗｉｄｅ　Ａｒｅａ　Ｎｅｔｗｏｒｋ）、インターネットなどである。また、情報処理システム２００において、学習装置１００と端末装置２０２とは、有線または無線のネットワーク２１０を介して接続される。 In the information processing system 200, the learning device 100 and the client device 201 are connected via a wired or wireless network 210. The network 210 is, for example, a LAN (Local Area Network), a WAN (Wide Area Network), the Internet, or the like. Further, in the information processing system 200, the learning device 100 and the terminal device 202 are connected via a wired or wireless network 210.

　学習装置１００は、第１のモーダルの情報と、第２のモーダルの情報の入力を受け付ける統合モデルを記憶する。記憶される統合モデルは、例えば、図１に示したモデル１０１に対応する。学習装置１００は、教師データに基づいて、統合モデルを更新する。 The learning device 100 stores an integrated model that accepts input of the first modal information and the second modal information. The stored integrated model corresponds to, for example, the model 101 shown in FIG. The learning device 100 updates the integrated model based on the teacher data.

　教師データは、例えば、標本となる第１のモーダルの情報と、標本となる第２のモーダルの情報と、正解データとを対応付けた対応情報である。教師データは、例えば、学習装置１００のユーザにより学習装置１００に入力される。正解データは、例えば、統合モデルの出力値についての正解を示す。正解データは、統合モデルの出力値に基づいて問題を解いて得られる解についての正解を示してもよい。第１のモーダルが、画像に関する場合、第１のモーダルの情報は、画像である。第２のモーダルが、言語に関する場合、第２のモーダルの情報は、文書である。統合モデルの更新は、例えば、誤差逆伝搬法により実現される。統合モデルの更新は、例えば、誤差逆伝搬以外の学習方法により実現されてもよい。 The teacher data is, for example, correspondence information in which the sample first modal information, the sample second modal information, and the correct answer data are associated with each other. The teacher data is input to the learning device 100 by the user of the learning device 100, for example. The correct answer data indicates, for example, the correct answer for the output value of the integrated model. The correct answer data may indicate the correct answer for the solution obtained by solving the problem based on the output value of the integrated model. If the first modal is about an image, the information in the first modal is an image. If the second modal is about language, the information in the second modal is a document. The update of the integrated model is realized by, for example, the error back propagation method. The update of the integrated model may be realized by, for example, a learning method other than error back propagation.

　また、学習装置１００は、問題を解くにあたり、第１のモーダルの情報と、第２のモーダルの情報とを取得する。学習装置１００は、例えば、学習装置１００のユーザにより学習装置１００に入力された第１のモーダルの情報を取得する。また、学習装置１００は、第１のモーダルの情報を、クライアント装置２０１または端末装置２０２から受信することにより取得してもよい。学習装置１００は、例えば、学習装置１００のユーザにより学習装置１００に入力された第２のモーダルの情報を取得する。また、学習装置１００は、第２のモーダルの情報を、クライアント装置２０１または端末装置２０２から受信することにより取得してもよい。 Further, the learning device 100 acquires the information of the first modal and the information of the second modal when solving the problem. The learning device 100 acquires, for example, first modal information input to the learning device 100 by the user of the learning device 100. Further, the learning device 100 may acquire the first modal information by receiving the information from the client device 201 or the terminal device 202. The learning device 100 acquires, for example, second modal information input to the learning device 100 by the user of the learning device 100. Further, the learning device 100 may acquire the second modal information by receiving the information from the client device 201 or the terminal device 202.

　そして、学習装置１００は、更新後の統合モデルを用いて、取得した第１のモーダルの情報と、第２のモーダルの情報とに基づいて、問題を解き、得られた解をクライアント装置２０１に送信する。学習装置１００は、更新後の統合モデルをさらにファインチューニングした上で、問題を解くにあたり利用するようにしてもよい。学習装置１００は、例えば、サーバやＰＣ（Ｐｅｒｓｏｎａｌ　Ｃｏｍｐｕｔｅｒ）などである。 Then, the learning device 100 solves the problem based on the acquired first modal information and the second modal information by using the updated integrated model, and transfers the obtained solution to the client device 201. Send. The learning device 100 may be used for solving the problem after further fine-tuning the updated integrated model. The learning device 100 is, for example, a server, a PC (Personal Computer), or the like.

　クライアント装置２０１は、学習装置１００と通信可能なコンピュータである。クライアント装置２０１は、例えば、第１のモーダルの情報を、学習装置１００に送信してもよい。また、クライアント装置２０１は、例えば、第２のモーダルの情報を、学習装置１００に送信してもよい。クライアント装置２０１は、学習装置１００が問題を解いて得られた解を受信して出力する。出力形式は、例えば、ディスプレイへの表示、プリンタへの印刷出力、他のコンピュータへの送信、または、記憶領域への記憶などである。クライアント装置２０１は、例えば、ＰＣ、タブレット端末、またはスマートフォンなどである。 The client device 201 is a computer capable of communicating with the learning device 100. The client device 201 may, for example, transmit the first modal information to the learning device 100. Further, the client device 201 may transmit, for example, second modal information to the learning device 100. The client device 201 receives and outputs the solution obtained by the learning device 100 solving the problem. The output format is, for example, display on a display, print output to a printer, transmission to another computer, or storage in a storage area. The client device 201 is, for example, a PC, a tablet terminal, a smartphone, or the like.

　端末装置２０２は、学習装置１００と通信可能なコンピュータである。端末装置２０２は、例えば、第１のモーダルの情報を、学習装置１００に送信してもよい。端末装置２０２は、例えば、第２のモーダルの情報を、学習装置１００に送信してもよい。端末装置２０２は、例えば、ＰＣ、タブレット端末、スマートフォン、電子機器、ＩｏＴ（Ｉｎｔｅｒｎｅｔ　ｏｆ　Ｔｈｉｎｇｓ）機器、またはセンサ装置などである。端末装置２０２は、具体的には、監視カメラであってもよい。 The terminal device 202 is a computer capable of communicating with the learning device 100. The terminal device 202 may, for example, transmit the first modal information to the learning device 100. The terminal device 202 may transmit, for example, second modal information to the learning device 100. The terminal device 202 is, for example, a PC, a tablet terminal, a smartphone, an electronic device, an IoT (Internet of Things) device, a sensor device, or the like. Specifically, the terminal device 202 may be a surveillance camera.

　ここでは、学習装置１００が、統合モデルを更新し、かつ、統合モデルを用いて、問題を解く場合について説明したが、これに限らない。例えば、他のコンピュータが、統合モデルを更新し、学習装置１００が、他のコンピュータから受信した統合モデルを用いて、問題を解く場合があってもよい。また、例えば、学習装置１００が、統合モデルを更新し、他のコンピュータに提供し、他のコンピュータで、統合モデルを用いて、問題を解く場合があってもよい。 Here, the case where the learning device 100 updates the integrated model and solves the problem using the integrated model has been described, but the present invention is not limited to this. For example, another computer may update the integrated model, and the learning device 100 may solve the problem using the integrated model received from the other computer. Further, for example, the learning device 100 may update the integrated model, provide it to another computer, and solve the problem on the other computer by using the integrated model.

　ここでは、学習装置１００が、クライアント装置２０１や端末装置２０２とは異なる装置である場合について説明したが、これに限らない。例えば、学習装置１００が、クライアント装置２０１と一体である場合があってもよい。また、例えば、学習装置１００が、端末装置２０２と一体である場合があってもよい。 Here, the case where the learning device 100 is a device different from the client device 201 and the terminal device 202 has been described, but the present invention is not limited to this. For example, the learning device 100 may be integrated with the client device 201. Further, for example, the learning device 100 may be integrated with the terminal device 202.

　ここでは、学習装置１００が、ソフトウェア的に、統合モデルを実現する場合について説明したが、これに限らない。例えば、学習装置１００が、統合モデルを、電子回路的に実現する場合があってもよい。 Here, the case where the learning device 100 realizes the integrated model in terms of software has been described, but the present invention is not limited to this. For example, the learning device 100 may realize the integrated model electronically.

（情報処理システム２００の適用例）
　例えば、端末装置２０２は、監視カメラであり、対象を撮像した画像を、学習装置１００に送信する。対象は、具体的には、試着室の外観である。また、学習装置１００は、対象についての説明文となる文書を記憶している。説明文は、具体的には、人間が試着室を利用中は、試着室のカーテンが閉まっている傾向があること、および、人間が試着室を利用中は、試着室の前に靴が置かれている傾向があることなどを記述した文書である。そして、学習装置１００は、モデルを用いて、画像と文書とに基づいて、危険度を判断する問題を解く。危険度は、例えば、試着室に避難が未完了の人間が残っている可能性の高さを示す指標値である。危険度は、例えば、試着室に避難が未完了の人間が残っているか否かの２値であってもよい。 (Application example of information processing system 200)
For example, the terminal device 202 is a surveillance camera, and transmits an image of the target to the learning device 100. The object is specifically the appearance of the fitting room. In addition, the learning device 100 stores a document that serves as an explanatory text about the target. The description specifically states that the dressing room curtains tend to be closed while humans are using the dressing room, and that shoes are placed in front of the dressing room while humans are using the dressing room. It is a document that describes things that tend to be done. Then, the learning device 100 solves the problem of determining the degree of risk based on the image and the document by using the model. The degree of risk is, for example, an index value indicating the high possibility that a person who has not completed evacuation remains in the fitting room. The degree of risk may be, for example, two values as to whether or not there are any humans who have not completed evacuation in the fitting room.

（学習装置１００のハードウェア構成例）
　次に、図３を用いて、学習装置１００のハードウェア構成例について説明する。 (Example of hardware configuration of learning device 100)
Next, a hardware configuration example of the learning device 100 will be described with reference to FIG.

　図３は、学習装置１００のハードウェア構成例を示すブロック図である。図３において、学習装置１００は、ＣＰＵ（Ｃｅｎｔｒａｌ　Ｐｒｏｃｅｓｓｉｎｇ　Ｕｎｉｔ）３０１と、メモリ３０２と、ネットワークＩ／Ｆ（Ｉｎｔｅｒｆａｃｅ）３０３と、記録媒体Ｉ／Ｆ３０４と、記録媒体３０５とを有する。また、各構成部は、バス３００によってそれぞれ接続される。 FIG. 3 is a block diagram showing a hardware configuration example of the learning device 100. In FIG. 3, the learning device 100 includes a CPU (Central Processing Unit) 301, a memory 302, a network I / F (Interface) 303, a recording medium I / F 304, and a recording medium 305. Further, each component is connected by a bus 300.

　ここで、ＣＰＵ３０１は、学習装置１００の全体の制御を司る。メモリ３０２は、例えば、ＲＯＭ（Ｒｅａｄ　Ｏｎｌｙ　Ｍｅｍｏｒｙ）、ＲＡＭ（Ｒａｎｄｏｍ　Ａｃｃｅｓｓ　Ｍｅｍｏｒｙ）およびフラッシュＲＯＭなどを有する。具体的には、例えば、フラッシュＲＯＭやＲＯＭが各種プログラムを記憶し、ＲＡＭがＣＰＵ３０１のワークエリアとして使用される。メモリ３０２に記憶されるプログラムは、ＣＰＵ３０１にロードされることで、コーディングされている処理をＣＰＵ３０１に実行させる。 Here, the CPU 301 controls the entire learning device 100. The memory 302 includes, for example, a ROM (Read Only Memory), a RAM (Random Access Memory), a flash ROM, and the like. Specifically, for example, a flash ROM or ROM stores various programs, and RAM is used as a work area of CPU 301. The program stored in the memory 302 is loaded into the CPU 301 to cause the CPU 301 to execute the coded process.

　ネットワークＩ／Ｆ３０３は、通信回線を通じてネットワーク２１０に接続され、ネットワーク２１０を介して他のコンピュータに接続される。そして、ネットワークＩ／Ｆ３０３は、ネットワーク２１０と内部のインターフェースを司り、他のコンピュータからのデータの入出力を制御する。ネットワークＩ／Ｆ３０３は、例えば、モデムやＬＡＮアダプタなどである。 The network I / F 303 is connected to the network 210 through a communication line, and is connected to another computer via the network 210. Then, the network I / F 303 controls the internal interface with the network 210 and controls the input / output of data from another computer. The network I / F 303 is, for example, a modem or a LAN adapter.

　記録媒体Ｉ／Ｆ３０４は、ＣＰＵ３０１の制御に従って記録媒体３０５に対するデータのリード／ライトを制御する。記録媒体Ｉ／Ｆ３０４は、例えば、ディスクドライブ、ＳＳＤ（Ｓｏｌｉｄ　Ｓｔａｔｅ　Ｄｒｉｖｅ）、ＵＳＢ（Ｕｎｉｖｅｒｓａｌ　Ｓｅｒｉａｌ　Ｂｕｓ）ポートなどである。記録媒体３０５は、記録媒体Ｉ／Ｆ３０４の制御で書き込まれたデータを記憶する不揮発メモリである。記録媒体３０５は、例えば、ディスク、半導体メモリ、ＵＳＢメモリなどである。記録媒体３０５は、学習装置１００から着脱可能であってもよい。 The recording medium I / F 304 controls data read / write to the recording medium 305 according to the control of the CPU 301. The recording medium I / F 304 is, for example, a disk drive, an SSD (Solid State Drive), a USB (Universal Serial Bus) port, or the like. The recording medium 305 is a non-volatile memory that stores data written under the control of the recording medium I / F 304. The recording medium 305 is, for example, a disk, a semiconductor memory, a USB memory, or the like. The recording medium 305 may be detachable from the learning device 100.

　学習装置１００は、上述した構成部の他、例えば、キーボード、マウス、ディスプレイ、プリンタ、スキャナ、マイク、スピーカーなどを有してもよい。また、学習装置１００は、記録媒体Ｉ／Ｆ３０４や記録媒体３０５を複数有していてもよい。また、学習装置１００は、記録媒体Ｉ／Ｆ３０４や記録媒体３０５を有していなくてもよい。 The learning device 100 may include, for example, a keyboard, a mouse, a display, a printer, a scanner, a microphone, a speaker, and the like, in addition to the above-described components. Further, the learning device 100 may have a plurality of recording media I / F 304 and recording media 305. Further, the learning device 100 does not have to have the recording medium I / F 304 or the recording medium 305.

（クライアント装置２０１のハードウェア構成例）
　クライアント装置２０１のハードウェア構成例は、具体的には、図３に示した学習装置１００のハードウェア構成例と同様であるため、説明を省略する。 (Hardware configuration example of client device 201)
The hardware configuration example of the client device 201 is specifically the same as the hardware configuration example of the learning device 100 shown in FIG. 3, so the description thereof will be omitted.

（端末装置２０２のハードウェア構成例）
　端末装置２０２のハードウェア構成例は、具体的には、図３に示した学習装置１００のハードウェア構成例と同様であるため、説明を省略する。 (Hardware configuration example of terminal device 202)
Since the hardware configuration example of the terminal device 202 is specifically the same as the hardware configuration example of the learning device 100 shown in FIG. 3, the description thereof will be omitted.

（学習装置１００の機能的構成例）
　次に、図４を用いて、学習装置１００の機能的構成例について説明する。 (Example of functional configuration of learning device 100)
Next, an example of a functional configuration of the learning device 100 will be described with reference to FIG.

　図４は、学習装置１００の機能的構成例を示すブロック図である。学習装置１００は、記憶部４００と、取得部４０１と、第１の抽出部４０２と、変換部４０３と、第１の処理部４０４と、第２の抽出部４０５と、第２の処理部４０６と、第３の処理部４０７と、更新部４０８と、利用部４０９と、出力部４１０とを含む。 FIG. 4 is a block diagram showing a functional configuration example of the learning device 100. The learning device 100 includes a storage unit 400, an acquisition unit 401, a first extraction unit 402, a conversion unit 403, a first processing unit 404, a second extraction unit 405, and a second processing unit 406. A third processing unit 407, an updating unit 408, a utilization unit 409, and an output unit 410 are included.

　記憶部４００は、例えば、図３に示したメモリ３０２や記録媒体３０５などの記憶領域によって実現される。以下では、記憶部４００が、学習装置１００に含まれる場合について説明するが、これに限らない。例えば、記憶部４００が、学習装置１００とは異なる装置に含まれ、記憶部４００の記憶内容が学習装置１００から参照可能である場合があってもよい。 The storage unit 400 is realized by, for example, a storage area such as the memory 302 or the recording medium 305 shown in FIG. Hereinafter, the case where the storage unit 400 is included in the learning device 100 will be described, but the present invention is not limited to this. For example, the storage unit 400 may be included in a device different from the learning device 100, and the stored contents of the storage unit 400 may be referred to by the learning device 100.

　取得部４０１～出力部４１０は、制御部の一例として機能する。取得部４０１～出力部４１０は、具体的には、例えば、図３に示したメモリ３０２や記録媒体３０５などの記憶領域に記憶されたプログラムをＣＰＵ３０１に実行させることにより、または、ネットワークＩ／Ｆ３０３により、その機能を実現する。各機能部の処理結果は、例えば、図３に示したメモリ３０２や記録媒体３０５などの記憶領域に記憶される。 The acquisition unit 401 to the output unit 410 function as an example of the control unit. Specifically, the acquisition unit 401 to the output unit 410 may cause the CPU 301 to execute a program stored in a storage area such as the memory 302 or the recording medium 305 shown in FIG. 3, or the network I / F 303. To realize the function. The processing result of each functional unit is stored in a storage area such as the memory 302 or the recording medium 305 shown in FIG. 3, for example.

　記憶部４００は、各機能部の処理において参照され、または更新される各種情報を記憶する。記憶部４００は、第１のモーダルの情報と、第２のモーダルの情報との入力を受け付ける統合モデルを記憶する。記憶される統合モデルは、例えば、第１の抽出モデルと、変換モデルと、第１の処理モデルと、第２の抽出モデルと、第２の処理モデルと、第３の処理モデルとを有する。第１の抽出モデルと、変換モデルと、第１の処理モデルとは、第１のモーダルに関する。第２の抽出モデルと、第２の処理モデルとは、第２のモーダルに関する。第３の処理モデルは、第１のモーダルと第２のモーダルとに関する。 The storage unit 400 stores various information referred to or updated in the processing of each functional unit. The storage unit 400 stores an integrated model that accepts input of the first modal information and the second modal information. The integrated model to be stored has, for example, a first extraction model, a transformation model, a first processing model, a second extraction model, a second processing model, and a third processing model. The first extraction model, the transformation model, and the first processing model relate to the first modal. The second extraction model and the second processing model relate to a second modal. The third processing model relates to a first modal and a second modal.

　第２のモーダルは、第１のモーダルとは異なる。例えば、第１のモーダルは、画像に関するモーダルであり、第２のモーダルは、言語に関するモーダルである。例えば、第１のモーダルは、画像に関するモーダルであり、第２のモーダルは、音声に関するモーダルである。例えば、第１のモーダルは、第１の言語に関するモーダルであり、第２のモーダルは、第２の言語に関するモーダルである。 The second modal is different from the first modal. For example, the first modal is an image modal and the second modal is a language modal. For example, the first modal is an image modal and the second modal is an audio modal. For example, the first modal is a modal for a first language and the second modal is a modal for a second language.

　取得部４０１は、各機能部の処理に用いられる各種情報を取得する。取得部４０１は、取得した各種情報を、記憶部４００に記憶し、または、各機能部に出力する。また、取得部４０１は、記憶部４００に記憶しておいた各種情報を、各機能部に出力してもよい。取得部４０１は、例えば、ユーザの操作入力に基づき、各種情報を取得する。取得部４０１は、例えば、学習装置１００とは異なる装置から、各種情報を受信してもよい。 The acquisition unit 401 acquires various information used for processing of each functional unit. The acquisition unit 401 stores various acquired information in the storage unit 400 or outputs it to each function unit. Further, the acquisition unit 401 may output various information stored in the storage unit 400 to each function unit. The acquisition unit 401 acquires various information based on, for example, a user's operation input. The acquisition unit 401 may receive various information from a device different from the learning device 100, for example.

　取得部４０１は、第１のモーダルの情報と、第２のモーダルの情報とを取得する。取得部４０１は、例えば、ユーザによる、第１のモーダルの情報と、第２のモーダルの情報との入力を受け付けることにより、第１のモーダルの情報と、第２のモーダルの情報とを取得する。取得部４０１は、例えば、第１のモーダルの情報と、第２のモーダルの情報とを、クライアント装置２０１または端末装置２０２から受信することにより取得してもよい。取得部４０１は、第１のモーダルの情報と、第２のモーダルの情報とを含む教師データを取得することにより、第１のモーダルの情報と、第２のモーダルの情報とを取得してもよい。 The acquisition unit 401 acquires the information of the first modal and the information of the second modal. The acquisition unit 401 acquires the first modal information and the second modal information by, for example, accepting the input of the first modal information and the second modal information by the user. .. The acquisition unit 401 may acquire, for example, the information of the first modal and the information of the second modal by receiving from the client device 201 or the terminal device 202. Even if the acquisition unit 401 acquires the information of the first modal and the information of the second modal by acquiring the teacher data including the information of the first modal and the information of the second modal. Good.

　取得部４０１は、いずれかの機能部の処理を開始する開始トリガーを受け付けてもよい。開始トリガーは、例えば、ユーザによる所定の操作入力があったことである。開始トリガーは、例えば、他のコンピュータから、所定の情報を受信したことであってもよい。開始トリガーは、例えば、いずれかの機能部が所定の情報を出力したことであってもよい。取得部４０１は、例えば、第１のモーダルの情報と、第２のモーダルの情報とを取得したことを、各機能部の処理を開始する開始トリガーとして受け付ける。 The acquisition unit 401 may accept a start trigger to start processing of any of the functional units. The start trigger is, for example, that there is a predetermined operation input by the user. The start trigger may be, for example, the receipt of predetermined information from another computer. The start trigger may be, for example, that any functional unit outputs predetermined information. The acquisition unit 401 receives, for example, the acquisition of the first modal information and the second modal information as a start trigger for starting the processing of each functional unit.

　第１の抽出部４０２は、第１のモーダルの情報から特徴量を抽出する。第１の抽出部４０２は、例えば、画像から画像特徴量を抽出する。抽出される画像特徴量は、例えば、画像に写る物体を示す画像特徴量である。これにより、第１の抽出部４０２は、第１のモーダルの情報を、変換モデルに入力可能な形式に変更することができる。また、第１の抽出部４０２は、第１のモーダルの情報から、問題を解くにあたり有用な情報を抽出することができる。 The first extraction unit 402 extracts the feature amount from the first modal information. The first extraction unit 402 extracts, for example, an image feature amount from an image. The extracted image feature amount is, for example, an image feature amount indicating an object appearing in the image. As a result, the first extraction unit 402 can change the information of the first modal into a format that can be input to the conversion model. In addition, the first extraction unit 402 can extract useful information for solving the problem from the first modal information.

　変換部４０３は、変換モデルを用いて、抽出した特徴量を複数のパラメータに基づいて変換することにより、新たな特徴量を取得する。ここで、複数のパラメータは、例えば、第１の特徴量と第２の特徴量とを含む。変換部４０３は、例えば、抽出した特徴量と、第１の特徴量との一致度を算出し、算出した一致度を基に第２の特徴量に重み付けして得られた指標値に基づいて、抽出した特徴量を変換することにより、新たな特徴量を取得する。 The conversion unit 403 acquires a new feature amount by converting the extracted feature amount based on a plurality of parameters using the conversion model. Here, the plurality of parameters include, for example, a first feature amount and a second feature amount. For example, the conversion unit 403 calculates the degree of coincidence between the extracted feature amount and the first feature amount, and based on the index value obtained by weighting the second feature amount based on the calculated degree of coincidence. , A new feature amount is acquired by converting the extracted feature amount.

　複数のパラメータが、第１の特徴量と第２の特徴量とである場合において、新たな特徴量を取得する一例については、具体的には、図５～図８を用いて後述する。これにより、変換部４０３は、第２のモーダルの情報の特徴を反映可能な複数のパラメータを用意し、複数のパラメータを介して、第１のモーダルの情報から抽出した特徴量に、第２のモーダルの情報の特徴を反映可能にすることができる。 Specific examples of acquiring a new feature amount when the plurality of parameters are the first feature amount and the second feature amount will be described later with reference to FIGS. 5 to 8. As a result, the conversion unit 403 prepares a plurality of parameters capable of reflecting the features of the information of the second modal, and the feature amount extracted from the information of the first modal via the plurality of parameters is subjected to the second. It is possible to reflect the characteristics of modal information.

　また、複数のパラメータは、例えば、入力層のノード数と出力層のノード数とより中間層のノード数が大きいニューラルネットワークのパラメータであってもよい。ニューラルネットワークは、変換モデルに対応する。変換部４０３は、例えば、抽出した特徴量をニューラルネットワークに入力することにより、新たな特徴量を取得する。 Further, the plurality of parameters may be, for example, parameters of a neural network in which the number of nodes in the input layer, the number of nodes in the output layer, and the number of nodes in the intermediate layer are larger. Neural networks correspond to transformation models. The conversion unit 403 acquires a new feature amount by inputting the extracted feature amount into the neural network, for example.

　複数のパラメータが、ニューラルネットワークのパラメータである場合において、新たな特徴量を取得する一例については、具体的には、図９を用いて後述する。これにより、変換部４０３は、第２のモーダルの情報の特徴を反映可能な複数のパラメータを用意し、複数のパラメータを介して、第１のモーダルの情報から抽出した特徴量に、第２のモーダルの情報の特徴を反映可能にすることができる。 A specific example of acquiring a new feature amount when a plurality of parameters are neural network parameters will be described later with reference to FIG. As a result, the conversion unit 403 prepares a plurality of parameters capable of reflecting the features of the information of the second modal, and the feature amount extracted from the information of the first modal via the plurality of parameters is subjected to the second. It is possible to reflect the characteristics of modal information.

　第１の処理部４０４は、取得した新たな特徴量を、第１の処理モデルに入力することにより、第１の出力値を取得する。第１の処理モデルは、例えば、画像処理モデルである。第１の処理部４０４は、例えば、新たな画像特徴量を、画像処理モデルに入力することにより、第１の出力値を取得する。 The first processing unit 404 acquires the first output value by inputting the acquired new feature amount into the first processing model. The first processing model is, for example, an image processing model. The first processing unit 404 acquires the first output value, for example, by inputting a new image feature amount into the image processing model.

　第１の出力値を取得する一例については、具体的には、図５～図８を用いて後述する。これにより、第１の処理部４０４は、取得した新たな特徴量を、第３の処理モデルに入力可能な形式に変更することができる。また、第１の処理部４０４は、取得した新たな特徴量から、問題を解くにあたり有用な情報を抽出することができる。 An example of acquiring the first output value will be specifically described later with reference to FIGS. 5 to 8. As a result, the first processing unit 404 can change the acquired new feature quantity into a format that can be input to the third processing model. In addition, the first processing unit 404 can extract useful information for solving the problem from the acquired new features.

　第２の抽出部４０５は、第２のモーダルの情報から他の特徴量を抽出する。第２の抽出部４０５は、例えば、文書から言語特徴量を抽出する。抽出される言語特徴量は、例えば、文書に含まれる単語を示す言語特徴量である。これにより、第２の抽出部４０５は、第２のモーダルの情報を、第２の処理モデルに入力可能な形式に変更することができる。また、第２の抽出部４０５は、第２のモーダルの情報から、問題を解くにあたり有用な情報を抽出することができる。 The second extraction unit 405 extracts other feature quantities from the second modal information. The second extraction unit 405 extracts, for example, a language feature from a document. The extracted linguistic feature is, for example, a linguistic feature indicating a word contained in a document. As a result, the second extraction unit 405 can change the information of the second modal into a format that can be input to the second processing model. In addition, the second extraction unit 405 can extract useful information for solving the problem from the second modal information.

　第２の処理部４０６は、抽出した他の特徴量を、第２の処理モデルに入力することにより、第２の出力値を取得する。第２の処理モデルは、例えば、言語処理モデルである。第２の処理部４０６は、例えば、言語特徴量を、言語処理モデルに入力することにより、第２の出力値を取得する。 The second processing unit 406 acquires the second output value by inputting the extracted other feature quantities into the second processing model. The second processing model is, for example, a language processing model. The second processing unit 406 acquires the second output value by inputting the language feature amount into the language processing model, for example.

　第２の出力値を取得する一例については、具体的には、図５～図８を用いて後述する。これにより、第２の処理部４０６は、抽出した他の特徴量を、第３の処理モデルに入力可能な形式に変更することができる。また、第２の処理部４０６は、抽出した他の特徴量から、問題を解くにあたり有用な情報を抽出することができる。 An example of acquiring the second output value will be specifically described later with reference to FIGS. 5 to 8. As a result, the second processing unit 406 can change the extracted other features into a format that can be input to the third processing model. In addition, the second processing unit 406 can extract useful information for solving the problem from the extracted other features.

　第３の処理部４０７は、取得した第１の出力値と第２の出力値とを、第３の処理モデルに入力することにより、第３の出力値を取得する。第３の処理モデルは、例えば、クロスモーダル処理モデルである。第３の処理部４０７は、例えば、第１の出力値と第２の出力値とを、クロスモーダル処理モデルに入力することにより、第３の出力値を取得する。第３の出力値を取得する一例については、具体的には、図５～図８を用いて後述する。これにより、第３の処理部４０７は、言語と画像との特徴を統合した第３の出力値を得ることができる。 The third processing unit 407 acquires the third output value by inputting the acquired first output value and the second output value into the third processing model. The third processing model is, for example, a cross-modal processing model. The third processing unit 407 acquires the third output value by inputting the first output value and the second output value into the cross-modal processing model, for example. An example of acquiring the third output value will be specifically described later with reference to FIGS. 5 to 8. As a result, the third processing unit 407 can obtain a third output value that integrates the features of the language and the image.

　更新部４０８は、取得した第３の出力値に基づいて、複数のパラメータを更新する。更新部４０８は、例えば、第３の出力値に基づいて、誤差逆伝搬法により、複数のパラメータを更新する。更新部４０８は、例えば、損失関数により、第３の出力値に基づいて、損失値を算出し、損失値に基づいて、複数のパラメータを更新する。 The update unit 408 updates a plurality of parameters based on the acquired third output value. The update unit 408 updates a plurality of parameters by the error back propagation method, for example, based on the third output value. The update unit 408 calculates the loss value based on the third output value by, for example, the loss function, and updates a plurality of parameters based on the loss value.

　これにより、更新部４０８は、問題を解くにあたり、第１のモーダルの情報と、第２のモーダルの情報とを扱う観点で有用な複数のパラメータを得ることができる。更新部４０８は、例えば、第１のモーダルの情報から新たな特徴量を取得するにあたり、第２のモーダルの情報の特徴を有効に活用可能に、明示的に用意された複数のパラメータの最適化を図ることができる。 As a result, the update unit 408 can obtain a plurality of parameters useful from the viewpoint of handling the first modal information and the second modal information in solving the problem. For example, the update unit 408 optimizes a plurality of explicitly prepared parameters so that the features of the second modal information can be effectively utilized when acquiring a new feature amount from the first modal information. Can be planned.

　更新部４０８は、第３の出力値に基づいて、第１の処理モデルを更新する。更新部４０８は、例えば、第３の出力値に基づいて、誤差逆伝搬法により、第１の処理モデルを更新する。更新部４０８は、例えば、損失関数により、第３の出力値に基づいて、損失値を算出し、損失値に基づいて、第１の処理モデルを更新する。これにより、更新部４０８は、問題を解くにあたり、第１のモーダルの情報を扱う観点で有用な第１の処理モデルを得ることができる。 The update unit 408 updates the first processing model based on the third output value. The update unit 408 updates the first processing model by the error back propagation method, for example, based on the third output value. The update unit 408 calculates the loss value based on the third output value by, for example, the loss function, and updates the first processing model based on the loss value. As a result, the update unit 408 can obtain a first processing model that is useful from the viewpoint of handling the first modal information in solving the problem.

　更新部４０８は、第３の出力値に基づいて、第２の処理モデルと、第３の処理モデルとを更新する。更新部４０８は、例えば、第３の出力値に基づいて、誤差逆伝搬法により、第２の処理モデルと、第３の処理モデルとを更新する。更新部４０８は、例えば、損失関数により、第３の出力値に基づいて、損失値を算出し、損失値に基づいて、第２の処理モデルと、第３の処理モデルとを更新する。 The update unit 408 updates the second processing model and the third processing model based on the third output value. The update unit 408 updates the second processing model and the third processing model by the error back propagation method, for example, based on the third output value. For example, the update unit 408 calculates the loss value based on the third output value by the loss function, and updates the second processing model and the third processing model based on the loss value.

　これにより、更新部４０８は、問題を解くにあたり、第２のモーダルの情報を扱う観点で有用な第２の処理モデルを得ることができる。また、学習装置１００は、問題を解くにあたり、第１のモーダルの情報と、第２のモーダルの情報とを統合する観点で有用な第３の処理モデルを得ることができる。 As a result, the update unit 408 can obtain a second processing model that is useful from the viewpoint of handling the second modal information in solving the problem. In addition, the learning device 100 can obtain a third processing model that is useful from the viewpoint of integrating the information of the first modal and the information of the second modal in solving the problem.

　利用部４０９は、所定の問題を解く。利用部４０９は、例えば、更新後の複数のパラメータと、未更新の第１の処理モデルとを用いて、第１のモーダルの他の情報の入力に応じて、所定の問題を解く。これにより、利用部４０９は、更新後の複数のパラメータを、少なくとも第１のモーダルの他の情報に基づいて問題を解く際に利用し、得られる解の精度向上を図ることができる。 The user unit 409 solves a predetermined problem. The utilization unit 409 solves a predetermined problem in response to input of other information of the first modal by using, for example, a plurality of updated parameters and an unupdated first processing model. As a result, the utilization unit 409 can use the updated plurality of parameters when solving the problem based on at least other information of the first modal, and can improve the accuracy of the obtained solution.

　利用部４０９は、例えば、更新後の複数のパラメータと、更新後の第１の処理モデルとを用いて、第１のモーダルの他の情報の入力に応じて、所定の問題を解く。これにより、利用部４０９は、更新後の複数のパラメータと、更新後の第１の処理モデルとを、少なくとも第１のモーダルの他の情報に基づいて問題を解く際に利用し、得られる解の精度向上を図ることができる。 The utilization unit 409 solves a predetermined problem in response to input of other information of the first modal by using, for example, a plurality of parameters after the update and the first processing model after the update. As a result, the utilization unit 409 uses the updated parameters and the updated first processing model when solving the problem based on at least other information of the first modal, and obtains a solution. It is possible to improve the accuracy of.

　利用部４０９は、例えば、更新後の複数のパラメータと、更新後の第１の処理モデルと、更新後の第２の処理モデルと、更新後の第３の処理モデルとを用いて、第１のモーダルの他の情報と、第２のモーダルの他の情報とに基づいて、所定の問題を解く。これにより、利用部４０９は、更新後の複数のパラメータと、更新後の第１の処理モデルと、更新後の第２の処理モデルと、更新後の第３の処理モデルとを、問題を解く際に利用し、得られる解の精度向上を図ることができる。 The utilization unit 409 uses, for example, a plurality of parameters after the update, the first processing model after the update, the second processing model after the update, and the third processing model after the update. Based on the other information of the modal and the other information of the second modal, the predetermined problem is solved. As a result, the utilization unit 409 solves the problem of the plurality of parameters after the update, the first processing model after the update, the second processing model after the update, and the third processing model after the update. It can be used in some cases to improve the accuracy of the obtained solution.

　出力部４１０は、いずれかの機能部の処理結果を出力する。出力形式は、例えば、ディスプレイへの表示、プリンタへの印刷出力、ネットワークＩ／Ｆ３０３による外部装置への送信、または、メモリ３０２や記録媒体３０５などの記憶領域への記憶である。これにより、出力部４１０は、各機能部の処理結果をユーザに通知可能にし、学習装置１００の利便性の向上を図ることができる。 The output unit 410 outputs the processing result of any of the functional units. The output format is, for example, display on a display, print output to a printer, transmission to an external device by the network I / F 303, or storage in a storage area such as a memory 302 or a recording medium 305. As a result, the output unit 410 can notify the user of the processing result of each functional unit, and can improve the convenience of the learning device 100.

　出力部４１０は、例えば、更新後の複数のパラメータを出力する。これにより、出力部４１０は、問題を解くにあたり、第１のモーダルの情報と、第２のモーダルの情報とを扱う観点で有用な、更新後の複数のパラメータを参照可能にすることができる。出力部４１０は、例えば、更新後の複数のパラメータを、他のコンピュータで利用可能にすることができる。このため、出力部４１０は、少なくとも複数のパラメータを用いて、問題を解くことにより得られる解の精度を向上可能にすることができる。 The output unit 410 outputs, for example, a plurality of updated parameters. As a result, the output unit 410 can refer to a plurality of updated parameters that are useful from the viewpoint of handling the first modal information and the second modal information in solving the problem. The output unit 410 can make the updated plurality of parameters available to other computers, for example. Therefore, the output unit 410 can improve the accuracy of the solution obtained by solving the problem by using at least a plurality of parameters.

　出力部４１０は、例えば、第１の処理モデルを出力してもよい。これにより、出力部４１０は、問題を解くにあたり、第１のモーダルの情報を扱う観点で有用な、更新後の第１の処理モデルを参照可能にすることができる。出力部４１０は、例えば、更新後の第１の処理モデルを、他のコンピュータで利用可能にすることができる。このため、出力部４１０は、少なくとも第１の処理モデルを用いて、第１のモーダルの情報に基づいて、問題を解くことにより得られる解の精度を向上可能にすることができる。 The output unit 410 may output, for example, the first processing model. As a result, the output unit 410 can refer to the updated first processing model, which is useful from the viewpoint of handling the first modal information in solving the problem. The output unit 410 can make the updated first processing model available to other computers, for example. Therefore, the output unit 410 can improve the accuracy of the solution obtained by solving the problem based on the information of the first modal by using at least the first processing model.

　出力部４１０は、例えば、第２の処理モデルと、第３の処理モデルとを出力してもよい。これにより、出力部４１０は、更新後の第２の処理モデルを参照可能にすることができる。出力部４１０は、例えば、更新後の第２の処理モデルを、他のコンピュータで利用可能にすることができる。このため、出力部４１０は、少なくとも第２の処理モデルと、第３の処理モデルとを用いて、第１のモーダルの情報と、第２のモーダルの情報とに基づいて、問題を解くことにより得られる解の精度を向上可能にすることができる。 The output unit 410 may output, for example, a second processing model and a third processing model. As a result, the output unit 410 can refer to the updated second processing model. The output unit 410 can make the updated second processing model available to other computers, for example. Therefore, the output unit 410 solves the problem based on the information of the first modal and the information of the second modal by using at least the second processing model and the third processing model. It is possible to improve the accuracy of the obtained solution.

（学習装置１００の具体的な機能的構成例）
　次に、図５を用いて、第１のモーダルが、画像に関し、第２のモーダルが、言語に関する場合について、学習装置１００の具体的な機能的構成例について説明する。 (Specific functional configuration example of the learning device 100)
Next, a specific functional configuration example of the learning device 100 will be described with reference to FIG. 5 in the case where the first modal is related to an image and the second modal is related to a language.

　図５は、学習装置１００の具体的な機能的構成例を示すブロック図である。図５において、学習装置１００は、統合モデル５００を有する。統合モデル５００は、画像特徴量生成部５０１と、クエリ生成部５０２と、テーブル検索部５０３と、画像処理部５０４と、言語処理部５０５と、クロスモーダル処理部５０６と、損失関数計算部５０７とを含む。 FIG. 5 is a block diagram showing a specific functional configuration example of the learning device 100. In FIG. 5, the learning device 100 has an integrated model 500. The integrated model 500 includes an image feature amount generation unit 501, a query generation unit 502, a table search unit 503, an image processing unit 504, a language processing unit 505, a crossmodal processing unit 506, and a loss function calculation unit 507. including.

　また、学習装置１００は、データテーブル５１０を有する。データテーブル５１０は、被検索キー列とクロスモーダル特徴量列とを記憶するテーブルである。データテーブル５１０は、事前学習により言語情報が反映されるテーブルである。 Further, the learning device 100 has a data table 510. The data table 510 is a table that stores the searched key sequence and the cross-modal feature quantity column. The data table 510 is a table in which linguistic information is reflected by pre-learning.

　データテーブル５１０は、例えば、画像特徴量を量子化する被検索キーに対し、言語情報が反映されたクロスモーダル特徴量を１対１で対応付けたテーブルである。データテーブル５１０は、画像特徴量から生成された検索クエリに基づいて、データテーブル５１０にクロスモーダル特徴量として反映された言語情報を、画像特徴量との関連度合いに応じた重み付けを実施した上で抽出するためのテーブルである。 The data table 510 is, for example, a table in which a cross-modal feature amount reflecting linguistic information is associated with a searched key that quantizes an image feature amount on a one-to-one basis. Based on the search query generated from the image feature amount, the data table 510 weights the linguistic information reflected as the cross-modal feature amount in the data table 510 according to the degree of relevance to the image feature amount. It is a table for extracting.

　画像特徴量生成部５０１は、入力された画像から、画像特徴量列を生成して出力する。画像特徴量生成部５０１は、例えば、入力された画像に写る物体を検出し、検出した物体それぞれを示す画像特徴量を含む画像特徴量列を生成して出力する。画像特徴量列は、例えば、ベクトルにより表現される。画像特徴量は、例えば、色や形状などを含む、物体の視覚的な特徴情報を示す。 The image feature amount generation unit 501 generates an image feature amount sequence from the input image and outputs it. The image feature amount generation unit 501 detects, for example, an object appearing in the input image, and generates and outputs an image feature amount sequence including an image feature amount indicating each of the detected objects. The image feature sequence is represented by, for example, a vector. The image feature amount indicates visual feature information of an object including, for example, a color and a shape.

　クエリ生成部５０２は、入力された画像特徴量列に基づいて、検索クエリ列を生成して出力する。クエリ生成部５０２は、例えば、画像特徴量列に変換行列Ｗ_qを乗算することにより、検索クエリ列を生成する。検索クエリ列は、例えば、ベクトルにより表現される。 The query generation unit 502 generates and outputs a search query sequence based on the input image feature sequence. The query generation unit 502 generates a search query sequence by, for example, multiplying the image feature sequence by the transformation matrix W _q. The search query column is represented by, for example, a vector.

　テーブル検索部５０３は、入力された検索クエリ列に基づいて、データテーブル５１０からクロスモーダル特徴量に重み付けすることにより、指標値列を算出して出力する。テーブル検索部５０３は、例えば、入力された検索クエリ列と、データテーブル５１０の被検索キー列との内積に基づいて、データテーブル５１０のクロスモーダル特徴量の重み付け平均を算出することにより、新たなクロスモーダル特徴量列を生成して出力する。 The table search unit 503 calculates and outputs an index value string by weighting the cross-modal feature amount from the data table 510 based on the input search query column. The table search unit 503 calculates a new weighted average of the cross-modal features of the data table 510 based on the inner product of the input search query column and the searched key column of the data table 510, for example. Generates and outputs a cross-modal feature sequence.

　画像処理部５０４は、入力された新たなクロスモーダル特徴量列に基づいて、画像特徴量列を変換して出力する。画像処理部５０４は、例えば、新たなクロスモーダル特徴量列を、画像特徴量列に加算した上で、画像処理モデルを用いて、加算後の画像特徴量列を変換し、変換後の画像特徴量列を出力する。 The image processing unit 504 converts and outputs the image feature quantity sequence based on the input new cross-modal feature quantity sequence. For example, the image processing unit 504 adds a new cross-modal feature sequence to the image feature sequence, converts the added image feature sequence using an image processing model, and converts the converted image feature. Output a quantity sequence.

　言語処理部５０５は、入力された単語列に基づいて、単語ｅｍｂｅｄｄｉｎｇ列を生成して出力する。言語処理部５０５は、例えば、言語処理モデルを用いて、入力された単語列に基づいて、単語ｅｍｂｅｄｄｉｎｇ列を生成して出力する。 The language processing unit 505 generates and outputs a word embedding string based on the input word string. The language processing unit 505 generates and outputs a word embedding string based on the input word string, for example, using a language processing model.

　クロスモーダル処理部５０６は、入力された単語ｅｍｂｅｄｄｉｎｇ列と、入力された画像特徴量列とを統合し、新たな単語ｅｍｂｅｄｄｉｎｇ列と、新たな画像特徴量列とを生成して出力する。クロスモーダル処理部５０６は、例えば、クロスモーダル処理モデルを用いて、入力された単語ｅｍｂｅｄｄｉｎｇ列と、入力された画像特徴量列とを統合し、新たな単語ｅｍｂｅｄｄｉｎｇ列と、新たな画像特徴量列とを生成して出力する。 The cross-modal processing unit 506 integrates the input word embedding sequence and the input image feature amount sequence, generates a new word embedding sequence, and outputs a new image feature amount sequence. The cross-modal processing unit 506 integrates the input word embedding sequence and the input image feature amount sequence by using, for example, a cross-modal processing model, and forms a new word embedding column and a new image feature amount sequence. And output.

　損失関数計算部５０７は、入力された単語ｅｍｂｅｄｄｉｎｇ列と、入力された画像特徴量列とに基づいて、損失関数を用いて、損失値を算出する。そして、学習装置１００は、損失値に基づいて、統合モデル５００を事前学習することにより、データテーブル５１０と、画像処理モデルと、言語処理モデルと、クロスモーダル処理モデルとを更新する。 The loss function calculation unit 507 calculates the loss value using the loss function based on the input word embedding sequence and the input image feature amount sequence. Then, the learning device 100 updates the data table 510, the image processing model, the language processing model, and the cross-modal processing model by pre-learning the integrated model 500 based on the loss value.

　これにより、学習装置１００は、言語情報をデータテーブル５１０に反映させることができる。学習装置１００は、例えば、言語情報を、データテーブル５１０のクロスモーダル特徴量に反映させることができる。学習装置１００は、具体的には、単語列に基づく損失値を、データテーブル５１０のクロスモーダル特徴量に反映させることができる。 As a result, the learning device 100 can reflect the language information in the data table 510. The learning device 100 can reflect, for example, linguistic information in the cross-modal features of the data table 510. Specifically, the learning device 100 can reflect the loss value based on the word string in the cross-modal feature amount of the data table 510.

　このように、学習装置１００は、言語情報を反映させるデータテーブル５１０を明示的に用意すると共に、言語情報を反映させる際に、被検索キーによる量子化に基づいて、効果的にクロスモーダル特徴量を更新することができる。結果として、学習装置１００は、画像を扱うにあたり、言語情報を考慮した表現空間を用意することができ、画像処理部５０４が、有用な画像特徴量列を生成可能にすることができる。 In this way, the learning device 100 explicitly prepares the data table 510 that reflects the linguistic information, and when the linguistic information is reflected, the cross-modal feature amount is effectively based on the quantization by the searched key. Can be updated. As a result, the learning device 100 can prepare an expression space in consideration of linguistic information when handling an image, and the image processing unit 504 can generate a useful image feature quantity sequence.

（学習装置１００が統合モデル６００を学習する一例）
　次に、図６を用いて、学習装置１００が統合モデル６００を学習する一例について説明する。 (An example in which the learning device 100 learns the integrated model 600)
Next, an example in which the learning device 100 learns the integrated model 600 will be described with reference to FIG.

　図６は、統合モデル６００を学習する一例を示す説明図である。図６において、学習装置１００は、統合モデル５００を具体化した統合モデル６００を有する。統合モデル６００は、Ａｄｄ部６０１と、画像Ｔｒａｎｓｆｏｒｍｅｒ６０２と、言語Ｔｒａｎｓｆｏｒｍｅｒ６０３と、クロスモーダルＴｒａｎｓｆｏｒｍｅｒ６０４とを含む。 FIG. 6 is an explanatory diagram showing an example of learning the integrated model 600. In FIG. 6, the learning device 100 has an integrated model 600 that embodies the integrated model 500. The integrated model 600 includes an add unit 601, an image Transferr 602, a language Transferr 603, and a cross-modal Transferr 604.

　また、学習装置１００は、データテーブル５１０を具体化した言語情報データテーブル６１０を有する。言語情報データテーブル６１０は、検索クエリ列を設定する配列Ｑｕｅｒｙ６１１と、被検索キー列を設定する配列Ｋｅｙ６２１と、クロスモーダル特徴量列を設定する配列Ｖａｌｕｅ６２２とを含む。 Further, the learning device 100 has a language information data table 610 that embodies the data table 510. The language information data table 610 includes an array Queue 611 for setting a search query sequence, an array Key 621 for setting a search key sequence, and an array Value 622 for setting a cross-modal feature quantity sequence.

　学習装置１００は、入力の一部をＭａｓｋし、Ｍａｓｋした部分を予測させることにより、統合モデル６００の事前学習を実施する。学習装置１００は、例えば、入力として、画像特徴量列Ｆ＝ｆ₁，ｆ₂，・・・，ｆ_i，・・・，ｆ_Iと、言語ｅｍｂｅｄｄｉｎｇ列Ｅ＝ｅ₁，ｅ₂，・・・，ｅ_j，・・・，ｅ_Lとを取得する。次に、学習装置１００は、画像特徴量列Ｆ＝ｆ₁，ｆ₂，・・・，ｆ_i，・・・，ｆ_Iのうち、画像特徴量ｆ_iをＭａｓｋし、言語ｅｍｂｅｄｄｉｎｇ列Ｅ＝ｅ₁，ｅ₂，・・・，ｅ_j，・・・，ｅ_Lのうち、言語ｅｍｂｅｄｄｉｎｇｅ_jをＭａｓｋする。 The learning device 100 performs pre-learning of the integrated model 600 by masking a part of the input and predicting the masked part. Learning apparatus 100 is, for example, as an input, the image feature amount sequence _{_{F = f 1, f 2,}} ···, f i, ···, and f _I, language embedding column _{_{E = e 1, e 2,}} ··・, E _j , ..., e _L and so on. Then, the learning apparatus 100, the image feature amount sequence _{_{F = f 1, f 2,}} ···, f i, ···, of f _I, image features f _i and Mask, language embedding column E = Of e ₁ , e ₂ , ..., e _j , ..., e _L , the language embeddinge _j is masked.

　次に、学習装置１００は、Ａｄｄ部６０１により、Ｍａｓｋした画像特徴量列Ｆを変換し、変換後の画像特徴量列Ｆ’を出力する。学習装置１００は、画像Ｔｒａｎｓｆｏｒｍｅｒ６０２により、補正後の画像特徴量列Ｆ’をさらに変換し、変換後の画像特徴量列Ｆ”を出力する。また、学習装置１００は、言語Ｔｒａｎｓｆｏｒｍｅｒ６０３により、Ｍａｓｋした言語ｅｍｂｅｄｄｉｎｇ列Ｅを変換し、変換後の言語ｅｍｂｅｄｄｉｎｇ列Ｅ’を出力する。そして、学習装置１００は、クロスモーダルＴｒａｎｓｆｏｒｍｅｒ６０４により、変換後の画像特徴量列Ｆ”と、変換後の言語ｅｍｂｅｄｄｉｎｇ列Ｅ’とを統合し、新たな画像特徴量列Ｆ＾と、新たな言語ｅｍｂｅｄｄｉｎｇ列Ｅ＾とを生成する。 Next, the learning device 100 converts the masked image feature amount sequence F by the Add unit 601 and outputs the converted image feature amount sequence F'. The learning device 100 further converts the corrected image feature amount sequence F'by the image Transferr 602, and outputs the converted image feature amount sequence F ”. Further, the learning device 100 uses the language Transfermer 603 to create a masked language. The embedded column E is converted and the converted language embedding column E'is output. Then, the learning device 100 uses the crossmodal Transferformer 604 to convert the converted image feature amount column F'and the converted language embedding column E'. To generate a new image feature sequence F ^ and a new language embedding sequence E ^.

　学習装置１００は、画像特徴量列Ｆ＾のうちＭａｓｋした部分に対応する画像特徴量「ｆ_i＾」と、言語ｅｍｂｅｄｄｉｎｇ列Ｅ＾のうちＭａｓｋした部分に対応する言語ｅｍｂｅｄｄｉｎｇ「ｅ_j＾」とに基づいて、Ｍａｓｋした部分を予測する。そして、学習装置１００は、予測した結果に基づいて、統合モデル６００の事前学習を実施する。次に、図７の説明に移行し、学習装置１００が、Ａｄｄ部６０１により、画像特徴量列Ｆを変換する一例について説明する。 Learning apparatus 100, the image feature amount corresponding to Mask portion of the image feature quantity column F ^ and "f _i ^", the language embedding sequence E ^ Mask language corresponding to the portion embedding of the "e _j ^" Based on, the Masked part is predicted. Then, the learning device 100 performs pre-learning of the integrated model 600 based on the predicted result. Next, moving to the description of FIG. 7, an example in which the learning device 100 converts the image feature quantity sequence F by the Add unit 601 will be described.

　図７は、画像特徴量列Ｆを変換する一例を示す説明図である。図７において、予め、学習装置１００は、Ｎ次元単位行列Ｉに変換行列Ｗ_kを乗算して得た被検索キー列の初期値を、配列Ｋｅｙ６２１に設定し、Ｎ次元単位行列Ｉに変換行列Ｗ_vを乗算して得たクロスモーダル特徴量列を、配列Ｖａｌｕｅ６２２に設定する。 FIG. 7 is an explanatory diagram showing an example of converting the image feature quantity sequence F. In FIG. 7, the learning apparatus 100 sets the initial value of the searched key sequence obtained by multiplying the N-dimensional identity matrix I by the transformation matrix W _k in advance in the array Key621, and converts it into the N-dimensional identity matrix I. The _{crossmodal feature matrix obtained by multiplying W v} is set in the array Value622.

　ここで、学習装置１００は、Ｍａｓｋした画像特徴量列Ｆに変換行列Ｗ_qを乗算し、検索クエリ列Ｑ＝ｑ₁，・・・，ｑ_Iを算出し、配列Ｑｕｅｒｙ６１１に設定する。学習装置１００は、配列Ｑｕｅｒｙ６１１と、配列Ｋｅｙ６２１との内積を算出し、算出した内積のｓｏｆｔｍａｘを算出する。学習装置１００は、算出したｓｏｆｔｍａｘと、配列Ｖａｌｕｅ６２２との内積を、補正情報として算出する。学習装置１００は、画像特徴量列Ｆに補正情報を加算することにより、画像特徴量列Ｆを変換する。次に、図８の説明に移行し、画像特徴量列Ｆを変換する計算の具体例について説明する。 Here, the learning device 100 multiplies the masked image feature sequence F by the transformation matrix W _q , calculates the search query sequences Q = q ₁ , ..., Q _I , and sets them in the array Queue 611. The learning device 100 calculates the inner product of the sequence Queen 611 and the sequence Key 621, and calculates the softmax of the calculated inner product. The learning device 100 calculates the inner product of the calculated softmax and the array Value622 as correction information. The learning device 100 converts the image feature quantity sequence F by adding the correction information to the image feature quantity sequence F. Next, the description shifts to FIG. 8, and a specific example of the calculation for converting the image feature quantity sequence F will be described.

　図８は、画像特徴量列Ｆを変換する計算の具体例を示す説明図である。図８に示すように、（８－１）学習装置１００は、画像特徴量列Ｆを変換するにあたり、それぞれの画像特徴量ｆ₁，ｆ₂，・・・，ｆ_i，・・・，ｆ_Iに、変換行列Ｗ_qのそれぞれの変換ベクトルＷ_q1，・・・，Ｗ_qh，・・・，Ｗ_qHを乗算する。学習装置１００は、例えば、画像特徴量ｆ_iに、変換ベクトルＷ_q1，・・・，Ｗ_qh，・・・，Ｗ_qHを乗算し、Ｈ個の部分特徴クエリｑ_i,1，・・・，ｑ_i,h，・・・，ｑ_i,Hを取得する。 FIG. 8 is an explanatory diagram showing a specific example of the calculation for converting the image feature quantity sequence F. As shown in FIG. 8, (8-1) Learning unit 100, when converting the image feature amount column F, the respective image feature amount _{_{f 1, f 2, ···,}} f i, ···, f to _I, each of the transform vector W _q1 of the transformation matrix _{_{W q, ···, W qh,}} ···, multiplying W _qH. Learning apparatus 100 is, for example, the image feature amount f _i, converting the vector _{_{W q1, ···, W qh,}} ···, W qH multiplies, H-number of the partial feature query q _i, 1, · · · , Q _{i, h} , ···, q _{i, H} are acquired.

　ここでは、学習装置１００が、画像特徴量ｆ_iに基づいて、部分特徴クエリｑ_i,1，・・・，ｑ_i,h，・・・，ｑ_i,Hを取得する場合について説明したが、画像特徴量ｆ_i以外の他の画像特徴量についても同様の計算が実施される。以下の説明では、部分特徴クエリｑ_i,hを例に挙げて、以降に実施する計算について説明するが、部分特徴クエリｑ_i,h以外の他の部分特徴クエリについても同様の計算が実施される。 Here, the learning apparatus 100, based on the image feature amount f _i, the partial feature query _{q i, 1, ···, q} i, h, ···, q i, has been described to obtain the _H It is carried out similar calculations for other image characteristic amount other than the image feature amount f _i. In the following explanation, the calculation to be performed thereafter will be described by taking the _{partial feature queries q i and h} as an example, but the same calculation is performed for other partial feature queries other than the _{partial feature queries q i and h.} To.

　（８－２）学習装置１００は、部分特徴クエリｑ_i,hと、配列Ｋｅｙ６２１に設定されたＮ個の部分特徴キーｋ_1,h，・・・，ｋ_n,h，・・・，ｋ_N,Hとの一致度を算出し、一致度のｓｏｆｔｍａｘを算出する。学習装置１００は、例えば、部分特徴クエリｑ_i,hと、Ｎ個の部分特徴キーｋ_1,h，・・・，ｋ_n,h，・・・，ｋ_N,Hとの内積を算出し、内積のｓｏｆｔｍａｘを算出する。学習装置１００は、具体的には、下記式（１）により、内積のｓｏｆｔｍａｘを算出する。 (8-2) The learning device 100 has a partial feature query q _{i, h} _{and N partial feature keys k 1, h} , ..., k _{n, h} , ..., K set in the array Key 621. _{The degree of coincidence with N and H} is calculated, and the softmax of the degree of coincidence is calculated. The learning device 100 calculates, for example, the inner product of the partial feature queries q _{i, h} and N partial feature keys k _{1, h} , ···, k _{n, h} , ···, k _{N, H.} , The softmax of the inner product is calculated. Specifically, the learning device 100 calculates the softmax of the inner product by the following formula (1).

　　ｅｘｐ（ｑ_i,h・ｋ_1,h）／Σ_xｅｘｐ（ｑ_i,h・ｋ_x,h），・・・，ｅｘｐ（ｑ_i,h・ｋ_n,h）／Σ_xｅｘｐ（ｑ_i,h・ｋ_x,h），・・・，ｅｘｐ（ｑ_i,h・ｋ_N,h）／Σ_xｅｘｐ（ｑ_i,h・ｋ_x,h）　　　　　　　　　　　　　　　　　　　　　　　・・・（１） exp (q _{i, h} · k _{1, h} ) / Σ _x exp (q _{i, h} · k _{x, h} ), ···, exp (q _{i, h} · k _{n, h} ) / Σ _x exp (q) _{i, h} · k _{x, h} ), ···, exp (q _{i, h} · k _{N, h} ) / Σ _x exp (q _{i, h} · k _{x, h} ) ··· (1)

　（８－３）学習装置１００は、ｓｏｆｔｍａｘに基づいて、Ｎ個の部分特徴キーｋ_1,h，・・・，ｋ_n,h，・・・，ｋ_N,Hと対応付けられた、Ｎ個のクロスモーダル特徴量ｖ_1,h，・・・，ｖ_n,h，・・・，ｖ_N,Hの加重平均ｄｆ_iを算出する。このように、学習装置１００は、Ｉ個の画像特徴量ｆ₁，・・・，ｆ_i，・・・，ｆ_Iに対応する、Ｉ個の加重平均ｄｆ₁，・・・，ｄｆ_i，・・・，ｄｆ_Iを、新たなクロスモーダル特徴量として算出する。その後、学習装置１００は、Ｉ個の画像特徴量ｆ₁，・・・，ｆ_i，・・・，ｆ_Iに、Ｉ個の加重平均ｄｆ₁，・・・，ｄｆ_i，・・・，ｄｆ_Iをそれぞれ加算し、Ｉ個の画像特徴量ｆ₁，・・・，ｆ_i，・・・，ｆ_Iを変換する。 (8-3) The learning device 100 is associated with N partial feature keys k _{1, h} , ..., k _{n, h} , ..., K _{N, H} based on softmax, N. Calculate the weighted average df _i of the cross-modal features v _{1, h} , ···, v _{n, h} , ···, v _{N, H.} Thus, the learning device 100, I-number of image feature amounts _{_{f 1, ···, f i,}} ···, corresponding to f _I, a weighted average df ₁ of number I, · · ·, df _i, ..., Df _I is calculated as a new cross-modal feature quantity. Then, the learning apparatus 100, I-number of image feature amounts _{_{f 1, ···, f i,}} ···, the f _I, a weighted average df ₁ of number I, · · ·, df _i, · · ·, the df _I respectively added, I pieces of image feature amounts f _1, converts ···, f _i, ···, an f _I.

　これにより、学習装置１００は、言語情報を言語情報データテーブル６１０に反映させることができる。学習装置１００は、例えば、言語情報を、変換行列Ｗ_qと、変換行列Ｗ_kと、変換行列Ｗ_vとに反映させることができる。結果として、学習装置１００は、画像を扱うにあたり、言語情報を考慮した表現空間を用意することができ、有用な画像特徴量列を生成可能にすることができ、有用な統合モデル６００を学習することができる。また、学習装置１００は、学習後の統合モデル６００のファインチューニングも効率よく実施可能にすることができる。その後、学習装置１００は、問題を解くにあたり統合モデル６００を用いれば、問題を解いて得られる解の精度を向上させることができる。 As a result, the learning device 100 can reflect the language information in the language information data table 610. For example, the learning device 100 can reflect the language information in the transformation matrix W _q , the transformation matrix W _k, and the transformation matrix W _v. As a result, the learning device 100 can prepare an expression space in consideration of linguistic information when handling an image, can generate a useful image feature sequence, and learns a useful integrated model 600. be able to. Further, the learning device 100 can efficiently perform fine tuning of the integrated model 600 after learning. After that, the learning device 100 can improve the accuracy of the solution obtained by solving the problem by using the integrated model 600 in solving the problem.

　また、学習装置１００は、例えば、画像特徴量が、言語情報から得られるクロスモーダル特徴量で適切に補完されるようにし、有用な画像特徴量列を生成可能にすることができる。このため、学習装置１００は、画像Ｔｒａｎｓｆｏｒｍｅｒ６０２を分離し、画像Ｔｒａｎｓｆｏｒｍｅｒ６０２単独のファインチューニングも効率よく実施可能にすることができる。その後、学習装置１００は、問題を解くにあたり画像Ｔｒａｎｓｆｏｒｍｅｒ６０２単独で用いても、問題を解いて得られる解の精度を向上させることができる。 Further, the learning device 100 can, for example, make it possible to appropriately complement the image feature amount with the crossmodal feature amount obtained from the linguistic information, and to generate a useful image feature amount sequence. Therefore, the learning device 100 can separate the image Transferr 602 and efficiently perform fine tuning of the image Transferr 602 alone. After that, the learning device 100 can improve the accuracy of the solution obtained by solving the problem even if the image Transformer 602 is used alone in solving the problem.

　ここで、従来では、画像に写る物体は、周囲の物体との関係性により異なる意味を有する性質があるところ、画像に写る物体を示す画像特徴量は、言語特徴量とは異なり量子化されておらず連続値で表現されており、適切な意味付けを行うことは難しい傾向がある。具体的には、様々な椅子を示す画像特徴量に対し、室内での配置関係を考慮して異なる意味付けを行うことが望まれるが、適切な意味付けを行うことは難しい。また、具体的には、様々な赤信号を示す画像特徴量に対し、停止する車との位置関係を考慮して異なる意味付けを行うことが望まれるが、適切な意味付けを行うことは難しい。これに対し、学習装置１００は、上述した通り、言語情報から得られるクロスモーダル特徴量で、画像特徴量に対する適切な意味付けを行いやすくすることができ、有用な統合モデル６００を得やすくすることができる。 Here, conventionally, an object captured in an image has a property of having a different meaning depending on the relationship with a surrounding object, but an image feature quantity indicating an object captured in an image is quantized unlike a language feature quantity. It is expressed as a continuous value, and it tends to be difficult to give an appropriate meaning. Specifically, it is desirable to give different meanings to the image features showing various chairs in consideration of the arrangement relationship in the room, but it is difficult to give an appropriate meaning. Further, specifically, it is desirable to give different meanings to the image features showing various red lights in consideration of the positional relationship with the stopping vehicle, but it is difficult to give an appropriate meaning. .. On the other hand, as described above, the learning device 100 makes it easy to give an appropriate meaning to the image feature amount by the cross-modal feature amount obtained from the linguistic information, and makes it easy to obtain a useful integrated model 600. Can be done.

（学習装置１００が画像特徴量列を変換する別の例）
　次に、図９を用いて、学習装置１００が画像特徴量列を変換する別の例について説明する。例えば、図５～図８の例では、学習装置１００が、被検索キー列とクロスモーダル特徴量列とを含むデータテーブル５１０を用いる場合について説明した。これに対し、図９の例では、学習装置１００が、データテーブル５１０に代わり、ニューラルネットワーク９００を用いる場合について説明する。 (Another example in which the learning device 100 converts an image feature sequence)
Next, another example in which the learning device 100 converts the image feature quantity sequence will be described with reference to FIG. For example, in the examples of FIGS. 5 to 8, the case where the learning device 100 uses the data table 510 including the searched key sequence and the cross-modal feature quantity sequence has been described. On the other hand, in the example of FIG. 9, the case where the learning device 100 uses the neural network 900 instead of the data table 510 will be described.

　図９は、画像特徴量列を変換する別の例を示す説明図である。図９において、ニューラルネットワーク９００は、２層の全結合ネットワークであり、入力層９０１と、中間層９０２と、出力層９０３とを有する。入力層９０１の次元と、出力層９０３の次元とは同一になるように形成される。中間層９０２の次元は、入力層９０１の次元より大きくなるように形成される。入力層９０１と中間層９０２との接続のパラメータは、変換行列Ｗ_q→_kである。中間層９０２と出力層９０３との接続のパラメータは、変換行列Ｗ_k→_vである。 FIG. 9 is an explanatory diagram showing another example of converting the image feature quantity sequence. In FIG. 9, the neural network 900 is a two-layer fully connected network, which has an input layer 901, an intermediate layer 902, and an output layer 903. The dimension of the input layer 901 and the dimension of the output layer 903 are formed to be the same. The dimension of the intermediate layer 902 is formed to be larger than the dimension of the input layer 901. The parameter of the connection between the input layer 901 and the intermediate layer 902 is the transformation matrix W _q → _k . The parameter of the connection between the intermediate layer 902 and the output layer 903 is the transformation matrix W _k → _v .

　学習装置１００は、ｉ番目の画像特徴量ｆ_iから変換行列Ｗ_q→_kによってＮ個の対応関係を生成し、生成した対応関係に基づいて、変換行列Ｗ_k→_vによってクロスモーダル特徴量を生成することになる。学習装置１００は、事前学習により、変換行列Ｗ_q→_kと変換行列Ｗ_k→_vとを更新することで、データテーブル５１０と同様の機能を、ニューラルネットワーク９００により実現することができる。 The learning device 100 generates N correspondences from the i-th image feature amount f _i by the transformation matrix W _q → _k , and based on the generated correspondences, the crossmodal feature amount is calculated by the _{transformation matrix W k} → _v. Will be generated. The learning device 100 can realize the same function as the data table 510 by the neural network 900 by updating the _{transformation matrix W q} → _k and the transformation matrix W _k → _{v by pre-learning.}

　これにより、学習装置１００は、言語情報をニューラルネットワーク９００に反映させることができる。学習装置１００は、例えば、言語情報を、変換行列Ｗ_q→_kと、変換行列Ｗ_k→_vとに反映させることができる。結果として、学習装置１００は、画像を扱うにあたり、言語情報を考慮した表現空間を用意することができ、有用な画像特徴量列を生成可能にすることができ、有用な統合モデルを学習可能にすることができる。また、学習装置１００は、学習後の統合モデルのファインチューニングも効率よく実施可能にすることができる。その後、学習装置１００は、問題を解くにあたり統合モデルを用いれば、問題を解いて得られる解の精度を向上させることができる。 As a result, the learning device 100 can reflect the linguistic information on the neural network 900. For example, the learning device 100 can reflect the language information in the transformation matrix W _q → _k and the transformation matrix W _k → _v. As a result, the learning device 100 can prepare an expression space in consideration of linguistic information when handling an image, can generate a useful image feature sequence, and can learn a useful integrated model. can do. In addition, the learning device 100 can efficiently perform fine tuning of the integrated model after learning. After that, the learning device 100 can improve the accuracy of the solution obtained by solving the problem by using the integrated model in solving the problem.

（学習装置１００が統合モデル１０００を利用する一例）
　次に、図１０を用いて、学習装置１００が統合モデル１０００を利用する一例について説明する。 (An example in which the learning device 100 uses the integrated model 1000)
Next, an example in which the learning device 100 uses the integrated model 1000 will be described with reference to FIG.

　図１０は、学習装置１００が統合モデル１０００を利用する一例を示す説明図である。学習装置１００は、例えば、統合モデル１０００を利用し、周辺状況に鑑みた映像監視の問題を解く。統合モデル１０００は、画像Ｔｒａｎｓｆｏｒｍｅｒ１０１０と、言語Ｔｒａｎｓｆｏｒｍｅｒ１０２０と、クロスモーダルＴｒａｎｓｆｏｒｍｅｒ１０３０とを含む。統合モデル１０００は、言語情報データテーブル６１０またはニューラルネットワーク９００などを含む。 FIG. 10 is an explanatory diagram showing an example in which the learning device 100 uses the integrated model 1000. The learning device 100 uses, for example, the integrated model 1000 to solve the problem of video monitoring in view of the surrounding situation. The integrated model 1000 includes an image Transferr 1010, a language Transferr 1020, and a crossmodal Transferr 1030. The integrated model 1000 includes a linguistic information data table 610, a neural network 900, and the like.

　学習装置１００は、具体的には、災害発生時に、室内にまだ避難していない人がいるか否かを検出する問題を解く。この際、学習装置１００は、試着室の外観を写す監視カメラの画像情報１００１を取得する。学習装置１００は、画像情報１００１から、画像特徴量列を取得し、言語情報データテーブル６１０またはニューラルネットワーク９００を用いて変換した後、画像Ｔｒａｎｓｆｏｒｍｅｒ１０１０に入力する。 Specifically, the learning device 100 solves the problem of detecting whether or not there is a person who has not yet evacuated in the room when a disaster occurs. At this time, the learning device 100 acquires the image information 1001 of the surveillance camera that captures the appearance of the fitting room. The learning device 100 acquires an image feature amount sequence from the image information 1001, converts it using the language information data table 610 or the neural network 900, and then inputs it to the image Transferr 1010.

　また、学習装置１００は、「試着室は靴を脱いで使う」や「試着室はカーテンを閉めて使う」などが記述された言語情報１００２を取得する。学習装置１００は、言語情報１００２から言語特徴量列を取得し、言語Ｔｒａｎｓｆｏｒｍｅｒ１０２０に入力する。そして、学習装置１００は、画像Ｔｒａｎｓｆｏｒｍｅｒ１０１０の出力値と、言語Ｔｒａｎｓｆｏｒｍｅｒ１０２０の出力値とを、クロスモーダルＴｒａｎｓｆｏｒｍｅｒ１０３０に入力する。 Further, the learning device 100 acquires linguistic information 1002 in which "the fitting room is used by taking off shoes" and "the fitting room is used by closing the curtain". The learning device 100 acquires a language feature sequence from the language information 1002 and inputs it to the language Transferformer 1020. Then, the learning device 100 inputs the output value of the image Transferr 1010 and the output value of the language Transferr 1020 into the cross-modal Transferr 1030.

　学習装置１００は、クロスモーダルＴｒａｎｓｆｏｒｍｅｒ１０３０の出力値に基づいて、試着室にまだ避難していない人がいるか否かを判断する。これにより、学習装置１００は、言語情報で提示された状況を、画像情報で提示された状況に適切に対応させて問題を解くことができ、問題を解いて得られる解の精度を向上させることができる。学習装置１００は、例えば、試着室のカーテンが閉まっており、人が直接写っていない状況であるため、画像情報単独では、人がいないと判断してしまうような場合にも、言語情報で提示された「人が試着室を利用中である」という状況と、画像内にある物体の配置関係とを適切に対応させて、試着室に人がいると正しく判断することができる。 The learning device 100 determines whether or not there is a person who has not yet evacuated to the fitting room based on the output value of the cross-modal Transformer 1030. As a result, the learning device 100 can solve the problem by appropriately matching the situation presented by the linguistic information with the situation presented by the image information, and improves the accuracy of the solution obtained by solving the problem. Can be done. In the learning device 100, for example, since the curtain of the fitting room is closed and a person is not directly photographed, the learning device 100 is presented with linguistic information even when it is determined that there is no person by the image information alone. It is possible to correctly determine that there is a person in the fitting room by appropriately matching the situation that "a person is using the fitting room" with the arrangement relationship of the objects in the image.

（学習処理手順）
　次に、図１１を用いて、学習装置１００が実行する、学習処理手順の一例について説明する。学習処理は、例えば、図３に示したＣＰＵ３０１と、メモリ３０２や記録媒体３０５などの記憶領域と、ネットワークＩ／Ｆ３０３とによって実現される。 (Learning process procedure)
Next, an example of the learning processing procedure executed by the learning device 100 will be described with reference to FIG. The learning process is realized, for example, by the CPU 301 shown in FIG. 3, a storage area such as a memory 302 or a recording medium 305, and a network I / F 303.

　図１１は、学習処理手順の一例を示すフローチャートである。図１１において、学習装置１００は、テーブル内の被検索キー値と特徴値補正バリュー値とをランダム値で初期化する（ステップＳ１１０１）。 FIG. 11 is a flowchart showing an example of the learning processing procedure. In FIG. 11, the learning device 100 initializes the searched key value and the feature value correction value value in the table with random values (step S1101).

　次に、学習装置１００は、相互に関係する文書と画像とのそれぞれから、言語特徴値の集合と、画像特徴値の集合とを生成する（ステップＳ１１０２）。そして、学習装置１００は、画像特徴値の集合のうち、一定の割合で画像特徴値を無関係な値に置換することにより、画像特徴値をマスクする（ステップＳ１１０３）。 Next, the learning device 100 generates a set of language feature values and a set of image feature values from each of the interrelated documents and images (step S1102). Then, the learning device 100 masks the image feature values by replacing the image feature values with irrelevant values at a constant ratio in the set of image feature values (step S1103).

　次に、学習装置１００は、画像特徴値の集合のうち、それぞれの画像特徴値について、テーブル内の被検索キー値との関係度合いを算出する（ステップＳ１１０４）。そして、学習装置１００は、関係度合いに対応するテーブル内の特徴値補正バリュー値に基づいて、それぞれの画像特徴値を補正する（ステップＳ１１０５）。 Next, the learning device 100 calculates the degree of relationship between each image feature value in the set of image feature values and the searched key value in the table (step S1104). Then, the learning device 100 corrects each image feature value based on the feature value correction value value in the table corresponding to the degree of relationship (step S1105).

　次に、学習装置１００は、言語処理モデルと、画像処理モデルと、クロスモーダル処理モデルとを用いて、言語特徴値の集合と、補正した画像特徴値の集合とに基づいて、マスクされた画像特徴値を復元し、予測値を取得する（ステップＳ１１０６）。そして、学習装置１００は、予測値の損失値を減少させる方向に、テーブル内の被検索キー値と特徴値補正バリュー値とを含むパラメータ値を更新する（ステップＳ１１０７）。 Next, the learning device 100 uses a language processing model, an image processing model, and a cross-modal processing model to create a masked image based on a set of language feature values and a set of corrected image feature values. The feature value is restored and the predicted value is acquired (step S1106). Then, the learning device 100 updates the parameter value including the searched key value and the feature value correction value value in the table in the direction of reducing the loss value of the predicted value (step S1107).

　次に、学習装置１００は、終了条件を満たすか否かを判定する（ステップＳ１１０８）。終了条件は、例えば、ステップＳ１１０２～Ｓ１１０８のループが一定回数以上繰り返されることである。終了条件は、例えば、前回の損失値と今回の損失値との差分が一定以下になることである。 Next, the learning device 100 determines whether or not the end condition is satisfied (step S1108). The end condition is, for example, that the loop of steps S1102 to S1108 is repeated a certain number of times or more. The end condition is, for example, that the difference between the previous loss value and the current loss value is less than a certain value.

　ここで、終了条件を満たさない場合（ステップＳ１１０８：Ｎｏ）、学習装置１００は、ステップＳ１１０２の処理に戻る。一方で、終了条件を満たす場合（ステップＳ１１０８：Ｙｅｓ）、学習装置１００は、学習処理を終了する。これにより、情報処理装置は、言語情報の特徴を反映したパラメータ値を含む有用なモデルを得ることができる。 Here, if the end condition is not satisfied (step S1108: No), the learning device 100 returns to the process of step S1102. On the other hand, when the end condition is satisfied (step S1108: Yes), the learning device 100 ends the learning process. As a result, the information processing apparatus can obtain a useful model including parameter values that reflect the characteristics of the linguistic information.

　以上説明したように、情報処理装置によれば、第１のモーダルの情報から特徴量を抽出することができる。情報処理装置によれば、抽出した特徴量をパラメータに基づいて変換することにより、新たな特徴量を取得することができる。情報処理装置によれば、取得した新たな特徴量を、第１のモーダルに関する第１の処理モデルに入力することにより、第１の出力値を取得することができる。情報処理装置によれば、第１のモーダルとは異なる第２のモーダルの情報から抽出された他の特徴量を、第２のモーダルに関する第２の処理モデルに入力することにより、第２の出力値を取得することができる。情報処理装置によれば、取得した第１の出力値と第２の出力値とを、第１のモーダルと第２のモーダルとに関する第３の処理モデルに入力することにより、第３の出力値を取得することができる。情報処理装置によれば、取得した第３の出力値に基づいて、パラメータを更新することができる。これにより、情報処理装置は、第２のモーダルの情報の特徴を反映したパラメータを含む有用なモデルを得ることができる。 As described above, according to the information processing device, the feature amount can be extracted from the first modal information. According to the information processing apparatus, a new feature amount can be acquired by converting the extracted feature amount based on the parameter. According to the information processing apparatus, the first output value can be acquired by inputting the acquired new feature quantity into the first processing model related to the first modal. According to the information processing apparatus, the second output is performed by inputting other features extracted from the information of the second modal different from the first modal into the second processing model related to the second modal. You can get the value. According to the information processing apparatus, the acquired first output value and the second output value are input to the third processing model relating to the first modal and the second modal, so that the third output value is obtained. Can be obtained. According to the information processing device, the parameters can be updated based on the acquired third output value. This allows the information processing device to obtain a useful model that includes parameters that reflect the characteristics of the second modal information.

　情報処理装置によれば、パラメータとして、第１の特徴量と第２の特徴量とを採用することができる。情報処理装置によれば、抽出した特徴量と、第１の特徴量との一致度を算出し、算出した一致度を基に第２の特徴量に重み付けして得られた指標値に基づいて、抽出した特徴量を変換することにより、新たな特徴量を取得することができる。これにより、情報処理装置は、第１の特徴量と第２の特徴量とに、第２のモーダルの情報の特徴を反映させ、第１のモーダルの情報から抽出された特徴量の変換を実現することができる。 According to the information processing device, the first feature amount and the second feature amount can be adopted as parameters. According to the information processing device, the degree of coincidence between the extracted feature amount and the first feature amount is calculated, and the second feature amount is weighted based on the calculated degree of coincidence based on the obtained index value. , A new feature amount can be obtained by converting the extracted feature amount. As a result, the information processing apparatus reflects the features of the information of the second modal to the first feature amount and the second feature amount, and realizes the conversion of the feature amount extracted from the information of the first modal. can do.

　情報処理装置によれば、パラメータとして、入力層のノード数と出力層のノード数とより中間層のノード数が大きいニューラルネットワークのパラメータを採用することができる。情報処理装置によれば、抽出した特徴量をニューラルネットワークに入力することにより、新たな特徴量を取得することができる。これにより、情報処理装置は、ニューラルネットワークに、第２のモーダルの情報の特徴を反映させ、第１のモーダルの情報から抽出された特徴量の変換を実現することができる。 According to the information processing device, it is possible to adopt a neural network parameter having a large number of nodes in the input layer, a number of nodes in the output layer, and a larger number of nodes in the intermediate layer as parameters. According to the information processing device, a new feature amount can be acquired by inputting the extracted feature amount into the neural network. As a result, the information processing apparatus can reflect the characteristics of the information of the second modal in the neural network and realize the conversion of the feature amount extracted from the information of the first modal.

　情報処理装置によれば、第３の出力値に基づいて、第１の処理モデルを更新し、更新後のパラメータと、更新後の第１の処理モデルとを用いて、第１のモーダルの他の情報の入力に応じて、所定の問題を解くことができる。これにより、情報処理装置は、有用な第１の処理モデルを得ることができ、第１の処理モデルを用いて所定の問題を解いて得られる解の精度を向上させることができる。 According to the information processing apparatus, the first processing model is updated based on the third output value, and the updated parameters and the updated first processing model are used in addition to the first modal. A predetermined problem can be solved according to the input of the information of. As a result, the information processing apparatus can obtain a useful first processing model, and can improve the accuracy of the solution obtained by solving a predetermined problem using the first processing model.

　情報処理装置によれば、第３の出力値に基づいて、第１の処理モデルと、第２の処理モデルと、第３の処理モデルとを更新することができる。情報処理装置によれば、更新後のパラメータと、更新後の第１の処理モデルと、更新後の第２の処理モデルと、更新後の第３の処理モデルとを用いて、所定の問題を解くことができる。これにより、情報処理装置は、有用な第１の処理モデルと、第２の処理モデルと、第３の処理モデルとを得ることができ、所定の問題を解いて得られる解の精度を向上させることができる。 According to the information processing device, the first processing model, the second processing model, and the third processing model can be updated based on the third output value. According to the information processing apparatus, a predetermined problem is solved by using the updated parameters, the updated first processing model, the updated second processing model, and the updated third processing model. Can be solved. As a result, the information processing apparatus can obtain a useful first processing model, a second processing model, and a third processing model, and improves the accuracy of the solution obtained by solving a predetermined problem. be able to.

　情報処理装置によれば、第１のモーダルとして、画像に関するモーダルを採用することができる。情報処理装置によれば、第２のモーダルとして、言語に関するモーダルを採用することができる。これにより、情報処理装置は、画像情報と言語情報とに基づいて問題を解くにあたり有用なモデルを得ることができる。 According to the information processing device, a modal related to an image can be adopted as the first modal. According to the information processing apparatus, a modal related to language can be adopted as the second modal. As a result, the information processing apparatus can obtain a model useful for solving the problem based on the image information and the linguistic information.

　情報処理装置によれば、第１のモーダルとして、画像に関するモーダルを採用することができる。情報処理装置によれば、第２のモーダルとして、音声に関するモーダルを採用することができる。これにより、情報処理装置は、画像情報と音声情報とに基づいて問題を解くにあたり有用なモデルを得ることができる。 According to the information processing device, a modal related to an image can be adopted as the first modal. According to the information processing device, a modal related to voice can be adopted as the second modal. As a result, the information processing apparatus can obtain a model useful for solving the problem based on the image information and the audio information.

　情報処理装置によれば、第１のモーダルとして、第１の言語に関するモーダルを採用することができる。情報処理装置によれば、第２のモーダルとして、第２の言語に関するモーダルを採用することができる。これにより、情報処理装置は、異なる言語の言語情報に基づいて問題を解くにあたり有用なモデルを得ることができる。 According to the information processing device, a modal related to the first language can be adopted as the first modal. According to the information processing apparatus, a modal related to the second language can be adopted as the second modal. This allows the information processing device to obtain a useful model for solving a problem based on linguistic information in different languages.

　情報処理装置によれば、第３の出力値に基づいて、誤差逆伝搬法により、パラメータを更新することができる。これにより、情報処理装置は、パラメータを精度よく更新することができる。 According to the information processing device, the parameters can be updated by the error back propagation method based on the third output value. As a result, the information processing apparatus can update the parameters with high accuracy.

　なお、本実施の形態で説明した学習方法は、予め用意されたプログラムをＰＣやワークステーションなどのコンピュータで実行することにより実現することができる。本実施の形態で説明した学習プログラムは、コンピュータで読み取り可能な記録媒体に記録され、コンピュータによって記録媒体から読み出されることによって実行される。記録媒体は、ハードディスク、フレキシブルディスク、ＣＤ（Ｃｏｍｐａｃｔ　Ｄｉｓｃ）－ＲＯＭ、ＭＯ、ＤＶＤ（Ｄｉｇｉｔａｌ　Ｖｅｒｓａｔｉｌｅ　Ｄｉｓｃ）などである。また、本実施の形態で説明した学習プログラムは、インターネットなどのネットワークを介して配布してもよい。 The learning method described in this embodiment can be realized by executing a program prepared in advance on a computer such as a PC or a workstation. The learning program described in this embodiment is executed by being recorded on a computer-readable recording medium and being read from the recording medium by the computer. The recording medium is a hard disk, a flexible disk, a CD (Compact Disc) -ROM, an MO, a DVD (Digital Versailles Disc), or the like. Further, the learning program described in this embodiment may be distributed via a network such as the Internet.

　１００　学習装置
　１０１　モデル
　１１１　抽出モデル
　１１２　変換モデル
　１２１　第１の処理モデル
　１２２　第２の処理モデル
　１２３　第３の処理モデル
　２００　情報処理システム
　２０１　クライアント装置
　２０２　端末装置
　２１０　ネットワーク
　３００　バス
　３０１　ＣＰＵ
　３０２　メモリ
　３０３　ネットワークＩ／Ｆ
　３０４　記録媒体Ｉ／Ｆ
　３０５　記録媒体
　４００　記憶部
　４０１　取得部
　４０２　第１の抽出部
　４０３　変換部
　４０４　第１の処理部
　４０５　第２の抽出部
　４０６　第２の処理部
　４０７　第３の処理部
　４０８　更新部
　４０９　利用部
　４１０　出力部
　５００，６００，１０００　統合モデル
　５０１　画像特徴量生成部
　５０２　クエリ生成部
　５０３　テーブル検索部
　５０４　画像処理部
　５０５　言語処理部
　５０６　クロスモーダル処理部
　５０７　損失関数計算部
　５１０　データテーブル
　６０１　Ａｄｄ部
　６０２，１０１０　画像Ｔｒａｎｓｆｏｒｍｅｒ
　６０３，１０２０　言語Ｔｒａｎｓｆｏｒｍｅｒ
　６０４　クロスモーダルＴｒａｎｓｆｏｒｍｅｒ
　６１０　言語情報データテーブル
　６１１　配列Ｑｕｅｒｙ
　６２１　配列Ｋｅｙ
　６２２　配列Ｖａｌｕｅ
　９００　ニューラルネットワーク
　９０１　入力層
　９０２　中間層
　９０３　出力層
　１００１　画像情報
　１００２　言語情報
　１０３０　クロスモーダルＴｒａｎｓｆｏｒｍｅｒ 100 Learning device 101 Model 111 Extraction model 112 Conversion model 121 First processing model 122 Second processing model 123 Third processing model 200 Information processing system 201 Client device 202 Terminal device 210 Network 300 Bus 301 CPU
302 Memory 303 Network I / F
304 Recording medium I / F
305 Recording medium 400 Storage unit 401 Acquisition unit 402 First extraction unit 403 Conversion unit 404 First processing unit 405 Second extraction unit 406 Second processing unit 407 Third processing unit 408 Update unit 409 Utilization unit 410 Output Part 500, 600, 1000 Integrated model 501 Image feature amount generation part 502 Query generation part 503 Table search part 504 Image processing part 505 Language processing part 506 Cross modal processing part 507 Loss function calculation part 510 Data table 601 Add part 602,1010 Transformer
603, 1020 Language Transformer
604 Cross Modal Transformer
610 Language Information Data Table 611 Array Query
621 Array Key
622 Array Value
900 Neural network 901 Input layer 902 Intermediate layer 903 Output layer 1001 Image information 1002 Language information 1030 Crossmodal Transformer

Claims

Features are extracted from the information of the first modal,
By converting the extracted feature quantity based on the parameter, a new feature quantity is obtained.
By inputting the acquired new feature quantity into the first processing model related to the first modal, the first output value is acquired.
A second output value is obtained by inputting another feature amount extracted from the information of the second modal different from the first modal into the second processing model related to the second modal.
By inputting the acquired first output value and the second output value into the third processing model relating to the first modal and the second modal, the third output value is acquired. ,
The parameter is updated based on the acquired third output value.
A learning method characterized by a computer performing processing.

The parameters include a first feature amount and a second feature amount.
The process of acquiring the new feature amount is
The degree of coincidence between the extracted feature amount and the first feature amount was calculated, and based on the calculated degree of coincidence, the second feature amount was weighted and extracted based on the index value obtained. The learning method according to claim 1, wherein the new feature amount is acquired by converting the feature amount.

The parameters are parameters of a neural network in which the number of nodes in the input layer, the number of nodes in the output layer, and the number of nodes in the intermediate layer are larger.
The process of acquiring the new feature amount is
The learning method according to claim 1, wherein the new feature amount is acquired by inputting the extracted feature amount into the neural network.

The first processing model is updated based on the third output value.
Using the updated parameters and the updated first processing model, a predetermined problem is solved in response to input of other information of the first modal.
The learning method according to any one of claims 1 to 3, wherein the processing is executed by the computer.

Based on the third output value, the first processing model, the second processing model, and the third processing model are updated.
In addition to the first modal, the updated parameters, the updated first processing model, the updated second processing model, and the updated third processing model are used. In response to the input of the information of the above and the other information of the second modal, the predetermined problem is solved.
The learning method according to any one of claims 1 to 3, wherein the processing is executed by the computer.

The first modal and the second modal are the image modal and the language modal, the image modal and the audio modal, the first language modal and the second language. The learning method according to any one of claims 1 to 5, wherein the learning method is one of a set with a modal.

The process of updating the parameters is
The learning method according to any one of claims 1 to 6, wherein the parameter is updated by an error back propagation method based on the third output value.

Features are extracted from the information of the first modal,
By converting the extracted feature quantity based on the parameter, a new feature quantity is obtained.
By inputting the acquired new feature quantity into the first processing model related to the first modal, the first output value is acquired.
A second output value is obtained by inputting another feature amount extracted from the information of the second modal different from the first modal into the second processing model related to the second modal.
By inputting the acquired first output value and the second output value into the third processing model relating to the first modal and the second modal, the third output value is acquired. ,
The parameter is updated based on the acquired third output value.
A learning program characterized by having a computer perform processing.

Features are extracted from the information of the first modal,
By converting the extracted feature quantity based on the parameter, a new feature quantity is obtained.
By inputting the acquired new feature quantity into the first processing model related to the first modal, the first output value is acquired.
A second output value is obtained by inputting another feature amount extracted from the information of the second modal different from the first modal into the second processing model related to the second modal.
By inputting the acquired first output value and the second output value into the third processing model relating to the first modal and the second modal, the third output value is acquired. ,
The parameter is updated based on the acquired third output value.
A learning device having a control unit.