[go: up one dir, main page]

WO2021095213A1 - Learning method, learning program, and learning device - Google Patents

Learning method, learning program, and learning device Download PDF

Info

Publication number
WO2021095213A1
WO2021095213A1 PCT/JP2019/044771 JP2019044771W WO2021095213A1 WO 2021095213 A1 WO2021095213 A1 WO 2021095213A1 JP 2019044771 W JP2019044771 W JP 2019044771W WO 2021095213 A1 WO2021095213 A1 WO 2021095213A1
Authority
WO
WIPO (PCT)
Prior art keywords
modal
information
output value
feature amount
processing model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/JP2019/044771
Other languages
French (fr)
Japanese (ja)
Inventor
裕一 鎌田
中川 章
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to PCT/JP2019/044771 priority Critical patent/WO2021095213A1/en
Publication of WO2021095213A1 publication Critical patent/WO2021095213A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0499Feedforward networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning

Definitions

  • the present invention relates to a learning method, a learning program, and a learning device.
  • BERT Bidirectional Encoder Representations from Transformers
  • modal information a concept indicating the style and type of information, and specific examples thereof include images, documents (texts), and sounds.
  • Machine learning using multiple modals is called multimodal learning. Further, among multimodal learning, learning the co-occurrence relationship between a plurality of modals is sometimes called cross-modal learning.
  • a gesture feature is input and a model for classifying whether or not the input gesture feature corresponds to a word is generated.
  • a word co-occurrence vector whose element is the frequency of occurrence of word co-occurrence words appearing near a word and an image co-occurrence vector whose element is the frequency of appearance of image co-occurrence words appearing near an image are used.
  • there is a technique of performing recognition processing of sign language elements which are individual components of sign language movements, from time-series data of hand movements, and performing sign language word recognition processing from the recognized sign language elements.
  • JP-A-2018-163400 Japanese Unexamined Patent Publication No. 2002-132823 Japanese Unexamined Patent Publication No. 9-34863
  • the model does not have an expression space that expresses features based on the relationship between modal information about images and modal information about languages, and is not useful in solving problems. May not be available.
  • the present invention aims to learn a useful model for extracting features from modal information.
  • a feature amount is extracted from the information of the first modal, and the extracted feature amount is converted based on a parameter to acquire a new feature amount, and the acquired new feature amount is obtained.
  • the first output value is obtained and other features extracted from the information of the second modal different from the first modal.
  • a second output value is acquired, and the acquired first output value and the second output value are combined with the first output value.
  • a learning method in which a third output value is acquired by inputting into a third processing model relating to the modal of the above and the second modal, and the parameter is updated based on the acquired third output value.
  • a learning program and a learning device are proposed.
  • FIG. 1 is an explanatory diagram showing an embodiment of a learning method according to an embodiment.
  • FIG. 2 is an explanatory diagram showing an example of the information processing system 200.
  • FIG. 3 is a block diagram showing a hardware configuration example of the learning device 100.
  • FIG. 4 is a block diagram showing a functional configuration example of the learning device 100.
  • FIG. 5 is a block diagram showing a specific functional configuration example of the learning device 100.
  • FIG. 6 is an explanatory diagram showing an example of learning the integrated model 600.
  • FIG. 7 is an explanatory diagram showing an example of converting the image feature quantity sequence F.
  • FIG. 8 is an explanatory diagram showing a specific example of the calculation for converting the image feature quantity sequence F.
  • FIG. 9 is an explanatory diagram showing another example of converting the image feature quantity sequence.
  • FIG. 10 is an explanatory diagram showing an example in which the learning device 100 uses the integrated model 1000.
  • FIG. 11 is a flowchart showing an example of the learning processing procedure.
  • FIG. 1 is an explanatory diagram showing an embodiment of a learning method according to an embodiment.
  • the learning device 100 is a computer for learning a useful model for extracting a feature amount from a predetermined modal information, which can be used when solving a problem using a predetermined modal information.
  • BERT pre-learning model
  • the BERT is formed by stacking the Encoder portions of the Transformer.
  • the following Non-Patent Document 1 can be referred to.
  • Non-Patent Document 1 Devlin, Jacob et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” NAACL-HLT (2019).
  • BERT is supposed to be applied to a situation where a problem is solved using modal information about a language, and cannot be applied to a situation where a problem is solved using a plurality of modal information. ..
  • VideoBERT that extends BERT so that it can be applied to situations where problems are solved using modal information related to languages as well as modal information related to images.
  • CBT Contrastive Biraditional Transformer for Temporal Representation Learning
  • CBT integrates the output of a language processing model that learns the co-occurrence relationship of language features, an image processing model that learns the co-occurrence relationship of image features, and a language processing model and an image processing model. It is formed by a cross-modal processing model that learns the co-occurrence relationship between language and images.
  • a language processing model that learns the co-occurrence relationship of language features
  • an image processing model that learns the co-occurrence relationship of image features
  • a language processing model and an image processing model It is formed by a cross-modal processing model that learns the co-occurrence relationship between language and images.
  • Non-Patent Document 2 can be referred to.
  • Non-Patent Document 2 Sun, Chen, et al. "Contrastive Biraditional Transition Former for Temporal Representation Learning.”
  • the various models mentioned above may not be useful models for solving problems.
  • the various models described above may not be useful models for extracting features from modal information when solving a problem using modal information.
  • CBT does not have an expression space that expresses features based on the relationship between modal information about images and modal information about languages, which is useful in solving problems. It may not be a model.
  • the image feature amount is an expression space that reflects modal information about the image, and has the property of not having an expression space that reflects modal information about the language. Therefore, even if CBT is pre-learned, the image processing model included in CBT does not become a model that can effectively utilize modal information about the language, and does not become a useful model for solving the problem. Further, the image feature amount has a property that even if the same object is captured, the image feature amount can be different for images of different appearances. For this reason, when reflecting the characteristics of modal information related to the language with respect to the image feature amount, it is necessary to update not one image feature amount but various image feature amount expressions, which makes effective update difficult or difficult. , It is possible that it will have an adverse effect on solving the problem.
  • the learning device 100 has, for example, a model 101.
  • the model 101 has an extraction model 111, a conversion model 112, a first processing model 121, a second processing model 122, and a third processing model 123.
  • the extraction model 111, the conversion model 112, and the first processing model 121 relate to the first modal.
  • the second processing model 122 relates to a second modal.
  • the third processing model 123 relates to a first modal and a second modal.
  • the learning device 100 acquires the information of the first modal and the information of the second modal.
  • Modal means a form of information.
  • the first modal and the second modal are different modals.
  • the first modal is, for example, an image modal. If the first modal is about an image, the information in the first modal is, for example, an image.
  • the second modal is, for example, a modal related to language. If the second modal is about language, the information in the second modal is, for example, a document.
  • the learning device 100 uses the extraction model 111 to extract the feature amount from the first modal information.
  • the learning device 100 extracts, for example, an image feature amount from an image.
  • the image feature amount is represented by, for example, a vector indicating an array.
  • the learning device 100 acquires a new feature amount by converting the extracted feature amount based on the parameter using the conversion model 112. There are a plurality of parameters, for example.
  • the learning device 100 calculates, for example, a correction amount for correcting the extracted image feature amount based on the extracted image feature amount and a plurality of parameters, and adds the correction amount to the extracted image feature amount to obtain a new image feature. Get the quantity.
  • the learning device 100 acquires the first output value by inputting the acquired new feature amount into the first processing model 121.
  • the first processing model 121 is, for example, an image processing model.
  • the learning device 100 acquires the first output value by inputting a new image feature amount into the image processing model, for example.
  • the learning device 100 acquires a second output value by inputting another feature amount extracted from the second modal information into the second processing model 122.
  • the second processing model 122 is, for example, a language processing model.
  • the learning device 100 acquires a second output value by inputting, for example, a language feature amount extracted from a document into a language processing model.
  • Language features are represented, for example, by vectors representing arrays.
  • the learning device 100 acquires the third output value by inputting the acquired first output value and the second output value into the third processing model 123.
  • the third processing model 123 is, for example, a cross-modal processing model.
  • the cross-modal processing model integrates information from multiple modals and learns co-occurrence of information from multiple modals.
  • the learning device 100 acquires the third output value by inputting the first output value and the second output value into the cross-modal processing model, for example.
  • the learning device 100 updates a plurality of parameters based on the acquired third output value.
  • the learning device 100 updates a plurality of parameters by the error back propagation method, for example, based on the third output value.
  • the learning device 100 may output a plurality of updated parameters.
  • the learning device 100 can obtain a plurality of parameters useful from the viewpoint of handling the first modal information and the second modal information in solving the problem.
  • the learning device 100 is, for example, explicitly prepared with a plurality of parameters capable of reflecting the features of the information of the second modal, and when acquiring a new feature amount from the information of the first modal, the second It is possible to effectively utilize the characteristics of modal information. Further, the learning device 100 does not directly reflect the features of the information of the second modal in the feature amount itself extracted from the information of the first modal, and can reduce the adverse effect when solving the problem.
  • the learning device 100 may further update the first processing model 121 based on the third output value.
  • the learning device 100 can obtain a first processing model 121 that is useful from the viewpoint of handling the first modal information in solving the problem. Then, the learning device 100 can use the model 101 when solving a problem by using the information of the first modal and the information of the second modal, and can improve the accuracy of the obtained solution.
  • the learning device 100 may separate the extraction model 111, the conversion model 112, and the first processing model 121 from the model 101. According to this, the learning device 100 can obtain a useful combination model of the extraction model 111, the conversion model 112, and the first processing model 121. Then, the learning device 100 uses the separated extraction model 111, the conversion model 112, and the first processing model 121 when solving the problem using the first modal information, and the accuracy of the obtained solution is obtained. You may try to improve.
  • the learning device 100 may further update the second processing model 122 and the third processing model 123.
  • the learning device 100 can obtain a second processing model 122 that is useful from the viewpoint of handling the second modal information in solving the problem.
  • the learning device 100 can obtain a third processing model 123 that is useful from the viewpoint of integrating the information of the first modal and the information of the second modal in solving the problem.
  • the learning device 100 is a useful model in which the extraction model 111, the conversion model 112, the first processing model 121, the second processing model 122, and the third processing model 123 are combined. 101 can be obtained. Then, the learning device 100 can use the model 101 when solving a problem by using the information of the first modal and the information of the second modal, and can improve the accuracy of the obtained solution.
  • the learning device 100 may separate the conversion model 112 from the model 101. According to this, the learning device 100 can obtain a useful conversion model 112. Then, the learning device 100 may use the separated conversion model 112 when solving the problem using the information of the first modal to improve the accuracy of the obtained solution.
  • the learning device 100 may separate the extraction model 111, the conversion model 112, the first processing model 121, and the second processing model 122 from the model 101. According to this, the learning device 100 can obtain a useful combination model of the extraction model 111, the conversion model 112, the first processing model 121, and the second processing model 122. Then, the learning device 100 uses the separated extraction model 111, the conversion model 112, the first processing model 121, and the second processing model 122 as the first modal information and the second modal information. It may be used when solving a problem by using and to improve the accuracy of the obtained solution.
  • the learning device 100 can obtain a useful model.
  • the useful model is, for example, one of the updated extraction model 111, the conversion model 112, the first processing model 121, the second processing model 122, and the third processing model 123. Is. Further, the useful model is, for example, two of the updated extraction model 111, the conversion model 112, the first processing model 121, the second processing model 122, and the third processing model 123. The above combination may be used.
  • FIG. 2 is an explanatory diagram showing an example of the information processing system 200.
  • the information processing system 200 includes a learning device 100, a client device 201, and a terminal device 202.
  • the learning device 100 and the client device 201 are connected via a wired or wireless network 210.
  • the network 210 is, for example, a LAN (Local Area Network), a WAN (Wide Area Network), the Internet, or the like. Further, in the information processing system 200, the learning device 100 and the terminal device 202 are connected via a wired or wireless network 210.
  • the learning device 100 stores an integrated model that accepts input of the first modal information and the second modal information.
  • the stored integrated model corresponds to, for example, the model 101 shown in FIG.
  • the learning device 100 updates the integrated model based on the teacher data.
  • the teacher data is, for example, correspondence information in which the sample first modal information, the sample second modal information, and the correct answer data are associated with each other.
  • the teacher data is input to the learning device 100 by the user of the learning device 100, for example.
  • the correct answer data indicates, for example, the correct answer for the output value of the integrated model.
  • the correct answer data may indicate the correct answer for the solution obtained by solving the problem based on the output value of the integrated model. If the first modal is about an image, the information in the first modal is an image. If the second modal is about language, the information in the second modal is a document.
  • the update of the integrated model is realized by, for example, the error back propagation method.
  • the update of the integrated model may be realized by, for example, a learning method other than error back propagation.
  • the learning device 100 acquires the information of the first modal and the information of the second modal when solving the problem.
  • the learning device 100 acquires, for example, first modal information input to the learning device 100 by the user of the learning device 100. Further, the learning device 100 may acquire the first modal information by receiving the information from the client device 201 or the terminal device 202.
  • the learning device 100 acquires, for example, second modal information input to the learning device 100 by the user of the learning device 100. Further, the learning device 100 may acquire the second modal information by receiving the information from the client device 201 or the terminal device 202.
  • the learning device 100 solves the problem based on the acquired first modal information and the second modal information by using the updated integrated model, and transfers the obtained solution to the client device 201. Send.
  • the learning device 100 may be used for solving the problem after further fine-tuning the updated integrated model.
  • the learning device 100 is, for example, a server, a PC (Personal Computer), or the like.
  • the client device 201 is a computer capable of communicating with the learning device 100.
  • the client device 201 may, for example, transmit the first modal information to the learning device 100. Further, the client device 201 may transmit, for example, second modal information to the learning device 100.
  • the client device 201 receives and outputs the solution obtained by the learning device 100 solving the problem.
  • the output format is, for example, display on a display, print output to a printer, transmission to another computer, or storage in a storage area.
  • the client device 201 is, for example, a PC, a tablet terminal, a smartphone, or the like.
  • the terminal device 202 is a computer capable of communicating with the learning device 100.
  • the terminal device 202 may, for example, transmit the first modal information to the learning device 100.
  • the terminal device 202 may transmit, for example, second modal information to the learning device 100.
  • the terminal device 202 is, for example, a PC, a tablet terminal, a smartphone, an electronic device, an IoT (Internet of Things) device, a sensor device, or the like.
  • the terminal device 202 may be a surveillance camera.
  • the learning device 100 updates the integrated model and solves the problem using the integrated model
  • another computer may update the integrated model
  • the learning device 100 may solve the problem using the integrated model received from the other computer.
  • the learning device 100 may update the integrated model, provide it to another computer, and solve the problem on the other computer by using the integrated model.
  • the learning device 100 is a device different from the client device 201 and the terminal device 202 has been described, but the present invention is not limited to this.
  • the learning device 100 may be integrated with the client device 201.
  • the learning device 100 may be integrated with the terminal device 202.
  • the learning device 100 realizes the integrated model in terms of software has been described, but the present invention is not limited to this.
  • the learning device 100 may realize the integrated model electronically.
  • the terminal device 202 is a surveillance camera, and transmits an image of the target to the learning device 100.
  • the object is specifically the appearance of the fitting room.
  • the learning device 100 stores a document that serves as an explanatory text about the target.
  • the description specifically states that the dressing room curtains tend to be closed while humans are using the dressing room, and that shoes are placed in front of the dressing room while humans are using the dressing room. It is a document that describes things that tend to be done.
  • the learning device 100 solves the problem of determining the degree of risk based on the image and the document by using the model.
  • the degree of risk is, for example, an index value indicating the high possibility that a person who has not completed evacuation remains in the fitting room.
  • the degree of risk may be, for example, two values as to whether or not there are any humans who have not completed evacuation in the fitting room.
  • FIG. 3 is a block diagram showing a hardware configuration example of the learning device 100.
  • the learning device 100 includes a CPU (Central Processing Unit) 301, a memory 302, a network I / F (Interface) 303, a recording medium I / F 304, and a recording medium 305. Further, each component is connected by a bus 300.
  • the CPU 301 controls the entire learning device 100.
  • the memory 302 includes, for example, a ROM (Read Only Memory), a RAM (Random Access Memory), a flash ROM, and the like. Specifically, for example, a flash ROM or ROM stores various programs, and RAM is used as a work area of CPU 301. The program stored in the memory 302 is loaded into the CPU 301 to cause the CPU 301 to execute the coded process.
  • the network I / F 303 is connected to the network 210 through a communication line, and is connected to another computer via the network 210. Then, the network I / F 303 controls the internal interface with the network 210 and controls the input / output of data from another computer.
  • the network I / F 303 is, for example, a modem or a LAN adapter.
  • the recording medium I / F 304 controls data read / write to the recording medium 305 according to the control of the CPU 301.
  • the recording medium I / F 304 is, for example, a disk drive, an SSD (Solid State Drive), a USB (Universal Serial Bus) port, or the like.
  • the recording medium 305 is a non-volatile memory that stores data written under the control of the recording medium I / F 304.
  • the recording medium 305 is, for example, a disk, a semiconductor memory, a USB memory, or the like.
  • the recording medium 305 may be detachable from the learning device 100.
  • the learning device 100 may include, for example, a keyboard, a mouse, a display, a printer, a scanner, a microphone, a speaker, and the like, in addition to the above-described components. Further, the learning device 100 may have a plurality of recording media I / F 304 and recording media 305. Further, the learning device 100 does not have to have the recording medium I / F 304 or the recording medium 305.
  • the hardware configuration example of the client device 201 is specifically the same as the hardware configuration example of the learning device 100 shown in FIG. 3, so the description thereof will be omitted.
  • FIG. 4 is a block diagram showing a functional configuration example of the learning device 100.
  • the learning device 100 includes a storage unit 400, an acquisition unit 401, a first extraction unit 402, a conversion unit 403, a first processing unit 404, a second extraction unit 405, and a second processing unit 406.
  • a third processing unit 407, an updating unit 408, a utilization unit 409, and an output unit 410 are included.
  • the storage unit 400 is realized by, for example, a storage area such as the memory 302 or the recording medium 305 shown in FIG.
  • a storage area such as the memory 302 or the recording medium 305 shown in FIG.
  • the storage unit 400 may be included in a device different from the learning device 100, and the stored contents of the storage unit 400 may be referred to by the learning device 100.
  • the acquisition unit 401 to the output unit 410 function as an example of the control unit. Specifically, the acquisition unit 401 to the output unit 410 may cause the CPU 301 to execute a program stored in a storage area such as the memory 302 or the recording medium 305 shown in FIG. 3, or the network I / F 303. To realize the function.
  • the processing result of each functional unit is stored in a storage area such as the memory 302 or the recording medium 305 shown in FIG. 3, for example.
  • the storage unit 400 stores various information referred to or updated in the processing of each functional unit.
  • the storage unit 400 stores an integrated model that accepts input of the first modal information and the second modal information.
  • the integrated model to be stored has, for example, a first extraction model, a transformation model, a first processing model, a second extraction model, a second processing model, and a third processing model.
  • the first extraction model, the transformation model, and the first processing model relate to the first modal.
  • the second extraction model and the second processing model relate to a second modal.
  • the third processing model relates to a first modal and a second modal.
  • the second modal is different from the first modal.
  • the first modal is an image modal and the second modal is a language modal.
  • the first modal is an image modal and the second modal is an audio modal.
  • the first modal is a modal for a first language and the second modal is a modal for a second language.
  • the acquisition unit 401 acquires various information used for processing of each functional unit.
  • the acquisition unit 401 stores various acquired information in the storage unit 400 or outputs it to each function unit. Further, the acquisition unit 401 may output various information stored in the storage unit 400 to each function unit.
  • the acquisition unit 401 acquires various information based on, for example, a user's operation input.
  • the acquisition unit 401 may receive various information from a device different from the learning device 100, for example.
  • the acquisition unit 401 acquires the information of the first modal and the information of the second modal.
  • the acquisition unit 401 acquires the first modal information and the second modal information by, for example, accepting the input of the first modal information and the second modal information by the user. ..
  • the acquisition unit 401 may acquire, for example, the information of the first modal and the information of the second modal by receiving from the client device 201 or the terminal device 202. Even if the acquisition unit 401 acquires the information of the first modal and the information of the second modal by acquiring the teacher data including the information of the first modal and the information of the second modal. Good.
  • the acquisition unit 401 may accept a start trigger to start processing of any of the functional units.
  • the start trigger is, for example, that there is a predetermined operation input by the user.
  • the start trigger may be, for example, the receipt of predetermined information from another computer.
  • the start trigger may be, for example, that any functional unit outputs predetermined information.
  • the acquisition unit 401 receives, for example, the acquisition of the first modal information and the second modal information as a start trigger for starting the processing of each functional unit.
  • the first extraction unit 402 extracts the feature amount from the first modal information.
  • the first extraction unit 402 extracts, for example, an image feature amount from an image.
  • the extracted image feature amount is, for example, an image feature amount indicating an object appearing in the image.
  • the first extraction unit 402 can change the information of the first modal into a format that can be input to the conversion model.
  • the first extraction unit 402 can extract useful information for solving the problem from the first modal information.
  • the conversion unit 403 acquires a new feature amount by converting the extracted feature amount based on a plurality of parameters using the conversion model.
  • the plurality of parameters include, for example, a first feature amount and a second feature amount.
  • the conversion unit 403 calculates the degree of coincidence between the extracted feature amount and the first feature amount, and based on the index value obtained by weighting the second feature amount based on the calculated degree of coincidence.
  • a new feature amount is acquired by converting the extracted feature amount.
  • the conversion unit 403 prepares a plurality of parameters capable of reflecting the features of the information of the second modal, and the feature amount extracted from the information of the first modal via the plurality of parameters is subjected to the second. It is possible to reflect the characteristics of modal information.
  • the plurality of parameters may be, for example, parameters of a neural network in which the number of nodes in the input layer, the number of nodes in the output layer, and the number of nodes in the intermediate layer are larger.
  • Neural networks correspond to transformation models.
  • the conversion unit 403 acquires a new feature amount by inputting the extracted feature amount into the neural network, for example.
  • the conversion unit 403 prepares a plurality of parameters capable of reflecting the features of the information of the second modal, and the feature amount extracted from the information of the first modal via the plurality of parameters is subjected to the second. It is possible to reflect the characteristics of modal information.
  • the first processing unit 404 acquires the first output value by inputting the acquired new feature amount into the first processing model.
  • the first processing model is, for example, an image processing model.
  • the first processing unit 404 acquires the first output value, for example, by inputting a new image feature amount into the image processing model.
  • the first processing unit 404 can change the acquired new feature quantity into a format that can be input to the third processing model.
  • the first processing unit 404 can extract useful information for solving the problem from the acquired new features.
  • the second extraction unit 405 extracts other feature quantities from the second modal information.
  • the second extraction unit 405 extracts, for example, a language feature from a document.
  • the extracted linguistic feature is, for example, a linguistic feature indicating a word contained in a document.
  • the second extraction unit 405 can change the information of the second modal into a format that can be input to the second processing model.
  • the second extraction unit 405 can extract useful information for solving the problem from the second modal information.
  • the second processing unit 406 acquires the second output value by inputting the extracted other feature quantities into the second processing model.
  • the second processing model is, for example, a language processing model.
  • the second processing unit 406 acquires the second output value by inputting the language feature amount into the language processing model, for example.
  • the second processing unit 406 can change the extracted other features into a format that can be input to the third processing model.
  • the second processing unit 406 can extract useful information for solving the problem from the extracted other features.
  • the third processing unit 407 acquires the third output value by inputting the acquired first output value and the second output value into the third processing model.
  • the third processing model is, for example, a cross-modal processing model.
  • the third processing unit 407 acquires the third output value by inputting the first output value and the second output value into the cross-modal processing model, for example.
  • An example of acquiring the third output value will be specifically described later with reference to FIGS. 5 to 8.
  • the third processing unit 407 can obtain a third output value that integrates the features of the language and the image.
  • the update unit 408 updates a plurality of parameters based on the acquired third output value.
  • the update unit 408 updates a plurality of parameters by the error back propagation method, for example, based on the third output value.
  • the update unit 408 calculates the loss value based on the third output value by, for example, the loss function, and updates a plurality of parameters based on the loss value.
  • the update unit 408 can obtain a plurality of parameters useful from the viewpoint of handling the first modal information and the second modal information in solving the problem. For example, the update unit 408 optimizes a plurality of explicitly prepared parameters so that the features of the second modal information can be effectively utilized when acquiring a new feature amount from the first modal information. Can be planned.
  • the update unit 408 updates the first processing model based on the third output value.
  • the update unit 408 updates the first processing model by the error back propagation method, for example, based on the third output value.
  • the update unit 408 calculates the loss value based on the third output value by, for example, the loss function, and updates the first processing model based on the loss value.
  • the update unit 408 can obtain a first processing model that is useful from the viewpoint of handling the first modal information in solving the problem.
  • the update unit 408 updates the second processing model and the third processing model based on the third output value.
  • the update unit 408 updates the second processing model and the third processing model by the error back propagation method, for example, based on the third output value.
  • the update unit 408 calculates the loss value based on the third output value by the loss function, and updates the second processing model and the third processing model based on the loss value.
  • the update unit 408 can obtain a second processing model that is useful from the viewpoint of handling the second modal information in solving the problem.
  • the learning device 100 can obtain a third processing model that is useful from the viewpoint of integrating the information of the first modal and the information of the second modal in solving the problem.
  • the user unit 409 solves a predetermined problem.
  • the utilization unit 409 solves a predetermined problem in response to input of other information of the first modal by using, for example, a plurality of updated parameters and an unupdated first processing model.
  • the utilization unit 409 can use the updated plurality of parameters when solving the problem based on at least other information of the first modal, and can improve the accuracy of the obtained solution.
  • the utilization unit 409 solves a predetermined problem in response to input of other information of the first modal by using, for example, a plurality of parameters after the update and the first processing model after the update. As a result, the utilization unit 409 uses the updated parameters and the updated first processing model when solving the problem based on at least other information of the first modal, and obtains a solution. It is possible to improve the accuracy of.
  • the utilization unit 409 uses, for example, a plurality of parameters after the update, the first processing model after the update, the second processing model after the update, and the third processing model after the update. Based on the other information of the modal and the other information of the second modal, the predetermined problem is solved. As a result, the utilization unit 409 solves the problem of the plurality of parameters after the update, the first processing model after the update, the second processing model after the update, and the third processing model after the update. It can be used in some cases to improve the accuracy of the obtained solution.
  • the output unit 410 outputs the processing result of any of the functional units.
  • the output format is, for example, display on a display, print output to a printer, transmission to an external device by the network I / F 303, or storage in a storage area such as a memory 302 or a recording medium 305.
  • the output unit 410 can notify the user of the processing result of each functional unit, and can improve the convenience of the learning device 100.
  • the output unit 410 outputs, for example, a plurality of updated parameters.
  • the output unit 410 can refer to a plurality of updated parameters that are useful from the viewpoint of handling the first modal information and the second modal information in solving the problem.
  • the output unit 410 can make the updated plurality of parameters available to other computers, for example. Therefore, the output unit 410 can improve the accuracy of the solution obtained by solving the problem by using at least a plurality of parameters.
  • the output unit 410 may output, for example, the first processing model.
  • the output unit 410 can refer to the updated first processing model, which is useful from the viewpoint of handling the first modal information in solving the problem.
  • the output unit 410 can make the updated first processing model available to other computers, for example. Therefore, the output unit 410 can improve the accuracy of the solution obtained by solving the problem based on the information of the first modal by using at least the first processing model.
  • the output unit 410 may output, for example, a second processing model and a third processing model.
  • the output unit 410 can refer to the updated second processing model.
  • the output unit 410 can make the updated second processing model available to other computers, for example. Therefore, the output unit 410 solves the problem based on the information of the first modal and the information of the second modal by using at least the second processing model and the third processing model. It is possible to improve the accuracy of the obtained solution.
  • FIG. 5 is a block diagram showing a specific functional configuration example of the learning device 100.
  • the learning device 100 has an integrated model 500.
  • the integrated model 500 includes an image feature amount generation unit 501, a query generation unit 502, a table search unit 503, an image processing unit 504, a language processing unit 505, a crossmodal processing unit 506, and a loss function calculation unit 507. including.
  • the learning device 100 has a data table 510.
  • the data table 510 is a table that stores the searched key sequence and the cross-modal feature quantity column.
  • the data table 510 is a table in which linguistic information is reflected by pre-learning.
  • the data table 510 is, for example, a table in which a cross-modal feature amount reflecting linguistic information is associated with a searched key that quantizes an image feature amount on a one-to-one basis. Based on the search query generated from the image feature amount, the data table 510 weights the linguistic information reflected as the cross-modal feature amount in the data table 510 according to the degree of relevance to the image feature amount. It is a table for extracting.
  • the image feature amount generation unit 501 generates an image feature amount sequence from the input image and outputs it.
  • the image feature amount generation unit 501 detects, for example, an object appearing in the input image, and generates and outputs an image feature amount sequence including an image feature amount indicating each of the detected objects.
  • the image feature sequence is represented by, for example, a vector.
  • the image feature amount indicates visual feature information of an object including, for example, a color and a shape.
  • the query generation unit 502 generates and outputs a search query sequence based on the input image feature sequence.
  • the query generation unit 502 generates a search query sequence by, for example, multiplying the image feature sequence by the transformation matrix W q.
  • the search query column is represented by, for example, a vector.
  • the table search unit 503 calculates and outputs an index value string by weighting the cross-modal feature amount from the data table 510 based on the input search query column.
  • the table search unit 503 calculates a new weighted average of the cross-modal features of the data table 510 based on the inner product of the input search query column and the searched key column of the data table 510, for example. Generates and outputs a cross-modal feature sequence.
  • the image processing unit 504 converts and outputs the image feature quantity sequence based on the input new cross-modal feature quantity sequence. For example, the image processing unit 504 adds a new cross-modal feature sequence to the image feature sequence, converts the added image feature sequence using an image processing model, and converts the converted image feature. Output a quantity sequence.
  • the language processing unit 505 generates and outputs a word embedding string based on the input word string.
  • the language processing unit 505 generates and outputs a word embedding string based on the input word string, for example, using a language processing model.
  • the cross-modal processing unit 506 integrates the input word embedding sequence and the input image feature amount sequence, generates a new word embedding sequence, and outputs a new image feature amount sequence.
  • the cross-modal processing unit 506 integrates the input word embedding sequence and the input image feature amount sequence by using, for example, a cross-modal processing model, and forms a new word embedding column and a new image feature amount sequence. And output.
  • the loss function calculation unit 507 calculates the loss value using the loss function based on the input word embedding sequence and the input image feature amount sequence. Then, the learning device 100 updates the data table 510, the image processing model, the language processing model, and the cross-modal processing model by pre-learning the integrated model 500 based on the loss value.
  • the learning device 100 can reflect the language information in the data table 510.
  • the learning device 100 can reflect, for example, linguistic information in the cross-modal features of the data table 510.
  • the learning device 100 can reflect the loss value based on the word string in the cross-modal feature amount of the data table 510.
  • the learning device 100 explicitly prepares the data table 510 that reflects the linguistic information, and when the linguistic information is reflected, the cross-modal feature amount is effectively based on the quantization by the searched key. Can be updated.
  • the learning device 100 can prepare an expression space in consideration of linguistic information when handling an image, and the image processing unit 504 can generate a useful image feature quantity sequence.
  • FIG. 6 is an explanatory diagram showing an example of learning the integrated model 600.
  • the learning device 100 has an integrated model 600 that embodies the integrated model 500.
  • the integrated model 600 includes an add unit 601, an image Transferr 602, a language Transferr 603, and a cross-modal Transferr 604.
  • the learning device 100 has a language information data table 610 that embodies the data table 510.
  • the language information data table 610 includes an array Queue 611 for setting a search query sequence, an array Key 621 for setting a search key sequence, and an array Value 622 for setting a cross-modal feature quantity sequence.
  • the learning device 100 performs pre-learning of the integrated model 600 by masking a part of the input and predicting the masked part.
  • the learning device 100 converts the masked image feature amount sequence F by the Add unit 601 and outputs the converted image feature amount sequence F'.
  • the learning device 100 further converts the corrected image feature amount sequence F'by the image Transferr 602, and outputs the converted image feature amount sequence F ”.
  • the learning device 100 uses the language Transfermer 603 to create a masked language.
  • the embedded column E is converted and the converted language embedding column E'is output.
  • the learning device 100 uses the crossmodal Transferformer 604 to convert the converted image feature amount column F'and the converted language embedding column E'. To generate a new image feature sequence F ⁇ and a new language embedding sequence E ⁇ .
  • FIG. 7 is an explanatory diagram showing an example of converting the image feature quantity sequence F.
  • the learning apparatus 100 sets the initial value of the searched key sequence obtained by multiplying the N-dimensional identity matrix I by the transformation matrix W k in advance in the array Key621, and converts it into the N-dimensional identity matrix I.
  • the crossmodal feature matrix obtained by multiplying W v is set in the array Value622.
  • the learning device 100 calculates the inner product of the sequence Queen 611 and the sequence Key 621, and calculates the softmax of the calculated inner product.
  • the learning device 100 calculates the inner product of the calculated softmax and the array Value622 as correction information.
  • the learning device 100 converts the image feature quantity sequence F by adding the correction information to the image feature quantity sequence F. Next, the description shifts to FIG. 8, and a specific example of the calculation for converting the image feature quantity sequence F will be described.
  • FIG. 8 is an explanatory diagram showing a specific example of the calculation for converting the image feature quantity sequence F.
  • (8-1) Learning unit 100 when converting the image feature amount column F, the respective image feature amount f 1, f 2, ⁇ , f i, ⁇ , f to I, each of the transform vector W q1 of the transformation matrix W q, ⁇ , W qh, ⁇ , multiplying W qH.
  • Learning apparatus 100 is, for example, the image feature amount f i, converting the vector W q1, ⁇ , W qh, ⁇ , W qH multiplies, H-number of the partial feature query q i, 1, ⁇ ⁇ ⁇ , Q i, h , ⁇ , q i, H are acquired.
  • the learning apparatus 100 based on the image feature amount f i, the partial feature query q i, 1, ⁇ , q i, h, ⁇ , q i, has been described to obtain the H It is carried out similar calculations for other image characteristic amount other than the image feature amount f i.
  • the calculation to be performed thereafter will be described by taking the partial feature queries q i and h as an example, but the same calculation is performed for other partial feature queries other than the partial feature queries q i and h. To.
  • the learning device 100 has a partial feature query q i, h and N partial feature keys k 1, h , ..., k n, h , ..., K set in the array Key 621.
  • the degree of coincidence with N and H is calculated, and the softmax of the degree of coincidence is calculated.
  • the learning device 100 calculates, for example, the inner product of the partial feature queries q i, h and N partial feature keys k 1, h , ⁇ , k n, h , ⁇ , k N, H.
  • the softmax of the inner product is calculated. Specifically, the learning device 100 calculates the softmax of the inner product by the following formula (1).
  • the learning device 100 is associated with N partial feature keys k 1, h , ..., k n, h , ..., K N, H based on softmax, N. Calculate the weighted average df i of the cross-modal features v 1, h , ⁇ , v n, h , ⁇ , v N, H.
  • the learning device 100 I-number of image feature amounts f 1, ⁇ , f i, ⁇ , corresponding to f I, a weighted average df 1 of number I, ⁇ ⁇ ⁇ , df i, ..., Df I is calculated as a new cross-modal feature quantity.
  • the learning apparatus 100 I-number of image feature amounts f 1, ⁇ , f i, ⁇ , the f I, a weighted average df 1 of number I, ⁇ ⁇ ⁇ , df i, ⁇ ⁇ ⁇ , the df I respectively added, I pieces of image feature amounts f 1, converts ⁇ , f i, ⁇ , an f I.
  • the learning device 100 can reflect the language information in the language information data table 610.
  • the learning device 100 can reflect the language information in the transformation matrix W q , the transformation matrix W k, and the transformation matrix W v.
  • the learning device 100 can prepare an expression space in consideration of linguistic information when handling an image, can generate a useful image feature sequence, and learns a useful integrated model 600. be able to.
  • the learning device 100 can efficiently perform fine tuning of the integrated model 600 after learning. After that, the learning device 100 can improve the accuracy of the solution obtained by solving the problem by using the integrated model 600 in solving the problem.
  • the learning device 100 can, for example, make it possible to appropriately complement the image feature amount with the crossmodal feature amount obtained from the linguistic information, and to generate a useful image feature amount sequence. Therefore, the learning device 100 can separate the image Transferr 602 and efficiently perform fine tuning of the image Transferr 602 alone. After that, the learning device 100 can improve the accuracy of the solution obtained by solving the problem even if the image Transformer 602 is used alone in solving the problem.
  • an object captured in an image has a property of having a different meaning depending on the relationship with a surrounding object, but an image feature quantity indicating an object captured in an image is quantized unlike a language feature quantity. It is expressed as a continuous value, and it tends to be difficult to give an appropriate meaning. Specifically, it is desirable to give different meanings to the image features showing various chairs in consideration of the arrangement relationship in the room, but it is difficult to give an appropriate meaning. Further, specifically, it is desirable to give different meanings to the image features showing various red lights in consideration of the positional relationship with the stopping vehicle, but it is difficult to give an appropriate meaning. ..
  • the learning device 100 makes it easy to give an appropriate meaning to the image feature amount by the cross-modal feature amount obtained from the linguistic information, and makes it easy to obtain a useful integrated model 600. Can be done.
  • FIG. 9 is an explanatory diagram showing another example of converting the image feature quantity sequence.
  • the neural network 900 is a two-layer fully connected network, which has an input layer 901, an intermediate layer 902, and an output layer 903.
  • the dimension of the input layer 901 and the dimension of the output layer 903 are formed to be the same.
  • the dimension of the intermediate layer 902 is formed to be larger than the dimension of the input layer 901.
  • the parameter of the connection between the input layer 901 and the intermediate layer 902 is the transformation matrix W q ⁇ k .
  • the parameter of the connection between the intermediate layer 902 and the output layer 903 is the transformation matrix W k ⁇ v .
  • the learning device 100 generates N correspondences from the i-th image feature amount f i by the transformation matrix W q ⁇ k , and based on the generated correspondences, the crossmodal feature amount is calculated by the transformation matrix W k ⁇ v. Will be generated.
  • the learning device 100 can realize the same function as the data table 510 by the neural network 900 by updating the transformation matrix W q ⁇ k and the transformation matrix W k ⁇ v by pre-learning.
  • the learning device 100 can reflect the linguistic information on the neural network 900.
  • the learning device 100 can reflect the language information in the transformation matrix W q ⁇ k and the transformation matrix W k ⁇ v.
  • the learning device 100 can prepare an expression space in consideration of linguistic information when handling an image, can generate a useful image feature sequence, and can learn a useful integrated model. can do.
  • the learning device 100 can efficiently perform fine tuning of the integrated model after learning. After that, the learning device 100 can improve the accuracy of the solution obtained by solving the problem by using the integrated model in solving the problem.
  • FIG. 10 is an explanatory diagram showing an example in which the learning device 100 uses the integrated model 1000.
  • the learning device 100 uses, for example, the integrated model 1000 to solve the problem of video monitoring in view of the surrounding situation.
  • the integrated model 1000 includes an image Transferr 1010, a language Transferr 1020, and a crossmodal Transferr 1030.
  • the integrated model 1000 includes a linguistic information data table 610, a neural network 900, and the like.
  • the learning device 100 solves the problem of detecting whether or not there is a person who has not yet evacuated in the room when a disaster occurs. At this time, the learning device 100 acquires the image information 1001 of the surveillance camera that captures the appearance of the fitting room. The learning device 100 acquires an image feature amount sequence from the image information 1001, converts it using the language information data table 610 or the neural network 900, and then inputs it to the image Transferr 1010.
  • the learning device 100 acquires linguistic information 1002 in which "the fitting room is used by taking off shoes” and "the fitting room is used by closing the curtain".
  • the learning device 100 acquires a language feature sequence from the language information 1002 and inputs it to the language Transferformer 1020. Then, the learning device 100 inputs the output value of the image Transferr 1010 and the output value of the language Transferr 1020 into the cross-modal Transferr 1030.
  • the learning device 100 determines whether or not there is a person who has not yet evacuated to the fitting room based on the output value of the cross-modal Transformer 1030. As a result, the learning device 100 can solve the problem by appropriately matching the situation presented by the linguistic information with the situation presented by the image information, and improves the accuracy of the solution obtained by solving the problem. Can be done. In the learning device 100, for example, since the curtain of the fitting room is closed and a person is not directly photographed, the learning device 100 is presented with linguistic information even when it is determined that there is no person by the image information alone. It is possible to correctly determine that there is a person in the fitting room by appropriately matching the situation that "a person is using the fitting room" with the arrangement relationship of the objects in the image.
  • the learning process is realized, for example, by the CPU 301 shown in FIG. 3, a storage area such as a memory 302 or a recording medium 305, and a network I / F 303.
  • FIG. 11 is a flowchart showing an example of the learning processing procedure.
  • the learning device 100 initializes the searched key value and the feature value correction value value in the table with random values (step S1101).
  • the learning device 100 generates a set of language feature values and a set of image feature values from each of the interrelated documents and images (step S1102). Then, the learning device 100 masks the image feature values by replacing the image feature values with irrelevant values at a constant ratio in the set of image feature values (step S1103).
  • the learning device 100 calculates the degree of relationship between each image feature value in the set of image feature values and the searched key value in the table (step S1104). Then, the learning device 100 corrects each image feature value based on the feature value correction value value in the table corresponding to the degree of relationship (step S1105).
  • the learning device 100 uses a language processing model, an image processing model, and a cross-modal processing model to create a masked image based on a set of language feature values and a set of corrected image feature values.
  • the feature value is restored and the predicted value is acquired (step S1106).
  • the learning device 100 updates the parameter value including the searched key value and the feature value correction value value in the table in the direction of reducing the loss value of the predicted value (step S1107).
  • the learning device 100 determines whether or not the end condition is satisfied (step S1108).
  • the end condition is, for example, that the loop of steps S1102 to S1108 is repeated a certain number of times or more.
  • the end condition is, for example, that the difference between the previous loss value and the current loss value is less than a certain value.
  • step S1108: No if the end condition is not satisfied (step S1108: No), the learning device 100 returns to the process of step S1102. On the other hand, when the end condition is satisfied (step S1108: Yes), the learning device 100 ends the learning process.
  • the information processing apparatus can obtain a useful model including parameter values that reflect the characteristics of the linguistic information.
  • the feature amount can be extracted from the first modal information.
  • a new feature amount can be acquired by converting the extracted feature amount based on the parameter.
  • the first output value can be acquired by inputting the acquired new feature quantity into the first processing model related to the first modal.
  • the second output is performed by inputting other features extracted from the information of the second modal different from the first modal into the second processing model related to the second modal. You can get the value.
  • the acquired first output value and the second output value are input to the third processing model relating to the first modal and the second modal, so that the third output value is obtained. Can be obtained.
  • the parameters can be updated based on the acquired third output value. This allows the information processing device to obtain a useful model that includes parameters that reflect the characteristics of the second modal information.
  • the first feature amount and the second feature amount can be adopted as parameters.
  • the degree of coincidence between the extracted feature amount and the first feature amount is calculated, and the second feature amount is weighted based on the calculated degree of coincidence based on the obtained index value.
  • a new feature amount can be obtained by converting the extracted feature amount.
  • the information processing apparatus reflects the features of the information of the second modal to the first feature amount and the second feature amount, and realizes the conversion of the feature amount extracted from the information of the first modal. can do.
  • the information processing device it is possible to adopt a neural network parameter having a large number of nodes in the input layer, a number of nodes in the output layer, and a larger number of nodes in the intermediate layer as parameters.
  • a new feature amount can be acquired by inputting the extracted feature amount into the neural network.
  • the information processing apparatus can reflect the characteristics of the information of the second modal in the neural network and realize the conversion of the feature amount extracted from the information of the first modal.
  • the first processing model is updated based on the third output value, and the updated parameters and the updated first processing model are used in addition to the first modal.
  • a predetermined problem can be solved according to the input of the information of.
  • the information processing apparatus can obtain a useful first processing model, and can improve the accuracy of the solution obtained by solving a predetermined problem using the first processing model.
  • the first processing model, the second processing model, and the third processing model can be updated based on the third output value.
  • a predetermined problem is solved by using the updated parameters, the updated first processing model, the updated second processing model, and the updated third processing model. Can be solved.
  • the information processing apparatus can obtain a useful first processing model, a second processing model, and a third processing model, and improves the accuracy of the solution obtained by solving a predetermined problem. be able to.
  • a modal related to an image can be adopted as the first modal.
  • a modal related to language can be adopted as the second modal.
  • the information processing apparatus can obtain a model useful for solving the problem based on the image information and the linguistic information.
  • a modal related to an image can be adopted as the first modal.
  • a modal related to voice can be adopted as the second modal.
  • the information processing apparatus can obtain a model useful for solving the problem based on the image information and the audio information.
  • a modal related to the first language can be adopted as the first modal.
  • a modal related to the second language can be adopted as the second modal. This allows the information processing device to obtain a useful model for solving a problem based on linguistic information in different languages.
  • the parameters can be updated by the error back propagation method based on the third output value.
  • the information processing apparatus can update the parameters with high accuracy.
  • the learning method described in this embodiment can be realized by executing a program prepared in advance on a computer such as a PC or a workstation.
  • the learning program described in this embodiment is executed by being recorded on a computer-readable recording medium and being read from the recording medium by the computer.
  • the recording medium is a hard disk, a flexible disk, a CD (Compact Disc) -ROM, an MO, a DVD (Digital entirely Disc), or the like.
  • the learning program described in this embodiment may be distributed via a network such as the Internet.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

This learning device (100) extracts a feature quantity from first modal information using an extraction model (111). The learning device (100) converts the extracted feature quantity using a conversion model (112) on the basis of a plurality of parameters and thereby acquires a new feature quantity. The learning device (100) inputs the acquired new feature quantity to a first processing model (121) and thereby acquires a first output value. The learning device (100) inputs other feature quantities extracted from second modal information to a second processing model (122) and thereby acquires a second output value. The learning device (100) inputs the acquired first and second output values to a third processing model (123) and thereby acquires a third output value. The learning device (100) updates the plurality of parameters on the basis of the acquired third output value.

Description

学習方法、学習プログラム、および学習装置Learning methods, learning programs, and learning devices

 本発明は、学習方法、学習プログラム、および学習装置に関する。 The present invention relates to a learning method, a learning program, and a learning device.

 従来、所定のモーダルの情報を用いて文書翻訳や質疑応答、物体検出、状況判断などの問題を解く技術がある。例えば、文書に関するモーダルの情報を用いて問題を解くためのBERT(Bidirectional Encoder Representations from Transformers)と呼ばれるモデルがある。また、BERTを、画像に関するモーダルの情報も用いて問題を解くように拡張したモデルがある。ここで、モーダルとは、情報の様式や種類を示す概念であり、具体例としては、画像、文書(テキスト)、音声などを挙げることができる。複数のモーダルを用いた機械学習はマルチモーダル学習と呼ばれる。また、マルチモーダル学習のうち、複数のモーダル間の共起関係を学習するものは、クロスモーダル学習と呼ばれる場合もある。 Conventionally, there is a technology to solve problems such as document translation, question and answer, object detection, and situation judgment using predetermined modal information. For example, there is a model called BERT (Bidirectional Encoder Representations from Transformers) for solving a problem using modal information about a document. There is also a model that extends BERT to solve problems using modal information about images. Here, the modal is a concept indicating the style and type of information, and specific examples thereof include images, documents (texts), and sounds. Machine learning using multiple modals is called multimodal learning. Further, among multimodal learning, learning the co-occurrence relationship between a plurality of modals is sometimes called cross-modal learning.

 先行技術としては、例えば、ジェスチャ特徴を入力とし、入力されたジェスチャ特徴が単語と対応するか否かを分類するモデルを生成するものがある。また、例えば、単語近くに現れる単語共起語の出現頻度を要素とする単語共起ベクトルと、画像近くに現れる画像共起語の出現頻度を要素とする画像共起ベクトルとを用いて、指定された単語および画像の類似度を求める技術がある。また、例えば、手動作の時系列データから手動作の個々の構成要素である手話素の認識処理を行い、認識された手話素から手話単語認識処理を行う技術がある。 As the prior art, for example, there is a method in which a gesture feature is input and a model for classifying whether or not the input gesture feature corresponds to a word is generated. Further, for example, a word co-occurrence vector whose element is the frequency of occurrence of word co-occurrence words appearing near a word and an image co-occurrence vector whose element is the frequency of appearance of image co-occurrence words appearing near an image are used. There is a technique for determining the similarity between words and images. Further, for example, there is a technique of performing recognition processing of sign language elements, which are individual components of sign language movements, from time-series data of hand movements, and performing sign language word recognition processing from the recognized sign language elements.

特開2018-163400号公報JP-A-2018-163400 特開2002-132823号公報Japanese Unexamined Patent Publication No. 2002-132823 特開平9-34863号公報Japanese Unexamined Patent Publication No. 9-34863

 しかしながら、従来技術では、モーダルの情報を用いて問題を解くにあたり、モーダルの情報から特徴量を抽出する有用なモデルを得ることが難しい。例えば、モデルは、画像に関するモーダルの情報を扱うにあたり、画像に関するモーダルの情報と言語に関するモーダルの情報との関係性に基づく特徴を表現する表現空間が設定されておらず、問題を解くにあたり有用ではない場合がある。 However, with the conventional technology, it is difficult to obtain a useful model for extracting features from modal information when solving a problem using modal information. For example, when dealing with modal information about images, the model does not have an expression space that expresses features based on the relationship between modal information about images and modal information about languages, and is not useful in solving problems. May not be available.

 1つの側面では、本発明は、モーダルの情報から特徴量を抽出する有用なモデルを学習することを目的とする。 In one aspect, the present invention aims to learn a useful model for extracting features from modal information.

 1つの実施態様によれば、第1のモーダルの情報から特徴量を抽出し、抽出した前記特徴量をパラメータに基づいて変換することにより、新たな特徴量を取得し、取得した前記新たな特徴量を、前記第1のモーダルに関する第1の処理モデルに入力することにより、第1の出力値を取得し、前記第1のモーダルとは異なる第2のモーダルの情報から抽出された他の特徴量を、前記第2のモーダルに関する第2の処理モデルに入力することにより、第2の出力値を取得し、取得した前記第1の出力値と前記第2の出力値とを、前記第1のモーダルと前記第2のモーダルとに関する第3の処理モデルに入力することにより、第3の出力値を取得し、取得した前記第3の出力値に基づいて、前記パラメータを更新する学習方法、学習プログラム、および学習装置が提案される。 According to one embodiment, a feature amount is extracted from the information of the first modal, and the extracted feature amount is converted based on a parameter to acquire a new feature amount, and the acquired new feature amount is obtained. By inputting the quantity into the first processing model for the first modal, the first output value is obtained and other features extracted from the information of the second modal different from the first modal. By inputting the amount into the second processing model relating to the second modal, a second output value is acquired, and the acquired first output value and the second output value are combined with the first output value. A learning method in which a third output value is acquired by inputting into a third processing model relating to the modal of the above and the second modal, and the parameter is updated based on the acquired third output value. A learning program and a learning device are proposed.

 一態様によれば、モーダルの情報から特徴量を抽出する有用なモデルを学習することが可能になる。 According to one aspect, it becomes possible to learn a useful model for extracting features from modal information.

図1は、実施の形態にかかる学習方法の一実施例を示す説明図である。FIG. 1 is an explanatory diagram showing an embodiment of a learning method according to an embodiment. 図2は、情報処理システム200の一例を示す説明図である。FIG. 2 is an explanatory diagram showing an example of the information processing system 200. 図3は、学習装置100のハードウェア構成例を示すブロック図である。FIG. 3 is a block diagram showing a hardware configuration example of the learning device 100. 図4は、学習装置100の機能的構成例を示すブロック図である。FIG. 4 is a block diagram showing a functional configuration example of the learning device 100. 図5は、学習装置100の具体的な機能的構成例を示すブロック図である。FIG. 5 is a block diagram showing a specific functional configuration example of the learning device 100. 図6は、統合モデル600を学習する一例を示す説明図である。FIG. 6 is an explanatory diagram showing an example of learning the integrated model 600. 図7は、画像特徴量列Fを変換する一例を示す説明図である。FIG. 7 is an explanatory diagram showing an example of converting the image feature quantity sequence F. 図8は、画像特徴量列Fを変換する計算の具体例を示す説明図である。FIG. 8 is an explanatory diagram showing a specific example of the calculation for converting the image feature quantity sequence F. 図9は、画像特徴量列を変換する別の例を示す説明図である。FIG. 9 is an explanatory diagram showing another example of converting the image feature quantity sequence. 図10は、学習装置100が統合モデル1000を利用する一例を示す説明図である。FIG. 10 is an explanatory diagram showing an example in which the learning device 100 uses the integrated model 1000. 図11は、学習処理手順の一例を示すフローチャートである。FIG. 11 is a flowchart showing an example of the learning processing procedure.

 以下に、図面を参照して、本発明にかかる学習方法、学習プログラム、および学習装置の実施の形態を詳細に説明する。 Hereinafter, embodiments of the learning method, learning program, and learning device according to the present invention will be described in detail with reference to the drawings.

(実施の形態にかかる学習方法の一実施例)
 図1は、実施の形態にかかる学習方法の一実施例を示す説明図である。学習装置100は、所定のモーダルの情報を用いて問題を解く際に利用可能な、所定のモーダルの情報から特徴量を抽出する有用なモデルを学習するためのコンピュータである。
(An example of a learning method according to an embodiment)
FIG. 1 is an explanatory diagram showing an embodiment of a learning method according to an embodiment. The learning device 100 is a computer for learning a useful model for extracting a feature amount from a predetermined modal information, which can be used when solving a problem using a predetermined modal information.

 従来、例えば、問題を解くための、BERTと呼ばれる事前学習モデル(pre-trainモデル)がある。BERTは、具体的には、TransformerのEncoder部を積み重ねて形成される。BERTについては、例えば、下記非特許文献1を参照することができる。 Conventionally, for example, there is a pre-learning model (pre-train model) called BERT for solving a problem. Specifically, the BERT is formed by stacking the Encoder portions of the Transformer. For BERT, for example, the following Non-Patent Document 1 can be referred to.

 非特許文献1 : Devlin, Jacob et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” NAACL-HLT (2019). Non-Patent Document 1: Devlin, Jacob et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL-HLT (2019).

 ここで、BERTは、言語に関するモーダルの情報を用いて問題を解くような状況に適用することが想定されており、複数のモーダルの情報を用いて問題を解くような状況に適用することができない。 Here, BERT is supposed to be applied to a situation where a problem is solved using modal information about a language, and cannot be applied to a situation where a problem is solved using a plurality of modal information. ..

 これに対し、例えば、BERTを、言語に関するモーダルの情報の他、画像に関するモーダルの情報も用いて問題を解くような状況に適用可能に拡張した、VideoBERTと呼ばれる拡張モデルがある。また、VideoBERTより性能向上を図った、CBT(Contrastive Bidirectional Transformer for Temporal Representation Learning)と呼ばれる拡張モデルがある。 On the other hand, for example, there is an extended model called VideoBERT that extends BERT so that it can be applied to situations where problems are solved using modal information related to languages as well as modal information related to images. In addition, there is an extended model called CBT (Contrastive Biraditional Transformer for Temporal Representation Learning), which has improved performance over VideoBERT.

 CBTは、具体的には、言語特徴量の共起関係を学習する言語処理モデルと、画像特徴量の共起関係を学習する画像処理モデルと、言語処理モデルと画像処理モデルとの出力を統合し、言語と画像との共起関係を学習するクロスモーダル処理モデルとで形成される。CBTについては、例えば、下記非特許文献2を参照することができる。 Specifically, CBT integrates the output of a language processing model that learns the co-occurrence relationship of language features, an image processing model that learns the co-occurrence relationship of image features, and a language processing model and an image processing model. It is formed by a cross-modal processing model that learns the co-occurrence relationship between language and images. For CBT, for example, the following Non-Patent Document 2 can be referred to.

 非特許文献2 : Sun, Chen, et al. “Contrastive Bidirectional Transformer for Temporal Representation Learning.” arXiv preprint arXiv:1906.05743 (2019). Non-Patent Document 2: Sun, Chen, et al. "Contrastive Biraditional Transition Former for Temporal Representation Learning." ArXiv preprint arXiv: 1906.05743 (2019).

 しかしながら、上述した各種モデルは、問題を解くにあたり有用なモデルとはならない場合がある。例えば、上述した各種モデルは、モーダルの情報を用いて問題を解くにあたり、モーダルの情報から特徴量を抽出する有用なモデルとはならない場合がある。例えば、CBTは、画像に関するモーダルの情報を扱うにあたり、画像に関するモーダルの情報と言語に関するモーダルの情報との関係性に基づく特徴を表現する表現空間が設定されておらず、問題を解くにあたり有用なモデルとはならない場合がある。 However, the various models mentioned above may not be useful models for solving problems. For example, the various models described above may not be useful models for extracting features from modal information when solving a problem using modal information. For example, when dealing with modal information about images, CBT does not have an expression space that expresses features based on the relationship between modal information about images and modal information about languages, which is useful in solving problems. It may not be a model.

 具体的には、画像特徴量は、画像に関するモーダルの情報を反映する表現空間であり、言語に関するモーダルの情報を反映する表現空間を有さないという性質がある。このため、CBTを事前学習したとしても、CBTに含まれる画像処理モデルは、言語に関するモーダルの情報を有効に活用可能なモデルにはならず、問題を解くにあたり有用なモデルにもならない。また、画像特徴量は、同じ物体が写っていても、異なる写り方の画像に対しては、異なる特徴量となり得るという性質がある。このため、画像特徴量に対し、言語に関するモーダルの情報の特徴を反映させる際、1つの画像特徴量ではなく、色々な画像特徴量の表現について更新することになり、有効な更新が難しく、または、問題を解くにあたり悪影響を及ぼすことが考えられる。 Specifically, the image feature amount is an expression space that reflects modal information about the image, and has the property of not having an expression space that reflects modal information about the language. Therefore, even if CBT is pre-learned, the image processing model included in CBT does not become a model that can effectively utilize modal information about the language, and does not become a useful model for solving the problem. Further, the image feature amount has a property that even if the same object is captured, the image feature amount can be different for images of different appearances. For this reason, when reflecting the characteristics of modal information related to the language with respect to the image feature amount, it is necessary to update not one image feature amount but various image feature amount expressions, which makes effective update difficult or difficult. , It is possible that it will have an adverse effect on solving the problem.

 そこで、本実施の形態では、モーダルの情報を変換する際に用いるパラメータを設定し、パラメータによる表現空間を設けることにより、モーダルの情報から特徴量を抽出する有用なモデルを学習することができる学習方法について説明する。 Therefore, in the present embodiment, by setting the parameters used when converting the modal information and providing the expression space by the parameters, it is possible to learn a useful model for extracting the feature amount from the modal information. The method will be described.

 図1において、学習装置100は、例えば、モデル101を有する。モデル101は、抽出モデル111と、変換モデル112と、第1の処理モデル121と、第2の処理モデル122と、第3の処理モデル123とを有する。抽出モデル111と、変換モデル112と、第1の処理モデル121とは、第1のモーダルに関する。第2の処理モデル122は、第2のモーダルに関する。第3の処理モデル123は、第1のモーダルと第2のモーダルとに関する。 In FIG. 1, the learning device 100 has, for example, a model 101. The model 101 has an extraction model 111, a conversion model 112, a first processing model 121, a second processing model 122, and a third processing model 123. The extraction model 111, the conversion model 112, and the first processing model 121 relate to the first modal. The second processing model 122 relates to a second modal. The third processing model 123 relates to a first modal and a second modal.

 学習装置100は、第1のモーダルの情報と、第2のモーダルの情報とを取得する。モーダルは、情報の様式を意味する。第1のモーダルと、第2のモーダルとは、それぞれ異なるモーダルである。第1のモーダルは、例えば、画像に関するモーダルである。第1のモーダルが、画像に関する場合、第1のモーダルの情報は、例えば、画像である。第2のモーダルは、例えば、言語に関するモーダルである。第2のモーダルが、言語に関する場合、第2のモーダルの情報は、例えば、文書である。 The learning device 100 acquires the information of the first modal and the information of the second modal. Modal means a form of information. The first modal and the second modal are different modals. The first modal is, for example, an image modal. If the first modal is about an image, the information in the first modal is, for example, an image. The second modal is, for example, a modal related to language. If the second modal is about language, the information in the second modal is, for example, a document.

 (1-1)学習装置100は、抽出モデル111を用いて、第1のモーダルの情報から特徴量を抽出する。学習装置100は、例えば、画像から画像特徴量を抽出する。画像特徴量は、例えば、配列を示すベクトルにより表現される。 (1-1) The learning device 100 uses the extraction model 111 to extract the feature amount from the first modal information. The learning device 100 extracts, for example, an image feature amount from an image. The image feature amount is represented by, for example, a vector indicating an array.

 (1-2)学習装置100は、変換モデル112を用いて、抽出した特徴量をパラメータに基づいて変換することにより、新たな特徴量を取得する。パラメータは、例えば、複数存在する。学習装置100は、例えば、抽出した画像特徴量と複数のパラメータとに基づいて、抽出した画像特徴量を補正する補正量を算出し、抽出した画像特徴量に加算することにより、新たな画像特徴量を取得する。 (1-2) The learning device 100 acquires a new feature amount by converting the extracted feature amount based on the parameter using the conversion model 112. There are a plurality of parameters, for example. The learning device 100 calculates, for example, a correction amount for correcting the extracted image feature amount based on the extracted image feature amount and a plurality of parameters, and adds the correction amount to the extracted image feature amount to obtain a new image feature. Get the quantity.

 (1-3)学習装置100は、取得した新たな特徴量を、第1の処理モデル121に入力することにより、第1の出力値を取得する。第1の処理モデル121は、例えば、画像処理モデルである。学習装置100は、例えば、新たな画像特徴量を、画像処理モデルに入力することにより、第1の出力値を取得する。 (1-3) The learning device 100 acquires the first output value by inputting the acquired new feature amount into the first processing model 121. The first processing model 121 is, for example, an image processing model. The learning device 100 acquires the first output value by inputting a new image feature amount into the image processing model, for example.

 (1-4)学習装置100は、第2のモーダルの情報から抽出された他の特徴量を、第2の処理モデル122に入力することにより、第2の出力値を取得する。第2の処理モデル122は、例えば、言語処理モデルである。学習装置100は、例えば、文書から抽出された言語特徴量を、言語処理モデルに入力することにより、第2の出力値を取得する。言語特徴量は、例えば、配列を示すベクトルにより表現される。 (1-4) The learning device 100 acquires a second output value by inputting another feature amount extracted from the second modal information into the second processing model 122. The second processing model 122 is, for example, a language processing model. The learning device 100 acquires a second output value by inputting, for example, a language feature amount extracted from a document into a language processing model. Language features are represented, for example, by vectors representing arrays.

 (1-5)学習装置100は、取得した第1の出力値と第2の出力値とを、第3の処理モデル123に入力することにより、第3の出力値を取得する。第3の処理モデル123は、例えば、クロスモーダル処理モデルである。クロスモーダル処理モデルは、複数のモーダルの情報を統合し、複数のモーダルの情報の共起を学習する。学習装置100は、例えば、第1の出力値と第2の出力値とを、クロスモーダル処理モデルに入力することにより、第3の出力値を取得する。 (1-5) The learning device 100 acquires the third output value by inputting the acquired first output value and the second output value into the third processing model 123. The third processing model 123 is, for example, a cross-modal processing model. The cross-modal processing model integrates information from multiple modals and learns co-occurrence of information from multiple modals. The learning device 100 acquires the third output value by inputting the first output value and the second output value into the cross-modal processing model, for example.

 (1-6)学習装置100は、取得した第3の出力値に基づいて、複数のパラメータを更新する。学習装置100は、例えば、第3の出力値に基づいて、誤差逆伝搬法により、複数のパラメータを更新する。学習装置100は、更新後の複数のパラメータを出力してもよい。これにより、学習装置100は、問題を解くにあたり、第1のモーダルの情報と、第2のモーダルの情報とを扱う観点で有用な複数のパラメータを得ることができる。 (1-6) The learning device 100 updates a plurality of parameters based on the acquired third output value. The learning device 100 updates a plurality of parameters by the error back propagation method, for example, based on the third output value. The learning device 100 may output a plurality of updated parameters. As a result, the learning device 100 can obtain a plurality of parameters useful from the viewpoint of handling the first modal information and the second modal information in solving the problem.

 学習装置100は、例えば、第2のモーダルの情報の特徴を反映可能な、複数のパラメータが明示的に用意されており、第1のモーダルの情報から新たな特徴量を取得するにあたり、第2のモーダルの情報の特徴を有効に活用可能にすることができる。また、学習装置100は、第1のモーダルの情報から抽出する特徴量自体には、第2のモーダルの情報の特徴を直接反映させず、問題を解く際の悪影響を低減することができる。 The learning device 100 is, for example, explicitly prepared with a plurality of parameters capable of reflecting the features of the information of the second modal, and when acquiring a new feature amount from the information of the first modal, the second It is possible to effectively utilize the characteristics of modal information. Further, the learning device 100 does not directly reflect the features of the information of the second modal in the feature amount itself extracted from the information of the first modal, and can reduce the adverse effect when solving the problem.

 ここでは、学習装置100が、第3の出力値に基づいて、複数のパラメータを更新する場合について説明したが、これに限らない。例えば、学習装置100が、さらに、第3の出力値に基づいて、第1の処理モデル121を更新する場合があってもよい。これにより、学習装置100は、問題を解くにあたり、第1のモーダルの情報を扱う観点で有用な第1の処理モデル121を得ることができる。そして、学習装置100は、モデル101を、第1のモーダルの情報と、第2のモーダルの情報とを用いて問題を解く際に利用し、得られる解の精度向上を図ることができる。 Here, the case where the learning device 100 updates a plurality of parameters based on the third output value has been described, but the present invention is not limited to this. For example, the learning device 100 may further update the first processing model 121 based on the third output value. As a result, the learning device 100 can obtain a first processing model 121 that is useful from the viewpoint of handling the first modal information in solving the problem. Then, the learning device 100 can use the model 101 when solving a problem by using the information of the first modal and the information of the second modal, and can improve the accuracy of the obtained solution.

 ここで、学習装置100は、モデル101から、抽出モデル111と、変換モデル112と、第1の処理モデル121とを分離してもよい。これによれば、学習装置100は、抽出モデル111と、変換モデル112と、第1の処理モデル121との、有用な組み合わせモデルを得ることができる。そして、学習装置100は、分離した抽出モデル111と、変換モデル112と、第1の処理モデル121とを、第1のモーダルの情報を用いて問題を解く際に利用し、得られる解の精度向上を図ってもよい。 Here, the learning device 100 may separate the extraction model 111, the conversion model 112, and the first processing model 121 from the model 101. According to this, the learning device 100 can obtain a useful combination model of the extraction model 111, the conversion model 112, and the first processing model 121. Then, the learning device 100 uses the separated extraction model 111, the conversion model 112, and the first processing model 121 when solving the problem using the first modal information, and the accuracy of the obtained solution is obtained. You may try to improve.

 また、例えば、学習装置100が、さらに、第2の処理モデル122と、第3の処理モデル123とを更新する場合があってもよい。これにより、学習装置100は、問題を解くにあたり、第2のモーダルの情報を扱う観点で有用な第2の処理モデル122を得ることができる。また、学習装置100は、問題を解くにあたり、第1のモーダルの情報と、第2のモーダルの情報とを統合する観点で有用な第3の処理モデル123を得ることができる。これによれば、学習装置100は、抽出モデル111と、変換モデル112と、第1の処理モデル121と、第2の処理モデル122と、第3の処理モデル123とを組み合わせた、有用なモデル101を得ることができる。そして、学習装置100は、モデル101を、第1のモーダルの情報と、第2のモーダルの情報とを用いて問題を解く際に利用し、得られる解の精度向上を図ることができる。 Further, for example, the learning device 100 may further update the second processing model 122 and the third processing model 123. As a result, the learning device 100 can obtain a second processing model 122 that is useful from the viewpoint of handling the second modal information in solving the problem. In addition, the learning device 100 can obtain a third processing model 123 that is useful from the viewpoint of integrating the information of the first modal and the information of the second modal in solving the problem. According to this, the learning device 100 is a useful model in which the extraction model 111, the conversion model 112, the first processing model 121, the second processing model 122, and the third processing model 123 are combined. 101 can be obtained. Then, the learning device 100 can use the model 101 when solving a problem by using the information of the first modal and the information of the second modal, and can improve the accuracy of the obtained solution.

 また、例えば、学習装置100は、モデル101から、変換モデル112を分離してもよい。これによれば、学習装置100は、有用な変換モデル112を得ることができる。そして、学習装置100は、分離した変換モデル112を、第1のモーダルの情報を用いて問題を解く際に利用し、得られる解の精度向上を図ってもよい。 Further, for example, the learning device 100 may separate the conversion model 112 from the model 101. According to this, the learning device 100 can obtain a useful conversion model 112. Then, the learning device 100 may use the separated conversion model 112 when solving the problem using the information of the first modal to improve the accuracy of the obtained solution.

 また、例えば、学習装置100は、モデル101から、抽出モデル111と、変換モデル112と、第1の処理モデル121と、第2の処理モデル122とを分離してもよい。これによれば、学習装置100は、抽出モデル111と、変換モデル112と、第1の処理モデル121と、第2の処理モデル122との、有用な組み合わせモデルを得ることができる。そして、学習装置100は、分離した抽出モデル111と、変換モデル112と、第1の処理モデル121と、第2の処理モデル122とを、第1のモーダルの情報と、第2のモーダルの情報とを用いて問題を解く際に利用し、得られる解の精度向上を図ってもよい。 Further, for example, the learning device 100 may separate the extraction model 111, the conversion model 112, the first processing model 121, and the second processing model 122 from the model 101. According to this, the learning device 100 can obtain a useful combination model of the extraction model 111, the conversion model 112, the first processing model 121, and the second processing model 122. Then, the learning device 100 uses the separated extraction model 111, the conversion model 112, the first processing model 121, and the second processing model 122 as the first modal information and the second modal information. It may be used when solving a problem by using and to improve the accuracy of the obtained solution.

 以上のように、学習装置100は、有用なモデルを得ることができる。有用なモデルとは、例えば、更新後の、抽出モデル111と、変換モデル112と、第1の処理モデル121と、第2の処理モデル122と、第3の処理モデル123とのいずれかのモデルである。また、有用なモデルとは、例えば、更新後の、抽出モデル111と、変換モデル112と、第1の処理モデル121と、第2の処理モデル122と、第3の処理モデル123とのうち2以上の組み合わせであってもよい。 As described above, the learning device 100 can obtain a useful model. The useful model is, for example, one of the updated extraction model 111, the conversion model 112, the first processing model 121, the second processing model 122, and the third processing model 123. Is. Further, the useful model is, for example, two of the updated extraction model 111, the conversion model 112, the first processing model 121, the second processing model 122, and the third processing model 123. The above combination may be used.

(情報処理システム200の一例)
 次に、図2を用いて、図1に示した学習装置100を適用した、情報処理システム200の一例について説明する。
(Example of information processing system 200)
Next, an example of the information processing system 200 to which the learning device 100 shown in FIG. 1 is applied will be described with reference to FIG.

 図2は、情報処理システム200の一例を示す説明図である。図2において、情報処理システム200は、学習装置100と、クライアント装置201と、端末装置202とを含む。 FIG. 2 is an explanatory diagram showing an example of the information processing system 200. In FIG. 2, the information processing system 200 includes a learning device 100, a client device 201, and a terminal device 202.

 情報処理システム200において、学習装置100とクライアント装置201とは、有線または無線のネットワーク210を介して接続される。ネットワーク210は、例えば、LAN(Local Area Network)、WAN(Wide Area Network)、インターネットなどである。また、情報処理システム200において、学習装置100と端末装置202とは、有線または無線のネットワーク210を介して接続される。 In the information processing system 200, the learning device 100 and the client device 201 are connected via a wired or wireless network 210. The network 210 is, for example, a LAN (Local Area Network), a WAN (Wide Area Network), the Internet, or the like. Further, in the information processing system 200, the learning device 100 and the terminal device 202 are connected via a wired or wireless network 210.

 学習装置100は、第1のモーダルの情報と、第2のモーダルの情報の入力を受け付ける統合モデルを記憶する。記憶される統合モデルは、例えば、図1に示したモデル101に対応する。学習装置100は、教師データに基づいて、統合モデルを更新する。 The learning device 100 stores an integrated model that accepts input of the first modal information and the second modal information. The stored integrated model corresponds to, for example, the model 101 shown in FIG. The learning device 100 updates the integrated model based on the teacher data.

 教師データは、例えば、標本となる第1のモーダルの情報と、標本となる第2のモーダルの情報と、正解データとを対応付けた対応情報である。教師データは、例えば、学習装置100のユーザにより学習装置100に入力される。正解データは、例えば、統合モデルの出力値についての正解を示す。正解データは、統合モデルの出力値に基づいて問題を解いて得られる解についての正解を示してもよい。第1のモーダルが、画像に関する場合、第1のモーダルの情報は、画像である。第2のモーダルが、言語に関する場合、第2のモーダルの情報は、文書である。統合モデルの更新は、例えば、誤差逆伝搬法により実現される。統合モデルの更新は、例えば、誤差逆伝搬以外の学習方法により実現されてもよい。 The teacher data is, for example, correspondence information in which the sample first modal information, the sample second modal information, and the correct answer data are associated with each other. The teacher data is input to the learning device 100 by the user of the learning device 100, for example. The correct answer data indicates, for example, the correct answer for the output value of the integrated model. The correct answer data may indicate the correct answer for the solution obtained by solving the problem based on the output value of the integrated model. If the first modal is about an image, the information in the first modal is an image. If the second modal is about language, the information in the second modal is a document. The update of the integrated model is realized by, for example, the error back propagation method. The update of the integrated model may be realized by, for example, a learning method other than error back propagation.

 また、学習装置100は、問題を解くにあたり、第1のモーダルの情報と、第2のモーダルの情報とを取得する。学習装置100は、例えば、学習装置100のユーザにより学習装置100に入力された第1のモーダルの情報を取得する。また、学習装置100は、第1のモーダルの情報を、クライアント装置201または端末装置202から受信することにより取得してもよい。学習装置100は、例えば、学習装置100のユーザにより学習装置100に入力された第2のモーダルの情報を取得する。また、学習装置100は、第2のモーダルの情報を、クライアント装置201または端末装置202から受信することにより取得してもよい。 Further, the learning device 100 acquires the information of the first modal and the information of the second modal when solving the problem. The learning device 100 acquires, for example, first modal information input to the learning device 100 by the user of the learning device 100. Further, the learning device 100 may acquire the first modal information by receiving the information from the client device 201 or the terminal device 202. The learning device 100 acquires, for example, second modal information input to the learning device 100 by the user of the learning device 100. Further, the learning device 100 may acquire the second modal information by receiving the information from the client device 201 or the terminal device 202.

 そして、学習装置100は、更新後の統合モデルを用いて、取得した第1のモーダルの情報と、第2のモーダルの情報とに基づいて、問題を解き、得られた解をクライアント装置201に送信する。学習装置100は、更新後の統合モデルをさらにファインチューニングした上で、問題を解くにあたり利用するようにしてもよい。学習装置100は、例えば、サーバやPC(Personal Computer)などである。 Then, the learning device 100 solves the problem based on the acquired first modal information and the second modal information by using the updated integrated model, and transfers the obtained solution to the client device 201. Send. The learning device 100 may be used for solving the problem after further fine-tuning the updated integrated model. The learning device 100 is, for example, a server, a PC (Personal Computer), or the like.

 クライアント装置201は、学習装置100と通信可能なコンピュータである。クライアント装置201は、例えば、第1のモーダルの情報を、学習装置100に送信してもよい。また、クライアント装置201は、例えば、第2のモーダルの情報を、学習装置100に送信してもよい。クライアント装置201は、学習装置100が問題を解いて得られた解を受信して出力する。出力形式は、例えば、ディスプレイへの表示、プリンタへの印刷出力、他のコンピュータへの送信、または、記憶領域への記憶などである。クライアント装置201は、例えば、PC、タブレット端末、またはスマートフォンなどである。 The client device 201 is a computer capable of communicating with the learning device 100. The client device 201 may, for example, transmit the first modal information to the learning device 100. Further, the client device 201 may transmit, for example, second modal information to the learning device 100. The client device 201 receives and outputs the solution obtained by the learning device 100 solving the problem. The output format is, for example, display on a display, print output to a printer, transmission to another computer, or storage in a storage area. The client device 201 is, for example, a PC, a tablet terminal, a smartphone, or the like.

 端末装置202は、学習装置100と通信可能なコンピュータである。端末装置202は、例えば、第1のモーダルの情報を、学習装置100に送信してもよい。端末装置202は、例えば、第2のモーダルの情報を、学習装置100に送信してもよい。端末装置202は、例えば、PC、タブレット端末、スマートフォン、電子機器、IoT(Internet of Things)機器、またはセンサ装置などである。端末装置202は、具体的には、監視カメラであってもよい。 The terminal device 202 is a computer capable of communicating with the learning device 100. The terminal device 202 may, for example, transmit the first modal information to the learning device 100. The terminal device 202 may transmit, for example, second modal information to the learning device 100. The terminal device 202 is, for example, a PC, a tablet terminal, a smartphone, an electronic device, an IoT (Internet of Things) device, a sensor device, or the like. Specifically, the terminal device 202 may be a surveillance camera.

 ここでは、学習装置100が、統合モデルを更新し、かつ、統合モデルを用いて、問題を解く場合について説明したが、これに限らない。例えば、他のコンピュータが、統合モデルを更新し、学習装置100が、他のコンピュータから受信した統合モデルを用いて、問題を解く場合があってもよい。また、例えば、学習装置100が、統合モデルを更新し、他のコンピュータに提供し、他のコンピュータで、統合モデルを用いて、問題を解く場合があってもよい。 Here, the case where the learning device 100 updates the integrated model and solves the problem using the integrated model has been described, but the present invention is not limited to this. For example, another computer may update the integrated model, and the learning device 100 may solve the problem using the integrated model received from the other computer. Further, for example, the learning device 100 may update the integrated model, provide it to another computer, and solve the problem on the other computer by using the integrated model.

 ここでは、学習装置100が、クライアント装置201や端末装置202とは異なる装置である場合について説明したが、これに限らない。例えば、学習装置100が、クライアント装置201と一体である場合があってもよい。また、例えば、学習装置100が、端末装置202と一体である場合があってもよい。 Here, the case where the learning device 100 is a device different from the client device 201 and the terminal device 202 has been described, but the present invention is not limited to this. For example, the learning device 100 may be integrated with the client device 201. Further, for example, the learning device 100 may be integrated with the terminal device 202.

 ここでは、学習装置100が、ソフトウェア的に、統合モデルを実現する場合について説明したが、これに限らない。例えば、学習装置100が、統合モデルを、電子回路的に実現する場合があってもよい。 Here, the case where the learning device 100 realizes the integrated model in terms of software has been described, but the present invention is not limited to this. For example, the learning device 100 may realize the integrated model electronically.

(情報処理システム200の適用例)
 例えば、端末装置202は、監視カメラであり、対象を撮像した画像を、学習装置100に送信する。対象は、具体的には、試着室の外観である。また、学習装置100は、対象についての説明文となる文書を記憶している。説明文は、具体的には、人間が試着室を利用中は、試着室のカーテンが閉まっている傾向があること、および、人間が試着室を利用中は、試着室の前に靴が置かれている傾向があることなどを記述した文書である。そして、学習装置100は、モデルを用いて、画像と文書とに基づいて、危険度を判断する問題を解く。危険度は、例えば、試着室に避難が未完了の人間が残っている可能性の高さを示す指標値である。危険度は、例えば、試着室に避難が未完了の人間が残っているか否かの2値であってもよい。
(Application example of information processing system 200)
For example, the terminal device 202 is a surveillance camera, and transmits an image of the target to the learning device 100. The object is specifically the appearance of the fitting room. In addition, the learning device 100 stores a document that serves as an explanatory text about the target. The description specifically states that the dressing room curtains tend to be closed while humans are using the dressing room, and that shoes are placed in front of the dressing room while humans are using the dressing room. It is a document that describes things that tend to be done. Then, the learning device 100 solves the problem of determining the degree of risk based on the image and the document by using the model. The degree of risk is, for example, an index value indicating the high possibility that a person who has not completed evacuation remains in the fitting room. The degree of risk may be, for example, two values as to whether or not there are any humans who have not completed evacuation in the fitting room.

(学習装置100のハードウェア構成例)
 次に、図3を用いて、学習装置100のハードウェア構成例について説明する。
(Example of hardware configuration of learning device 100)
Next, a hardware configuration example of the learning device 100 will be described with reference to FIG.

 図3は、学習装置100のハードウェア構成例を示すブロック図である。図3において、学習装置100は、CPU(Central Processing Unit)301と、メモリ302と、ネットワークI/F(Interface)303と、記録媒体I/F304と、記録媒体305とを有する。また、各構成部は、バス300によってそれぞれ接続される。 FIG. 3 is a block diagram showing a hardware configuration example of the learning device 100. In FIG. 3, the learning device 100 includes a CPU (Central Processing Unit) 301, a memory 302, a network I / F (Interface) 303, a recording medium I / F 304, and a recording medium 305. Further, each component is connected by a bus 300.

 ここで、CPU301は、学習装置100の全体の制御を司る。メモリ302は、例えば、ROM(Read Only Memory)、RAM(Random Access Memory)およびフラッシュROMなどを有する。具体的には、例えば、フラッシュROMやROMが各種プログラムを記憶し、RAMがCPU301のワークエリアとして使用される。メモリ302に記憶されるプログラムは、CPU301にロードされることで、コーディングされている処理をCPU301に実行させる。 Here, the CPU 301 controls the entire learning device 100. The memory 302 includes, for example, a ROM (Read Only Memory), a RAM (Random Access Memory), a flash ROM, and the like. Specifically, for example, a flash ROM or ROM stores various programs, and RAM is used as a work area of CPU 301. The program stored in the memory 302 is loaded into the CPU 301 to cause the CPU 301 to execute the coded process.

 ネットワークI/F303は、通信回線を通じてネットワーク210に接続され、ネットワーク210を介して他のコンピュータに接続される。そして、ネットワークI/F303は、ネットワーク210と内部のインターフェースを司り、他のコンピュータからのデータの入出力を制御する。ネットワークI/F303は、例えば、モデムやLANアダプタなどである。 The network I / F 303 is connected to the network 210 through a communication line, and is connected to another computer via the network 210. Then, the network I / F 303 controls the internal interface with the network 210 and controls the input / output of data from another computer. The network I / F 303 is, for example, a modem or a LAN adapter.

 記録媒体I/F304は、CPU301の制御に従って記録媒体305に対するデータのリード/ライトを制御する。記録媒体I/F304は、例えば、ディスクドライブ、SSD(Solid State Drive)、USB(Universal Serial Bus)ポートなどである。記録媒体305は、記録媒体I/F304の制御で書き込まれたデータを記憶する不揮発メモリである。記録媒体305は、例えば、ディスク、半導体メモリ、USBメモリなどである。記録媒体305は、学習装置100から着脱可能であってもよい。 The recording medium I / F 304 controls data read / write to the recording medium 305 according to the control of the CPU 301. The recording medium I / F 304 is, for example, a disk drive, an SSD (Solid State Drive), a USB (Universal Serial Bus) port, or the like. The recording medium 305 is a non-volatile memory that stores data written under the control of the recording medium I / F 304. The recording medium 305 is, for example, a disk, a semiconductor memory, a USB memory, or the like. The recording medium 305 may be detachable from the learning device 100.

 学習装置100は、上述した構成部の他、例えば、キーボード、マウス、ディスプレイ、プリンタ、スキャナ、マイク、スピーカーなどを有してもよい。また、学習装置100は、記録媒体I/F304や記録媒体305を複数有していてもよい。また、学習装置100は、記録媒体I/F304や記録媒体305を有していなくてもよい。 The learning device 100 may include, for example, a keyboard, a mouse, a display, a printer, a scanner, a microphone, a speaker, and the like, in addition to the above-described components. Further, the learning device 100 may have a plurality of recording media I / F 304 and recording media 305. Further, the learning device 100 does not have to have the recording medium I / F 304 or the recording medium 305.

(クライアント装置201のハードウェア構成例)
 クライアント装置201のハードウェア構成例は、具体的には、図3に示した学習装置100のハードウェア構成例と同様であるため、説明を省略する。
(Hardware configuration example of client device 201)
The hardware configuration example of the client device 201 is specifically the same as the hardware configuration example of the learning device 100 shown in FIG. 3, so the description thereof will be omitted.

(端末装置202のハードウェア構成例)
 端末装置202のハードウェア構成例は、具体的には、図3に示した学習装置100のハードウェア構成例と同様であるため、説明を省略する。
(Hardware configuration example of terminal device 202)
Since the hardware configuration example of the terminal device 202 is specifically the same as the hardware configuration example of the learning device 100 shown in FIG. 3, the description thereof will be omitted.

(学習装置100の機能的構成例)
 次に、図4を用いて、学習装置100の機能的構成例について説明する。
(Example of functional configuration of learning device 100)
Next, an example of a functional configuration of the learning device 100 will be described with reference to FIG.

 図4は、学習装置100の機能的構成例を示すブロック図である。学習装置100は、記憶部400と、取得部401と、第1の抽出部402と、変換部403と、第1の処理部404と、第2の抽出部405と、第2の処理部406と、第3の処理部407と、更新部408と、利用部409と、出力部410とを含む。 FIG. 4 is a block diagram showing a functional configuration example of the learning device 100. The learning device 100 includes a storage unit 400, an acquisition unit 401, a first extraction unit 402, a conversion unit 403, a first processing unit 404, a second extraction unit 405, and a second processing unit 406. A third processing unit 407, an updating unit 408, a utilization unit 409, and an output unit 410 are included.

 記憶部400は、例えば、図3に示したメモリ302や記録媒体305などの記憶領域によって実現される。以下では、記憶部400が、学習装置100に含まれる場合について説明するが、これに限らない。例えば、記憶部400が、学習装置100とは異なる装置に含まれ、記憶部400の記憶内容が学習装置100から参照可能である場合があってもよい。 The storage unit 400 is realized by, for example, a storage area such as the memory 302 or the recording medium 305 shown in FIG. Hereinafter, the case where the storage unit 400 is included in the learning device 100 will be described, but the present invention is not limited to this. For example, the storage unit 400 may be included in a device different from the learning device 100, and the stored contents of the storage unit 400 may be referred to by the learning device 100.

 取得部401~出力部410は、制御部の一例として機能する。取得部401~出力部410は、具体的には、例えば、図3に示したメモリ302や記録媒体305などの記憶領域に記憶されたプログラムをCPU301に実行させることにより、または、ネットワークI/F303により、その機能を実現する。各機能部の処理結果は、例えば、図3に示したメモリ302や記録媒体305などの記憶領域に記憶される。 The acquisition unit 401 to the output unit 410 function as an example of the control unit. Specifically, the acquisition unit 401 to the output unit 410 may cause the CPU 301 to execute a program stored in a storage area such as the memory 302 or the recording medium 305 shown in FIG. 3, or the network I / F 303. To realize the function. The processing result of each functional unit is stored in a storage area such as the memory 302 or the recording medium 305 shown in FIG. 3, for example.

 記憶部400は、各機能部の処理において参照され、または更新される各種情報を記憶する。記憶部400は、第1のモーダルの情報と、第2のモーダルの情報との入力を受け付ける統合モデルを記憶する。記憶される統合モデルは、例えば、第1の抽出モデルと、変換モデルと、第1の処理モデルと、第2の抽出モデルと、第2の処理モデルと、第3の処理モデルとを有する。第1の抽出モデルと、変換モデルと、第1の処理モデルとは、第1のモーダルに関する。第2の抽出モデルと、第2の処理モデルとは、第2のモーダルに関する。第3の処理モデルは、第1のモーダルと第2のモーダルとに関する。 The storage unit 400 stores various information referred to or updated in the processing of each functional unit. The storage unit 400 stores an integrated model that accepts input of the first modal information and the second modal information. The integrated model to be stored has, for example, a first extraction model, a transformation model, a first processing model, a second extraction model, a second processing model, and a third processing model. The first extraction model, the transformation model, and the first processing model relate to the first modal. The second extraction model and the second processing model relate to a second modal. The third processing model relates to a first modal and a second modal.

 第2のモーダルは、第1のモーダルとは異なる。例えば、第1のモーダルは、画像に関するモーダルであり、第2のモーダルは、言語に関するモーダルである。例えば、第1のモーダルは、画像に関するモーダルであり、第2のモーダルは、音声に関するモーダルである。例えば、第1のモーダルは、第1の言語に関するモーダルであり、第2のモーダルは、第2の言語に関するモーダルである。 The second modal is different from the first modal. For example, the first modal is an image modal and the second modal is a language modal. For example, the first modal is an image modal and the second modal is an audio modal. For example, the first modal is a modal for a first language and the second modal is a modal for a second language.

 取得部401は、各機能部の処理に用いられる各種情報を取得する。取得部401は、取得した各種情報を、記憶部400に記憶し、または、各機能部に出力する。また、取得部401は、記憶部400に記憶しておいた各種情報を、各機能部に出力してもよい。取得部401は、例えば、ユーザの操作入力に基づき、各種情報を取得する。取得部401は、例えば、学習装置100とは異なる装置から、各種情報を受信してもよい。 The acquisition unit 401 acquires various information used for processing of each functional unit. The acquisition unit 401 stores various acquired information in the storage unit 400 or outputs it to each function unit. Further, the acquisition unit 401 may output various information stored in the storage unit 400 to each function unit. The acquisition unit 401 acquires various information based on, for example, a user's operation input. The acquisition unit 401 may receive various information from a device different from the learning device 100, for example.

 取得部401は、第1のモーダルの情報と、第2のモーダルの情報とを取得する。取得部401は、例えば、ユーザによる、第1のモーダルの情報と、第2のモーダルの情報との入力を受け付けることにより、第1のモーダルの情報と、第2のモーダルの情報とを取得する。取得部401は、例えば、第1のモーダルの情報と、第2のモーダルの情報とを、クライアント装置201または端末装置202から受信することにより取得してもよい。取得部401は、第1のモーダルの情報と、第2のモーダルの情報とを含む教師データを取得することにより、第1のモーダルの情報と、第2のモーダルの情報とを取得してもよい。 The acquisition unit 401 acquires the information of the first modal and the information of the second modal. The acquisition unit 401 acquires the first modal information and the second modal information by, for example, accepting the input of the first modal information and the second modal information by the user. .. The acquisition unit 401 may acquire, for example, the information of the first modal and the information of the second modal by receiving from the client device 201 or the terminal device 202. Even if the acquisition unit 401 acquires the information of the first modal and the information of the second modal by acquiring the teacher data including the information of the first modal and the information of the second modal. Good.

 取得部401は、いずれかの機能部の処理を開始する開始トリガーを受け付けてもよい。開始トリガーは、例えば、ユーザによる所定の操作入力があったことである。開始トリガーは、例えば、他のコンピュータから、所定の情報を受信したことであってもよい。開始トリガーは、例えば、いずれかの機能部が所定の情報を出力したことであってもよい。取得部401は、例えば、第1のモーダルの情報と、第2のモーダルの情報とを取得したことを、各機能部の処理を開始する開始トリガーとして受け付ける。 The acquisition unit 401 may accept a start trigger to start processing of any of the functional units. The start trigger is, for example, that there is a predetermined operation input by the user. The start trigger may be, for example, the receipt of predetermined information from another computer. The start trigger may be, for example, that any functional unit outputs predetermined information. The acquisition unit 401 receives, for example, the acquisition of the first modal information and the second modal information as a start trigger for starting the processing of each functional unit.

 第1の抽出部402は、第1のモーダルの情報から特徴量を抽出する。第1の抽出部402は、例えば、画像から画像特徴量を抽出する。抽出される画像特徴量は、例えば、画像に写る物体を示す画像特徴量である。これにより、第1の抽出部402は、第1のモーダルの情報を、変換モデルに入力可能な形式に変更することができる。また、第1の抽出部402は、第1のモーダルの情報から、問題を解くにあたり有用な情報を抽出することができる。 The first extraction unit 402 extracts the feature amount from the first modal information. The first extraction unit 402 extracts, for example, an image feature amount from an image. The extracted image feature amount is, for example, an image feature amount indicating an object appearing in the image. As a result, the first extraction unit 402 can change the information of the first modal into a format that can be input to the conversion model. In addition, the first extraction unit 402 can extract useful information for solving the problem from the first modal information.

 変換部403は、変換モデルを用いて、抽出した特徴量を複数のパラメータに基づいて変換することにより、新たな特徴量を取得する。ここで、複数のパラメータは、例えば、第1の特徴量と第2の特徴量とを含む。変換部403は、例えば、抽出した特徴量と、第1の特徴量との一致度を算出し、算出した一致度を基に第2の特徴量に重み付けして得られた指標値に基づいて、抽出した特徴量を変換することにより、新たな特徴量を取得する。 The conversion unit 403 acquires a new feature amount by converting the extracted feature amount based on a plurality of parameters using the conversion model. Here, the plurality of parameters include, for example, a first feature amount and a second feature amount. For example, the conversion unit 403 calculates the degree of coincidence between the extracted feature amount and the first feature amount, and based on the index value obtained by weighting the second feature amount based on the calculated degree of coincidence. , A new feature amount is acquired by converting the extracted feature amount.

 複数のパラメータが、第1の特徴量と第2の特徴量とである場合において、新たな特徴量を取得する一例については、具体的には、図5~図8を用いて後述する。これにより、変換部403は、第2のモーダルの情報の特徴を反映可能な複数のパラメータを用意し、複数のパラメータを介して、第1のモーダルの情報から抽出した特徴量に、第2のモーダルの情報の特徴を反映可能にすることができる。 Specific examples of acquiring a new feature amount when the plurality of parameters are the first feature amount and the second feature amount will be described later with reference to FIGS. 5 to 8. As a result, the conversion unit 403 prepares a plurality of parameters capable of reflecting the features of the information of the second modal, and the feature amount extracted from the information of the first modal via the plurality of parameters is subjected to the second. It is possible to reflect the characteristics of modal information.

 また、複数のパラメータは、例えば、入力層のノード数と出力層のノード数とより中間層のノード数が大きいニューラルネットワークのパラメータであってもよい。ニューラルネットワークは、変換モデルに対応する。変換部403は、例えば、抽出した特徴量をニューラルネットワークに入力することにより、新たな特徴量を取得する。 Further, the plurality of parameters may be, for example, parameters of a neural network in which the number of nodes in the input layer, the number of nodes in the output layer, and the number of nodes in the intermediate layer are larger. Neural networks correspond to transformation models. The conversion unit 403 acquires a new feature amount by inputting the extracted feature amount into the neural network, for example.

 複数のパラメータが、ニューラルネットワークのパラメータである場合において、新たな特徴量を取得する一例については、具体的には、図9を用いて後述する。これにより、変換部403は、第2のモーダルの情報の特徴を反映可能な複数のパラメータを用意し、複数のパラメータを介して、第1のモーダルの情報から抽出した特徴量に、第2のモーダルの情報の特徴を反映可能にすることができる。 A specific example of acquiring a new feature amount when a plurality of parameters are neural network parameters will be described later with reference to FIG. As a result, the conversion unit 403 prepares a plurality of parameters capable of reflecting the features of the information of the second modal, and the feature amount extracted from the information of the first modal via the plurality of parameters is subjected to the second. It is possible to reflect the characteristics of modal information.

 第1の処理部404は、取得した新たな特徴量を、第1の処理モデルに入力することにより、第1の出力値を取得する。第1の処理モデルは、例えば、画像処理モデルである。第1の処理部404は、例えば、新たな画像特徴量を、画像処理モデルに入力することにより、第1の出力値を取得する。 The first processing unit 404 acquires the first output value by inputting the acquired new feature amount into the first processing model. The first processing model is, for example, an image processing model. The first processing unit 404 acquires the first output value, for example, by inputting a new image feature amount into the image processing model.

 第1の出力値を取得する一例については、具体的には、図5~図8を用いて後述する。これにより、第1の処理部404は、取得した新たな特徴量を、第3の処理モデルに入力可能な形式に変更することができる。また、第1の処理部404は、取得した新たな特徴量から、問題を解くにあたり有用な情報を抽出することができる。 An example of acquiring the first output value will be specifically described later with reference to FIGS. 5 to 8. As a result, the first processing unit 404 can change the acquired new feature quantity into a format that can be input to the third processing model. In addition, the first processing unit 404 can extract useful information for solving the problem from the acquired new features.

 第2の抽出部405は、第2のモーダルの情報から他の特徴量を抽出する。第2の抽出部405は、例えば、文書から言語特徴量を抽出する。抽出される言語特徴量は、例えば、文書に含まれる単語を示す言語特徴量である。これにより、第2の抽出部405は、第2のモーダルの情報を、第2の処理モデルに入力可能な形式に変更することができる。また、第2の抽出部405は、第2のモーダルの情報から、問題を解くにあたり有用な情報を抽出することができる。 The second extraction unit 405 extracts other feature quantities from the second modal information. The second extraction unit 405 extracts, for example, a language feature from a document. The extracted linguistic feature is, for example, a linguistic feature indicating a word contained in a document. As a result, the second extraction unit 405 can change the information of the second modal into a format that can be input to the second processing model. In addition, the second extraction unit 405 can extract useful information for solving the problem from the second modal information.

 第2の処理部406は、抽出した他の特徴量を、第2の処理モデルに入力することにより、第2の出力値を取得する。第2の処理モデルは、例えば、言語処理モデルである。第2の処理部406は、例えば、言語特徴量を、言語処理モデルに入力することにより、第2の出力値を取得する。 The second processing unit 406 acquires the second output value by inputting the extracted other feature quantities into the second processing model. The second processing model is, for example, a language processing model. The second processing unit 406 acquires the second output value by inputting the language feature amount into the language processing model, for example.

 第2の出力値を取得する一例については、具体的には、図5~図8を用いて後述する。これにより、第2の処理部406は、抽出した他の特徴量を、第3の処理モデルに入力可能な形式に変更することができる。また、第2の処理部406は、抽出した他の特徴量から、問題を解くにあたり有用な情報を抽出することができる。 An example of acquiring the second output value will be specifically described later with reference to FIGS. 5 to 8. As a result, the second processing unit 406 can change the extracted other features into a format that can be input to the third processing model. In addition, the second processing unit 406 can extract useful information for solving the problem from the extracted other features.

 第3の処理部407は、取得した第1の出力値と第2の出力値とを、第3の処理モデルに入力することにより、第3の出力値を取得する。第3の処理モデルは、例えば、クロスモーダル処理モデルである。第3の処理部407は、例えば、第1の出力値と第2の出力値とを、クロスモーダル処理モデルに入力することにより、第3の出力値を取得する。第3の出力値を取得する一例については、具体的には、図5~図8を用いて後述する。これにより、第3の処理部407は、言語と画像との特徴を統合した第3の出力値を得ることができる。 The third processing unit 407 acquires the third output value by inputting the acquired first output value and the second output value into the third processing model. The third processing model is, for example, a cross-modal processing model. The third processing unit 407 acquires the third output value by inputting the first output value and the second output value into the cross-modal processing model, for example. An example of acquiring the third output value will be specifically described later with reference to FIGS. 5 to 8. As a result, the third processing unit 407 can obtain a third output value that integrates the features of the language and the image.

 更新部408は、取得した第3の出力値に基づいて、複数のパラメータを更新する。更新部408は、例えば、第3の出力値に基づいて、誤差逆伝搬法により、複数のパラメータを更新する。更新部408は、例えば、損失関数により、第3の出力値に基づいて、損失値を算出し、損失値に基づいて、複数のパラメータを更新する。 The update unit 408 updates a plurality of parameters based on the acquired third output value. The update unit 408 updates a plurality of parameters by the error back propagation method, for example, based on the third output value. The update unit 408 calculates the loss value based on the third output value by, for example, the loss function, and updates a plurality of parameters based on the loss value.

 これにより、更新部408は、問題を解くにあたり、第1のモーダルの情報と、第2のモーダルの情報とを扱う観点で有用な複数のパラメータを得ることができる。更新部408は、例えば、第1のモーダルの情報から新たな特徴量を取得するにあたり、第2のモーダルの情報の特徴を有効に活用可能に、明示的に用意された複数のパラメータの最適化を図ることができる。 As a result, the update unit 408 can obtain a plurality of parameters useful from the viewpoint of handling the first modal information and the second modal information in solving the problem. For example, the update unit 408 optimizes a plurality of explicitly prepared parameters so that the features of the second modal information can be effectively utilized when acquiring a new feature amount from the first modal information. Can be planned.

 更新部408は、第3の出力値に基づいて、第1の処理モデルを更新する。更新部408は、例えば、第3の出力値に基づいて、誤差逆伝搬法により、第1の処理モデルを更新する。更新部408は、例えば、損失関数により、第3の出力値に基づいて、損失値を算出し、損失値に基づいて、第1の処理モデルを更新する。これにより、更新部408は、問題を解くにあたり、第1のモーダルの情報を扱う観点で有用な第1の処理モデルを得ることができる。 The update unit 408 updates the first processing model based on the third output value. The update unit 408 updates the first processing model by the error back propagation method, for example, based on the third output value. The update unit 408 calculates the loss value based on the third output value by, for example, the loss function, and updates the first processing model based on the loss value. As a result, the update unit 408 can obtain a first processing model that is useful from the viewpoint of handling the first modal information in solving the problem.

 更新部408は、第3の出力値に基づいて、第2の処理モデルと、第3の処理モデルとを更新する。更新部408は、例えば、第3の出力値に基づいて、誤差逆伝搬法により、第2の処理モデルと、第3の処理モデルとを更新する。更新部408は、例えば、損失関数により、第3の出力値に基づいて、損失値を算出し、損失値に基づいて、第2の処理モデルと、第3の処理モデルとを更新する。 The update unit 408 updates the second processing model and the third processing model based on the third output value. The update unit 408 updates the second processing model and the third processing model by the error back propagation method, for example, based on the third output value. For example, the update unit 408 calculates the loss value based on the third output value by the loss function, and updates the second processing model and the third processing model based on the loss value.

 これにより、更新部408は、問題を解くにあたり、第2のモーダルの情報を扱う観点で有用な第2の処理モデルを得ることができる。また、学習装置100は、問題を解くにあたり、第1のモーダルの情報と、第2のモーダルの情報とを統合する観点で有用な第3の処理モデルを得ることができる。 As a result, the update unit 408 can obtain a second processing model that is useful from the viewpoint of handling the second modal information in solving the problem. In addition, the learning device 100 can obtain a third processing model that is useful from the viewpoint of integrating the information of the first modal and the information of the second modal in solving the problem.

 利用部409は、所定の問題を解く。利用部409は、例えば、更新後の複数のパラメータと、未更新の第1の処理モデルとを用いて、第1のモーダルの他の情報の入力に応じて、所定の問題を解く。これにより、利用部409は、更新後の複数のパラメータを、少なくとも第1のモーダルの他の情報に基づいて問題を解く際に利用し、得られる解の精度向上を図ることができる。 The user unit 409 solves a predetermined problem. The utilization unit 409 solves a predetermined problem in response to input of other information of the first modal by using, for example, a plurality of updated parameters and an unupdated first processing model. As a result, the utilization unit 409 can use the updated plurality of parameters when solving the problem based on at least other information of the first modal, and can improve the accuracy of the obtained solution.

 利用部409は、例えば、更新後の複数のパラメータと、更新後の第1の処理モデルとを用いて、第1のモーダルの他の情報の入力に応じて、所定の問題を解く。これにより、利用部409は、更新後の複数のパラメータと、更新後の第1の処理モデルとを、少なくとも第1のモーダルの他の情報に基づいて問題を解く際に利用し、得られる解の精度向上を図ることができる。 The utilization unit 409 solves a predetermined problem in response to input of other information of the first modal by using, for example, a plurality of parameters after the update and the first processing model after the update. As a result, the utilization unit 409 uses the updated parameters and the updated first processing model when solving the problem based on at least other information of the first modal, and obtains a solution. It is possible to improve the accuracy of.

 利用部409は、例えば、更新後の複数のパラメータと、更新後の第1の処理モデルと、更新後の第2の処理モデルと、更新後の第3の処理モデルとを用いて、第1のモーダルの他の情報と、第2のモーダルの他の情報とに基づいて、所定の問題を解く。これにより、利用部409は、更新後の複数のパラメータと、更新後の第1の処理モデルと、更新後の第2の処理モデルと、更新後の第3の処理モデルとを、問題を解く際に利用し、得られる解の精度向上を図ることができる。 The utilization unit 409 uses, for example, a plurality of parameters after the update, the first processing model after the update, the second processing model after the update, and the third processing model after the update. Based on the other information of the modal and the other information of the second modal, the predetermined problem is solved. As a result, the utilization unit 409 solves the problem of the plurality of parameters after the update, the first processing model after the update, the second processing model after the update, and the third processing model after the update. It can be used in some cases to improve the accuracy of the obtained solution.

 出力部410は、いずれかの機能部の処理結果を出力する。出力形式は、例えば、ディスプレイへの表示、プリンタへの印刷出力、ネットワークI/F303による外部装置への送信、または、メモリ302や記録媒体305などの記憶領域への記憶である。これにより、出力部410は、各機能部の処理結果をユーザに通知可能にし、学習装置100の利便性の向上を図ることができる。 The output unit 410 outputs the processing result of any of the functional units. The output format is, for example, display on a display, print output to a printer, transmission to an external device by the network I / F 303, or storage in a storage area such as a memory 302 or a recording medium 305. As a result, the output unit 410 can notify the user of the processing result of each functional unit, and can improve the convenience of the learning device 100.

 出力部410は、例えば、更新後の複数のパラメータを出力する。これにより、出力部410は、問題を解くにあたり、第1のモーダルの情報と、第2のモーダルの情報とを扱う観点で有用な、更新後の複数のパラメータを参照可能にすることができる。出力部410は、例えば、更新後の複数のパラメータを、他のコンピュータで利用可能にすることができる。このため、出力部410は、少なくとも複数のパラメータを用いて、問題を解くことにより得られる解の精度を向上可能にすることができる。 The output unit 410 outputs, for example, a plurality of updated parameters. As a result, the output unit 410 can refer to a plurality of updated parameters that are useful from the viewpoint of handling the first modal information and the second modal information in solving the problem. The output unit 410 can make the updated plurality of parameters available to other computers, for example. Therefore, the output unit 410 can improve the accuracy of the solution obtained by solving the problem by using at least a plurality of parameters.

 出力部410は、例えば、第1の処理モデルを出力してもよい。これにより、出力部410は、問題を解くにあたり、第1のモーダルの情報を扱う観点で有用な、更新後の第1の処理モデルを参照可能にすることができる。出力部410は、例えば、更新後の第1の処理モデルを、他のコンピュータで利用可能にすることができる。このため、出力部410は、少なくとも第1の処理モデルを用いて、第1のモーダルの情報に基づいて、問題を解くことにより得られる解の精度を向上可能にすることができる。 The output unit 410 may output, for example, the first processing model. As a result, the output unit 410 can refer to the updated first processing model, which is useful from the viewpoint of handling the first modal information in solving the problem. The output unit 410 can make the updated first processing model available to other computers, for example. Therefore, the output unit 410 can improve the accuracy of the solution obtained by solving the problem based on the information of the first modal by using at least the first processing model.

 出力部410は、例えば、第2の処理モデルと、第3の処理モデルとを出力してもよい。これにより、出力部410は、更新後の第2の処理モデルを参照可能にすることができる。出力部410は、例えば、更新後の第2の処理モデルを、他のコンピュータで利用可能にすることができる。このため、出力部410は、少なくとも第2の処理モデルと、第3の処理モデルとを用いて、第1のモーダルの情報と、第2のモーダルの情報とに基づいて、問題を解くことにより得られる解の精度を向上可能にすることができる。 The output unit 410 may output, for example, a second processing model and a third processing model. As a result, the output unit 410 can refer to the updated second processing model. The output unit 410 can make the updated second processing model available to other computers, for example. Therefore, the output unit 410 solves the problem based on the information of the first modal and the information of the second modal by using at least the second processing model and the third processing model. It is possible to improve the accuracy of the obtained solution.

(学習装置100の具体的な機能的構成例)
 次に、図5を用いて、第1のモーダルが、画像に関し、第2のモーダルが、言語に関する場合について、学習装置100の具体的な機能的構成例について説明する。
(Specific functional configuration example of the learning device 100)
Next, a specific functional configuration example of the learning device 100 will be described with reference to FIG. 5 in the case where the first modal is related to an image and the second modal is related to a language.

 図5は、学習装置100の具体的な機能的構成例を示すブロック図である。図5において、学習装置100は、統合モデル500を有する。統合モデル500は、画像特徴量生成部501と、クエリ生成部502と、テーブル検索部503と、画像処理部504と、言語処理部505と、クロスモーダル処理部506と、損失関数計算部507とを含む。 FIG. 5 is a block diagram showing a specific functional configuration example of the learning device 100. In FIG. 5, the learning device 100 has an integrated model 500. The integrated model 500 includes an image feature amount generation unit 501, a query generation unit 502, a table search unit 503, an image processing unit 504, a language processing unit 505, a crossmodal processing unit 506, and a loss function calculation unit 507. including.

 また、学習装置100は、データテーブル510を有する。データテーブル510は、被検索キー列とクロスモーダル特徴量列とを記憶するテーブルである。データテーブル510は、事前学習により言語情報が反映されるテーブルである。 Further, the learning device 100 has a data table 510. The data table 510 is a table that stores the searched key sequence and the cross-modal feature quantity column. The data table 510 is a table in which linguistic information is reflected by pre-learning.

 データテーブル510は、例えば、画像特徴量を量子化する被検索キーに対し、言語情報が反映されたクロスモーダル特徴量を1対1で対応付けたテーブルである。データテーブル510は、画像特徴量から生成された検索クエリに基づいて、データテーブル510にクロスモーダル特徴量として反映された言語情報を、画像特徴量との関連度合いに応じた重み付けを実施した上で抽出するためのテーブルである。 The data table 510 is, for example, a table in which a cross-modal feature amount reflecting linguistic information is associated with a searched key that quantizes an image feature amount on a one-to-one basis. Based on the search query generated from the image feature amount, the data table 510 weights the linguistic information reflected as the cross-modal feature amount in the data table 510 according to the degree of relevance to the image feature amount. It is a table for extracting.

 画像特徴量生成部501は、入力された画像から、画像特徴量列を生成して出力する。画像特徴量生成部501は、例えば、入力された画像に写る物体を検出し、検出した物体それぞれを示す画像特徴量を含む画像特徴量列を生成して出力する。画像特徴量列は、例えば、ベクトルにより表現される。画像特徴量は、例えば、色や形状などを含む、物体の視覚的な特徴情報を示す。 The image feature amount generation unit 501 generates an image feature amount sequence from the input image and outputs it. The image feature amount generation unit 501 detects, for example, an object appearing in the input image, and generates and outputs an image feature amount sequence including an image feature amount indicating each of the detected objects. The image feature sequence is represented by, for example, a vector. The image feature amount indicates visual feature information of an object including, for example, a color and a shape.

 クエリ生成部502は、入力された画像特徴量列に基づいて、検索クエリ列を生成して出力する。クエリ生成部502は、例えば、画像特徴量列に変換行列Wqを乗算することにより、検索クエリ列を生成する。検索クエリ列は、例えば、ベクトルにより表現される。 The query generation unit 502 generates and outputs a search query sequence based on the input image feature sequence. The query generation unit 502 generates a search query sequence by, for example, multiplying the image feature sequence by the transformation matrix W q. The search query column is represented by, for example, a vector.

 テーブル検索部503は、入力された検索クエリ列に基づいて、データテーブル510からクロスモーダル特徴量に重み付けすることにより、指標値列を算出して出力する。テーブル検索部503は、例えば、入力された検索クエリ列と、データテーブル510の被検索キー列との内積に基づいて、データテーブル510のクロスモーダル特徴量の重み付け平均を算出することにより、新たなクロスモーダル特徴量列を生成して出力する。 The table search unit 503 calculates and outputs an index value string by weighting the cross-modal feature amount from the data table 510 based on the input search query column. The table search unit 503 calculates a new weighted average of the cross-modal features of the data table 510 based on the inner product of the input search query column and the searched key column of the data table 510, for example. Generates and outputs a cross-modal feature sequence.

 画像処理部504は、入力された新たなクロスモーダル特徴量列に基づいて、画像特徴量列を変換して出力する。画像処理部504は、例えば、新たなクロスモーダル特徴量列を、画像特徴量列に加算した上で、画像処理モデルを用いて、加算後の画像特徴量列を変換し、変換後の画像特徴量列を出力する。 The image processing unit 504 converts and outputs the image feature quantity sequence based on the input new cross-modal feature quantity sequence. For example, the image processing unit 504 adds a new cross-modal feature sequence to the image feature sequence, converts the added image feature sequence using an image processing model, and converts the converted image feature. Output a quantity sequence.

 言語処理部505は、入力された単語列に基づいて、単語embedding列を生成して出力する。言語処理部505は、例えば、言語処理モデルを用いて、入力された単語列に基づいて、単語embedding列を生成して出力する。 The language processing unit 505 generates and outputs a word embedding string based on the input word string. The language processing unit 505 generates and outputs a word embedding string based on the input word string, for example, using a language processing model.

 クロスモーダル処理部506は、入力された単語embedding列と、入力された画像特徴量列とを統合し、新たな単語embedding列と、新たな画像特徴量列とを生成して出力する。クロスモーダル処理部506は、例えば、クロスモーダル処理モデルを用いて、入力された単語embedding列と、入力された画像特徴量列とを統合し、新たな単語embedding列と、新たな画像特徴量列とを生成して出力する。 The cross-modal processing unit 506 integrates the input word embedding sequence and the input image feature amount sequence, generates a new word embedding sequence, and outputs a new image feature amount sequence. The cross-modal processing unit 506 integrates the input word embedding sequence and the input image feature amount sequence by using, for example, a cross-modal processing model, and forms a new word embedding column and a new image feature amount sequence. And output.

 損失関数計算部507は、入力された単語embedding列と、入力された画像特徴量列とに基づいて、損失関数を用いて、損失値を算出する。そして、学習装置100は、損失値に基づいて、統合モデル500を事前学習することにより、データテーブル510と、画像処理モデルと、言語処理モデルと、クロスモーダル処理モデルとを更新する。 The loss function calculation unit 507 calculates the loss value using the loss function based on the input word embedding sequence and the input image feature amount sequence. Then, the learning device 100 updates the data table 510, the image processing model, the language processing model, and the cross-modal processing model by pre-learning the integrated model 500 based on the loss value.

 これにより、学習装置100は、言語情報をデータテーブル510に反映させることができる。学習装置100は、例えば、言語情報を、データテーブル510のクロスモーダル特徴量に反映させることができる。学習装置100は、具体的には、単語列に基づく損失値を、データテーブル510のクロスモーダル特徴量に反映させることができる。 As a result, the learning device 100 can reflect the language information in the data table 510. The learning device 100 can reflect, for example, linguistic information in the cross-modal features of the data table 510. Specifically, the learning device 100 can reflect the loss value based on the word string in the cross-modal feature amount of the data table 510.

 このように、学習装置100は、言語情報を反映させるデータテーブル510を明示的に用意すると共に、言語情報を反映させる際に、被検索キーによる量子化に基づいて、効果的にクロスモーダル特徴量を更新することができる。結果として、学習装置100は、画像を扱うにあたり、言語情報を考慮した表現空間を用意することができ、画像処理部504が、有用な画像特徴量列を生成可能にすることができる。 In this way, the learning device 100 explicitly prepares the data table 510 that reflects the linguistic information, and when the linguistic information is reflected, the cross-modal feature amount is effectively based on the quantization by the searched key. Can be updated. As a result, the learning device 100 can prepare an expression space in consideration of linguistic information when handling an image, and the image processing unit 504 can generate a useful image feature quantity sequence.

(学習装置100が統合モデル600を学習する一例)
 次に、図6を用いて、学習装置100が統合モデル600を学習する一例について説明する。
(An example in which the learning device 100 learns the integrated model 600)
Next, an example in which the learning device 100 learns the integrated model 600 will be described with reference to FIG.

 図6は、統合モデル600を学習する一例を示す説明図である。図6において、学習装置100は、統合モデル500を具体化した統合モデル600を有する。統合モデル600は、Add部601と、画像Transformer602と、言語Transformer603と、クロスモーダルTransformer604とを含む。 FIG. 6 is an explanatory diagram showing an example of learning the integrated model 600. In FIG. 6, the learning device 100 has an integrated model 600 that embodies the integrated model 500. The integrated model 600 includes an add unit 601, an image Transferr 602, a language Transferr 603, and a cross-modal Transferr 604.

 また、学習装置100は、データテーブル510を具体化した言語情報データテーブル610を有する。言語情報データテーブル610は、検索クエリ列を設定する配列Query611と、被検索キー列を設定する配列Key621と、クロスモーダル特徴量列を設定する配列Value622とを含む。 Further, the learning device 100 has a language information data table 610 that embodies the data table 510. The language information data table 610 includes an array Queue 611 for setting a search query sequence, an array Key 621 for setting a search key sequence, and an array Value 622 for setting a cross-modal feature quantity sequence.

 学習装置100は、入力の一部をMaskし、Maskした部分を予測させることにより、統合モデル600の事前学習を実施する。学習装置100は、例えば、入力として、画像特徴量列F=f1,f2,・・・,fi,・・・,fIと、言語embedding列E=e1,e2,・・・,ej,・・・,eLとを取得する。次に、学習装置100は、画像特徴量列F=f1,f2,・・・,fi,・・・,fIのうち、画像特徴量fiをMaskし、言語embedding列E=e1,e2,・・・,ej,・・・,eLのうち、言語embeddingejをMaskする。 The learning device 100 performs pre-learning of the integrated model 600 by masking a part of the input and predicting the masked part. Learning apparatus 100 is, for example, as an input, the image feature amount sequence F = f 1, f 2, ···, f i, ···, and f I, language embedding column E = e 1, e 2, ··・, E j , ..., e L and so on. Then, the learning apparatus 100, the image feature amount sequence F = f 1, f 2, ···, f i, ···, of f I, image features f i and Mask, language embedding column E = Of e 1 , e 2 , ..., e j , ..., e L , the language embeddinge j is masked.

 次に、学習装置100は、Add部601により、Maskした画像特徴量列Fを変換し、変換後の画像特徴量列F’を出力する。学習装置100は、画像Transformer602により、補正後の画像特徴量列F’をさらに変換し、変換後の画像特徴量列F”を出力する。また、学習装置100は、言語Transformer603により、Maskした言語embedding列Eを変換し、変換後の言語embedding列E’を出力する。そして、学習装置100は、クロスモーダルTransformer604により、変換後の画像特徴量列F”と、変換後の言語embedding列E’とを統合し、新たな画像特徴量列F^と、新たな言語embedding列E^とを生成する。 Next, the learning device 100 converts the masked image feature amount sequence F by the Add unit 601 and outputs the converted image feature amount sequence F'. The learning device 100 further converts the corrected image feature amount sequence F'by the image Transferr 602, and outputs the converted image feature amount sequence F ”. Further, the learning device 100 uses the language Transfermer 603 to create a masked language. The embedded column E is converted and the converted language embedding column E'is output. Then, the learning device 100 uses the crossmodal Transferformer 604 to convert the converted image feature amount column F'and the converted language embedding column E'. To generate a new image feature sequence F ^ and a new language embedding sequence E ^.

 学習装置100は、画像特徴量列F^のうちMaskした部分に対応する画像特徴量「fi^」と、言語embedding列E^のうちMaskした部分に対応する言語embedding「ej^」とに基づいて、Maskした部分を予測する。そして、学習装置100は、予測した結果に基づいて、統合モデル600の事前学習を実施する。次に、図7の説明に移行し、学習装置100が、Add部601により、画像特徴量列Fを変換する一例について説明する。 Learning apparatus 100, the image feature amount corresponding to Mask portion of the image feature quantity column F ^ and "f i ^", the language embedding sequence E ^ Mask language corresponding to the portion embedding of the "e j ^" Based on, the Masked part is predicted. Then, the learning device 100 performs pre-learning of the integrated model 600 based on the predicted result. Next, moving to the description of FIG. 7, an example in which the learning device 100 converts the image feature quantity sequence F by the Add unit 601 will be described.

 図7は、画像特徴量列Fを変換する一例を示す説明図である。図7において、予め、学習装置100は、N次元単位行列Iに変換行列Wkを乗算して得た被検索キー列の初期値を、配列Key621に設定し、N次元単位行列Iに変換行列Wvを乗算して得たクロスモーダル特徴量列を、配列Value622に設定する。 FIG. 7 is an explanatory diagram showing an example of converting the image feature quantity sequence F. In FIG. 7, the learning apparatus 100 sets the initial value of the searched key sequence obtained by multiplying the N-dimensional identity matrix I by the transformation matrix W k in advance in the array Key621, and converts it into the N-dimensional identity matrix I. The crossmodal feature matrix obtained by multiplying W v is set in the array Value622.

 ここで、学習装置100は、Maskした画像特徴量列Fに変換行列Wqを乗算し、検索クエリ列Q=q1,・・・,qIを算出し、配列Query611に設定する。学習装置100は、配列Query611と、配列Key621との内積を算出し、算出した内積のsoftmaxを算出する。学習装置100は、算出したsoftmaxと、配列Value622との内積を、補正情報として算出する。学習装置100は、画像特徴量列Fに補正情報を加算することにより、画像特徴量列Fを変換する。次に、図8の説明に移行し、画像特徴量列Fを変換する計算の具体例について説明する。 Here, the learning device 100 multiplies the masked image feature sequence F by the transformation matrix W q , calculates the search query sequences Q = q 1 , ..., Q I , and sets them in the array Queue 611. The learning device 100 calculates the inner product of the sequence Queen 611 and the sequence Key 621, and calculates the softmax of the calculated inner product. The learning device 100 calculates the inner product of the calculated softmax and the array Value622 as correction information. The learning device 100 converts the image feature quantity sequence F by adding the correction information to the image feature quantity sequence F. Next, the description shifts to FIG. 8, and a specific example of the calculation for converting the image feature quantity sequence F will be described.

 図8は、画像特徴量列Fを変換する計算の具体例を示す説明図である。図8に示すように、(8-1)学習装置100は、画像特徴量列Fを変換するにあたり、それぞれの画像特徴量f1,f2,・・・,fi,・・・,fIに、変換行列Wqのそれぞれの変換ベクトルWq1,・・・,Wqh,・・・,WqHを乗算する。学習装置100は、例えば、画像特徴量fiに、変換ベクトルWq1,・・・,Wqh,・・・,WqHを乗算し、H個の部分特徴クエリqi,1,・・・,qi,h,・・・,qi,Hを取得する。 FIG. 8 is an explanatory diagram showing a specific example of the calculation for converting the image feature quantity sequence F. As shown in FIG. 8, (8-1) Learning unit 100, when converting the image feature amount column F, the respective image feature amount f 1, f 2, ···, f i, ···, f to I, each of the transform vector W q1 of the transformation matrix W q, ···, W qh, ···, multiplying W qH. Learning apparatus 100 is, for example, the image feature amount f i, converting the vector W q1, ···, W qh, ···, W qH multiplies, H-number of the partial feature query q i, 1, · · · , Q i, h , ···, q i, H are acquired.

 ここでは、学習装置100が、画像特徴量fiに基づいて、部分特徴クエリqi,1,・・・,qi,h,・・・,qi,Hを取得する場合について説明したが、画像特徴量fi以外の他の画像特徴量についても同様の計算が実施される。以下の説明では、部分特徴クエリqi,hを例に挙げて、以降に実施する計算について説明するが、部分特徴クエリqi,h以外の他の部分特徴クエリについても同様の計算が実施される。 Here, the learning apparatus 100, based on the image feature amount f i, the partial feature query q i, 1, ···, q i, h, ···, q i, has been described to obtain the H It is carried out similar calculations for other image characteristic amount other than the image feature amount f i. In the following explanation, the calculation to be performed thereafter will be described by taking the partial feature queries q i and h as an example, but the same calculation is performed for other partial feature queries other than the partial feature queries q i and h. To.

 (8-2)学習装置100は、部分特徴クエリqi,hと、配列Key621に設定されたN個の部分特徴キーk1,h,・・・,kn,h,・・・,kN,Hとの一致度を算出し、一致度のsoftmaxを算出する。学習装置100は、例えば、部分特徴クエリqi,hと、N個の部分特徴キーk1,h,・・・,kn,h,・・・,kN,Hとの内積を算出し、内積のsoftmaxを算出する。学習装置100は、具体的には、下記式(1)により、内積のsoftmaxを算出する。 (8-2) The learning device 100 has a partial feature query q i, h and N partial feature keys k 1, h , ..., k n, h , ..., K set in the array Key 621. The degree of coincidence with N and H is calculated, and the softmax of the degree of coincidence is calculated. The learning device 100 calculates, for example, the inner product of the partial feature queries q i, h and N partial feature keys k 1, h , ···, k n, h , ···, k N, H. , The softmax of the inner product is calculated. Specifically, the learning device 100 calculates the softmax of the inner product by the following formula (1).

  exp(qi,h・k1,h)/Σxexp(qi,h・kx,h),・・・,exp(qi,h・kn,h)/Σxexp(qi,h・kx,h),・・・,exp(qi,h・kN,h)/Σxexp(qi,h・kx,h)                       ・・・(1) exp (q i, h · k 1, h ) / Σ x exp (q i, h · k x, h ), ···, exp (q i, h · k n, h ) / Σ x exp (q) i, h · k x, h ), ···, exp (q i, h · k N, h ) / Σ x exp (q i, h · k x, h ) ··· (1)

 (8-3)学習装置100は、softmaxに基づいて、N個の部分特徴キーk1,h,・・・,kn,h,・・・,kN,Hと対応付けられた、N個のクロスモーダル特徴量v1,h,・・・,vn,h,・・・,vN,Hの加重平均dfiを算出する。このように、学習装置100は、I個の画像特徴量f1,・・・,fi,・・・,fIに対応する、I個の加重平均df1,・・・,dfi,・・・,dfIを、新たなクロスモーダル特徴量として算出する。その後、学習装置100は、I個の画像特徴量f1,・・・,fi,・・・,fIに、I個の加重平均df1,・・・,dfi,・・・,dfIをそれぞれ加算し、I個の画像特徴量f1,・・・,fi,・・・,fIを変換する。 (8-3) The learning device 100 is associated with N partial feature keys k 1, h , ..., k n, h , ..., K N, H based on softmax, N. Calculate the weighted average df i of the cross-modal features v 1, h , ···, v n, h , ···, v N, H. Thus, the learning device 100, I-number of image feature amounts f 1, ···, f i, ···, corresponding to f I, a weighted average df 1 of number I, · · ·, df i, ..., Df I is calculated as a new cross-modal feature quantity. Then, the learning apparatus 100, I-number of image feature amounts f 1, ···, f i, ···, the f I, a weighted average df 1 of number I, · · ·, df i, · · ·, the df I respectively added, I pieces of image feature amounts f 1, converts ···, f i, ···, an f I.

 これにより、学習装置100は、言語情報を言語情報データテーブル610に反映させることができる。学習装置100は、例えば、言語情報を、変換行列Wqと、変換行列Wkと、変換行列Wvとに反映させることができる。結果として、学習装置100は、画像を扱うにあたり、言語情報を考慮した表現空間を用意することができ、有用な画像特徴量列を生成可能にすることができ、有用な統合モデル600を学習することができる。また、学習装置100は、学習後の統合モデル600のファインチューニングも効率よく実施可能にすることができる。その後、学習装置100は、問題を解くにあたり統合モデル600を用いれば、問題を解いて得られる解の精度を向上させることができる。 As a result, the learning device 100 can reflect the language information in the language information data table 610. For example, the learning device 100 can reflect the language information in the transformation matrix W q , the transformation matrix W k, and the transformation matrix W v. As a result, the learning device 100 can prepare an expression space in consideration of linguistic information when handling an image, can generate a useful image feature sequence, and learns a useful integrated model 600. be able to. Further, the learning device 100 can efficiently perform fine tuning of the integrated model 600 after learning. After that, the learning device 100 can improve the accuracy of the solution obtained by solving the problem by using the integrated model 600 in solving the problem.

 また、学習装置100は、例えば、画像特徴量が、言語情報から得られるクロスモーダル特徴量で適切に補完されるようにし、有用な画像特徴量列を生成可能にすることができる。このため、学習装置100は、画像Transformer602を分離し、画像Transformer602単独のファインチューニングも効率よく実施可能にすることができる。その後、学習装置100は、問題を解くにあたり画像Transformer602単独で用いても、問題を解いて得られる解の精度を向上させることができる。 Further, the learning device 100 can, for example, make it possible to appropriately complement the image feature amount with the crossmodal feature amount obtained from the linguistic information, and to generate a useful image feature amount sequence. Therefore, the learning device 100 can separate the image Transferr 602 and efficiently perform fine tuning of the image Transferr 602 alone. After that, the learning device 100 can improve the accuracy of the solution obtained by solving the problem even if the image Transformer 602 is used alone in solving the problem.

 ここで、従来では、画像に写る物体は、周囲の物体との関係性により異なる意味を有する性質があるところ、画像に写る物体を示す画像特徴量は、言語特徴量とは異なり量子化されておらず連続値で表現されており、適切な意味付けを行うことは難しい傾向がある。具体的には、様々な椅子を示す画像特徴量に対し、室内での配置関係を考慮して異なる意味付けを行うことが望まれるが、適切な意味付けを行うことは難しい。また、具体的には、様々な赤信号を示す画像特徴量に対し、停止する車との位置関係を考慮して異なる意味付けを行うことが望まれるが、適切な意味付けを行うことは難しい。これに対し、学習装置100は、上述した通り、言語情報から得られるクロスモーダル特徴量で、画像特徴量に対する適切な意味付けを行いやすくすることができ、有用な統合モデル600を得やすくすることができる。 Here, conventionally, an object captured in an image has a property of having a different meaning depending on the relationship with a surrounding object, but an image feature quantity indicating an object captured in an image is quantized unlike a language feature quantity. It is expressed as a continuous value, and it tends to be difficult to give an appropriate meaning. Specifically, it is desirable to give different meanings to the image features showing various chairs in consideration of the arrangement relationship in the room, but it is difficult to give an appropriate meaning. Further, specifically, it is desirable to give different meanings to the image features showing various red lights in consideration of the positional relationship with the stopping vehicle, but it is difficult to give an appropriate meaning. .. On the other hand, as described above, the learning device 100 makes it easy to give an appropriate meaning to the image feature amount by the cross-modal feature amount obtained from the linguistic information, and makes it easy to obtain a useful integrated model 600. Can be done.

(学習装置100が画像特徴量列を変換する別の例)
 次に、図9を用いて、学習装置100が画像特徴量列を変換する別の例について説明する。例えば、図5~図8の例では、学習装置100が、被検索キー列とクロスモーダル特徴量列とを含むデータテーブル510を用いる場合について説明した。これに対し、図9の例では、学習装置100が、データテーブル510に代わり、ニューラルネットワーク900を用いる場合について説明する。
(Another example in which the learning device 100 converts an image feature sequence)
Next, another example in which the learning device 100 converts the image feature quantity sequence will be described with reference to FIG. For example, in the examples of FIGS. 5 to 8, the case where the learning device 100 uses the data table 510 including the searched key sequence and the cross-modal feature quantity sequence has been described. On the other hand, in the example of FIG. 9, the case where the learning device 100 uses the neural network 900 instead of the data table 510 will be described.

 図9は、画像特徴量列を変換する別の例を示す説明図である。図9において、ニューラルネットワーク900は、2層の全結合ネットワークであり、入力層901と、中間層902と、出力層903とを有する。入力層901の次元と、出力層903の次元とは同一になるように形成される。中間層902の次元は、入力層901の次元より大きくなるように形成される。入力層901と中間層902との接続のパラメータは、変換行列Wqkである。中間層902と出力層903との接続のパラメータは、変換行列Wkvである。 FIG. 9 is an explanatory diagram showing another example of converting the image feature quantity sequence. In FIG. 9, the neural network 900 is a two-layer fully connected network, which has an input layer 901, an intermediate layer 902, and an output layer 903. The dimension of the input layer 901 and the dimension of the output layer 903 are formed to be the same. The dimension of the intermediate layer 902 is formed to be larger than the dimension of the input layer 901. The parameter of the connection between the input layer 901 and the intermediate layer 902 is the transformation matrix W qk . The parameter of the connection between the intermediate layer 902 and the output layer 903 is the transformation matrix W kv .

 学習装置100は、i番目の画像特徴量fiから変換行列WqkによってN個の対応関係を生成し、生成した対応関係に基づいて、変換行列Wkvによってクロスモーダル特徴量を生成することになる。学習装置100は、事前学習により、変換行列Wqkと変換行列Wkvとを更新することで、データテーブル510と同様の機能を、ニューラルネットワーク900により実現することができる。 The learning device 100 generates N correspondences from the i-th image feature amount f i by the transformation matrix W qk , and based on the generated correspondences, the crossmodal feature amount is calculated by the transformation matrix W kv. Will be generated. The learning device 100 can realize the same function as the data table 510 by the neural network 900 by updating the transformation matrix W qk and the transformation matrix W kv by pre-learning.

 これにより、学習装置100は、言語情報をニューラルネットワーク900に反映させることができる。学習装置100は、例えば、言語情報を、変換行列Wqkと、変換行列Wkvとに反映させることができる。結果として、学習装置100は、画像を扱うにあたり、言語情報を考慮した表現空間を用意することができ、有用な画像特徴量列を生成可能にすることができ、有用な統合モデルを学習可能にすることができる。また、学習装置100は、学習後の統合モデルのファインチューニングも効率よく実施可能にすることができる。その後、学習装置100は、問題を解くにあたり統合モデルを用いれば、問題を解いて得られる解の精度を向上させることができる。 As a result, the learning device 100 can reflect the linguistic information on the neural network 900. For example, the learning device 100 can reflect the language information in the transformation matrix W qk and the transformation matrix W kv. As a result, the learning device 100 can prepare an expression space in consideration of linguistic information when handling an image, can generate a useful image feature sequence, and can learn a useful integrated model. can do. In addition, the learning device 100 can efficiently perform fine tuning of the integrated model after learning. After that, the learning device 100 can improve the accuracy of the solution obtained by solving the problem by using the integrated model in solving the problem.

(学習装置100が統合モデル1000を利用する一例)
 次に、図10を用いて、学習装置100が統合モデル1000を利用する一例について説明する。
(An example in which the learning device 100 uses the integrated model 1000)
Next, an example in which the learning device 100 uses the integrated model 1000 will be described with reference to FIG.

 図10は、学習装置100が統合モデル1000を利用する一例を示す説明図である。学習装置100は、例えば、統合モデル1000を利用し、周辺状況に鑑みた映像監視の問題を解く。統合モデル1000は、画像Transformer1010と、言語Transformer1020と、クロスモーダルTransformer1030とを含む。統合モデル1000は、言語情報データテーブル610またはニューラルネットワーク900などを含む。 FIG. 10 is an explanatory diagram showing an example in which the learning device 100 uses the integrated model 1000. The learning device 100 uses, for example, the integrated model 1000 to solve the problem of video monitoring in view of the surrounding situation. The integrated model 1000 includes an image Transferr 1010, a language Transferr 1020, and a crossmodal Transferr 1030. The integrated model 1000 includes a linguistic information data table 610, a neural network 900, and the like.

 学習装置100は、具体的には、災害発生時に、室内にまだ避難していない人がいるか否かを検出する問題を解く。この際、学習装置100は、試着室の外観を写す監視カメラの画像情報1001を取得する。学習装置100は、画像情報1001から、画像特徴量列を取得し、言語情報データテーブル610またはニューラルネットワーク900を用いて変換した後、画像Transformer1010に入力する。 Specifically, the learning device 100 solves the problem of detecting whether or not there is a person who has not yet evacuated in the room when a disaster occurs. At this time, the learning device 100 acquires the image information 1001 of the surveillance camera that captures the appearance of the fitting room. The learning device 100 acquires an image feature amount sequence from the image information 1001, converts it using the language information data table 610 or the neural network 900, and then inputs it to the image Transferr 1010.

 また、学習装置100は、「試着室は靴を脱いで使う」や「試着室はカーテンを閉めて使う」などが記述された言語情報1002を取得する。学習装置100は、言語情報1002から言語特徴量列を取得し、言語Transformer1020に入力する。そして、学習装置100は、画像Transformer1010の出力値と、言語Transformer1020の出力値とを、クロスモーダルTransformer1030に入力する。 Further, the learning device 100 acquires linguistic information 1002 in which "the fitting room is used by taking off shoes" and "the fitting room is used by closing the curtain". The learning device 100 acquires a language feature sequence from the language information 1002 and inputs it to the language Transferformer 1020. Then, the learning device 100 inputs the output value of the image Transferr 1010 and the output value of the language Transferr 1020 into the cross-modal Transferr 1030.

 学習装置100は、クロスモーダルTransformer1030の出力値に基づいて、試着室にまだ避難していない人がいるか否かを判断する。これにより、学習装置100は、言語情報で提示された状況を、画像情報で提示された状況に適切に対応させて問題を解くことができ、問題を解いて得られる解の精度を向上させることができる。学習装置100は、例えば、試着室のカーテンが閉まっており、人が直接写っていない状況であるため、画像情報単独では、人がいないと判断してしまうような場合にも、言語情報で提示された「人が試着室を利用中である」という状況と、画像内にある物体の配置関係とを適切に対応させて、試着室に人がいると正しく判断することができる。 The learning device 100 determines whether or not there is a person who has not yet evacuated to the fitting room based on the output value of the cross-modal Transformer 1030. As a result, the learning device 100 can solve the problem by appropriately matching the situation presented by the linguistic information with the situation presented by the image information, and improves the accuracy of the solution obtained by solving the problem. Can be done. In the learning device 100, for example, since the curtain of the fitting room is closed and a person is not directly photographed, the learning device 100 is presented with linguistic information even when it is determined that there is no person by the image information alone. It is possible to correctly determine that there is a person in the fitting room by appropriately matching the situation that "a person is using the fitting room" with the arrangement relationship of the objects in the image.

(学習処理手順)
 次に、図11を用いて、学習装置100が実行する、学習処理手順の一例について説明する。学習処理は、例えば、図3に示したCPU301と、メモリ302や記録媒体305などの記憶領域と、ネットワークI/F303とによって実現される。
(Learning process procedure)
Next, an example of the learning processing procedure executed by the learning device 100 will be described with reference to FIG. The learning process is realized, for example, by the CPU 301 shown in FIG. 3, a storage area such as a memory 302 or a recording medium 305, and a network I / F 303.

 図11は、学習処理手順の一例を示すフローチャートである。図11において、学習装置100は、テーブル内の被検索キー値と特徴値補正バリュー値とをランダム値で初期化する(ステップS1101)。 FIG. 11 is a flowchart showing an example of the learning processing procedure. In FIG. 11, the learning device 100 initializes the searched key value and the feature value correction value value in the table with random values (step S1101).

 次に、学習装置100は、相互に関係する文書と画像とのそれぞれから、言語特徴値の集合と、画像特徴値の集合とを生成する(ステップS1102)。そして、学習装置100は、画像特徴値の集合のうち、一定の割合で画像特徴値を無関係な値に置換することにより、画像特徴値をマスクする(ステップS1103)。 Next, the learning device 100 generates a set of language feature values and a set of image feature values from each of the interrelated documents and images (step S1102). Then, the learning device 100 masks the image feature values by replacing the image feature values with irrelevant values at a constant ratio in the set of image feature values (step S1103).

 次に、学習装置100は、画像特徴値の集合のうち、それぞれの画像特徴値について、テーブル内の被検索キー値との関係度合いを算出する(ステップS1104)。そして、学習装置100は、関係度合いに対応するテーブル内の特徴値補正バリュー値に基づいて、それぞれの画像特徴値を補正する(ステップS1105)。 Next, the learning device 100 calculates the degree of relationship between each image feature value in the set of image feature values and the searched key value in the table (step S1104). Then, the learning device 100 corrects each image feature value based on the feature value correction value value in the table corresponding to the degree of relationship (step S1105).

 次に、学習装置100は、言語処理モデルと、画像処理モデルと、クロスモーダル処理モデルとを用いて、言語特徴値の集合と、補正した画像特徴値の集合とに基づいて、マスクされた画像特徴値を復元し、予測値を取得する(ステップS1106)。そして、学習装置100は、予測値の損失値を減少させる方向に、テーブル内の被検索キー値と特徴値補正バリュー値とを含むパラメータ値を更新する(ステップS1107)。 Next, the learning device 100 uses a language processing model, an image processing model, and a cross-modal processing model to create a masked image based on a set of language feature values and a set of corrected image feature values. The feature value is restored and the predicted value is acquired (step S1106). Then, the learning device 100 updates the parameter value including the searched key value and the feature value correction value value in the table in the direction of reducing the loss value of the predicted value (step S1107).

 次に、学習装置100は、終了条件を満たすか否かを判定する(ステップS1108)。終了条件は、例えば、ステップS1102~S1108のループが一定回数以上繰り返されることである。終了条件は、例えば、前回の損失値と今回の損失値との差分が一定以下になることである。 Next, the learning device 100 determines whether or not the end condition is satisfied (step S1108). The end condition is, for example, that the loop of steps S1102 to S1108 is repeated a certain number of times or more. The end condition is, for example, that the difference between the previous loss value and the current loss value is less than a certain value.

 ここで、終了条件を満たさない場合(ステップS1108:No)、学習装置100は、ステップS1102の処理に戻る。一方で、終了条件を満たす場合(ステップS1108:Yes)、学習装置100は、学習処理を終了する。これにより、情報処理装置は、言語情報の特徴を反映したパラメータ値を含む有用なモデルを得ることができる。 Here, if the end condition is not satisfied (step S1108: No), the learning device 100 returns to the process of step S1102. On the other hand, when the end condition is satisfied (step S1108: Yes), the learning device 100 ends the learning process. As a result, the information processing apparatus can obtain a useful model including parameter values that reflect the characteristics of the linguistic information.

 以上説明したように、情報処理装置によれば、第1のモーダルの情報から特徴量を抽出することができる。情報処理装置によれば、抽出した特徴量をパラメータに基づいて変換することにより、新たな特徴量を取得することができる。情報処理装置によれば、取得した新たな特徴量を、第1のモーダルに関する第1の処理モデルに入力することにより、第1の出力値を取得することができる。情報処理装置によれば、第1のモーダルとは異なる第2のモーダルの情報から抽出された他の特徴量を、第2のモーダルに関する第2の処理モデルに入力することにより、第2の出力値を取得することができる。情報処理装置によれば、取得した第1の出力値と第2の出力値とを、第1のモーダルと第2のモーダルとに関する第3の処理モデルに入力することにより、第3の出力値を取得することができる。情報処理装置によれば、取得した第3の出力値に基づいて、パラメータを更新することができる。これにより、情報処理装置は、第2のモーダルの情報の特徴を反映したパラメータを含む有用なモデルを得ることができる。 As described above, according to the information processing device, the feature amount can be extracted from the first modal information. According to the information processing apparatus, a new feature amount can be acquired by converting the extracted feature amount based on the parameter. According to the information processing apparatus, the first output value can be acquired by inputting the acquired new feature quantity into the first processing model related to the first modal. According to the information processing apparatus, the second output is performed by inputting other features extracted from the information of the second modal different from the first modal into the second processing model related to the second modal. You can get the value. According to the information processing apparatus, the acquired first output value and the second output value are input to the third processing model relating to the first modal and the second modal, so that the third output value is obtained. Can be obtained. According to the information processing device, the parameters can be updated based on the acquired third output value. This allows the information processing device to obtain a useful model that includes parameters that reflect the characteristics of the second modal information.

 情報処理装置によれば、パラメータとして、第1の特徴量と第2の特徴量とを採用することができる。情報処理装置によれば、抽出した特徴量と、第1の特徴量との一致度を算出し、算出した一致度を基に第2の特徴量に重み付けして得られた指標値に基づいて、抽出した特徴量を変換することにより、新たな特徴量を取得することができる。これにより、情報処理装置は、第1の特徴量と第2の特徴量とに、第2のモーダルの情報の特徴を反映させ、第1のモーダルの情報から抽出された特徴量の変換を実現することができる。 According to the information processing device, the first feature amount and the second feature amount can be adopted as parameters. According to the information processing device, the degree of coincidence between the extracted feature amount and the first feature amount is calculated, and the second feature amount is weighted based on the calculated degree of coincidence based on the obtained index value. , A new feature amount can be obtained by converting the extracted feature amount. As a result, the information processing apparatus reflects the features of the information of the second modal to the first feature amount and the second feature amount, and realizes the conversion of the feature amount extracted from the information of the first modal. can do.

 情報処理装置によれば、パラメータとして、入力層のノード数と出力層のノード数とより中間層のノード数が大きいニューラルネットワークのパラメータを採用することができる。情報処理装置によれば、抽出した特徴量をニューラルネットワークに入力することにより、新たな特徴量を取得することができる。これにより、情報処理装置は、ニューラルネットワークに、第2のモーダルの情報の特徴を反映させ、第1のモーダルの情報から抽出された特徴量の変換を実現することができる。 According to the information processing device, it is possible to adopt a neural network parameter having a large number of nodes in the input layer, a number of nodes in the output layer, and a larger number of nodes in the intermediate layer as parameters. According to the information processing device, a new feature amount can be acquired by inputting the extracted feature amount into the neural network. As a result, the information processing apparatus can reflect the characteristics of the information of the second modal in the neural network and realize the conversion of the feature amount extracted from the information of the first modal.

 情報処理装置によれば、第3の出力値に基づいて、第1の処理モデルを更新し、更新後のパラメータと、更新後の第1の処理モデルとを用いて、第1のモーダルの他の情報の入力に応じて、所定の問題を解くことができる。これにより、情報処理装置は、有用な第1の処理モデルを得ることができ、第1の処理モデルを用いて所定の問題を解いて得られる解の精度を向上させることができる。 According to the information processing apparatus, the first processing model is updated based on the third output value, and the updated parameters and the updated first processing model are used in addition to the first modal. A predetermined problem can be solved according to the input of the information of. As a result, the information processing apparatus can obtain a useful first processing model, and can improve the accuracy of the solution obtained by solving a predetermined problem using the first processing model.

 情報処理装置によれば、第3の出力値に基づいて、第1の処理モデルと、第2の処理モデルと、第3の処理モデルとを更新することができる。情報処理装置によれば、更新後のパラメータと、更新後の第1の処理モデルと、更新後の第2の処理モデルと、更新後の第3の処理モデルとを用いて、所定の問題を解くことができる。これにより、情報処理装置は、有用な第1の処理モデルと、第2の処理モデルと、第3の処理モデルとを得ることができ、所定の問題を解いて得られる解の精度を向上させることができる。 According to the information processing device, the first processing model, the second processing model, and the third processing model can be updated based on the third output value. According to the information processing apparatus, a predetermined problem is solved by using the updated parameters, the updated first processing model, the updated second processing model, and the updated third processing model. Can be solved. As a result, the information processing apparatus can obtain a useful first processing model, a second processing model, and a third processing model, and improves the accuracy of the solution obtained by solving a predetermined problem. be able to.

 情報処理装置によれば、第1のモーダルとして、画像に関するモーダルを採用することができる。情報処理装置によれば、第2のモーダルとして、言語に関するモーダルを採用することができる。これにより、情報処理装置は、画像情報と言語情報とに基づいて問題を解くにあたり有用なモデルを得ることができる。 According to the information processing device, a modal related to an image can be adopted as the first modal. According to the information processing apparatus, a modal related to language can be adopted as the second modal. As a result, the information processing apparatus can obtain a model useful for solving the problem based on the image information and the linguistic information.

 情報処理装置によれば、第1のモーダルとして、画像に関するモーダルを採用することができる。情報処理装置によれば、第2のモーダルとして、音声に関するモーダルを採用することができる。これにより、情報処理装置は、画像情報と音声情報とに基づいて問題を解くにあたり有用なモデルを得ることができる。 According to the information processing device, a modal related to an image can be adopted as the first modal. According to the information processing device, a modal related to voice can be adopted as the second modal. As a result, the information processing apparatus can obtain a model useful for solving the problem based on the image information and the audio information.

 情報処理装置によれば、第1のモーダルとして、第1の言語に関するモーダルを採用することができる。情報処理装置によれば、第2のモーダルとして、第2の言語に関するモーダルを採用することができる。これにより、情報処理装置は、異なる言語の言語情報に基づいて問題を解くにあたり有用なモデルを得ることができる。 According to the information processing device, a modal related to the first language can be adopted as the first modal. According to the information processing apparatus, a modal related to the second language can be adopted as the second modal. This allows the information processing device to obtain a useful model for solving a problem based on linguistic information in different languages.

 情報処理装置によれば、第3の出力値に基づいて、誤差逆伝搬法により、パラメータを更新することができる。これにより、情報処理装置は、パラメータを精度よく更新することができる。 According to the information processing device, the parameters can be updated by the error back propagation method based on the third output value. As a result, the information processing apparatus can update the parameters with high accuracy.

 なお、本実施の形態で説明した学習方法は、予め用意されたプログラムをPCやワークステーションなどのコンピュータで実行することにより実現することができる。本実施の形態で説明した学習プログラムは、コンピュータで読み取り可能な記録媒体に記録され、コンピュータによって記録媒体から読み出されることによって実行される。記録媒体は、ハードディスク、フレキシブルディスク、CD(Compact Disc)-ROM、MO、DVD(Digital Versatile Disc)などである。また、本実施の形態で説明した学習プログラムは、インターネットなどのネットワークを介して配布してもよい。 The learning method described in this embodiment can be realized by executing a program prepared in advance on a computer such as a PC or a workstation. The learning program described in this embodiment is executed by being recorded on a computer-readable recording medium and being read from the recording medium by the computer. The recording medium is a hard disk, a flexible disk, a CD (Compact Disc) -ROM, an MO, a DVD (Digital Versailles Disc), or the like. Further, the learning program described in this embodiment may be distributed via a network such as the Internet.

 100 学習装置
 101 モデル
 111 抽出モデル
 112 変換モデル
 121 第1の処理モデル
 122 第2の処理モデル
 123 第3の処理モデル
 200 情報処理システム
 201 クライアント装置
 202 端末装置
 210 ネットワーク
 300 バス
 301 CPU
 302 メモリ
 303 ネットワークI/F
 304 記録媒体I/F
 305 記録媒体
 400 記憶部
 401 取得部
 402 第1の抽出部
 403 変換部
 404 第1の処理部
 405 第2の抽出部
 406 第2の処理部
 407 第3の処理部
 408 更新部
 409 利用部
 410 出力部
 500,600,1000 統合モデル
 501 画像特徴量生成部
 502 クエリ生成部
 503 テーブル検索部
 504 画像処理部
 505 言語処理部
 506 クロスモーダル処理部
 507 損失関数計算部
 510 データテーブル
 601 Add部
 602,1010 画像Transformer
 603,1020 言語Transformer
 604 クロスモーダルTransformer
 610 言語情報データテーブル
 611 配列Query
 621 配列Key
 622 配列Value
 900 ニューラルネットワーク
 901 入力層
 902 中間層
 903 出力層
 1001 画像情報
 1002 言語情報
 1030 クロスモーダルTransformer
100 Learning device 101 Model 111 Extraction model 112 Conversion model 121 First processing model 122 Second processing model 123 Third processing model 200 Information processing system 201 Client device 202 Terminal device 210 Network 300 Bus 301 CPU
302 Memory 303 Network I / F
304 Recording medium I / F
305 Recording medium 400 Storage unit 401 Acquisition unit 402 First extraction unit 403 Conversion unit 404 First processing unit 405 Second extraction unit 406 Second processing unit 407 Third processing unit 408 Update unit 409 Utilization unit 410 Output Part 500, 600, 1000 Integrated model 501 Image feature amount generation part 502 Query generation part 503 Table search part 504 Image processing part 505 Language processing part 506 Cross modal processing part 507 Loss function calculation part 510 Data table 601 Add part 602,1010 Transformer
603, 1020 Language Transformer
604 Cross Modal Transformer
610 Language Information Data Table 611 Array Query
621 Array Key
622 Array Value
900 Neural network 901 Input layer 902 Intermediate layer 903 Output layer 1001 Image information 1002 Language information 1030 Crossmodal Transformer

Claims (9)

 第1のモーダルの情報から特徴量を抽出し、
 抽出した前記特徴量をパラメータに基づいて変換することにより、新たな特徴量を取得し、
 取得した前記新たな特徴量を、前記第1のモーダルに関する第1の処理モデルに入力することにより、第1の出力値を取得し、
 前記第1のモーダルとは異なる第2のモーダルの情報から抽出された他の特徴量を、前記第2のモーダルに関する第2の処理モデルに入力することにより、第2の出力値を取得し、
 取得した前記第1の出力値と前記第2の出力値とを、前記第1のモーダルと前記第2のモーダルとに関する第3の処理モデルに入力することにより、第3の出力値を取得し、
 取得した前記第3の出力値に基づいて、前記パラメータを更新する、
 処理をコンピュータが実行することを特徴とする学習方法。
Features are extracted from the information of the first modal,
By converting the extracted feature quantity based on the parameter, a new feature quantity is obtained.
By inputting the acquired new feature quantity into the first processing model related to the first modal, the first output value is acquired.
A second output value is obtained by inputting another feature amount extracted from the information of the second modal different from the first modal into the second processing model related to the second modal.
By inputting the acquired first output value and the second output value into the third processing model relating to the first modal and the second modal, the third output value is acquired. ,
The parameter is updated based on the acquired third output value.
A learning method characterized by a computer performing processing.
 前記パラメータは、第1の特徴量と第2の特徴量とを含み、
 前記新たな特徴量を取得する処理は、
 抽出した前記特徴量と、前記第1の特徴量との一致度を算出し、算出した前記一致度を基に前記第2の特徴量に重み付けして得られた指標値に基づいて、抽出した前記特徴量を変換することにより、前記新たな特徴量を取得する、ことを特徴とする請求項1に記載の学習方法。
The parameters include a first feature amount and a second feature amount.
The process of acquiring the new feature amount is
The degree of coincidence between the extracted feature amount and the first feature amount was calculated, and based on the calculated degree of coincidence, the second feature amount was weighted and extracted based on the index value obtained. The learning method according to claim 1, wherein the new feature amount is acquired by converting the feature amount.
 前記パラメータは、入力層のノード数と出力層のノード数とより中間層のノード数が大きいニューラルネットワークのパラメータであり、
 前記新たな特徴量を取得する処理は、
 抽出した前記特徴量を前記ニューラルネットワークに入力することにより、前記新たな特徴量を取得する、ことを特徴とする請求項1に記載の学習方法。
The parameters are parameters of a neural network in which the number of nodes in the input layer, the number of nodes in the output layer, and the number of nodes in the intermediate layer are larger.
The process of acquiring the new feature amount is
The learning method according to claim 1, wherein the new feature amount is acquired by inputting the extracted feature amount into the neural network.
 前記第3の出力値に基づいて、前記第1の処理モデルを更新し、
 更新後の前記パラメータと、更新後の前記第1の処理モデルとを用いて、前記第1のモーダルの他の情報の入力に応じて、所定の問題を解く、
 処理を前記コンピュータが実行することを特徴とする請求項1~3のいずれか一つに記載の学習方法。
The first processing model is updated based on the third output value.
Using the updated parameters and the updated first processing model, a predetermined problem is solved in response to input of other information of the first modal.
The learning method according to any one of claims 1 to 3, wherein the processing is executed by the computer.
 前記第3の出力値に基づいて、前記第1の処理モデルと、前記第2の処理モデルと、前記第3の処理モデルとを更新し、
 更新後の前記パラメータと、更新後の前記第1の処理モデルと、更新後の前記第2の処理モデルと、更新後の前記第3の処理モデルとを用いて、前記第1のモーダルの他の情報と前記第2のモーダルの他の情報との入力に応じて、所定の問題を解く、
 処理を前記コンピュータが実行することを特徴とする請求項1~3のいずれか一つに記載の学習方法。
Based on the third output value, the first processing model, the second processing model, and the third processing model are updated.
In addition to the first modal, the updated parameters, the updated first processing model, the updated second processing model, and the updated third processing model are used. In response to the input of the information of the above and the other information of the second modal, the predetermined problem is solved.
The learning method according to any one of claims 1 to 3, wherein the processing is executed by the computer.
 前記第1のモーダルと前記第2のモーダルとの組は、画像に関するモーダルと言語に関するモーダルとの組、前記画像に関するモーダルと音声に関するモーダルとの組、第1の言語に関するモーダルと第2の言語に関するモーダルとの組とのうちいずれかの組である、ことを特徴とする請求項1~5のいずれか一つに記載の学習方法。 The first modal and the second modal are the image modal and the language modal, the image modal and the audio modal, the first language modal and the second language. The learning method according to any one of claims 1 to 5, wherein the learning method is one of a set with a modal.  前記パラメータを更新する処理は、
 前記第3の出力値に基づいて、誤差逆伝搬法により、前記パラメータを更新する、ことを特徴とする請求項1~6のいずれか一つに記載の学習方法。
The process of updating the parameters is
The learning method according to any one of claims 1 to 6, wherein the parameter is updated by an error back propagation method based on the third output value.
 第1のモーダルの情報から特徴量を抽出し、
 抽出した前記特徴量をパラメータに基づいて変換することにより、新たな特徴量を取得し、
 取得した前記新たな特徴量を、前記第1のモーダルに関する第1の処理モデルに入力することにより、第1の出力値を取得し、
 前記第1のモーダルとは異なる第2のモーダルの情報から抽出された他の特徴量を、前記第2のモーダルに関する第2の処理モデルに入力することにより、第2の出力値を取得し、
 取得した前記第1の出力値と前記第2の出力値とを、前記第1のモーダルと前記第2のモーダルとに関する第3の処理モデルに入力することにより、第3の出力値を取得し、
 取得した前記第3の出力値に基づいて、前記パラメータを更新する、
 処理をコンピュータに実行させることを特徴とする学習プログラム。
Features are extracted from the information of the first modal,
By converting the extracted feature quantity based on the parameter, a new feature quantity is obtained.
By inputting the acquired new feature quantity into the first processing model related to the first modal, the first output value is acquired.
A second output value is obtained by inputting another feature amount extracted from the information of the second modal different from the first modal into the second processing model related to the second modal.
By inputting the acquired first output value and the second output value into the third processing model relating to the first modal and the second modal, the third output value is acquired. ,
The parameter is updated based on the acquired third output value.
A learning program characterized by having a computer perform processing.
 第1のモーダルの情報から特徴量を抽出し、
 抽出した前記特徴量をパラメータに基づいて変換することにより、新たな特徴量を取得し、
 取得した前記新たな特徴量を、前記第1のモーダルに関する第1の処理モデルに入力することにより、第1の出力値を取得し、
 前記第1のモーダルとは異なる第2のモーダルの情報から抽出された他の特徴量を、前記第2のモーダルに関する第2の処理モデルに入力することにより、第2の出力値を取得し、
 取得した前記第1の出力値と前記第2の出力値とを、前記第1のモーダルと前記第2のモーダルとに関する第3の処理モデルに入力することにより、第3の出力値を取得し、
 取得した前記第3の出力値に基づいて、前記パラメータを更新する、
 制御部を有することを特徴とする学習装置。
Features are extracted from the information of the first modal,
By converting the extracted feature quantity based on the parameter, a new feature quantity is obtained.
By inputting the acquired new feature quantity into the first processing model related to the first modal, the first output value is acquired.
A second output value is obtained by inputting another feature amount extracted from the information of the second modal different from the first modal into the second processing model related to the second modal.
By inputting the acquired first output value and the second output value into the third processing model relating to the first modal and the second modal, the third output value is acquired. ,
The parameter is updated based on the acquired third output value.
A learning device having a control unit.
PCT/JP2019/044771 2019-11-14 2019-11-14 Learning method, learning program, and learning device Ceased WO2021095213A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2019/044771 WO2021095213A1 (en) 2019-11-14 2019-11-14 Learning method, learning program, and learning device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2019/044771 WO2021095213A1 (en) 2019-11-14 2019-11-14 Learning method, learning program, and learning device

Publications (1)

Publication Number Publication Date
WO2021095213A1 true WO2021095213A1 (en) 2021-05-20

Family

ID=75913025

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2019/044771 Ceased WO2021095213A1 (en) 2019-11-14 2019-11-14 Learning method, learning program, and learning device

Country Status (1)

Country Link
WO (1) WO2021095213A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2023017910A (en) * 2021-11-05 2023-02-07 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド Pre-training method, device and electronic device for semantic representation model
WO2023013043A1 (en) * 2021-08-06 2023-02-09 日本電信電話株式会社 Estimation method, device, and program
JP2023042973A (en) * 2021-09-15 2023-03-28 キヤノン株式会社 Image processing apparatus, method of controlling the same, and program

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160314789A1 (en) * 2015-04-27 2016-10-27 Nuance Communications, Inc. Methods and apparatus for speech recognition using visual information
CN110377710A (en) * 2019-06-17 2019-10-25 杭州电子科技大学 A kind of vision question and answer fusion Enhancement Method based on multi-modal fusion

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160314789A1 (en) * 2015-04-27 2016-10-27 Nuance Communications, Inc. Methods and apparatus for speech recognition using visual information
CN110377710A (en) * 2019-06-17 2019-10-25 杭州电子科技大学 A kind of vision question and answer fusion Enhancement Method based on multi-modal fusion

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LU, JIASEN ET AL., VILBERT: PRETRAINING TASK-AGNOSTIC VISIOLINGUISTIC REPRESENTATIONS FOR VISION-AND-LANGUAGE TASKS, ARXIV.ORG, 6 August 2019 (2019-08-06), pages 1 - 11, XP081456681, Retrieved from the Internet <URL:https://arxiv.org/pdf/1908.02265v1.pdf> [retrieved on 20191213] *
NGUYEN, DUY-KIEN ET AL., IMPROVED FUSION OF VISUAL AND LANGUAGE REPRESENTATIONS BY DENSE SYMMETRIC CO-ATTENTION FOR VISUAL QUESTION ANSWERING, 2018, pages 6087 - 6096, XP033473524, Retrieved from the Internet <URL:http://openaccess.thecvf.com/content_cvpr_2018/html/Nguyen_Improved_Fusion_of_CVPR_2018_paper.html> [retrieved on 20191213] *
YU, JIANFEI ET AL., ADAPTING BERT FOR TARGET-ORIENTED MULTIMODAL SENTIMENT CLASSIFICATION, August 2019 (2019-08-01), pages 5408 - 5414, XP055823561, Retrieved from the Internet <URL:https://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=5444&context=sisresearch> [retrieved on 20191213] *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023013043A1 (en) * 2021-08-06 2023-02-09 日本電信電話株式会社 Estimation method, device, and program
JPWO2023013043A1 (en) * 2021-08-06 2023-02-09
JP7544280B2 (en) 2021-08-06 2024-09-03 日本電信電話株式会社 Estimation method, device and program
JP2023042973A (en) * 2021-09-15 2023-03-28 キヤノン株式会社 Image processing apparatus, method of controlling the same, and program
JP7741670B2 (en) 2021-09-15 2025-09-18 キヤノン株式会社 Image processing device, its control method, and program
JP2023017910A (en) * 2021-11-05 2023-02-07 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド Pre-training method, device and electronic device for semantic representation model

Similar Documents

Publication Publication Date Title
CN108416065B (en) Image-sentence description generation system and method based on hierarchical neural network
WO2020108165A1 (en) Image description information generation method and device, and electronic device
CN111133453A (en) Artificial neural network
CN118246537B (en) Question and answer method, device, equipment and storage medium based on large model
KR102274581B1 (en) Method for generating personalized hrtf
WO2021095213A1 (en) Learning method, learning program, and learning device
CN115101050A (en) Speech recognition model training method and device, speech recognition method and medium
CN118194238B (en) Multilingual multi-mode emotion recognition method, system and equipment
KR20220065209A (en) Method and apparatus for recognizing image of various quality
CN110175338B (en) Data processing method and device
CN115101075A (en) Voice recognition method and related device
CN112765998A (en) Machine translation method, machine translation model training method, device and storage medium
CN117173269A (en) Facial image generation method, device, electronic device and storage medium
CN110570877A (en) Sign language video generation method, electronic device and computer-readable storage medium
CN115995225A (en) Model training method and device, speech synthesis method, device and storage medium
CN115881101B (en) Training method and device for speech recognition model and processing equipment
CN117116289B (en) Ward medical intercom management system and method thereof
CN119918010A (en) A multimodal sentiment analysis method and system based on multi-dimensional perception
CN119129733A (en) An adaptive multimodal relation extraction method based on mutual attention mechanism
CN114550159B (en) Image subtitle generation method, device, equipment and readable storage medium
JP7623619B2 (en) Learning device, estimation device, learning method, estimation method, and program
JP6703964B2 (en) Learning device, text generating device, method, and program
CN116975696A (en) Task processing methods, devices, electronic equipment and storage media
JP7207568B2 (en) Output method, output program, and output device
CN110188367B (en) Data processing method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19952446

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19952446

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP