WO2025158547A1

WO2025158547A1 - Learning device, inference device, learning method, inference method, and program

Info

Publication number: WO2025158547A1
Application number: PCT/JP2024/001906
Authority: WO
Inventors: 厚志安藤; 岳至森
Original assignee: NTT Inc
Current assignee: NTT Inc
Priority date: 2024-01-23
Filing date: 2024-01-23
Publication date: 2025-07-31
Anticipated expiration: 2026-07-23

Abstract

This learning device comprises: an input unit for inputting learning data including speech, a first sentence regarding the speech, and a second sentence corresponding to the first sentence; a speech feature generation unit that generates information representing features of the speech for each prescribed time interval, on the basis of a speech feature extractor composed of a plurality of layers; a first integration unit that, on the basis of a first parameter, generates first integrated information for each time interval by integrating information representing features generated individually in a prescribed plurality of layers of the speech feature extractor; a second integration unit that, on the basis of a second parameter, generates second integrated information by integrating the first integrated information for each time interval in the time direction; a calculation unit that calculates the generation probability of a third sentence corresponding to the first sentence on the basis of the first sentence, the second integrated information, and a language model; and a learning unit that learns parameters to be learned, including the first parameter and the second parameter, on the basis of the generation probability of the third sentence, and the second sentence.

Description

Learning device, inference device, learning method, inference method, and program

　本開示は、学習装置、推論装置、学習方法、推論方法、及びプログラムに関する。 This disclosure relates to a learning device, an inference device, a learning method, an inference method, and a program.

　音声には、言語情報・非言語情報・パラ言語情報という３種類の情報（以下、これら３種類の情報をまとめて「音声情報」ともいう。）が含まれていることが知られている（非特許文献１）。また、音声から非言語情報・パラ言語情報を認識する技術が知られている（非特許文献２）。 It is known that speech contains three types of information: linguistic information, non-linguistic information, and paralinguistic information (hereinafter, these three types of information will be collectively referred to as "speech information") (Non-Patent Document 1). Furthermore, technology for recognizing non-linguistic and paralinguistic information from speech is known (Non-Patent Document 2).

　他方、画像処理分野においては、画像に含まれる様々な情報を自然文で出力する技術として、画像理解技術等と呼ばれる技術が知られている（非特許文献３）。 On the other hand, in the field of image processing, there is a technology known as image understanding technology that outputs various information contained in images in natural language (Non-Patent Document 3).

H. Fujisaki, "Prosody, Models, and Spontaneous Speech," in Computing Prosody, Y. Sagisaka, N. Campbell, and N. Higuchi, Springer, pp.27-42, 1996.H. Fujisaki, “Prosody, Models, and Spontaneous Speech,” in Computing Prosody, Y. Sagisaka, N. Campbell, and N. Higuchi, Springer, pp.27-42, 1996. A. Ando, S. Kobashikawa, H. Kamiyama, R. Masumura, Y. Ijima and Y. Aono, "SOFT-TARGET TRAINING WITH AMBIGUOUS EMOTIONAL UTTERANCESFOR DNN-BASED SPEECH EMOTION CLASSIFICATION," in Proc. of ICASSP, 2018, pp. 4964-4968.A. Ando, S. Kobashikawa, H. Kamiyama, R. Masumura, Y. Ijima and Y. Aono, "SOFT-TARGET TRAINING WITH AMBIGUOU S EMOTIONAL UTTERANCESFOR DNN-BASED SPEECH EMOTION CLASSIFICATION," in Proc. of ICASSP, 2018, pp. 4964-4968. D. Zhu, J. Chen, X. Shen, X. Li, M. Elhoseiny, "MINIGPT-4: ENHANCING VISION-LANGUAGE UNDERSTANDING WITH ADVANCED LARGE LANGUAGE MODELS," in arXiv preprint arXiv:2304.10592, 2023.D. Zhu, J. Chen, X. Shen, X. Li, M. Elhoseiny, "MINIGPT-4: ENHANCED VISION-LANGUAGE UNDERSTANDING WITH ADVANCED LARGE LANGUAGE MODELS," in arXiv preprint arXiv:2304.10592, 2023.

　画像理解技術で利用されている画像エンコーダの代わりに音声エンコーダを利用することにより、音声に含まれる音声情報を自然文で出力する技術（以下、「音声理解技術」ともいう。）を実現することが可能であると考えられる。しかしながら、画像エンコーダと音声エンコーダの構成の違いや画像と音声の性質の違い等により、単純に画像エンコーダの代わりに音声エンコーダを利用するだけでは音声理解技術の実現は困難である。 By using a speech encoder instead of the image encoder used in image understanding technology, it is thought possible to realize a technology that outputs speech information contained in speech in natural language (hereinafter also referred to as "speech understanding technology"). However, due to differences in the configurations of image and speech encoders and differences in the properties of images and speech, it is difficult to realize speech understanding technology simply by using a speech encoder instead of an image encoder.

　本開示は、上記の点に鑑みてなされたもので、音声理解技術を実現することを目的とする。 This disclosure has been made in light of the above points and aims to realize speech understanding technology.

　本開示の一態様による学習装置は、音声と、前記音声に関する第１の文と、前記第１の文に対応する第２の文とが含まれる学習データを入力する入力部と、複数の層で構成された音声特徴抽出器に基づいて、所定の時間区間毎に、前記音声の特徴を表す情報を生成する音声特徴生成部と、第１のパラメータに基づいて、前記時間区間毎に、前記音声特徴抽出器の所定の複数の層でそれぞれ生成された前記特徴を表す情報を統合した第１の統合情報を生成する第１の統合部と、第２のパラメータに基づいて、前記時間区間毎の前記第１の統合情報を時間方向に統合した第２の統合情報を生成する第２の統合部と、前記第１の文と、前記第２の統合情報と、言語モデルとに基づいて、前記第１の文に対応する第３の文の生成確率を算出する算出部と、前記第３の文の生成確率と、前記第２の文とに基づいて、前記第１のパラメータと前記第２のパラメータとを含む学習対象パラメータを学習する学習部と、を有する。 A learning device according to one aspect of the present disclosure includes an input unit that inputs learning data including speech, a first sentence related to the speech, and a second sentence corresponding to the first sentence; a speech feature generation unit that generates information representing features of the speech for each predetermined time interval based on a speech feature extractor configured with multiple layers; a first integration unit that generates first integrated information for each time interval based on first parameters, integrating the information representing the features generated by each of the predetermined multiple layers of the speech feature extractor; a second integration unit that generates second integrated information for each time interval based on second parameters, integrating the first integrated information in the time direction; a calculation unit that calculates the generation probability of a third sentence corresponding to the first sentence based on the first sentence, the second integrated information, and a language model; and a learning unit that learns learning target parameters including the first parameter and the second parameter based on the generation probability of the third sentence and the second sentence.

　音声理解技術を実現できる。 Speech understanding technology can be realized.

音声理解モデルの一例を示す図である。FIG. 2 is a diagram illustrating an example of a speech understanding model. 音声エンコーダ及び音声エンコーダ出力統合ブロックの一例を示す図である。FIG. 10 illustrates an example of an audio encoder and an audio encoder output integrated block. 時間情報統合ブロックの一例を示す図である。FIG. 10 is a diagram illustrating an example of a time information integration block. 学習時における音声理解装置のハードウェア構成の一例を示す図である。FIG. 2 is a diagram illustrating an example of a hardware configuration of a speech understanding device during learning. 学習時における音声理解装置の機能構成の一例を示す図である。FIG. 2 is a diagram illustrating an example of a functional configuration of a speech understanding device during learning. 学習データセットの一例を示す図である。FIG. 10 is a diagram illustrating an example of a training data set. モデル学習部の詳細な機能構成の一例を示す図である。FIG. 2 is a diagram illustrating an example of a detailed functional configuration of a model learning unit. モデル構築処理の一例を示すフローチャートである。10 is a flowchart illustrating an example of a model construction process. モデル学習処理の一例を示すフローチャートである。10 is a flowchart illustrating an example of a model learning process. 推論時における音声理解装置の機能構成の一例を示す図である。FIG. 2 is a diagram illustrating an example of a functional configuration of a speech understanding device during inference. 出力文生成部の詳細な機能構成の一例を示す図である。FIG. 2 is a diagram illustrating an example of a detailed functional configuration of an output statement generation unit. 出力文生成処理の一例を示すフローチャートである。10 is a flowchart illustrating an example of an output sentence generation process.

　以下、本発明の一実施形態について、図面を参照しながら詳細に説明する。 Below, one embodiment of the present invention will be described in detail with reference to the drawings.

　＜非言語情報・パラ言語情報の認識技術＞
　音声には、音声情報（つまり、言語情報・非言語情報・パラ言語情報）が含まれていることが知られている（非特許文献１）。ここで、言語情報とは、話者が話した言葉の情報のことである。非言語情報とは、言語情報でない情報のうち、随意的に変化させることができない情報（例えば、話者性、性別、感情等を表す情報）のことである。パラ言語情報とは、言語情報でない情報のうち、随意的に変化させることができる情報（例えば、意図、態度等を表す情報）のことである。 <Non-verbal and para-linguistic information recognition technology>
It is known that speech contains speech information (i.e., linguistic information, non-linguistic information, and paralinguistic information) (Non-Patent Document 1). Here, linguistic information refers to information about the words spoken by a speaker. Non-linguistic information refers to information that is not linguistic but cannot be changed at will (e.g., information that represents the speaker's identity, gender, emotions, etc.). Paralinguistic information refers to information that is not linguistic but can be changed at will (e.g., information that represents intentions, attitudes, etc.).

　音声から非言語情報・パラ言語情報を認識する従来技術では、有限個の状態を事前に定義した上で、それらの状態のうちのいずれに最も当てはまるかを推定する場合が多い。例えば、非特許文献２に記載されている技術では、深層学習に基づく統計モデルを用いて、怒り・喜び・悲しみ等の感情状態のいずれに最も近いかを推定する。しかしながら、このような従来技術では、きめ細やかな非言語情報・パラ言語情報を認識することはできない。例えば、事前に定義していない感情状態「いらいらしている」や複数の感情状態にまたがる感情状態「怒りつつ悲しんでいる」等といった感情状態を推定することはできない。このため、非言語情報・パラ言語情報の認識結果を利用する後段のシステムにおける処理精度（例えば、コンタクトセンタシステムにおける通話分析の精度、音声対話システムにおける対話制御の精度や分析精度等）が低下するという問題がある。 In conventional technologies for recognizing non-verbal and paralinguistic information from speech, a finite number of states are defined in advance, and then an estimation is made of which of these states the speech most closely matches. For example, the technology described in Non-Patent Document 2 uses a statistical model based on deep learning to estimate which emotional state, such as anger, joy, or sadness, the speech most closely matches. However, such conventional technologies are unable to recognize fine-grained non-verbal and paralinguistic information. For example, they are unable to estimate emotional states such as "irritated," which are not predefined, or "angry and sad," which spans multiple emotional states. This results in a problem of reduced processing accuracy in downstream systems that use the results of non-verbal and paralinguistic information recognition (for example, the accuracy of call analysis in contact center systems, the accuracy of dialogue control and analysis in voice dialogue systems, etc.).

　＜画像理解技術＞
　画像処理の分野においては、画像に含まれる様々な情報を自然文で出力する画像理解技術等と呼ばれる技術が知られている（非特許文献３）。なお、自然文とは、自然言語（例えば、日本語、英語、中国語等といった人間が意思疎通等のために用いる言語）で記述された文のことである。画像理解技術は、大量のテキストデータを用いて単語間の関係性や共起性を獲得した大規模言語モデルと、画像からその画像中の物体の情報等を抽出する画像エンコーダとを組み合わせた深層学習モデルで構成される。そして、この深層学習モデルに対して入力画像とその入力画像に関する自然文の質問（例えば、「この画像に含まれるロゴについてどう思いますか？」等といった文）とを与えると、その質問に対応する出力文（例えば、「このログはシンプルかつ記号的なロゴです」等といった文）が出力される。 <Image understanding technology>
In the field of image processing, there is known a technology called image understanding, which outputs various information contained in an image in natural language (Non-Patent Document 3). Note that natural language refers to a sentence written in a natural language (e.g., a language used by humans for communication, such as Japanese, English, or Chinese). Image understanding technology is composed of a deep learning model that combines a large-scale language model that acquires relationships and co-occurrences between words using a large amount of text data with an image encoder that extracts information about objects in the image from the image. When this deep learning model is given an input image and a natural language question about the input image (e.g., a sentence such as "What do you think about the logo in this image?"), it outputs an output sentence corresponding to the question (e.g., a sentence such as "This log is a simple and symbolic logo").

　画像理解技術を実現する深層学習モデルは、入力画像とその入力画像に関する自然文の質問とその質問に対応する正解出力文との組を与えることにより、画像に含まれる様々な情報を自然文で推定することができるようになる。例えば、「この画像に含まれるロゴの色は？」との質問に対して、「ピンク色です。」等といった文が出力文として出力されるようにすることができる。 Deep learning models that realize image understanding technology can infer various pieces of information contained in an image in natural language by providing a pair of an input image, a natural language question about that input image, and a correct output sentence corresponding to that question. For example, in response to the question, "What color is the logo in this image?", a sentence such as "It's pink" can be output as the output sentence.

　＜音声理解技術＞
　音声に含まれる音声情報を自然文で出力する音声理解技術を実現することにより、例えば、きめ細やかな非言語情報・パラ言語情報を含む多様な音声情報を認識し、その音声情報を自然文で出力することが可能となる。 <Speech understanding technology>
By realizing speech understanding technology that can output speech information contained in speech in natural sentences, it will be possible to recognize a variety of speech information, including detailed non-linguistic and paralinguistic information, and output that speech information in natural sentences.

　音声理解技術を実現する単純な方法として、画像理解技術で利用されている画像エンコーダの代わりに、音声からその音声の特徴を表す情報を抽出する音声エンコーダを利用することが考えられる。しかしながら、実際には、この方法では音声理解技術の実現は困難である。これには２つの原因がある。 A simple way to realize speech understanding technology would be to use a speech encoder that extracts information representing the characteristics of speech from speech, instead of the image encoder used in image understanding technology. However, in practice, it is difficult to realize speech understanding technology using this method. There are two reasons for this.

　１つ目の原因は、画像エンコーダと音声エンコーダの構成の違いである。すなわち、既存の音声エンコーダ（例えば、ｗａｖ２ｖｅｃ２．０（参考文献１）、ＷａｖＬＭ（参考文献２）等）は各層で異なる情報が抽出されるため、画像理解技術と同様に音声エンコーダから最終的に出力される情報のみを用いるだけでは多様な音声情報を理解することができないためである。例えば、参考文献２では、話者情報等といった物理的性質に近い情報は音声エンコーダの低層で抽出される一方で、音韻性等の抽象的性質に近い情報は音声エンコーダの高層で抽出されることが示唆されている。このため、例えば、話し方を認識する際には音声エンコーダの低層で抽出される情報を利用する一方で、音声認識の際には音声エンコーダの高層で抽出される情報を利用しなければ、正確な自然文の出力は困難であると考えられる。 The first reason is the difference in the structure of image encoders and speech encoders. Existing speech encoders (e.g., wav2vec2.0 (Reference 1), WavLM (Reference 2), etc.) extract different information at each layer, making it impossible to understand diverse speech information using only the information ultimately output by the speech encoder, as is the case with image understanding technology. For example, Reference 2 suggests that information closer to physical properties, such as speaker information, is extracted in the lower layers of the speech encoder, while information closer to abstract properties, such as phonology, is extracted in the higher layers of the speech encoder. For this reason, for example, while speaking style recognition uses information extracted in the lower layers of the speech encoder, speech recognition must use information extracted in the higher layers of the speech encoder; otherwise, it is thought that accurate natural-sounding output will be difficult.

　２つ目の原因は、画像と音声の性質の違いである。すなわち、音声は画像と異なり長さが可変であるため、任意長の音声に含まれる音声情報を認識し、その音声情報を自然文で出力できるような時間方向の処理が必要であると考えられる。 The second reason is the difference in the nature of images and audio. In other words, unlike images, audio has a variable length, so it is thought that time-domain processing is required to recognize the audio information contained in audio of any length and output that audio information in natural language.

　＜音声理解モデル＞
　そこで、以下では、上記の２つの原因に起因する問題を解決することができる深層学習モデル（以下、「音声理解モデル」と呼ぶ。）を提案する。この音声理解モデルにより、きめ細やかな非言語情報・パラ言語情報を含む多様な音声情報を任意長の音声から認識し、その音声情報を自然文で出力することが可能となる。すなわち、任意長の音声に含まれる多様な音声情報（より具体的には、音声の物理的性質及び抽象的性質の少なくとも一方に関連する音声情報）を自然文で出力する音声理解技術を実現することができる。このため、例えば、非言語情報・パラ言語情報の認識結果を利用する後段のシステムにおける処理精度（例えば、コンタクトセンタシステムにおける通話分析の精度、音声対話システムにおける対話制御の精度や分析精度等）の向上も期待できる。 <Speech understanding model>
Therefore, we propose a deep learning model (hereinafter referred to as a "speech understanding model") that can solve the problems caused by the above two causes. This speech understanding model makes it possible to recognize diverse speech information, including detailed non-verbal and paralinguistic information, from speech of any length and output the speech information in natural sentences. In other words, it is possible to realize a speech understanding technology that outputs diverse speech information contained in speech of any length (more specifically, speech information related to at least one of the physical and abstract properties of speech) in natural sentences. This can be expected to improve the processing accuracy of downstream systems that use the recognition results of non-verbal and paralinguistic information (e.g., the accuracy of call analysis in contact center systems, the accuracy of dialogue control and analysis in voice dialogue systems, etc.).

　本実施形態で提案する音声理解モデル１０００の一例について、図１を参照しながら説明する。図１は、音声理解モデル１０００の一例を示す図である。 An example of the speech understanding model 1000 proposed in this embodiment will be described with reference to Figure 1. Figure 1 is a diagram showing an example of the speech understanding model 1000.

　図１に示すように、音声理解モデル１０００は、音声エンコーダ１１００と、音声エンコーダ出力統合ブロック１２００と、時間情報統合ブロック１３００と、線形変換層１４００と、大規模言語モデル１５００とで構成される。 As shown in Figure 1, the speech understanding model 1000 is composed of a speech encoder 1100, a speech encoder output integration block 1200, a temporal information integration block 1300, a linear transformation layer 1400, and a large-scale language model 1500.

　音声エンコーダ１１００は、既存の任意の音声エンコーダ（例えば、ｗａｖ２ｖｅｃ２．０、ＷａｖＬＭ等）である。音声エンコーダ１１００は、音声（以下、「入力音声」ともいう。）を入力として、その音声の特徴を表す情報を出力する。このとき、音声エンコーダ１１００は、所定の時間区間毎に、その時間区間における入力音声を入力として、その時間区間における音声の特徴を表す情報を出力する。 The audio encoder 1100 is any existing audio encoder (for example, wav2vec2.0, WavLM, etc.). The audio encoder 1100 receives audio (hereinafter also referred to as "input audio") as input and outputs information representing the characteristics of that audio. At this time, for each predetermined time interval, the audio encoder 1100 receives the input audio for that time interval and outputs information representing the characteristics of the audio for that time interval.

　音声エンコーダ出力統合ブロック１２００は、各時間区間において、音声エンコーダ１１００の各層の出力のうち、予め指定された複数の層の出力を統合する。以下、予め指定された複数の層の各々の層のことを「統合対象層」と呼ぶことにする。ただし、各統合対象層の出力次元数は同一であるものとする。なお、統合対象層は、音声エンコーダ１１００の各層のうちその出力次元数が同一である層の中からユーザ等によって指定される。 The audio encoder output merging block 1200 merges the outputs of multiple pre-specified layers from among the outputs of each layer of the audio encoder 1100 in each time interval. Hereinafter, each of the multiple pre-specified layers will be referred to as the "layer to be merged." However, it is assumed that each layer to be merged has the same number of output dimensions. The layer to be merged is specified by the user, etc., from among the layers of the audio encoder 1100 that have the same number of output dimensions.

　ここで、以下、時間区間を表すインデックスをｔとして、ｔ＝１，・・・，Ｔであるものとする。Ｔは入力音声の最後の時間区間を表すインデックスであり、入力音声の長さによってその値は異なり得る。また、以下、統合対象層の層数をＮとして、時間区間ｔにおけるｎ（ただし、ｎ＝１，・・・，Ｎ）番目の統合対象層の出力をｈ_ｎ（ｔ）とする。なお、各ｎ＝１，・・・，Ｎに対してｈ_ｎ（ｔ）は入力音声の何等かの特徴（例えば、物理的性質や抽象的性質）を表しており、また各ｈ_ｎ（ｔ）は予め決められた次元数のベクトルで記述できるため、以下、各ｈ_ｎ（ｔ）を「第１の音声特徴ベクトル」と呼ぶことにする。 Hereinafter, an index representing a time interval is defined as t, where t = 1, ..., T. T is an index representing the last time interval of the input speech, and its value may vary depending on the length of the input speech. Furthermore, hereinafter, the number of layers to be integrated is defined as N, and the output of the nth layer to be integrated (where n = 1, ..., N) in time interval t is defined as h _n (t). Note that for each n = 1, ..., N, h _n (t) represents some feature of the input speech (e.g., a physical property or an abstract property), and each h _n (t) can be described by a vector with a predetermined number of dimensions, so hereinafter each h _n (t) will be referred to as a "first speech feature vector."

　このとき、音声エンコーダ出力統合ブロック１２００は、各時間区間ｔにおいて、その時間区間ｔにおける各第１の音声特徴ベクトルｈ_ｎ（ｔ）を入力として、各第１の音声特徴ベクトルｈ_ｎ（ｔ）を統合したベクトルを出力する。以下、各第１の音声特徴ベクトルｈ_ｎ（ｔ）を統合したベクトルを「第１の統合ベクトル」と呼び、ｅ（ｔ）で表すことにする。 In this case, the speech encoder output integration block 1200 receives, in each time interval t, each first speech feature vector h _n (t) in that time interval t as input, and outputs a vector obtained by integrating each first speech feature vector h _n (t). Hereinafter, the vector obtained by integrating each first speech feature vector h _n (t) will be referred to as the "first integrated vector" and represented by e(t).

　例えば、図２に示すように、音声エンコーダ１１００が１つの畳み込み層とＮ個のＴｒａｎｓｆｏｒｍｅｒ層とで構成されており、これらＮ個のＴｒａｎｓｆｏｒｍｅｒ層が統合対象層であるものとする。この場合、時間区間ｔにおけるｎ番目のＴｒａｎｓｆｏｒｍｅｒ層の出力が第１の音声特徴ベクトルｈ_ｎ（ｔ）であり、音声エンコーダ出力統合ブロック１２００は、これらの第１の音声特徴ベクトルｈ_ｎ（ｔ）を統合することにより第１の統合ベクトルｅ（ｔ）を作成する。 For example, as shown in Figure 2, suppose the speech encoder 1100 is composed of one convolutional layer and N transformer layers, and these N transformer layers are layers to be integrated. In this case, the output of the nth transformer layer in time interval t is a first speech feature vector h _n (t), and the speech encoder output integration block 1200 integrates these first speech feature vectors h _n (t) to create a first integrated vector e(t).

　各第１の音声特徴ベクトルｈ_ｎ（ｔ）の統合方法としては、例えば、重み付け和や線形変換和を用いることができる。重み付け和を用いる場合、ｅ（ｔ）＝α_１ｈ_１（ｔ）＋・・・＋α_Ｎｈ_Ｎ（ｔ）により第１の統合ベクトルｅ（ｔ）が作成される。ここで、α_１，・・・，α_Ｎは重み係数とも呼ばれ、α_１＋・・・＋α_Ｎ＝１を満たす学習対象パラメータである。一方で、線形変換和を用いる場合、ｅ（ｔ）＝（（Ｗ_１ｈ_１（ｔ）＋ｂ_１）＋・・・＋（Ｗ_Ｎｈ_Ｎ（ｔ）＋ｂ_Ｎ））／Ｎにより第１の統合ベクトルｅ（ｔ）が作成される。ここで、Ｗ_１，・・・，Ｗ_Ｎ，ｂ_１，・・・，ｂ_Ｎは線形変換係数とも呼ばれ、学習対象パラメータである。 The first speech feature vectors h _n (t) can be integrated by, for example, weighted sum or linear transformation sum. When weighted sum is used, the first integrated vector e(t) is created by e(t) = α ₁ h ₁ (t) + ... + α _N h _N (t). Here, α ₁ , ..., α _N are also called weighting coefficients and are training target parameters that satisfy α ₁ + ... + α _N = 1. On the other hand, when linear transformation sum is used, the first integrated vector e(t) is created by e(t) = ((W ₁ h ₁ (t) + b ₁ ) + ... + (W _N h _N (t) + b _N ))/N. Here, W ₁ , ..., W _N , b ₁ , ..., b _N are also called linear transformation coefficients and are training target parameters.

　時間情報統合ブロック１３００は、第１の統合ベクトルｅ（ｔ）を時間方向に統合する。すなわち、時間情報統合ブロック１３００は、各時間区間ｔにおける第１の統合ベクトルｅ（ｔ）を入力として、各第１の統合ベクトルｅ（ｔ）を時間方向に統合したベクトルを出力する。このとき、すべての時間区間ｔにおける第１の統合ベクトルｅ（ｔ）を単純に統合すればよいわけではなく、時間的に重視すべき部分とそうでない部分とを考慮する必要がある。例えば、音声から発話者の感情や話者情報を認識する場合、音声中の短い間を表す時間区間や息継ぎが生じている時間区間等は無視し、発話している時間区間を重視する必要がある。このため、時間情報統合ブロック１３００は、各第１の統合ベクトルｅ（ｔ）を重み付け和により統合する。以下、各第１の統合ベクトルｅ（ｔ）を時間方向に統合したベクトルを「第２の統合ベクトル」と呼び、ｖで表すことにする。 The temporal information integration block 1300 integrates the first integrated vector e(t) in the time direction. That is, the temporal information integration block 1300 receives the first integrated vector e(t) for each time interval t as input and outputs a vector obtained by integrating each first integrated vector e(t) in the time direction. In this case, it is not sufficient to simply integrate the first integrated vector e(t) for all time intervals t; it is necessary to consider which parts of the time period should be emphasized and which parts should not. For example, when recognizing a speaker's emotions or speaker information from speech, it is necessary to ignore time periods representing short pauses in the speech or time periods where breathing occurs, and to emphasize the time periods during which the speaker is speaking. For this reason, the temporal information integration block 1300 integrates each first integrated vector e(t) using a weighted sum. Hereinafter, the vector obtained by integrating each first integrated vector e(t) in the time direction will be referred to as the "second integrated vector" and represented by v.

　例えば、図３に示すように、第１の統合ベクトルｅ（１），・・・，ｅ（Ｔ）が入力された場合、時間情報統合ブロック１３００は、これらの第１の統合ベクトルｅ（１），・・・，ｅ（Ｔ）を時間方向に統合することにより第２の統合ベクトルｖを作成する。 For example, as shown in Figure 3, when first integrated vectors e(1), ..., e(T) are input, the time information integration block 1300 creates a second integrated vector v by integrating these first integrated vectors e(1), ..., e(T) in the time direction.

　各第１の統合ベクトルｅ（１），・・・，ｅ（Ｔ）の統合方法としては、例えば、自己注意プーリング層（Self-Attentive Pooling）を用いることができる。Ｅ＝［ｅ（１），・・・，ｅ（Ｔ）］^τとする。このとき、自己注意プーリング層を用いる場合、ｖ＝ａＥにより第２の統合ベクトルｖが作成される。ここで、ａ＝ｓｏｆｔｍａｘ（ＲｅＬＵ（Ｗ_１'Ｅ）Ｗ_２'）である。また、Ｗ_１'及びＷ_２'は学習対象パラメータ、τは転置を表す記号である。これにより、各第１の統合ベクトルｅ（１），・・・，ｅ（Ｔ）が重み付け和により統合された第２の統合ベクトルｖが得られる。 As a method for integrating each of the first integrated vectors e(1), ..., e(T), for example, a self-attentive pooling layer can be used. Let E = [e(1), ..., e(T)] ^τ . In this case, when a self-attention pooling layer is used, the second integrated vector v is created by v = aE. Here, a = softmax(ReLU(W ₁ 'E)W ₂ '). Furthermore, W ₁ ' and W ₂ ' are training parameters, and τ is a symbol representing transposition. As a result, the second integrated vector v is obtained by integrating each of the first integrated vectors e(1), ..., e(T) by weighted sum.

　なお、時間情報統合ブロック１３００は、畳み込みニューラルネットワークを用いて、各第１の統合ベクトルｅ（１），・・・，ｅ（Ｔ）を時間方向に統合してもよい。この場合、Ｖ＝ｃｏｎｖ１Ｄ（Ｅ）によりＫ個の第２の統合ベクトルｖ（１），・・・，ｖ（Ｋ）により構成される行列Ｖ＝［ｖ（１），・・・，ｖ（Ｋ）］^τが得られる。このとき、１次元畳み込みニューラルネットワークの学習可能パラメータが学習対象パラメータである。また、Ｋは、１次元畳み込みニューラルネットワークのウインドウサイズと第１の統合ベクトルｅ（１），・・・，ｅ（Ｔ）の系列長Ｔによって決定される１以上の整数である。 The time information integration block 1300 may integrate each of the first integrated vectors e(1), ..., e(T) in the time direction using a convolutional neural network. In this case, a matrix V = [v(1), ..., v(K)] ^τ composed of K second integrated vectors v(1), ..., v(K) is obtained by V = conv1D(E). In this case, the learnable parameters of the one-dimensional convolutional neural network are the parameters to be learned. Furthermore, K is an integer equal to or greater than 1 determined by the window size of the one-dimensional convolutional neural network and the sequence length T of the first integrated vectors e(1), ..., e(T).

　線形変換層１４００は、第２の統合ベクトルｖを線形変換する。すなわち、線形変換層１４００は、第２の統合ベクトルｖを入力として、この第２の統合ベクトルｖを線形変換したベクトルを出力する。以下、第２の統合ベクトルｖを線形変換したベクトルを「第２の音声特徴ベクトル」と呼び、ｗで表すことにする。第２の音声特徴ベクトルｗは、ｗ＝Ｗｖ＋ｂにより作成される。ここで、Ｗ，ｂは線形変換係数とも呼ばれ、学習対象パラメータである。なお、大規模言語モデル１５００におけるトークンの埋め込み空間の次元数がＭである場合、第２の音声特徴ベクトルｗの次元数もＭである。このため、線形変換層１４００では、長さが１、次元数がＭの第２の音声特徴ベクトル系列が作成されることを意味している。 The linear transformation layer 1400 linearly transforms the second integrated vector v. That is, the linear transformation layer 1400 takes the second integrated vector v as input and outputs a vector obtained by linearly transforming this second integrated vector v. Hereinafter, the vector obtained by linearly transforming the second integrated vector v will be referred to as the "second speech feature vector" and represented by w. The second speech feature vector w is created by w = Wv + b. Here, W and b are also called linear transformation coefficients and are training target parameters. Note that if the number of dimensions of the token embedding space in the large-scale language model 1500 is M, then the number of dimensions of the second speech feature vector w is also M. This means that the linear transformation layer 1400 creates a second speech feature vector sequence with a length of 1 and M dimensions.

　なお、時間情報統合ブロック１３００から行列Ｖ＝［ｖ（１），・・・，ｖ（Ｋ）］^τが出力された場合、例えば、ｖ（１），・・・，ｖ（Ｋ）をそれぞれ線形変換したＫ個のベクトルを第２の音声特徴ベクトルｗ（１），・・・，ｗ（Ｋ）とすればよい。これは、長さがＫ、次元数がＭの第２の音声特徴ベクトル系列が作成されることを意味している。 When the matrix V = [v(1), ..., v(K)] ^τ is output from the time information integration block 1300, for example, K vectors obtained by linearly transforming v(1), ..., v(K) can be set as second speech feature vectors w(1), ..., w(K). This means that a second speech feature vector sequence with a length of K and a number of dimensions of M is created.

　大規模言語モデル１５００は、既存の任意の大規模言語モデル（ＬＬＭ：Large Language Models）である。大規模言語モデル１５００は、入力音声に関する自然文の質問である入力文と、第２の音声特徴ベクトルｗ（又は、第２の音声特徴ベクトルｗ（１），・・・，ｗ（Ｋ））とを入力として、その質問に対応する出力文を出力する。このような出力文は、入力文と第２の音声特徴ベクトルｗ（又は、第２の音声特徴ベクトルｗ（１），・・・，ｗ（Ｋ））とが与えられたときの事後確率に従って生成される。なお、大規模言語モデル１５００では、入力文を構成するトークンの埋め込み表現を表す埋め込みベクトル系列と第２の音声特徴ベクトルｗ（又は、第２の音声特徴ベクトルｗ（１），・・・，ｗ（Ｋ））とを結合したベクトル系列が処理されることにより事後確率が算出される。 Large-scale language model 1500 is any existing large-scale language model (LLM). Large-scale language model 1500 takes as input an input sentence, which is a natural language question about the input speech, and a second speech feature vector w (or second speech feature vector w(1), ..., w(K)), and outputs an output sentence corresponding to the question. Such an output sentence is generated according to the posterior probability given the input sentence and the second speech feature vector w (or second speech feature vector w(1), ..., w(K)). Note that in large-scale language model 1500, the posterior probability is calculated by processing a vector sequence combining an embedding vector sequence representing the embedded representation of the tokens that make up the input sentence and the second speech feature vector w (or second speech feature vector w(1), ..., w(K)).

　図１に示す例では、「次の音声に含まれる人物の感情状態を教えて」との入力文が与えられたときに、「この人物は男性で、少しいらいらしています」との出力文が生成された場合を示している。なお、大規模言語モデル１５００としては、例えば、Ｌｌａｍａ　２　７Ｂ（参考文献３）等を用いることができる。ただし、入力文と第２の音声特徴ベクトルｗ（又は、第２の音声特徴ベクトルｗ（１），・・・，ｗ（Ｋ））とが与えられたときの事後確率に従って出力文を生成可能な言語モデルであれば、大規模言語モデル以外の言語モデルが大規模言語モデル１５００として用いられてもよい。 The example shown in Figure 1 shows a case where, when an input sentence of "Please tell me the emotional state of the person in the following speech," is given, an output sentence of "This person is male and is slightly irritated" is generated. Note that, for example, Llama 2 7B (Reference 3) or the like can be used as the large-scale language model 1500. However, a language model other than a large-scale language model may also be used as the large-scale language model 1500, as long as it is a language model that can generate an output sentence according to the posterior probability when given an input sentence and a second speech feature vector w (or second speech feature vectors w(1), ..., w(K)).

　以下、簡単のため、時間情報統合ブロック１３００で自己注意プーリング層が用いられるものとして、時間情報統合ブロック１３００では第２の統合ベクトルｖが得られ、線形変換層１４００では第２の音声特徴ベクトルｗが作成されるものとして説明する。ただし、時間情報統合ブロック１３００で畳み込みニューラルネットワークが用いられる場合、「第２の統合ベクトルｖ」を「第２の統合ベクトルｖ（１），・・・，ｖ（Ｋ）」、「第２の音声特徴ベクトルｗ」を「第２の音声特徴ベクトルｗ（１），・・・，ｗ（Ｋ）」と読み替えることにより、以下の実施形態を同様に適用することが可能である。 For simplicity, the following description will be given assuming that a self-attention pooling layer is used in the temporal information integration block 1300, that a second integrated vector v is obtained in the temporal information integration block 1300, and that a second audio feature vector w is created in the linear transformation layer 1400. However, if a convolutional neural network is used in the temporal information integration block 1300, the following embodiment can be similarly applied by replacing "second integrated vector v" with "second integrated vector v(1), ..., v(K)" and "second audio feature vector w" with "second audio feature vector w(1), ..., w(K)."

　なお、音声エンコーダ１１００は、例えば、「音声特徴抽出器」、「音声符号化器」等と呼ばれてもよい。また、大規模言語モデル１５００は、例えば、「言語モデル」、「自然言語モデル」、「自然言語処理モデル」等と呼ばれてもよい。更に、音声理解モデル１０００を構成する構成要素（音声エンコーダ１１００、音声エンコーダ出力統合ブロック１２００、時間情報統合ブロック１３００、線形変換層１４００、大規模言語モデル１５００）は、例えば、「モジュール」等と呼ばれてもよい。 Note that the speech encoder 1100 may be referred to as, for example, a "speech feature extractor" or "speech coder." Furthermore, the large-scale language model 1500 may be referred to as, for example, a "language model," a "natural language model," or a "natural language processing model." Furthermore, the components that make up the speech understanding model 1000 (speech encoder 1100, speech encoder output integration block 1200, temporal information integration block 1300, linear transformation layer 1400, and large-scale language model 1500) may be referred to as, for example, a "module."

　以下、図１に示す音声理解モデル１０００により音声理解技術を実現する音声理解装置１０について説明する。ここで、音声理解装置１０には、音声理解モデル１０００を構築する「モデル構築時」と、音声理解モデル１０００の学習対象パラメータを学習する「学習時」と、学習済みパラメータを用いて音声理解モデル１０００により出力文を生成する「推論時」とが存在する。学習時における音声理解装置１０には、入力音声とその入力音声に関する自然文の質問である入力文とその質問に対応する正解出力文との組で表される学習データの集合（以下、「学習データセット」ともいう。）が与えられる。一方で、推論時における音声理解装置１０には、入力音声とその入力音声に関する自然文の質問である入力文との組で表されるテストデータが与えられる。 The following describes a speech understanding device 10 that realizes speech understanding technology using the speech understanding model 1000 shown in Figure 1. Here, the speech understanding device 10 undergoes a "model construction" phase during which the speech understanding model 1000 is constructed, a "learning" phase during which parameters to be learned for the speech understanding model 1000 are learned, and an "inference" phase during which output sentences are generated by the speech understanding model 1000 using the learned parameters. During learning, the speech understanding device 10 is provided with a collection of training data (hereinafter also referred to as a "training dataset") represented by pairs of input speech, input sentences which are questions in natural language related to the input speech, and correct output sentences corresponding to the questions. On the other hand, during inference, the speech understanding device 10 is provided with test data represented by pairs of input speech and input sentences which are questions in natural language related to the input speech.

　なお、モデル構築時における音声理解装置１０は、例えば、「モデル構築装置」、「モデル作成装置」等と呼ばれてもよい。学習時における音声理解装置１０は、例えば、「学習装置」、「パラメータ推定装置」、「パラメータ最適化装置」等と呼ばれてもよい。推論時における音声理解装置１０は、例えば、「推論装置」、「推定装置」、「自然文生成装置」等と呼ばれてもよい。 Note that the speech understanding device 10 during model construction may be referred to as, for example, a "model construction device" or a "model creation device." The speech understanding device 10 during learning may be referred to as, for example, a "learning device," a "parameter estimation device," a "parameter optimization device," etc. The speech understanding device 10 during inference may be referred to as, for example, an "inference device," an "estimation device," a "natural sentence generation device," etc.

　以下では、簡単のため、モデル構築時は学習時に含まれるものとして、学習時における音声理解装置１０が音声理解モデル１０００の構築も行う場合について説明する。 In the following, for simplicity's sake, model construction is considered to be included in learning, and a case will be described in which the speech understanding device 10 also constructs the speech understanding model 1000 during learning.

　［学習時］
　以下、学習時における音声理解装置１０について説明する。 [Study]
The speech understanding device 10 during learning will be described below.

　＜学習時における音声理解装置１０のハードウェア構成例＞
　学習時における音声理解装置１０のハードウェア構成例について、図４を参照しながら説明する。図４は、学習時における音声理解装置１０のハードウェア構成の一例を示す図である。 <Example of hardware configuration of the speech understanding device 10 during learning>
An example of the hardware configuration of the speech understanding device 10 during learning will be described with reference to Fig. 4. Fig. 4 is a diagram showing an example of the hardware configuration of the speech understanding device 10 during learning.

　図４に示すように、学習時における音声理解装置１０は、入力装置１０１と、表示装置１０２と、外部Ｉ／Ｆ１０３と、通信Ｉ／Ｆ１０４と、ＲＡＭ（Random Access Memory）１０５と、ＲＯＭ（Read Only Memory）１０６と、補助記憶装置１０７と、プロセッサ１０８とを有する。これらの各ハードウェアは、それぞれがバス１０９を介して通信可能に接続される。 As shown in FIG. 4, the speech understanding device 10 during learning has an input device 101, a display device 102, an external I/F 103, a communication I/F 104, a RAM (Random Access Memory) 105, a ROM (Read Only Memory) 106, an auxiliary storage device 107, and a processor 108. Each of these pieces of hardware is connected to each other so that they can communicate with each other via a bus 109.

　入力装置１０１は、例えば、キーボード、マウス、タッチパネル、物理ボタン等である。表示装置１０２は、例えば、ディスプレイ、表示パネル等である。なお、音声理解装置１０は、例えば、入力装置１０１及び表示装置１０２のうちの少なくとも一方を有していなくてもよい。 The input device 101 is, for example, a keyboard, a mouse, a touch panel, a physical button, etc. The display device 102 is, for example, a display, a display panel, etc. Note that the speech understanding device 10 does not necessarily have to have at least one of the input device 101 and the display device 102, for example.

　外部Ｉ／Ｆ１０３は、記録媒体１０３ａ等の外部装置とのインタフェースである。記録媒体１０３ａとしては、例えば、ＣＤ（Compact Disc）、ＤＶＤ（Digital Versatile Disk）、ＳＤメモリカード（Secure Digital memory card）、ＵＳＢ（Universal Serial Bus）メモリカード等が挙げられる。 The external I/F 103 is an interface with external devices such as a recording medium 103a. Examples of recording media 103a include a CD (Compact Disc), a DVD (Digital Versatile Disk), an SD memory card (Secure Digital memory card), and a USB (Universal Serial Bus) memory card.

　通信Ｉ／Ｆ１０４は、通信ネットワークに接続するためのインタフェースである。ＲＡＭ１０５は、プログラムやデータを一時保持する揮発性の半導体メモリ（記憶装置）である。ＲＯＭ１０６は、電源を切ってもプログラムやデータを保持することができる不揮発性の半導体メモリ（記憶装置）である。補助記憶装置１０７は、例えば、ＨＤＤ（Hard Disk Drive）、ＳＳＤ（Solid State Drive）、フラッシュメモリ等の不揮発性の記憶装置である。プロセッサ１０８は、例えば、ＣＰＵ（Central Processing Unit）やＧＰＵ（Graphic Processing Unit）等の各種演算装置である。 The communication I/F 104 is an interface for connecting to a communication network. The RAM 105 is a volatile semiconductor memory (storage device) that temporarily stores programs and data. The ROM 106 is a non-volatile semiconductor memory (storage device) that can store programs and data even when the power is turned off. The auxiliary storage device 107 is a non-volatile storage device such as an HDD (Hard Disk Drive), SSD (Solid State Drive), or flash memory. The processor 108 is a variety of arithmetic devices such as a CPU (Central Processing Unit) or GPU (Graphics Processing Unit).

　なお、図４に示すハードウェア構成は一例であって、音声理解装置１０のハードウェア構成はこれに限られるものではない。例えば、音声理解装置１０は、複数の補助記憶装置１０７や複数のプロセッサ１０８を有していてもよいし、図示したハードウェアの一部を有していなくてもよいし、図示したハードウェア以外の種々のハードウェアを有していてもよい。 Note that the hardware configuration shown in FIG. 4 is an example, and the hardware configuration of the speech understanding device 10 is not limited to this. For example, the speech understanding device 10 may have multiple auxiliary storage devices 107 or multiple processors 108, may not have some of the hardware shown in the figure, or may have various hardware other than the hardware shown in the figure.

　＜学習時における音声理解装置１０の機能構成例＞
　学習時における音声理解装置１０の機能構成例について、図５を参照しながら説明する。図５は、学習時における音声理解装置１０の機能構成の一例を示す図である。 <Example of functional configuration of the speech understanding device 10 during learning>
An example of the functional configuration of the speech understanding device 10 during learning will be described with reference to Fig. 5. Fig. 5 is a diagram showing an example of the functional configuration of the speech understanding device 10 during learning.

　図５に示すように、学習時における音声理解装置１０は、モデル構築部２０１と、モデル学習部２０２とを有する。これら各部は、例えば、音声理解装置１０にインストールされた１以上のプログラムが、プロセッサ１０８等に実行させる処理により実現される。また、学習時における音声理解装置１０は、学習済み音声エンコーダ記憶部２０３と、学習済み大規模言語モデル記憶部２０４と、音声理解モデル記憶部２０５と、学習データセット記憶部２０６とを有する。これら各記憶部は、例えば、補助記憶装置１０７等の記憶領域により実現される。ただし、これら各記憶部のうちの少なくとも１つの記憶部が、音声理解装置１０と通信可能に接続される記憶装置（例えば、データベースサーバ等が有する記憶装置等）の記憶領域により実現されてもよい。 As shown in FIG. 5, the speech understanding device 10 during learning has a model construction unit 201 and a model learning unit 202. These units are realized, for example, by processing in which one or more programs installed in the speech understanding device 10 are executed by the processor 108 or the like. The speech understanding device 10 during learning also has a trained speech encoder storage unit 203, a trained large-scale language model storage unit 204, a speech understanding model storage unit 205, and a training dataset storage unit 206. Each of these storage units is realized, for example, by a storage area such as the auxiliary storage device 107. However, at least one of these storage units may also be realized by a storage area of a storage device (for example, a storage device included in a database server, etc.) that is communicatively connected to the speech understanding device 10.

　モデル構築部２０１は、学習済み音声エンコーダ記憶部２０３に記憶されている学習済み音声エンコーダと、学習済み大規模言語モデル記憶部２０４に記憶されている学習済み大規模言語モデルとを用いて、図１に示す音声理解モデル１０００を構築する。すなわち、モデル構築部２０１は、学習済み音声エンコーダを音声エンコーダ１１００、学習済み大規模言語モデルを大規模言語モデル１５００として、図１に示す音声理解モデル１０００を構築する。このとき、モデル構築部２０１は、音声エンコーダ出力統合ブロック１２００の学習対象パラメータと、時間情報統合ブロック１３００の学習対象パラメータと、線形変換層１４００の学習対象パラメータとを初期化する。学習対象パラメータは任意の方法により初期化すればよいが、例えば、ランダムに初期化する、所定の分布からサンプリングする、等により初期化すればよい。なお、学習済み音声エンコーダとは、パラメータが学習済みの音声エンコーダのことである。同様に、学習済み大規模言語モデルとは、パラメータが学習済みの大規模言語モデルのことである。 The model construction unit 201 constructs the speech understanding model 1000 shown in FIG. 1 using the trained speech encoder stored in the trained speech encoder storage unit 203 and the trained large-scale language model stored in the trained large-scale language model storage unit 204. That is, the model construction unit 201 constructs the speech understanding model 1000 shown in FIG. 1 using the trained speech encoder as the speech encoder 1100 and the trained large-scale language model as the large-scale language model 1500. At this time, the model construction unit 201 initializes the training target parameters of the speech encoder output integration block 1200, the training target parameters of the time information integration block 1300, and the training target parameters of the linear transformation layer 1400. The training target parameters may be initialized using any method, such as random initialization or sampling from a predetermined distribution. Note that a trained speech encoder is a speech encoder whose parameters have been trained. Similarly, a trained large-scale language model is a large-scale language model whose parameters have been trained.

　また、モデル構築部２０１は、音声理解モデル１０００を音声理解モデル記憶部２０５に保存する。 In addition, the model construction unit 201 stores the speech understanding model 1000 in the speech understanding model storage unit 205.

　モデル学習部２０２は、学習データセット記憶部２０６に記憶されている学習データセットを用いて、音声理解モデル記憶部２０５に記憶されている音声理解モデル１０００を学習する。このとき、モデル学習部２０２は、音声エンコーダ１１００及び大規模言語モデル１５００のパラメータは固定したままで、音声エンコーダ出力統合ブロック１２００の学習対象パラメータと、時間情報統合ブロック１３００の学習対象パラメータと、線形変換層１４００の学習対象パラメータとを学習する。より具体的には、モデル学習部２０２は、学習データに含まれる入力音声及び入力文を与えたときに音声理解モデル１０００によって生成される出力文と当該学習データに含まれる正解出力文との交差エントロピー等を損失関数として用いて、その損失関数を最小化するように、既存の最適化手法により学習対象パラメータを学習する。なお、モデル学習部２０２の詳細な機能構成例については後述する。 The model training unit 202 uses the training dataset stored in the training dataset storage unit 206 to train the speech understanding model 1000 stored in the speech understanding model storage unit 205. At this time, the model training unit 202 trains the training target parameters of the speech encoder output integration block 1200, the training target parameters of the temporal information integration block 1300, and the training target parameters of the linear transformation layer 1400, while keeping the parameters of the speech encoder 1100 and large-scale language model 1500 fixed. More specifically, the model training unit 202 uses, as a loss function, the cross entropy between the output sentence generated by the speech understanding model 1000 when given the input speech and input sentence contained in the training data and the correct output sentence contained in the training data, and trains the training target parameters using an existing optimization method so as to minimize the loss function. A detailed example of the functional configuration of the model training unit 202 will be described later.

　学習済み音声エンコーダ記憶部２０３は、学習済み音声エンコーダを記憶する。学習済み大規模言語モデル記憶部２０４は、学習済み大規模言語モデルを記憶する。音声理解モデル記憶部２０５は、モデル構築部２０１によって構築された音声理解モデル１０００を記憶する。学習データセット記憶部２０６は、与えられた学習データセットを記憶する。 The trained speech encoder storage unit 203 stores a trained speech encoder. The trained large-scale language model storage unit 204 stores a trained large-scale language model. The speech understanding model storage unit 205 stores the speech understanding model 1000 constructed by the model construction unit 201. The training dataset storage unit 206 stores a given training dataset.

　　≪学習データセット≫
　学習データセット記憶部２０６に記憶されている学習データセットの一例について、図６を参照しながら説明する。図６は、学習データセットの一例を示す図である。 <Learning dataset>
An example of the training data set stored in the training data set storage unit 206 will be described with reference to Fig. 6. Fig. 6 is a diagram showing an example of the training data set.

　図６に示すように、学習データセットは１以上の学習データで構成されており、各学習データには、入力音声と、入力文と、正解出力文とが含まれる。なお、一般に、学習データセットは多数の学習データで構成される。 As shown in Figure 6, a training dataset is made up of one or more training data, each of which includes input speech, an input sentence, and a correct output sentence. Generally, a training dataset is made up of a large number of training data.

　入力音声は、音声理解モデル１０００に入力される音声データである。入力文は、入力音声に関する自然文の質問を表すテキストデータである。正解出力文は、入力文が表す質問に対して正解となる自然文の応答又は回答を表すテキストデータである。なお、入力音声は、必ずしも人の声を収録した音声データである必要はなく、任意の音を収録した音声データであってもよい。また、正解出力文は、例えば、「教師データ」等と呼ばれてもよい。 The input speech is speech data input to the speech understanding model 1000. The input sentence is text data that represents a question in natural language related to the input speech. The correct output sentence is text data that represents a response or answer in natural language that is the correct answer to the question represented by the input sentence. Note that the input speech does not necessarily have to be audio data that records a human voice, but may also be audio data that records any sound. The correct output sentence may also be called, for example, "teaching data."

　例えば、図６に示す例の１行目の学習データには、入力音声「音声Ａ」と、入力文「この音声を書き起こして」と、正解出力文「どうなっているか説明して」とが含まれている。同様に、図６に示す例の２行目の学習データには、入力音声「音声Ａ」と、入力文「この音声の話し方を教えて」と、正解出力文「女性が早口で大きな声で話しています」とが含まれている。同様に、図６に示す例の３行目の学習データには、入力音声「音声Ｂ」と、入力文「この音声の話し方を教えて」と、正解出力文「ゆっくりと落ち着いた男性の話し方です」とが含まれている。同様に、図６に示す例の４行目の学習データには、入力音声「音声Ｂ」と、入力文「この音声の話者の性別を教えて」と、正解出力文「男性です」とが含まれている。同様に、図６に示す例の５行目の学習データには、入力音声「音声Ｃ」と、入力文「この音声の話者の感情はなに」と、正解出力文「この話者はややいらついています」とが含まれている。 6 includes the input speech "Speech A," the input sentence "Please transcribe this speech," and the correct output sentence "Please explain what is going on." Similarly, the training data for the second line of the example shown in FIG. 6 includes the input speech "Speech A," the input sentence "Please tell me how this speech is spoken," and the correct output sentence "A woman is speaking quickly and loudly." Similarly, the training data for the third line of the example shown in FIG. 6 includes the input speech "Speech B," the input sentence "Please tell me how this speech is spoken," and the correct output sentence "A man speaks slowly and calmly." Similarly, the training data for the fourth line of the example shown in FIG. 6 includes the input speech "Speech B," the input sentence "Please tell me the gender of the speaker of this speech," and the correct output sentence "It is a man." Similarly, the training data on the fifth line of the example shown in Figure 6 includes the input speech "Speech C," the input sentence "What is the emotion of the speaker of this speech?", and the correct output sentence "This speaker seems a little irritated."

　このように、学習データセットは、入力音声と入力文と正解出力文との組で表される学習データで構成されている。なお、図６に示すように、学習データセットには、同一の入力音声に対して異なる入力文及び正解出力文が含まれる複数の学習データが存在してもよい。 In this way, the training dataset is composed of training data represented by pairs of input speech, input sentences, and correct output sentences. Note that, as shown in Figure 6, the training dataset may contain multiple training data sets that contain different input sentences and correct output sentences for the same input speech.

　　≪モデル学習部２０２の詳細な機能構成例≫
　モデル学習部２０２の詳細な機能構成例について、図７を参照しながら説明する。図７は、モデル学習部２０２の詳細な機能構成の一例を示す図である。 <<Detailed Functional Configuration Example of Model Learning Unit 202>>
An example of a detailed functional configuration of the model learning unit 202 will be described with reference to Fig. 7. Fig. 7 is a diagram showing an example of a detailed functional configuration of the model learning unit 202.

　図７に示すように、モデル学習部２０２には、学習データ入力部２１１と、音声エンコード部２１２と、第１の統合部２１３と、第２の統合部２１４と、線形変換部２１５と、事後確率算出部２１６と、パラメータ更新部２１７と、終了判定部２１８とが含まれる。 As shown in FIG. 7, the model learning unit 202 includes a learning data input unit 211, an audio encoding unit 212, a first integration unit 213, a second integration unit 214, a linear transformation unit 215, a posterior probability calculation unit 216, a parameter update unit 217, and an end determination unit 218.

　学習データ入力部２１１は、学習データセット記憶部２０６に記憶されている学習データセットから１件の学習データを入力する。 The learning data input unit 211 inputs one piece of learning data from the learning dataset stored in the learning dataset storage unit 206.

　音声エンコード部２１２は、音声理解モデル１０００に含まれる音声エンコーダ１１００により実現される。音声エンコード部２１２は、学習データ入力部２１１によって入力された学習データに含まれる入力音声を入力として、各時間区間ｔ（ｔ＝１，・・・，Ｔ）でＮ個の統合対象層からＮ個の第１の音声特徴ベクトルｈ_ｎ（ｔ）（ｎ＝１，・・・，Ｎ）をそれぞれ出力する。 The speech encoding unit 212 is realized by the speech encoder 1100 included in the speech understanding model 1000. The speech encoding unit 212 receives input speech included in the training data input by the training data input unit 211, and outputs N first speech feature vectors h _n (t) (n=1, ..., N) from the N integration target layers in each time interval t (t=1, ..., T).

　第１の統合部２１３は、音声理解モデル１０００に含まれる音声エンコーダ出力統合ブロック１２００により実現される。第１の統合部２１３は、各時間区間ｔ（ｔ＝１，・・・，Ｔ）において、その時間区間ｔにおける第１の音声特徴ベクトルｈ_ｎ（ｔ）（ｎ＝１，・・・，Ｎ）を入力として、これらの各第１の音声特徴ベクトルｈ_ｎ（ｔ）（ｎ＝１，・・・，Ｎ）を統合した第１の統合ベクトルｅ（ｔ）をそれぞれ出力する。 The first integration unit 213 is realized by the speech encoder output integration block 1200 included in the speech understanding model 1000. In each time interval t (t=1, ..., T), the first integration unit 213 receives the first speech feature vectors h _n (t) (n=1, ..., N) in that time interval t as input, and outputs a first integrated vector e(t) obtained by integrating each of these first speech feature vectors h _n (t) (n=1, ..., N).

　第２の統合部２１４は、音声理解モデル１０００に含まれる時間情報統合ブロック１３００により実現される。第２の統合部２１４は、各時間区間ｔにおける第１の統合ベクトルｅ（ｔ）を入力として、これらの各第１の統合ベクトルｅ（ｔ）（ｔ＝１，・・・，Ｔ）を時間方向に統合した第２の統合ベクトルｖを出力する。 The second integration unit 214 is realized by the time information integration block 1300 included in the speech understanding model 1000. The second integration unit 214 receives the first integrated vector e(t) for each time interval t as input, and outputs the second integrated vector v obtained by integrating each of these first integrated vectors e(t) (t = 1, ..., T) in the time direction.

　線形変換部２１５は、音声理解モデル１０００に含まれる線形変換層１４００により実現される。線形変換部２１５は、第２の統合ベクトルｖを入力として、第２の統合ベクトルｖを線形変換した第２の音声特徴ベクトルｗを出力する。 The linear transformation unit 215 is realized by the linear transformation layer 1400 included in the speech understanding model 1000. The linear transformation unit 215 receives the second integrated vector v as input and outputs the second speech feature vector w obtained by linearly transforming the second integrated vector v.

　事後確率算出部２１６は、音声理解モデル１０００に含まれる大規模言語モデル１５００により実現される。事後確率算出部２１６は、学習データ入力部２１１によって入力された学習データに含まれる入力文と、第２の音声特徴ベクトルｗとを入力として、当該入力文と当該第２の音声特徴ベクトルｗとが与えられたときの出力文の事後確率を算出する。 The posterior probability calculation unit 216 is realized by the large-scale language model 1500 included in the speech understanding model 1000. The posterior probability calculation unit 216 receives as input an input sentence included in the training data input by the training data input unit 211 and a second speech feature vector w, and calculates the posterior probability of an output sentence when the input sentence and the second speech feature vector w are given.

　より具体的には、出力文を構成するｉ番目のトークンをｓ_ｉとする（ただし、ｓ_１は文頭を表すトークンであるものとする。）。また、当該入力文と当該第２の音声特徴ベクトルｗとが与えられたときにトークンｓ_１が生成される事後確率をｐ（ｓ_１）、当該入力文と当該第２の音声特徴ベクトルｗとｓ_１，・・・，ｓ_ｉ－１とが与えられたときにトークンｓ_ｉが生成される事後確率をｐ（ｓ_ｉ）（ただし、ｉ≧２）とする。このとき、事後確率算出部２１６は、ｉ＝１，・・・，Ｉに対して、事後確率ｐ（ｓ_ｉ）を算出する。Ｉは出力文に含まれるトークン数（つまり、出力文の長さ）であり、例えば、学習データ入力部２１１によって入力された学習データに含まれる正解出力文の長さとすればよい。ここで、事後確率ｐ（ｓ_ｉ）は、例えば、トークンの埋め込み空間の次元数をＭとしたとき、ｍ種類目のトークンが生成される確率をｍ番目の要素に持ち、かつ、すべての要素の値の和が１となるＭ次元ベクトルで表される。 More specifically, the i-th token constituting the output sentence is defined as s _i (where s ₁ is the token representing the beginning of the sentence). Furthermore, the posterior probability that token s ₁ is generated when the input sentence and the second speech feature vector w are given is defined as p(s ₁ ), and the posterior probability that token s _i is generated when the input sentence, the second speech feature vector w, and s ₁ , ..., s _i-1 are given is defined as p(s _i ) (where i≧2). In this case, the posterior probability calculation unit 216 calculates the posterior probability p(s _i ) for i=1, ..., I. I is the number of tokens included in the output sentence (i.e., the length of the output sentence), and may be, for example, the length of the correct output sentence included in the training data input by the training data input unit 211. Here, the posterior probability p(s _i ) is expressed as an M-dimensional vector in which, for example, when the number of dimensions of the token embedding space is M, the m-th element represents the probability that the m-th type of token is generated, and the sum of the values of all elements is 1.

　なお、トークンとは、大規模言語モデル等の言語モデルが文字列を処理する際に基本となる処理単位のことである。トークンの典型例は単語であるが、トークンは単語に限られるものではなく、例えば、文字、形態素、サブワード、或るまとまりのある文字列等でもよい。 Note that a token is the basic processing unit used when a language model, such as a large-scale language model, processes a string of characters. A typical example of a token is a word, but tokens are not limited to words and can also be, for example, characters, morphemes, subwords, or coherent strings of characters.

　パラメータ更新部２１７は、事後確率算出部２１６によって算出された事後確率と、学習データ入力部２１１によって入力された学習データに含まれる正解出力文とを用いて、既存の最適化手法により音声理解モデル１０００の学習対象パラメータを学習する。 The parameter update unit 217 uses the posterior probabilities calculated by the posterior probability calculation unit 216 and the correct output sentences included in the learning data input by the learning data input unit 211 to learn the learning target parameters of the speech understanding model 1000 using an existing optimization method.

　より具体的には、正解出力文を構成するｉ番目のトークンをｓ_ｉ'とする（ただし、ｓ_１'は文頭を表すトークン、ｓ_Ｉ'は文末を表すトークンであるものとする。）。また、トークンｓ_ｉ'が生成される確率をｐ（ｓ_ｉ'）とする。確率ｐ（ｓ_ｉ'）は、例えば、トークンの埋め込み空間の次元数をＭとしたとき、トークンｓ_ｉ'に対応する要素の値のみ１、それ以外の要素の値は０であるＭ次元ベクトルで表される。このとき、パラメータ更新部２１７は、ｉ＝１，・・・，Ｉに関する－ｐ（ｓ_ｉ'）ｌｏｇｐ（ｓ_ｉ）の和（つまり、交差エントロピー）を損失関数として用いて、その損失関数を最小化するように、既存の最適化手法により学習対象パラメータを更新する。なお、学習対象パラメータを更新する際に利用可能な最適化手法は特定の手法に限定されるものではないが、例えば、確率的勾配降下法に基づくオンライン最適化手法を用いることが可能である。 More specifically, the i-th token constituting the correct output sentence is denoted by s _i ' (where s ₁ ' is a token representing the beginning of the sentence, and s _I ' is a token representing the end of the sentence). Furthermore, the probability that token s _i ' is generated is denoted by p(s _i '). For example, when the number of dimensions of the token embedding space is M, the probability p(s _i ') is expressed as an M-dimensional vector in which only the element corresponding to token s _i ' has a value of 1 and the other elements have a value of 0. In this case, the parameter update unit 217 uses the sum of -p(s _i ')logp(s _i ) for i = 1, ..., I (i.e., cross entropy) as a loss function, and updates the training parameters using an existing optimization method so as to minimize the loss function. The optimization method that can be used to update the training parameters is not limited to a specific method, but an online optimization method based on stochastic gradient descent, for example, can be used.

　終了判定部２１８は、学習対象パラメータの更新を終了するか否かを判定する。このとき、終了判定部２１８は、予め決められた所定の終了条件を満たす場合は学習対象パラメータの更新を終了すると判定し、そうでない場合は学習対象パラメータの更新を終了しないと判定する。これにより、所定の終了条件を満たすまで、音声エンコーダ出力統合ブロック１２００の学習対象パラメータと、時間情報統合ブロック１３００の学習対象パラメータと、線形変換層１４００の学習対象パラメータとが繰り返し更新される。ここで、所定の終了条件としては、例えば、学習対象パラメータの更新回数が所定の回数以上となったこと、エポック数が所定のエポック数以上となったこと、損失関数の値が所定の値未満となったこと、損失関数が収束したこと、等が挙げられる。 The termination determination unit 218 determines whether to terminate the updating of the training parameters. At this time, the termination determination unit 218 determines to terminate the updating of the training parameters if a predetermined termination condition is met, and determines not to terminate the updating of the training parameters if not. As a result, the training parameters of the audio encoder output integrated block 1200, the training parameters of the temporal information integrated block 1300, and the training parameters of the linear transformation layer 1400 are repeatedly updated until the predetermined termination condition is met. Here, examples of the predetermined termination condition include the training parameters being updated a predetermined number of times or more, the number of epochs being a predetermined number of epochs or more, the value of the loss function being less than a predetermined value, the loss function converging, etc.

　＜モデル構築処理＞
　以下、モデル構築処理について、図８を参照しながら説明する。図８は、モデル構築処理の一例を示すフローチャートである。 <Model Building Process>
The model construction process will be described below with reference to Fig. 8. Fig. 8 is a flowchart showing an example of the model construction process.

　モデル構築部２０１は、学習済み音声エンコーダ記憶部２０３に記憶されている学習済み音声エンコーダと、学習済み大規模言語モデル記憶部２０４に記憶されている学習済み大規模言語モデルとを用いて、図１に示す音声理解モデル１０００を構築する（ステップＳ１０１）。なお、このとき、モデル構築部２０１は、音声エンコーダ出力統合ブロック１２００の学習対象パラメータと、時間情報統合ブロック１３００の学習対象パラメータと、線形変換層１４００の学習対象パラメータとを任意の手法により初期化する。これにより、学習済みでない音声理解モデル１０００が構築される。 The model construction unit 201 constructs the speech understanding model 1000 shown in FIG. 1 using the trained speech encoder stored in the trained speech encoder storage unit 203 and the trained large-scale language model stored in the trained large-scale language model storage unit 204 (step S101). At this time, the model construction unit 201 initializes the training parameters of the speech encoder output integration block 1200, the training parameters of the time information integration block 1300, and the training parameters of the linear transformation layer 1400 using any method. This constructs an untrained speech understanding model 1000.

　そして、モデル構築部２０１は、上記のステップＳ１０１で構築された音声理解モデル１０００に音声理解モデル記憶部２０５に保存する（ステップＳ１０２）。 Then, the model construction unit 201 stores the speech understanding model 1000 constructed in step S101 above in the speech understanding model storage unit 205 (step S102).

　＜モデル学習処理＞
　以下、モデル学習処理について、図９を参照しながら説明する。図９は、モデル学習処理の一例を示すフローチャートである。 <Model learning process>
The model learning process will be described below with reference to Fig. 9. Fig. 9 is a flowchart showing an example of the model learning process.

　モデル学習部２０２の学習データ入力部２１１は、学習データセット記憶部２０６に記憶されている学習データセットから１件の学習データを入力する（ステップＳ２０１）。学習データ入力部２１１は、例えば、学習データセットを構成する学習データのうち、現在のエポック数で未だ入力されていない１件の学習データを入力する。なお、エポック数は０から開始し、学習データセットを構成するすべての学習データが入力される毎に１ずつ加算される。 The learning data input unit 211 of the model learning unit 202 inputs one piece of learning data from the learning dataset stored in the learning dataset storage unit 206 (step S201). The learning data input unit 211 inputs, for example, one piece of learning data that has not yet been input for the current number of epochs from the learning data that make up the learning dataset. The epoch number starts from 0 and is incremented by 1 each time all of the learning data that make up the learning dataset is input.

　モデル学習部２０２の音声エンコード部２１２は、上記のステップＳ２０１で入力された学習データに含まれる入力音声を入力として、各時間区間ｔ（ｔ＝１，・・・，Ｔ）でＮ個の統合対象層からＮ個の第１の音声特徴ベクトルｈ_ｎ（ｔ）（ｎ＝１，・・・，Ｎ）をそれぞれ出力する（ステップＳ２０２）。 The speech encoding unit 212 of the model training unit 202 takes the input speech included in the training data input in step S201 above as input, and outputs N first speech feature vectors h _n (t) (n = 1, ..., N) from the N integration target layers in each time interval t (t = 1, ..., T) (step S202).

　モデル学習部２０２の第１の統合部２１３は、各時間区間ｔ（ｔ＝１，・・・，Ｔ）において、その時間区間ｔにおける第１の音声特徴ベクトルｈ_ｎ（ｔ）（ｎ＝１，・・・，Ｎ）を入力として、これらの各第１の音声特徴ベクトルｈ_ｎ（ｔ）（ｎ＝１，・・・，Ｎ）を統合した第１の統合ベクトルｅ（ｔ）をそれぞれ出力する（ステップＳ２０３）。 The first integration unit 213 of the model learning unit 202 takes as input the first speech feature vector h _n (t) (n = 1, ..., N) for each time interval t (t = 1, ..., T) and outputs a first integrated vector e(t) by integrating each of these first speech feature vectors h _n (t) (n = 1, ..., N) (step S203).

　モデル学習部２０２の第２の統合部２１４は、各時間区間ｔにおける第１の統合ベクトルｅ（ｔ）を入力として、これらの各第１の統合ベクトルｅ（ｔ）（ｔ＝１，・・・，Ｔ）を時間方向に統合した第２の統合ベクトルｖを出力する（ステップＳ２０４）。 The second integration unit 214 of the model learning unit 202 receives the first integrated vector e(t) for each time interval t as input, and outputs a second integrated vector v obtained by integrating each of these first integrated vectors e(t) (t = 1, ..., T) in the time direction (step S204).

　モデル学習部２０２の線形変換部２１５は、第２の統合ベクトルｖを入力として、第２の統合ベクトルｖを線形変換した第２の音声特徴ベクトルｗを出力する（ステップＳ２０５）。 The linear transformation unit 215 of the model learning unit 202 receives the second integrated vector v as input and outputs the second speech feature vector w obtained by linearly transforming the second integrated vector v (step S205).

　モデル学習部２０２の事後確率算出部２１６は、上記のステップＳ２０１で入力された学習データに含まれる入力文と、上記のステップＳ２０５で出力された第２の音声特徴ベクトルｗとを入力として、当該入力文と当該第２の音声特徴ベクトルｗとが与えられたときの出力文の事後確率を算出する（ステップＳ２０６）。 The posterior probability calculation unit 216 of the model training unit 202 receives as input the input sentence included in the training data input in step S201 above and the second speech feature vector w output in step S205 above, and calculates the posterior probability of the output sentence when the input sentence and the second speech feature vector w are given (step S206).

　モデル学習部２０２のパラメータ更新部２１７は、上記のステップＳ２０６で算出された事後確率と、上記のステップＳ２０１で入力された学習データに含まれる正解出力文とを用いて、既存の最適化手法により音声理解モデル１０００の学習対象パラメータを学習する（ステップＳ２０７）。すなわち、パラメータ更新部２１７は、上記のステップＳ２０１で入力された学習データに含まれる入力音声及び入力文を与えたときに音声理解モデル１０００によって生成される出力文と当該学習データに含まれる正解出力文との交差エントロピー（具体的には、ｉ＝１，・・・，Ｉに関する－ｐ（ｓ_ｉ'）ｌｏｇｐ（ｓ_ｉ）の和）を損失関数として用いて、その損失関数を最小化するように、既存の最適化手法により学習対象パラメータを更新する。 The parameter update unit 217 of the model training unit 202 uses the posterior probability calculated in step S206 and the correct output sentence included in the training data input in step S201 to train the training parameters of the speech understanding model 1000 by an existing optimization method (step S207). That is, the parameter update unit 217 uses the cross entropy (specifically, the sum of -p(s i ')logp(s i ) for i = 1, ..., _I ) between the output sentence generated by the speech understanding model 1000 when the input speech and input sentence included in the training data input in step _S201 are given and the correct output sentence included in the training data as a loss function, and updates the training parameters by an existing optimization method so as to minimize the loss function.

　モデル学習部２０２の終了判定部２１８は、学習対象パラメータの更新を終了するか否かを判定する（ステップＳ２０８）。すなわち、終了判定部２１８は、予め決められた所定の終了条件を満たす場合は学習対象パラメータの更新を終了すると判定し、そうでない場合は学習対象パラメータの更新を終了しないと判定する。 The termination determination unit 218 of the model learning unit 202 determines whether to terminate the update of the learning parameter (step S208). That is, if a predetermined termination condition is met, the termination determination unit 218 determines to terminate the update of the learning parameter, and if not, determines not to terminate the update of the learning parameter.

　上記のステップＳ２０８で学習対象パラメータの更新を終了しないと判定された場合、モデル学習部２０２は、上記のステップＳ２０１に戻る。これにより、所定の終了条件を満たすまで、上記のステップＳ２０１～ステップＳ２０７が繰り返し実行される。 If it is determined in step S208 above that the update of the learning target parameters should not be terminated, the model learning unit 202 returns to step S201 above. As a result, steps S201 to S207 above are repeatedly executed until a predetermined termination condition is met.

　一方で、上記のステップＳ２０８で学習対象パラメータの更新を終了すると判定された場合、モデル学習部２０２は、モデル学習処理を終了する。これにより、学習対象パラメータが学習され、学習済み音声理解モデル１０００が得られる。 On the other hand, if it is determined in step S208 above that the update of the learning target parameters is to be terminated, the model learning unit 202 terminates the model learning process. As a result, the learning target parameters are learned and the trained speech understanding model 1000 is obtained.

　［推論時］
　以下、推論時における音声理解装置１０について説明する。なお、以下では、主に、学習時との相違点について説明し、学習時と同様としてよい箇所については、適宜、その説明を省略する。 [At the time of inference]
The following describes the speech understanding device 10 during inference. Note that the following mainly describes differences from the learning period, and omits explanations of points that may be the same as the learning period, as appropriate.

　＜推論時における音声理解装置１０のハードウェア構成例＞
　推論時における音声理解装置１０のハードウェア構成は、学習時と同様としてよいため、その説明を省略する。 <Example of hardware configuration of the speech understanding device 10 during inference>
The hardware configuration of the speech understanding device 10 during inference may be the same as during learning, and therefore a description thereof will be omitted.

　＜推論時における音声理解装置１０の機能構成例＞
　推論時における音声理解装置１０の機能構成例について、図１０を参照しながら説明する。図１０は、推論時における音声理解装置１０の機能構成の一例を示す図である。 <Example of functional configuration of the speech understanding device 10 during inference>
An example of the functional configuration of the speech understanding device 10 at the time of inference will be described with reference to Fig. 10. Fig. 10 is a diagram showing an example of the functional configuration of the speech understanding device 10 at the time of inference.

　図１０に示すように、推論時における音声理解装置１０は、出力文生成部２０７を有する。出力文生成部２０７は、例えば、音声理解装置１０にインストールされた１以上のプログラムが、プロセッサ１０８等に実行させる処理により実現される。また、推論時における音声理解装置１０は、学習済み音声理解モデル記憶部２０８と、テストデータ記憶部２０９とを有する。これら各記憶部は、例えば、補助記憶装置１０７等の記憶領域により実現される。ただし、これら各記憶部のうちの少なくとも１つの記憶部が、音声理解装置１０と通信可能に接続される記憶装置（例えば、データベースサーバ等が有する記憶装置等）の記憶領域により実現されてもよい。 As shown in FIG. 10, the speech understanding device 10 during inference has an output sentence generation unit 207. The output sentence generation unit 207 is realized, for example, by processing in which one or more programs installed in the speech understanding device 10 are executed by the processor 108 or the like. The speech understanding device 10 during inference also has a trained speech understanding model storage unit 208 and a test data storage unit 209. Each of these storage units is realized, for example, by a storage area of the auxiliary storage device 107 or the like. However, at least one of these storage units may also be realized by a storage area of a storage device (for example, a storage device included in a database server or the like) that is communicatively connected to the speech understanding device 10.

　出力文生成部２０７は、テストデータ記憶部２０９に記憶されているテストデータと、学習済み音声理解モデル記憶部２０８に記憶されている学習済み音声理解モデル１０００とを用いて、そのテストデータに含まれる入力文が表す質問に対応する出力文（つまり、その質問に対する自然文の応答又は回答を表すテキストデータ）を生成及び出力する。ここで、テストデータとは、入力音声とその入力音声に関する自然文の質問である入力文との組で表されるデータのことである。また、学習済み音声理解モデル１０００とは、学習対象パラメータが学習済みの音声理解モデル１０００のことである。なお、出力文生成部２０７の詳細な機能構成例については後述する。 The output sentence generation unit 207 uses the test data stored in the test data storage unit 209 and the trained speech understanding model 1000 stored in the trained speech understanding model storage unit 208 to generate and output an output sentence corresponding to the question expressed by the input sentence included in the test data (i.e., text data representing a natural language response or answer to the question). Here, test data refers to data represented by a pair of input speech and an input sentence that is a natural language question related to the input speech. Furthermore, the trained speech understanding model 1000 refers to a speech understanding model 1000 whose learning target parameters have been trained. A detailed example of the functional configuration of the output sentence generation unit 207 will be described later.

　学習済み音声理解モデル記憶部２０８は、学習済み音声理解モデル１０００を記憶する。テストデータ記憶部２０９は、与えられたテストデータを記憶する。 The trained speech understanding model storage unit 208 stores the trained speech understanding model 1000. The test data storage unit 209 stores the given test data.

　　≪出力文生成部２０７の詳細な機能構成例≫
　出力文生成部２０７の詳細な機能構成例について、図１１を参照しながら説明する。図１１は、出力文生成部２０７の詳細な機能構成の一例を示す図である。 <<Detailed Functional Configuration Example of Output Sentence Generation Unit 207>>
An example of a detailed functional configuration of the output sentence generation unit 207 will be described with reference to Fig. 11. Fig. 11 is a diagram showing an example of a detailed functional configuration of the output sentence generation unit 207.

　図１１に示すように、出力文生成部２０７には、テストデータ入力部２２１と、音声エンコード部２２２と、第１の統合部２２３と、第２の統合部２２４と、線形変換部２２５と、生成部２２６と、出力部２２７とが含まれる。 As shown in FIG. 11, the output sentence generation unit 207 includes a test data input unit 221, a speech encoding unit 222, a first integration unit 223, a second integration unit 224, a linear conversion unit 225, a generation unit 226, and an output unit 227.

　テストデータ入力部２２１は、テストデータ記憶部２０９に記憶されている１件のテストデータを入力する。 The test data input unit 221 inputs one piece of test data stored in the test data storage unit 209.

　音声エンコード部２２２は、学習済み音声理解モデル１０００に含まれる音声エンコーダ１１００により実現される。音声エンコード部２２２は、テストデータ入力部２２１によって入力されたテストデータに含まれる入力音声を入力として、各時間区間ｔ（ｔ＝１，・・・，Ｔ）でＮ個の統合対象層からＮ個の第１の音声特徴ベクトルｈ_ｎ（ｔ）（ｎ＝１，・・・，Ｎ）をそれぞれ出力する。 The speech encoding unit 222 is realized by the speech encoder 1100 included in the trained speech understanding model 1000. The speech encoding unit 222 receives input speech included in the test data input by the test data input unit 221, and outputs N first speech feature vectors h _n (t) (n=1, ..., N) from the N integration target layers in each time interval t (t=1, ..., T).

　第１の統合部２２３は、学習済み音声理解モデル１０００に含まれる音声エンコーダ出力統合ブロック１２００により実現される。第１の統合部２２３は、各時間区間ｔ（ｔ＝１，・・・，Ｔ）において、その時間区間ｔにおける第１の音声特徴ベクトルｈ_ｎ（ｔ）（ｎ＝１，・・・，Ｎ）を入力として、これらの各第１の音声特徴ベクトルｈ_ｎ（ｔ）（ｎ＝１，・・・，Ｎ）を統合した第１の統合ベクトルｅ（ｔ）をそれぞれ出力する。 The first integration unit 223 is realized by the speech encoder output integration block 1200 included in the trained speech understanding model 1000. In each time interval t (t=1, ..., T), the first integration unit 223 receives the first speech feature vectors h _n (t) (n=1, ..., N) in that time interval t as input, and outputs a first integrated vector e(t) obtained by integrating each of these first speech feature vectors h _n (t) (n=1, ..., N).

　第２の統合部２２４は、学習済み音声理解モデル１０００に含まれる時間情報統合ブロック１３００により実現される。第２の統合部２２４は、各時間区間ｔにおける第１の統合ベクトルｅ（ｔ）を入力として、これらの各第１の統合ベクトルｅ（ｔ）（ｔ＝１，・・・，Ｔ）を時間方向に統合した第２の統合ベクトルｖを出力する。 The second integration unit 224 is realized by the time information integration block 1300 included in the trained speech understanding model 1000. The second integration unit 224 receives the first integrated vector e(t) for each time interval t as input, and outputs the second integrated vector v obtained by integrating each of these first integrated vectors e(t) (t = 1, ..., T) in the time direction.

　線形変換部２２５は、学習済み音声理解モデル１０００に含まれる線形変換層１４００により実現される。線形変換部２２５は、第２の統合ベクトルｖを入力として、第２の統合ベクトルｖを線形変換した第２の音声特徴ベクトルｗを出力する。 The linear transformation unit 225 is realized by the linear transformation layer 1400 included in the trained speech understanding model 1000. The linear transformation unit 225 receives the second integrated vector v as input and outputs the second speech feature vector w obtained by linearly transforming the second integrated vector v.

　生成部２２６は、学習済み音声理解モデル１０００に含まれる大規模言語モデル１５００により実現される。生成部２２６は、テストデータ入力部２２１によって入力されたテストデータに含まれる入力文と、第２の音声特徴ベクトルｗとを入力として、当該入力文と当該第２の音声特徴ベクトルｗとが与えられたときの出力文を生成する。 The generation unit 226 is realized by the large-scale language model 1500 included in the trained speech understanding model 1000. The generation unit 226 receives as input an input sentence included in the test data input by the test data input unit 221 and a second speech feature vector w, and generates an output sentence when the input sentence and the second speech feature vector w are given.

　より具体的には、出力文を構成するｉ番目のトークンをｓ_ｉとする。また、当該入力文と当該第２の音声特徴ベクトルｗとが与えられたときにトークンｓ_１が生成される事後確率をｐ（ｓ_１）、当該入力文と当該第２の音声特徴ベクトルｗとｓ_１，・・・，ｓ_ｉ－１とが与えられたときにトークンｓ_ｉが生成される事後確率をｐ（ｓ_ｉ）（ただし、ｉ≧２）とする。このとき、生成部２２６は、例えば、文末を表すトークンが生成されるまで、事後確率ｐ（ｓ_ｉ）に従ってトークンｓ_ｉを生成することにより出力文を生成する。 More specifically, the i-th token constituting the output sentence is denoted by s _i . Furthermore, the posterior probability that token s _i will be generated when the input sentence and the second speech feature vector w are given is denoted by p(s _i ), and the posterior probability that token s _i will be generated when the input sentence, the second speech feature vector w, and s ₁ , ..., s _i-1 are given is denoted by p(s _i ) (where i≧2). In this case, the generation unit 226 generates the output sentence by generating token s _i in accordance with the posterior probability p(s _i ) until, for example, a token representing the end of the sentence is generated.

　出力部２２７は、生成部２２６によって生成された出力文を予め決められた所定の出力先に出力する。ここで、所定の出力先としては、例えば、補助記憶装置１０７等の記憶領域、ディスプレイ等の表示装置１０２、通信可能に接続される他の装置や機器、等が挙げられる。 The output unit 227 outputs the output sentence generated by the generation unit 226 to a predetermined output destination. Here, examples of the predetermined output destination include a storage area such as the auxiliary storage device 107, a display device 102 such as a display, other devices or equipment that are communicatively connected, etc.

　＜出力文生成処理＞
　以下、出力文生成処理について、図１２を参照しながら説明する。図１２は、出力文生成処理の一例を示すフローチャートである。 <Output sentence generation process>
The output sentence generation process will be described below with reference to Fig. 12. Fig. 12 is a flowchart showing an example of the output sentence generation process.

　出力文生成部２０７のテストデータ入力部２２１は、テストデータ記憶部２０９に記憶されている１件のテストデータを入力する（ステップＳ３０１）。 The test data input unit 221 of the output statement generation unit 207 inputs one piece of test data stored in the test data storage unit 209 (step S301).

　出力文生成部２０７の音声エンコード部２２２は、上記のステップＳ３０１で入力されたテストデータに含まれる入力音声を入力として、各時間区間ｔ（ｔ＝１，・・・，Ｔ）でＮ個の統合対象層からＮ個の第１の音声特徴ベクトルｈ_ｎ（ｔ）（ｎ＝１，・・・，Ｎ）をそれぞれ出力する（ステップＳ３０２）。 The speech encoding unit 222 of the output sentence generation unit 207 uses the input speech contained in the test data input in step S301 above as input, and outputs N first speech feature vectors h _n (t) (n = 1, ..., N) from the N integration target layers in each time interval t (t = 1, ..., T) (step S302).

　出力文生成部２０７の第１の統合部２２３は、各時間区間ｔ（ｔ＝１，・・・，Ｔ）において、その時間区間ｔにおける第１の音声特徴ベクトルｈ_ｎ（ｔ）（ｎ＝１，・・・，Ｎ）を入力として、これらの各第１の音声特徴ベクトルｈ_ｎ（ｔ）（ｎ＝１，・・・，Ｎ）を統合した第１の統合ベクトルｅ（ｔ）をそれぞれ出力する（ステップＳ３０３）。 The first integration unit 223 of the output sentence generation unit 207 takes the first speech feature vector h _n (t) (n = 1, ..., N) for each time interval t (t = 1, ..., T) as input, and outputs a first integrated vector e(t) by integrating each of these first speech feature vectors h _n (t) (n = 1, ..., N) (step S303).

　出力文生成部２０７の第２の統合部２２４は、各時間区間ｔにおける第１の統合ベクトルｅ（ｔ）を入力として、これらの各第１の統合ベクトルｅ（ｔ）（ｔ＝１，・・・，Ｔ）を時間方向に統合した第２の統合ベクトルｖを出力する（ステップＳ３０４）。 The second integration unit 224 of the output sentence generation unit 207 receives the first integrated vector e(t) for each time interval t as input, and outputs a second integrated vector v obtained by integrating each of these first integrated vectors e(t) (t = 1, ..., T) in the time direction (step S304).

　出力文生成部２０７の線形変換部２２５は、第２の統合ベクトルｖを入力として、第２の統合ベクトルｖを線形変換した第２の音声特徴ベクトルｗを出力する（ステップＳ３０５）。 The linear transformation unit 225 of the output sentence generation unit 207 receives the second integrated vector v as input and outputs the second speech feature vector w obtained by linearly transforming the second integrated vector v (step S305).

　出力文生成部２０７の生成部２２６は、上記のステップＳ３０１で入力されたテストデータに含まれる入力文と、上記のステップＳ３０５で出力された第２の音声特徴ベクトルｗとを入力として、当該入力文と当該第２の音声特徴ベクトルｗとが与えられたときの出力文を生成する（ステップＳ３０６）。 The generation unit 226 of the output sentence generation unit 207 receives as input the input sentence included in the test data input in step S301 above and the second speech feature vector w output in step S305 above, and generates an output sentence when the input sentence and the second speech feature vector w are given (step S306).

　出力文生成部２０７の出力部２２７は、上記のステップＳ３０６で生成された出力文を予め決められた所定の出力先に出力する（ステップＳ３０７）。これにより、入力音声に関する質問の応答又は回答となる出力文が得られる。 The output unit 227 of the output sentence generation unit 207 outputs the output sentence generated in step S306 above to a predetermined output destination (step S307). This results in an output sentence that is a response or answer to the question related to the input voice.

　＜まとめ＞
　以上のように、本実施形態に係る音声理解装置１０は、音声エンコーダ１１００と大規模言語モデル１５００との間に音声エンコーダ出力統合ブロック１２００及び時間情報統合ブロック１３００が存在する音声理解モデル１０００により、音声理解技術を実現することができる。このため、本実施形態に係る音声理解装置１０を用いることにより、例えば、非言語情報・パラ言語情報の認識結果を利用する後段のシステムにおける処理精度の向上が期待できる。 <Summary>
As described above, the speech understanding device 10 according to this embodiment can realize speech understanding technology using the speech understanding model 1000 in which the speech encoder output integration block 1200 and the time information integration block 1300 are present between the speech encoder 1100 and the large-scale language model 1500. For this reason, by using the speech understanding device 10 according to this embodiment, it is possible to expect an improvement in the processing accuracy of a downstream system that uses the recognition results of non-linguistic information and paralinguistic information, for example.

　以上の実施形態に関して、更に以下の付記を開示する。
（付記１）
　メモリと、
　前記メモリに接続された少なくとも１つのプロセッサと、
　を含み、
　前記プロセッサは、
　音声と、前記音声に関する第１の文と、前記第１の文に対応する第２の文とが含まれる学習データを入力し、
　複数の層で構成された音声特徴抽出器に基づいて、所定の時間区間毎に、前記音声の特徴を表す情報を生成し、
　第１のパラメータに基づいて、前記時間区間毎に、前記音声特徴抽出器の所定の複数の層でそれぞれ生成された前記特徴を表す情報を統合した第１の統合情報を生成し、
　第２のパラメータに基づいて、前記時間区間毎の前記第１の統合情報を時間方向に統合した第２の統合情報を生成し、
　前記第１の文と、前記第２の統合情報と、言語モデルとに基づいて、前記第１の文に対応する第３の文の生成確率を算出し、
　前記第３の文の生成確率と、前記第２の文とに基づいて、前記第１のパラメータと前記第２のパラメータとを含む学習対象パラメータを学習する、学習装置。
（付記２）
　前記メモリに接続された少なくとも１つのプロセッサと、
　を含み、
　前記プロセッサは、
　音声と、前記音声に関する第１の文とが含まれるテストデータを入力し、
　複数の層で構成された音声特徴抽出器に基づいて、所定の時間区間毎に、前記音声の特徴を表す情報を生成し、
　学習済みの第１のパラメータに基づいて、前記時間区間毎に、前記音声特徴抽出器の所定の複数の層でそれぞれ生成された前記特徴を表す情報を統合した第１の統合情報を生成し、
　学習済みの第２のパラメータに基づいて、前記時間区間毎の前記第１の統合情報を時間方向に統合した第２の統合情報を生成し、
　前記第１の文と、前記第２の統合情報と、言語モデルとに基づいて、前記第１の文に対応する第２の文を生成する、推論装置。
（付記３）
　前記プロセッサは、
　第３のパラメータに基づいて、前記第２の統合情報を線形変換し、
　前記第１の文と、前記線形変換後の第２の統合情報と、前記言語モデルとに基づいて、前記第３の文の生成確率を算出する、付記１に記載の学習装置。
（付記４）
　前記第１のパラメータは、重み付け和に用いられる重み、又は、線形変換和に用いられる線形変換係数であり、
　前記プロセッサは、
　前記特徴を前記重み付け和又は前記線形変換和により統合した第１の統合情報を生成する、付記１又は３に記載の学習装置。
（付記５）
　前記第２のパラメータは、自己注意プーリング層の重み、又は、１次元畳み込みニューラルネットワークのパラメータであり、
　前記プロセッサは、
　前記第１の統合情報を前記自己注意プーリング層又は前記１次元畳み込みニューラルネットワークにより時間方向に統合した第２の統合情報を生成する、付記１又は３に記載の学習装置。
（付記６）
　学習処理を実行するようにコンピュータによって実行可能なプログラムを記憶した非一時的記憶媒体であって、
　前記学習処理は、
　音声と、前記音声に関する第１の文と、前記第１の文に対応する第２の文とが含まれる学習データを入力し、
　複数の層で構成された音声特徴抽出器に基づいて、所定の時間区間毎に、前記音声の特徴を表す情報を生成し、
　第１のパラメータに基づいて、前記時間区間毎に、前記音声特徴抽出器の所定の複数の層でそれぞれ生成された前記特徴を表す情報を統合した第１の統合情報を生成し、
　第２のパラメータに基づいて、前記時間区間毎の前記第１の統合情報を時間方向に統合した第２の統合情報を生成し、
　前記第１の文と、前記第２の統合情報と、言語モデルとに基づいて、前記第１の文に対応する第３の文の生成確率を算出し、
　前記第３の文の生成確率と、前記第２の文とに基づいて、前記第１のパラメータと前記第２のパラメータとを含む学習対象パラメータを学習する、非一時的記憶媒体。
（付記７）
　推論処理を実行するようにコンピュータによって実行可能なプログラムを記憶した非一時的記憶媒体であって、
　前記推論処理は、
　音声と、前記音声に関する第１の文とが含まれるテストデータを入力し、
　複数の層で構成された音声特徴抽出器に基づいて、所定の時間区間毎に、前記音声の特徴を表す情報を生成し、
　学習済みの第１のパラメータに基づいて、前記時間区間毎に、前記音声特徴抽出器の所定の複数の層でそれぞれ生成された前記特徴を表す情報を統合した第１の統合情報を生成し、
　学習済みの第２のパラメータに基づいて、前記時間区間毎の前記第１の統合情報を時間方向に統合した第２の統合情報を生成し、
　前記第１の文と、前記第２の統合情報と、言語モデルとに基づいて、前記第１の文に対応する第２の文を生成する、非一時的記憶媒体。 The following additional notes are further disclosed regarding the above-described embodiments.
(Appendix 1)
Memory and
at least one processor coupled to said memory;
Including,
The processor:
inputting training data including a speech, a first sentence related to the speech, and a second sentence corresponding to the first sentence;
generating information representing the features of the speech for each predetermined time interval based on a speech feature extractor configured with multiple layers;
generating first integrated information for each time interval based on a first parameter, the first integrated information being obtained by integrating information representing the features generated in a plurality of predetermined layers of the audio feature extractor;
generating second integrated information by integrating the first integrated information for each time section in a time direction based on a second parameter;
calculating a generation probability of a third sentence corresponding to the first sentence based on the first sentence, the second integrated information, and a language model;
A learning device that learns learning target parameters including the first parameter and the second parameter based on the generation probability of the third sentence and the second sentence.
(Appendix 2)
at least one processor coupled to said memory;
Including,
The processor:
inputting test data including a speech and a first sentence related to the speech;
generating information representing the features of the speech for each predetermined time interval based on a speech feature extractor configured with multiple layers;
generating first integrated information for each time interval by integrating information representing the features generated in a plurality of predetermined layers of the speech feature extractor based on the trained first parameters;
generating second integrated information by integrating the first integrated information for each time interval in a time direction based on the learned second parameter;
a reasoning device that generates a second sentence corresponding to the first sentence based on the first sentence, the second integrated information, and a language model;
(Appendix 3)
The processor:
linearly transforming the second integrated information based on a third parameter;
2. The learning device according to claim 1, wherein the learning device calculates the probability of generating the third sentence based on the first sentence, the second integrated information after the linear transformation, and the language model.
(Appendix 4)
the first parameter is a weight used in a weighted sum or a linear transform coefficient used in a linear transform sum,
The processor:
The learning device according to claim 1 or 3, wherein first integrated information is generated by integrating the features using the weighted sum or the linear transformation sum.
(Appendix 5)
The second parameter is a weight of a self-attention pooling layer or a parameter of a one-dimensional convolutional neural network;
The processor:
The learning device according to claim 1 or 3, wherein second integrated information is generated by integrating the first integrated information in a time direction using the self-attention pooling layer or the one-dimensional convolutional neural network.
(Appendix 6)
A non-transitory storage medium storing a program executable by a computer to perform a learning process,
The learning process includes:
inputting training data including a speech, a first sentence related to the speech, and a second sentence corresponding to the first sentence;
generating information representing the features of the speech for each predetermined time interval based on a speech feature extractor configured with multiple layers;
generating first integrated information for each time interval based on a first parameter, the first integrated information being obtained by integrating information representing the features generated in a plurality of predetermined layers of the audio feature extractor;
generating second integrated information by integrating the first integrated information for each time section in a time direction based on a second parameter;
calculating a generation probability of a third sentence corresponding to the first sentence based on the first sentence, the second integrated information, and a language model;
A non-transitory storage medium for learning learning target parameters including the first parameter and the second parameter based on the generation probability of the third sentence and the second sentence.
(Appendix 7)
A non-transitory storage medium storing a program executable by a computer to perform an inference process,
The inference process includes:
inputting test data including a speech and a first sentence related to the speech;
generating information representing the features of the speech for each predetermined time interval based on a speech feature extractor configured with multiple layers;
generating first integrated information for each time interval by integrating information representing the features generated in a plurality of predetermined layers of the speech feature extractor based on the trained first parameters;
generating second integrated information by integrating the first integrated information for each time interval in a time direction based on the learned second parameter;
a non-transitory storage medium for generating a second sentence corresponding to the first sentence based on the first sentence, the second integrated information, and a language model;

　本発明は、具体的に開示された上記の実施形態に限定されるものではなく、請求の範囲の記載から逸脱することなく、種々の変形や変更、既知の技術との組み合わせ等が可能である。 The present invention is not limited to the specifically disclosed embodiments above, and various modifications, alterations, and combinations with known technologies are possible without departing from the scope of the claims.

　［参考文献］
　参考文献１：Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli, "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations," arXiv preprint arXiv:2006.11477, 2020.
　参考文献２：S. Chen et al., "WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing," in IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505-1518, Oct. 2022, doi: 10.1109/JSTSP.2022.3188113.
　参考文献３：H. Touvron et al., "Llama 2: Open Foundation and Fine-Tuned Chat Models," arXiv preprint arXiv:2307.09288, 2023. [References]
Reference 1: Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli, "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations," arXiv preprint arXiv:2006.11477, 2020.
Reference 2: S. Chen et al., "WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing," in IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505-1518, Oct. 2022, doi: 10.1109/JSTSP.2022.3188113.
Reference 3: H. Touvron et al., "Llama 2: Open Foundation and Fine-Tuned Chat Models," arXiv preprint arXiv:2307.09288, 2023.

　１０　　　　音声理解装置
　１０１　　　入力装置
　１０２　　　表示装置
　１０３　　　外部Ｉ／Ｆ
　１０３ａ　　記録媒体
　１０４　　　通信Ｉ／Ｆ
　１０５　　　ＲＡＭ
　１０６　　　ＲＯＭ
　１０７　　　補助記憶装置
　１０８　　　プロセッサ
　１０９　　　バス
　２０１　　　モデル構築部
　２０２　　　モデル学習部
　２０３　　　学習済み音声エンコーダ記憶部
　２０４　　　学習済み大規模言語モデル記憶部
　２０５　　　音声理解モデル記憶部
　２０６　　　学習データセット記憶部
　２０７　　　出力文生成部
　２０８　　　学習済み音声理解モデル記憶部
　２０９　　　テストデータ記憶部
　２１１　　　学習データ入力部
　２１２　　　音声エンコード部
　２１３　　　第１の統合部
　２１４　　　第２の統合部
　２１５　　　線形変換部
　２１６　　　事後確率算出部
　２１７　　　パラメータ更新部
　２１８　　　終了判定部
　２２１　　　テストデータ入力部
　２２２　　　音声エンコード部
　２２３　　　第１の統合部
　２２４　　　第２の統合部
　２２５　　　線形変換部
　２２６　　　生成部
　２２７　　　出力部 10 Speech understanding device 101 Input device 102 Display device 103 External I/F
103a Recording medium 104 Communication I/F
105 RAM
106 ROM
107 Auxiliary storage device 108 Processor 109 Bus 201 Model construction unit 202 Model learning unit 203 Trained speech encoder storage unit 204 Trained large-scale language model storage unit 205 Speech understanding model storage unit 206 Training dataset storage unit 207 Output sentence generation unit 208 Trained speech understanding model storage unit 209 Test data storage unit 211 Training data input unit 212 Speech encoding unit 213 First integration unit 214 Second integration unit 215 Linear transformation unit 216 Posterior probability calculation unit 217 Parameter update unit 218 End determination unit 221 Test data input unit 222 Speech encoding unit 223 First integration unit 224 Second integration unit 225 Linear transformation unit 226 Generation unit 227 Output unit

Claims

an input unit for inputting training data including a speech, a first sentence related to the speech, and a second sentence corresponding to the first sentence;
a speech feature generation unit that generates information representing the speech features for each predetermined time interval based on a speech feature extractor configured with multiple layers;
a first integration unit that generates, for each time interval, first integrated information by integrating information representing the features generated in a plurality of predetermined layers of the speech feature extractor based on a first parameter;
a second integration unit that generates second integrated information by integrating the first integrated information for each time section in a time direction based on a second parameter;
a calculation unit that calculates a generation probability of a third sentence corresponding to the first sentence based on the first sentence, the second integrated information, and a language model;
a learning unit that learns learning target parameters including the first parameter and the second parameter based on the generation probability of the third sentence and the second sentence;
A learning device having the above configuration.

an input unit for inputting test data including a speech and a first sentence related to the speech;
a speech feature generation unit that generates information representing the speech features for each predetermined time interval based on a speech feature extractor configured with multiple layers;
a first integration unit that generates first integrated information by integrating information representing the features generated in a plurality of predetermined layers of the speech feature extractor for each time interval based on the trained first parameters;
a second integration unit that generates second integrated information by integrating the first integrated information for each time interval in a time direction based on a learned second parameter;
a sentence generation unit that generates a second sentence corresponding to the first sentence based on the first sentence, the second integrated information, and a language model;
An inference device having:

a linear transformation unit that linearly transforms the second integrated information based on a third parameter;
The calculation unit
The learning device according to claim 1 , wherein the generation probability of the third sentence is calculated based on the first sentence, the second integrated information after the linear transformation, and the language model.

the first parameter is a weight used in a weighted sum or a linear transform coefficient used in a linear transform sum,
The first integration unit
The learning device according to claim 1 or 3, wherein first integrated information is generated by integrating the features using the weighted sum or the linear transformation sum.

The second parameter is a weight of a self-attention pooling layer or a parameter of a one-dimensional convolutional neural network;
The second integration unit
The learning device according to claim 1 or 3, wherein the first integrated information is integrated in a time direction by the self-attention pooling layer or the one-dimensional convolutional neural network to generate second integrated information.

an input step of inputting training data including a speech, a first sentence related to the speech, and a second sentence corresponding to the first sentence;
a speech feature generation step for generating information representing the speech features for each predetermined time interval based on a speech feature extractor configured with multiple layers;
a first integration step for generating, for each time interval, first integrated information by integrating information representing the features generated in a plurality of predetermined layers of the speech feature extractor based on a first parameter;
a second integration step of generating second integrated information by integrating the first integrated information for each time section in a time direction based on a second parameter;
a calculation step of calculating a generation probability of a third sentence corresponding to the first sentence based on the first sentence, the second integrated information, and a language model;
a learning procedure for learning learning target parameters including the first parameter and the second parameter based on the generation probability of the third sentence and the second sentence;
A learning method that computers perform.

an input step of inputting test data including a speech and a first sentence related to the speech;
a speech feature generation step for generating information representing the speech features for each predetermined time interval based on a speech feature extractor configured with multiple layers;
a first integration step for generating, for each time interval, first integrated information by integrating information representing the features generated in a plurality of predetermined layers of the speech feature extractor based on the trained first parameters;
a second integration step of generating second integrated information by integrating the first integrated information for each time interval in a time direction based on a learned second parameter;
a sentence generation step of generating a second sentence corresponding to the first sentence based on the first sentence, the second integrated information, and a language model;
A method of inference performed by a computer.

A program that causes a computer to function as the learning device described in claim 1 or the inference device described in claim 2.