WO2025099939A1

WO2025099939A1 - Learning device, generation device, learning method, generation method, and program

Info

Publication number: WO2025099939A1
Application number: PCT/JP2023/040613
Authority: WO
Inventors: 勇祐井島; 孝平松浦
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: NTT Inc
Priority date: 2023-11-10
Filing date: 2023-11-10
Publication date: 2025-05-15
Anticipated expiration: 2026-05-10

Abstract

A learning device according to one aspect includes: an input unit for inputting learning data including voice data and training data representing a natural language description of paralinguistic information or non-language information included in the voice represented by the voice data; a calculation unit for calculating, by a machine learning model that takes a voice sequence, obtained by converting the voice data into a predetermined format for each predetermined unit, as input and calculates the generation probability of a natural language description of paralinguistic information or non-language information included in the voice, the generation probability of the natural language description of paralinguistic information or non-language information included in the voice represented by the voice data of the conversion source of the input voice sequence; and an update unit for updating learnable parameters of the machine learning model on the basis of the error between the natural language description generated according to the generation probability and the training data included in the learning data.

Description

Learning device, generation device, learning method, generation method, and program

　本開示は、学習装置、生成装置、学習方法、生成方法、及びプログラムに関する。 This disclosure relates to a learning device, a generating device, a learning method, a generating method, and a program.

　近年、音声処理の分野で、音声に含まれるパラ言語情報や非言語情報を認識又は検出する技術が数多く提案されている（例えば、非特許文献１）。ここで、パラ言語情報及び非言語情報とはいずれも、音声に含まれる情報のうち、言語情報でない情報のことである。以下、パラ言語情報と非言語情報とをまとめて「パラ言語・非言語情報」と呼ぶ。 In recent years, in the field of speech processing, many technologies have been proposed for recognizing or detecting paralinguistic and non-linguistic information contained in speech (for example, Non-Patent Document 1). Here, paralinguistic and non-linguistic information both refer to information contained in speech that is not linguistic information. Hereinafter, paralinguistic and non-linguistic information will be collectively referred to as "paralinguistic and non-linguistic information."

M. B. Akcay and K. Oguz, "Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers," Speech Communication, vol. 116, pp. 56-76, 2020.M. B. Akcay and K. Oguz, “Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers,” Speech Communication, vol. 116, pp. 56-76, 2020.

　しかしながら、音声に含まれるパラ言語・非言語情報を認識又は検出する従来技術では、その認識又は検出結果のみが出力される場合が多く、解釈性が高くなかった。 However, conventional technologies that recognize or detect paralinguistic and non-linguistic information contained in speech often only output the results of the recognition or detection, and are not highly interpretable.

　本開示は、上記の点に鑑みてなされたもので、音声に含まれるパラ言語・非言語情報を解釈性の高い形式で出力できる技術を提供する。 The present disclosure has been made in consideration of the above points, and provides technology that can output paralinguistic and non-linguistic information contained in speech in a highly interpretable format.

　本開示の一態様による学習装置は、音声データと、前記音声データが表す音声に含まれるパラ言語情報又は非言語情報の自然言語記述を表す教師データとが含まれる学習データを入力する入力部と、前記音声データを所定の単位毎に所定の形式に変換した音声系列を入力として、音声に含まれるパラ言語情報又は非言語情報の自然言語記述の生成確率を算出する機械学習モデルにより、入力された音声系列の変換元の音声データが表す音声に含まれるパラ言語情報又は非言語情報の自然言語記述の生成確率を算出する算出部と、前記生成確率に従って生成される自然言語記述と、前記学習データに含まれる教師データとの誤差に基づいて、前記機械学習モデルの学習可能パラメータを更新する更新部と、を有する。 A learning device according to one aspect of the present disclosure includes an input unit that inputs learning data including voice data and teacher data representing a natural language description of paralinguistic information or non-linguistic information contained in the voice represented by the voice data, a calculation unit that uses as input a voice sequence obtained by converting the voice data into a predetermined format for each predetermined unit, and calculates the generation probability of a natural language description of paralinguistic information or non-linguistic information contained in the voice represented by the voice data from which the input voice sequence is converted using a machine learning model that calculates the generation probability of a natural language description of paralinguistic information or non-linguistic information contained in the voice, and an update unit that updates the learnable parameters of the machine learning model based on the error between the natural language description generated according to the generation probability and the teacher data included in the learning data.

　音声に含まれるパラ言語・非言語情報を解釈性の高い形式で出力できる技術が提供される。 Technology is provided that can output paralinguistic and non-linguistic information contained in speech in a highly interpretable format.

第一の実施形態に係るキャプション生成装置のハードウェア構成の一例を示す図である。1 is a diagram illustrating an example of a hardware configuration of a caption generation device according to a first embodiment. 第一の実施形態に係るキャプション生成装置の学習時の機能構成の一例を示す図である。A figure showing an example of the functional configuration of the caption generation device according to the first embodiment during learning. 学習データ記憶部に記憶されている学習データの一例を示す図である。FIG. 4 is a diagram illustrating an example of learning data stored in a learning data storage unit. 第一の実施形態に係るキャプション生成部の詳細な機能構成の一例を示す図である。2 is a diagram illustrating an example of a detailed functional configuration of a caption generation unit according to the first embodiment. FIG. 第一の実施形態に係る学習処理の一例を示すフローチャートである。5 is a flowchart illustrating an example of a learning process according to the first embodiment. 第一の実施形態に係る出力トークン確率系列の生成処理の一例を示すフローチャートである。11 is a flowchart illustrating an example of a generation process of an output token random sequence according to the first embodiment. 第一の実施形態に係るキャプション生成装置の推論時の機能構成の一例を示す図である。2 is a diagram illustrating an example of a functional configuration at the time of inference of the caption generation device according to the first embodiment. FIG. 第一の実施形態に係るキャプション生成処理の一例を示すフローチャートである。11 is a flowchart showing an example of a caption generation process according to the first embodiment. 第一の実施形態に係る出力トークン系列の生成処理の一例を示すフローチャートである。11 is a flowchart illustrating an example of a generation process of an output token sequence according to the first embodiment. 第二の実施形態に係るキャプション生成装置の学習時の機能構成の一例を示す図である。A figure showing an example of the functional configuration of a caption generation device according to a second embodiment during learning. 第二の実施形態に係るキャプション生成部の詳細な機能構成の一例を示す図である。FIG. 11 is a diagram illustrating an example of a detailed functional configuration of a caption generation unit according to a second embodiment. 第二の実施形態に係る学習処理の一例を示すフローチャートである。13 is a flowchart illustrating an example of a learning process according to the second embodiment. 第二の実施形態に係る出力トークン確率系列の生成処理の一例を示すフローチャートである。13 is a flowchart illustrating an example of a generation process of an output token random sequence according to the second embodiment. 第二の実施形態に係るキャプション生成装置の推論時の機能構成の一例を示す図である。A figure showing an example of a functional configuration at the time of inference of a caption generation device according to a second embodiment. 第二の実施形態に係るキャプション生成処理の一例を示すフローチャートである。13 is a flowchart showing an example of a caption generation process according to the second embodiment. 第二の実施形態に係る出力トークン系列の生成処理の一例を示すフローチャートである。13 is a flowchart illustrating an example of a generation process of an output token sequence according to the second embodiment.

　以下、本発明の各実施形態について、図面を参照しながら説明する。以下の各実施形態では、一例として、音声に含まれるパラ言語・非言語情報を自然言語で記述されたキャプションの形式で出力することができるキャプション生成装置１０について説明する。キャプションとは、一般に、例えば、動画や写真等の内容に関する説明文を表す用語であるが、以下の各実施形態では音声に含まれるパラ言語・非言語情報を自然言語で記述した説明文を表す用語として用いるものとする。これにより、音声に含まれるパラ言語・非言語情報をキャプションという解釈性の高い形式で出力することが可能となる。なお、キャプションとの用語の代わりに、例えば、「自然言語文」、「説明文」、「文」、「文字列」等の用語が用いられてもよい。 Each embodiment of the present invention will be described below with reference to the drawings. In each of the following embodiments, as an example, a caption generation device 10 will be described that can output paralinguistic and non-linguistic information contained in audio in the form of a caption written in natural language. A caption is generally a term that refers to an explanatory text about the contents of, for example, a video or a photo, but in the following embodiments, the term is used to refer to an explanatory text that describes the paralinguistic and non-linguistic information contained in audio in natural language. This makes it possible to output the paralinguistic and non-linguistic information contained in audio in the highly interpretable form of a caption. Note that instead of the term caption, terms such as "natural language sentence", "explanatory text", "sentence", and "character string" may be used.

　ただし、音声に含まれるパラ言語・非言語情報をキャプションの形式で出力することは一例であって、パラ言語・非言語情報を表す解釈性の高い形式はキャプションに限られるものではない。以下で説明する各実施形態は、キャプション以外の解釈性の高い形式でパラ言語・非言語情報を表す場合にも同様に適用することが可能である。キャプション以外の解釈性の高い形式としては、例えば、パラ言語・非言語情報を自然言語で記述した説明文を含む図や表、画像等が挙げられる。 However, outputting paralinguistic and non-linguistic information contained in audio in the form of captions is just one example, and the highly interpretable format for expressing paralinguistic and non-linguistic information is not limited to captions. Each of the embodiments described below can be similarly applied to cases where paralinguistic and non-linguistic information is expressed in a highly interpretable format other than captions. Examples of highly interpretable formats other than captions include figures, tables, images, etc. that include explanatory text that describes the paralinguistic and non-linguistic information in natural language.

　なお、パラ言語情報及び非言語情報はいずれも音声に含まれる情報のうち、言語情報でない情報のことであるが、一般に、パラ言語情報は随意的に変化させることが可能な情報のことを指し、非言語情報は随意的に変化させることができない情報のことを指す。パラ言語情報の例としては、話者の意図や態度を示す情報等が挙げられる。一方で、非言語情報の例としては、話者の性別や感情、或る特定の病気の有無等を示す情報、話者自体の識別情報等が挙げられる。 Note that both paralinguistic and non-linguistic information are non-linguistic information contained in speech, but generally, paralinguistic information refers to information that can be changed at will, while non-linguistic information refers to information that cannot be changed at will. Examples of paralinguistic information include information that indicates the speaker's intentions and attitudes. On the other hand, examples of non-linguistic information include information that indicates the speaker's gender and emotions, the presence or absence of a certain illness, and information that identifies the speaker.

　［第一の実施形態］
　以下、第一の実施形態について説明する。ここで、第一の実施形態に係るキャプション生成装置１０には、キャプションを生成する機械学習モデル（以下、「キャプショニングモデル」と呼ぶ。）を学習する「学習時」と、学習済みのキャプショニングモデルにより音声からキャプションを生成する「推論時」とが存在する。以下では、学習時と推論時とで同一の装置によりキャプション生成装置１０が実現される場合を想定するが、例えば、学習時と推論時とで異なる装置によりキャプション生成装置１０が実現されてもよい。 [First embodiment]
A first embodiment will be described below. Here, the caption generation device 10 according to the first embodiment has a "learning time" during which a machine learning model that generates captions (hereinafter referred to as a "captioning model") is trained, and an "inference time" during which a caption is generated from a voice using the trained captioning model. In the following, it is assumed that the caption generation device 10 is realized by the same device during learning and inference, but for example, the caption generation device 10 may be realized by different devices during learning and inference.

　なお、学習時におけるキャプション生成装置１０は、例えば、「学習装置」等と呼ばれてもよい。また、推論時におけるキャプション生成装置１０は、例えば、単に「生成装置」と呼ばれてもよいし、「推論装置」等と呼ばれてもよい。 Note that the caption generation device 10 during learning may be called, for example, a "learning device." Furthermore, the caption generation device 10 during inference may be called, for example, simply a "generation device" or an "inference device."

　＜第一の実施形態に係るキャプション生成装置１０のハードウェア構成例＞
　第一の実施形態に係るキャプション生成装置１０のハードウェア構成例について、図１を参照しながら説明する。図１は、第一の実施形態に係るキャプション生成装置１０のハードウェア構成の一例を示す図である。 <Hardware configuration example of the caption generation device 10 according to the first embodiment>
An example of the hardware configuration of the caption generation device 10 according to the first embodiment will be described with reference to Fig. 1. Fig. 1 is a diagram showing an example of the hardware configuration of the caption generation device 10 according to the first embodiment.

　図１に示すように、第一の実施形態に係るキャプション生成装置１０は、入力装置１０１と、表示装置１０２と、外部Ｉ／Ｆ１０３と、通信Ｉ／Ｆ１０４と、ＲＡＭ（Random Access Memory）１０５と、ＲＯＭ（Read Only Memory）１０６と、補助記憶装置１０７と、プロセッサ１０８とを有する。これらの各ハードウェアは、それぞれがバス１０９を介して通信可能に接続される。 As shown in FIG. 1, the caption generating device 10 according to the first embodiment has an input device 101, a display device 102, an external I/F 103, a communication I/F 104, a RAM (Random Access Memory) 105, a ROM (Read Only Memory) 106, an auxiliary storage device 107, and a processor 108. Each of these pieces of hardware is connected to each other so as to be able to communicate with each other via a bus 109.

　入力装置１０１は、例えば、キーボード、マウス、タッチパネル、物理ボタン等である。表示装置１０２は、例えば、ディスプレイ、表示パネル等である。なお、キャプション生成装置１０は、例えば、入力装置１０１及び表示装置１０２のうちの少なくとも一方を有していなくてもよい。 The input device 101 is, for example, a keyboard, a mouse, a touch panel, a physical button, etc. The display device 102 is, for example, a display, a display panel, etc. Note that the caption generation device 10 does not have to have at least one of the input device 101 and the display device 102, for example.

　外部Ｉ／Ｆ１０３は、記録媒体１０３ａ等の外部装置とのインタフェースである。記録媒体１０３ａとしては、例えば、ＣＤ（Compact Disc）、ＤＶＤ（Digital Versatile Disk）、ＳＤメモリカード（Secure Digital memory card）、ＵＳＢ（Universal Serial Bus）メモリカード等が挙げられる。 The external I/F 103 is an interface with external devices such as a recording medium 103a. Examples of recording media 103a include a CD (Compact Disc), a DVD (Digital Versatile Disk), an SD memory card (Secure Digital memory card), and a USB (Universal Serial Bus) memory card.

　通信Ｉ／Ｆ１０４は、通信ネットワークに接続するためのインタフェースである。ＲＡＭ１０５は、プログラムやデータを一時保持する揮発性の半導体メモリ（記憶装置）である。ＲＯＭ１０６は、電源を切ってもプログラムやデータを保持することができる不揮発性の半導体メモリ（記憶装置）である。補助記憶装置１０７は、例えば、ＨＤＤ（Hard Disk Drive）、ＳＳＤ（Solid State Drive）、フラッシュメモリ等の不揮発性の記憶装置である。プロセッサ１０８は、例えば、ＣＰＵ（Central Processing Unit）やＧＰＵ（Graphic Processing Unit）等の各種演算装置である。 The communication I/F 104 is an interface for connecting to a communication network. The RAM 105 is a volatile semiconductor memory (storage device) that temporarily stores programs and data. The ROM 106 is a non-volatile semiconductor memory (storage device) that can store programs and data even when the power is turned off. The auxiliary storage device 107 is a non-volatile storage device such as a HDD (Hard Disk Drive), SSD (Solid State Drive), or flash memory. The processor 108 is various types of arithmetic devices such as a CPU (Central Processing Unit) or GPU (Graphic Processing Unit).

　なお、図１に示すハードウェア構成は一例であって、キャプション生成装置１０のハードウェア構成はこれに限られるものではない。例えば、キャプション生成装置１０は、複数の補助記憶装置１０７や複数のプロセッサ１０８を有していてもよいし、図示したハードウェアの一部を有していなくてもよいし、図示したハードウェア以外の種々のハードウェアを有していてもよい。 Note that the hardware configuration shown in FIG. 1 is an example, and the hardware configuration of the caption generation device 10 is not limited to this. For example, the caption generation device 10 may have multiple auxiliary storage devices 107 and multiple processors 108, may not have some of the hardware shown in the figure, or may have various hardware other than the hardware shown in the figure.

　・学習時
　以下、第一の実施形態に係るキャプション生成装置１０の学習時について説明する。 - Learning Time Hereinafter, the learning time of the caption generation device 10 according to the first embodiment will be described.

　＜第一の実施形態に係るキャプション生成装置１０の学習時の機能構成例＞
　第一の実施形態に係るキャプション生成装置１０の学習時の機能構成例について、図２を参照しながら説明する。図２は、第一の実施形態に係るキャプション生成装置１０の学習時の機能構成の一例を示す図である。 <Example of functional configuration during learning of the caption generation device 10 according to the first embodiment>
An example of a functional configuration of the caption generation device 10 according to the first embodiment during learning will be described with reference to Fig. 2. Fig. 2 is a diagram showing an example of a functional configuration of the caption generation device 10 according to the first embodiment during learning.

　図２に示すように、学習時において、第一の実施形態に係るキャプション生成装置１０は、入力部２０１と、音声変換部２０２と、キャプション生成部２０３と、正解キャプション変換部２０４と、パラメータ更新部２０５とを有する。これら各部は、例えば、キャプション生成装置１０にインストールされた１以上のプログラムが、プロセッサ１０８に実行させる処理により実現される。また、学習時において、第一の実施形態に係るキャプション生成装置１０は、学習データ記憶部２０６と、モデルパラメータ記憶部２０７とを有する。これら各記憶部は、例えば、補助記憶装置１０７等の記憶領域により実現される。ただし、例えば、学習データ記憶部２０６及びモデルパラメータ記憶部２０７の少なくとも一方の記憶部が、キャプション生成装置１０と通信可能に接続される記憶装置等の記憶領域により実現されてもよい。 As shown in FIG. 2, during learning, the caption generation device 10 according to the first embodiment has an input unit 201, a voice conversion unit 202, a caption generation unit 203, a correct caption conversion unit 204, and a parameter update unit 205. Each of these units is realized, for example, by a process in which one or more programs installed in the caption generation device 10 are executed by the processor 108. Furthermore, during learning, the caption generation device 10 according to the first embodiment has a learning data storage unit 206 and a model parameter storage unit 207. Each of these storage units is realized, for example, by a storage area such as the auxiliary storage device 107. However, for example, at least one of the storage units of the learning data storage unit 206 and the model parameter storage unit 207 may be realized by a storage area of a storage device or the like that is communicably connected to the caption generation device 10.

　入力部２０１は、学習データ記憶部２０６に記憶されている学習データを入力する。ここで、学習データとはキャプショニングモデルを学習するためのデータであり、或る話者の音声を表す音声データと、その音声に含まれるパラ言語・非言語情報を自然言語で記述した正解キャプションとのペアで表される。 The input unit 201 inputs the learning data stored in the learning data storage unit 206. Here, the learning data is data for training a captioning model, and is represented as a pair of audio data representing the voice of a certain speaker and a correct answer caption that describes the paralinguistic and non-linguistic information contained in the voice in natural language.

　音声変換部２０２は、入力部２０１によって入力された学習データに含まれる音声データを入力音声系列に変換する。入力音声系列とは、キャプショニングモデルに入力される音声系列のことである。また、音声系列とは、音声データを信号処理等によりフレーム単位で扱える形式に変換した系列データのことである。音声系列の具体例としては、スペクトル、メル周波数ケプストラム（ＭＦＣＣ）、メルスペクトログラム等の形式に変換した系列データ等が挙げられる。 The voice conversion unit 202 converts the voice data included in the training data input by the input unit 201 into an input voice sequence. The input voice sequence is the voice sequence input to the captioning model. The voice sequence is also a sequence of data obtained by converting voice data into a format that can be handled on a frame-by-frame basis by signal processing or the like. Specific examples of voice sequences include sequence of data converted into formats such as spectrum, mel frequency cepstrum (MFCC), and mel spectrogram.

　ただし、後述する音声エンコード部３０１を実現するモデルによっては、音声変換部２０２は、音声データの音声波形をフレーム単位で分割したデータを入力音声系列としてもよい。 However, depending on the model that realizes the audio encoding unit 301 described below, the audio conversion unit 202 may use data obtained by dividing the audio waveform of the audio data into frames as the input audio sequence.

　キャプション生成部２０３は、キャプショニングモデルによって実現され、音声変換部２０２によって変換された入力音声系列から出力トークン確率系列を生成する。出力トークン確率系列とは、キャプションを構成する出力トークンの生成確率の系列のことである。また、トークンとは、或る１つの単位を表す１以上の文字の並びのことであり、典型例としては単語等が挙げられる。出力トークンの生成確率は、例えば、生成可能なトークンの種類数をＮとしたとき、各次元の要素の和が１であり、かつ、各次元の要素の値をその要素に対応するトークンが生成される確率とするＮ次元ベクトルで表される。なお、キャプション生成部２０３の詳細な機能構成例については後述する。 The caption generation unit 203 is realized by a captioning model, and generates an output token probability sequence from the input speech sequence converted by the speech conversion unit 202. The output token probability sequence is a sequence of generation probabilities of the output token that constitutes the caption. A token is a sequence of one or more characters that represents a certain unit, and a typical example is a word. The generation probability of the output token is expressed as an N-dimensional vector in which the sum of the elements of each dimension is 1, and the value of each element of the dimension is the probability that the token corresponding to that element will be generated, when the number of types of tokens that can be generated is N. An example of a detailed functional configuration of the caption generation unit 203 will be described later.

　正解キャプション変換部２０４は、入力部２０１によって入力された学習データに含まれる正解キャプションを正解トークン確率系列に変換する。正解トークン確率系列とは、正解キャプションを構成する正解トークンに対応する確率の系列のことである。正解トークンに対応する確率とは、例えば、生成対象となるトークンの種類数をＮとしたとき、正解トークンに対応する要素の値のみが１、それ以外の要素の値が０であるＮ次元ベクトルで表される。 The correct caption conversion unit 204 converts the correct caption contained in the learning data input by the input unit 201 into a correct token probability sequence. A correct token probability sequence is a sequence of probabilities corresponding to the correct tokens that make up the correct caption. For example, when the number of types of tokens to be generated is N, the probability corresponding to the correct token is expressed as an N-dimensional vector in which only the value of the element corresponding to the correct token is 1 and the values of the other elements are 0.

　パラメータ更新部２０５は、キャプション生成部２０３によって生成された出力トークン確率系列と、正解キャプション変換部２０４によって変換された正解トークン確率系列との誤差を用いて、キャプショニングモデルの学習可能パラメータ（以下、「モデルパラメータ」という。）を更新する。なお、誤差としては、例えば、クロスエントロピー誤差等を採用することができる。 The parameter update unit 205 updates the learnable parameters of the captioning model (hereinafter referred to as "model parameters") using the error between the output token probability sequence generated by the caption generation unit 203 and the correct token probability sequence converted by the correct caption conversion unit 204. Note that, for example, cross-entropy error can be used as the error.

　学習データ記憶部２０６は、予め作成された学習データを記憶する。なお、学習データの具体例については後述する。 The learning data storage unit 206 stores learning data that has been created in advance. Specific examples of learning data will be described later.

　モデルパラメータ記憶部２０７は、キャプショニングモデルのモデルパラメータを記憶する。 The model parameter storage unit 207 stores the model parameters of the captioning model.

　　≪学習データ記憶部２０６に記憶されている学習データの例≫
　学習データ記憶部２０６に記憶されている学習データの例について、図３を参照しながら説明する。図３は、学習データ記憶部２０６に記憶されている学習データの一例を示す図である。 <Example of learning data stored in learning data storage unit 206>
An example of the learning data stored in the learning data storage unit 206 will be described with reference to Fig. 3. Fig. 3 is a diagram showing an example of the learning data stored in the learning data storage unit 206.

　図３に示すように、学習データ記憶部２０６には、１以上の学習データが記憶されている。また、各学習データには、或る発話者が或る文章を発話した音声を表す音声データと、その音声を聞いた者が受けるパラ言語・非言語情報を自然言語で記述した説明文を表す正解キャプションとが含まれている。なお、正解キャプションは、例えば、教師データ等と呼ばれてもよい。 As shown in FIG. 3, the learning data storage unit 206 stores one or more learning data. Each learning data includes audio data representing a voice of a certain speaker speaking a certain sentence, and a correct answer caption representing an explanatory text in natural language describing the paralinguistic and non-linguistic information received by a person who hears the voice. The correct answer caption may be called, for example, teacher data.

　例えば、図３に示す例では、１行目の学習データには、音声データ１と、正解キャプション「老人男性がゆっくり話している」とが含まれている。同様に、２行目の学習データには、音声データ２と、正解キャプション「風邪気味の若い女性の声」とが含まれている。同様に、３行目の学習データには、音声データ３と、正解キャプション「友達と楽しそうに遊んでいる小さい女の子」とが含まれている。同様に、４行目の学習データには、音声データ４と、正解キャプション「隣の人にささやくように話しかけている男子大学生」とが含まれている。 For example, in the example shown in FIG. 3, the learning data in the first row contains audio data 1 and the correct caption "An elderly man speaking slowly." Similarly, the learning data in the second row contains audio data 2 and the correct caption "The voice of a young woman who seems to be suffering from a cold." Similarly, the learning data in the third row contains audio data 3 and the correct caption "A little girl playing happily with her friends." Similarly, the learning data in the fourth row contains audio data 4 and the correct caption "A male university student whispering to the person next to him."

　このように、学習データ記憶部２０６には、音声データとその正解キャプションとのペアでそれぞれ表される複数の学習データが記憶されている。なお、学習データ記憶部２０６に記憶されている学習データの数は、数百～数千以上であることが好ましい。また、学習データに含まれる音声データの話者数は数百名以上、音声データが表す音声の文書数は数文章～数十文章以上であることが好ましい。 In this way, the training data storage unit 206 stores multiple training data, each represented by a pair of audio data and its correct caption. The number of training data stored in the training data storage unit 206 is preferably several hundred to several thousand or more. It is also preferable that the number of speakers of the audio data included in the training data is several hundred or more, and the number of audio documents represented by the audio data is several to several tens of sentences or more.

　　≪第一の実施形態に係るキャプション生成部２０３の詳細な機能構成例≫
　第一の実施形態に係るキャプション生成部２０３の詳細な機能構成例について、図４を参照しながら説明する。図４は、第一の実施形態に係るキャプション生成部２０３の詳細な機能構成の一例を示す図である。以下では、一例として、ニューラルネットワークを含む機械学習モデルである音声エンコーダ、変換ネットワーク及びテキストデータの３つのモデルでキャプショニングモデルが構成されている場合について説明する。ただし、これは一例であって、キャプショニングモデルは必ずしも音声エンコーダ、変換ネットワーク及びテキストデータの３つの機械学習モデルで構成されている必要はなく、その他の構成であってもよい。例えば、キャプショニングモデルが、ニューラルネットワークを含む１つの機械学習モデルで構成されていてもよい。 <<Detailed Functional Configuration Example of the Caption Generation Unit 203 According to the First Embodiment>>
A detailed functional configuration example of the caption generation unit 203 according to the first embodiment will be described with reference to FIG. 4. FIG. 4 is a diagram showing an example of a detailed functional configuration of the caption generation unit 203 according to the first embodiment. In the following, as an example, a case where a captioning model is configured with three models, a voice encoder, which is a machine learning model including a neural network, a conversion network, and text data, will be described. However, this is only an example, and the captioning model does not necessarily have to be configured with three machine learning models, a voice encoder, a conversion network, and text data, and may have other configurations. For example, the captioning model may be configured with one machine learning model including a neural network.

　図４に示すように、第一の実施形態に係るキャプション生成部２０３は、音声エンコーダによって実現される音声エンコード部３０１と、変換ネットワークによって実現されるベクトル変換部３０２と、テキストデコーダによって実現されるテキストデコード部３０３とで構成される。 As shown in FIG. 4, the caption generation unit 203 according to the first embodiment is composed of an audio encoding unit 301 realized by an audio encoder, a vector conversion unit 302 realized by a conversion network, and a text decoding unit 303 realized by a text decoder.

　音声エンコード部３０１は、入力音声系列を入力として、その入力音声系列の特徴（パラ言語・非言語情報を表す特徴）を表現する固定長ベクトル（以下、「音声固定長ベクトル」という。）を生成する。音声エンコード部３０１を実現する音声エンコーダとしては、例えば、Ｇｌｏｂａｌ　Ｓｔｙｌｅ　Ｔｏｋｅｎ（参考文献１）等を用いることが可能である。また、音声エンコード部３０１を実現する音声エンコーダとして、例えば、ＨｕＢＥＲＴ（参考文献２）やＷａｖＬＭ（参考文献３）等の自己教師あり学習に基づく音声表現モデル（音声ＳＳＬ（Self-Supervised Learning）モデル）とＧｌｏｂａｌ　Ｓｔｙｌｅ　Ｔｏｋｅｎ等とを用いてもよい。 The speech encoding unit 301 takes an input speech sequence as input and generates a fixed-length vector (hereinafter referred to as an "audio fixed-length vector") that expresses the features of the input speech sequence (features that express paralinguistic and non-linguistic information). As a speech encoder that realizes the speech encoding unit 301, for example, a Global Style Token (Reference 1) or the like can be used. In addition, as a speech encoder that realizes the speech encoding unit 301, for example, a speech representation model based on self-supervised learning such as HuBERT (Reference 2) or WavLM (Reference 3) and a Global Style Token or the like may be used.

　音声エンコード部３０１を実現する音声エンコーダに音声ＳＳＬモデルが用いられる場合、上記の音声変換部２０２は、音声データの音声波形をフレーム単位で分割したデータを入力音声系列とする。この場合、入力音声系列を入力として音声ＳＳＬモデルから出力されたフレーム単位の特徴量の系列がＧｌｏｂａｌ　Ｓｔｙｌｅ　Ｔｏｋｅｎ等に入力され、音声固定長ベクトルが生成される。 When an audio SSL model is used in the audio encoder that realizes the audio encoding unit 301, the above-mentioned audio conversion unit 202 takes data obtained by dividing the audio waveform of the audio data by frame as the input audio sequence. In this case, the input audio sequence is used as input, and a sequence of frame-by-frame features output from the audio SSL model is input to a Global Style Token, etc., and an audio fixed-length vector is generated.

　ベクトル変換部３０２は、音声エンコード部３０１によって生成された音声固定長ベクトルを入力として、テキストデコード部３０３で扱うことが可能なＰ個のテキスト情報ベクトルの列（以下、「テキスト情報ベクトル列」という。）を生成する。言い換えれば、ベクトル変換部３０２は、入力された音声固定長ベクトルをテキスト情報ベクトル列に変換する。ここで、Ｐは予め決められる固定値である。また、テキスト情報ベクトル列は、モデルパラメータに含まれるＰ個のベクトルに対して、入力された音声固定長ベクトルが表現する特徴を反映した単語埋め込み表現を表すベクトル列である。なお、モデルパラメータに含まれるＰ個のベクトルも学習可能パラメータの１つである。 The vector conversion unit 302 receives the fixed-length audio vector generated by the audio encoding unit 301 as input and generates a sequence of P text information vectors (hereinafter referred to as a "text information vector sequence") that can be handled by the text decoding unit 303. In other words, the vector conversion unit 302 converts the input fixed-length audio vector into a text information vector sequence. Here, P is a predetermined fixed value. The text information vector sequence is a vector sequence that represents a word embedding expression that reflects the features expressed by the input fixed-length audio vector for the P vectors included in the model parameters. The P vectors included in the model parameters are also one of the learnable parameters.

　ベクトル変換部３０２を実現する変換ネットワークとしては、例えば、参考文献４に記載されているＭａｐｐｉｎｇ　Ｎｅｔｗｏｒｋと同様に、通常のＭＬＰ（Multilayer perceptron）を用いることが可能である。これ以外にも、ベクトル変換部３０２を実現する変換ネットワークとして、例えば、Ｔｒａｎｓｆｏｒｍｅｒ－ｅｎｃｏｄｅｒ等の前後の系列情報を考慮可能なニューラルネットワーク、又は、ＭＬＰやＴｒａｎｓｆｏｒｍｅｒ－ｅｎｃｏｄｅｒ等を組み合わせたニューラルネットワークを用いることも可能である。 As a conversion network that realizes the vector conversion unit 302, for example, a normal MLP (Multilayer Perceptron) can be used, similar to the Mapping Network described in Reference 4. In addition to this, as a conversion network that realizes the vector conversion unit 302, for example, a neural network that can take into account sequence information before and after a transformer-encoder, or a neural network that combines an MLP and a transformer-encoder, etc. can also be used.

　テキストデコード部３０３は、ベクトル変換部３０２によって変換されたテキスト情報ベクトル列を入力として、出力トークン確率系列を生成する。 The text decoding unit 303 receives the text information vector sequence converted by the vector conversion unit 302 as input and generates an output token probability sequence.

　テキストデコード部３０３を実現するテキストデコーダとしては、例えば、大規模なテキストデータで学習されたデコーダ型の事前学習モデル（参考文献５）を用いることが可能である。このような事前学習モデルとしては、例えば、ＧＰＴ（Generative Pre-trained Transformer）等の大規模言語モデル（ＬＬＭ：Large Language Models）が挙げられる。 As a text decoder for realizing the text decoding unit 303, for example, a decoder-type pre-trained model (Reference 5) trained with large-scale text data can be used. Examples of such pre-trained models include large-scale language models (LLMs) such as the Generative Pre-trained Transformer (GPT).

　また、テキストデコード部３０３は、例えば、出力トークン確率系列に含まれる各生成確率に従って出力トークンをサンプリングすることにより、キャプションを構成する出力トークン系列を生成することも可能である。 The text decoding unit 303 can also generate an output token sequence that constitutes a caption, for example, by sampling the output tokens according to each generation probability included in the output token probability sequence.

　＜第一の実施形態に係る学習処理＞
　以下、第一の実施形態に係る学習処理について、図５を参照しながら説明する。図５は、第一の実施形態に係る学習処理の一例を示すフローチャートである。以下では、一例として、オンライン学習を想定し、図５のステップＳ１０１～ステップＳ１０５は学習データ毎に繰り返し実行されるものとする。ただし、これは一例であって、ミニバッチ学習やバッチ学習等により学習処理が実行されてもよい。 <Learning Process According to the First Embodiment>
The learning process according to the first embodiment will be described below with reference to Fig. 5. Fig. 5 is a flowchart showing an example of the learning process according to the first embodiment. In the following, online learning is assumed as an example, and steps S101 to S105 in Fig. 5 are repeatedly executed for each piece of learning data. However, this is only an example, and the learning process may be executed by mini-batch learning, batch learning, or the like.

　まず、入力部２０１は、学習データ記憶部２０６に記憶されている１件の学習データを入力する（ステップＳ１０１）。 First, the input unit 201 inputs one piece of learning data stored in the learning data storage unit 206 (step S101).

　次に、音声変換部２０２は、上記のステップＳ１０１で入力された学習データに含まれる音声データを入力音声系列に変換する（ステップＳ１０２）。 Next, the voice conversion unit 202 converts the voice data contained in the training data input in step S101 above into an input voice sequence (step S102).

　次に、キャプション生成部２０３は、上記のステップＳ１０２で変換された入力音声系列から出力トークン確率系列を生成する（ステップＳ１０３）。なお、本ステップの処理の詳細については後述する。 Next, the caption generation unit 203 generates an output token probability sequence from the input speech sequence converted in step S102 (step S103). Details of the processing in this step will be described later.

　次に、正解キャプション変換部２０４は、上記のステップＳ１０１で入力された学習データに含まれる正解キャプションを正解トークン確率系列に変換する（ステップＳ１０４）。 Next, the correct caption conversion unit 204 converts the correct caption contained in the learning data input in step S101 above into a correct token probability sequence (step S104).

　そして、パラメータ更新部２０５は、上記のステップＳ１０３で生成された出力トークン確率系列と、上記のステップＳ１０４で変換された正解トークン確率系列との誤差を用いて、モデルパラメータを更新する（ステップＳ１０５）。すなわち、パラメータ更新部２０５は、当該誤差を最小化するように、既知の最適化手法によりモデルパラメータを更新する。ただし、このとき、パラメータ更新部２０５は、テキストデコード部３０３を実現するテキストデコーダのモデルパラメータを更新対象外としてもよい。また、音声エンコード部３０１を実現する音声エンコーダに音声ＳＳＬモデルが用いられる場合、パラメータ更新部２０５は、音声ＳＳＬモデルのモデルパラメータを更新対象外としてもよい。なお、パラメータ更新部２０５は、テキストデコード部３０３を実現するテキストデコーダのモデルパラメータや音声ＳＳＬモデルのモデルパラメータを更新する場合にはモデルパラメータ全体を更新してもよいし、ＬｏＲＡ（参考文献６）等に代表されるＡｄａｐｔｅｒを用いてもよい。 Then, the parameter update unit 205 updates the model parameters using the error between the output token probability sequence generated in step S103 and the correct token probability sequence converted in step S104 (step S105). That is, the parameter update unit 205 updates the model parameters by a known optimization method so as to minimize the error. However, at this time, the parameter update unit 205 may exclude the model parameters of the text decoder that realizes the text decoding unit 303 from the update target. Also, if a voice SSL model is used for the voice encoder that realizes the voice encoding unit 301, the parameter update unit 205 may exclude the model parameters of the voice SSL model from the update target. Note that, when updating the model parameters of the text decoder that realizes the text decoding unit 303 or the model parameters of the voice SSL model, the parameter update unit 205 may update the entire model parameters, or may use an adapter such as LoRA (Reference 6).

　例えば、ｉ＝１，・・・，Ｉとして、ｉ番目の出力トークンをｔ（ｉ）、出力トークンｔ（１）～ｔ（ｉ－１）までの出力トークンが得られたときにｉ番目の出力トークンの生成確率（事後確率）をｐ（ｔ（ｉ））とする。一方で、ｉ番目の正解トークンをｓ（ｉ）、ｉ番目の正解トークンに対応する確率をｐ（ｓ（ｉ））とする。このとき、クロスエントロピー誤差を用いる場合、パラメータ更新部２０５は、ｉ＝１，・・・，Ｉに関する－ｐ（ｓ（ｉ））ｌｏｇｐ（ｔ（ｉ））の和を最小化するように、モデルパラメータを更新すればよい。 For example, let i = 1, ..., I, the i-th output token be t(i), and the generation probability (posterior probability) of the i-th output token when output tokens t(1) to t(i-1) are obtained be p(t(i)). On the other hand, let the i-th correct token be s(i), and the probability corresponding to the i-th correct token be p(s(i)). In this case, when using the cross-entropy error, the parameter update unit 205 updates the model parameters so as to minimize the sum of -p(s(i))logp(t(i)) for i = 1, ..., I.

　以上のステップＳ１０１～ステップＳ１０５が繰り返し実行されることにより、学習済みモデルパラメータが設定されたキャプショニングモデル（つまり、学習済みのキャプショニングモデル）が得られる。 By repeatedly executing steps S101 to S105 above, a captioning model in which the trained model parameters are set (i.e., a trained captioning model) is obtained.

　　≪出力トークン確率系列の生成処理≫
　以下、上記のステップＳ１０３における出力トークン確率系列の生成処理の詳細について、図６を参照しながら説明する。図６は、第一の実施形態に係る出力トークン確率系列の生成処理の一例を示すフローチャートである。 <<Generation process of output token probability sequence>>
Hereinafter, details of the process of generating an output token probability sequence in step S103 will be described with reference to Fig. 6. Fig. 6 is a flowchart showing an example of the process of generating an output token probability sequence according to the first embodiment.

　まず、キャプション生成部２０３の音声エンコード部３０１は、入力音声系列から音声固定長ベクトルを生成する（ステップＳ２０１）。 First, the audio encoding unit 301 of the caption generation unit 203 generates a fixed-length audio vector from the input audio sequence (step S201).

　次に、キャプション生成部２０３のベクトル変換部３０２は、上記のステップＳ２０１で生成された音声固定長ベクトルをテキスト情報ベクトル列に変換する（ステップＳ２０２）。 Next, the vector conversion unit 302 of the caption generation unit 203 converts the fixed-length audio vector generated in step S201 above into a text information vector sequence (step S202).

　そして、キャプション生成部２０３のテキストデコード部３０３は、上記のステップＳ２０２で変換されたテキスト情報ベクトル列から出力トークン確率系列を生成する（ステップＳ２０３）。 Then, the text decoding unit 303 of the caption generation unit 203 generates an output token probability sequence from the text information vector sequence converted in step S202 above (step S203).

　・推論時
　以下、第一の実施形態に係るキャプション生成装置１０の推論時について説明する。以下、推論時では、キャプショニングモデルのモデルパラメータは学習済みであるものとする。なお、推論時では、主に、学習時との相違点について説明し、学習時と同様の構成要素についてはその説明を省略する。 Inference Time Inference time of the caption generation device 10 according to the first embodiment will be described below. In the following, it is assumed that the model parameters of the captioning model have already been learned at the time of inference. Note that, at the time of inference, differences from the time of learning will be mainly described, and descriptions of components similar to those at the time of learning will be omitted.

　＜第一の実施形態に係るキャプション生成装置１０の推論時の機能構成例＞
　第一の実施形態に係るキャプション生成装置１０の推論時の機能構成例について、図７を参照しながら説明する。図７は、第一の実施形態に係るキャプション生成装置１０の推論時の機能構成の一例を示す図である。 <Example of functional configuration at the time of inference of the caption generation device 10 according to the first embodiment>
An example of a functional configuration at the time of inference of the caption generation device 10 according to the first embodiment will be described with reference to Fig. 7. Fig. 7 is a diagram showing an example of a functional configuration at the time of inference of the caption generation device 10 according to the first embodiment.

　図７に示すように、推論時において、第一の実施形態に係るキャプション生成装置１０は、入力部２０１と、音声変換部２０２と、キャプション生成部２０３と、出力部２０８とを有する。出力部２０８は、例えば、キャプション生成装置１０にインストールされた１以上のプログラムが、プロセッサ１０８に実行させる処理により実現される。また、推論時において、第一の実施形態に係るキャプション生成装置１０は、モデルパラメータ記憶部２０７を有する。 As shown in FIG. 7, during inference, the caption generation device 10 according to the first embodiment has an input unit 201, a voice conversion unit 202, a caption generation unit 203, and an output unit 208. The output unit 208 is realized, for example, by a process in which one or more programs installed in the caption generation device 10 are executed by the processor 108. Furthermore, during inference, the caption generation device 10 according to the first embodiment has a model parameter storage unit 207.

　入力部２０１は、キャプションの生成対象となる音声データが与えられると、この音声データを入力する。 When audio data for which a caption is to be generated is given, the input unit 201 inputs this audio data.

　音声変換部２０２は、入力部２０１によって入力された音声データを入力音声系列に変換する。 The voice conversion unit 202 converts the voice data input by the input unit 201 into an input voice sequence.

　キャプション生成部２０３は、学習済みのキャプショニングモデルによって実現され、音声変換部２０２によって変換された入力音声系列から出力トークン系列を生成する。 The caption generation unit 203 is realized by a trained captioning model, and generates an output token sequence from the input speech sequence converted by the speech conversion unit 202.

　出力部２０８は、キャプション生成部２０３によって生成された出力トークン系列で構成されるキャプションを予め決められた所定の出力先に出力する。なお、当該出力先は特定の出力先に限定されるものではなく、任意の出力先を対象とすることが可能である。例えば、補助記憶装置１０７の記憶領域、ディスプレイ等の表示装置１０２、通信可能に接続される他の機器等を出力先とすることが可能である。 The output unit 208 outputs a caption composed of an output token sequence generated by the caption generation unit 203 to a predetermined output destination. Note that the output destination is not limited to a specific output destination, and any output destination can be targeted. For example, the output destination can be a storage area of the auxiliary storage device 107, a display device 102 such as a display, other devices connected in a communicable manner, etc.

　モデルパラメータ記憶部２０７は、キャプショニングモデルの学習済みモデルパラメータを記憶する。 The model parameter storage unit 207 stores the learned model parameters of the captioning model.

　＜第一の実施形態に係るキャプション生成処理＞
　以下、第一の実施形態に係るキャプション生成処理について、図８を参照しながら説明する。図８は、第一の実施形態に係るキャプション生成処理の一例を示すフローチャートである。なお、以下では、キャプションの生成対象となる音声データがキャプション生成装置１０に与えられたものとする。 <Caption Generation Process According to the First Embodiment>
The caption generation process according to the first embodiment will be described below with reference to Fig. 8. Fig. 8 is a flowchart showing an example of the caption generation process according to the first embodiment. Note that, in the following, it is assumed that audio data for which a caption is to be generated is provided to the caption generation device 10.

　まず、入力部２０１は、与えられた音声データを入力する（ステップＳ３０１）。 First, the input unit 201 inputs the given voice data (step S301).

　次に、音声変換部２０２は、上記のステップＳ３０１で入力された音声データを入力音声系列に変換する（ステップＳ３０２）。 Next, the voice conversion unit 202 converts the voice data input in step S301 above into an input voice sequence (step S302).

　次に、キャプション生成部２０３は、上記のステップＳ３０２で変換された入力音声系列から出力トークン系列を生成する（ステップＳ３０３）。なお、本ステップの処理の詳細については後述する。 Next, the caption generation unit 203 generates an output token sequence from the input speech sequence converted in step S302 (step S303). Details of the processing in this step will be described later.

　そして、出力部２０８は、上記のステップＳ３０３で生成された出力トークン系列で構成されるキャプションを予め決められた所定の出力先に出力する（ステップＳ３０４）。これにより、与えられた音声データが表す音声に含まれるパラ言語・非言語情報を自然言語で記述したキャプションが得られる。 Then, the output unit 208 outputs the caption composed of the output token sequence generated in step S303 above to a predetermined output destination (step S304). This results in a caption that describes in natural language the paralinguistic and non-linguistic information contained in the speech represented by the given audio data.

　　≪出力トークン系列の生成処理≫
　以下、上記のステップＳ３０３における出力トークン系列の生成処理の詳細について、図９を参照しながら説明する。図９は、第一の実施形態に係る出力トークン系列の生成処理の一例を示すフローチャートである。 <<Generation process of output token sequence>>
Hereinafter, details of the process of generating an output token sequence in step S303 will be described with reference to Fig. 9. Fig. 9 is a flowchart showing an example of the process of generating an output token sequence according to the first embodiment.

　まず、キャプション生成部２０３の音声エンコード部３０１は、入力音声系列から音声固定長ベクトルを生成する（ステップＳ４０１）。 First, the audio encoding unit 301 of the caption generation unit 203 generates an audio fixed-length vector from the input audio sequence (step S401).

　次に、キャプション生成部２０３のベクトル変換部３０２は、上記のステップＳ４０１で生成された音声固定長ベクトルをテキスト情報ベクトル列に変換する（ステップＳ４０２）。 Next, the vector conversion unit 302 of the caption generation unit 203 converts the fixed-length audio vector generated in step S401 above into a text information vector sequence (step S402).

　そして、キャプション生成部２０３のテキストデコード部３０３は、上記のステップＳ４０２で変換されたテキスト情報ベクトル列から出力トークン系列を生成する（ステップＳ４０３）。なお、出力トークン系列は、例えば、出力トークン確率系列に含まれる各生成確率に従って出力トークンをサンプリングすることにより生成することが可能である。 Then, the text decoding unit 303 of the caption generating unit 203 generates an output token sequence from the text information vector sequence converted in step S402 (step S403). Note that the output token sequence can be generated, for example, by sampling the output tokens according to each generation probability included in the output token probability sequence.

　［第二の実施形態］
　以下、第二の実施形態について説明する。例えば、音声に含まれるパラ言語・非言語情報を識別する識別モデルを医療分野に応用し、音声から病気やその兆候等を識別する場合等を考えた場合、識別モデルの識別結果にはその理由付けが求められるものと考えられる。また、その理由付けを表す自然言語記述も識別結果に応じて変化するものと考えられる。 [Second embodiment]
A second embodiment will be described below. For example, when considering a case where a discrimination model for discriminating paralinguistic and non-linguistic information contained in a voice is applied to the medical field to discriminate illnesses and their symptoms from a voice, it is considered that the discrimination result of the discrimination model requires a justification. In addition, it is considered that the natural language description expressing the justification changes depending on the discrimination result.

　そこで、第二の実施形態では、音声に含まれるパラ言語・非言語情報を識別する識別モデルの識別結果も考慮して、モデルパラメータの学習とキャプションの生成とを行う場合について説明する。 In the second embodiment, we therefore explain the case where model parameter learning and caption generation are carried out while taking into account the results of a discrimination model that discriminates paralinguistic and non-linguistic information contained in speech.

　なお、第二の実施形態では、主に、第一の実施形態との相違点について説明し、第一の実施形態と同様の構成要素についてはその説明を省略する。 In addition, in the second embodiment, the differences from the first embodiment will be mainly described, and the description of the components that are the same as those in the first embodiment will be omitted.

　・学習時
　以下、第二の実施形態に係るキャプション生成装置１０の学習時について説明する。 - Learning Time Hereinafter, the learning time of the caption generation device 10 according to the second embodiment will be described.

　＜第二の実施形態に係るキャプション生成装置１０の学習時の機能構成例＞
　第二の実施形態に係るキャプション生成装置１０の学習時の機能構成例について、図１０を参照しながら説明する。図１０は、第二の実施形態に係るキャプション生成装置１０の学習時の機能構成の一例を示す図である。 <Example of functional configuration during learning of the caption generation device 10 according to the second embodiment>
An example of a functional configuration of the caption generation device 10 according to the second embodiment during learning will be described with reference to Fig. 10. Fig. 10 is a diagram showing an example of a functional configuration of the caption generation device 10 according to the second embodiment during learning.

　図１０に示すように、学習時において、第二の実施形態に係るキャプション生成装置１０は、第一の実施形態で説明した各部に加えて、音声に含まれるパラ言語・非言語情報を識別する識別モデルによって実現される識別部２０９を有する。識別部２０９は、例えば、キャプション生成装置１０にインストールされた１以上のプログラムが、プロセッサ１０８に実行させる処理により実現される。 As shown in FIG. 10, during learning, the caption generation device 10 according to the second embodiment has, in addition to the various units described in the first embodiment, a recognition unit 209 that is realized by a recognition model that recognizes paralinguistic and non-linguistic information contained in speech. The recognition unit 209 is realized, for example, by a process in which one or more programs installed in the caption generation device 10 are executed by the processor 108.

　識別部２０９は、音声変換部２０２によって変換された入力音声系列からパラ言語・非言語情報を識別した結果を識別結果として生成する。識別部２０９を実現する識別モデルは特定の識別モデルに限定されるものではなく、例えば、病気、性別、感情等といった任意のパラ言語・非言語情報を入力音声系列から識別可能な任意の識別モデルを用いることが可能である。また、識別部２０９は、１つの識別モデルによって実現されていてもよいし、複数の識別モデルによって実現されていてもよい。複数の識別モデルによって識別部２０９が実現されている場合、識別部２０９は、入力音声系列から複数のパラ言語・非言語情報をそれぞれ識別した結果を識別結果として生成する。 The identification unit 209 generates an identification result that is a result of identifying paralinguistic and non-linguistic information from the input speech sequence converted by the speech conversion unit 202. The identification model that realizes the identification unit 209 is not limited to a specific identification model, and any identification model that can identify any paralinguistic and non-linguistic information such as illness, gender, emotion, etc. from the input speech sequence can be used. Furthermore, the identification unit 209 may be realized by one identification model or by multiple identification models. When the identification unit 209 is realized by multiple identification models, the identification unit 209 generates an identification result that is a result of identifying each of the multiple pieces of paralinguistic and non-linguistic information from the input speech sequence.

　なお、識別部２０９を実現する１以上の識別モデルは予め学習済みであってもよいし、キャプショニングモデルと共に学習されてもよい。識別部２０９を実現する１以上の識別モデルをキャプショニングモデルと共に学習する場合、学習データには、音声データと正解キャプションに加えて、識別結果の正解を表す正解パラ言語・非言語情報も含まれる。 Note that one or more discrimination models that realize the discrimination unit 209 may be pre-trained, or may be trained together with a captioning model. When one or more discrimination models that realize the discrimination unit 209 are trained together with a captioning model, the training data includes, in addition to the audio data and the correct caption, correct paralinguistic and non-linguistic information that indicates the correct discrimination result.

　キャプション生成部２０３は、音声変換部２０２によって変換された入力音声系列と識別部２０９によって生成された識別結果とから出力トークン確率系列を生成する。なお、キャプション生成部２０３の詳細な機能構成例については後述する。 The caption generation unit 203 generates an output token probability sequence from the input speech sequence converted by the speech conversion unit 202 and the classification result generated by the classification unit 209. An example of a detailed functional configuration of the caption generation unit 203 will be described later.

　　≪第二の実施形態に係るキャプション生成部２０３の詳細な機能構成例≫
　第二の実施形態に係るキャプション生成部２０３の詳細な機能構成例について、図１１を参照しながら説明する。図１１は、第二の実施形態に係るキャプション生成部２０３の詳細な機能構成の一例を示す図である。以下では、第一の実施形態と同様に、一例として、キャプショニングモデルが音声エンコーダ、変換ネットワーク及びテキストデータの３つの機械学習モデルで構成されている場合について説明する。 <<Detailed Functional Configuration Example of the Caption Generation Unit 203 According to the Second Embodiment>>
A detailed functional configuration example of the caption generation unit 203 according to the second embodiment will be described with reference to Fig. 11. Fig. 11 is a diagram showing an example of a detailed functional configuration of the caption generation unit 203 according to the second embodiment. As in the first embodiment, the following describes, as an example, a case in which a captioning model is composed of three machine learning models, a voice encoder, a conversion network, and text data.

　図１１に示すように、第二の実施形態に係るキャプション生成部２０３は、音声エンコーダによって実現される音声エンコード部３０１と、変換ネットワークによって実現されるベクトル変換部３０２と、テキストデコーダによって実現されるテキストデコード部３０３とで構成される。 As shown in FIG. 11, the caption generation unit 203 according to the second embodiment is composed of an audio encoding unit 301 realized by an audio encoder, a vector conversion unit 302 realized by a conversion network, and a text decoding unit 303 realized by a text decoder.

　ベクトル変換部３０２は、音声エンコード部３０１によって生成された音声固定長ベクトルと識別部２０９によって生成された識別結果とを入力として、テキスト情報ベクトル列を生成する。言い換えれば、ベクトル変換部３０２は、入力された音声固定長ベクトルと入力された識別結果とをテキスト情報ベクトル列に変換する。具体的には、ベクトル変換部３０２は、音声エンコード部３０１によって生成された音声固定長ベクトルと識別部２０９によって生成された識別結果とを結合したベクトルを新たな音声固定長ベクトルとした上で、第一の実施形態に係るベクトル変換部３０２と同様に、この新たな音声固定長ベクトルからテキスト情報ベクトル列を生成する。これにより、音声固定長ベクトルだけでなく、識別モデルの識別結果も考慮したテキスト情報ベクトル列を生成することが可能となり、その結果、それらを考慮した出力トークン系列で構成されるキャプションを生成することが可能となる。 The vector conversion unit 302 generates a text information vector sequence using the fixed-length audio vector generated by the audio encoding unit 301 and the classification result generated by the classification unit 209 as input. In other words, the vector conversion unit 302 converts the input fixed-length audio vector and the input classification result into a text information vector sequence. Specifically, the vector conversion unit 302 combines the fixed-length audio vector generated by the audio encoding unit 301 and the classification result generated by the classification unit 209 to create a new fixed-length audio vector, and then generates a text information vector sequence from this new fixed-length audio vector, similar to the vector conversion unit 302 according to the first embodiment. This makes it possible to generate a text information vector sequence that takes into account not only the fixed-length audio vector but also the classification result of the classification model, and as a result, it becomes possible to generate a caption that is composed of an output token sequence that takes these into account.

　なお、音声固定長ベクトルと識別結果とを結合したベクトルとは、例えば、音声固定長ベクトルがＭ次元ベクトル、識別結果の数がＬ個ですべての識別結果が０又は１の２値を取るスカラー値である場合、音声固定長ベクトルの各要素とＬ個の識別結果とを要素とするＭ＋Ｌ次元ベクトルのことである。 Note that a vector combining a fixed-length audio vector and a classification result is, for example, an M+L-dimensional vector whose elements are each element of the fixed-length audio vector and the L classification results, in the case where the fixed-length audio vector is an M-dimensional vector, there are L classification results, and all classification results are scalar values that take the two values 0 or 1.

　＜第二の実施形態に係る学習処理＞
　以下、第一の実施形態に係る学習処理について、図１２を参照しながら説明する。図１２は、第二の実施形態に係る学習処理の一例を示すフローチャートである。以下では、一例として、オンライン学習を想定し、図１２のステップＳ５０１～ステップＳ５０６は学習データ毎に繰り返し実行されるものとする。 <Learning Process According to the Second Embodiment>
The learning process according to the first embodiment will be described below with reference to Fig. 12. Fig. 12 is a flowchart showing an example of the learning process according to the second embodiment. In the following, online learning is assumed as an example, and steps S501 to S506 in Fig. 12 are repeatedly executed for each piece of learning data.

　図１２のステップＳ５０１～ステップＳ５０２は、図５のステップＳ１０１～ステップＳ１０２とそれぞれ同様であるため、その説明を省略する。 Steps S501 and S502 in FIG. 12 are similar to steps S101 and S102 in FIG. 5, respectively, and therefore will not be described.

　ステップＳ５０２に続いて、識別部２０９は、ステップＳ５０２で変換された入力音声系列からパラ言語・非言語情報を識別した結果を識別結果として生成する（ステップＳ５０３）。 Following step S502, the identification unit 209 generates an identification result that identifies paralinguistic and non-linguistic information from the input speech sequence converted in step S502 (step S503).

　次に、キャプション生成部２０３は、ステップＳ５０２で変換された入力音声系列と上記のステップＳ５０３で生成された識別結果とから出力トークン確率系列を生成する（ステップＳ５０４）。なお、本ステップの処理の詳細については後述する。 Next, the caption generation unit 203 generates an output token probability sequence from the input speech sequence converted in step S502 and the classification result generated in step S503 (step S504). Details of the processing in this step will be described later.

　次に、正解キャプション変換部２０４は、図５のステップＳ１０４と同様に、ステップＳ５０１で入力された学習データに含まれる正解キャプションを正解トークン確率系列に変換する（ステップＳ５０５）。 Next, the correct caption conversion unit 204 converts the correct caption included in the learning data input in step S501 into a correct token probability sequence (step S505), similar to step S104 in FIG. 5.

　そして、パラメータ更新部２０５は、図５のステップＳ１０５と同様に、上記のステップＳ５０４で生成された出力トークン確率系列と、上記のステップＳ５０５で変換された正解トークン確率系列との誤差を用いて、モデルパラメータを更新する（ステップＳ５０６）。ただし、このとき、識別部２０９を実現する１以上の識別モデルも学習する場合、パラメータ更新部２０５は、ステップＳ５０１で入力された学習データに含まれる正解パラ言語・非言語情報と、上記のステップＳ５０３で生成された識別結果が表すパラ言語・非言語情報との誤差も用いて、モデルパラメータに加えて、識別部２０９を実現する１以上の識別モデルの学習可能パラメータも更新する。 Then, the parameter update unit 205 updates the model parameters using the error between the output token probability sequence generated in the above step S504 and the correct token probability sequence converted in the above step S505, similar to step S105 in FIG. 5 (step S506). However, at this time, if one or more discrimination models realizing the identification unit 209 are also trained, the parameter update unit 205 also uses the error between the correct paralinguistic and non-linguistic information included in the training data input in step S501 and the paralinguistic and non-linguistic information represented by the recognition result generated in the above step S503 to update the learnable parameters of the one or more discrimination models realizing the identification unit 209 in addition to the model parameters.

　以上のステップＳ５０１～ステップＳ５０６が繰り返し実行されることにより、学習済みモデルパラメータが設定されたキャプショニングモデル（つまり、学習済みのキャプショニングモデル）が得られる。 By repeatedly executing steps S501 to S506 above, a captioning model in which the trained model parameters are set (i.e., a trained captioning model) is obtained.

　　≪出力トークン確率系列の生成処理≫
　　以下、上記のステップＳ５０４における出力トークン確率系列の生成処理の詳細について、図１３を参照しながら説明する。図１３は、第二の実施形態に係る出力トークン確率系列の生成処理の一例を示すフローチャートである。 <<Generation process of output token probability sequence>>
Hereinafter, details of the process of generating an output token probability sequence in step S504 will be described with reference to Fig. 13. Fig. 13 is a flowchart showing an example of the process of generating an output token probability sequence according to the second embodiment.

　まず、キャプション生成部２０３の音声エンコード部３０１は、図６のステップＳ２０１と同様に、入力音声系列から音声固定長ベクトルを生成する（ステップＳ６０１）。 First, the audio encoding unit 301 of the caption generation unit 203 generates an audio fixed-length vector from the input audio sequence (step S601), similar to step S201 in FIG. 6.

　次に、キャプション生成部２０３のベクトル変換部３０２は、上記のステップＳ６０１で生成された音声固定長ベクトルと、図１４のステップＳ５０３で生成された識別結果とをテキスト情報ベクトル列に変換する（ステップＳ６０２）。 Next, the vector conversion unit 302 of the caption generation unit 203 converts the fixed-length audio vector generated in step S601 above and the classification result generated in step S503 of FIG. 14 into a text information vector sequence (step S602).

　そして、キャプション生成部２０３のテキストデコード部３０３は、図６のステップＳ２０３と同様に、上記のステップＳ６０２で変換されたテキスト情報ベクトル列から出力トークン確率系列を生成する（ステップＳ６０３）。 Then, the text decoding unit 303 of the caption generation unit 203 generates an output token probability sequence from the text information vector sequence converted in step S602 above (step S603), similar to step S203 in FIG. 6.

　・推論時
　以下、第二の実施形態に係るキャプション生成装置１０の推論時について説明する。以下、推論時では、キャプショニングモデルのモデルパラメータと識別モデルの学習可能パラメータはいずれも学習済みであるものとする。なお、推論時では、主に、学習時との相違点について説明し、学習時と同様の構成要素についてはその説明を省略する。 Inference Time Inference time of the caption generation device 10 according to the second embodiment will be described below. In the following, it is assumed that the model parameters of the captioning model and the learnable parameters of the discriminative model have both been learned at the time of inference. Note that, at the time of inference, differences from the time of learning will be mainly described, and descriptions of components similar to those at the time of learning will be omitted.

　＜第二の実施形態に係るキャプション生成装置１０の推論時の機能構成例＞
　第二の実施形態に係るキャプション生成装置１０の推論時の機能構成例について、図１４を参照しながら説明する。図１４は、第二の実施形態に係るキャプション生成装置１０の推論時の機能構成の一例を示す図である。 <Example of functional configuration at the time of inference of the caption generation device 10 according to the second embodiment>
An example of a functional configuration at the time of inference of the caption generation device 10 according to the second embodiment will be described with reference to Fig. 14. Fig. 14 is a diagram showing an example of a functional configuration at the time of inference of the caption generation device 10 according to the second embodiment.

　図１４に示すように、推論時において、第二の実施形態に係るキャプション生成装置１０は、第一の実施形態で説明した各部に加えて、識別部２０９を有する。 As shown in FIG. 14, during inference, the caption generation device 10 according to the second embodiment has an identification unit 209 in addition to the units described in the first embodiment.

　識別部２０９は、音声変換部２０２によって変換された入力音声系列からパラ言語・非言語情報を識別した結果を識別結果として生成する。 The identification unit 209 generates an identification result by identifying paralinguistic and non-linguistic information from the input speech sequence converted by the speech conversion unit 202.

　キャプション生成部２０３は、学習済みのキャプショニングモデルによって実現され、音声変換部２０２によって変換された入力音声系列と識別部２０９によって生成された識別結果とから出力トークン系列を生成する。 The caption generation unit 203 is realized by a trained captioning model, and generates an output token sequence from the input speech sequence converted by the speech conversion unit 202 and the classification result generated by the classification unit 209.

　＜第二の実施形態に係るキャプション生成処理＞
　以下、第二の実施形態に係るキャプション生成処理について、図１５を参照しながら説明する。図１５は、第二の実施形態に係るキャプション生成処理の一例を示すフローチャートである。なお、以下では、キャプションの生成対象となる音声データがキャプション生成装置１０に与えられたものとする。 <Caption Generation Process According to the Second Embodiment>
The caption generation process according to the second embodiment will be described below with reference to Fig. 15. Fig. 15 is a flowchart showing an example of the caption generation process according to the second embodiment. Note that, in the following, it is assumed that audio data for which a caption is to be generated is provided to the caption generation device 10.

　図１５のステップＳ７０１～ステップＳ７０２は、図８のステップＳ３０１～ステップＳ３０２とそれぞれ同様であるため、その説明を省略する。 Steps S701 and S702 in FIG. 15 are similar to steps S301 and S302 in FIG. 8, respectively, and therefore will not be described.

　ステップＳ７０２に続いて、識別部２０９は、ステップＳ７０２で変換された入力音声系列からパラ言語・非言語情報を識別した結果を識別結果として生成する（ステップＳ７０３）。 Following step S702, the identification unit 209 generates an identification result that identifies paralinguistic and non-linguistic information from the input speech sequence converted in step S702 (step S703).

　次に、キャプション生成部２０３は、上記のステップＳ７０２で変換された入力音声系列と上記のステップＳ７０３で生成された識別結果から出力トークン系列を生成する（ステップＳ７０４）。 Next, the caption generation unit 203 generates an output token sequence from the input speech sequence converted in step S702 above and the identification result generated in step S703 above (step S704).

　そして、出力部２０８は、図８のステップＳ３０４と同様に、上記のステップＳ７０４で生成された出力トークン系列で構成されるキャプションを予め決められた所定の出力先に出力する（ステップＳ７０５）。これにより、与えられた音声データから識別モデルによって識別された識別結果も考慮して、その音声データが表す音声に含まれるパラ言語・非言語情報を自然言語で記述したキャプションが得られる。 Then, the output unit 208 outputs the caption composed of the output token sequence generated in step S704 above to a predetermined output destination (step S705), similar to step S304 in FIG. 8. This allows for the acquisition of a caption that describes in natural language the paralinguistic and non-linguistic information contained in the speech represented by the given speech data, taking into account the identification results obtained from the speech data using the identification model.

　　≪出力トークン系列の生成処理≫
　以下、上記のステップＳ７０４における出力トークン系列の生成処理の詳細について、図１６を参照しながら説明する。図１６は、第二の実施形態に係る出力トークン系列の生成処理の一例を示すフローチャートである。 <<Generation process of output token sequence>>
Details of the output token sequence generation process in step S704 above will be described below with reference to Fig. 16. Fig. 16 is a flowchart showing an example of the output token sequence generation process according to the second embodiment.

　まず、キャプション生成部２０３の音声エンコード部３０１は、図９のステップＳ４０１と同様に、入力音声系列から音声固定長ベクトルを生成する（ステップＳ８０１）。 First, the audio encoding unit 301 of the caption generation unit 203 generates an audio fixed-length vector from the input audio sequence (step S801), similar to step S401 in FIG. 9.

　次に、キャプション生成部２０３のベクトル変換部３０２は、上記のステップＳ８０１で生成された音声固定長ベクトルと、図１５のステップＳ７０３で生成された識別結果とをテキスト情報ベクトル列に変換する（ステップＳ８０２）。 Next, the vector conversion unit 302 of the caption generation unit 203 converts the fixed-length audio vector generated in step S801 above and the classification result generated in step S703 of FIG. 15 into a text information vector sequence (step S802).

　そして、キャプション生成部２０３のテキストデコード部３０３は、図９のステップＳ４０３と同様に、上記のステップＳ８０２で変換されたテキスト情報ベクトル列から出力トークン系列を生成する（ステップＳ８０３）。 Then, the text decoding unit 303 of the caption generation unit 203 generates an output token sequence from the text information vector sequence converted in step S802 above (step S803), similar to step S403 in FIG. 9.

　［まとめ］
　以上のように、第一の実施形態に係るキャプション生成装置１０は、音声データとその音声データが表す音声に含まれるパラ言語・非言語情報を自然言語で記述した正解キャプションとが含まれる学習データからキャプショニングモデルを学習する。そして、第一の実施形態に係るキャプション生成装置１０は、学習済みのキャプショニングモデルにより、与えられた音声データが表す音声に含まれるパラ言語・非言語情報を自然言語で記述したキャプションを生成することができる。これにより、音声に含まれるパラ言語・非言語情報をキャプションという解釈性の高い形式で出力することが可能となる。 [summary]
As described above, the caption generation device 10 according to the first embodiment learns a captioning model from learning data including voice data and correct captions that describe in natural language the paralinguistic and non-linguistic information contained in the voice represented by the voice data. The caption generation device 10 according to the first embodiment can then generate captions that describe in natural language the paralinguistic and non-linguistic information contained in the voice represented by the given voice data, using the learned captioning model. This makes it possible to output the paralinguistic and non-linguistic information contained in the voice in the form of a caption that has high interpretability.

　また、第二の実施形態に係るキャプション生成装置１０は、音声データが表す音声からパラ言語・非言語情報を識別する識別モデルによる識別結果も用いてキャプショニングモデルを学習する。そして、第二の実施形態に係るキャプション生成装置１０は、学習済みのキャプショニングモデルと識別モデルにより、与えられた音声データのキャプションを生成すると共にパラ言語・非言語情報を識別結果として生成することができる。これにより、音声に含まれるパラ言語・非言語情報と、そのキャプションとを得ることが可能となる。 The caption generation device 10 according to the second embodiment also learns a captioning model using the results of a discrimination model that discriminates paralinguistic and non-linguistic information from the speech represented by the audio data. The caption generation device 10 according to the second embodiment can then use the learned captioning model and discrimination model to generate a caption for given audio data and generate paralinguistic and non-linguistic information as discrimination results. This makes it possible to obtain the paralinguistic and non-linguistic information contained in the speech and its caption.

　本発明は、具体的に開示された上記の各実施形態に限定されるものではなく、請求の範囲の記載から逸脱することなく、種々の変形や変更、既知の技術との組み合わせ等が可能である。 The present invention is not limited to the specifically disclosed embodiments above, and various modifications, changes, and combinations with known technologies are possible without departing from the scope of the claims.

　［参考文献］
　参考文献１：Y. Wang, D. Stanton, Y. Zhang, R.-S. Ryan, E. Battenberg, J. Shor, et al., "Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis," in ICML, 2018.
　参考文献２：Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed, "HuBERT: Self-supervised speech representation learning by masked prediction of hidden units," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451-3460, 2021.
　参考文献３：Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al., "WavLM: Large-scale self-supervised pre-training for full stack speech processing," IEEE J-STSP, vol. 16, no. 6, pp. 1505-1518, 2022.
　参考文献４：R. Mokady, A. Hertz, and A. H. Bermano, "ClipCap: CLIP prefix for image captioning," arXiv preprint arXiv: 2111.09734, 2021.
　参考文献５：T. Brown, B. Mann, N. Ryder, et al., "Language models are few-shot learners," in Proc. NeurIPS, 2020.
　参考文献６：E. J Hu, Y. Shen, P. Wallis, et al., "Language models are few-shot learners," in Proc. ICLR, 2022. [References]
Reference 1: Y. Wang, D. Stanton, Y. Zhang, R.-S. Ryan, E. Battenberg, J. Shor, et al., "Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis," in ICML, 2018.
Reference 2: Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed, "HuBERT: Self-supervised speech representation learning by masked prediction of hidden units," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451-3460, 2021.
Reference 3: Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al., "WavLM: Large-scale self-supervised pre-training for full stack speech processing," IEEE J-STSP, vol. 16, no. 6, pp. 1505-1518, 2022.
Reference 4: R. Mokady, A. Hertz, and A. H. Bermano, "ClipCap: CLIP prefix for image captioning," arXiv preprint arXiv: 2111.09734, 2021.
Reference 5: T. Brown, B. Mann, N. Ryder, et al., "Language models are few-shot learners," in Proc. NeurIPS, 2020.
Reference 6: E. J Hu, Y. Shen, P. Wallis, et al., "Language models are few-shot learners," in Proc. ICLR, 2022.

　１０　　　　キャプション生成装置
　１０１　　　入力装置
　１０２　　　表示装置
　１０３　　　外部Ｉ／Ｆ
　１０３ａ　　記録媒体
　１０４　　　通信Ｉ／Ｆ
　１０５　　　ＲＡＭ
　１０６　　　ＲＯＭ
　１０７　　　補助記憶装置
　１０８　　　プロセッサ
　１０９　　　バス
　２０１　　　入力部
　２０２　　　音声変換部
　２０３　　　キャプション生成部
　２０４　　　正解キャプション変換部
　２０５　　　パラメータ更新部
　２０６　　　学習データ記憶部
　２０７　　　モデルパラメータ記憶部
　２０８　　　出力部
　２０９　　　識別部 10 Caption generating device 101 Input device 102 Display device 103 External I/F
103a Recording medium 104 Communication I/F
105 RAM
106 ROM
107 Auxiliary storage device 108 Processor 109 Bus 201 Input unit 202 Voice conversion unit 203 Caption generation unit 204 Correct caption conversion unit 205 Parameter update unit 206 Learning data storage unit 207 Model parameter storage unit 208 Output unit 209 Classification unit

Claims

an input unit for inputting learning data including speech data and training data representing a natural language description of paralinguistic information or non-linguistic information included in the speech represented by the speech data;
a calculation unit that uses an input of a speech sequence obtained by converting the speech data into a predetermined format for each predetermined unit, and calculates the generation probability of a natural language description of paralinguistic information or non-linguistic information contained in the speech, which is represented by the speech data that is the source of the input speech sequence, using a machine learning model that calculates the generation probability of a natural language description of paralinguistic information or non-linguistic information contained in the speech;
an update unit that updates a learnable parameter of the machine learning model based on an error between a natural language description generated according to the generation probability and teacher data included in the training data;
A learning device having the above configuration.

The calculation unit is
converting an input speech sequence into fixed-length first information representative of a feature of the speech sequence;
Converting the first information into a string of second information representing word embedding expressions that reflect characteristics of the first information;
The learning device according to claim 1 , further comprising: a generation probability of a character sequence for each predetermined unit included in the natural language description, the generation probability being calculated from the second information string.

a discrimination unit that uses the speech sequence as an input and discriminates paralinguistic information or non-linguistic information contained in speech represented by the speech data that is the source of conversion of the input speech sequence using a discrimination model that discriminates paralinguistic information or non-linguistic information contained in speech,
The calculation unit is
The learning device according to claim 2, wherein third information consisting of the first information and information representing paralinguistic information or non-linguistic information identified by the identification unit is converted into a string of second information representing word embedding expressions reflecting characteristics of the third information.

a generation unit that receives as input a speech sequence obtained by converting given speech data into a predetermined format for each predetermined unit, and generates a natural language description of paralinguistic information or non-linguistic information contained in the speech represented by the speech data from which the input speech sequence is converted, using a trained machine learning model that generates a natural language description of paralinguistic information or non-linguistic information contained in the speech;
A generating device having the following:

a discrimination unit that uses the speech sequence as an input and discriminates paralinguistic information or non-linguistic information contained in speech represented by the speech data that is the source of conversion of the input speech sequence using a discrimination model that discriminates paralinguistic information or non-linguistic information contained in speech,
The generation unit is
The generating device according to claim 4 , further comprising: a generating section configured to generate a natural language description of the paralinguistic information or the non-linguistic information based on the paralinguistic information or the non-linguistic information identified by the identifying section.

an input step of inputting learning data including speech data and training data representing a natural language description of paralinguistic information or non-linguistic information contained in the speech represented by the speech data;
a calculation step of calculating the probability of generating a natural language description of paralinguistic information or non-linguistic information contained in speech represented by the speech data from which the input speech sequence is converted, using a machine learning model that calculates the probability of generating a natural language description of paralinguistic information or non-linguistic information contained in speech, the natural language description being a source of conversion of the input speech sequence, using a speech sequence obtained by converting the speech data into a predetermined format for each predetermined unit as an input;
an update step of updating a learnable parameter of the machine learning model based on an error between the natural language description generated according to the generation probability and teacher data included in the training data;
The computer executes the learning method.

a generation step of using as input a speech sequence obtained by converting given speech data into a predetermined format for each predetermined unit, and generating a natural language description of the paralinguistic or non-linguistic information contained in the speech, which is represented by the speech data from which the input speech sequence is converted, using a trained machine learning model that generates a natural language description of the paralinguistic or non-linguistic information contained in the speech;
A computer executed generation method.

A program that causes a computer to function as a learning device according to any one of claims 1 to 3, or as a generating device according to claim 4 or 5.