WO2024127472A1

WO2024127472A1 - Emotion recognition learning method, emotion recognition method, emotion recognition learning device, emotion recognition device, and program

Info

Publication number: WO2024127472A1
Application number: PCT/JP2022/045706
Authority: WO
Inventors: 厚志安藤; 有実子村田; 岳至森
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: NTT Inc
Priority date: 2022-12-12
Filing date: 2022-12-12
Publication date: 2024-06-20
Anticipated expiration: 2025-06-12
Also published as: JPWO2024127472A1

Abstract

The present invention involves: learning data that, for a plurality of speakers, includes a plurality of ways of input utterances by the speakers, a plurality of ways of ordinary-state emotion utterances corresponding to the input utterances, and correct-answer labels for emotion corresponding to the input utterances; and a model that receives inputs of an input utterance and a corresponding ordinary-state emotion utterance to output an emotion recognition result. The invention causes a computer to execute a learning procedure for training the model on the basis of a first loss function for minimizing a difference (error) between a correct-answer label corresponding to the input utterance input to the model and the recognition result output by the model on the basis of the learning data, and a second loss function for achieving, for every speaker, a constant vector representing the quality of emotional expression calculated for the ordinary-state emotion utterance in the course of the model outputting the recognition result. In this way, the invention contributes to an increase in the accuracy of recognizing the emotion of a speaker from an utterance.

Description

Emotion recognition learning method, emotion recognition method, emotion recognition learning device, emotion recognition device, and program

　本発明は、感情認識学習方法、感情認識方法、感情認識学習装置、感情認識装置及びプログラムに関する。 The present invention relates to an emotion recognition learning method, an emotion recognition method, an emotion recognition learning device, an emotion recognition device, and a program.

　発話からの話者の感情の認識は重要な技術である。例えば、カウンセリング時に話者の感情の認識を行うことで、患者の不安や悲しみの感情を可視化でき、カウンセラーの理解の深化や指導の質の向上が期待できる。また、人間と機械の対話において人間の感情を認識することで、人間が喜んでいれば共に喜ぶ、悲しんでいれば励ますなど、より親しみやすい対話システムの構築が可能となる。以降では、或る発話を入力とし、当該発話に含まれる話者の感情が感情クラス（例えば、平常、怒り、喜び、悲しみ、など）のいずれにあたるかを推定する技術を「感情認識」と呼ぶ。　Recognizing a speaker's emotions from speech is an important technology. For example, by recognizing a speaker's emotions during counseling, it is possible to visualize the patient's feelings of anxiety or sadness, which is expected to deepen the counselor's understanding and improve the quality of guidance. Furthermore, by recognizing human emotions in human-machine dialogue, it is possible to build a more friendly dialogue system that can share in the human's happiness if the human is happy, or encourage the human if he or she is sad. Hereafter, the technology that takes an utterance as input and estimates which emotion class the speaker's emotion contained in the utterance belongs to (e.g., neutral, anger, joy, sadness, etc.) will be referred to as "emotion recognition".

　感情認識のうち、「平常」感情（喜びなどのポジティブな感情でも、怒りや悲しみなどネガティブな感情でもない普段の状態の感情）をもって話したときの発話（以下、「平常感情発話」という。）を利用することで認識精度を向上させる技術（以下、「従来技術」という。）が特許文献１や非特許文献１に提案されている。従来技術は、「ある人物の普段の話し方（＝平常感情発話）を知っている場合、その人物に対する感情認識の精度が向上するのでは」との仮説に基づく。 Patent Document 1 and Non-Patent Document 1 propose a technology (hereinafter referred to as "conventional technology") for improving the accuracy of emotion recognition by using speech when speaking with "normal" emotions (emotions that are neither positive emotions such as joy nor negative emotions such as anger or sadness, but are normal emotions in a normal state) (hereinafter referred to as "normal emotion speech"). The conventional technology is based on the hypothesis that "if you know how a person normally speaks (= normal emotion speech), the accuracy of emotion recognition for that person will improve."

　図１は、従来技術を説明するための図である。従来技術では、推定時に、認識対象の入力発話に加えて入力発話と同じ話者の平常感情発話を必要とし、これらを感情認識モデルに入力することで感情認識結果が得られる。感情認識モデルの内部では、初めに入力発話と平常感情発話それぞれについて、「感情表現ベクトルの抽出ブロック」が用いられて発話全体の感情表現の性質を表す感情表現ベクトルが抽出される。その後、入力発話と平常感情発話の感情表現ベクトルに基づき、「平常感情発話の感情表現ベクトルを用いた入力発話の感情推定ブロック」が用いられて入力発話の感情が推定される。感情認識モデルには深層学習に基づく統計モデルが利用され、感情認識モデル内の各ブロックは、推定の事前に入力発話・平常感情発話・入力発話の正解感情ラベル（入力発話の感情認識の正解値）を組とするラベル付きデータの集合を用いて同時に学習される。 Figure 1 is a diagram for explaining the conventional technology. In the conventional technology, in addition to the input utterance to be recognized, normal emotional utterance of the same speaker as the input utterance is required for estimation, and emotion recognition results are obtained by inputting these into an emotion recognition model. Inside the emotion recognition model, an "emotion expression vector extraction block" is first used for each of the input utterance and the normal emotional utterance to extract an emotion expression vector that represents the nature of the emotional expression of the entire utterance. Then, based on the emotion expression vectors of the input utterance and the normal emotional utterance, an "emotion estimation block for input utterance using the emotion expression vector of normal emotional utterance" is used to estimate the emotion of the input utterance. A statistical model based on deep learning is used for the emotion recognition model, and each block in the emotion recognition model is simultaneously trained using a set of labeled data consisting of the input utterance, normal emotional utterance, and the correct emotion label of the input utterance (the correct value for emotion recognition of the input utterance) prior to estimation.

国際公開第２０２１／１７１５５２号International Publication No. 2021/171552

Andreas Triantafyllopoulos, Shuo Liu, Bjoern W. Schuller、"DEEP SPEAKER CONDITIONING FOR SPEECH EMOTION RECOGNITION"、Proc. of ICME, pp.1-6, 2021Andreas Triantafyllopoulos, Shuo Liu, Bjoern W. Schuller, "DEEP SPEAKER CONDITIONING FOR SPEECH EMOTION RECOGNITION", Proc. of ICME, pp.1-6, 2021

　従来技術では、認識時に異なる平常感情発話を用いると認識結果が変化するという問題がある（例えば、入力発話Ｘと平常感情発話Ａを用いた場合では入力発話Ｘの認識結果が「喜び」となり、Ｘと平常感情発話Ｂを用いた場合では入力発話Ｘの認識結果が「悲しみ」となる場合がある。）。これは、感情認識モデルの学習時に入力発話・平常感情発話・入力発話の正解感情ラベルの組を利用することから、入力発話と平常感情発話の組み合わせに特化して認識結果を出力するように感情認識モデル内の各ブロックが最適化されてしまうためである。従来技術においても、組み合わせに特化し過ぎる問題を防ぐために様々な平常感情発話との組み合わせを学習データに含めるという方法がとられているが、この方法は明示的に組み合わせへの特化を減少させる方法ではないため、当該問題への対処は十分ではない。 In conventional technologies, there is a problem that the recognition result changes when different normal emotion utterances are used during recognition (for example, when input utterance X and normal emotion utterance A are used, the recognition result of input utterance X may be "joy", and when X and normal emotion utterance B are used, the recognition result of input utterance X may be "sadness".). This is because when training the emotion recognition model, a set of input utterance, normal emotion utterance, and correct emotion label for the input utterance is used, so each block in the emotion recognition model is optimized to output recognition results specialized for combinations of input utterance and normal emotion utterance. In conventional technologies, a method is used in which combinations with various normal emotion utterances are included in the training data to prevent the problem of over-specialization in combinations, but this method does not explicitly reduce specialization in combinations and does not adequately address the problem.

　本発明は、上記の点に鑑みてなされたものであって、発話からの話者の感情の認識精度の向上に寄与することを目的とする。 The present invention was made in consideration of the above points, and aims to contribute to improving the accuracy of recognizing a speaker's emotions from speech.

　そこで上記課題を解決するため、話者による複数通りの入力発話と、それぞれの前記入力発話に対応する複数通りの平常感情発話と、それぞれの前記入力発話に対応する感情の正解ラベルとを複数の話者について含む学習用のデータに基づいて、いずれかの前記入力発話と当該入力発話に対応する前記平常感情発話とを入力したモデルが出力する感情の認識結果について当該入力発話に対応する前記正解ラベルに対する誤差を最小化するための第１の損失関数と、前記モデルが前記認識結果を出力する過程において前記平常感情発話について算出される感情表現の性質を表すベクトルを話者ごとに一定にするための第２の損失関数と、に基づいて前記モデルを学習する学習手順、をコンピュータが実行する。 In order to solve the above problem, a computer executes a learning procedure for learning the model based on training data including, for multiple speakers, multiple input utterances by a speaker, multiple normal emotion utterances corresponding to each of the input utterances, and correct emotion labels corresponding to each of the input utterances, and based on a first loss function for minimizing an error in the emotion recognition result output by a model to which any of the input utterances and the normal emotion utterance corresponding to that input utterance are input, relative to the correct emotion label corresponding to that input utterance, and a second loss function for making constant for each speaker a vector representing the nature of the emotional expression calculated for the normal emotion utterance in the process in which the model outputs the recognition result.

　発話からの話者の感情の認識精度の向上に寄与することができる。 This can contribute to improving the accuracy of recognizing a speaker's emotions from speech.

従来技術を説明するための図である。FIG. 1 is a diagram for explaining a conventional technique. 本発明の実施の形態における感情認識装置１０のハードウェア構成例を示す図である。FIG. 1 is a diagram illustrating an example of a hardware configuration of an emotion recognition device according to an embodiment of the present invention. 本発明の実施の形態における感情認識装置１０の学習時の機能構成例を示す図である。FIG. 2 is a diagram illustrating an example of a functional configuration during learning of the emotion recognition device 10 according to an embodiment of the present invention. 本発明の実施の形態における感情認識モデルｍ１の構成例を示す図である。FIG. 2 is a diagram showing an example of a configuration of an emotion recognition model m1 according to an embodiment of the present invention. 感情認識モデルｍ１の学習方法の概要を説明するための図である。FIG. 2 is a diagram for explaining an outline of a learning method for an emotion recognition model m1. 本発明の実施の形態における感情認識装置１０の感情認識時の機能構成例を示す図である。FIG. 2 is a diagram illustrating an example of a functional configuration of the emotion recognition device 10 according to the embodiment of the present invention when recognizing emotions. 感情認識装置１０が学習時において実行する処理手順の一例を説明するためのフローチャートである。10 is a flowchart illustrating an example of a process performed by the emotion recognition device 10 during learning. 平常感情発話の感情表現ベクトルの話者ごとの平均値の算出を説明するための図である。11 is a diagram for explaining calculation of an average value of emotion expression vectors of normal emotion utterances for each speaker. FIG. 各感情表現ベクトルｅ_ｊｉと各話者平均ｃ_ｋとの距離Ｓ_ｊｉ，ｋの算出を説明するための図である。FIG. 13 is a diagram for explaining calculation of a distance S _ji,k between each emotional expression vector e _ji and each speaker average c _k .

　本実施の形態では、平常感情発話（喜びなどのポジティブな感情でも、怒りや悲しみなどネガティブな感情でもない普段の状態の感情における発話）を利用した感情認識モデルの学習時において、従来手法で用いられている正解感情ラベルと感情認識結果との誤り（正解感情ラベルに対する感情認識結果の誤差）を最小化するための損失関数に加えて、平常感情発話の感情表現ベクトルが同一話者であれば同じベクトル値を示す（つまり、平常感情発話が変化しても話者が同じであれば一定のベクトル値を示す）ようにするための損失関数（平常感情発話に対する話者同一性損失関数）が新たに導入される。これにより、異なる平常感情発話に対して一定の平常感情発話の感情表現ベクトルが得られるように感情表現ベクトルの抽出ブロックが最適化され、入力発話と平常感情発話の組み合わせに特化してしまう問題を解決することができる。 In this embodiment, when training an emotion recognition model using normal emotion utterances (utterances expressing emotions in a normal state that are neither positive emotions such as joy, nor negative emotions such as anger or sadness), in addition to the loss function used in conventional methods for minimizing the error between the correct emotion label and the emotion recognition result (error in the emotion recognition result for the correct emotion label), a new loss function (speaker identity loss function for normal emotion utterances) is introduced to ensure that the emotion expression vector of normal emotion utterances shows the same vector value if the speaker is the same (in other words, even if the normal emotion utterance changes, it shows a constant vector value if the speaker is the same). This optimizes the emotion expression vector extraction block so that a constant emotion expression vector of normal emotion utterance is obtained for different normal emotion utterances, solving the problem of specialization to the combination of input utterance and normal emotion utterance.

　以下、図面に基づいて本発明の実施の形態を説明する。 The following describes an embodiment of the present invention with reference to the drawings.

　図２は、本発明の実施の形態における感情認識装置１０のハードウェア構成例を示す図である。図２の感情認識装置１０は、それぞれバスＢで相互に接続されているドライブ装置１００、補助記憶装置１０２、メモリ装置１０３、プロセッサ１０４、及びインタフェース装置１０５等を有する。 FIG. 2 is a diagram showing an example of the hardware configuration of an emotion recognition device 10 according to an embodiment of the present invention. The emotion recognition device 10 in FIG. 2 includes a drive device 100, an auxiliary storage device 102, a memory device 103, a processor 104, and an interface device 105, all of which are interconnected via a bus B.

　感情認識装置１０での処理を実現するプログラムは、ＣＤ－ＲＯＭ等の記録媒体１０１によって提供される。プログラムを記憶した記録媒体１０１がドライブ装置１００にセットされると、プログラムが記録媒体１０１からドライブ装置１００を介して補助記憶装置１０２にインストールされる。但し、プログラムのインストールは必ずしも記録媒体１０１より行う必要はなく、ネットワークを介して他のコンピュータよりダウンロードするようにしてもよい。補助記憶装置１０２は、インストールされたプログラムを格納すると共に、必要なファイルやデータ等を格納する。 The program that realizes the processing in the emotion recognition device 10 is provided by a recording medium 101 such as a CD-ROM. When the recording medium 101 storing the program is set in the drive device 100, the program is installed from the recording medium 101 via the drive device 100 into the auxiliary storage device 102. However, the program does not necessarily have to be installed from the recording medium 101, but may be downloaded from another computer via a network. The auxiliary storage device 102 stores the installed program as well as necessary files, data, etc.

　メモリ装置１０３は、プログラムの起動指示があった場合に、補助記憶装置１０２からプログラムを読み出して格納する。プロセッサ１０４は、ＣＰＵ若しくはＧＰＵ（Graphics Processing Unit）、又はＣＰＵ及びＧＰＵであり、メモリ装置１０３に格納されたプログラムに従って感情認識装置１０に係る機能を実行する。インタフェース装置１０５は、ネットワークに接続するためのインタフェースとして用いられる。 When an instruction to start a program is received, the memory device 103 reads out and stores the program from the auxiliary storage device 102. The processor 104 is a CPU or a GPU (Graphics Processing Unit), or a CPU and a GPU, and executes functions related to the emotion recognition device 10 in accordance with the program stored in the memory device 103. The interface device 105 is used as an interface for connecting to a network.

　図３は、本発明の実施の形態における感情認識装置１０の学習時の機能構成例を示す図である。図３に示されるように、学習時の感情認識装置１０（感情認識学習装置）は、２つの音響特徴抽出部１１（音響特徴抽出部１１－１、音響特徴抽出部１１－２）と１つの学習部１２とを有する。これら各部は、感情認識装置１０にインストールされた１以上のプログラムが、プロセッサ１０４に実行させる処理により実現される。感情認識装置１０は、また、学習データ記憶部１２１を利用する。学習データ記憶部１２１は、例えば、補助記憶装置１０２、又は感情認識装置１０にネットワークを介して接続可能な記憶装置等を用いて実現可能である。 FIG. 3 is a diagram showing an example of the functional configuration of the emotion recognition device 10 during learning in an embodiment of the present invention. As shown in FIG. 3, the emotion recognition device 10 (emotion recognition learning device) during learning has two acoustic feature extraction units 11 (acoustic feature extraction unit 11-1, acoustic feature extraction unit 11-2) and one learning unit 12. These units are realized by processing in which one or more programs installed in the emotion recognition device 10 are executed by the processor 104. The emotion recognition device 10 also uses a learning data storage unit 121. The learning data storage unit 121 can be realized, for example, using the auxiliary storage device 102, or a storage device that can be connected to the emotion recognition device 10 via a network.

　［学習データ記憶部１２１］
　学習データ記憶部１２１は、入力発話、入力発話の正解感情ラベル（感情の正解ラベル）、入力発話と同じ話者の平常感情発話、平常感情発話の話者の識別情報である話者ラベルの組である学習データを予め大量に記憶する。各学習データの、入力発話と平常感情発話は異なる発話であるものとする。正解感情ラベルとは、入力発話の感情認識の正解値をいう。感情認識とは、或る発話を入力とし、当該発話に含まれる話者の感情が感情クラス（例えば、平常、怒り、喜び、悲しみ、など）のいずれにあたるかを推定することをいう。平常感情発話とは、「平常」感情をもって話したときの発話をいう。また、「発話」とは、言語表現を表出する行動によって生じた音声（音声データ）をいう。各学習データの入力発話と平常感情発話とは、感情及び発話内容（発話対象のテキスト）が異なる。学習データ記憶部１２１は、このような学習データを複数人の話者について記憶する。各話者による発話内容は異なっていてもよいし同じでもよい。また、同一の話者に関して発話内容が異なる複数通りの学習データが含まれているのが望ましい。話者ごとに決まった内容を話していると、その発話内容（音韻の偏り）から特定の感情認識結果を返してしまうような感情認識モデルが構築される可能性を回避するためである。この場合、同一の話者に関する各学習データにおいて、平常感情発話も異なることが望ましい。したがって、学習データ記憶部１２１は、複数の話者について、当該話者による複数通りの入力発話と、それぞれの入力発話に対応する複数通りの平常感情発話と、それぞれの入力発話に対応する正解感情ラベルとを学習用のデータとして予め記憶しているといえる。 [Learning Data Storage Unit 121]
The learning data storage unit 121 stores a large amount of learning data in advance, which is a set of an input utterance, a correct emotion label (correct emotion label) of the input utterance, a normal emotion utterance of the same speaker as the input utterance, and a speaker label that is identification information of the speaker of the normal emotion utterance. The input utterance and the normal emotion utterance of each learning data are different utterances. The correct emotion label refers to a correct value of emotion recognition of the input utterance. Emotion recognition refers to estimating, using a certain utterance as input, which emotion class (e.g., normal, anger, joy, sadness, etc.) the speaker's emotion contained in the utterance belongs to. The normal emotion utterance refers to an utterance when speaking with a "normal" emotion. In addition, "utterance" refers to a voice (voice data) generated by an action that expresses a linguistic expression. The input utterance and the normal emotion utterance of each learning data have different emotions and utterance contents (text to be spoken). The learning data storage unit 121 stores such learning data for multiple speakers. The contents of speech by each speaker may be different or the same. It is also preferable that multiple sets of training data with different utterance contents for the same speaker are included. This is to avoid the possibility that an emotion recognition model that returns a specific emotion recognition result from the utterance contents (phonological bias) may be constructed if each speaker speaks a fixed content. In this case, it is preferable that normal emotion utterances are also different in each set of training data for the same speaker. Therefore, it can be said that the training data storage unit 121 stores, as training data for multiple speakers, multiple input utterances by the speaker, multiple normal emotion utterances corresponding to each input utterance, and correct emotion labels corresponding to each input utterance.

　［音響特徴抽出部１１］
　音響特徴抽出部１１は、発話（の音声データ）を入力とし、発話から音響特徴系列を抽出し、当該音響特徴系列を出力する。学習時において、音響特徴抽出部１１－１は、各学習データの入力発話から音響特徴系列を抽出し、音響特徴抽出部１１－２は、各学習データの平常感情発話から音響特徴系列を抽出する。 [Acoustic feature extraction unit 11]
The acoustic feature extraction unit 11 receives an utterance (audio data of the utterance) as input, extracts an acoustic feature sequence from the utterance, and outputs the acoustic feature sequence. During training, the acoustic feature extraction unit 11-1 extracts an acoustic feature sequence from the input utterance of each training data, and the acoustic feature extraction unit 11-2 extracts an acoustic feature sequence from normal emotion utterance of each training data.

　音響特徴系列とは、入力発話を短時間窓で分割し、短時間窓ごとに音響特徴を求め、当該音響特徴のベクトルを時系列順に並べたデータをいう。音響特徴は、パワースペクトル、メルフィルタバンク出力、ＭＦＣＣ、基本周波数、対数パワー、ＨＮＲ（Harmonics-to-Noise Ratio）、音声確率、ゼロ交差数、及びこれらの一次微分又は二次微分のいずれか一つ以上を含むも。音声確率は、例えば、事前学習した音声／非音声のＧＭＭモデルの尤度比により求められる。ＨＮＲは、例えば、ケプストラムに基づく手法により求められる（「Peter Murphy, Olatunji Akande、"Cepstrum-Based Harmonics-to-Noise Ratio Measurement in Voiced Speech"、Lecture Notes in Artificial Intelligence, Nonlinear Speech Modeling and Applications, Vol. 3445, Springer-Verlag, 2005」）。より多くの音響特徴を利用することで、発話に含まれる様々な特徴を表現でき、感情認識精度が向上する傾向にある。 An acoustic feature series is data in which an input utterance is divided into short-time windows, acoustic features are obtained for each short-time window, and the acoustic feature vectors are arranged in chronological order. The acoustic features include one or more of the power spectrum, Mel filter bank output, MFCC, fundamental frequency, logarithmic power, HNR (Harmonics-to-Noise Ratio), speech probability, number of zero-crossings, and their first or second derivatives. The speech probability is obtained, for example, from the likelihood ratio of a pre-trained speech/non-speech GMM model. HNR can be calculated, for example, using a cepstrum-based method (Peter Murphy, Olatunji Akande, "Cepstrum-Based Harmonics-to-Noise Ratio Measurement in Voiced Speech", Lecture Notes in Artificial Intelligence, Nonlinear Speech Modeling and Applications, Vol. 3445, Springer-Verlag, 2005). By using more acoustic features, it is possible to express various characteristics contained in speech, which tends to improve the accuracy of emotion recognition.

　［学習部１２］
　学習部１２は、各学習データの入力発話及び平常感情発話のそれぞれについて音響特徴抽出部１１から出力される音響特徴系列と、各学習データの正解感情ラベル及び話者ラベルとに基づいて、入力発話及び平常感情発話の音響特徴系列を平常感情発話との比較による感情認識モデル（以下、単に「感情認識モデルｍ１」という。）に入力し、入力発話の正解感情ラベルと話者ラベルを教師データとして用いて、感情認識モデルｍ１を学習する。 [Learning unit 12]
The learning unit 12 inputs the acoustic feature sequences of the input utterance and normal emotional utterance into an emotion recognition model (hereinafter simply referred to as "emotion recognition model m1") based on the acoustic feature sequences output from the acoustic feature extraction unit 11 for each of the input utterance and normal emotional utterance of each piece of learning data, and the correct emotion label and speaker label of each piece of learning data, and learns the emotion recognition model m1 using the correct emotion label and speaker label of the input utterance as teacher data.

　図４は、本発明の実施の形態における感情認識モデルｍ１の構成例を示す図である。図４に示されるように、感情認識モデルｍ１は、２つの感情表現ベクトル抽出ブロックｍ１１（感情表現ベクトル抽出ブロックｍ１１－１、感情表現ベクトル抽出ブロックｍ１１－２）と１つの感情確率推定ブロックｍ１２とを含む。 FIG. 4 is a diagram showing an example of the configuration of emotion recognition model m1 in an embodiment of the present invention. As shown in FIG. 4, emotion recognition model m1 includes two emotion expression vector extraction blocks m11 (emotion expression vector extraction block m11-1, emotion expression vector extraction block m11-2) and one emotion probability estimation block m12.

　感情表現ベクトル抽出ブロックｍ１１は、感情認識モデルｍ１が話者の感情を認識する過程において、音響特徴系列を発話全体の感情表現の性質（又は特徴）を表す固定長のベクトル（以下、「感情表現ベクトル」という。）に変換する。感情表現ベクトル抽出ブロックｍ１１には、任意の系列長を持つ入力ベクトル系列から固定長の表現を抽出する深層学習モデル（例えば、TransformerとSelf Attentive Poolingで構成されるモデル）が利用される。図４において、感情表現ベクトル抽出ブロックｍ１１－１は、入力発話から抽出された音響特徴系列を感情表現ベクトルに変換する。感情表現ベクトル抽出ブロックｍ１１－２は、平常感情発話から抽出された音響特徴系列を感情表現ベクトルに変換する。 The emotional expression vector extraction block m11 converts the acoustic feature sequence into a fixed-length vector (hereinafter referred to as "emotion expression vector") that represents the nature (or features) of the emotional expression of the entire utterance during the process in which the emotion recognition model m1 recognizes the speaker's emotion. The emotional expression vector extraction block m11 uses a deep learning model (for example, a model composed of a Transformer and Self Attentive Pooling) that extracts a fixed-length expression from an input vector sequence of any length. In Figure 4, the emotional expression vector extraction block m11-1 converts the acoustic feature sequence extracted from the input utterance into an emotional expression vector. The emotional expression vector extraction block m11-2 converts the acoustic feature sequence extracted from normal emotional utterance into an emotional expression vector.

　感情確率推定ブロックｍ１２には、固定長のベクトル（感情表現ベクトル）から各感情の事後確率を表すベクトル（各次元がそれぞれ異なる感情の事後確率を示すベクトル）を射影する深層学習モデル（例えば、全結合層と活性化関数を１層以上重ねたモデル）が利用される。 The emotion probability estimation block m12 uses a deep learning model (e.g., a model with one or more layers of fully connected layers and activation functions) that projects a vector representing the posterior probability of each emotion (a vector in which each dimension indicates the posterior probability of a different emotion) from a fixed-length vector (emotion expression vector).

　学習部１２は、図５に示されるように、平常感情発話に対する話者同一性損失関数Ｌ_ｐ（図Ａ３）と、感情認識モデルｍ１の出力と正解感情ラベルとの誤り（誤差）に基づく損失関数Ｌ_ａ（入力発話の正解感情ラベルと感情認識結果との誤り最小化の損失関数のための損失関数）とを重みづけ和して得られる損失関数Ｌ（Ｌ＝αＬ_ａ＋（１－α）Ｌ_ｐ）に基づいて全モデルのモデルパラメータの更新を行う。αは人手で決定される重み係数である。学習部１２は、従来技術と同様に確率的勾配降下法を用いてモデルパラメータを更新する。学習部１２は、一定回数モデルパラメータを更新したのち、最終的に得られた感情認識モデルｍ１を出力とする。 As shown in FIG. 5, the learning unit 12 updates the model parameters of all models based on a loss function L (L=αL a +(1-α)L _p ) obtained by weighting and summing a speaker identity loss function L p (FIG. _A3 ) for normal emotional utterances and a loss function L _a (a loss function for minimizing the error between the correct emotion label of the input utterance and the emotion recognition result) based on the error ( _error ) between the output of the emotion recognition model m1 and the correct emotion label, where α is a weighting coefficient that is determined manually. The learning unit 12 updates the model parameters using the stochastic gradient descent method, as in the conventional technology. After updating the model parameters a certain number of times, the learning unit 12 outputs the finally obtained emotion recognition model m1.

　図６は、本発明の実施の形態における感情認識装置１０の感情認識時の機能構成例を示す図である。図６中、図３と同一部分には同一符号を付し、その説明は省略する。 FIG. 6 is a diagram showing an example of the functional configuration of emotion recognition device 10 during emotion recognition in an embodiment of the present invention. In FIG. 6, the same parts as in FIG. 3 are given the same reference numerals, and their explanations are omitted.

　図６に示されるように、感情認識時において、感情認識装置１０は、学習部１２の代わりに感情認識部１３を有する。感情認識部１３は、感情認識装置１０にインストールされた１以上のプログラムが、プロセッサ１０４に実行させる処理により実現される。また、感情認識時において、感情認識装置１０は、学習データ記憶部１２１を利用しない。なお、学習時と感情認識時とにおいて、異なるコンピュータが用いられてもよい。 As shown in FIG. 6, when recognizing emotions, the emotion recognition device 10 has an emotion recognition unit 13 instead of a learning unit 12. The emotion recognition unit 13 is realized by a process in which one or more programs installed in the emotion recognition device 10 are executed by the processor 104. Furthermore, when recognizing emotions, the emotion recognition device 10 does not use the learning data storage unit 121. Note that different computers may be used for learning and emotion recognition.

　［感情認識部１３］
　感情認識部１３は、感情の認識対象とする人物の入力発話から音響特徴抽出部１１－１が抽出した音響特徴系列と、当該人物の平常感情発話から音響特徴抽出部１１－２が抽出した音響特徴系列とを入力とし、当該音響特徴系列を学習済みの感情認識モデルｍ１に順伝播させることで平常感情発話との比較による感情認識結果（以下、単に「感情認識結果」という。）を出力する。感情認識部１３の出力である感情認識結果は、各感情の事後確率ベクトル（感情認識モデルｍ１の順伝播の出力）と、当該事後確率ベクトルにおいて事後確率が最大であった感情クラスを含む。事後確率が最大であった感情クラスが最終的な感情認識結果として利用される。 [Emotion Recognition Unit 13]
The emotion recognition unit 13 receives as input an acoustic feature sequence extracted by the acoustic feature extraction unit 11-1 from the input utterance of a person to be recognized for emotion, and an acoustic feature sequence extracted by the acoustic feature extraction unit 11-2 from the person's normal emotional utterance, and outputs an emotion recognition result (hereinafter simply referred to as "emotion recognition result") by comparing the acoustic feature sequence with the normal emotional utterance by forward propagating the acoustic feature sequence to a trained emotion recognition model m1. The emotion recognition result, which is the output of the emotion recognition unit 13, includes a posterior probability vector of each emotion (output of the forward propagation of the emotion recognition model m1), and an emotion class with the maximum posterior probability in the posterior probability vector. The emotion class with the maximum posterior probability is used as the final emotion recognition result.

　以下、感情認識装置１０が実行する処理手順について説明する。 The processing steps executed by the emotion recognition device 10 are explained below.

　［学習時］
　図７は、感情認識装置１０が学習時において実行する処理手順の一例を説明するためのフローチャートである。 [Study]
FIG. 7 is a flowchart illustrating an example of a process performed by the emotion recognition device 10 during learning.

　ステップＳ１０１において、学習部１２は、学習データ記憶部１２１に記憶されている学習データ群の中から一部の複数（Ｎ個）の学習データランダムに選択することで、１つのミニバッチ（以下、「対象ミニバッチ」という。）を取得する。より厳密には、学習部１２は、学習データをランダムに並び替えた上でミニバッチを生成する。 In step S101, the learning unit 12 randomly selects a number of (N) pieces of learning data from the group of learning data stored in the learning data storage unit 121 to obtain one mini-batch (hereinafter referred to as the "target mini-batch"). More precisely, the learning unit 12 randomly rearranges the learning data to generate the mini-batch.

　続いて、感情認識装置１０は、対象ミニバッチに含まれるＮ個の学習データごとに、ステップＳ１０２～Ｓ１０４を含むループ処理を実行する。当該ループ処理において処理対象とされている学習データを、以下「対象学習データ」という。 The emotion recognition device 10 then executes a loop process including steps S102 to S104 for each of the N pieces of training data included in the target mini-batch. The training data being processed in this loop process is hereinafter referred to as the "target training data."

　ステップＳ１０２において、音響特徴抽出部１１－１は、対象学習データの入力発話から音響特徴系列を抽出し、音響特徴抽出部１１－２は、対象学習データの平常感情発話から音響特徴系列を抽出する。 In step S102, the acoustic feature extraction unit 11-1 extracts an acoustic feature sequence from the input utterance of the target learning data, and the acoustic feature extraction unit 11-2 extracts an acoustic feature sequence from the normal emotion utterance of the target learning data.

　続いて、学習部１２は、抽出された２つの音響特徴系列を感情表現モデルａに入力し、感情表現モデルａが出力する認識結果（各感情の事後確率ベクトル）及び平常感情発話の感情表現ベクトルを取得する（Ｓ１０３）。すなわち、感情表現モデルａが認識結果を出力する過程において、感情表現ベクトル抽出ブロックｍ１１－２が平常感情発話に関する感情表現ベクトルを算出するが、学習部は、当該感情表現ベクトルを取得する。学習部１２は、取得した感情表現ベクトルを対象学習データの話者ラベルに関連付けて記憶しておく。 Then, the learning unit 12 inputs the two extracted acoustic feature sequences into emotion expression model a, and acquires the recognition result (posterior probability vector for each emotion) and the emotion expression vector of normal emotion utterance output by emotion expression model a (S103). That is, in the process in which emotion expression model a outputs the recognition result, the emotion expression vector extraction block m11-2 calculates the emotion expression vector related to normal emotion utterance, and the learning unit acquires this emotion expression vector. The learning unit 12 stores the acquired emotion expression vector in association with the speaker label of the target training data.

　続いて、学習部１２は、対象学習データの正解感情ラベルと認識結果（各感情の事後確率ベクトル）とに基づいて損失関数Ｌ_ａ（入力発話の正解感情ラベルと各感情の事後確率ベクトルとの交差エントロピー関数）を計算することで、認識結果について正解感情ラベルに対する誤差に基づく損失（以下、「損失Ｌ_ａ」という。）を算出する（Ｓ１０４）。 Next, the learning unit 12 calculates a loss function L _a (the cross-entropy function between the correct emotion label of the input utterance and the posterior probability vector of each emotion) based on the correct emotion label of the target training data and the recognition result (the posterior probability vector of each emotion) to calculate a loss based on the error for the correct emotion label of the recognition result (hereinafter referred to as "loss L _a ") (S104).

　対象ミニバッチ内の全ての学習データについてステップＳ１０２～Ｓ１０４が実行されると、学習部１２は、ステップＳ１０３において対象ミニバッチ内の学習データごとに取得された（平常感情発話の）感情表現ベクトルについて、話者ごとの平均値を算出する（Ｓ１０５）。 When steps S102 to S104 have been executed for all training data in the target mini-batch, the training unit 12 calculates the average value for each speaker for the emotional expression vectors (of normal emotional utterances) obtained for each training data in the target mini-batch in step S103 (S105).

　図８は、平常感情発話の感情表現ベクトルの話者ごとの平均値の算出を説明するための図である。図８において、ｅ_ｊｉは、対象ミニバッチにおいて話者ｊのｉ番目の平常感情発話の感情表現ベクトルを示す。或る感情表現ベクトルの話者が誰であるかについては、ステップＳ１０３において当該感情表現ベクトルに関連付けられている話者ラベルに基づいて特定可能である。図８には、話者ラベルが、１、２又は３である各話者について平常感情発話の感情表現ベクトルが取得されている例が示されている。 Fig. 8 is a diagram for explaining calculation of the average value of emotion expression vectors of normal emotion utterances for each speaker. In Fig. 8, e _ji indicates the emotion expression vector of the i-th normal emotion utterance of speaker j in the target mini-batch. The speaker of a certain emotion expression vector can be identified based on the speaker label associated with the emotion expression vector in step S103. Fig. 8 shows an example in which emotion expression vectors of normal emotion utterances are obtained for each speaker whose speaker label is 1, 2, or 3.

　学習部１２は、関連付けられている話者ラベルが同じである（共通する）感情表現ベクトルの集合ごとに、当該集合の平均値を算出する。以下、話者ごとの平均値を「話者平均ｃ_ｋ」という（ｋは話者ラベル）。図８の例では、話者ラベルは、１、２又は３であるため、話者平均ｃ_１、ｃ_２及びｃ_３が算出される。 The learning unit 12 calculates the average value of each set of emotion expression vectors associated with the same (common) speaker label. Hereinafter, the average value for each speaker is referred to as the "speaker average c _k " (k is the speaker label). In the example of Fig. 8, the speaker labels are 1, 2, and 3, so speaker averages c ₁ , c ₂ , and c ₃ are calculated.

　続いて、学習部１２は、感情表現ベクトルｅ_ｊｉごとに、各話者平均ｃ_ｋとの距離Ｓ_ｊｉ，ｋを算出する（Ｓ１０６）。 Next, the learning unit 12 calculates the distance S _ji,k between each emotional expression vector e _ji and the speaker average c _k (S106).

　図９は、各感情表現ベクトルｅ_ｊｉと各話者平均ｃ_ｋとの距離Ｓ_ｊｉ，ｋの算出を説明するための図である。 FIG. 9 is a diagram for explaining the calculation of the distance S _ji,k between each emotional expression vector e _ji and each speaker average c _k .

　図９には、各ｅ_ｊｉに各行が対応し、各ｃ_ｋに各列が対応する行列が示されている。当該行列の各要素Ｓ_ｊｉ，ｋは、ｅ_ｊｉとｃ_ｋとの距離に対応する。学習部１２は、当該距離を、例えば、以下の式に基づいて算出する。 9 shows a matrix in which each row corresponds to each e _ji and each column corresponds to each c _k . Each element S _ji,k of the matrix corresponds to the distance between e _ji and c _k . The learning unit 12 calculates the distance based on, for example, the following formula.

但し、ｗ、ｂは学習可能なパラメータであり、感情認識モデルｍ１を構成する各ブロック（各モデル）のパラメータと同時に損失関数に基づいて更新される。

Here, w and b are learnable parameters, and are updated based on the loss function simultaneously with the parameters of each block (each model) constituting the emotion recognition model m1.

　続いて学習部１２は、以下の損失関数Ｌ_ｐを用いて、当該距離に基づく損失（以下、「損失Ｌ_ｐ」という。）を算出する（Ｓ１０７）。 Next, the learning unit 12 calculates a loss based on the distance (hereinafter, referred to as "loss _Lp ") using the following loss function _Lp (S107).

但し、ｊ＝ｋのとき（ｅ_ｊｉとｃ_ｋとの話者が同じ場合）ｔ_ｊｉ，ｋ＝１，ｊ≠ｋのとき（ｅ_ｊｉとｃ_ｋとの話者が異なる場合）ｔ_ｊｉ，ｋ＝０とし、Ｓ_ｊｉ，ｋとｔ_ｊｉ，ｋが一致するように（Ｌ_ｐが最小化されるように）学習が行われる。なお、図９に示した行列において、網掛けが付与されている要素は、ｊ＝ｋである要素に対応する。すなわち、損失関数Ｌ_ｐは、ｊ＝ｋの場合（感情表現ベクトルと、感情表現ベクトルの話者平均が同じ話者の場合）は１に近づき、そうでない場合は０に近づくような損失関数である。

However, when j=k (when the speaker of _eji and _ck is the same), _tji,k =1, and when j≠k (when the speaker of _eji and _ck is different), _tji,k =0, and learning is performed so that _Sji ,k and _tji ,k match (so that _Lp is minimized). In the matrix shown in FIG. 9, the shaded elements correspond to the elements where j=k. In other words, the loss function _Lp approaches 1 when j=k (when the emotional expression vector and the speaker average of the emotional expression vector are the same speaker), and approaches 0 otherwise.

　換言すれば、損失関数Ｌ_ｐは、感情認識モデルｍ１が認識結果を出力する過程において平常感情発話について算出する感情表現ベクトルを話者ごとに一定にするための（話者ごとの感情表現ベクトルの誤差を最小化するための）損失関数である。 In other words, the loss function _Lp is a loss function for making the emotional expression vector calculated for normal emotional utterances constant for each speaker in the process in which the emotion recognition model m1 outputs the recognition result (for minimizing the error of the emotional expression vector for each speaker).

　続いて、学習部１２は、損失関数Ｌ_ｐに基づく損失Ｌ_ｐと、損失関数Ｌ_ａに基づく損失Ｌ_ａとの加重和を感情認識モデルｍ１全体の損失Ｌとして算出する（Ｓ１０８）。ここで、損失Ｌの算出式は、例えば、以下の通りである。
Ｌ＝αΣ（Ｌ_ａ／Ｎ）＋（１－α）Ｌ_ｐ
但し、αは、Ｌ_ａの「感情認識誤りが減る」役割とＬ_ｐの「同じ話者の平常感情ベクトルが近い値を取る」役割のどちらを優先するかを調整するためのハイパーパラメータ（重み係数）である。Ｌ_ａの重みが大きいほどＬ_ｐの効果が薄、Ｌ_ｐの重みが大きいほど感情認識誤りがある程度許容されることになる。 Next, the learning unit 12 calculates the weighted sum of the loss _Lp based on the loss function _Lp and the loss _{La based on the loss function La} _as the loss L of the entire emotion recognition model m1 (S108). Here, the calculation formula for the loss L is, for example, as follows:
L = αΣ(L _a /N) + (1-α)L _p
Here, α is a hyperparameter (weighting coefficient) for adjusting the priority between the role of L _a in "reducing emotion recognition errors" and the role of L _p in "making the normal emotion vector of the same speaker take a similar value." The larger the weight of L _a is, the weaker the effect of L _p is, and the larger the weight of L _p is, the more emotion recognition errors are tolerated to a certain extent.

　また、Ｎは、対象ミニバッチ内の学習データの数である。すなわち、Σ（Ｌ_ａ／Ｎ）、は、対象ミニバッチにおけるＬ_ａの平均値である。 In addition, N is the number of training data in the target mini-batch, i.e., Σ(L _a /N) is the average value of L _a in the target mini-batch.

　続いて、学習部１２は、Ｌに基づく誤差逆伝播法を用いて感情認識モデルｍ１中の各ブロックや損失関数Ｌ_Ｐのパラメータを同時に更新する（Ｓ１０９）。ステップＳ１０１～Ｓ１０９が繰り返されることにより、当該各ブロックのパラメータが同時に最適化される。 Next, the learning unit 12 simultaneously updates the parameters of each block in the emotion recognition model m1 and the loss function L _P using the backpropagation algorithm based on L (S109). By repeating steps S101 to S109, the parameters of the blocks are simultaneously optimized.

　ステップＳ１０１～Ｓ１０９が所定回数繰り返されると（Ｓ１１０でＹｅｓ）、学習部１２は、図７の処理手順を終了する。 When steps S101 to S109 have been repeated a predetermined number of times (Yes in S110), the learning unit 12 ends the processing procedure in FIG. 7.

　［認識時］
　認識時において感情認識装置１０が実行する処理は、図６を用いて説明した通りである。すなわち、感情認識装置１０は、感情認識の対象とする話者の発話と、当該話者の平常感情発話を入力とする。音響特徴抽出部１１－１は、当該発話から音響特徴系列を抽出し、音響特徴抽出部１１－２は、当該平常感情発話から音響特徴系列を抽出する。感情認識部１３は、これら２つの音響特徴系列を学習済みの感情認識モデルｍ１に入力し、感情認識モデルｍ１が出力する感情ラベルごとの確率に基づいて当該話者の感情を認識する。 [When recognized]
The process executed by the emotion recognition device 10 during recognition is as described with reference to Fig. 6. That is, the emotion recognition device 10 receives as input an utterance of a speaker to be subjected to emotion recognition and the speaker's normal emotional utterance. The acoustic feature extraction unit 11-1 extracts an acoustic feature series from the utterance, and the acoustic feature extraction unit 11-2 extracts an acoustic feature series from the normal emotional utterance. The emotion recognition unit 13 inputs these two acoustic feature series to a trained emotion recognition model m1, and recognizes the speaker's emotion based on the probability for each emotion label output by the emotion recognition model m1.

　上述したように、本実施の形態によれば、感情表現ベクトル抽出ブロックｍ１１－２は、Ｌ_ｐに基づいて、同じ話者の複数通りの平常感情発話について、同じような（距離が相対的に小さい）感情表現ベクトルを出力しやすくなるように学習される。Ｌ_ｐの加重和がＬに含まれることにより、従来技術では同じ話者の平常感情発話が同じような感情表現ベクトルとなる保証がない（＝平常感情発話が変われば認識結果も変わることが多い）という問題を軽減することができる。すなわち、平常感情発話を利用した感情認識において、入力発話と平常感情発話の組み合わせに特化して推定結果を出力してしまう事象を解消することができる。これにより、異なる平常感情発話に対しても同一の認識結果が得られる可能性を高めることができ、発話からの話者の感情の認識精度の向上に寄与することができる。 As described above, according to this embodiment, the emotional expression vector extraction block m11-2 learns based on _Lp so as to be more likely to output similar (relatively small distance) emotional expression vectors for multiple normal emotion utterances by the same speaker. By including the weighted sum of _Lp in L, it is possible to alleviate the problem that in the conventional technology, there is no guarantee that normal emotion utterances by the same speaker will have similar emotional expression vectors (= recognition results often change when normal emotion utterances change). In other words, it is possible to eliminate the phenomenon in which an estimated result is output by specializing in a combination of an input utterance and a normal emotion utterance in emotion recognition using normal emotion utterances. This increases the possibility of obtaining the same recognition result for different normal emotion utterances, which contributes to improving the recognition accuracy of the speaker's emotion from the utterance.

　なお、本実施の形態において、損失関数Ｌ_ａは、第１の損失関数の一例である。損失関数Ｌ_ｐは、第２の損失関数の一例である。 In this embodiment, the loss function L _a is an example of a first loss function, and the loss function L _p is an example of a second loss function.

　以上、本発明の実施の形態について詳述したが、本発明は斯かる特定の実施形態に限定されるものではなく、請求の範囲に記載された本発明の要旨の範囲内において、種々の変形・変更が可能である。 The above describes in detail the embodiments of the present invention, but the present invention is not limited to such specific embodiments, and various modifications and variations are possible within the scope of the gist of the present invention as described in the claims.

１０　　　　　感情認識装置
１１－１　　　音響特徴抽出部
１１－２　　　音響特徴抽出部
１２　　　　　学習部
１３　　　　　感情認識部
１００　　　　ドライブ装置
１０１　　　　記録媒体
１０２　　　　補助記憶装置
１０３　　　　メモリ装置
１０４　　　　プロセッサ
１０５　　　　インタフェース装置
１２１　　　　学習データ記憶部
Ｂ　　　　　　バス
ｍ１　　　　　感情認識モデル
ｍ１１　　　　感情表現ベクトル抽出ブロック
ｍ１１－１　　感情表現ベクトル抽出ブロック
ｍ１１－２　　感情表現ベクトル抽出ブロック
ｍ１２　　　　感情確率推定ブロック 10 Emotion recognition device 11-1 Acoustic feature extraction unit 11-2 Acoustic feature extraction unit 12 Learning unit 13 Emotion recognition unit 100 Drive device 101 Recording medium 102 Auxiliary storage device 103 Memory device 104 Processor 105 Interface device 121 Learning data storage unit B Bus m1 Emotion recognition model m11 Emotion expression vector extraction block m11-1 Emotion expression vector extraction block m11-2 Emotion expression vector extraction block m12 Emotion probability estimation block

Claims

a learning procedure for learning the model based on learning data including, for a plurality of speakers, a plurality of input utterances by a speaker, a plurality of normal emotion utterances corresponding to each of the input utterances, and a correct answer label of the emotion corresponding to each of the input utterances, and based on: a first loss function for minimizing an error, relative to the correct answer label corresponding to the input utterance, of an emotion recognition result output by a model to which any of the input utterances and the normal emotion utterance corresponding to the input utterance have been input; and a second loss function for making constant, for each speaker, a vector representing the nature of the emotional expression calculated for the normal emotion utterance in a process in which the model outputs the recognition result;
The emotion recognition learning method is characterized in that the emotion recognition learning method is executed by a computer.

the second loss function is based on a distance between an average value of the vectors calculated by the model for each of the normal emotion utterances corresponding to the same speaker and each of the vectors;
2. The emotion recognition learning method according to claim 1 .

the learning procedure includes learning the model based on a weighted sum of the first loss function and the second loss function.
3. The emotion recognition learning method according to claim 1 or 2.

an emotion recognition step of recognizing the emotion of a speaker related to an input utterance and a normal emotional utterance by using a model trained by the emotion recognition training method according to claim 1;
The emotion recognition method is characterized in that the emotion recognition method is executed by a computer.

a learning unit configured to learn the model based on learning data including, for a plurality of speakers, a plurality of input utterances by a speaker, a plurality of normal emotion utterances corresponding to each of the input utterances, and a correct answer label of the emotion corresponding to each of the input utterances, based on: a first loss function for minimizing an error, with respect to the correct answer label corresponding to the input utterance, of an emotion recognition result output by a model to which any of the input utterances and the normal emotion utterance corresponding to the input utterance have been input; and a second loss function for making constant, for each speaker, a vector representing the nature of the emotional expression calculated for the normal emotion utterance in a process in which the model outputs the recognition result;
An emotion recognition learning device comprising:

an emotion recognition unit configured to recognize emotions of a speaker related to an input utterance and a normal emotional utterance by using a model trained by the emotion recognition training method according to claim 1;
An emotion recognition device comprising:

A program that causes a computer to execute the emotion recognition learning method described in claim 1.

A program that causes a computer to execute the emotion recognition method described in claim 4.