[go: up one dir, main page]

WO2024127472A1 - Emotion recognition learning method, emotion recognition method, emotion recognition learning device, emotion recognition device, and program - Google Patents

Emotion recognition learning method, emotion recognition method, emotion recognition learning device, emotion recognition device, and program Download PDF

Info

Publication number
WO2024127472A1
WO2024127472A1 PCT/JP2022/045706 JP2022045706W WO2024127472A1 WO 2024127472 A1 WO2024127472 A1 WO 2024127472A1 JP 2022045706 W JP2022045706 W JP 2022045706W WO 2024127472 A1 WO2024127472 A1 WO 2024127472A1
Authority
WO
WIPO (PCT)
Prior art keywords
emotion
utterance
emotion recognition
input
learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/JP2022/045706
Other languages
French (fr)
Japanese (ja)
Inventor
厚志 安藤
有実子 村田
岳至 森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NTT Inc
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Priority to JP2024563791A priority Critical patent/JPWO2024127472A1/ja
Priority to PCT/JP2022/045706 priority patent/WO2024127472A1/en
Publication of WO2024127472A1 publication Critical patent/WO2024127472A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Definitions

  • the present invention relates to an emotion recognition learning method, an emotion recognition method, an emotion recognition learning device, an emotion recognition device, and a program.
  • Recognizing a speaker's emotions from speech is an important technology. For example, by recognizing a speaker's emotions during counseling, it is possible to visualize the patient's feelings of anxiety or sadness, which is expected to deepen the counselor's understanding and improve the quality of guidance. Furthermore, by recognizing human emotions in human-machine dialogue, it is possible to build a more friendly dialogue system that can share in the human's happiness if the human is happy, or encourage the human if he or she is sad.
  • the technology that takes an utterance as input and estimates which emotion class the speaker's emotion contained in the utterance belongs to e.g., neutral, anger, joy, sadness, etc.
  • Patent Document 1 and Non-Patent Document 1 propose a technology (hereinafter referred to as "conventional technology") for improving the accuracy of emotion recognition by using speech when speaking with "normal” emotions (emotions that are neither positive emotions such as joy nor negative emotions such as anger or sadness, but are normal emotions in a normal state) (hereinafter referred to as "normal emotion speech”).
  • normal emotion speech a technology for improving the accuracy of emotion recognition by using speech when speaking with "normal” emotions (emotions that are neither positive emotions such as joy nor negative emotions such as anger or sadness, but are normal emotions in a normal state)
  • Figure 1 is a diagram for explaining the conventional technology.
  • normal emotional utterance of the same speaker as the input utterance is required for estimation, and emotion recognition results are obtained by inputting these into an emotion recognition model.
  • an "emotion expression vector extraction block” is first used for each of the input utterance and the normal emotional utterance to extract an emotion expression vector that represents the nature of the emotional expression of the entire utterance. Then, based on the emotion expression vectors of the input utterance and the normal emotional utterance, an "emotion estimation block for input utterance using the emotion expression vector of normal emotional utterance” is used to estimate the emotion of the input utterance.
  • a statistical model based on deep learning is used for the emotion recognition model, and each block in the emotion recognition model is simultaneously trained using a set of labeled data consisting of the input utterance, normal emotional utterance, and the correct emotion label of the input utterance (the correct value for emotion recognition of the input utterance) prior to estimation.
  • the present invention was made in consideration of the above points, and aims to contribute to improving the accuracy of recognizing a speaker's emotions from speech.
  • a computer executes a learning procedure for learning the model based on training data including, for multiple speakers, multiple input utterances by a speaker, multiple normal emotion utterances corresponding to each of the input utterances, and correct emotion labels corresponding to each of the input utterances, and based on a first loss function for minimizing an error in the emotion recognition result output by a model to which any of the input utterances and the normal emotion utterance corresponding to that input utterance are input, relative to the correct emotion label corresponding to that input utterance, and a second loss function for making constant for each speaker a vector representing the nature of the emotional expression calculated for the normal emotion utterance in the process in which the model outputs the recognition result.
  • FIG. 1 is a diagram for explaining a conventional technique.
  • FIG. 1 is a diagram illustrating an example of a hardware configuration of an emotion recognition device according to an embodiment of the present invention.
  • FIG. 2 is a diagram illustrating an example of a functional configuration during learning of the emotion recognition device 10 according to an embodiment of the present invention.
  • FIG. 2 is a diagram showing an example of a configuration of an emotion recognition model m1 according to an embodiment of the present invention.
  • FIG. 2 is a diagram for explaining an outline of a learning method for an emotion recognition model m1.
  • FIG. 2 is a diagram illustrating an example of a functional configuration of the emotion recognition device 10 according to the embodiment of the present invention when recognizing emotions.
  • 10 is a flowchart illustrating an example of a process performed by the emotion recognition device 10 during learning.
  • FIG. 11 is a diagram for explaining calculation of an average value of emotion expression vectors of normal emotion utterances for each speaker.
  • FIG. 13 is a diagram for explaining calculation of a distance S ji,k between each emotional expression vector e ji and each speaker average c k .
  • a new loss function (speaker identity loss function for normal emotion utterances) is introduced to ensure that the emotion expression vector of normal emotion utterances shows the same vector value if the speaker is the same (in other words, even if the normal emotion utterance changes, it shows a constant vector value if the speaker is the same).
  • This optimizes the emotion expression vector extraction block so that a constant emotion expression vector of normal emotion utterance is obtained for different normal emotion utterances, solving the problem of specialization to the combination of input utterance and normal emotion utterance.
  • FIG. 2 is a diagram showing an example of the hardware configuration of an emotion recognition device 10 according to an embodiment of the present invention.
  • the emotion recognition device 10 in FIG. 2 includes a drive device 100, an auxiliary storage device 102, a memory device 103, a processor 104, and an interface device 105, all of which are interconnected via a bus B.
  • the program that realizes the processing in the emotion recognition device 10 is provided by a recording medium 101 such as a CD-ROM.
  • a recording medium 101 such as a CD-ROM.
  • the program is installed from the recording medium 101 via the drive device 100 into the auxiliary storage device 102.
  • the program does not necessarily have to be installed from the recording medium 101, but may be downloaded from another computer via a network.
  • the auxiliary storage device 102 stores the installed program as well as necessary files, data, etc.
  • the memory device 103 When an instruction to start a program is received, the memory device 103 reads out and stores the program from the auxiliary storage device 102.
  • the processor 104 is a CPU or a GPU (Graphics Processing Unit), or a CPU and a GPU, and executes functions related to the emotion recognition device 10 in accordance with the program stored in the memory device 103.
  • the interface device 105 is used as an interface for connecting to a network.
  • FIG. 3 is a diagram showing an example of the functional configuration of the emotion recognition device 10 during learning in an embodiment of the present invention.
  • the emotion recognition device 10 (emotion recognition learning device) during learning has two acoustic feature extraction units 11 (acoustic feature extraction unit 11-1, acoustic feature extraction unit 11-2) and one learning unit 12. These units are realized by processing in which one or more programs installed in the emotion recognition device 10 are executed by the processor 104.
  • the emotion recognition device 10 also uses a learning data storage unit 121.
  • the learning data storage unit 121 can be realized, for example, using the auxiliary storage device 102, or a storage device that can be connected to the emotion recognition device 10 via a network.
  • the learning data storage unit 121 stores a large amount of learning data in advance, which is a set of an input utterance, a correct emotion label (correct emotion label) of the input utterance, a normal emotion utterance of the same speaker as the input utterance, and a speaker label that is identification information of the speaker of the normal emotion utterance.
  • the input utterance and the normal emotion utterance of each learning data are different utterances.
  • the correct emotion label refers to a correct value of emotion recognition of the input utterance.
  • Emotion recognition refers to estimating, using a certain utterance as input, which emotion class (e.g., normal, anger, joy, sadness, etc.) the speaker's emotion contained in the utterance belongs to.
  • the normal emotion utterance refers to an utterance when speaking with a "normal” emotion.
  • "utterance” refers to a voice (voice data) generated by an action that expresses a linguistic expression.
  • the input utterance and the normal emotion utterance of each learning data have different emotions and utterance contents (text to be spoken).
  • the learning data storage unit 121 stores such learning data for multiple speakers.
  • the contents of speech by each speaker may be different or the same. It is also preferable that multiple sets of training data with different utterance contents for the same speaker are included. This is to avoid the possibility that an emotion recognition model that returns a specific emotion recognition result from the utterance contents (phonological bias) may be constructed if each speaker speaks a fixed content.
  • the training data storage unit 121 stores, as training data for multiple speakers, multiple input utterances by the speaker, multiple normal emotion utterances corresponding to each input utterance, and correct emotion labels corresponding to each input utterance.
  • the acoustic feature extraction unit 11 receives an utterance (audio data of the utterance) as input, extracts an acoustic feature sequence from the utterance, and outputs the acoustic feature sequence.
  • the acoustic feature extraction unit 11-1 extracts an acoustic feature sequence from the input utterance of each training data
  • the acoustic feature extraction unit 11-2 extracts an acoustic feature sequence from normal emotion utterance of each training data.
  • An acoustic feature series is data in which an input utterance is divided into short-time windows, acoustic features are obtained for each short-time window, and the acoustic feature vectors are arranged in chronological order.
  • the acoustic features include one or more of the power spectrum, Mel filter bank output, MFCC, fundamental frequency, logarithmic power, HNR (Harmonics-to-Noise Ratio), speech probability, number of zero-crossings, and their first or second derivatives.
  • the speech probability is obtained, for example, from the likelihood ratio of a pre-trained speech/non-speech GMM model.
  • HNR can be calculated, for example, using a cepstrum-based method (Peter Murphy, Olatunji Akande, "Cepstrum-Based Harmonics-to-Noise Ratio Measurement in Voiced Speech", Lecture Notes in Artificial Intelligence, Nonlinear Speech Modeling and Applications, Vol. 3445, Springer-Verlag, 2005).
  • Cepstrum-based method Peter Murphy, Olatunji Akande, "Cepstrum-Based Harmonics-to-Noise Ratio Measurement in Voiced Speech", Lecture Notes in Artificial Intelligence, Nonlinear Speech Modeling and Applications, Vol. 3445, Springer-Verlag, 2005.
  • the learning unit 12 inputs the acoustic feature sequences of the input utterance and normal emotional utterance into an emotion recognition model (hereinafter simply referred to as "emotion recognition model m1") based on the acoustic feature sequences output from the acoustic feature extraction unit 11 for each of the input utterance and normal emotional utterance of each piece of learning data, and the correct emotion label and speaker label of each piece of learning data, and learns the emotion recognition model m1 using the correct emotion label and speaker label of the input utterance as teacher data.
  • emotion recognition model m1 an emotion recognition model
  • FIG. 4 is a diagram showing an example of the configuration of emotion recognition model m1 in an embodiment of the present invention.
  • emotion recognition model m1 includes two emotion expression vector extraction blocks m11 (emotion expression vector extraction block m11-1, emotion expression vector extraction block m11-2) and one emotion probability estimation block m12.
  • the emotional expression vector extraction block m11 converts the acoustic feature sequence into a fixed-length vector (hereinafter referred to as "emotion expression vector") that represents the nature (or features) of the emotional expression of the entire utterance during the process in which the emotion recognition model m1 recognizes the speaker's emotion.
  • the emotional expression vector extraction block m11 uses a deep learning model (for example, a model composed of a Transformer and Self Attentive Pooling) that extracts a fixed-length expression from an input vector sequence of any length.
  • the emotional expression vector extraction block m11-1 converts the acoustic feature sequence extracted from the input utterance into an emotional expression vector.
  • the emotional expression vector extraction block m11-2 converts the acoustic feature sequence extracted from normal emotional utterance into an emotional expression vector.
  • the emotion probability estimation block m12 uses a deep learning model (e.g., a model with one or more layers of fully connected layers and activation functions) that projects a vector representing the posterior probability of each emotion (a vector in which each dimension indicates the posterior probability of a different emotion) from a fixed-length vector (emotion expression vector).
  • a deep learning model e.g., a model with one or more layers of fully connected layers and activation functions
  • the learning unit 12 updates the model parameters using the stochastic gradient descent method, as in the conventional technology. After updating the model parameters a certain number of times, the learning unit 12 outputs the finally obtained emotion recognition model m1.
  • FIG. 6 is a diagram showing an example of the functional configuration of emotion recognition device 10 during emotion recognition in an embodiment of the present invention.
  • the same parts as in FIG. 3 are given the same reference numerals, and their explanations are omitted.
  • the emotion recognition device 10 when recognizing emotions, has an emotion recognition unit 13 instead of a learning unit 12.
  • the emotion recognition unit 13 is realized by a process in which one or more programs installed in the emotion recognition device 10 are executed by the processor 104. Furthermore, when recognizing emotions, the emotion recognition device 10 does not use the learning data storage unit 121. Note that different computers may be used for learning and emotion recognition.
  • the emotion recognition unit 13 receives as input an acoustic feature sequence extracted by the acoustic feature extraction unit 11-1 from the input utterance of a person to be recognized for emotion, and an acoustic feature sequence extracted by the acoustic feature extraction unit 11-2 from the person's normal emotional utterance, and outputs an emotion recognition result (hereinafter simply referred to as "emotion recognition result") by comparing the acoustic feature sequence with the normal emotional utterance by forward propagating the acoustic feature sequence to a trained emotion recognition model m1.
  • the emotion recognition result which is the output of the emotion recognition unit 13, includes a posterior probability vector of each emotion (output of the forward propagation of the emotion recognition model m1), and an emotion class with the maximum posterior probability in the posterior probability vector.
  • the emotion class with the maximum posterior probability is used as the final emotion recognition result.
  • FIG. 7 is a flowchart illustrating an example of a process performed by the emotion recognition device 10 during learning.
  • step S101 the learning unit 12 randomly selects a number of (N) pieces of learning data from the group of learning data stored in the learning data storage unit 121 to obtain one mini-batch (hereinafter referred to as the "target mini-batch"). More precisely, the learning unit 12 randomly rearranges the learning data to generate the mini-batch.
  • the emotion recognition device 10 then executes a loop process including steps S102 to S104 for each of the N pieces of training data included in the target mini-batch.
  • the training data being processed in this loop process is hereinafter referred to as the "target training data.”
  • step S102 the acoustic feature extraction unit 11-1 extracts an acoustic feature sequence from the input utterance of the target learning data, and the acoustic feature extraction unit 11-2 extracts an acoustic feature sequence from the normal emotion utterance of the target learning data.
  • the learning unit 12 inputs the two extracted acoustic feature sequences into emotion expression model a, and acquires the recognition result (posterior probability vector for each emotion) and the emotion expression vector of normal emotion utterance output by emotion expression model a (S103). That is, in the process in which emotion expression model a outputs the recognition result, the emotion expression vector extraction block m11-2 calculates the emotion expression vector related to normal emotion utterance, and the learning unit acquires this emotion expression vector. The learning unit 12 stores the acquired emotion expression vector in association with the speaker label of the target training data.
  • the learning unit 12 calculates a loss function L a (the cross-entropy function between the correct emotion label of the input utterance and the posterior probability vector of each emotion) based on the correct emotion label of the target training data and the recognition result (the posterior probability vector of each emotion) to calculate a loss based on the error for the correct emotion label of the recognition result (hereinafter referred to as "loss L a ") (S104).
  • a loss function L a the cross-entropy function between the correct emotion label of the input utterance and the posterior probability vector of each emotion
  • the training unit 12 calculates the average value for each speaker for the emotional expression vectors (of normal emotional utterances) obtained for each training data in the target mini-batch in step S103 (S105).
  • Fig. 8 is a diagram for explaining calculation of the average value of emotion expression vectors of normal emotion utterances for each speaker.
  • e ji indicates the emotion expression vector of the i-th normal emotion utterance of speaker j in the target mini-batch.
  • the speaker of a certain emotion expression vector can be identified based on the speaker label associated with the emotion expression vector in step S103.
  • Fig. 8 shows an example in which emotion expression vectors of normal emotion utterances are obtained for each speaker whose speaker label is 1, 2, or 3.
  • the learning unit 12 calculates the average value of each set of emotion expression vectors associated with the same (common) speaker label.
  • the average value for each speaker is referred to as the "speaker average c k " (k is the speaker label).
  • the speaker labels are 1, 2, and 3, so speaker averages c 1 , c 2 , and c 3 are calculated.
  • the learning unit 12 calculates the distance S ji,k between each emotional expression vector e ji and the speaker average c k (S106).
  • FIG. 9 is a diagram for explaining the calculation of the distance S ji,k between each emotional expression vector e ji and each speaker average c k .
  • each row corresponds to each e ji and each column corresponds to each c k .
  • Each element S ji,k of the matrix corresponds to the distance between e ji and c k .
  • the learning unit 12 calculates the distance based on, for example, the following formula.
  • w and b are learnable parameters, and are updated based on the loss function simultaneously with the parameters of each block (each model) constituting the emotion recognition model m1.
  • the learning unit 12 calculates a loss based on the distance (hereinafter, referred to as "loss Lp ”) using the following loss function Lp (S107).
  • the loss function Lp is a loss function for making the emotional expression vector calculated for normal emotional utterances constant for each speaker in the process in which the emotion recognition model m1 outputs the recognition result (for minimizing the error of the emotional expression vector for each speaker).
  • the learning unit 12 calculates the weighted sum of the loss Lp based on the loss function Lp and the loss La based on the loss function La as the loss L of the entire emotion recognition model m1 (S108).
  • is a hyperparameter (weighting coefficient) for adjusting the priority between the role of L a in "reducing emotion recognition errors" and the role of L p in "making the normal emotion vector of the same speaker take a similar value.”
  • N is the number of training data in the target mini-batch, i.e., ⁇ (L a /N) is the average value of L a in the target mini-batch.
  • the learning unit 12 simultaneously updates the parameters of each block in the emotion recognition model m1 and the loss function L P using the backpropagation algorithm based on L (S109). By repeating steps S101 to S109, the parameters of the blocks are simultaneously optimized.
  • the emotion recognition device 10 receives as input an utterance of a speaker to be subjected to emotion recognition and the speaker's normal emotional utterance.
  • the acoustic feature extraction unit 11-1 extracts an acoustic feature series from the utterance, and the acoustic feature extraction unit 11-2 extracts an acoustic feature series from the normal emotional utterance.
  • the emotion recognition unit 13 inputs these two acoustic feature series to a trained emotion recognition model m1, and recognizes the speaker's emotion based on the probability for each emotion label output by the emotion recognition model m1.
  • the emotional expression vector extraction block m11-2 learns based on Lp so as to be more likely to output similar (relatively small distance) emotional expression vectors for multiple normal emotion utterances by the same speaker.
  • Lp weighted sum of Lp in L
  • it is possible to alleviate the problem that in the conventional technology, there is no guarantee that normal emotion utterances by the same speaker will have similar emotional expression vectors ( recognition results often change when normal emotion utterances change).
  • the loss function L a is an example of a first loss function
  • the loss function L p is an example of a second loss function
  • Emotion recognition device 11-1 Acoustic feature extraction unit 11-2 Acoustic feature extraction unit 12 Learning unit 13 Emotion recognition unit 100 Drive device 101 Recording medium 102 Auxiliary storage device 103 Memory device 104 Processor 105 Interface device 121 Learning data storage unit B Bus m1 Emotion recognition model m11 Emotion expression vector extraction block m11-1 Emotion expression vector extraction block m11-2 Emotion expression vector extraction block m12 Emotion probability estimation block

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)

Abstract

The present invention involves: learning data that, for a plurality of speakers, includes a plurality of ways of input utterances by the speakers, a plurality of ways of ordinary-state emotion utterances corresponding to the input utterances, and correct-answer labels for emotion corresponding to the input utterances; and a model that receives inputs of an input utterance and a corresponding ordinary-state emotion utterance to output an emotion recognition result. The invention causes a computer to execute a learning procedure for training the model on the basis of a first loss function for minimizing a difference (error) between a correct-answer label corresponding to the input utterance input to the model and the recognition result output by the model on the basis of the learning data, and a second loss function for achieving, for every speaker, a constant vector representing the quality of emotional expression calculated for the ordinary-state emotion utterance in the course of the model outputting the recognition result. In this way, the invention contributes to an increase in the accuracy of recognizing the emotion of a speaker from an utterance.

Description

感情認識学習方法、感情認識方法、感情認識学習装置、感情認識装置及びプログラムEmotion recognition learning method, emotion recognition method, emotion recognition learning device, emotion recognition device, and program

 本発明は、感情認識学習方法、感情認識方法、感情認識学習装置、感情認識装置及びプログラムに関する。 The present invention relates to an emotion recognition learning method, an emotion recognition method, an emotion recognition learning device, an emotion recognition device, and a program.

 発話からの話者の感情の認識は重要な技術である。例えば、カウンセリング時に話者の感情の認識を行うことで、患者の不安や悲しみの感情を可視化でき、カウンセラーの理解の深化や指導の質の向上が期待できる。また、人間と機械の対話において人間の感情を認識することで、人間が喜んでいれば共に喜ぶ、悲しんでいれば励ますなど、より親しみやすい対話システムの構築が可能となる。以降では、或る発話を入力とし、当該発話に含まれる話者の感情が感情クラス(例えば、平常、怒り、喜び、悲しみ、など)のいずれにあたるかを推定する技術を「感情認識」と呼ぶ。  Recognizing a speaker's emotions from speech is an important technology. For example, by recognizing a speaker's emotions during counseling, it is possible to visualize the patient's feelings of anxiety or sadness, which is expected to deepen the counselor's understanding and improve the quality of guidance. Furthermore, by recognizing human emotions in human-machine dialogue, it is possible to build a more friendly dialogue system that can share in the human's happiness if the human is happy, or encourage the human if he or she is sad. Hereafter, the technology that takes an utterance as input and estimates which emotion class the speaker's emotion contained in the utterance belongs to (e.g., neutral, anger, joy, sadness, etc.) will be referred to as "emotion recognition".

 感情認識のうち、「平常」感情(喜びなどのポジティブな感情でも、怒りや悲しみなどネガティブな感情でもない普段の状態の感情)をもって話したときの発話(以下、「平常感情発話」という。)を利用することで認識精度を向上させる技術(以下、「従来技術」という。)が特許文献1や非特許文献1に提案されている。従来技術は、「ある人物の普段の話し方(=平常感情発話)を知っている場合、その人物に対する感情認識の精度が向上するのでは」との仮説に基づく。 Patent Document 1 and Non-Patent Document 1 propose a technology (hereinafter referred to as "conventional technology") for improving the accuracy of emotion recognition by using speech when speaking with "normal" emotions (emotions that are neither positive emotions such as joy nor negative emotions such as anger or sadness, but are normal emotions in a normal state) (hereinafter referred to as "normal emotion speech"). The conventional technology is based on the hypothesis that "if you know how a person normally speaks (= normal emotion speech), the accuracy of emotion recognition for that person will improve."

 図1は、従来技術を説明するための図である。従来技術では、推定時に、認識対象の入力発話に加えて入力発話と同じ話者の平常感情発話を必要とし、これらを感情認識モデルに入力することで感情認識結果が得られる。感情認識モデルの内部では、初めに入力発話と平常感情発話それぞれについて、「感情表現ベクトルの抽出ブロック」が用いられて発話全体の感情表現の性質を表す感情表現ベクトルが抽出される。その後、入力発話と平常感情発話の感情表現ベクトルに基づき、「平常感情発話の感情表現ベクトルを用いた入力発話の感情推定ブロック」が用いられて入力発話の感情が推定される。感情認識モデルには深層学習に基づく統計モデルが利用され、感情認識モデル内の各ブロックは、推定の事前に入力発話・平常感情発話・入力発話の正解感情ラベル(入力発話の感情認識の正解値)を組とするラベル付きデータの集合を用いて同時に学習される。 Figure 1 is a diagram for explaining the conventional technology. In the conventional technology, in addition to the input utterance to be recognized, normal emotional utterance of the same speaker as the input utterance is required for estimation, and emotion recognition results are obtained by inputting these into an emotion recognition model. Inside the emotion recognition model, an "emotion expression vector extraction block" is first used for each of the input utterance and the normal emotional utterance to extract an emotion expression vector that represents the nature of the emotional expression of the entire utterance. Then, based on the emotion expression vectors of the input utterance and the normal emotional utterance, an "emotion estimation block for input utterance using the emotion expression vector of normal emotional utterance" is used to estimate the emotion of the input utterance. A statistical model based on deep learning is used for the emotion recognition model, and each block in the emotion recognition model is simultaneously trained using a set of labeled data consisting of the input utterance, normal emotional utterance, and the correct emotion label of the input utterance (the correct value for emotion recognition of the input utterance) prior to estimation.

国際公開第2021/171552号International Publication No. 2021/171552

Andreas Triantafyllopoulos, Shuo Liu, Bjoern W. Schuller、"DEEP SPEAKER CONDITIONING FOR SPEECH EMOTION RECOGNITION"、Proc. of ICME, pp.1-6, 2021Andreas Triantafyllopoulos, Shuo Liu, Bjoern W. Schuller, "DEEP SPEAKER CONDITIONING FOR SPEECH EMOTION RECOGNITION", Proc. of ICME, pp.1-6, 2021

 従来技術では、認識時に異なる平常感情発話を用いると認識結果が変化するという問題がある(例えば、入力発話Xと平常感情発話Aを用いた場合では入力発話Xの認識結果が「喜び」となり、Xと平常感情発話Bを用いた場合では入力発話Xの認識結果が「悲しみ」となる場合がある。)。これは、感情認識モデルの学習時に入力発話・平常感情発話・入力発話の正解感情ラベルの組を利用することから、入力発話と平常感情発話の組み合わせに特化して認識結果を出力するように感情認識モデル内の各ブロックが最適化されてしまうためである。従来技術においても、組み合わせに特化し過ぎる問題を防ぐために様々な平常感情発話との組み合わせを学習データに含めるという方法がとられているが、この方法は明示的に組み合わせへの特化を減少させる方法ではないため、当該問題への対処は十分ではない。 In conventional technologies, there is a problem that the recognition result changes when different normal emotion utterances are used during recognition (for example, when input utterance X and normal emotion utterance A are used, the recognition result of input utterance X may be "joy", and when X and normal emotion utterance B are used, the recognition result of input utterance X may be "sadness".). This is because when training the emotion recognition model, a set of input utterance, normal emotion utterance, and correct emotion label for the input utterance is used, so each block in the emotion recognition model is optimized to output recognition results specialized for combinations of input utterance and normal emotion utterance. In conventional technologies, a method is used in which combinations with various normal emotion utterances are included in the training data to prevent the problem of over-specialization in combinations, but this method does not explicitly reduce specialization in combinations and does not adequately address the problem.

 本発明は、上記の点に鑑みてなされたものであって、発話からの話者の感情の認識精度の向上に寄与することを目的とする。 The present invention was made in consideration of the above points, and aims to contribute to improving the accuracy of recognizing a speaker's emotions from speech.

 そこで上記課題を解決するため、話者による複数通りの入力発話と、それぞれの前記入力発話に対応する複数通りの平常感情発話と、それぞれの前記入力発話に対応する感情の正解ラベルとを複数の話者について含む学習用のデータに基づいて、いずれかの前記入力発話と当該入力発話に対応する前記平常感情発話とを入力したモデルが出力する感情の認識結果について当該入力発話に対応する前記正解ラベルに対する誤差を最小化するための第1の損失関数と、前記モデルが前記認識結果を出力する過程において前記平常感情発話について算出される感情表現の性質を表すベクトルを話者ごとに一定にするための第2の損失関数と、に基づいて前記モデルを学習する学習手順、をコンピュータが実行する。 In order to solve the above problem, a computer executes a learning procedure for learning the model based on training data including, for multiple speakers, multiple input utterances by a speaker, multiple normal emotion utterances corresponding to each of the input utterances, and correct emotion labels corresponding to each of the input utterances, and based on a first loss function for minimizing an error in the emotion recognition result output by a model to which any of the input utterances and the normal emotion utterance corresponding to that input utterance are input, relative to the correct emotion label corresponding to that input utterance, and a second loss function for making constant for each speaker a vector representing the nature of the emotional expression calculated for the normal emotion utterance in the process in which the model outputs the recognition result.

 発話からの話者の感情の認識精度の向上に寄与することができる。 This can contribute to improving the accuracy of recognizing a speaker's emotions from speech.

従来技術を説明するための図である。FIG. 1 is a diagram for explaining a conventional technique. 本発明の実施の形態における感情認識装置10のハードウェア構成例を示す図である。FIG. 1 is a diagram illustrating an example of a hardware configuration of an emotion recognition device according to an embodiment of the present invention. 本発明の実施の形態における感情認識装置10の学習時の機能構成例を示す図である。FIG. 2 is a diagram illustrating an example of a functional configuration during learning of the emotion recognition device 10 according to an embodiment of the present invention. 本発明の実施の形態における感情認識モデルm1の構成例を示す図である。FIG. 2 is a diagram showing an example of a configuration of an emotion recognition model m1 according to an embodiment of the present invention. 感情認識モデルm1の学習方法の概要を説明するための図である。FIG. 2 is a diagram for explaining an outline of a learning method for an emotion recognition model m1. 本発明の実施の形態における感情認識装置10の感情認識時の機能構成例を示す図である。FIG. 2 is a diagram illustrating an example of a functional configuration of the emotion recognition device 10 according to the embodiment of the present invention when recognizing emotions. 感情認識装置10が学習時において実行する処理手順の一例を説明するためのフローチャートである。10 is a flowchart illustrating an example of a process performed by the emotion recognition device 10 during learning. 平常感情発話の感情表現ベクトルの話者ごとの平均値の算出を説明するための図である。11 is a diagram for explaining calculation of an average value of emotion expression vectors of normal emotion utterances for each speaker. FIG. 各感情表現ベクトルejiと各話者平均cとの距離Sji,kの算出を説明するための図である。FIG. 13 is a diagram for explaining calculation of a distance S ji,k between each emotional expression vector e ji and each speaker average c k .

 本実施の形態では、平常感情発話(喜びなどのポジティブな感情でも、怒りや悲しみなどネガティブな感情でもない普段の状態の感情における発話)を利用した感情認識モデルの学習時において、従来手法で用いられている正解感情ラベルと感情認識結果との誤り(正解感情ラベルに対する感情認識結果の誤差)を最小化するための損失関数に加えて、平常感情発話の感情表現ベクトルが同一話者であれば同じベクトル値を示す(つまり、平常感情発話が変化しても話者が同じであれば一定のベクトル値を示す)ようにするための損失関数(平常感情発話に対する話者同一性損失関数)が新たに導入される。これにより、異なる平常感情発話に対して一定の平常感情発話の感情表現ベクトルが得られるように感情表現ベクトルの抽出ブロックが最適化され、入力発話と平常感情発話の組み合わせに特化してしまう問題を解決することができる。 In this embodiment, when training an emotion recognition model using normal emotion utterances (utterances expressing emotions in a normal state that are neither positive emotions such as joy, nor negative emotions such as anger or sadness), in addition to the loss function used in conventional methods for minimizing the error between the correct emotion label and the emotion recognition result (error in the emotion recognition result for the correct emotion label), a new loss function (speaker identity loss function for normal emotion utterances) is introduced to ensure that the emotion expression vector of normal emotion utterances shows the same vector value if the speaker is the same (in other words, even if the normal emotion utterance changes, it shows a constant vector value if the speaker is the same). This optimizes the emotion expression vector extraction block so that a constant emotion expression vector of normal emotion utterance is obtained for different normal emotion utterances, solving the problem of specialization to the combination of input utterance and normal emotion utterance.

 以下、図面に基づいて本発明の実施の形態を説明する。 The following describes an embodiment of the present invention with reference to the drawings.

 図2は、本発明の実施の形態における感情認識装置10のハードウェア構成例を示す図である。図2の感情認識装置10は、それぞれバスBで相互に接続されているドライブ装置100、補助記憶装置102、メモリ装置103、プロセッサ104、及びインタフェース装置105等を有する。 FIG. 2 is a diagram showing an example of the hardware configuration of an emotion recognition device 10 according to an embodiment of the present invention. The emotion recognition device 10 in FIG. 2 includes a drive device 100, an auxiliary storage device 102, a memory device 103, a processor 104, and an interface device 105, all of which are interconnected via a bus B.

 感情認識装置10での処理を実現するプログラムは、CD-ROM等の記録媒体101によって提供される。プログラムを記憶した記録媒体101がドライブ装置100にセットされると、プログラムが記録媒体101からドライブ装置100を介して補助記憶装置102にインストールされる。但し、プログラムのインストールは必ずしも記録媒体101より行う必要はなく、ネットワークを介して他のコンピュータよりダウンロードするようにしてもよい。補助記憶装置102は、インストールされたプログラムを格納すると共に、必要なファイルやデータ等を格納する。 The program that realizes the processing in the emotion recognition device 10 is provided by a recording medium 101 such as a CD-ROM. When the recording medium 101 storing the program is set in the drive device 100, the program is installed from the recording medium 101 via the drive device 100 into the auxiliary storage device 102. However, the program does not necessarily have to be installed from the recording medium 101, but may be downloaded from another computer via a network. The auxiliary storage device 102 stores the installed program as well as necessary files, data, etc.

 メモリ装置103は、プログラムの起動指示があった場合に、補助記憶装置102からプログラムを読み出して格納する。プロセッサ104は、CPU若しくはGPU(Graphics Processing Unit)、又はCPU及びGPUであり、メモリ装置103に格納されたプログラムに従って感情認識装置10に係る機能を実行する。インタフェース装置105は、ネットワークに接続するためのインタフェースとして用いられる。 When an instruction to start a program is received, the memory device 103 reads out and stores the program from the auxiliary storage device 102. The processor 104 is a CPU or a GPU (Graphics Processing Unit), or a CPU and a GPU, and executes functions related to the emotion recognition device 10 in accordance with the program stored in the memory device 103. The interface device 105 is used as an interface for connecting to a network.

 図3は、本発明の実施の形態における感情認識装置10の学習時の機能構成例を示す図である。図3に示されるように、学習時の感情認識装置10(感情認識学習装置)は、2つの音響特徴抽出部11(音響特徴抽出部11-1、音響特徴抽出部11-2)と1つの学習部12とを有する。これら各部は、感情認識装置10にインストールされた1以上のプログラムが、プロセッサ104に実行させる処理により実現される。感情認識装置10は、また、学習データ記憶部121を利用する。学習データ記憶部121は、例えば、補助記憶装置102、又は感情認識装置10にネットワークを介して接続可能な記憶装置等を用いて実現可能である。 FIG. 3 is a diagram showing an example of the functional configuration of the emotion recognition device 10 during learning in an embodiment of the present invention. As shown in FIG. 3, the emotion recognition device 10 (emotion recognition learning device) during learning has two acoustic feature extraction units 11 (acoustic feature extraction unit 11-1, acoustic feature extraction unit 11-2) and one learning unit 12. These units are realized by processing in which one or more programs installed in the emotion recognition device 10 are executed by the processor 104. The emotion recognition device 10 also uses a learning data storage unit 121. The learning data storage unit 121 can be realized, for example, using the auxiliary storage device 102, or a storage device that can be connected to the emotion recognition device 10 via a network.

 [学習データ記憶部121]
 学習データ記憶部121は、入力発話、入力発話の正解感情ラベル(感情の正解ラベル)、入力発話と同じ話者の平常感情発話、平常感情発話の話者の識別情報である話者ラベルの組である学習データを予め大量に記憶する。各学習データの、入力発話と平常感情発話は異なる発話であるものとする。正解感情ラベルとは、入力発話の感情認識の正解値をいう。感情認識とは、或る発話を入力とし、当該発話に含まれる話者の感情が感情クラス(例えば、平常、怒り、喜び、悲しみ、など)のいずれにあたるかを推定することをいう。平常感情発話とは、「平常」感情をもって話したときの発話をいう。また、「発話」とは、言語表現を表出する行動によって生じた音声(音声データ)をいう。各学習データの入力発話と平常感情発話とは、感情及び発話内容(発話対象のテキスト)が異なる。学習データ記憶部121は、このような学習データを複数人の話者について記憶する。各話者による発話内容は異なっていてもよいし同じでもよい。また、同一の話者に関して発話内容が異なる複数通りの学習データが含まれているのが望ましい。話者ごとに決まった内容を話していると、その発話内容(音韻の偏り)から特定の感情認識結果を返してしまうような感情認識モデルが構築される可能性を回避するためである。この場合、同一の話者に関する各学習データにおいて、平常感情発話も異なることが望ましい。したがって、学習データ記憶部121は、複数の話者について、当該話者による複数通りの入力発話と、それぞれの入力発話に対応する複数通りの平常感情発話と、それぞれの入力発話に対応する正解感情ラベルとを学習用のデータとして予め記憶しているといえる。
[Learning Data Storage Unit 121]
The learning data storage unit 121 stores a large amount of learning data in advance, which is a set of an input utterance, a correct emotion label (correct emotion label) of the input utterance, a normal emotion utterance of the same speaker as the input utterance, and a speaker label that is identification information of the speaker of the normal emotion utterance. The input utterance and the normal emotion utterance of each learning data are different utterances. The correct emotion label refers to a correct value of emotion recognition of the input utterance. Emotion recognition refers to estimating, using a certain utterance as input, which emotion class (e.g., normal, anger, joy, sadness, etc.) the speaker's emotion contained in the utterance belongs to. The normal emotion utterance refers to an utterance when speaking with a "normal" emotion. In addition, "utterance" refers to a voice (voice data) generated by an action that expresses a linguistic expression. The input utterance and the normal emotion utterance of each learning data have different emotions and utterance contents (text to be spoken). The learning data storage unit 121 stores such learning data for multiple speakers. The contents of speech by each speaker may be different or the same. It is also preferable that multiple sets of training data with different utterance contents for the same speaker are included. This is to avoid the possibility that an emotion recognition model that returns a specific emotion recognition result from the utterance contents (phonological bias) may be constructed if each speaker speaks a fixed content. In this case, it is preferable that normal emotion utterances are also different in each set of training data for the same speaker. Therefore, it can be said that the training data storage unit 121 stores, as training data for multiple speakers, multiple input utterances by the speaker, multiple normal emotion utterances corresponding to each input utterance, and correct emotion labels corresponding to each input utterance.

 [音響特徴抽出部11]
 音響特徴抽出部11は、発話(の音声データ)を入力とし、発話から音響特徴系列を抽出し、当該音響特徴系列を出力する。学習時において、音響特徴抽出部11-1は、各学習データの入力発話から音響特徴系列を抽出し、音響特徴抽出部11-2は、各学習データの平常感情発話から音響特徴系列を抽出する。
[Acoustic feature extraction unit 11]
The acoustic feature extraction unit 11 receives an utterance (audio data of the utterance) as input, extracts an acoustic feature sequence from the utterance, and outputs the acoustic feature sequence. During training, the acoustic feature extraction unit 11-1 extracts an acoustic feature sequence from the input utterance of each training data, and the acoustic feature extraction unit 11-2 extracts an acoustic feature sequence from normal emotion utterance of each training data.

 音響特徴系列とは、入力発話を短時間窓で分割し、短時間窓ごとに音響特徴を求め、当該音響特徴のベクトルを時系列順に並べたデータをいう。音響特徴は、パワースペクトル、メルフィルタバンク出力、MFCC、基本周波数、対数パワー、HNR(Harmonics-to-Noise Ratio)、音声確率、ゼロ交差数、及びこれらの一次微分又は二次微分のいずれか一つ以上を含むも。音声確率は、例えば、事前学習した音声/非音声のGMMモデルの尤度比により求められる。HNRは、例えば、ケプストラムに基づく手法により求められる(「Peter Murphy, Olatunji Akande、"Cepstrum-Based Harmonics-to-Noise Ratio Measurement in Voiced Speech"、Lecture Notes in Artificial Intelligence, Nonlinear Speech Modeling and Applications, Vol. 3445, Springer-Verlag, 2005」)。より多くの音響特徴を利用することで、発話に含まれる様々な特徴を表現でき、感情認識精度が向上する傾向にある。 An acoustic feature series is data in which an input utterance is divided into short-time windows, acoustic features are obtained for each short-time window, and the acoustic feature vectors are arranged in chronological order. The acoustic features include one or more of the power spectrum, Mel filter bank output, MFCC, fundamental frequency, logarithmic power, HNR (Harmonics-to-Noise Ratio), speech probability, number of zero-crossings, and their first or second derivatives. The speech probability is obtained, for example, from the likelihood ratio of a pre-trained speech/non-speech GMM model. HNR can be calculated, for example, using a cepstrum-based method (Peter Murphy, Olatunji Akande, "Cepstrum-Based Harmonics-to-Noise Ratio Measurement in Voiced Speech", Lecture Notes in Artificial Intelligence, Nonlinear Speech Modeling and Applications, Vol. 3445, Springer-Verlag, 2005). By using more acoustic features, it is possible to express various characteristics contained in speech, which tends to improve the accuracy of emotion recognition.

 [学習部12]
 学習部12は、各学習データの入力発話及び平常感情発話のそれぞれについて音響特徴抽出部11から出力される音響特徴系列と、各学習データの正解感情ラベル及び話者ラベルとに基づいて、入力発話及び平常感情発話の音響特徴系列を平常感情発話との比較による感情認識モデル(以下、単に「感情認識モデルm1」という。)に入力し、入力発話の正解感情ラベルと話者ラベルを教師データとして用いて、感情認識モデルm1を学習する。
[Learning unit 12]
The learning unit 12 inputs the acoustic feature sequences of the input utterance and normal emotional utterance into an emotion recognition model (hereinafter simply referred to as "emotion recognition model m1") based on the acoustic feature sequences output from the acoustic feature extraction unit 11 for each of the input utterance and normal emotional utterance of each piece of learning data, and the correct emotion label and speaker label of each piece of learning data, and learns the emotion recognition model m1 using the correct emotion label and speaker label of the input utterance as teacher data.

 図4は、本発明の実施の形態における感情認識モデルm1の構成例を示す図である。図4に示されるように、感情認識モデルm1は、2つの感情表現ベクトル抽出ブロックm11(感情表現ベクトル抽出ブロックm11-1、感情表現ベクトル抽出ブロックm11-2)と1つの感情確率推定ブロックm12とを含む。 FIG. 4 is a diagram showing an example of the configuration of emotion recognition model m1 in an embodiment of the present invention. As shown in FIG. 4, emotion recognition model m1 includes two emotion expression vector extraction blocks m11 (emotion expression vector extraction block m11-1, emotion expression vector extraction block m11-2) and one emotion probability estimation block m12.

 感情表現ベクトル抽出ブロックm11は、感情認識モデルm1が話者の感情を認識する過程において、音響特徴系列を発話全体の感情表現の性質(又は特徴)を表す固定長のベクトル(以下、「感情表現ベクトル」という。)に変換する。感情表現ベクトル抽出ブロックm11には、任意の系列長を持つ入力ベクトル系列から固定長の表現を抽出する深層学習モデル(例えば、TransformerとSelf Attentive Poolingで構成されるモデル)が利用される。図4において、感情表現ベクトル抽出ブロックm11-1は、入力発話から抽出された音響特徴系列を感情表現ベクトルに変換する。感情表現ベクトル抽出ブロックm11-2は、平常感情発話から抽出された音響特徴系列を感情表現ベクトルに変換する。 The emotional expression vector extraction block m11 converts the acoustic feature sequence into a fixed-length vector (hereinafter referred to as "emotion expression vector") that represents the nature (or features) of the emotional expression of the entire utterance during the process in which the emotion recognition model m1 recognizes the speaker's emotion. The emotional expression vector extraction block m11 uses a deep learning model (for example, a model composed of a Transformer and Self Attentive Pooling) that extracts a fixed-length expression from an input vector sequence of any length. In Figure 4, the emotional expression vector extraction block m11-1 converts the acoustic feature sequence extracted from the input utterance into an emotional expression vector. The emotional expression vector extraction block m11-2 converts the acoustic feature sequence extracted from normal emotional utterance into an emotional expression vector.

 感情確率推定ブロックm12には、固定長のベクトル(感情表現ベクトル)から各感情の事後確率を表すベクトル(各次元がそれぞれ異なる感情の事後確率を示すベクトル)を射影する深層学習モデル(例えば、全結合層と活性化関数を1層以上重ねたモデル)が利用される。 The emotion probability estimation block m12 uses a deep learning model (e.g., a model with one or more layers of fully connected layers and activation functions) that projects a vector representing the posterior probability of each emotion (a vector in which each dimension indicates the posterior probability of a different emotion) from a fixed-length vector (emotion expression vector).

 学習部12は、図5に示されるように、平常感情発話に対する話者同一性損失関数L(図A3)と、感情認識モデルm1の出力と正解感情ラベルとの誤り(誤差)に基づく損失関数L(入力発話の正解感情ラベルと感情認識結果との誤り最小化の損失関数のための損失関数)とを重みづけ和して得られる損失関数L(L=αL+(1-α)L)に基づいて全モデルのモデルパラメータの更新を行う。αは人手で決定される重み係数である。学習部12は、従来技術と同様に確率的勾配降下法を用いてモデルパラメータを更新する。学習部12は、一定回数モデルパラメータを更新したのち、最終的に得られた感情認識モデルm1を出力とする。 As shown in FIG. 5, the learning unit 12 updates the model parameters of all models based on a loss function L (L=αL a +(1-α)L p ) obtained by weighting and summing a speaker identity loss function L p (FIG. A3 ) for normal emotional utterances and a loss function L a (a loss function for minimizing the error between the correct emotion label of the input utterance and the emotion recognition result) based on the error ( error ) between the output of the emotion recognition model m1 and the correct emotion label, where α is a weighting coefficient that is determined manually. The learning unit 12 updates the model parameters using the stochastic gradient descent method, as in the conventional technology. After updating the model parameters a certain number of times, the learning unit 12 outputs the finally obtained emotion recognition model m1.

 図6は、本発明の実施の形態における感情認識装置10の感情認識時の機能構成例を示す図である。図6中、図3と同一部分には同一符号を付し、その説明は省略する。 FIG. 6 is a diagram showing an example of the functional configuration of emotion recognition device 10 during emotion recognition in an embodiment of the present invention. In FIG. 6, the same parts as in FIG. 3 are given the same reference numerals, and their explanations are omitted.

 図6に示されるように、感情認識時において、感情認識装置10は、学習部12の代わりに感情認識部13を有する。感情認識部13は、感情認識装置10にインストールされた1以上のプログラムが、プロセッサ104に実行させる処理により実現される。また、感情認識時において、感情認識装置10は、学習データ記憶部121を利用しない。なお、学習時と感情認識時とにおいて、異なるコンピュータが用いられてもよい。 As shown in FIG. 6, when recognizing emotions, the emotion recognition device 10 has an emotion recognition unit 13 instead of a learning unit 12. The emotion recognition unit 13 is realized by a process in which one or more programs installed in the emotion recognition device 10 are executed by the processor 104. Furthermore, when recognizing emotions, the emotion recognition device 10 does not use the learning data storage unit 121. Note that different computers may be used for learning and emotion recognition.

 [感情認識部13]
 感情認識部13は、感情の認識対象とする人物の入力発話から音響特徴抽出部11-1が抽出した音響特徴系列と、当該人物の平常感情発話から音響特徴抽出部11-2が抽出した音響特徴系列とを入力とし、当該音響特徴系列を学習済みの感情認識モデルm1に順伝播させることで平常感情発話との比較による感情認識結果(以下、単に「感情認識結果」という。)を出力する。感情認識部13の出力である感情認識結果は、各感情の事後確率ベクトル(感情認識モデルm1の順伝播の出力)と、当該事後確率ベクトルにおいて事後確率が最大であった感情クラスを含む。事後確率が最大であった感情クラスが最終的な感情認識結果として利用される。
[Emotion Recognition Unit 13]
The emotion recognition unit 13 receives as input an acoustic feature sequence extracted by the acoustic feature extraction unit 11-1 from the input utterance of a person to be recognized for emotion, and an acoustic feature sequence extracted by the acoustic feature extraction unit 11-2 from the person's normal emotional utterance, and outputs an emotion recognition result (hereinafter simply referred to as "emotion recognition result") by comparing the acoustic feature sequence with the normal emotional utterance by forward propagating the acoustic feature sequence to a trained emotion recognition model m1. The emotion recognition result, which is the output of the emotion recognition unit 13, includes a posterior probability vector of each emotion (output of the forward propagation of the emotion recognition model m1), and an emotion class with the maximum posterior probability in the posterior probability vector. The emotion class with the maximum posterior probability is used as the final emotion recognition result.

 以下、感情認識装置10が実行する処理手順について説明する。 The processing steps executed by the emotion recognition device 10 are explained below.

 [学習時]
 図7は、感情認識装置10が学習時において実行する処理手順の一例を説明するためのフローチャートである。
[Study]
FIG. 7 is a flowchart illustrating an example of a process performed by the emotion recognition device 10 during learning.

 ステップS101において、学習部12は、学習データ記憶部121に記憶されている学習データ群の中から一部の複数(N個)の学習データランダムに選択することで、1つのミニバッチ(以下、「対象ミニバッチ」という。)を取得する。より厳密には、学習部12は、学習データをランダムに並び替えた上でミニバッチを生成する。 In step S101, the learning unit 12 randomly selects a number of (N) pieces of learning data from the group of learning data stored in the learning data storage unit 121 to obtain one mini-batch (hereinafter referred to as the "target mini-batch"). More precisely, the learning unit 12 randomly rearranges the learning data to generate the mini-batch.

 続いて、感情認識装置10は、対象ミニバッチに含まれるN個の学習データごとに、ステップS102~S104を含むループ処理を実行する。当該ループ処理において処理対象とされている学習データを、以下「対象学習データ」という。 The emotion recognition device 10 then executes a loop process including steps S102 to S104 for each of the N pieces of training data included in the target mini-batch. The training data being processed in this loop process is hereinafter referred to as the "target training data."

 ステップS102において、音響特徴抽出部11-1は、対象学習データの入力発話から音響特徴系列を抽出し、音響特徴抽出部11-2は、対象学習データの平常感情発話から音響特徴系列を抽出する。 In step S102, the acoustic feature extraction unit 11-1 extracts an acoustic feature sequence from the input utterance of the target learning data, and the acoustic feature extraction unit 11-2 extracts an acoustic feature sequence from the normal emotion utterance of the target learning data.

 続いて、学習部12は、抽出された2つの音響特徴系列を感情表現モデルaに入力し、感情表現モデルaが出力する認識結果(各感情の事後確率ベクトル)及び平常感情発話の感情表現ベクトルを取得する(S103)。すなわち、感情表現モデルaが認識結果を出力する過程において、感情表現ベクトル抽出ブロックm11-2が平常感情発話に関する感情表現ベクトルを算出するが、学習部は、当該感情表現ベクトルを取得する。学習部12は、取得した感情表現ベクトルを対象学習データの話者ラベルに関連付けて記憶しておく。 Then, the learning unit 12 inputs the two extracted acoustic feature sequences into emotion expression model a, and acquires the recognition result (posterior probability vector for each emotion) and the emotion expression vector of normal emotion utterance output by emotion expression model a (S103). That is, in the process in which emotion expression model a outputs the recognition result, the emotion expression vector extraction block m11-2 calculates the emotion expression vector related to normal emotion utterance, and the learning unit acquires this emotion expression vector. The learning unit 12 stores the acquired emotion expression vector in association with the speaker label of the target training data.

 続いて、学習部12は、対象学習データの正解感情ラベルと認識結果(各感情の事後確率ベクトル)とに基づいて損失関数L(入力発話の正解感情ラベルと各感情の事後確率ベクトルとの交差エントロピー関数)を計算することで、認識結果について正解感情ラベルに対する誤差に基づく損失(以下、「損失L」という。)を算出する(S104)。 Next, the learning unit 12 calculates a loss function L a (the cross-entropy function between the correct emotion label of the input utterance and the posterior probability vector of each emotion) based on the correct emotion label of the target training data and the recognition result (the posterior probability vector of each emotion) to calculate a loss based on the error for the correct emotion label of the recognition result (hereinafter referred to as "loss L a ") (S104).

 対象ミニバッチ内の全ての学習データについてステップS102~S104が実行されると、学習部12は、ステップS103において対象ミニバッチ内の学習データごとに取得された(平常感情発話の)感情表現ベクトルについて、話者ごとの平均値を算出する(S105)。 When steps S102 to S104 have been executed for all training data in the target mini-batch, the training unit 12 calculates the average value for each speaker for the emotional expression vectors (of normal emotional utterances) obtained for each training data in the target mini-batch in step S103 (S105).

 図8は、平常感情発話の感情表現ベクトルの話者ごとの平均値の算出を説明するための図である。図8において、ejiは、対象ミニバッチにおいて話者jのi番目の平常感情発話の感情表現ベクトルを示す。或る感情表現ベクトルの話者が誰であるかについては、ステップS103において当該感情表現ベクトルに関連付けられている話者ラベルに基づいて特定可能である。図8には、話者ラベルが、1、2又は3である各話者について平常感情発話の感情表現ベクトルが取得されている例が示されている。 Fig. 8 is a diagram for explaining calculation of the average value of emotion expression vectors of normal emotion utterances for each speaker. In Fig. 8, e ji indicates the emotion expression vector of the i-th normal emotion utterance of speaker j in the target mini-batch. The speaker of a certain emotion expression vector can be identified based on the speaker label associated with the emotion expression vector in step S103. Fig. 8 shows an example in which emotion expression vectors of normal emotion utterances are obtained for each speaker whose speaker label is 1, 2, or 3.

 学習部12は、関連付けられている話者ラベルが同じである(共通する)感情表現ベクトルの集合ごとに、当該集合の平均値を算出する。以下、話者ごとの平均値を「話者平均c」という(kは話者ラベル)。図8の例では、話者ラベルは、1、2又は3であるため、話者平均c、c及びcが算出される。 The learning unit 12 calculates the average value of each set of emotion expression vectors associated with the same (common) speaker label. Hereinafter, the average value for each speaker is referred to as the "speaker average c k " (k is the speaker label). In the example of Fig. 8, the speaker labels are 1, 2, and 3, so speaker averages c 1 , c 2 , and c 3 are calculated.

 続いて、学習部12は、感情表現ベクトルejiごとに、各話者平均cとの距離Sji,kを算出する(S106)。 Next, the learning unit 12 calculates the distance S ji,k between each emotional expression vector e ji and the speaker average c k (S106).

 図9は、各感情表現ベクトルejiと各話者平均cとの距離Sji,kの算出を説明するための図である。 FIG. 9 is a diagram for explaining the calculation of the distance S ji,k between each emotional expression vector e ji and each speaker average c k .

 図9には、各ejiに各行が対応し、各cに各列が対応する行列が示されている。当該行列の各要素Sji,kは、ejiとcとの距離に対応する。学習部12は、当該距離を、例えば、以下の式に基づいて算出する。 9 shows a matrix in which each row corresponds to each e ji and each column corresponds to each c k . Each element S ji,k of the matrix corresponds to the distance between e ji and c k . The learning unit 12 calculates the distance based on, for example, the following formula.

Figure JPOXMLDOC01-appb-M000001
但し、w、bは学習可能なパラメータであり、感情認識モデルm1を構成する各ブロック(各モデル)のパラメータと同時に損失関数に基づいて更新される。
Figure JPOXMLDOC01-appb-M000001
Here, w and b are learnable parameters, and are updated based on the loss function simultaneously with the parameters of each block (each model) constituting the emotion recognition model m1.

 続いて学習部12は、以下の損失関数Lを用いて、当該距離に基づく損失(以下、「損失L」という。)を算出する(S107)。 Next, the learning unit 12 calculates a loss based on the distance (hereinafter, referred to as "loss Lp ") using the following loss function Lp (S107).

Figure JPOXMLDOC01-appb-M000002
但し、j=kのとき(ejiとcとの話者が同じ場合)tji,k=1,j≠kのとき(ejiとcとの話者が異なる場合)tji,k=0とし、Sji,kとtji,kが一致するように(Lが最小化されるように)学習が行われる。なお、図9に示した行列において、網掛けが付与されている要素は、j=kである要素に対応する。すなわち、損失関数Lは、j=kの場合(感情表現ベクトルと、感情表現ベクトルの話者平均が同じ話者の場合)は1に近づき、そうでない場合は0に近づくような損失関数である。
Figure JPOXMLDOC01-appb-M000002
However, when j=k (when the speaker of eji and ck is the same), tji,k =1, and when j≠k (when the speaker of eji and ck is different), tji,k =0, and learning is performed so that Sji ,k and tji ,k match (so that Lp is minimized). In the matrix shown in FIG. 9, the shaded elements correspond to the elements where j=k. In other words, the loss function Lp approaches 1 when j=k (when the emotional expression vector and the speaker average of the emotional expression vector are the same speaker), and approaches 0 otherwise.

 換言すれば、損失関数Lは、感情認識モデルm1が認識結果を出力する過程において平常感情発話について算出する感情表現ベクトルを話者ごとに一定にするための(話者ごとの感情表現ベクトルの誤差を最小化するための)損失関数である。 In other words, the loss function Lp is a loss function for making the emotional expression vector calculated for normal emotional utterances constant for each speaker in the process in which the emotion recognition model m1 outputs the recognition result (for minimizing the error of the emotional expression vector for each speaker).

 続いて、学習部12は、損失関数Lに基づく損失Lと、損失関数Lに基づく損失Lとの加重和を感情認識モデルm1全体の損失Lとして算出する(S108)。ここで、損失Lの算出式は、例えば、以下の通りである。
L=αΣ(L/N)+(1-α)L
但し、αは、Lの「感情認識誤りが減る」役割とLの「同じ話者の平常感情ベクトルが近い値を取る」役割のどちらを優先するかを調整するためのハイパーパラメータ(重み係数)である。Lの重みが大きいほどLの効果が薄、Lの重みが大きいほど感情認識誤りがある程度許容されることになる。
Next, the learning unit 12 calculates the weighted sum of the loss Lp based on the loss function Lp and the loss La based on the loss function La as the loss L of the entire emotion recognition model m1 (S108). Here, the calculation formula for the loss L is, for example, as follows:
L = αΣ(L a /N) + (1-α)L p
Here, α is a hyperparameter (weighting coefficient) for adjusting the priority between the role of L a in "reducing emotion recognition errors" and the role of L p in "making the normal emotion vector of the same speaker take a similar value." The larger the weight of L a is, the weaker the effect of L p is, and the larger the weight of L p is, the more emotion recognition errors are tolerated to a certain extent.

 また、Nは、対象ミニバッチ内の学習データの数である。すなわち、Σ(L/N)、は、対象ミニバッチにおけるLの平均値である。 In addition, N is the number of training data in the target mini-batch, i.e., Σ(L a /N) is the average value of L a in the target mini-batch.

 続いて、学習部12は、Lに基づく誤差逆伝播法を用いて感情認識モデルm1中の各ブロックや損失関数Lのパラメータを同時に更新する(S109)。ステップS101~S109が繰り返されることにより、当該各ブロックのパラメータが同時に最適化される。 Next, the learning unit 12 simultaneously updates the parameters of each block in the emotion recognition model m1 and the loss function L P using the backpropagation algorithm based on L (S109). By repeating steps S101 to S109, the parameters of the blocks are simultaneously optimized.

 ステップS101~S109が所定回数繰り返されると(S110でYes)、学習部12は、図7の処理手順を終了する。 When steps S101 to S109 have been repeated a predetermined number of times (Yes in S110), the learning unit 12 ends the processing procedure in FIG. 7.

 [認識時]
 認識時において感情認識装置10が実行する処理は、図6を用いて説明した通りである。すなわち、感情認識装置10は、感情認識の対象とする話者の発話と、当該話者の平常感情発話を入力とする。音響特徴抽出部11-1は、当該発話から音響特徴系列を抽出し、音響特徴抽出部11-2は、当該平常感情発話から音響特徴系列を抽出する。感情認識部13は、これら2つの音響特徴系列を学習済みの感情認識モデルm1に入力し、感情認識モデルm1が出力する感情ラベルごとの確率に基づいて当該話者の感情を認識する。
[When recognized]
The process executed by the emotion recognition device 10 during recognition is as described with reference to Fig. 6. That is, the emotion recognition device 10 receives as input an utterance of a speaker to be subjected to emotion recognition and the speaker's normal emotional utterance. The acoustic feature extraction unit 11-1 extracts an acoustic feature series from the utterance, and the acoustic feature extraction unit 11-2 extracts an acoustic feature series from the normal emotional utterance. The emotion recognition unit 13 inputs these two acoustic feature series to a trained emotion recognition model m1, and recognizes the speaker's emotion based on the probability for each emotion label output by the emotion recognition model m1.

 上述したように、本実施の形態によれば、感情表現ベクトル抽出ブロックm11-2は、Lに基づいて、同じ話者の複数通りの平常感情発話について、同じような(距離が相対的に小さい)感情表現ベクトルを出力しやすくなるように学習される。Lの加重和がLに含まれることにより、従来技術では同じ話者の平常感情発話が同じような感情表現ベクトルとなる保証がない(=平常感情発話が変われば認識結果も変わることが多い)という問題を軽減することができる。すなわち、平常感情発話を利用した感情認識において、入力発話と平常感情発話の組み合わせに特化して推定結果を出力してしまう事象を解消することができる。これにより、異なる平常感情発話に対しても同一の認識結果が得られる可能性を高めることができ、発話からの話者の感情の認識精度の向上に寄与することができる。 As described above, according to this embodiment, the emotional expression vector extraction block m11-2 learns based on Lp so as to be more likely to output similar (relatively small distance) emotional expression vectors for multiple normal emotion utterances by the same speaker. By including the weighted sum of Lp in L, it is possible to alleviate the problem that in the conventional technology, there is no guarantee that normal emotion utterances by the same speaker will have similar emotional expression vectors (= recognition results often change when normal emotion utterances change). In other words, it is possible to eliminate the phenomenon in which an estimated result is output by specializing in a combination of an input utterance and a normal emotion utterance in emotion recognition using normal emotion utterances. This increases the possibility of obtaining the same recognition result for different normal emotion utterances, which contributes to improving the recognition accuracy of the speaker's emotion from the utterance.

 なお、本実施の形態において、損失関数Lは、第1の損失関数の一例である。損失関数Lは、第2の損失関数の一例である。 In this embodiment, the loss function L a is an example of a first loss function, and the loss function L p is an example of a second loss function.

 以上、本発明の実施の形態について詳述したが、本発明は斯かる特定の実施形態に限定されるものではなく、請求の範囲に記載された本発明の要旨の範囲内において、種々の変形・変更が可能である。 The above describes in detail the embodiments of the present invention, but the present invention is not limited to such specific embodiments, and various modifications and variations are possible within the scope of the gist of the present invention as described in the claims.

10     感情認識装置
11-1   音響特徴抽出部
11-2   音響特徴抽出部
12     学習部
13     感情認識部
100    ドライブ装置
101    記録媒体
102    補助記憶装置
103    メモリ装置
104    プロセッサ
105    インタフェース装置
121    学習データ記憶部
B      バス
m1     感情認識モデル
m11    感情表現ベクトル抽出ブロック
m11-1  感情表現ベクトル抽出ブロック
m11-2  感情表現ベクトル抽出ブロック
m12    感情確率推定ブロック
10 Emotion recognition device 11-1 Acoustic feature extraction unit 11-2 Acoustic feature extraction unit 12 Learning unit 13 Emotion recognition unit 100 Drive device 101 Recording medium 102 Auxiliary storage device 103 Memory device 104 Processor 105 Interface device 121 Learning data storage unit B Bus m1 Emotion recognition model m11 Emotion expression vector extraction block m11-1 Emotion expression vector extraction block m11-2 Emotion expression vector extraction block m12 Emotion probability estimation block

Claims (8)

 話者による複数通りの入力発話と、それぞれの前記入力発話に対応する複数通りの平常感情発話と、それぞれの前記入力発話に対応する感情の正解ラベルとを複数の話者について含む学習用のデータに基づいて、いずれかの前記入力発話と当該入力発話に対応する前記平常感情発話とを入力したモデルが出力する感情の認識結果について当該入力発話に対応する前記正解ラベルに対する誤差を最小化するための第1の損失関数と、前記モデルが前記認識結果を出力する過程において前記平常感情発話について算出される感情表現の性質を表すベクトルを話者ごとに一定にするための第2の損失関数と、に基づいて前記モデルを学習する学習手順、
をコンピュータが実行することを特徴とする感情認識学習方法。
a learning procedure for learning the model based on learning data including, for a plurality of speakers, a plurality of input utterances by a speaker, a plurality of normal emotion utterances corresponding to each of the input utterances, and a correct answer label of the emotion corresponding to each of the input utterances, and based on: a first loss function for minimizing an error, relative to the correct answer label corresponding to the input utterance, of an emotion recognition result output by a model to which any of the input utterances and the normal emotion utterance corresponding to the input utterance have been input; and a second loss function for making constant, for each speaker, a vector representing the nature of the emotional expression calculated for the normal emotion utterance in a process in which the model outputs the recognition result;
The emotion recognition learning method is characterized in that the emotion recognition learning method is executed by a computer.
 前記第2の損失関数は、同一の話者に対応するそれぞれの前記平常感情発話について前記モデルが算出する前記ベクトルの平均値と、それぞれの当該ベクトルとの距離に基づく、
ことを特徴とする請求項1記載の感情認識学習方法。
the second loss function is based on a distance between an average value of the vectors calculated by the model for each of the normal emotion utterances corresponding to the same speaker and each of the vectors;
2. The emotion recognition learning method according to claim 1 .
 前記学習手順は、前記第1の損失関数と前記第2の損失関数との加重和に基づいて前記モデルを学習する、
ことを特徴とする請求項1又は2記載の感情認識学習方法。
the learning procedure includes learning the model based on a weighted sum of the first loss function and the second loss function.
3. The emotion recognition learning method according to claim 1 or 2.
 請求項1記載の感情認識学習方法によって学習されたモデルを用いて、入力された発話及び平常感情発話に係る話者の感情の認識する感情認識手順、
をコンピュータが実行することを特徴とする感情認識方法。
an emotion recognition step of recognizing the emotion of a speaker related to an input utterance and a normal emotional utterance by using a model trained by the emotion recognition training method according to claim 1;
The emotion recognition method is characterized in that the emotion recognition method is executed by a computer.
 話者による複数通りの入力発話と、それぞれの前記入力発話に対応する複数通りの平常感情発話と、それぞれの前記入力発話に対応する感情の正解ラベルとを複数の話者について含む学習用のデータに基づいて、いずれかの前記入力発話と当該入力発話に対応する前記平常感情発話とを入力したモデルが出力する感情の認識結果について当該入力発話に対応する前記正解ラベルに対する誤差を最小化するための第1の損失関数と、前記モデルが前記認識結果を出力する過程において前記平常感情発話について算出される感情表現の性質を表すベクトルを話者ごとに一定にするための第2の損失関数と、に基づいて前記モデルを学習するように構成されている学習部、
を有することを特徴とする感情認識学習装置。
a learning unit configured to learn the model based on learning data including, for a plurality of speakers, a plurality of input utterances by a speaker, a plurality of normal emotion utterances corresponding to each of the input utterances, and a correct answer label of the emotion corresponding to each of the input utterances, based on: a first loss function for minimizing an error, with respect to the correct answer label corresponding to the input utterance, of an emotion recognition result output by a model to which any of the input utterances and the normal emotion utterance corresponding to the input utterance have been input; and a second loss function for making constant, for each speaker, a vector representing the nature of the emotional expression calculated for the normal emotion utterance in a process in which the model outputs the recognition result;
An emotion recognition learning device comprising:
 請求項1記載の感情認識学習方法によって学習されたモデルを用いて、入力された発話及び平常感情発話に係る話者の感情の認識するように構成されている感情認識部、
を有することを特徴とする感情認識装置。
an emotion recognition unit configured to recognize emotions of a speaker related to an input utterance and a normal emotional utterance by using a model trained by the emotion recognition training method according to claim 1;
An emotion recognition device comprising:
 請求項1記載の感情認識学習方法をコンピュータに実行させることを特徴とするプログラム。 A program that causes a computer to execute the emotion recognition learning method described in claim 1.  請求項4記載の感情認識方法をコンピュータに実行させることを特徴とするプログラム。 A program that causes a computer to execute the emotion recognition method described in claim 4.
PCT/JP2022/045706 2022-12-12 2022-12-12 Emotion recognition learning method, emotion recognition method, emotion recognition learning device, emotion recognition device, and program Ceased WO2024127472A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2024563791A JPWO2024127472A1 (en) 2022-12-12 2022-12-12
PCT/JP2022/045706 WO2024127472A1 (en) 2022-12-12 2022-12-12 Emotion recognition learning method, emotion recognition method, emotion recognition learning device, emotion recognition device, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/045706 WO2024127472A1 (en) 2022-12-12 2022-12-12 Emotion recognition learning method, emotion recognition method, emotion recognition learning device, emotion recognition device, and program

Publications (1)

Publication Number Publication Date
WO2024127472A1 true WO2024127472A1 (en) 2024-06-20

Family

ID=91484498

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/045706 Ceased WO2024127472A1 (en) 2022-12-12 2022-12-12 Emotion recognition learning method, emotion recognition method, emotion recognition learning device, emotion recognition device, and program

Country Status (2)

Country Link
JP (1) JPWO2024127472A1 (en)
WO (1) WO2024127472A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2020126125A (en) * 2019-02-04 2020-08-20 富士通株式会社 Voice processing program, voice processing method, and voice processing apparatus
WO2021171552A1 (en) * 2020-02-28 2021-09-02 日本電信電話株式会社 Emotion recognition device, emotion recognition model learning device, method for same, and program

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2020126125A (en) * 2019-02-04 2020-08-20 富士通株式会社 Voice processing program, voice processing method, and voice processing apparatus
WO2021171552A1 (en) * 2020-02-28 2021-09-02 日本電信電話株式会社 Emotion recognition device, emotion recognition model learning device, method for same, and program

Also Published As

Publication number Publication date
JPWO2024127472A1 (en) 2024-06-20

Similar Documents

Publication Publication Date Title
Nakashika et al. Voice conversion in high-order eigen space using deep belief nets.
JP6777768B2 (en) Word vectorization model learning device, word vectorization device, speech synthesizer, their methods, and programs
JP5777178B2 (en) Statistical acoustic model adaptation method, acoustic model learning method suitable for statistical acoustic model adaptation, storage medium storing parameters for constructing a deep neural network, and statistical acoustic model adaptation Computer programs
CN108447490A (en) Method and device for voiceprint recognition based on memory bottleneck feature
JP7420211B2 (en) Emotion recognition device, emotion recognition model learning device, methods thereof, and programs
JP6884946B2 (en) Acoustic model learning device and computer program for it
CN103578462A (en) Speech processing system
Hashimoto et al. Trajectory training considering global variance for speech synthesis based on neural networks
KR20200111595A (en) Conversation agent system and method using emotional history
JP5807921B2 (en) Quantitative F0 pattern generation device and method, model learning device for F0 pattern generation, and computer program
Chen et al. The USTC System for Voice Conversion Challenge 2016: Neural Network Based Approaches for Spectrum, Aperiodicity and F0 Conversion.
JP7700801B2 (en) SPEAKER RECOGNITION METHOD, SPEAKER RECOGNITION DEVICE, AND SPEAKER RECOGNITION PROGRAM
Tailor et al. Deep learning approach for spoken digit recognition in Gujarati language
JP7423056B2 (en) Reasoners and how to learn them
CN102237082B (en) Self-adaption method of speech recognition system
Wang et al. Normalization through Fine-tuning: Understanding Wav2vec 2.0 Embeddings for Phonetic Analysis
Banjara et al. Nepali speech recognition using cnn and sequence models
WO2024127472A1 (en) Emotion recognition learning method, emotion recognition method, emotion recognition learning device, emotion recognition device, and program
Joy et al. DNNs for unsupervised extraction of pseudo speaker-normalized features without explicit adaptation data
Shibata et al. Unsupervised acoustic-to-articulatory inversion neural network learning based on deterministic policy gradient
Dave An approach to increase word recognition accuracy in Gujarati language
JP4571921B2 (en) Acoustic model adaptation apparatus, acoustic model adaptation method, acoustic model adaptation program, and recording medium thereof
JP6137708B2 (en) Quantitative F0 pattern generation device, model learning device for F0 pattern generation, and computer program
JPH10254485A (en) Speaker normalizing device, speaker adaptive device and speech recognizer
Muxamediyeva et al. Application of Genetic Algorithm in Training Automatic Speech Recognition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22968384

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2024563791

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22968384

Country of ref document: EP

Kind code of ref document: A1