[go: up one dir, main page]

WO2025158547A1 - Dispositif d'apprentissage, dispositif d'inférence, procédé d'apprentissage, procédé d'inférence et programme - Google Patents

Dispositif d'apprentissage, dispositif d'inférence, procédé d'apprentissage, procédé d'inférence et programme

Info

Publication number
WO2025158547A1
WO2025158547A1 PCT/JP2024/001906 JP2024001906W WO2025158547A1 WO 2025158547 A1 WO2025158547 A1 WO 2025158547A1 JP 2024001906 W JP2024001906 W JP 2024001906W WO 2025158547 A1 WO2025158547 A1 WO 2025158547A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
sentence
unit
learning
parameter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/JP2024/001906
Other languages
English (en)
Japanese (ja)
Inventor
厚志 安藤
岳至 森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NTT Inc
NTT Inc USA
Original Assignee
Nippon Telegraph and Telephone Corp
NTT Inc USA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp, NTT Inc USA filed Critical Nippon Telegraph and Telephone Corp
Priority to PCT/JP2024/001906 priority Critical patent/WO2025158547A1/fr
Publication of WO2025158547A1 publication Critical patent/WO2025158547A1/fr
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Definitions

  • This disclosure relates to a learning device, an inference device, a learning method, an inference method, and a program.
  • Speech information contains three types of information: linguistic information, non-linguistic information, and paralinguistic information (hereinafter, these three types of information will be collectively referred to as "speech information") (Non-Patent Document 1). Furthermore, technology for recognizing non-linguistic and paralinguistic information from speech is known (Non-Patent Document 2).
  • Non-Patent Document 3 there is a technology known as image understanding technology that outputs various information contained in images in natural language.
  • speech understanding technology By using a speech encoder instead of the image encoder used in image understanding technology, it is thought possible to realize a technology that outputs speech information contained in speech in natural language (hereinafter also referred to as "speech understanding technology").
  • speech understanding technology due to differences in the configurations of image and speech encoders and differences in the properties of images and speech, it is difficult to realize speech understanding technology simply by using a speech encoder instead of an image encoder.
  • a learning device includes an input unit that inputs learning data including speech, a first sentence related to the speech, and a second sentence corresponding to the first sentence; a speech feature generation unit that generates information representing features of the speech for each predetermined time interval based on a speech feature extractor configured with multiple layers; a first integration unit that generates first integrated information for each time interval based on first parameters, integrating the information representing the features generated by each of the predetermined multiple layers of the speech feature extractor; a second integration unit that generates second integrated information for each time interval based on second parameters, integrating the first integrated information in the time direction; a calculation unit that calculates the generation probability of a third sentence corresponding to the first sentence based on the first sentence, the second integrated information, and a language model; and a learning unit that learns learning target parameters including the first parameter and the second parameter based on the generation probability of the third sentence and the second sentence.
  • FIG. 2 is a diagram illustrating an example of a speech understanding model.
  • FIG. 10 illustrates an example of an audio encoder and an audio encoder output integrated block.
  • FIG. 10 is a diagram illustrating an example of a time information integration block.
  • FIG. 2 is a diagram illustrating an example of a hardware configuration of a speech understanding device during learning.
  • FIG. 2 is a diagram illustrating an example of a functional configuration of a speech understanding device during learning.
  • FIG. 10 is a diagram illustrating an example of a training data set.
  • FIG. 2 is a diagram illustrating an example of a detailed functional configuration of a model learning unit.
  • 10 is a flowchart illustrating an example of a model construction process.
  • 10 is a flowchart illustrating an example of a model learning process.
  • FIG. 10 is a flowchart illustrating an example of a model learning process.
  • FIG. 2 is a diagram illustrating an example of a functional configuration of a speech understanding device during inference.
  • FIG. 2 is a diagram illustrating an example of a detailed functional configuration of an output statement generation unit.
  • 10 is a flowchart illustrating an example of an output sentence generation process.
  • Non-Patent Document 1 ⁇ Non-verbal and para-linguistic information recognition technology> It is known that speech contains speech information (i.e., linguistic information, non-linguistic information, and paralinguistic information) (Non-Patent Document 1).
  • linguistic information refers to information about the words spoken by a speaker.
  • Non-linguistic information refers to information that is not linguistic but cannot be changed at will (e.g., information that represents the speaker's identity, gender, emotions, etc.).
  • Paralinguistic information refers to information that is not linguistic but can be changed at will (e.g., information that represents intentions, attitudes, etc.).
  • Non-Patent Document 2 uses a statistical model based on deep learning to estimate which emotional state, such as anger, joy, or sadness, the speech most closely matches.
  • conventional technologies are unable to recognize fine-grained non-verbal and paralinguistic information. For example, they are unable to estimate emotional states such as "irritated,” which are not predefined, or "angry and sad,” which spans multiple emotional states. This results in a problem of reduced processing accuracy in downstream systems that use the results of non-verbal and paralinguistic information recognition (for example, the accuracy of call analysis in contact center systems, the accuracy of dialogue control and analysis in voice dialogue systems, etc.).
  • Non-Patent Document 3 a technology used by humans for communication, such as Japanese, English, or Chinese.
  • Image understanding technology is composed of a deep learning model that combines a large-scale language model that acquires relationships and co-occurrences between words using a large amount of text data with an image encoder that extracts information about objects in the image from the image.
  • this deep learning model When this deep learning model is given an input image and a natural language question about the input image (e.g., a sentence such as "What do you think about the logo in this image?"), it outputs an output sentence corresponding to the question (e.g., a sentence such as "This log is a simple and symbolic logo").
  • a natural language question about the input image e.g., a sentence such as "What do you think about the logo in this image?"
  • Deep learning models that realize image understanding technology can infer various pieces of information contained in an image in natural language by providing a pair of an input image, a natural language question about that input image, and a correct output sentence corresponding to that question. For example, in response to the question, "What color is the logo in this image?", a sentence such as "It's pink” can be output as the output sentence.
  • Speech understanding technology By realizing speech understanding technology that can output speech information contained in speech in natural sentences, it will be possible to recognize a variety of speech information, including detailed non-linguistic and paralinguistic information, and output that speech information in natural sentences.
  • a simple way to realize speech understanding technology would be to use a speech encoder that extracts information representing the characteristics of speech from speech, instead of the image encoder used in image understanding technology.
  • a speech encoder that extracts information representing the characteristics of speech from speech
  • image encoder used in image understanding technology
  • the first reason is the difference in the structure of image encoders and speech encoders.
  • Existing speech encoders e.g., wav2vec2.0 (Reference 1), WavLM (Reference 2), etc.
  • Reference 2 suggests that information closer to physical properties, such as speaker information, is extracted in the lower layers of the speech encoder, while information closer to abstract properties, such as phonology, is extracted in the higher layers of the speech encoder.
  • speech recognition must use information extracted in the higher layers of the speech encoder; otherwise, it is thought that accurate natural-sounding output will be difficult.
  • the second reason is the difference in the nature of images and audio.
  • audio has a variable length, so it is thought that time-domain processing is required to recognize the audio information contained in audio of any length and output that audio information in natural language.
  • ⁇ Speech understanding model> Therefore, we propose a deep learning model (hereinafter referred to as a "speech understanding model") that can solve the problems caused by the above two causes.
  • This speech understanding model makes it possible to recognize diverse speech information, including detailed non-verbal and paralinguistic information, from speech of any length and output the speech information in natural sentences.
  • a speech understanding technology that outputs diverse speech information contained in speech of any length (more specifically, speech information related to at least one of the physical and abstract properties of speech) in natural sentences. This can be expected to improve the processing accuracy of downstream systems that use the recognition results of non-verbal and paralinguistic information (e.g., the accuracy of call analysis in contact center systems, the accuracy of dialogue control and analysis in voice dialogue systems, etc.).
  • FIG. 1 is a diagram showing an example of the speech understanding model 1000.
  • the speech understanding model 1000 is composed of a speech encoder 1100, a speech encoder output integration block 1200, a temporal information integration block 1300, a linear transformation layer 1400, and a large-scale language model 1500.
  • the audio encoder 1100 is any existing audio encoder (for example, wav2vec2.0, WavLM, etc.).
  • the audio encoder 1100 receives audio (hereinafter also referred to as "input audio") as input and outputs information representing the characteristics of that audio.
  • input audio hereinafter also referred to as "input audio”
  • the audio encoder 1100 receives the input audio for that time interval and outputs information representing the characteristics of the audio for that time interval.
  • the audio encoder output merging block 1200 merges the outputs of multiple pre-specified layers from among the outputs of each layer of the audio encoder 1100 in each time interval.
  • each of the multiple pre-specified layers will be referred to as the "layer to be merged.”
  • each layer to be merged has the same number of output dimensions.
  • the layer to be merged is specified by the user, etc., from among the layers of the audio encoder 1100 that have the same number of output dimensions.
  • T is an index representing the last time interval of the input speech, and its value may vary depending on the length of the input speech.
  • each h n (t) represents some feature of the input speech (e.g., a physical property or an abstract property), and each h n (t) can be described by a vector with a predetermined number of dimensions, so hereinafter each h n (t) will be referred to as a "first speech feature vector.”
  • the speech encoder output integration block 1200 receives, in each time interval t, each first speech feature vector h n (t) in that time interval t as input, and outputs a vector obtained by integrating each first speech feature vector h n (t).
  • the vector obtained by integrating each first speech feature vector h n (t) will be referred to as the "first integrated vector” and represented by e(t).
  • the speech encoder 1100 is composed of one convolutional layer and N transformer layers, and these N transformer layers are layers to be integrated.
  • the output of the nth transformer layer in time interval t is a first speech feature vector h n (t)
  • the speech encoder output integration block 1200 integrates these first speech feature vectors h n (t) to create a first integrated vector e(t).
  • the first speech feature vectors h n (t) can be integrated by, for example, weighted sum or linear transformation sum.
  • W 1 , ..., W N , b 1 , ..., b N are also called linear transformation coefficients and are training target parameters.
  • the temporal information integration block 1300 integrates the first integrated vector e(t) in the time direction. That is, the temporal information integration block 1300 receives the first integrated vector e(t) for each time interval t as input and outputs a vector obtained by integrating each first integrated vector e(t) in the time direction. In this case, it is not sufficient to simply integrate the first integrated vector e(t) for all time intervals t; it is necessary to consider which parts of the time period should be emphasized and which parts should not. For example, when recognizing a speaker's emotions or speaker information from speech, it is necessary to ignore time periods representing short pauses in the speech or time periods where breathing occurs, and to emphasize the time periods during which the speaker is speaking.
  • the temporal information integration block 1300 integrates each first integrated vector e(t) using a weighted sum.
  • the vector obtained by integrating each first integrated vector e(t) in the time direction will be referred to as the "second integrated vector" and represented by v.
  • the time information integration block 1300 creates a second integrated vector v by integrating these first integrated vectors e(1), ..., e(T) in the time direction.
  • a self-attentive pooling layer can be used as a method for integrating each of the first integrated vectors e(1), ..., e(T).
  • E [e(1), ..., e(T)] ⁇ .
  • a softmax(ReLU(W 1 'E)W 2 ').
  • W 1 ' and W 2 ' are training parameters
  • is a symbol representing transposition.
  • the second integrated vector v is obtained by integrating each of the first integrated vectors e(1), ..., e(T) by weighted sum.
  • the time information integration block 1300 may integrate each of the first integrated vectors e(1), ..., e(T) in the time direction using a convolutional neural network.
  • the learnable parameters of the one-dimensional convolutional neural network are the parameters to be learned.
  • K is an integer equal to or greater than 1 determined by the window size of the one-dimensional convolutional neural network and the sequence length T of the first integrated vectors e(1), ..., e(T).
  • the linear transformation layer 1400 linearly transforms the second integrated vector v. That is, the linear transformation layer 1400 takes the second integrated vector v as input and outputs a vector obtained by linearly transforming this second integrated vector v.
  • the vector obtained by linearly transforming the second integrated vector v will be referred to as the "second speech feature vector" and represented by w.
  • W and b are also called linear transformation coefficients and are training target parameters. Note that if the number of dimensions of the token embedding space in the large-scale language model 1500 is M, then the number of dimensions of the second speech feature vector w is also M. This means that the linear transformation layer 1400 creates a second speech feature vector sequence with a length of 1 and M dimensions.
  • K vectors obtained by linearly transforming v(1), ..., v(K) can be set as second speech feature vectors w(1), ..., w(K). This means that a second speech feature vector sequence with a length of K and a number of dimensions of M is created.
  • Large-scale language model 1500 is any existing large-scale language model (LLM).
  • Large-scale language model 1500 takes as input an input sentence, which is a natural language question about the input speech, and a second speech feature vector w (or second speech feature vector w(1), ..., w(K)), and outputs an output sentence corresponding to the question.
  • Such an output sentence is generated according to the posterior probability given the input sentence and the second speech feature vector w (or second speech feature vector w(1), ..., w(K)).
  • the posterior probability is calculated by processing a vector sequence combining an embedding vector sequence representing the embedded representation of the tokens that make up the input sentence and the second speech feature vector w (or second speech feature vector w(1), ..., w(K)).
  • FIG. 1 shows a case where, when an input sentence of "Please tell me the emotional state of the person in the following speech," is given, an output sentence of "This person is male and is slightly irritated” is generated.
  • Llama 2 7B Reference 3
  • a language model other than a large-scale language model may also be used as the large-scale language model 1500, as long as it is a language model that can generate an output sentence according to the posterior probability when given an input sentence and a second speech feature vector w (or second speech feature vectors w(1), ..., w(K)).
  • the speech encoder 1100 may be referred to as, for example, a "speech feature extractor” or “speech coder.”
  • the large-scale language model 1500 may be referred to as, for example, a "language model,” a "natural language model,” or a “natural language processing model.”
  • the components that make up the speech understanding model 1000 may be referred to as, for example, a "module.”
  • the speech understanding device 10 undergoes a "model construction" phase during which the speech understanding model 1000 is constructed, a “learning” phase during which parameters to be learned for the speech understanding model 1000 are learned, and an “inference” phase during which output sentences are generated by the speech understanding model 1000 using the learned parameters.
  • the speech understanding device 10 is provided with a collection of training data (hereinafter also referred to as a "training dataset") represented by pairs of input speech, input sentences which are questions in natural language related to the input speech, and correct output sentences corresponding to the questions.
  • training dataset training data
  • the speech understanding device 10 is provided with test data represented by pairs of input speech and input sentences which are questions in natural language related to the input speech.
  • the speech understanding device 10 during model construction may be referred to as, for example, a "model construction device” or a "model creation device.”
  • the speech understanding device 10 during learning may be referred to as, for example, a “learning device,” a “parameter estimation device,” a “parameter optimization device,” etc.
  • the speech understanding device 10 during inference may be referred to as, for example, an “inference device,” an “estimation device,” a "natural sentence generation device,” etc.
  • model construction is considered to be included in learning, and a case will be described in which the speech understanding device 10 also constructs the speech understanding model 1000 during learning.
  • Fig. 4 is a diagram showing an example of the hardware configuration of the speech understanding device 10 during learning.
  • the speech understanding device 10 during learning has an input device 101, a display device 102, an external I/F 103, a communication I/F 104, a RAM (Random Access Memory) 105, a ROM (Read Only Memory) 106, an auxiliary storage device 107, and a processor 108.
  • a processor 108 Each of these pieces of hardware is connected to each other so that they can communicate with each other via a bus 109.
  • the input device 101 is, for example, a keyboard, a mouse, a touch panel, a physical button, etc.
  • the display device 102 is, for example, a display, a display panel, etc. Note that the speech understanding device 10 does not necessarily have to have at least one of the input device 101 and the display device 102, for example.
  • the external I/F 103 is an interface with external devices such as a recording medium 103a.
  • recording media 103a include a CD (Compact Disc), a DVD (Digital Versatile Disk), an SD memory card (Secure Digital memory card), and a USB (Universal Serial Bus) memory card.
  • the communication I/F 104 is an interface for connecting to a communication network.
  • the RAM 105 is a volatile semiconductor memory (storage device) that temporarily stores programs and data.
  • the ROM 106 is a non-volatile semiconductor memory (storage device) that can store programs and data even when the power is turned off.
  • the auxiliary storage device 107 is a non-volatile storage device such as an HDD (Hard Disk Drive), SSD (Solid State Drive), or flash memory.
  • the processor 108 is a variety of arithmetic devices such as a CPU (Central Processing Unit) or GPU (Graphics Processing Unit).
  • the hardware configuration shown in FIG. 4 is an example, and the hardware configuration of the speech understanding device 10 is not limited to this.
  • the speech understanding device 10 may have multiple auxiliary storage devices 107 or multiple processors 108, may not have some of the hardware shown in the figure, or may have various hardware other than the hardware shown in the figure.
  • Fig. 5 is a diagram showing an example of the functional configuration of the speech understanding device 10 during learning.
  • the speech understanding device 10 during learning has a model construction unit 201 and a model learning unit 202. These units are realized, for example, by processing in which one or more programs installed in the speech understanding device 10 are executed by the processor 108 or the like.
  • the speech understanding device 10 during learning also has a trained speech encoder storage unit 203, a trained large-scale language model storage unit 204, a speech understanding model storage unit 205, and a training dataset storage unit 206.
  • Each of these storage units is realized, for example, by a storage area such as the auxiliary storage device 107. However, at least one of these storage units may also be realized by a storage area of a storage device (for example, a storage device included in a database server, etc.) that is communicatively connected to the speech understanding device 10.
  • the model construction unit 201 constructs the speech understanding model 1000 shown in FIG. 1 using the trained speech encoder stored in the trained speech encoder storage unit 203 and the trained large-scale language model stored in the trained large-scale language model storage unit 204. That is, the model construction unit 201 constructs the speech understanding model 1000 shown in FIG. 1 using the trained speech encoder as the speech encoder 1100 and the trained large-scale language model as the large-scale language model 1500.
  • the model construction unit 201 initializes the training target parameters of the speech encoder output integration block 1200, the training target parameters of the time information integration block 1300, and the training target parameters of the linear transformation layer 1400.
  • the training target parameters may be initialized using any method, such as random initialization or sampling from a predetermined distribution.
  • a trained speech encoder is a speech encoder whose parameters have been trained.
  • a trained large-scale language model is a large-scale language model whose parameters have been trained.
  • model construction unit 201 stores the speech understanding model 1000 in the speech understanding model storage unit 205.
  • the model training unit 202 uses the training dataset stored in the training dataset storage unit 206 to train the speech understanding model 1000 stored in the speech understanding model storage unit 205. At this time, the model training unit 202 trains the training target parameters of the speech encoder output integration block 1200, the training target parameters of the temporal information integration block 1300, and the training target parameters of the linear transformation layer 1400, while keeping the parameters of the speech encoder 1100 and large-scale language model 1500 fixed. More specifically, the model training unit 202 uses, as a loss function, the cross entropy between the output sentence generated by the speech understanding model 1000 when given the input speech and input sentence contained in the training data and the correct output sentence contained in the training data, and trains the training target parameters using an existing optimization method so as to minimize the loss function. A detailed example of the functional configuration of the model training unit 202 will be described later.
  • the trained speech encoder storage unit 203 stores a trained speech encoder.
  • the trained large-scale language model storage unit 204 stores a trained large-scale language model.
  • the speech understanding model storage unit 205 stores the speech understanding model 1000 constructed by the model construction unit 201.
  • the training dataset storage unit 206 stores a given training dataset.
  • Fig. 6 is a diagram showing an example of the training data set.
  • a training dataset is made up of one or more training data, each of which includes input speech, an input sentence, and a correct output sentence.
  • a training dataset is made up of a large number of training data.
  • the input speech is speech data input to the speech understanding model 1000.
  • the input sentence is text data that represents a question in natural language related to the input speech.
  • the correct output sentence is text data that represents a response or answer in natural language that is the correct answer to the question represented by the input sentence.
  • the input speech does not necessarily have to be audio data that records a human voice, but may also be audio data that records any sound.
  • the correct output sentence may also be called, for example, "teaching data.”
  • the training data for the second line of the example shown in FIG. 6 includes the input speech "Speech A,” the input sentence “Please tell me how this speech is spoken,” and the correct output sentence "A woman is speaking quickly and loudly.”
  • the training data for the third line of the example shown in FIG. 6 includes the input speech "Speech B,” the input sentence “Please tell me how this speech is spoken,” and the correct output sentence "A man speaks slowly and calmly.”
  • the training data on the fifth line of the example shown in Figure 6 includes the input speech "Speech B,” the input sentence “Please tell me the gender of the speaker of this speech,” and the correct output sentence “It is a man.”
  • the training data on the fifth line of the example shown in Figure 6 includes the input speech "Speech C,” the input sentence “What is the emotion of the speaker of this speech?", and the correct output sentence "This speaker seems a little irritated.”
  • the training dataset is composed of training data represented by pairs of input speech, input sentences, and correct output sentences.
  • the training dataset may contain multiple training data sets that contain different input sentences and correct output sentences for the same input speech.
  • Fig. 7 is a diagram showing an example of a detailed functional configuration of the model learning unit 202.
  • the model learning unit 202 includes a learning data input unit 211, an audio encoding unit 212, a first integration unit 213, a second integration unit 214, a linear transformation unit 215, a posterior probability calculation unit 216, a parameter update unit 217, and an end determination unit 218.
  • the learning data input unit 211 inputs one piece of learning data from the learning dataset stored in the learning dataset storage unit 206.
  • the speech encoding unit 212 is realized by the speech encoder 1100 included in the speech understanding model 1000.
  • the second integration unit 214 is realized by the time information integration block 1300 included in the speech understanding model 1000.
  • the linear transformation unit 215 is realized by the linear transformation layer 1400 included in the speech understanding model 1000.
  • the linear transformation unit 215 receives the second integrated vector v as input and outputs the second speech feature vector w obtained by linearly transforming the second integrated vector v.
  • the posterior probability calculation unit 216 is realized by the large-scale language model 1500 included in the speech understanding model 1000.
  • the posterior probability calculation unit 216 receives as input an input sentence included in the training data input by the training data input unit 211 and a second speech feature vector w, and calculates the posterior probability of an output sentence when the input sentence and the second speech feature vector w are given.
  • I is the number of tokens included in the output sentence (i.e., the length of the output sentence), and may be, for example, the length of the correct output sentence included in the training data input by the training data input unit 211.
  • the posterior probability p(s i ) is expressed as an M-dimensional vector in which, for example, when the number of dimensions of the token embedding space is M, the m-th element represents the probability that the m-th type of token is generated, and the sum of the values of all elements is 1.
  • a token is the basic processing unit used when a language model, such as a large-scale language model, processes a string of characters.
  • a typical example of a token is a word, but tokens are not limited to words and can also be, for example, characters, morphemes, subwords, or coherent strings of characters.
  • the parameter update unit 217 uses the posterior probabilities calculated by the posterior probability calculation unit 216 and the correct output sentences included in the learning data input by the learning data input unit 211 to learn the learning target parameters of the speech understanding model 1000 using an existing optimization method.
  • the i-th token constituting the correct output sentence is denoted by s i ' (where s 1 ' is a token representing the beginning of the sentence, and s I ' is a token representing the end of the sentence).
  • the probability that token s i ' is generated is denoted by p(s i ').
  • the probability p(s i ') is expressed as an M-dimensional vector in which only the element corresponding to token s i ' has a value of 1 and the other elements have a value of 0.
  • the optimization method that can be used to update the training parameters is not limited to a specific method, but an online optimization method based on stochastic gradient descent, for example, can be used.
  • the termination determination unit 218 determines whether to terminate the updating of the training parameters. At this time, the termination determination unit 218 determines to terminate the updating of the training parameters if a predetermined termination condition is met, and determines not to terminate the updating of the training parameters if not.
  • the training parameters of the audio encoder output integrated block 1200, the training parameters of the temporal information integrated block 1300, and the training parameters of the linear transformation layer 1400 are repeatedly updated until the predetermined termination condition is met.
  • examples of the predetermined termination condition include the training parameters being updated a predetermined number of times or more, the number of epochs being a predetermined number of epochs or more, the value of the loss function being less than a predetermined value, the loss function converging, etc.
  • Fig. 8 is a flowchart showing an example of the model construction process.
  • the model construction unit 201 constructs the speech understanding model 1000 shown in FIG. 1 using the trained speech encoder stored in the trained speech encoder storage unit 203 and the trained large-scale language model stored in the trained large-scale language model storage unit 204 (step S101). At this time, the model construction unit 201 initializes the training parameters of the speech encoder output integration block 1200, the training parameters of the time information integration block 1300, and the training parameters of the linear transformation layer 1400 using any method. This constructs an untrained speech understanding model 1000.
  • the model construction unit 201 stores the speech understanding model 1000 constructed in step S101 above in the speech understanding model storage unit 205 (step S102).
  • Fig. 9 is a flowchart showing an example of the model learning process.
  • the learning data input unit 211 of the model learning unit 202 inputs one piece of learning data from the learning dataset stored in the learning dataset storage unit 206 (step S201).
  • the learning data input unit 211 inputs, for example, one piece of learning data that has not yet been input for the current number of epochs from the learning data that make up the learning dataset.
  • the epoch number starts from 0 and is incremented by 1 each time all of the learning data that make up the learning dataset is input.
  • the linear transformation unit 215 of the model learning unit 202 receives the second integrated vector v as input and outputs the second speech feature vector w obtained by linearly transforming the second integrated vector v (step S205).
  • the posterior probability calculation unit 216 of the model training unit 202 receives as input the input sentence included in the training data input in step S201 above and the second speech feature vector w output in step S205 above, and calculates the posterior probability of the output sentence when the input sentence and the second speech feature vector w are given (step S206).
  • the termination determination unit 218 of the model learning unit 202 determines whether to terminate the update of the learning parameter (step S208). That is, if a predetermined termination condition is met, the termination determination unit 218 determines to terminate the update of the learning parameter, and if not, determines not to terminate the update of the learning parameter.
  • step S208 above If it is determined in step S208 above that the update of the learning target parameters should not be terminated, the model learning unit 202 returns to step S201 above. As a result, steps S201 to S207 above are repeatedly executed until a predetermined termination condition is met.
  • step S208 if it is determined in step S208 above that the update of the learning target parameters is to be terminated, the model learning unit 202 terminates the model learning process. As a result, the learning target parameters are learned and the trained speech understanding model 1000 is obtained.
  • Example of hardware configuration of the speech understanding device 10 during inference The hardware configuration of the speech understanding device 10 during inference may be the same as during learning, and therefore a description thereof will be omitted.
  • Fig. 10 is a diagram showing an example of the functional configuration of the speech understanding device 10 at the time of inference.
  • the speech understanding device 10 during inference has an output sentence generation unit 207.
  • the output sentence generation unit 207 is realized, for example, by processing in which one or more programs installed in the speech understanding device 10 are executed by the processor 108 or the like.
  • the speech understanding device 10 during inference also has a trained speech understanding model storage unit 208 and a test data storage unit 209.
  • Each of these storage units is realized, for example, by a storage area of the auxiliary storage device 107 or the like. However, at least one of these storage units may also be realized by a storage area of a storage device (for example, a storage device included in a database server or the like) that is communicatively connected to the speech understanding device 10.
  • the output sentence generation unit 207 uses the test data stored in the test data storage unit 209 and the trained speech understanding model 1000 stored in the trained speech understanding model storage unit 208 to generate and output an output sentence corresponding to the question expressed by the input sentence included in the test data (i.e., text data representing a natural language response or answer to the question).
  • test data refers to data represented by a pair of input speech and an input sentence that is a natural language question related to the input speech.
  • the trained speech understanding model 1000 refers to a speech understanding model 1000 whose learning target parameters have been trained. A detailed example of the functional configuration of the output sentence generation unit 207 will be described later.
  • the trained speech understanding model storage unit 208 stores the trained speech understanding model 1000.
  • the test data storage unit 209 stores the given test data.
  • Fig. 11 is a diagram showing an example of a detailed functional configuration of the output sentence generation unit 207.
  • the output sentence generation unit 207 includes a test data input unit 221, a speech encoding unit 222, a first integration unit 223, a second integration unit 224, a linear conversion unit 225, a generation unit 226, and an output unit 227.
  • the test data input unit 221 inputs one piece of test data stored in the test data storage unit 209.
  • the speech encoding unit 222 is realized by the speech encoder 1100 included in the trained speech understanding model 1000.
  • the second integration unit 224 is realized by the time information integration block 1300 included in the trained speech understanding model 1000.
  • the linear transformation unit 225 is realized by the linear transformation layer 1400 included in the trained speech understanding model 1000.
  • the linear transformation unit 225 receives the second integrated vector v as input and outputs the second speech feature vector w obtained by linearly transforming the second integrated vector v.
  • the generation unit 226 is realized by the large-scale language model 1500 included in the trained speech understanding model 1000.
  • the generation unit 226 receives as input an input sentence included in the test data input by the test data input unit 221 and a second speech feature vector w, and generates an output sentence when the input sentence and the second speech feature vector w are given.
  • the i-th token constituting the output sentence is denoted by s i .
  • the posterior probability that token s i will be generated when the input sentence and the second speech feature vector w are given is denoted by p(s i )
  • the posterior probability that token s i will be generated when the input sentence, the second speech feature vector w, and s 1 , ..., s i-1 are given is denoted by p(s i ) (where i ⁇ 2).
  • the generation unit 226 generates the output sentence by generating token s i in accordance with the posterior probability p(s i ) until, for example, a token representing the end of the sentence is generated.
  • the output unit 227 outputs the output sentence generated by the generation unit 226 to a predetermined output destination.
  • the predetermined output destination include a storage area such as the auxiliary storage device 107, a display device 102 such as a display, other devices or equipment that are communicatively connected, etc.
  • Fig. 12 is a flowchart showing an example of the output sentence generation process.
  • the test data input unit 221 of the output statement generation unit 207 inputs one piece of test data stored in the test data storage unit 209 (step S301).
  • the linear transformation unit 225 of the output sentence generation unit 207 receives the second integrated vector v as input and outputs the second speech feature vector w obtained by linearly transforming the second integrated vector v (step S305).
  • the generation unit 226 of the output sentence generation unit 207 receives as input the input sentence included in the test data input in step S301 above and the second speech feature vector w output in step S305 above, and generates an output sentence when the input sentence and the second speech feature vector w are given (step S306).
  • the output unit 227 of the output sentence generation unit 207 outputs the output sentence generated in step S306 above to a predetermined output destination (step S307). This results in an output sentence that is a response or answer to the question related to the input voice.
  • the speech understanding device 10 can realize speech understanding technology using the speech understanding model 1000 in which the speech encoder output integration block 1200 and the time information integration block 1300 are present between the speech encoder 1100 and the large-scale language model 1500. For this reason, by using the speech understanding device 10 according to this embodiment, it is possible to expect an improvement in the processing accuracy of a downstream system that uses the recognition results of non-linguistic information and paralinguistic information, for example.
  • (Appendix 1) Memory and at least one processor coupled to said memory; Including, The processor: inputting training data including a speech, a first sentence related to the speech, and a second sentence corresponding to the first sentence; generating information representing the features of the speech for each predetermined time interval based on a speech feature extractor configured with multiple layers; generating first integrated information for each time interval based on a first parameter, the first integrated information being obtained by integrating information representing the features generated in a plurality of predetermined layers of the audio feature extractor; generating second integrated information by integrating the first integrated information for each time section in a time direction based on a second parameter; calculating a generation probability of a third sentence corresponding to the first sentence based on the first sentence, the second integrated information, and a language model; A learning device that learns learning target parameters including the first parameter and the second parameter based on the generation probability of the third sentence and the second sentence.
  • (Appendix 2) at least one processor coupled to said memory; Including, The processor: inputting test data including a speech and a first sentence related to the speech; generating information representing the features of the speech for each predetermined time interval based on a speech feature extractor configured with multiple layers; generating first integrated information for each time interval by integrating information representing the features generated in a plurality of predetermined layers of the speech feature extractor based on the trained first parameters; generating second integrated information by integrating the first integrated information for each time interval in a time direction based on the learned second parameter; a reasoning device that generates a second sentence corresponding to the first sentence based on the first sentence, the second integrated information, and a language model; (Appendix 3) The processor: linearly transforming the second integrated information based on a third parameter; 2.
  • the learning device calculates the probability of generating the third sentence based on the first sentence, the second integrated information after the linear transformation, and the language model.
  • the first parameter is a weight used in a weighted sum or a linear transform coefficient used in a linear transform sum
  • the processor The learning device according to claim 1 or 3, wherein first integrated information is generated by integrating the features using the weighted sum or the linear transformation sum.
  • the second parameter is a weight of a self-attention pooling layer or a parameter of a one-dimensional convolutional neural network;
  • the processor The learning device according to claim 1 or 3, wherein second integrated information is generated by integrating the first integrated information in a time direction using the self-attention pooling layer or the one-dimensional convolutional neural network.
  • a non-transitory storage medium storing a program executable by a computer to perform a learning process,
  • the learning process includes: inputting training data including a speech, a first sentence related to the speech, and a second sentence corresponding to the first sentence; generating information representing the features of the speech for each predetermined time interval based on a speech feature extractor configured with multiple layers; generating first integrated information for each time interval based on a first parameter, the first integrated information being obtained by integrating information representing the features generated in a plurality of predetermined layers of the audio feature extractor; generating second integrated information by integrating the first integrated information for each time section in a time direction based on a second parameter; calculating a generation probability of a third sentence corresponding to the first sentence based on the first sentence, the second integrated information, and a language model;
  • a non-transitory storage medium for learning learning target parameters including the first parameter and the second parameter based on the generation probability of the third sentence and the second sentence.
  • a non-transitory storage medium storing a program executable by a computer to perform an inference process,
  • the inference process includes: inputting test data including a speech and a first sentence related to the speech; generating information representing the features of the speech for each predetermined time interval based on a speech feature extractor configured with multiple layers; generating first integrated information for each time interval by integrating information representing the features generated in a plurality of predetermined layers of the speech feature extractor based on the trained first parameters; generating second integrated information by integrating the first integrated information for each time interval in a time direction based on the learned second parameter; a non-transitory storage medium for generating a second sentence corresponding to the first sentence based on the first sentence, the second integrated information, and a language model;
  • Reference 1 Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli, "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations," arXiv preprint arXiv:2006.11477, 2020.
  • Reference 2 S. Chen et al., "WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing," in IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505-1518, Oct. 2022, doi: 10.1109/JSTSP.2022.3188113.
  • Reference 3 H. Touvron et al., "Llama 2: Open Foundation and Fine-Tuned Chat Models," arXiv preprint arXiv:2307.09288, 2023.
  • Speech understanding device 101 Input device 102 Display device 103 External I/F 103a Recording medium 104 Communication I/F 105 RAM 106 ROM 107 Auxiliary storage device 108 Processor 109 Bus 201 Model construction unit 202 Model learning unit 203 Trained speech encoder storage unit 204 Trained large-scale language model storage unit 205 Speech understanding model storage unit 206 Training dataset storage unit 207 Output sentence generation unit 208 Trained speech understanding model storage unit 209 Test data storage unit 211 Training data input unit 212 Speech encoding unit 213 First integration unit 214 Second integration unit 215 Linear transformation unit 216 Posterior probability calculation unit 217 Parameter update unit 218 End determination unit 221 Test data input unit 222 Speech encoding unit 223 First integration unit 224 Second integration unit 225 Linear transformation unit 226 Generation unit 227 Output unit

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Medical Informatics (AREA)
  • Machine Translation (AREA)

Abstract

Ce dispositif d'apprentissage comprend : une unité d'entrée pour entrer des données d'apprentissage comprenant de la parole, une première phrase concernant la parole, et une seconde phrase correspondant à la première phrase ; une unité de génération de caractéristiques vocales qui génère des informations représentant des caractéristiques de la parole pour chaque intervalle de temps prescrit, sur la base d'un extracteur de caractéristiques vocales composé d'une pluralité de couches ; une première unité d'intégration qui, sur la base d'un premier paramètre, génère des premières informations intégrées pour chaque intervalle de temps en intégrant des informations représentant des caractéristiques générées individuellement dans une pluralité prescrite de couches de l'extracteur de caractéristiques vocales ; une seconde unité d'intégration qui, sur la base d'un second paramètre, génère des secondes informations intégrées par intégration des premières informations intégrées pour chaque intervalle de temps dans la direction temporelle ; une unité de calcul qui calcule la probabilité de génération d'une troisième phrase correspondant à la première phrase sur la base de la première phrase, des secondes informations intégrées et d'un modèle de langage ; et une unité d'apprentissage qui apprend des paramètres à apprendre, comprenant le premier paramètre et le deuxième paramètre, sur la base de la probabilité de génération de la troisième phrase, et de la deuxième phrase.
PCT/JP2024/001906 2024-01-23 2024-01-23 Dispositif d'apprentissage, dispositif d'inférence, procédé d'apprentissage, procédé d'inférence et programme Pending WO2025158547A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2024/001906 WO2025158547A1 (fr) 2024-01-23 2024-01-23 Dispositif d'apprentissage, dispositif d'inférence, procédé d'apprentissage, procédé d'inférence et programme

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2024/001906 WO2025158547A1 (fr) 2024-01-23 2024-01-23 Dispositif d'apprentissage, dispositif d'inférence, procédé d'apprentissage, procédé d'inférence et programme

Publications (1)

Publication Number Publication Date
WO2025158547A1 true WO2025158547A1 (fr) 2025-07-31

Family

ID=96544586

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2024/001906 Pending WO2025158547A1 (fr) 2024-01-23 2024-01-23 Dispositif d'apprentissage, dispositif d'inférence, procédé d'apprentissage, procédé d'inférence et programme

Country Status (1)

Country Link
WO (1) WO2025158547A1 (fr)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021166207A1 (fr) * 2020-02-21 2021-08-26 日本電信電話株式会社 Dispositif de reconnaissance, dispositif d'apprentissage, procédé associé et programme
JP2023117248A (ja) * 2022-02-10 2023-08-23 株式会社東芝 機械学習装置、機械学習方法、機械学習プログラム及び推論装置

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021166207A1 (fr) * 2020-02-21 2021-08-26 日本電信電話株式会社 Dispositif de reconnaissance, dispositif d'apprentissage, procédé associé et programme
JP2023117248A (ja) * 2022-02-10 2023-08-23 株式会社東芝 機械学習装置、機械学習方法、機械学習プログラム及び推論装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
GONG, YUAN ET AL.: "JOINT AUDIO AND SPEECH UNDERSTANDING.", ARXIV.ORG E-PRINT ARCHIVE, 10 December 2023 (2023-12-10), pages 1 - 8, XP034518593, Retrieved from the Internet <URL:https://arxiv.org/pdf/2309.14405> [retrieved on 20240311], DOI: 10.48550.arXiv.2309.14405 *

Similar Documents

Publication Publication Date Title
US12148444B2 (en) Synthesizing speech from text using neural networks
US20220068255A1 (en) Speech Recognition Using Unspoken Text and Speech Synthesis
US11929060B2 (en) Consistency prediction on streaming sequence models
US10789942B2 (en) Word embedding system
JP7502561B2 (ja) 言語間音声合成を改良するための音声認識の使用
EP4409568B1 (fr) Réseau siamois à contraste pour reconnaissance semi-supervisée de la parole
JP2022037862A (ja) テキスト基盤の事前学習モデルを活用した縦断型音声言語理解知識を蒸留するための方法、システム、およびコンピュータ読み取り可能な記録媒体
JP7678227B2 (ja) 多言語自動音声認識のための教師無しおよび教師有り共同トレーニング(just)
CN119054013A (zh) 使用非并行话音转换用于训练语音辨识模型
El‐Bialy et al. Developing phoneme‐based lip‐reading sentences system for silent speech recognition
CN115294962B (zh) 语音合成模型的训练方法、装置、设备及存储介质
CN113823259B (zh) 将文本数据转换为音素序列的方法及设备
JPWO2020240709A1 (ja) 対話処理装置、学習装置、対話処理方法、学習方法及びプログラム
WO2019167296A1 (fr) Dispositif, procédé et programme de traitement de langage naturel
JP2024530969A (ja) 音声合成ベースのモデル適応での音声認識の向上
CN120239884A (zh) 用于语音辨识的半监督训练方案
Radzikowski et al. Dual supervised learning for non-native speech recognition
US20240177706A1 (en) Monte Carlo Self-Training for Speech Recognition
US20250279089A1 (en) Using Synthetic Data to Improve Word Error Rate of Differentially Private ASR Models
Bai et al. Integrating knowledge into end-to-end speech recognition from external text-only data
CN115346510B (zh) 一种语音合成方法、装置、电子设备及存储介质
Do et al. Transferring Emphasis in Speech Translation Using Hard-Attentional Neural Network Models.
CN117727288B (zh) 一种语音合成方法、装置、设备及存储介质
JP2019159464A (ja) 言語モデルを利用する装置、方法及びプログラム
WO2025158547A1 (fr) Dispositif d&#39;apprentissage, dispositif d&#39;inférence, procédé d&#39;apprentissage, procédé d&#39;inférence et programme

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24920434

Country of ref document: EP

Kind code of ref document: A1