WO2025158547A1 - Learning device, inference device, learning method, inference method, and program - Google Patents
Learning device, inference device, learning method, inference method, and programInfo
- Publication number
- WO2025158547A1 WO2025158547A1 PCT/JP2024/001906 JP2024001906W WO2025158547A1 WO 2025158547 A1 WO2025158547 A1 WO 2025158547A1 JP 2024001906 W JP2024001906 W JP 2024001906W WO 2025158547 A1 WO2025158547 A1 WO 2025158547A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speech
- sentence
- unit
- learning
- parameter
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
Definitions
- This disclosure relates to a learning device, an inference device, a learning method, an inference method, and a program.
- Speech information contains three types of information: linguistic information, non-linguistic information, and paralinguistic information (hereinafter, these three types of information will be collectively referred to as "speech information") (Non-Patent Document 1). Furthermore, technology for recognizing non-linguistic and paralinguistic information from speech is known (Non-Patent Document 2).
- Non-Patent Document 3 there is a technology known as image understanding technology that outputs various information contained in images in natural language.
- speech understanding technology By using a speech encoder instead of the image encoder used in image understanding technology, it is thought possible to realize a technology that outputs speech information contained in speech in natural language (hereinafter also referred to as "speech understanding technology").
- speech understanding technology due to differences in the configurations of image and speech encoders and differences in the properties of images and speech, it is difficult to realize speech understanding technology simply by using a speech encoder instead of an image encoder.
- a learning device includes an input unit that inputs learning data including speech, a first sentence related to the speech, and a second sentence corresponding to the first sentence; a speech feature generation unit that generates information representing features of the speech for each predetermined time interval based on a speech feature extractor configured with multiple layers; a first integration unit that generates first integrated information for each time interval based on first parameters, integrating the information representing the features generated by each of the predetermined multiple layers of the speech feature extractor; a second integration unit that generates second integrated information for each time interval based on second parameters, integrating the first integrated information in the time direction; a calculation unit that calculates the generation probability of a third sentence corresponding to the first sentence based on the first sentence, the second integrated information, and a language model; and a learning unit that learns learning target parameters including the first parameter and the second parameter based on the generation probability of the third sentence and the second sentence.
- FIG. 2 is a diagram illustrating an example of a speech understanding model.
- FIG. 10 illustrates an example of an audio encoder and an audio encoder output integrated block.
- FIG. 10 is a diagram illustrating an example of a time information integration block.
- FIG. 2 is a diagram illustrating an example of a hardware configuration of a speech understanding device during learning.
- FIG. 2 is a diagram illustrating an example of a functional configuration of a speech understanding device during learning.
- FIG. 10 is a diagram illustrating an example of a training data set.
- FIG. 2 is a diagram illustrating an example of a detailed functional configuration of a model learning unit.
- 10 is a flowchart illustrating an example of a model construction process.
- 10 is a flowchart illustrating an example of a model learning process.
- FIG. 10 is a flowchart illustrating an example of a model learning process.
- FIG. 2 is a diagram illustrating an example of a functional configuration of a speech understanding device during inference.
- FIG. 2 is a diagram illustrating an example of a detailed functional configuration of an output statement generation unit.
- 10 is a flowchart illustrating an example of an output sentence generation process.
- Non-Patent Document 1 ⁇ Non-verbal and para-linguistic information recognition technology> It is known that speech contains speech information (i.e., linguistic information, non-linguistic information, and paralinguistic information) (Non-Patent Document 1).
- linguistic information refers to information about the words spoken by a speaker.
- Non-linguistic information refers to information that is not linguistic but cannot be changed at will (e.g., information that represents the speaker's identity, gender, emotions, etc.).
- Paralinguistic information refers to information that is not linguistic but can be changed at will (e.g., information that represents intentions, attitudes, etc.).
- Non-Patent Document 2 uses a statistical model based on deep learning to estimate which emotional state, such as anger, joy, or sadness, the speech most closely matches.
- conventional technologies are unable to recognize fine-grained non-verbal and paralinguistic information. For example, they are unable to estimate emotional states such as "irritated,” which are not predefined, or "angry and sad,” which spans multiple emotional states. This results in a problem of reduced processing accuracy in downstream systems that use the results of non-verbal and paralinguistic information recognition (for example, the accuracy of call analysis in contact center systems, the accuracy of dialogue control and analysis in voice dialogue systems, etc.).
- Non-Patent Document 3 a technology used by humans for communication, such as Japanese, English, or Chinese.
- Image understanding technology is composed of a deep learning model that combines a large-scale language model that acquires relationships and co-occurrences between words using a large amount of text data with an image encoder that extracts information about objects in the image from the image.
- this deep learning model When this deep learning model is given an input image and a natural language question about the input image (e.g., a sentence such as "What do you think about the logo in this image?"), it outputs an output sentence corresponding to the question (e.g., a sentence such as "This log is a simple and symbolic logo").
- a natural language question about the input image e.g., a sentence such as "What do you think about the logo in this image?"
- Deep learning models that realize image understanding technology can infer various pieces of information contained in an image in natural language by providing a pair of an input image, a natural language question about that input image, and a correct output sentence corresponding to that question. For example, in response to the question, "What color is the logo in this image?", a sentence such as "It's pink” can be output as the output sentence.
- Speech understanding technology By realizing speech understanding technology that can output speech information contained in speech in natural sentences, it will be possible to recognize a variety of speech information, including detailed non-linguistic and paralinguistic information, and output that speech information in natural sentences.
- a simple way to realize speech understanding technology would be to use a speech encoder that extracts information representing the characteristics of speech from speech, instead of the image encoder used in image understanding technology.
- a speech encoder that extracts information representing the characteristics of speech from speech
- image encoder used in image understanding technology
- the first reason is the difference in the structure of image encoders and speech encoders.
- Existing speech encoders e.g., wav2vec2.0 (Reference 1), WavLM (Reference 2), etc.
- Reference 2 suggests that information closer to physical properties, such as speaker information, is extracted in the lower layers of the speech encoder, while information closer to abstract properties, such as phonology, is extracted in the higher layers of the speech encoder.
- speech recognition must use information extracted in the higher layers of the speech encoder; otherwise, it is thought that accurate natural-sounding output will be difficult.
- the second reason is the difference in the nature of images and audio.
- audio has a variable length, so it is thought that time-domain processing is required to recognize the audio information contained in audio of any length and output that audio information in natural language.
- ⁇ Speech understanding model> Therefore, we propose a deep learning model (hereinafter referred to as a "speech understanding model") that can solve the problems caused by the above two causes.
- This speech understanding model makes it possible to recognize diverse speech information, including detailed non-verbal and paralinguistic information, from speech of any length and output the speech information in natural sentences.
- a speech understanding technology that outputs diverse speech information contained in speech of any length (more specifically, speech information related to at least one of the physical and abstract properties of speech) in natural sentences. This can be expected to improve the processing accuracy of downstream systems that use the recognition results of non-verbal and paralinguistic information (e.g., the accuracy of call analysis in contact center systems, the accuracy of dialogue control and analysis in voice dialogue systems, etc.).
- FIG. 1 is a diagram showing an example of the speech understanding model 1000.
- the speech understanding model 1000 is composed of a speech encoder 1100, a speech encoder output integration block 1200, a temporal information integration block 1300, a linear transformation layer 1400, and a large-scale language model 1500.
- the audio encoder 1100 is any existing audio encoder (for example, wav2vec2.0, WavLM, etc.).
- the audio encoder 1100 receives audio (hereinafter also referred to as "input audio") as input and outputs information representing the characteristics of that audio.
- input audio hereinafter also referred to as "input audio”
- the audio encoder 1100 receives the input audio for that time interval and outputs information representing the characteristics of the audio for that time interval.
- the audio encoder output merging block 1200 merges the outputs of multiple pre-specified layers from among the outputs of each layer of the audio encoder 1100 in each time interval.
- each of the multiple pre-specified layers will be referred to as the "layer to be merged.”
- each layer to be merged has the same number of output dimensions.
- the layer to be merged is specified by the user, etc., from among the layers of the audio encoder 1100 that have the same number of output dimensions.
- T is an index representing the last time interval of the input speech, and its value may vary depending on the length of the input speech.
- each h n (t) represents some feature of the input speech (e.g., a physical property or an abstract property), and each h n (t) can be described by a vector with a predetermined number of dimensions, so hereinafter each h n (t) will be referred to as a "first speech feature vector.”
- the speech encoder output integration block 1200 receives, in each time interval t, each first speech feature vector h n (t) in that time interval t as input, and outputs a vector obtained by integrating each first speech feature vector h n (t).
- the vector obtained by integrating each first speech feature vector h n (t) will be referred to as the "first integrated vector” and represented by e(t).
- the speech encoder 1100 is composed of one convolutional layer and N transformer layers, and these N transformer layers are layers to be integrated.
- the output of the nth transformer layer in time interval t is a first speech feature vector h n (t)
- the speech encoder output integration block 1200 integrates these first speech feature vectors h n (t) to create a first integrated vector e(t).
- the first speech feature vectors h n (t) can be integrated by, for example, weighted sum or linear transformation sum.
- W 1 , ..., W N , b 1 , ..., b N are also called linear transformation coefficients and are training target parameters.
- the temporal information integration block 1300 integrates the first integrated vector e(t) in the time direction. That is, the temporal information integration block 1300 receives the first integrated vector e(t) for each time interval t as input and outputs a vector obtained by integrating each first integrated vector e(t) in the time direction. In this case, it is not sufficient to simply integrate the first integrated vector e(t) for all time intervals t; it is necessary to consider which parts of the time period should be emphasized and which parts should not. For example, when recognizing a speaker's emotions or speaker information from speech, it is necessary to ignore time periods representing short pauses in the speech or time periods where breathing occurs, and to emphasize the time periods during which the speaker is speaking.
- the temporal information integration block 1300 integrates each first integrated vector e(t) using a weighted sum.
- the vector obtained by integrating each first integrated vector e(t) in the time direction will be referred to as the "second integrated vector" and represented by v.
- the time information integration block 1300 creates a second integrated vector v by integrating these first integrated vectors e(1), ..., e(T) in the time direction.
- a self-attentive pooling layer can be used as a method for integrating each of the first integrated vectors e(1), ..., e(T).
- E [e(1), ..., e(T)] ⁇ .
- a softmax(ReLU(W 1 'E)W 2 ').
- W 1 ' and W 2 ' are training parameters
- ⁇ is a symbol representing transposition.
- the second integrated vector v is obtained by integrating each of the first integrated vectors e(1), ..., e(T) by weighted sum.
- the time information integration block 1300 may integrate each of the first integrated vectors e(1), ..., e(T) in the time direction using a convolutional neural network.
- the learnable parameters of the one-dimensional convolutional neural network are the parameters to be learned.
- K is an integer equal to or greater than 1 determined by the window size of the one-dimensional convolutional neural network and the sequence length T of the first integrated vectors e(1), ..., e(T).
- the linear transformation layer 1400 linearly transforms the second integrated vector v. That is, the linear transformation layer 1400 takes the second integrated vector v as input and outputs a vector obtained by linearly transforming this second integrated vector v.
- the vector obtained by linearly transforming the second integrated vector v will be referred to as the "second speech feature vector" and represented by w.
- W and b are also called linear transformation coefficients and are training target parameters. Note that if the number of dimensions of the token embedding space in the large-scale language model 1500 is M, then the number of dimensions of the second speech feature vector w is also M. This means that the linear transformation layer 1400 creates a second speech feature vector sequence with a length of 1 and M dimensions.
- K vectors obtained by linearly transforming v(1), ..., v(K) can be set as second speech feature vectors w(1), ..., w(K). This means that a second speech feature vector sequence with a length of K and a number of dimensions of M is created.
- Large-scale language model 1500 is any existing large-scale language model (LLM).
- Large-scale language model 1500 takes as input an input sentence, which is a natural language question about the input speech, and a second speech feature vector w (or second speech feature vector w(1), ..., w(K)), and outputs an output sentence corresponding to the question.
- Such an output sentence is generated according to the posterior probability given the input sentence and the second speech feature vector w (or second speech feature vector w(1), ..., w(K)).
- the posterior probability is calculated by processing a vector sequence combining an embedding vector sequence representing the embedded representation of the tokens that make up the input sentence and the second speech feature vector w (or second speech feature vector w(1), ..., w(K)).
- FIG. 1 shows a case where, when an input sentence of "Please tell me the emotional state of the person in the following speech," is given, an output sentence of "This person is male and is slightly irritated” is generated.
- Llama 2 7B Reference 3
- a language model other than a large-scale language model may also be used as the large-scale language model 1500, as long as it is a language model that can generate an output sentence according to the posterior probability when given an input sentence and a second speech feature vector w (or second speech feature vectors w(1), ..., w(K)).
- the speech encoder 1100 may be referred to as, for example, a "speech feature extractor” or “speech coder.”
- the large-scale language model 1500 may be referred to as, for example, a "language model,” a "natural language model,” or a “natural language processing model.”
- the components that make up the speech understanding model 1000 may be referred to as, for example, a "module.”
- the speech understanding device 10 undergoes a "model construction" phase during which the speech understanding model 1000 is constructed, a “learning” phase during which parameters to be learned for the speech understanding model 1000 are learned, and an “inference” phase during which output sentences are generated by the speech understanding model 1000 using the learned parameters.
- the speech understanding device 10 is provided with a collection of training data (hereinafter also referred to as a "training dataset") represented by pairs of input speech, input sentences which are questions in natural language related to the input speech, and correct output sentences corresponding to the questions.
- training dataset training data
- the speech understanding device 10 is provided with test data represented by pairs of input speech and input sentences which are questions in natural language related to the input speech.
- the speech understanding device 10 during model construction may be referred to as, for example, a "model construction device” or a "model creation device.”
- the speech understanding device 10 during learning may be referred to as, for example, a “learning device,” a “parameter estimation device,” a “parameter optimization device,” etc.
- the speech understanding device 10 during inference may be referred to as, for example, an “inference device,” an “estimation device,” a "natural sentence generation device,” etc.
- model construction is considered to be included in learning, and a case will be described in which the speech understanding device 10 also constructs the speech understanding model 1000 during learning.
- Fig. 4 is a diagram showing an example of the hardware configuration of the speech understanding device 10 during learning.
- the speech understanding device 10 during learning has an input device 101, a display device 102, an external I/F 103, a communication I/F 104, a RAM (Random Access Memory) 105, a ROM (Read Only Memory) 106, an auxiliary storage device 107, and a processor 108.
- a processor 108 Each of these pieces of hardware is connected to each other so that they can communicate with each other via a bus 109.
- the input device 101 is, for example, a keyboard, a mouse, a touch panel, a physical button, etc.
- the display device 102 is, for example, a display, a display panel, etc. Note that the speech understanding device 10 does not necessarily have to have at least one of the input device 101 and the display device 102, for example.
- the external I/F 103 is an interface with external devices such as a recording medium 103a.
- recording media 103a include a CD (Compact Disc), a DVD (Digital Versatile Disk), an SD memory card (Secure Digital memory card), and a USB (Universal Serial Bus) memory card.
- the communication I/F 104 is an interface for connecting to a communication network.
- the RAM 105 is a volatile semiconductor memory (storage device) that temporarily stores programs and data.
- the ROM 106 is a non-volatile semiconductor memory (storage device) that can store programs and data even when the power is turned off.
- the auxiliary storage device 107 is a non-volatile storage device such as an HDD (Hard Disk Drive), SSD (Solid State Drive), or flash memory.
- the processor 108 is a variety of arithmetic devices such as a CPU (Central Processing Unit) or GPU (Graphics Processing Unit).
- the hardware configuration shown in FIG. 4 is an example, and the hardware configuration of the speech understanding device 10 is not limited to this.
- the speech understanding device 10 may have multiple auxiliary storage devices 107 or multiple processors 108, may not have some of the hardware shown in the figure, or may have various hardware other than the hardware shown in the figure.
- Fig. 5 is a diagram showing an example of the functional configuration of the speech understanding device 10 during learning.
- the speech understanding device 10 during learning has a model construction unit 201 and a model learning unit 202. These units are realized, for example, by processing in which one or more programs installed in the speech understanding device 10 are executed by the processor 108 or the like.
- the speech understanding device 10 during learning also has a trained speech encoder storage unit 203, a trained large-scale language model storage unit 204, a speech understanding model storage unit 205, and a training dataset storage unit 206.
- Each of these storage units is realized, for example, by a storage area such as the auxiliary storage device 107. However, at least one of these storage units may also be realized by a storage area of a storage device (for example, a storage device included in a database server, etc.) that is communicatively connected to the speech understanding device 10.
- the model construction unit 201 constructs the speech understanding model 1000 shown in FIG. 1 using the trained speech encoder stored in the trained speech encoder storage unit 203 and the trained large-scale language model stored in the trained large-scale language model storage unit 204. That is, the model construction unit 201 constructs the speech understanding model 1000 shown in FIG. 1 using the trained speech encoder as the speech encoder 1100 and the trained large-scale language model as the large-scale language model 1500.
- the model construction unit 201 initializes the training target parameters of the speech encoder output integration block 1200, the training target parameters of the time information integration block 1300, and the training target parameters of the linear transformation layer 1400.
- the training target parameters may be initialized using any method, such as random initialization or sampling from a predetermined distribution.
- a trained speech encoder is a speech encoder whose parameters have been trained.
- a trained large-scale language model is a large-scale language model whose parameters have been trained.
- model construction unit 201 stores the speech understanding model 1000 in the speech understanding model storage unit 205.
- the model training unit 202 uses the training dataset stored in the training dataset storage unit 206 to train the speech understanding model 1000 stored in the speech understanding model storage unit 205. At this time, the model training unit 202 trains the training target parameters of the speech encoder output integration block 1200, the training target parameters of the temporal information integration block 1300, and the training target parameters of the linear transformation layer 1400, while keeping the parameters of the speech encoder 1100 and large-scale language model 1500 fixed. More specifically, the model training unit 202 uses, as a loss function, the cross entropy between the output sentence generated by the speech understanding model 1000 when given the input speech and input sentence contained in the training data and the correct output sentence contained in the training data, and trains the training target parameters using an existing optimization method so as to minimize the loss function. A detailed example of the functional configuration of the model training unit 202 will be described later.
- the trained speech encoder storage unit 203 stores a trained speech encoder.
- the trained large-scale language model storage unit 204 stores a trained large-scale language model.
- the speech understanding model storage unit 205 stores the speech understanding model 1000 constructed by the model construction unit 201.
- the training dataset storage unit 206 stores a given training dataset.
- Fig. 6 is a diagram showing an example of the training data set.
- a training dataset is made up of one or more training data, each of which includes input speech, an input sentence, and a correct output sentence.
- a training dataset is made up of a large number of training data.
- the input speech is speech data input to the speech understanding model 1000.
- the input sentence is text data that represents a question in natural language related to the input speech.
- the correct output sentence is text data that represents a response or answer in natural language that is the correct answer to the question represented by the input sentence.
- the input speech does not necessarily have to be audio data that records a human voice, but may also be audio data that records any sound.
- the correct output sentence may also be called, for example, "teaching data.”
- the training data for the second line of the example shown in FIG. 6 includes the input speech "Speech A,” the input sentence “Please tell me how this speech is spoken,” and the correct output sentence "A woman is speaking quickly and loudly.”
- the training data for the third line of the example shown in FIG. 6 includes the input speech "Speech B,” the input sentence “Please tell me how this speech is spoken,” and the correct output sentence "A man speaks slowly and calmly.”
- the training data on the fifth line of the example shown in Figure 6 includes the input speech "Speech B,” the input sentence “Please tell me the gender of the speaker of this speech,” and the correct output sentence “It is a man.”
- the training data on the fifth line of the example shown in Figure 6 includes the input speech "Speech C,” the input sentence “What is the emotion of the speaker of this speech?", and the correct output sentence "This speaker seems a little irritated.”
- the training dataset is composed of training data represented by pairs of input speech, input sentences, and correct output sentences.
- the training dataset may contain multiple training data sets that contain different input sentences and correct output sentences for the same input speech.
- Fig. 7 is a diagram showing an example of a detailed functional configuration of the model learning unit 202.
- the model learning unit 202 includes a learning data input unit 211, an audio encoding unit 212, a first integration unit 213, a second integration unit 214, a linear transformation unit 215, a posterior probability calculation unit 216, a parameter update unit 217, and an end determination unit 218.
- the learning data input unit 211 inputs one piece of learning data from the learning dataset stored in the learning dataset storage unit 206.
- the speech encoding unit 212 is realized by the speech encoder 1100 included in the speech understanding model 1000.
- the second integration unit 214 is realized by the time information integration block 1300 included in the speech understanding model 1000.
- the linear transformation unit 215 is realized by the linear transformation layer 1400 included in the speech understanding model 1000.
- the linear transformation unit 215 receives the second integrated vector v as input and outputs the second speech feature vector w obtained by linearly transforming the second integrated vector v.
- the posterior probability calculation unit 216 is realized by the large-scale language model 1500 included in the speech understanding model 1000.
- the posterior probability calculation unit 216 receives as input an input sentence included in the training data input by the training data input unit 211 and a second speech feature vector w, and calculates the posterior probability of an output sentence when the input sentence and the second speech feature vector w are given.
- I is the number of tokens included in the output sentence (i.e., the length of the output sentence), and may be, for example, the length of the correct output sentence included in the training data input by the training data input unit 211.
- the posterior probability p(s i ) is expressed as an M-dimensional vector in which, for example, when the number of dimensions of the token embedding space is M, the m-th element represents the probability that the m-th type of token is generated, and the sum of the values of all elements is 1.
- a token is the basic processing unit used when a language model, such as a large-scale language model, processes a string of characters.
- a typical example of a token is a word, but tokens are not limited to words and can also be, for example, characters, morphemes, subwords, or coherent strings of characters.
- the parameter update unit 217 uses the posterior probabilities calculated by the posterior probability calculation unit 216 and the correct output sentences included in the learning data input by the learning data input unit 211 to learn the learning target parameters of the speech understanding model 1000 using an existing optimization method.
- the i-th token constituting the correct output sentence is denoted by s i ' (where s 1 ' is a token representing the beginning of the sentence, and s I ' is a token representing the end of the sentence).
- the probability that token s i ' is generated is denoted by p(s i ').
- the probability p(s i ') is expressed as an M-dimensional vector in which only the element corresponding to token s i ' has a value of 1 and the other elements have a value of 0.
- the optimization method that can be used to update the training parameters is not limited to a specific method, but an online optimization method based on stochastic gradient descent, for example, can be used.
- the termination determination unit 218 determines whether to terminate the updating of the training parameters. At this time, the termination determination unit 218 determines to terminate the updating of the training parameters if a predetermined termination condition is met, and determines not to terminate the updating of the training parameters if not.
- the training parameters of the audio encoder output integrated block 1200, the training parameters of the temporal information integrated block 1300, and the training parameters of the linear transformation layer 1400 are repeatedly updated until the predetermined termination condition is met.
- examples of the predetermined termination condition include the training parameters being updated a predetermined number of times or more, the number of epochs being a predetermined number of epochs or more, the value of the loss function being less than a predetermined value, the loss function converging, etc.
- Fig. 8 is a flowchart showing an example of the model construction process.
- the model construction unit 201 constructs the speech understanding model 1000 shown in FIG. 1 using the trained speech encoder stored in the trained speech encoder storage unit 203 and the trained large-scale language model stored in the trained large-scale language model storage unit 204 (step S101). At this time, the model construction unit 201 initializes the training parameters of the speech encoder output integration block 1200, the training parameters of the time information integration block 1300, and the training parameters of the linear transformation layer 1400 using any method. This constructs an untrained speech understanding model 1000.
- the model construction unit 201 stores the speech understanding model 1000 constructed in step S101 above in the speech understanding model storage unit 205 (step S102).
- Fig. 9 is a flowchart showing an example of the model learning process.
- the learning data input unit 211 of the model learning unit 202 inputs one piece of learning data from the learning dataset stored in the learning dataset storage unit 206 (step S201).
- the learning data input unit 211 inputs, for example, one piece of learning data that has not yet been input for the current number of epochs from the learning data that make up the learning dataset.
- the epoch number starts from 0 and is incremented by 1 each time all of the learning data that make up the learning dataset is input.
- the linear transformation unit 215 of the model learning unit 202 receives the second integrated vector v as input and outputs the second speech feature vector w obtained by linearly transforming the second integrated vector v (step S205).
- the posterior probability calculation unit 216 of the model training unit 202 receives as input the input sentence included in the training data input in step S201 above and the second speech feature vector w output in step S205 above, and calculates the posterior probability of the output sentence when the input sentence and the second speech feature vector w are given (step S206).
- the termination determination unit 218 of the model learning unit 202 determines whether to terminate the update of the learning parameter (step S208). That is, if a predetermined termination condition is met, the termination determination unit 218 determines to terminate the update of the learning parameter, and if not, determines not to terminate the update of the learning parameter.
- step S208 above If it is determined in step S208 above that the update of the learning target parameters should not be terminated, the model learning unit 202 returns to step S201 above. As a result, steps S201 to S207 above are repeatedly executed until a predetermined termination condition is met.
- step S208 if it is determined in step S208 above that the update of the learning target parameters is to be terminated, the model learning unit 202 terminates the model learning process. As a result, the learning target parameters are learned and the trained speech understanding model 1000 is obtained.
- Example of hardware configuration of the speech understanding device 10 during inference The hardware configuration of the speech understanding device 10 during inference may be the same as during learning, and therefore a description thereof will be omitted.
- Fig. 10 is a diagram showing an example of the functional configuration of the speech understanding device 10 at the time of inference.
- the speech understanding device 10 during inference has an output sentence generation unit 207.
- the output sentence generation unit 207 is realized, for example, by processing in which one or more programs installed in the speech understanding device 10 are executed by the processor 108 or the like.
- the speech understanding device 10 during inference also has a trained speech understanding model storage unit 208 and a test data storage unit 209.
- Each of these storage units is realized, for example, by a storage area of the auxiliary storage device 107 or the like. However, at least one of these storage units may also be realized by a storage area of a storage device (for example, a storage device included in a database server or the like) that is communicatively connected to the speech understanding device 10.
- the output sentence generation unit 207 uses the test data stored in the test data storage unit 209 and the trained speech understanding model 1000 stored in the trained speech understanding model storage unit 208 to generate and output an output sentence corresponding to the question expressed by the input sentence included in the test data (i.e., text data representing a natural language response or answer to the question).
- test data refers to data represented by a pair of input speech and an input sentence that is a natural language question related to the input speech.
- the trained speech understanding model 1000 refers to a speech understanding model 1000 whose learning target parameters have been trained. A detailed example of the functional configuration of the output sentence generation unit 207 will be described later.
- the trained speech understanding model storage unit 208 stores the trained speech understanding model 1000.
- the test data storage unit 209 stores the given test data.
- Fig. 11 is a diagram showing an example of a detailed functional configuration of the output sentence generation unit 207.
- the output sentence generation unit 207 includes a test data input unit 221, a speech encoding unit 222, a first integration unit 223, a second integration unit 224, a linear conversion unit 225, a generation unit 226, and an output unit 227.
- the test data input unit 221 inputs one piece of test data stored in the test data storage unit 209.
- the speech encoding unit 222 is realized by the speech encoder 1100 included in the trained speech understanding model 1000.
- the second integration unit 224 is realized by the time information integration block 1300 included in the trained speech understanding model 1000.
- the linear transformation unit 225 is realized by the linear transformation layer 1400 included in the trained speech understanding model 1000.
- the linear transformation unit 225 receives the second integrated vector v as input and outputs the second speech feature vector w obtained by linearly transforming the second integrated vector v.
- the generation unit 226 is realized by the large-scale language model 1500 included in the trained speech understanding model 1000.
- the generation unit 226 receives as input an input sentence included in the test data input by the test data input unit 221 and a second speech feature vector w, and generates an output sentence when the input sentence and the second speech feature vector w are given.
- the i-th token constituting the output sentence is denoted by s i .
- the posterior probability that token s i will be generated when the input sentence and the second speech feature vector w are given is denoted by p(s i )
- the posterior probability that token s i will be generated when the input sentence, the second speech feature vector w, and s 1 , ..., s i-1 are given is denoted by p(s i ) (where i ⁇ 2).
- the generation unit 226 generates the output sentence by generating token s i in accordance with the posterior probability p(s i ) until, for example, a token representing the end of the sentence is generated.
- the output unit 227 outputs the output sentence generated by the generation unit 226 to a predetermined output destination.
- the predetermined output destination include a storage area such as the auxiliary storage device 107, a display device 102 such as a display, other devices or equipment that are communicatively connected, etc.
- Fig. 12 is a flowchart showing an example of the output sentence generation process.
- the test data input unit 221 of the output statement generation unit 207 inputs one piece of test data stored in the test data storage unit 209 (step S301).
- the linear transformation unit 225 of the output sentence generation unit 207 receives the second integrated vector v as input and outputs the second speech feature vector w obtained by linearly transforming the second integrated vector v (step S305).
- the generation unit 226 of the output sentence generation unit 207 receives as input the input sentence included in the test data input in step S301 above and the second speech feature vector w output in step S305 above, and generates an output sentence when the input sentence and the second speech feature vector w are given (step S306).
- the output unit 227 of the output sentence generation unit 207 outputs the output sentence generated in step S306 above to a predetermined output destination (step S307). This results in an output sentence that is a response or answer to the question related to the input voice.
- the speech understanding device 10 can realize speech understanding technology using the speech understanding model 1000 in which the speech encoder output integration block 1200 and the time information integration block 1300 are present between the speech encoder 1100 and the large-scale language model 1500. For this reason, by using the speech understanding device 10 according to this embodiment, it is possible to expect an improvement in the processing accuracy of a downstream system that uses the recognition results of non-linguistic information and paralinguistic information, for example.
- (Appendix 1) Memory and at least one processor coupled to said memory; Including, The processor: inputting training data including a speech, a first sentence related to the speech, and a second sentence corresponding to the first sentence; generating information representing the features of the speech for each predetermined time interval based on a speech feature extractor configured with multiple layers; generating first integrated information for each time interval based on a first parameter, the first integrated information being obtained by integrating information representing the features generated in a plurality of predetermined layers of the audio feature extractor; generating second integrated information by integrating the first integrated information for each time section in a time direction based on a second parameter; calculating a generation probability of a third sentence corresponding to the first sentence based on the first sentence, the second integrated information, and a language model; A learning device that learns learning target parameters including the first parameter and the second parameter based on the generation probability of the third sentence and the second sentence.
- (Appendix 2) at least one processor coupled to said memory; Including, The processor: inputting test data including a speech and a first sentence related to the speech; generating information representing the features of the speech for each predetermined time interval based on a speech feature extractor configured with multiple layers; generating first integrated information for each time interval by integrating information representing the features generated in a plurality of predetermined layers of the speech feature extractor based on the trained first parameters; generating second integrated information by integrating the first integrated information for each time interval in a time direction based on the learned second parameter; a reasoning device that generates a second sentence corresponding to the first sentence based on the first sentence, the second integrated information, and a language model; (Appendix 3) The processor: linearly transforming the second integrated information based on a third parameter; 2.
- the learning device calculates the probability of generating the third sentence based on the first sentence, the second integrated information after the linear transformation, and the language model.
- the first parameter is a weight used in a weighted sum or a linear transform coefficient used in a linear transform sum
- the processor The learning device according to claim 1 or 3, wherein first integrated information is generated by integrating the features using the weighted sum or the linear transformation sum.
- the second parameter is a weight of a self-attention pooling layer or a parameter of a one-dimensional convolutional neural network;
- the processor The learning device according to claim 1 or 3, wherein second integrated information is generated by integrating the first integrated information in a time direction using the self-attention pooling layer or the one-dimensional convolutional neural network.
- a non-transitory storage medium storing a program executable by a computer to perform a learning process,
- the learning process includes: inputting training data including a speech, a first sentence related to the speech, and a second sentence corresponding to the first sentence; generating information representing the features of the speech for each predetermined time interval based on a speech feature extractor configured with multiple layers; generating first integrated information for each time interval based on a first parameter, the first integrated information being obtained by integrating information representing the features generated in a plurality of predetermined layers of the audio feature extractor; generating second integrated information by integrating the first integrated information for each time section in a time direction based on a second parameter; calculating a generation probability of a third sentence corresponding to the first sentence based on the first sentence, the second integrated information, and a language model;
- a non-transitory storage medium for learning learning target parameters including the first parameter and the second parameter based on the generation probability of the third sentence and the second sentence.
- a non-transitory storage medium storing a program executable by a computer to perform an inference process,
- the inference process includes: inputting test data including a speech and a first sentence related to the speech; generating information representing the features of the speech for each predetermined time interval based on a speech feature extractor configured with multiple layers; generating first integrated information for each time interval by integrating information representing the features generated in a plurality of predetermined layers of the speech feature extractor based on the trained first parameters; generating second integrated information by integrating the first integrated information for each time interval in a time direction based on the learned second parameter; a non-transitory storage medium for generating a second sentence corresponding to the first sentence based on the first sentence, the second integrated information, and a language model;
- Reference 1 Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli, "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations," arXiv preprint arXiv:2006.11477, 2020.
- Reference 2 S. Chen et al., "WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing," in IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505-1518, Oct. 2022, doi: 10.1109/JSTSP.2022.3188113.
- Reference 3 H. Touvron et al., "Llama 2: Open Foundation and Fine-Tuned Chat Models," arXiv preprint arXiv:2307.09288, 2023.
- Speech understanding device 101 Input device 102 Display device 103 External I/F 103a Recording medium 104 Communication I/F 105 RAM 106 ROM 107 Auxiliary storage device 108 Processor 109 Bus 201 Model construction unit 202 Model learning unit 203 Trained speech encoder storage unit 204 Trained large-scale language model storage unit 205 Speech understanding model storage unit 206 Training dataset storage unit 207 Output sentence generation unit 208 Trained speech understanding model storage unit 209 Test data storage unit 211 Training data input unit 212 Speech encoding unit 213 First integration unit 214 Second integration unit 215 Linear transformation unit 216 Posterior probability calculation unit 217 Parameter update unit 218 End determination unit 221 Test data input unit 222 Speech encoding unit 223 First integration unit 224 Second integration unit 225 Linear transformation unit 226 Generation unit 227 Output unit
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Evolutionary Computation (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Child & Adolescent Psychology (AREA)
- General Health & Medical Sciences (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Medical Informatics (AREA)
- Machine Translation (AREA)
Abstract
Description
本開示は、学習装置、推論装置、学習方法、推論方法、及びプログラムに関する。 This disclosure relates to a learning device, an inference device, a learning method, an inference method, and a program.
音声には、言語情報・非言語情報・パラ言語情報という3種類の情報(以下、これら3種類の情報をまとめて「音声情報」ともいう。)が含まれていることが知られている(非特許文献1)。また、音声から非言語情報・パラ言語情報を認識する技術が知られている(非特許文献2)。 It is known that speech contains three types of information: linguistic information, non-linguistic information, and paralinguistic information (hereinafter, these three types of information will be collectively referred to as "speech information") (Non-Patent Document 1). Furthermore, technology for recognizing non-linguistic and paralinguistic information from speech is known (Non-Patent Document 2).
他方、画像処理分野においては、画像に含まれる様々な情報を自然文で出力する技術として、画像理解技術等と呼ばれる技術が知られている(非特許文献3)。 On the other hand, in the field of image processing, there is a technology known as image understanding technology that outputs various information contained in images in natural language (Non-Patent Document 3).
画像理解技術で利用されている画像エンコーダの代わりに音声エンコーダを利用することにより、音声に含まれる音声情報を自然文で出力する技術(以下、「音声理解技術」ともいう。)を実現することが可能であると考えられる。しかしながら、画像エンコーダと音声エンコーダの構成の違いや画像と音声の性質の違い等により、単純に画像エンコーダの代わりに音声エンコーダを利用するだけでは音声理解技術の実現は困難である。 By using a speech encoder instead of the image encoder used in image understanding technology, it is thought possible to realize a technology that outputs speech information contained in speech in natural language (hereinafter also referred to as "speech understanding technology"). However, due to differences in the configurations of image and speech encoders and differences in the properties of images and speech, it is difficult to realize speech understanding technology simply by using a speech encoder instead of an image encoder.
本開示は、上記の点に鑑みてなされたもので、音声理解技術を実現することを目的とする。 This disclosure has been made in light of the above points and aims to realize speech understanding technology.
本開示の一態様による学習装置は、音声と、前記音声に関する第1の文と、前記第1の文に対応する第2の文とが含まれる学習データを入力する入力部と、複数の層で構成された音声特徴抽出器に基づいて、所定の時間区間毎に、前記音声の特徴を表す情報を生成する音声特徴生成部と、第1のパラメータに基づいて、前記時間区間毎に、前記音声特徴抽出器の所定の複数の層でそれぞれ生成された前記特徴を表す情報を統合した第1の統合情報を生成する第1の統合部と、第2のパラメータに基づいて、前記時間区間毎の前記第1の統合情報を時間方向に統合した第2の統合情報を生成する第2の統合部と、前記第1の文と、前記第2の統合情報と、言語モデルとに基づいて、前記第1の文に対応する第3の文の生成確率を算出する算出部と、前記第3の文の生成確率と、前記第2の文とに基づいて、前記第1のパラメータと前記第2のパラメータとを含む学習対象パラメータを学習する学習部と、を有する。 A learning device according to one aspect of the present disclosure includes an input unit that inputs learning data including speech, a first sentence related to the speech, and a second sentence corresponding to the first sentence; a speech feature generation unit that generates information representing features of the speech for each predetermined time interval based on a speech feature extractor configured with multiple layers; a first integration unit that generates first integrated information for each time interval based on first parameters, integrating the information representing the features generated by each of the predetermined multiple layers of the speech feature extractor; a second integration unit that generates second integrated information for each time interval based on second parameters, integrating the first integrated information in the time direction; a calculation unit that calculates the generation probability of a third sentence corresponding to the first sentence based on the first sentence, the second integrated information, and a language model; and a learning unit that learns learning target parameters including the first parameter and the second parameter based on the generation probability of the third sentence and the second sentence.
音声理解技術を実現できる。 Speech understanding technology can be realized.
以下、本発明の一実施形態について、図面を参照しながら詳細に説明する。 Below, one embodiment of the present invention will be described in detail with reference to the drawings.
<非言語情報・パラ言語情報の認識技術>
音声には、音声情報(つまり、言語情報・非言語情報・パラ言語情報)が含まれていることが知られている(非特許文献1)。ここで、言語情報とは、話者が話した言葉の情報のことである。非言語情報とは、言語情報でない情報のうち、随意的に変化させることができない情報(例えば、話者性、性別、感情等を表す情報)のことである。パラ言語情報とは、言語情報でない情報のうち、随意的に変化させることができる情報(例えば、意図、態度等を表す情報)のことである。
<Non-verbal and para-linguistic information recognition technology>
It is known that speech contains speech information (i.e., linguistic information, non-linguistic information, and paralinguistic information) (Non-Patent Document 1). Here, linguistic information refers to information about the words spoken by a speaker. Non-linguistic information refers to information that is not linguistic but cannot be changed at will (e.g., information that represents the speaker's identity, gender, emotions, etc.). Paralinguistic information refers to information that is not linguistic but can be changed at will (e.g., information that represents intentions, attitudes, etc.).
音声から非言語情報・パラ言語情報を認識する従来技術では、有限個の状態を事前に定義した上で、それらの状態のうちのいずれに最も当てはまるかを推定する場合が多い。例えば、非特許文献2に記載されている技術では、深層学習に基づく統計モデルを用いて、怒り・喜び・悲しみ等の感情状態のいずれに最も近いかを推定する。しかしながら、このような従来技術では、きめ細やかな非言語情報・パラ言語情報を認識することはできない。例えば、事前に定義していない感情状態「いらいらしている」や複数の感情状態にまたがる感情状態「怒りつつ悲しんでいる」等といった感情状態を推定することはできない。このため、非言語情報・パラ言語情報の認識結果を利用する後段のシステムにおける処理精度(例えば、コンタクトセンタシステムにおける通話分析の精度、音声対話システムにおける対話制御の精度や分析精度等)が低下するという問題がある。 In conventional technologies for recognizing non-verbal and paralinguistic information from speech, a finite number of states are defined in advance, and then an estimation is made of which of these states the speech most closely matches. For example, the technology described in Non-Patent Document 2 uses a statistical model based on deep learning to estimate which emotional state, such as anger, joy, or sadness, the speech most closely matches. However, such conventional technologies are unable to recognize fine-grained non-verbal and paralinguistic information. For example, they are unable to estimate emotional states such as "irritated," which are not predefined, or "angry and sad," which spans multiple emotional states. This results in a problem of reduced processing accuracy in downstream systems that use the results of non-verbal and paralinguistic information recognition (for example, the accuracy of call analysis in contact center systems, the accuracy of dialogue control and analysis in voice dialogue systems, etc.).
<画像理解技術>
画像処理の分野においては、画像に含まれる様々な情報を自然文で出力する画像理解技術等と呼ばれる技術が知られている(非特許文献3)。なお、自然文とは、自然言語(例えば、日本語、英語、中国語等といった人間が意思疎通等のために用いる言語)で記述された文のことである。画像理解技術は、大量のテキストデータを用いて単語間の関係性や共起性を獲得した大規模言語モデルと、画像からその画像中の物体の情報等を抽出する画像エンコーダとを組み合わせた深層学習モデルで構成される。そして、この深層学習モデルに対して入力画像とその入力画像に関する自然文の質問(例えば、「この画像に含まれるロゴについてどう思いますか?」等といった文)とを与えると、その質問に対応する出力文(例えば、「このログはシンプルかつ記号的なロゴです」等といった文)が出力される。
<Image understanding technology>
In the field of image processing, there is known a technology called image understanding, which outputs various information contained in an image in natural language (Non-Patent Document 3). Note that natural language refers to a sentence written in a natural language (e.g., a language used by humans for communication, such as Japanese, English, or Chinese). Image understanding technology is composed of a deep learning model that combines a large-scale language model that acquires relationships and co-occurrences between words using a large amount of text data with an image encoder that extracts information about objects in the image from the image. When this deep learning model is given an input image and a natural language question about the input image (e.g., a sentence such as "What do you think about the logo in this image?"), it outputs an output sentence corresponding to the question (e.g., a sentence such as "This log is a simple and symbolic logo").
画像理解技術を実現する深層学習モデルは、入力画像とその入力画像に関する自然文の質問とその質問に対応する正解出力文との組を与えることにより、画像に含まれる様々な情報を自然文で推定することができるようになる。例えば、「この画像に含まれるロゴの色は?」との質問に対して、「ピンク色です。」等といった文が出力文として出力されるようにすることができる。 Deep learning models that realize image understanding technology can infer various pieces of information contained in an image in natural language by providing a pair of an input image, a natural language question about that input image, and a correct output sentence corresponding to that question. For example, in response to the question, "What color is the logo in this image?", a sentence such as "It's pink" can be output as the output sentence.
<音声理解技術>
音声に含まれる音声情報を自然文で出力する音声理解技術を実現することにより、例えば、きめ細やかな非言語情報・パラ言語情報を含む多様な音声情報を認識し、その音声情報を自然文で出力することが可能となる。
<Speech understanding technology>
By realizing speech understanding technology that can output speech information contained in speech in natural sentences, it will be possible to recognize a variety of speech information, including detailed non-linguistic and paralinguistic information, and output that speech information in natural sentences.
音声理解技術を実現する単純な方法として、画像理解技術で利用されている画像エンコーダの代わりに、音声からその音声の特徴を表す情報を抽出する音声エンコーダを利用することが考えられる。しかしながら、実際には、この方法では音声理解技術の実現は困難である。これには2つの原因がある。 A simple way to realize speech understanding technology would be to use a speech encoder that extracts information representing the characteristics of speech from speech, instead of the image encoder used in image understanding technology. However, in practice, it is difficult to realize speech understanding technology using this method. There are two reasons for this.
1つ目の原因は、画像エンコーダと音声エンコーダの構成の違いである。すなわち、既存の音声エンコーダ(例えば、wav2vec2.0(参考文献1)、WavLM(参考文献2)等)は各層で異なる情報が抽出されるため、画像理解技術と同様に音声エンコーダから最終的に出力される情報のみを用いるだけでは多様な音声情報を理解することができないためである。例えば、参考文献2では、話者情報等といった物理的性質に近い情報は音声エンコーダの低層で抽出される一方で、音韻性等の抽象的性質に近い情報は音声エンコーダの高層で抽出されることが示唆されている。このため、例えば、話し方を認識する際には音声エンコーダの低層で抽出される情報を利用する一方で、音声認識の際には音声エンコーダの高層で抽出される情報を利用しなければ、正確な自然文の出力は困難であると考えられる。 The first reason is the difference in the structure of image encoders and speech encoders. Existing speech encoders (e.g., wav2vec2.0 (Reference 1), WavLM (Reference 2), etc.) extract different information at each layer, making it impossible to understand diverse speech information using only the information ultimately output by the speech encoder, as is the case with image understanding technology. For example, Reference 2 suggests that information closer to physical properties, such as speaker information, is extracted in the lower layers of the speech encoder, while information closer to abstract properties, such as phonology, is extracted in the higher layers of the speech encoder. For this reason, for example, while speaking style recognition uses information extracted in the lower layers of the speech encoder, speech recognition must use information extracted in the higher layers of the speech encoder; otherwise, it is thought that accurate natural-sounding output will be difficult.
2つ目の原因は、画像と音声の性質の違いである。すなわち、音声は画像と異なり長さが可変であるため、任意長の音声に含まれる音声情報を認識し、その音声情報を自然文で出力できるような時間方向の処理が必要であると考えられる。 The second reason is the difference in the nature of images and audio. In other words, unlike images, audio has a variable length, so it is thought that time-domain processing is required to recognize the audio information contained in audio of any length and output that audio information in natural language.
<音声理解モデル>
そこで、以下では、上記の2つの原因に起因する問題を解決することができる深層学習モデル(以下、「音声理解モデル」と呼ぶ。)を提案する。この音声理解モデルにより、きめ細やかな非言語情報・パラ言語情報を含む多様な音声情報を任意長の音声から認識し、その音声情報を自然文で出力することが可能となる。すなわち、任意長の音声に含まれる多様な音声情報(より具体的には、音声の物理的性質及び抽象的性質の少なくとも一方に関連する音声情報)を自然文で出力する音声理解技術を実現することができる。このため、例えば、非言語情報・パラ言語情報の認識結果を利用する後段のシステムにおける処理精度(例えば、コンタクトセンタシステムにおける通話分析の精度、音声対話システムにおける対話制御の精度や分析精度等)の向上も期待できる。
<Speech understanding model>
Therefore, we propose a deep learning model (hereinafter referred to as a "speech understanding model") that can solve the problems caused by the above two causes. This speech understanding model makes it possible to recognize diverse speech information, including detailed non-verbal and paralinguistic information, from speech of any length and output the speech information in natural sentences. In other words, it is possible to realize a speech understanding technology that outputs diverse speech information contained in speech of any length (more specifically, speech information related to at least one of the physical and abstract properties of speech) in natural sentences. This can be expected to improve the processing accuracy of downstream systems that use the recognition results of non-verbal and paralinguistic information (e.g., the accuracy of call analysis in contact center systems, the accuracy of dialogue control and analysis in voice dialogue systems, etc.).
本実施形態で提案する音声理解モデル1000の一例について、図1を参照しながら説明する。図1は、音声理解モデル1000の一例を示す図である。 An example of the speech understanding model 1000 proposed in this embodiment will be described with reference to Figure 1. Figure 1 is a diagram showing an example of the speech understanding model 1000.
図1に示すように、音声理解モデル1000は、音声エンコーダ1100と、音声エンコーダ出力統合ブロック1200と、時間情報統合ブロック1300と、線形変換層1400と、大規模言語モデル1500とで構成される。 As shown in Figure 1, the speech understanding model 1000 is composed of a speech encoder 1100, a speech encoder output integration block 1200, a temporal information integration block 1300, a linear transformation layer 1400, and a large-scale language model 1500.
音声エンコーダ1100は、既存の任意の音声エンコーダ(例えば、wav2vec2.0、WavLM等)である。音声エンコーダ1100は、音声(以下、「入力音声」ともいう。)を入力として、その音声の特徴を表す情報を出力する。このとき、音声エンコーダ1100は、所定の時間区間毎に、その時間区間における入力音声を入力として、その時間区間における音声の特徴を表す情報を出力する。 The audio encoder 1100 is any existing audio encoder (for example, wav2vec2.0, WavLM, etc.). The audio encoder 1100 receives audio (hereinafter also referred to as "input audio") as input and outputs information representing the characteristics of that audio. At this time, for each predetermined time interval, the audio encoder 1100 receives the input audio for that time interval and outputs information representing the characteristics of the audio for that time interval.
音声エンコーダ出力統合ブロック1200は、各時間区間において、音声エンコーダ1100の各層の出力のうち、予め指定された複数の層の出力を統合する。以下、予め指定された複数の層の各々の層のことを「統合対象層」と呼ぶことにする。ただし、各統合対象層の出力次元数は同一であるものとする。なお、統合対象層は、音声エンコーダ1100の各層のうちその出力次元数が同一である層の中からユーザ等によって指定される。 The audio encoder output merging block 1200 merges the outputs of multiple pre-specified layers from among the outputs of each layer of the audio encoder 1100 in each time interval. Hereinafter, each of the multiple pre-specified layers will be referred to as the "layer to be merged." However, it is assumed that each layer to be merged has the same number of output dimensions. The layer to be merged is specified by the user, etc., from among the layers of the audio encoder 1100 that have the same number of output dimensions.
ここで、以下、時間区間を表すインデックスをtとして、t=1,・・・,Tであるものとする。Tは入力音声の最後の時間区間を表すインデックスであり、入力音声の長さによってその値は異なり得る。また、以下、統合対象層の層数をNとして、時間区間tにおけるn(ただし、n=1,・・・,N)番目の統合対象層の出力をhn(t)とする。なお、各n=1,・・・,Nに対してhn(t)は入力音声の何等かの特徴(例えば、物理的性質や抽象的性質)を表しており、また各hn(t)は予め決められた次元数のベクトルで記述できるため、以下、各hn(t)を「第1の音声特徴ベクトル」と呼ぶことにする。 Hereinafter, an index representing a time interval is defined as t, where t = 1, ..., T. T is an index representing the last time interval of the input speech, and its value may vary depending on the length of the input speech. Furthermore, hereinafter, the number of layers to be integrated is defined as N, and the output of the nth layer to be integrated (where n = 1, ..., N) in time interval t is defined as h n (t). Note that for each n = 1, ..., N, h n (t) represents some feature of the input speech (e.g., a physical property or an abstract property), and each h n (t) can be described by a vector with a predetermined number of dimensions, so hereinafter each h n (t) will be referred to as a "first speech feature vector."
このとき、音声エンコーダ出力統合ブロック1200は、各時間区間tにおいて、その時間区間tにおける各第1の音声特徴ベクトルhn(t)を入力として、各第1の音声特徴ベクトルhn(t)を統合したベクトルを出力する。以下、各第1の音声特徴ベクトルhn(t)を統合したベクトルを「第1の統合ベクトル」と呼び、e(t)で表すことにする。 In this case, the speech encoder output integration block 1200 receives, in each time interval t, each first speech feature vector h n (t) in that time interval t as input, and outputs a vector obtained by integrating each first speech feature vector h n (t). Hereinafter, the vector obtained by integrating each first speech feature vector h n (t) will be referred to as the "first integrated vector" and represented by e(t).
例えば、図2に示すように、音声エンコーダ1100が1つの畳み込み層とN個のTransformer層とで構成されており、これらN個のTransformer層が統合対象層であるものとする。この場合、時間区間tにおけるn番目のTransformer層の出力が第1の音声特徴ベクトルhn(t)であり、音声エンコーダ出力統合ブロック1200は、これらの第1の音声特徴ベクトルhn(t)を統合することにより第1の統合ベクトルe(t)を作成する。 For example, as shown in Figure 2, suppose the speech encoder 1100 is composed of one convolutional layer and N transformer layers, and these N transformer layers are layers to be integrated. In this case, the output of the nth transformer layer in time interval t is a first speech feature vector h n (t), and the speech encoder output integration block 1200 integrates these first speech feature vectors h n (t) to create a first integrated vector e(t).
各第1の音声特徴ベクトルhn(t)の統合方法としては、例えば、重み付け和や線形変換和を用いることができる。重み付け和を用いる場合、e(t)=α1h1(t)+・・・+αNhN(t)により第1の統合ベクトルe(t)が作成される。ここで、α1,・・・,αNは重み係数とも呼ばれ、α1+・・・+αN=1を満たす学習対象パラメータである。一方で、線形変換和を用いる場合、e(t)=((W1h1(t)+b1)+・・・+(WNhN(t)+bN))/Nにより第1の統合ベクトルe(t)が作成される。ここで、W1,・・・,WN,b1,・・・,bNは線形変換係数とも呼ばれ、学習対象パラメータである。 The first speech feature vectors h n (t) can be integrated by, for example, weighted sum or linear transformation sum. When weighted sum is used, the first integrated vector e(t) is created by e(t) = α 1 h 1 (t) + ... + α N h N (t). Here, α 1 , ..., α N are also called weighting coefficients and are training target parameters that satisfy α 1 + ... + α N = 1. On the other hand, when linear transformation sum is used, the first integrated vector e(t) is created by e(t) = ((W 1 h 1 (t) + b 1 ) + ... + (W N h N (t) + b N ))/N. Here, W 1 , ..., W N , b 1 , ..., b N are also called linear transformation coefficients and are training target parameters.
時間情報統合ブロック1300は、第1の統合ベクトルe(t)を時間方向に統合する。すなわち、時間情報統合ブロック1300は、各時間区間tにおける第1の統合ベクトルe(t)を入力として、各第1の統合ベクトルe(t)を時間方向に統合したベクトルを出力する。このとき、すべての時間区間tにおける第1の統合ベクトルe(t)を単純に統合すればよいわけではなく、時間的に重視すべき部分とそうでない部分とを考慮する必要がある。例えば、音声から発話者の感情や話者情報を認識する場合、音声中の短い間を表す時間区間や息継ぎが生じている時間区間等は無視し、発話している時間区間を重視する必要がある。このため、時間情報統合ブロック1300は、各第1の統合ベクトルe(t)を重み付け和により統合する。以下、各第1の統合ベクトルe(t)を時間方向に統合したベクトルを「第2の統合ベクトル」と呼び、vで表すことにする。 The temporal information integration block 1300 integrates the first integrated vector e(t) in the time direction. That is, the temporal information integration block 1300 receives the first integrated vector e(t) for each time interval t as input and outputs a vector obtained by integrating each first integrated vector e(t) in the time direction. In this case, it is not sufficient to simply integrate the first integrated vector e(t) for all time intervals t; it is necessary to consider which parts of the time period should be emphasized and which parts should not. For example, when recognizing a speaker's emotions or speaker information from speech, it is necessary to ignore time periods representing short pauses in the speech or time periods where breathing occurs, and to emphasize the time periods during which the speaker is speaking. For this reason, the temporal information integration block 1300 integrates each first integrated vector e(t) using a weighted sum. Hereinafter, the vector obtained by integrating each first integrated vector e(t) in the time direction will be referred to as the "second integrated vector" and represented by v.
例えば、図3に示すように、第1の統合ベクトルe(1),・・・,e(T)が入力された場合、時間情報統合ブロック1300は、これらの第1の統合ベクトルe(1),・・・,e(T)を時間方向に統合することにより第2の統合ベクトルvを作成する。 For example, as shown in Figure 3, when first integrated vectors e(1), ..., e(T) are input, the time information integration block 1300 creates a second integrated vector v by integrating these first integrated vectors e(1), ..., e(T) in the time direction.
各第1の統合ベクトルe(1),・・・,e(T)の統合方法としては、例えば、自己注意プーリング層(Self-Attentive Pooling)を用いることができる。E=[e(1),・・・,e(T)]τとする。このとき、自己注意プーリング層を用いる場合、v=aEにより第2の統合ベクトルvが作成される。ここで、a=softmax(ReLU(W1'E)W2')である。また、W1'及びW2'は学習対象パラメータ、τは転置を表す記号である。これにより、各第1の統合ベクトルe(1),・・・,e(T)が重み付け和により統合された第2の統合ベクトルvが得られる。 As a method for integrating each of the first integrated vectors e(1), ..., e(T), for example, a self-attentive pooling layer can be used. Let E = [e(1), ..., e(T)] τ . In this case, when a self-attention pooling layer is used, the second integrated vector v is created by v = aE. Here, a = softmax(ReLU(W 1 'E)W 2 '). Furthermore, W 1 ' and W 2 ' are training parameters, and τ is a symbol representing transposition. As a result, the second integrated vector v is obtained by integrating each of the first integrated vectors e(1), ..., e(T) by weighted sum.
なお、時間情報統合ブロック1300は、畳み込みニューラルネットワークを用いて、各第1の統合ベクトルe(1),・・・,e(T)を時間方向に統合してもよい。この場合、V=conv1D(E)によりK個の第2の統合ベクトルv(1),・・・,v(K)により構成される行列V=[v(1),・・・,v(K)]τが得られる。このとき、1次元畳み込みニューラルネットワークの学習可能パラメータが学習対象パラメータである。また、Kは、1次元畳み込みニューラルネットワークのウインドウサイズと第1の統合ベクトルe(1),・・・,e(T)の系列長Tによって決定される1以上の整数である。 The time information integration block 1300 may integrate each of the first integrated vectors e(1), ..., e(T) in the time direction using a convolutional neural network. In this case, a matrix V = [v(1), ..., v(K)] τ composed of K second integrated vectors v(1), ..., v(K) is obtained by V = conv1D(E). In this case, the learnable parameters of the one-dimensional convolutional neural network are the parameters to be learned. Furthermore, K is an integer equal to or greater than 1 determined by the window size of the one-dimensional convolutional neural network and the sequence length T of the first integrated vectors e(1), ..., e(T).
線形変換層1400は、第2の統合ベクトルvを線形変換する。すなわち、線形変換層1400は、第2の統合ベクトルvを入力として、この第2の統合ベクトルvを線形変換したベクトルを出力する。以下、第2の統合ベクトルvを線形変換したベクトルを「第2の音声特徴ベクトル」と呼び、wで表すことにする。第2の音声特徴ベクトルwは、w=Wv+bにより作成される。ここで、W,bは線形変換係数とも呼ばれ、学習対象パラメータである。なお、大規模言語モデル1500におけるトークンの埋め込み空間の次元数がMである場合、第2の音声特徴ベクトルwの次元数もMである。このため、線形変換層1400では、長さが1、次元数がMの第2の音声特徴ベクトル系列が作成されることを意味している。 The linear transformation layer 1400 linearly transforms the second integrated vector v. That is, the linear transformation layer 1400 takes the second integrated vector v as input and outputs a vector obtained by linearly transforming this second integrated vector v. Hereinafter, the vector obtained by linearly transforming the second integrated vector v will be referred to as the "second speech feature vector" and represented by w. The second speech feature vector w is created by w = Wv + b. Here, W and b are also called linear transformation coefficients and are training target parameters. Note that if the number of dimensions of the token embedding space in the large-scale language model 1500 is M, then the number of dimensions of the second speech feature vector w is also M. This means that the linear transformation layer 1400 creates a second speech feature vector sequence with a length of 1 and M dimensions.
なお、時間情報統合ブロック1300から行列V=[v(1),・・・,v(K)]τが出力された場合、例えば、v(1),・・・,v(K)をそれぞれ線形変換したK個のベクトルを第2の音声特徴ベクトルw(1),・・・,w(K)とすればよい。これは、長さがK、次元数がMの第2の音声特徴ベクトル系列が作成されることを意味している。 When the matrix V = [v(1), ..., v(K)] τ is output from the time information integration block 1300, for example, K vectors obtained by linearly transforming v(1), ..., v(K) can be set as second speech feature vectors w(1), ..., w(K). This means that a second speech feature vector sequence with a length of K and a number of dimensions of M is created.
大規模言語モデル1500は、既存の任意の大規模言語モデル(LLM:Large Language Models)である。大規模言語モデル1500は、入力音声に関する自然文の質問である入力文と、第2の音声特徴ベクトルw(又は、第2の音声特徴ベクトルw(1),・・・,w(K))とを入力として、その質問に対応する出力文を出力する。このような出力文は、入力文と第2の音声特徴ベクトルw(又は、第2の音声特徴ベクトルw(1),・・・,w(K))とが与えられたときの事後確率に従って生成される。なお、大規模言語モデル1500では、入力文を構成するトークンの埋め込み表現を表す埋め込みベクトル系列と第2の音声特徴ベクトルw(又は、第2の音声特徴ベクトルw(1),・・・,w(K))とを結合したベクトル系列が処理されることにより事後確率が算出される。 Large-scale language model 1500 is any existing large-scale language model (LLM). Large-scale language model 1500 takes as input an input sentence, which is a natural language question about the input speech, and a second speech feature vector w (or second speech feature vector w(1), ..., w(K)), and outputs an output sentence corresponding to the question. Such an output sentence is generated according to the posterior probability given the input sentence and the second speech feature vector w (or second speech feature vector w(1), ..., w(K)). Note that in large-scale language model 1500, the posterior probability is calculated by processing a vector sequence combining an embedding vector sequence representing the embedded representation of the tokens that make up the input sentence and the second speech feature vector w (or second speech feature vector w(1), ..., w(K)).
図1に示す例では、「次の音声に含まれる人物の感情状態を教えて」との入力文が与えられたときに、「この人物は男性で、少しいらいらしています」との出力文が生成された場合を示している。なお、大規模言語モデル1500としては、例えば、Llama 2 7B(参考文献3)等を用いることができる。ただし、入力文と第2の音声特徴ベクトルw(又は、第2の音声特徴ベクトルw(1),・・・,w(K))とが与えられたときの事後確率に従って出力文を生成可能な言語モデルであれば、大規模言語モデル以外の言語モデルが大規模言語モデル1500として用いられてもよい。 The example shown in Figure 1 shows a case where, when an input sentence of "Please tell me the emotional state of the person in the following speech," is given, an output sentence of "This person is male and is slightly irritated" is generated. Note that, for example, Llama 2 7B (Reference 3) or the like can be used as the large-scale language model 1500. However, a language model other than a large-scale language model may also be used as the large-scale language model 1500, as long as it is a language model that can generate an output sentence according to the posterior probability when given an input sentence and a second speech feature vector w (or second speech feature vectors w(1), ..., w(K)).
以下、簡単のため、時間情報統合ブロック1300で自己注意プーリング層が用いられるものとして、時間情報統合ブロック1300では第2の統合ベクトルvが得られ、線形変換層1400では第2の音声特徴ベクトルwが作成されるものとして説明する。ただし、時間情報統合ブロック1300で畳み込みニューラルネットワークが用いられる場合、「第2の統合ベクトルv」を「第2の統合ベクトルv(1),・・・,v(K)」、「第2の音声特徴ベクトルw」を「第2の音声特徴ベクトルw(1),・・・,w(K)」と読み替えることにより、以下の実施形態を同様に適用することが可能である。 For simplicity, the following description will be given assuming that a self-attention pooling layer is used in the temporal information integration block 1300, that a second integrated vector v is obtained in the temporal information integration block 1300, and that a second audio feature vector w is created in the linear transformation layer 1400. However, if a convolutional neural network is used in the temporal information integration block 1300, the following embodiment can be similarly applied by replacing "second integrated vector v" with "second integrated vector v(1), ..., v(K)" and "second audio feature vector w" with "second audio feature vector w(1), ..., w(K)."
なお、音声エンコーダ1100は、例えば、「音声特徴抽出器」、「音声符号化器」等と呼ばれてもよい。また、大規模言語モデル1500は、例えば、「言語モデル」、「自然言語モデル」、「自然言語処理モデル」等と呼ばれてもよい。更に、音声理解モデル1000を構成する構成要素(音声エンコーダ1100、音声エンコーダ出力統合ブロック1200、時間情報統合ブロック1300、線形変換層1400、大規模言語モデル1500)は、例えば、「モジュール」等と呼ばれてもよい。 Note that the speech encoder 1100 may be referred to as, for example, a "speech feature extractor" or "speech coder." Furthermore, the large-scale language model 1500 may be referred to as, for example, a "language model," a "natural language model," or a "natural language processing model." Furthermore, the components that make up the speech understanding model 1000 (speech encoder 1100, speech encoder output integration block 1200, temporal information integration block 1300, linear transformation layer 1400, and large-scale language model 1500) may be referred to as, for example, a "module."
以下、図1に示す音声理解モデル1000により音声理解技術を実現する音声理解装置10について説明する。ここで、音声理解装置10には、音声理解モデル1000を構築する「モデル構築時」と、音声理解モデル1000の学習対象パラメータを学習する「学習時」と、学習済みパラメータを用いて音声理解モデル1000により出力文を生成する「推論時」とが存在する。学習時における音声理解装置10には、入力音声とその入力音声に関する自然文の質問である入力文とその質問に対応する正解出力文との組で表される学習データの集合(以下、「学習データセット」ともいう。)が与えられる。一方で、推論時における音声理解装置10には、入力音声とその入力音声に関する自然文の質問である入力文との組で表されるテストデータが与えられる。 The following describes a speech understanding device 10 that realizes speech understanding technology using the speech understanding model 1000 shown in Figure 1. Here, the speech understanding device 10 undergoes a "model construction" phase during which the speech understanding model 1000 is constructed, a "learning" phase during which parameters to be learned for the speech understanding model 1000 are learned, and an "inference" phase during which output sentences are generated by the speech understanding model 1000 using the learned parameters. During learning, the speech understanding device 10 is provided with a collection of training data (hereinafter also referred to as a "training dataset") represented by pairs of input speech, input sentences which are questions in natural language related to the input speech, and correct output sentences corresponding to the questions. On the other hand, during inference, the speech understanding device 10 is provided with test data represented by pairs of input speech and input sentences which are questions in natural language related to the input speech.
なお、モデル構築時における音声理解装置10は、例えば、「モデル構築装置」、「モデル作成装置」等と呼ばれてもよい。学習時における音声理解装置10は、例えば、「学習装置」、「パラメータ推定装置」、「パラメータ最適化装置」等と呼ばれてもよい。推論時における音声理解装置10は、例えば、「推論装置」、「推定装置」、「自然文生成装置」等と呼ばれてもよい。 Note that the speech understanding device 10 during model construction may be referred to as, for example, a "model construction device" or a "model creation device." The speech understanding device 10 during learning may be referred to as, for example, a "learning device," a "parameter estimation device," a "parameter optimization device," etc. The speech understanding device 10 during inference may be referred to as, for example, an "inference device," an "estimation device," a "natural sentence generation device," etc.
以下では、簡単のため、モデル構築時は学習時に含まれるものとして、学習時における音声理解装置10が音声理解モデル1000の構築も行う場合について説明する。 In the following, for simplicity's sake, model construction is considered to be included in learning, and a case will be described in which the speech understanding device 10 also constructs the speech understanding model 1000 during learning.
[学習時]
以下、学習時における音声理解装置10について説明する。
[Study]
The speech understanding device 10 during learning will be described below.
<学習時における音声理解装置10のハードウェア構成例>
学習時における音声理解装置10のハードウェア構成例について、図4を参照しながら説明する。図4は、学習時における音声理解装置10のハードウェア構成の一例を示す図である。
<Example of hardware configuration of the speech understanding device 10 during learning>
An example of the hardware configuration of the speech understanding device 10 during learning will be described with reference to Fig. 4. Fig. 4 is a diagram showing an example of the hardware configuration of the speech understanding device 10 during learning.
図4に示すように、学習時における音声理解装置10は、入力装置101と、表示装置102と、外部I/F103と、通信I/F104と、RAM(Random Access Memory)105と、ROM(Read Only Memory)106と、補助記憶装置107と、プロセッサ108とを有する。これらの各ハードウェアは、それぞれがバス109を介して通信可能に接続される。 As shown in FIG. 4, the speech understanding device 10 during learning has an input device 101, a display device 102, an external I/F 103, a communication I/F 104, a RAM (Random Access Memory) 105, a ROM (Read Only Memory) 106, an auxiliary storage device 107, and a processor 108. Each of these pieces of hardware is connected to each other so that they can communicate with each other via a bus 109.
入力装置101は、例えば、キーボード、マウス、タッチパネル、物理ボタン等である。表示装置102は、例えば、ディスプレイ、表示パネル等である。なお、音声理解装置10は、例えば、入力装置101及び表示装置102のうちの少なくとも一方を有していなくてもよい。 The input device 101 is, for example, a keyboard, a mouse, a touch panel, a physical button, etc. The display device 102 is, for example, a display, a display panel, etc. Note that the speech understanding device 10 does not necessarily have to have at least one of the input device 101 and the display device 102, for example.
外部I/F103は、記録媒体103a等の外部装置とのインタフェースである。記録媒体103aとしては、例えば、CD(Compact Disc)、DVD(Digital Versatile Disk)、SDメモリカード(Secure Digital memory card)、USB(Universal Serial Bus)メモリカード等が挙げられる。 The external I/F 103 is an interface with external devices such as a recording medium 103a. Examples of recording media 103a include a CD (Compact Disc), a DVD (Digital Versatile Disk), an SD memory card (Secure Digital memory card), and a USB (Universal Serial Bus) memory card.
通信I/F104は、通信ネットワークに接続するためのインタフェースである。RAM105は、プログラムやデータを一時保持する揮発性の半導体メモリ(記憶装置)である。ROM106は、電源を切ってもプログラムやデータを保持することができる不揮発性の半導体メモリ(記憶装置)である。補助記憶装置107は、例えば、HDD(Hard Disk Drive)、SSD(Solid State Drive)、フラッシュメモリ等の不揮発性の記憶装置である。プロセッサ108は、例えば、CPU(Central Processing Unit)やGPU(Graphic Processing Unit)等の各種演算装置である。 The communication I/F 104 is an interface for connecting to a communication network. The RAM 105 is a volatile semiconductor memory (storage device) that temporarily stores programs and data. The ROM 106 is a non-volatile semiconductor memory (storage device) that can store programs and data even when the power is turned off. The auxiliary storage device 107 is a non-volatile storage device such as an HDD (Hard Disk Drive), SSD (Solid State Drive), or flash memory. The processor 108 is a variety of arithmetic devices such as a CPU (Central Processing Unit) or GPU (Graphics Processing Unit).
なお、図4に示すハードウェア構成は一例であって、音声理解装置10のハードウェア構成はこれに限られるものではない。例えば、音声理解装置10は、複数の補助記憶装置107や複数のプロセッサ108を有していてもよいし、図示したハードウェアの一部を有していなくてもよいし、図示したハードウェア以外の種々のハードウェアを有していてもよい。 Note that the hardware configuration shown in FIG. 4 is an example, and the hardware configuration of the speech understanding device 10 is not limited to this. For example, the speech understanding device 10 may have multiple auxiliary storage devices 107 or multiple processors 108, may not have some of the hardware shown in the figure, or may have various hardware other than the hardware shown in the figure.
<学習時における音声理解装置10の機能構成例>
学習時における音声理解装置10の機能構成例について、図5を参照しながら説明する。図5は、学習時における音声理解装置10の機能構成の一例を示す図である。
<Example of functional configuration of the speech understanding device 10 during learning>
An example of the functional configuration of the speech understanding device 10 during learning will be described with reference to Fig. 5. Fig. 5 is a diagram showing an example of the functional configuration of the speech understanding device 10 during learning.
図5に示すように、学習時における音声理解装置10は、モデル構築部201と、モデル学習部202とを有する。これら各部は、例えば、音声理解装置10にインストールされた1以上のプログラムが、プロセッサ108等に実行させる処理により実現される。また、学習時における音声理解装置10は、学習済み音声エンコーダ記憶部203と、学習済み大規模言語モデル記憶部204と、音声理解モデル記憶部205と、学習データセット記憶部206とを有する。これら各記憶部は、例えば、補助記憶装置107等の記憶領域により実現される。ただし、これら各記憶部のうちの少なくとも1つの記憶部が、音声理解装置10と通信可能に接続される記憶装置(例えば、データベースサーバ等が有する記憶装置等)の記憶領域により実現されてもよい。 As shown in FIG. 5, the speech understanding device 10 during learning has a model construction unit 201 and a model learning unit 202. These units are realized, for example, by processing in which one or more programs installed in the speech understanding device 10 are executed by the processor 108 or the like. The speech understanding device 10 during learning also has a trained speech encoder storage unit 203, a trained large-scale language model storage unit 204, a speech understanding model storage unit 205, and a training dataset storage unit 206. Each of these storage units is realized, for example, by a storage area such as the auxiliary storage device 107. However, at least one of these storage units may also be realized by a storage area of a storage device (for example, a storage device included in a database server, etc.) that is communicatively connected to the speech understanding device 10.
モデル構築部201は、学習済み音声エンコーダ記憶部203に記憶されている学習済み音声エンコーダと、学習済み大規模言語モデル記憶部204に記憶されている学習済み大規模言語モデルとを用いて、図1に示す音声理解モデル1000を構築する。すなわち、モデル構築部201は、学習済み音声エンコーダを音声エンコーダ1100、学習済み大規模言語モデルを大規模言語モデル1500として、図1に示す音声理解モデル1000を構築する。このとき、モデル構築部201は、音声エンコーダ出力統合ブロック1200の学習対象パラメータと、時間情報統合ブロック1300の学習対象パラメータと、線形変換層1400の学習対象パラメータとを初期化する。学習対象パラメータは任意の方法により初期化すればよいが、例えば、ランダムに初期化する、所定の分布からサンプリングする、等により初期化すればよい。なお、学習済み音声エンコーダとは、パラメータが学習済みの音声エンコーダのことである。同様に、学習済み大規模言語モデルとは、パラメータが学習済みの大規模言語モデルのことである。 The model construction unit 201 constructs the speech understanding model 1000 shown in FIG. 1 using the trained speech encoder stored in the trained speech encoder storage unit 203 and the trained large-scale language model stored in the trained large-scale language model storage unit 204. That is, the model construction unit 201 constructs the speech understanding model 1000 shown in FIG. 1 using the trained speech encoder as the speech encoder 1100 and the trained large-scale language model as the large-scale language model 1500. At this time, the model construction unit 201 initializes the training target parameters of the speech encoder output integration block 1200, the training target parameters of the time information integration block 1300, and the training target parameters of the linear transformation layer 1400. The training target parameters may be initialized using any method, such as random initialization or sampling from a predetermined distribution. Note that a trained speech encoder is a speech encoder whose parameters have been trained. Similarly, a trained large-scale language model is a large-scale language model whose parameters have been trained.
また、モデル構築部201は、音声理解モデル1000を音声理解モデル記憶部205に保存する。 In addition, the model construction unit 201 stores the speech understanding model 1000 in the speech understanding model storage unit 205.
モデル学習部202は、学習データセット記憶部206に記憶されている学習データセットを用いて、音声理解モデル記憶部205に記憶されている音声理解モデル1000を学習する。このとき、モデル学習部202は、音声エンコーダ1100及び大規模言語モデル1500のパラメータは固定したままで、音声エンコーダ出力統合ブロック1200の学習対象パラメータと、時間情報統合ブロック1300の学習対象パラメータと、線形変換層1400の学習対象パラメータとを学習する。より具体的には、モデル学習部202は、学習データに含まれる入力音声及び入力文を与えたときに音声理解モデル1000によって生成される出力文と当該学習データに含まれる正解出力文との交差エントロピー等を損失関数として用いて、その損失関数を最小化するように、既存の最適化手法により学習対象パラメータを学習する。なお、モデル学習部202の詳細な機能構成例については後述する。 The model training unit 202 uses the training dataset stored in the training dataset storage unit 206 to train the speech understanding model 1000 stored in the speech understanding model storage unit 205. At this time, the model training unit 202 trains the training target parameters of the speech encoder output integration block 1200, the training target parameters of the temporal information integration block 1300, and the training target parameters of the linear transformation layer 1400, while keeping the parameters of the speech encoder 1100 and large-scale language model 1500 fixed. More specifically, the model training unit 202 uses, as a loss function, the cross entropy between the output sentence generated by the speech understanding model 1000 when given the input speech and input sentence contained in the training data and the correct output sentence contained in the training data, and trains the training target parameters using an existing optimization method so as to minimize the loss function. A detailed example of the functional configuration of the model training unit 202 will be described later.
学習済み音声エンコーダ記憶部203は、学習済み音声エンコーダを記憶する。学習済み大規模言語モデル記憶部204は、学習済み大規模言語モデルを記憶する。音声理解モデル記憶部205は、モデル構築部201によって構築された音声理解モデル1000を記憶する。学習データセット記憶部206は、与えられた学習データセットを記憶する。 The trained speech encoder storage unit 203 stores a trained speech encoder. The trained large-scale language model storage unit 204 stores a trained large-scale language model. The speech understanding model storage unit 205 stores the speech understanding model 1000 constructed by the model construction unit 201. The training dataset storage unit 206 stores a given training dataset.
≪学習データセット≫
学習データセット記憶部206に記憶されている学習データセットの一例について、図6を参照しながら説明する。図6は、学習データセットの一例を示す図である。
<Learning dataset>
An example of the training data set stored in the training data set storage unit 206 will be described with reference to Fig. 6. Fig. 6 is a diagram showing an example of the training data set.
図6に示すように、学習データセットは1以上の学習データで構成されており、各学習データには、入力音声と、入力文と、正解出力文とが含まれる。なお、一般に、学習データセットは多数の学習データで構成される。 As shown in Figure 6, a training dataset is made up of one or more training data, each of which includes input speech, an input sentence, and a correct output sentence. Generally, a training dataset is made up of a large number of training data.
入力音声は、音声理解モデル1000に入力される音声データである。入力文は、入力音声に関する自然文の質問を表すテキストデータである。正解出力文は、入力文が表す質問に対して正解となる自然文の応答又は回答を表すテキストデータである。なお、入力音声は、必ずしも人の声を収録した音声データである必要はなく、任意の音を収録した音声データであってもよい。また、正解出力文は、例えば、「教師データ」等と呼ばれてもよい。 The input speech is speech data input to the speech understanding model 1000. The input sentence is text data that represents a question in natural language related to the input speech. The correct output sentence is text data that represents a response or answer in natural language that is the correct answer to the question represented by the input sentence. Note that the input speech does not necessarily have to be audio data that records a human voice, but may also be audio data that records any sound. The correct output sentence may also be called, for example, "teaching data."
例えば、図6に示す例の1行目の学習データには、入力音声「音声A」と、入力文「この音声を書き起こして」と、正解出力文「どうなっているか説明して」とが含まれている。同様に、図6に示す例の2行目の学習データには、入力音声「音声A」と、入力文「この音声の話し方を教えて」と、正解出力文「女性が早口で大きな声で話しています」とが含まれている。同様に、図6に示す例の3行目の学習データには、入力音声「音声B」と、入力文「この音声の話し方を教えて」と、正解出力文「ゆっくりと落ち着いた男性の話し方です」とが含まれている。同様に、図6に示す例の4行目の学習データには、入力音声「音声B」と、入力文「この音声の話者の性別を教えて」と、正解出力文「男性です」とが含まれている。同様に、図6に示す例の5行目の学習データには、入力音声「音声C」と、入力文「この音声の話者の感情はなに」と、正解出力文「この話者はややいらついています」とが含まれている。 6 includes the input speech "Speech A," the input sentence "Please transcribe this speech," and the correct output sentence "Please explain what is going on." Similarly, the training data for the second line of the example shown in FIG. 6 includes the input speech "Speech A," the input sentence "Please tell me how this speech is spoken," and the correct output sentence "A woman is speaking quickly and loudly." Similarly, the training data for the third line of the example shown in FIG. 6 includes the input speech "Speech B," the input sentence "Please tell me how this speech is spoken," and the correct output sentence "A man speaks slowly and calmly." Similarly, the training data for the fourth line of the example shown in FIG. 6 includes the input speech "Speech B," the input sentence "Please tell me the gender of the speaker of this speech," and the correct output sentence "It is a man." Similarly, the training data on the fifth line of the example shown in Figure 6 includes the input speech "Speech C," the input sentence "What is the emotion of the speaker of this speech?", and the correct output sentence "This speaker seems a little irritated."
このように、学習データセットは、入力音声と入力文と正解出力文との組で表される学習データで構成されている。なお、図6に示すように、学習データセットには、同一の入力音声に対して異なる入力文及び正解出力文が含まれる複数の学習データが存在してもよい。 In this way, the training dataset is composed of training data represented by pairs of input speech, input sentences, and correct output sentences. Note that, as shown in Figure 6, the training dataset may contain multiple training data sets that contain different input sentences and correct output sentences for the same input speech.
≪モデル学習部202の詳細な機能構成例≫
モデル学習部202の詳細な機能構成例について、図7を参照しながら説明する。図7は、モデル学習部202の詳細な機能構成の一例を示す図である。
<<Detailed Functional Configuration Example of Model Learning Unit 202>>
An example of a detailed functional configuration of the model learning unit 202 will be described with reference to Fig. 7. Fig. 7 is a diagram showing an example of a detailed functional configuration of the model learning unit 202.
図7に示すように、モデル学習部202には、学習データ入力部211と、音声エンコード部212と、第1の統合部213と、第2の統合部214と、線形変換部215と、事後確率算出部216と、パラメータ更新部217と、終了判定部218とが含まれる。 As shown in FIG. 7, the model learning unit 202 includes a learning data input unit 211, an audio encoding unit 212, a first integration unit 213, a second integration unit 214, a linear transformation unit 215, a posterior probability calculation unit 216, a parameter update unit 217, and an end determination unit 218.
学習データ入力部211は、学習データセット記憶部206に記憶されている学習データセットから1件の学習データを入力する。 The learning data input unit 211 inputs one piece of learning data from the learning dataset stored in the learning dataset storage unit 206.
音声エンコード部212は、音声理解モデル1000に含まれる音声エンコーダ1100により実現される。音声エンコード部212は、学習データ入力部211によって入力された学習データに含まれる入力音声を入力として、各時間区間t(t=1,・・・,T)でN個の統合対象層からN個の第1の音声特徴ベクトルhn(t)(n=1,・・・,N)をそれぞれ出力する。 The speech encoding unit 212 is realized by the speech encoder 1100 included in the speech understanding model 1000. The speech encoding unit 212 receives input speech included in the training data input by the training data input unit 211, and outputs N first speech feature vectors h n (t) (n=1, ..., N) from the N integration target layers in each time interval t (t=1, ..., T).
第1の統合部213は、音声理解モデル1000に含まれる音声エンコーダ出力統合ブロック1200により実現される。第1の統合部213は、各時間区間t(t=1,・・・,T)において、その時間区間tにおける第1の音声特徴ベクトルhn(t)(n=1,・・・,N)を入力として、これらの各第1の音声特徴ベクトルhn(t)(n=1,・・・,N)を統合した第1の統合ベクトルe(t)をそれぞれ出力する。 The first integration unit 213 is realized by the speech encoder output integration block 1200 included in the speech understanding model 1000. In each time interval t (t=1, ..., T), the first integration unit 213 receives the first speech feature vectors h n (t) (n=1, ..., N) in that time interval t as input, and outputs a first integrated vector e(t) obtained by integrating each of these first speech feature vectors h n (t) (n=1, ..., N).
第2の統合部214は、音声理解モデル1000に含まれる時間情報統合ブロック1300により実現される。第2の統合部214は、各時間区間tにおける第1の統合ベクトルe(t)を入力として、これらの各第1の統合ベクトルe(t)(t=1,・・・,T)を時間方向に統合した第2の統合ベクトルvを出力する。 The second integration unit 214 is realized by the time information integration block 1300 included in the speech understanding model 1000. The second integration unit 214 receives the first integrated vector e(t) for each time interval t as input, and outputs the second integrated vector v obtained by integrating each of these first integrated vectors e(t) (t = 1, ..., T) in the time direction.
線形変換部215は、音声理解モデル1000に含まれる線形変換層1400により実現される。線形変換部215は、第2の統合ベクトルvを入力として、第2の統合ベクトルvを線形変換した第2の音声特徴ベクトルwを出力する。 The linear transformation unit 215 is realized by the linear transformation layer 1400 included in the speech understanding model 1000. The linear transformation unit 215 receives the second integrated vector v as input and outputs the second speech feature vector w obtained by linearly transforming the second integrated vector v.
事後確率算出部216は、音声理解モデル1000に含まれる大規模言語モデル1500により実現される。事後確率算出部216は、学習データ入力部211によって入力された学習データに含まれる入力文と、第2の音声特徴ベクトルwとを入力として、当該入力文と当該第2の音声特徴ベクトルwとが与えられたときの出力文の事後確率を算出する。 The posterior probability calculation unit 216 is realized by the large-scale language model 1500 included in the speech understanding model 1000. The posterior probability calculation unit 216 receives as input an input sentence included in the training data input by the training data input unit 211 and a second speech feature vector w, and calculates the posterior probability of an output sentence when the input sentence and the second speech feature vector w are given.
より具体的には、出力文を構成するi番目のトークンをsiとする(ただし、s1は文頭を表すトークンであるものとする。)。また、当該入力文と当該第2の音声特徴ベクトルwとが与えられたときにトークンs1が生成される事後確率をp(s1)、当該入力文と当該第2の音声特徴ベクトルwとs1,・・・,si-1とが与えられたときにトークンsiが生成される事後確率をp(si)(ただし、i≧2)とする。このとき、事後確率算出部216は、i=1,・・・,Iに対して、事後確率p(si)を算出する。Iは出力文に含まれるトークン数(つまり、出力文の長さ)であり、例えば、学習データ入力部211によって入力された学習データに含まれる正解出力文の長さとすればよい。ここで、事後確率p(si)は、例えば、トークンの埋め込み空間の次元数をMとしたとき、m種類目のトークンが生成される確率をm番目の要素に持ち、かつ、すべての要素の値の和が1となるM次元ベクトルで表される。 More specifically, the i-th token constituting the output sentence is defined as s i (where s 1 is the token representing the beginning of the sentence). Furthermore, the posterior probability that token s 1 is generated when the input sentence and the second speech feature vector w are given is defined as p(s 1 ), and the posterior probability that token s i is generated when the input sentence, the second speech feature vector w, and s 1 , ..., s i-1 are given is defined as p(s i ) (where i≧2). In this case, the posterior probability calculation unit 216 calculates the posterior probability p(s i ) for i=1, ..., I. I is the number of tokens included in the output sentence (i.e., the length of the output sentence), and may be, for example, the length of the correct output sentence included in the training data input by the training data input unit 211. Here, the posterior probability p(s i ) is expressed as an M-dimensional vector in which, for example, when the number of dimensions of the token embedding space is M, the m-th element represents the probability that the m-th type of token is generated, and the sum of the values of all elements is 1.
なお、トークンとは、大規模言語モデル等の言語モデルが文字列を処理する際に基本となる処理単位のことである。トークンの典型例は単語であるが、トークンは単語に限られるものではなく、例えば、文字、形態素、サブワード、或るまとまりのある文字列等でもよい。 Note that a token is the basic processing unit used when a language model, such as a large-scale language model, processes a string of characters. A typical example of a token is a word, but tokens are not limited to words and can also be, for example, characters, morphemes, subwords, or coherent strings of characters.
パラメータ更新部217は、事後確率算出部216によって算出された事後確率と、学習データ入力部211によって入力された学習データに含まれる正解出力文とを用いて、既存の最適化手法により音声理解モデル1000の学習対象パラメータを学習する。 The parameter update unit 217 uses the posterior probabilities calculated by the posterior probability calculation unit 216 and the correct output sentences included in the learning data input by the learning data input unit 211 to learn the learning target parameters of the speech understanding model 1000 using an existing optimization method.
より具体的には、正解出力文を構成するi番目のトークンをsi'とする(ただし、s1'は文頭を表すトークン、sI'は文末を表すトークンであるものとする。)。また、トークンsi'が生成される確率をp(si')とする。確率p(si')は、例えば、トークンの埋め込み空間の次元数をMとしたとき、トークンsi'に対応する要素の値のみ1、それ以外の要素の値は0であるM次元ベクトルで表される。このとき、パラメータ更新部217は、i=1,・・・,Iに関する-p(si')logp(si)の和(つまり、交差エントロピー)を損失関数として用いて、その損失関数を最小化するように、既存の最適化手法により学習対象パラメータを更新する。なお、学習対象パラメータを更新する際に利用可能な最適化手法は特定の手法に限定されるものではないが、例えば、確率的勾配降下法に基づくオンライン最適化手法を用いることが可能である。 More specifically, the i-th token constituting the correct output sentence is denoted by s i ' (where s 1 ' is a token representing the beginning of the sentence, and s I ' is a token representing the end of the sentence). Furthermore, the probability that token s i ' is generated is denoted by p(s i '). For example, when the number of dimensions of the token embedding space is M, the probability p(s i ') is expressed as an M-dimensional vector in which only the element corresponding to token s i ' has a value of 1 and the other elements have a value of 0. In this case, the parameter update unit 217 uses the sum of -p(s i ')logp(s i ) for i = 1, ..., I (i.e., cross entropy) as a loss function, and updates the training parameters using an existing optimization method so as to minimize the loss function. The optimization method that can be used to update the training parameters is not limited to a specific method, but an online optimization method based on stochastic gradient descent, for example, can be used.
終了判定部218は、学習対象パラメータの更新を終了するか否かを判定する。このとき、終了判定部218は、予め決められた所定の終了条件を満たす場合は学習対象パラメータの更新を終了すると判定し、そうでない場合は学習対象パラメータの更新を終了しないと判定する。これにより、所定の終了条件を満たすまで、音声エンコーダ出力統合ブロック1200の学習対象パラメータと、時間情報統合ブロック1300の学習対象パラメータと、線形変換層1400の学習対象パラメータとが繰り返し更新される。ここで、所定の終了条件としては、例えば、学習対象パラメータの更新回数が所定の回数以上となったこと、エポック数が所定のエポック数以上となったこと、損失関数の値が所定の値未満となったこと、損失関数が収束したこと、等が挙げられる。 The termination determination unit 218 determines whether to terminate the updating of the training parameters. At this time, the termination determination unit 218 determines to terminate the updating of the training parameters if a predetermined termination condition is met, and determines not to terminate the updating of the training parameters if not. As a result, the training parameters of the audio encoder output integrated block 1200, the training parameters of the temporal information integrated block 1300, and the training parameters of the linear transformation layer 1400 are repeatedly updated until the predetermined termination condition is met. Here, examples of the predetermined termination condition include the training parameters being updated a predetermined number of times or more, the number of epochs being a predetermined number of epochs or more, the value of the loss function being less than a predetermined value, the loss function converging, etc.
<モデル構築処理>
以下、モデル構築処理について、図8を参照しながら説明する。図8は、モデル構築処理の一例を示すフローチャートである。
<Model Building Process>
The model construction process will be described below with reference to Fig. 8. Fig. 8 is a flowchart showing an example of the model construction process.
モデル構築部201は、学習済み音声エンコーダ記憶部203に記憶されている学習済み音声エンコーダと、学習済み大規模言語モデル記憶部204に記憶されている学習済み大規模言語モデルとを用いて、図1に示す音声理解モデル1000を構築する(ステップS101)。なお、このとき、モデル構築部201は、音声エンコーダ出力統合ブロック1200の学習対象パラメータと、時間情報統合ブロック1300の学習対象パラメータと、線形変換層1400の学習対象パラメータとを任意の手法により初期化する。これにより、学習済みでない音声理解モデル1000が構築される。 The model construction unit 201 constructs the speech understanding model 1000 shown in FIG. 1 using the trained speech encoder stored in the trained speech encoder storage unit 203 and the trained large-scale language model stored in the trained large-scale language model storage unit 204 (step S101). At this time, the model construction unit 201 initializes the training parameters of the speech encoder output integration block 1200, the training parameters of the time information integration block 1300, and the training parameters of the linear transformation layer 1400 using any method. This constructs an untrained speech understanding model 1000.
そして、モデル構築部201は、上記のステップS101で構築された音声理解モデル1000に音声理解モデル記憶部205に保存する(ステップS102)。 Then, the model construction unit 201 stores the speech understanding model 1000 constructed in step S101 above in the speech understanding model storage unit 205 (step S102).
<モデル学習処理>
以下、モデル学習処理について、図9を参照しながら説明する。図9は、モデル学習処理の一例を示すフローチャートである。
<Model learning process>
The model learning process will be described below with reference to Fig. 9. Fig. 9 is a flowchart showing an example of the model learning process.
モデル学習部202の学習データ入力部211は、学習データセット記憶部206に記憶されている学習データセットから1件の学習データを入力する(ステップS201)。学習データ入力部211は、例えば、学習データセットを構成する学習データのうち、現在のエポック数で未だ入力されていない1件の学習データを入力する。なお、エポック数は0から開始し、学習データセットを構成するすべての学習データが入力される毎に1ずつ加算される。 The learning data input unit 211 of the model learning unit 202 inputs one piece of learning data from the learning dataset stored in the learning dataset storage unit 206 (step S201). The learning data input unit 211 inputs, for example, one piece of learning data that has not yet been input for the current number of epochs from the learning data that make up the learning dataset. The epoch number starts from 0 and is incremented by 1 each time all of the learning data that make up the learning dataset is input.
モデル学習部202の音声エンコード部212は、上記のステップS201で入力された学習データに含まれる入力音声を入力として、各時間区間t(t=1,・・・,T)でN個の統合対象層からN個の第1の音声特徴ベクトルhn(t)(n=1,・・・,N)をそれぞれ出力する(ステップS202)。 The speech encoding unit 212 of the model training unit 202 takes the input speech included in the training data input in step S201 above as input, and outputs N first speech feature vectors h n (t) (n = 1, ..., N) from the N integration target layers in each time interval t (t = 1, ..., T) (step S202).
モデル学習部202の第1の統合部213は、各時間区間t(t=1,・・・,T)において、その時間区間tにおける第1の音声特徴ベクトルhn(t)(n=1,・・・,N)を入力として、これらの各第1の音声特徴ベクトルhn(t)(n=1,・・・,N)を統合した第1の統合ベクトルe(t)をそれぞれ出力する(ステップS203)。 The first integration unit 213 of the model learning unit 202 takes as input the first speech feature vector h n (t) (n = 1, ..., N) for each time interval t (t = 1, ..., T) and outputs a first integrated vector e(t) by integrating each of these first speech feature vectors h n (t) (n = 1, ..., N) (step S203).
モデル学習部202の第2の統合部214は、各時間区間tにおける第1の統合ベクトルe(t)を入力として、これらの各第1の統合ベクトルe(t)(t=1,・・・,T)を時間方向に統合した第2の統合ベクトルvを出力する(ステップS204)。 The second integration unit 214 of the model learning unit 202 receives the first integrated vector e(t) for each time interval t as input, and outputs a second integrated vector v obtained by integrating each of these first integrated vectors e(t) (t = 1, ..., T) in the time direction (step S204).
モデル学習部202の線形変換部215は、第2の統合ベクトルvを入力として、第2の統合ベクトルvを線形変換した第2の音声特徴ベクトルwを出力する(ステップS205)。 The linear transformation unit 215 of the model learning unit 202 receives the second integrated vector v as input and outputs the second speech feature vector w obtained by linearly transforming the second integrated vector v (step S205).
モデル学習部202の事後確率算出部216は、上記のステップS201で入力された学習データに含まれる入力文と、上記のステップS205で出力された第2の音声特徴ベクトルwとを入力として、当該入力文と当該第2の音声特徴ベクトルwとが与えられたときの出力文の事後確率を算出する(ステップS206)。 The posterior probability calculation unit 216 of the model training unit 202 receives as input the input sentence included in the training data input in step S201 above and the second speech feature vector w output in step S205 above, and calculates the posterior probability of the output sentence when the input sentence and the second speech feature vector w are given (step S206).
モデル学習部202のパラメータ更新部217は、上記のステップS206で算出された事後確率と、上記のステップS201で入力された学習データに含まれる正解出力文とを用いて、既存の最適化手法により音声理解モデル1000の学習対象パラメータを学習する(ステップS207)。すなわち、パラメータ更新部217は、上記のステップS201で入力された学習データに含まれる入力音声及び入力文を与えたときに音声理解モデル1000によって生成される出力文と当該学習データに含まれる正解出力文との交差エントロピー(具体的には、i=1,・・・,Iに関する-p(si')logp(si)の和)を損失関数として用いて、その損失関数を最小化するように、既存の最適化手法により学習対象パラメータを更新する。 The parameter update unit 217 of the model training unit 202 uses the posterior probability calculated in step S206 and the correct output sentence included in the training data input in step S201 to train the training parameters of the speech understanding model 1000 by an existing optimization method (step S207). That is, the parameter update unit 217 uses the cross entropy (specifically, the sum of -p(s i ')logp(s i ) for i = 1, ..., I ) between the output sentence generated by the speech understanding model 1000 when the input speech and input sentence included in the training data input in step S201 are given and the correct output sentence included in the training data as a loss function, and updates the training parameters by an existing optimization method so as to minimize the loss function.
モデル学習部202の終了判定部218は、学習対象パラメータの更新を終了するか否かを判定する(ステップS208)。すなわち、終了判定部218は、予め決められた所定の終了条件を満たす場合は学習対象パラメータの更新を終了すると判定し、そうでない場合は学習対象パラメータの更新を終了しないと判定する。 The termination determination unit 218 of the model learning unit 202 determines whether to terminate the update of the learning parameter (step S208). That is, if a predetermined termination condition is met, the termination determination unit 218 determines to terminate the update of the learning parameter, and if not, determines not to terminate the update of the learning parameter.
上記のステップS208で学習対象パラメータの更新を終了しないと判定された場合、モデル学習部202は、上記のステップS201に戻る。これにより、所定の終了条件を満たすまで、上記のステップS201~ステップS207が繰り返し実行される。 If it is determined in step S208 above that the update of the learning target parameters should not be terminated, the model learning unit 202 returns to step S201 above. As a result, steps S201 to S207 above are repeatedly executed until a predetermined termination condition is met.
一方で、上記のステップS208で学習対象パラメータの更新を終了すると判定された場合、モデル学習部202は、モデル学習処理を終了する。これにより、学習対象パラメータが学習され、学習済み音声理解モデル1000が得られる。 On the other hand, if it is determined in step S208 above that the update of the learning target parameters is to be terminated, the model learning unit 202 terminates the model learning process. As a result, the learning target parameters are learned and the trained speech understanding model 1000 is obtained.
[推論時]
以下、推論時における音声理解装置10について説明する。なお、以下では、主に、学習時との相違点について説明し、学習時と同様としてよい箇所については、適宜、その説明を省略する。
[At the time of inference]
The following describes the speech understanding device 10 during inference. Note that the following mainly describes differences from the learning period, and omits explanations of points that may be the same as the learning period, as appropriate.
<推論時における音声理解装置10のハードウェア構成例>
推論時における音声理解装置10のハードウェア構成は、学習時と同様としてよいため、その説明を省略する。
<Example of hardware configuration of the speech understanding device 10 during inference>
The hardware configuration of the speech understanding device 10 during inference may be the same as during learning, and therefore a description thereof will be omitted.
<推論時における音声理解装置10の機能構成例>
推論時における音声理解装置10の機能構成例について、図10を参照しながら説明する。図10は、推論時における音声理解装置10の機能構成の一例を示す図である。
<Example of functional configuration of the speech understanding device 10 during inference>
An example of the functional configuration of the speech understanding device 10 at the time of inference will be described with reference to Fig. 10. Fig. 10 is a diagram showing an example of the functional configuration of the speech understanding device 10 at the time of inference.
図10に示すように、推論時における音声理解装置10は、出力文生成部207を有する。出力文生成部207は、例えば、音声理解装置10にインストールされた1以上のプログラムが、プロセッサ108等に実行させる処理により実現される。また、推論時における音声理解装置10は、学習済み音声理解モデル記憶部208と、テストデータ記憶部209とを有する。これら各記憶部は、例えば、補助記憶装置107等の記憶領域により実現される。ただし、これら各記憶部のうちの少なくとも1つの記憶部が、音声理解装置10と通信可能に接続される記憶装置(例えば、データベースサーバ等が有する記憶装置等)の記憶領域により実現されてもよい。 As shown in FIG. 10, the speech understanding device 10 during inference has an output sentence generation unit 207. The output sentence generation unit 207 is realized, for example, by processing in which one or more programs installed in the speech understanding device 10 are executed by the processor 108 or the like. The speech understanding device 10 during inference also has a trained speech understanding model storage unit 208 and a test data storage unit 209. Each of these storage units is realized, for example, by a storage area of the auxiliary storage device 107 or the like. However, at least one of these storage units may also be realized by a storage area of a storage device (for example, a storage device included in a database server or the like) that is communicatively connected to the speech understanding device 10.
出力文生成部207は、テストデータ記憶部209に記憶されているテストデータと、学習済み音声理解モデル記憶部208に記憶されている学習済み音声理解モデル1000とを用いて、そのテストデータに含まれる入力文が表す質問に対応する出力文(つまり、その質問に対する自然文の応答又は回答を表すテキストデータ)を生成及び出力する。ここで、テストデータとは、入力音声とその入力音声に関する自然文の質問である入力文との組で表されるデータのことである。また、学習済み音声理解モデル1000とは、学習対象パラメータが学習済みの音声理解モデル1000のことである。なお、出力文生成部207の詳細な機能構成例については後述する。 The output sentence generation unit 207 uses the test data stored in the test data storage unit 209 and the trained speech understanding model 1000 stored in the trained speech understanding model storage unit 208 to generate and output an output sentence corresponding to the question expressed by the input sentence included in the test data (i.e., text data representing a natural language response or answer to the question). Here, test data refers to data represented by a pair of input speech and an input sentence that is a natural language question related to the input speech. Furthermore, the trained speech understanding model 1000 refers to a speech understanding model 1000 whose learning target parameters have been trained. A detailed example of the functional configuration of the output sentence generation unit 207 will be described later.
学習済み音声理解モデル記憶部208は、学習済み音声理解モデル1000を記憶する。テストデータ記憶部209は、与えられたテストデータを記憶する。 The trained speech understanding model storage unit 208 stores the trained speech understanding model 1000. The test data storage unit 209 stores the given test data.
≪出力文生成部207の詳細な機能構成例≫
出力文生成部207の詳細な機能構成例について、図11を参照しながら説明する。図11は、出力文生成部207の詳細な機能構成の一例を示す図である。
<<Detailed Functional Configuration Example of Output Sentence Generation Unit 207>>
An example of a detailed functional configuration of the output sentence generation unit 207 will be described with reference to Fig. 11. Fig. 11 is a diagram showing an example of a detailed functional configuration of the output sentence generation unit 207.
図11に示すように、出力文生成部207には、テストデータ入力部221と、音声エンコード部222と、第1の統合部223と、第2の統合部224と、線形変換部225と、生成部226と、出力部227とが含まれる。 As shown in FIG. 11, the output sentence generation unit 207 includes a test data input unit 221, a speech encoding unit 222, a first integration unit 223, a second integration unit 224, a linear conversion unit 225, a generation unit 226, and an output unit 227.
テストデータ入力部221は、テストデータ記憶部209に記憶されている1件のテストデータを入力する。 The test data input unit 221 inputs one piece of test data stored in the test data storage unit 209.
音声エンコード部222は、学習済み音声理解モデル1000に含まれる音声エンコーダ1100により実現される。音声エンコード部222は、テストデータ入力部221によって入力されたテストデータに含まれる入力音声を入力として、各時間区間t(t=1,・・・,T)でN個の統合対象層からN個の第1の音声特徴ベクトルhn(t)(n=1,・・・,N)をそれぞれ出力する。 The speech encoding unit 222 is realized by the speech encoder 1100 included in the trained speech understanding model 1000. The speech encoding unit 222 receives input speech included in the test data input by the test data input unit 221, and outputs N first speech feature vectors h n (t) (n=1, ..., N) from the N integration target layers in each time interval t (t=1, ..., T).
第1の統合部223は、学習済み音声理解モデル1000に含まれる音声エンコーダ出力統合ブロック1200により実現される。第1の統合部223は、各時間区間t(t=1,・・・,T)において、その時間区間tにおける第1の音声特徴ベクトルhn(t)(n=1,・・・,N)を入力として、これらの各第1の音声特徴ベクトルhn(t)(n=1,・・・,N)を統合した第1の統合ベクトルe(t)をそれぞれ出力する。 The first integration unit 223 is realized by the speech encoder output integration block 1200 included in the trained speech understanding model 1000. In each time interval t (t=1, ..., T), the first integration unit 223 receives the first speech feature vectors h n (t) (n=1, ..., N) in that time interval t as input, and outputs a first integrated vector e(t) obtained by integrating each of these first speech feature vectors h n (t) (n=1, ..., N).
第2の統合部224は、学習済み音声理解モデル1000に含まれる時間情報統合ブロック1300により実現される。第2の統合部224は、各時間区間tにおける第1の統合ベクトルe(t)を入力として、これらの各第1の統合ベクトルe(t)(t=1,・・・,T)を時間方向に統合した第2の統合ベクトルvを出力する。 The second integration unit 224 is realized by the time information integration block 1300 included in the trained speech understanding model 1000. The second integration unit 224 receives the first integrated vector e(t) for each time interval t as input, and outputs the second integrated vector v obtained by integrating each of these first integrated vectors e(t) (t = 1, ..., T) in the time direction.
線形変換部225は、学習済み音声理解モデル1000に含まれる線形変換層1400により実現される。線形変換部225は、第2の統合ベクトルvを入力として、第2の統合ベクトルvを線形変換した第2の音声特徴ベクトルwを出力する。 The linear transformation unit 225 is realized by the linear transformation layer 1400 included in the trained speech understanding model 1000. The linear transformation unit 225 receives the second integrated vector v as input and outputs the second speech feature vector w obtained by linearly transforming the second integrated vector v.
生成部226は、学習済み音声理解モデル1000に含まれる大規模言語モデル1500により実現される。生成部226は、テストデータ入力部221によって入力されたテストデータに含まれる入力文と、第2の音声特徴ベクトルwとを入力として、当該入力文と当該第2の音声特徴ベクトルwとが与えられたときの出力文を生成する。 The generation unit 226 is realized by the large-scale language model 1500 included in the trained speech understanding model 1000. The generation unit 226 receives as input an input sentence included in the test data input by the test data input unit 221 and a second speech feature vector w, and generates an output sentence when the input sentence and the second speech feature vector w are given.
より具体的には、出力文を構成するi番目のトークンをsiとする。また、当該入力文と当該第2の音声特徴ベクトルwとが与えられたときにトークンs1が生成される事後確率をp(s1)、当該入力文と当該第2の音声特徴ベクトルwとs1,・・・,si-1とが与えられたときにトークンsiが生成される事後確率をp(si)(ただし、i≧2)とする。このとき、生成部226は、例えば、文末を表すトークンが生成されるまで、事後確率p(si)に従ってトークンsiを生成することにより出力文を生成する。 More specifically, the i-th token constituting the output sentence is denoted by s i . Furthermore, the posterior probability that token s i will be generated when the input sentence and the second speech feature vector w are given is denoted by p(s i ), and the posterior probability that token s i will be generated when the input sentence, the second speech feature vector w, and s 1 , ..., s i-1 are given is denoted by p(s i ) (where i≧2). In this case, the generation unit 226 generates the output sentence by generating token s i in accordance with the posterior probability p(s i ) until, for example, a token representing the end of the sentence is generated.
出力部227は、生成部226によって生成された出力文を予め決められた所定の出力先に出力する。ここで、所定の出力先としては、例えば、補助記憶装置107等の記憶領域、ディスプレイ等の表示装置102、通信可能に接続される他の装置や機器、等が挙げられる。 The output unit 227 outputs the output sentence generated by the generation unit 226 to a predetermined output destination. Here, examples of the predetermined output destination include a storage area such as the auxiliary storage device 107, a display device 102 such as a display, other devices or equipment that are communicatively connected, etc.
<出力文生成処理>
以下、出力文生成処理について、図12を参照しながら説明する。図12は、出力文生成処理の一例を示すフローチャートである。
<Output sentence generation process>
The output sentence generation process will be described below with reference to Fig. 12. Fig. 12 is a flowchart showing an example of the output sentence generation process.
出力文生成部207のテストデータ入力部221は、テストデータ記憶部209に記憶されている1件のテストデータを入力する(ステップS301)。 The test data input unit 221 of the output statement generation unit 207 inputs one piece of test data stored in the test data storage unit 209 (step S301).
出力文生成部207の音声エンコード部222は、上記のステップS301で入力されたテストデータに含まれる入力音声を入力として、各時間区間t(t=1,・・・,T)でN個の統合対象層からN個の第1の音声特徴ベクトルhn(t)(n=1,・・・,N)をそれぞれ出力する(ステップS302)。 The speech encoding unit 222 of the output sentence generation unit 207 uses the input speech contained in the test data input in step S301 above as input, and outputs N first speech feature vectors h n (t) (n = 1, ..., N) from the N integration target layers in each time interval t (t = 1, ..., T) (step S302).
出力文生成部207の第1の統合部223は、各時間区間t(t=1,・・・,T)において、その時間区間tにおける第1の音声特徴ベクトルhn(t)(n=1,・・・,N)を入力として、これらの各第1の音声特徴ベクトルhn(t)(n=1,・・・,N)を統合した第1の統合ベクトルe(t)をそれぞれ出力する(ステップS303)。 The first integration unit 223 of the output sentence generation unit 207 takes the first speech feature vector h n (t) (n = 1, ..., N) for each time interval t (t = 1, ..., T) as input, and outputs a first integrated vector e(t) by integrating each of these first speech feature vectors h n (t) (n = 1, ..., N) (step S303).
出力文生成部207の第2の統合部224は、各時間区間tにおける第1の統合ベクトルe(t)を入力として、これらの各第1の統合ベクトルe(t)(t=1,・・・,T)を時間方向に統合した第2の統合ベクトルvを出力する(ステップS304)。 The second integration unit 224 of the output sentence generation unit 207 receives the first integrated vector e(t) for each time interval t as input, and outputs a second integrated vector v obtained by integrating each of these first integrated vectors e(t) (t = 1, ..., T) in the time direction (step S304).
出力文生成部207の線形変換部225は、第2の統合ベクトルvを入力として、第2の統合ベクトルvを線形変換した第2の音声特徴ベクトルwを出力する(ステップS305)。 The linear transformation unit 225 of the output sentence generation unit 207 receives the second integrated vector v as input and outputs the second speech feature vector w obtained by linearly transforming the second integrated vector v (step S305).
出力文生成部207の生成部226は、上記のステップS301で入力されたテストデータに含まれる入力文と、上記のステップS305で出力された第2の音声特徴ベクトルwとを入力として、当該入力文と当該第2の音声特徴ベクトルwとが与えられたときの出力文を生成する(ステップS306)。 The generation unit 226 of the output sentence generation unit 207 receives as input the input sentence included in the test data input in step S301 above and the second speech feature vector w output in step S305 above, and generates an output sentence when the input sentence and the second speech feature vector w are given (step S306).
出力文生成部207の出力部227は、上記のステップS306で生成された出力文を予め決められた所定の出力先に出力する(ステップS307)。これにより、入力音声に関する質問の応答又は回答となる出力文が得られる。 The output unit 227 of the output sentence generation unit 207 outputs the output sentence generated in step S306 above to a predetermined output destination (step S307). This results in an output sentence that is a response or answer to the question related to the input voice.
<まとめ>
以上のように、本実施形態に係る音声理解装置10は、音声エンコーダ1100と大規模言語モデル1500との間に音声エンコーダ出力統合ブロック1200及び時間情報統合ブロック1300が存在する音声理解モデル1000により、音声理解技術を実現することができる。このため、本実施形態に係る音声理解装置10を用いることにより、例えば、非言語情報・パラ言語情報の認識結果を利用する後段のシステムにおける処理精度の向上が期待できる。
<Summary>
As described above, the speech understanding device 10 according to this embodiment can realize speech understanding technology using the speech understanding model 1000 in which the speech encoder output integration block 1200 and the time information integration block 1300 are present between the speech encoder 1100 and the large-scale language model 1500. For this reason, by using the speech understanding device 10 according to this embodiment, it is possible to expect an improvement in the processing accuracy of a downstream system that uses the recognition results of non-linguistic information and paralinguistic information, for example.
以上の実施形態に関して、更に以下の付記を開示する。
(付記1)
メモリと、
前記メモリに接続された少なくとも1つのプロセッサと、
を含み、
前記プロセッサは、
音声と、前記音声に関する第1の文と、前記第1の文に対応する第2の文とが含まれる学習データを入力し、
複数の層で構成された音声特徴抽出器に基づいて、所定の時間区間毎に、前記音声の特徴を表す情報を生成し、
第1のパラメータに基づいて、前記時間区間毎に、前記音声特徴抽出器の所定の複数の層でそれぞれ生成された前記特徴を表す情報を統合した第1の統合情報を生成し、
第2のパラメータに基づいて、前記時間区間毎の前記第1の統合情報を時間方向に統合した第2の統合情報を生成し、
前記第1の文と、前記第2の統合情報と、言語モデルとに基づいて、前記第1の文に対応する第3の文の生成確率を算出し、
前記第3の文の生成確率と、前記第2の文とに基づいて、前記第1のパラメータと前記第2のパラメータとを含む学習対象パラメータを学習する、学習装置。
(付記2)
前記メモリに接続された少なくとも1つのプロセッサと、
を含み、
前記プロセッサは、
音声と、前記音声に関する第1の文とが含まれるテストデータを入力し、
複数の層で構成された音声特徴抽出器に基づいて、所定の時間区間毎に、前記音声の特徴を表す情報を生成し、
学習済みの第1のパラメータに基づいて、前記時間区間毎に、前記音声特徴抽出器の所定の複数の層でそれぞれ生成された前記特徴を表す情報を統合した第1の統合情報を生成し、
学習済みの第2のパラメータに基づいて、前記時間区間毎の前記第1の統合情報を時間方向に統合した第2の統合情報を生成し、
前記第1の文と、前記第2の統合情報と、言語モデルとに基づいて、前記第1の文に対応する第2の文を生成する、推論装置。
(付記3)
前記プロセッサは、
第3のパラメータに基づいて、前記第2の統合情報を線形変換し、
前記第1の文と、前記線形変換後の第2の統合情報と、前記言語モデルとに基づいて、前記第3の文の生成確率を算出する、付記1に記載の学習装置。
(付記4)
前記第1のパラメータは、重み付け和に用いられる重み、又は、線形変換和に用いられる線形変換係数であり、
前記プロセッサは、
前記特徴を前記重み付け和又は前記線形変換和により統合した第1の統合情報を生成する、付記1又は3に記載の学習装置。
(付記5)
前記第2のパラメータは、自己注意プーリング層の重み、又は、1次元畳み込みニューラルネットワークのパラメータであり、
前記プロセッサは、
前記第1の統合情報を前記自己注意プーリング層又は前記1次元畳み込みニューラルネットワークにより時間方向に統合した第2の統合情報を生成する、付記1又は3に記載の学習装置。
(付記6)
学習処理を実行するようにコンピュータによって実行可能なプログラムを記憶した非一時的記憶媒体であって、
前記学習処理は、
音声と、前記音声に関する第1の文と、前記第1の文に対応する第2の文とが含まれる学習データを入力し、
複数の層で構成された音声特徴抽出器に基づいて、所定の時間区間毎に、前記音声の特徴を表す情報を生成し、
第1のパラメータに基づいて、前記時間区間毎に、前記音声特徴抽出器の所定の複数の層でそれぞれ生成された前記特徴を表す情報を統合した第1の統合情報を生成し、
第2のパラメータに基づいて、前記時間区間毎の前記第1の統合情報を時間方向に統合した第2の統合情報を生成し、
前記第1の文と、前記第2の統合情報と、言語モデルとに基づいて、前記第1の文に対応する第3の文の生成確率を算出し、
前記第3の文の生成確率と、前記第2の文とに基づいて、前記第1のパラメータと前記第2のパラメータとを含む学習対象パラメータを学習する、非一時的記憶媒体。
(付記7)
推論処理を実行するようにコンピュータによって実行可能なプログラムを記憶した非一時的記憶媒体であって、
前記推論処理は、
音声と、前記音声に関する第1の文とが含まれるテストデータを入力し、
複数の層で構成された音声特徴抽出器に基づいて、所定の時間区間毎に、前記音声の特徴を表す情報を生成し、
学習済みの第1のパラメータに基づいて、前記時間区間毎に、前記音声特徴抽出器の所定の複数の層でそれぞれ生成された前記特徴を表す情報を統合した第1の統合情報を生成し、
学習済みの第2のパラメータに基づいて、前記時間区間毎の前記第1の統合情報を時間方向に統合した第2の統合情報を生成し、
前記第1の文と、前記第2の統合情報と、言語モデルとに基づいて、前記第1の文に対応する第2の文を生成する、非一時的記憶媒体。
The following additional notes are further disclosed regarding the above-described embodiments.
(Appendix 1)
Memory and
at least one processor coupled to said memory;
Including,
The processor:
inputting training data including a speech, a first sentence related to the speech, and a second sentence corresponding to the first sentence;
generating information representing the features of the speech for each predetermined time interval based on a speech feature extractor configured with multiple layers;
generating first integrated information for each time interval based on a first parameter, the first integrated information being obtained by integrating information representing the features generated in a plurality of predetermined layers of the audio feature extractor;
generating second integrated information by integrating the first integrated information for each time section in a time direction based on a second parameter;
calculating a generation probability of a third sentence corresponding to the first sentence based on the first sentence, the second integrated information, and a language model;
A learning device that learns learning target parameters including the first parameter and the second parameter based on the generation probability of the third sentence and the second sentence.
(Appendix 2)
at least one processor coupled to said memory;
Including,
The processor:
inputting test data including a speech and a first sentence related to the speech;
generating information representing the features of the speech for each predetermined time interval based on a speech feature extractor configured with multiple layers;
generating first integrated information for each time interval by integrating information representing the features generated in a plurality of predetermined layers of the speech feature extractor based on the trained first parameters;
generating second integrated information by integrating the first integrated information for each time interval in a time direction based on the learned second parameter;
a reasoning device that generates a second sentence corresponding to the first sentence based on the first sentence, the second integrated information, and a language model;
(Appendix 3)
The processor:
linearly transforming the second integrated information based on a third parameter;
2. The learning device according to claim 1, wherein the learning device calculates the probability of generating the third sentence based on the first sentence, the second integrated information after the linear transformation, and the language model.
(Appendix 4)
the first parameter is a weight used in a weighted sum or a linear transform coefficient used in a linear transform sum,
The processor:
The learning device according to claim 1 or 3, wherein first integrated information is generated by integrating the features using the weighted sum or the linear transformation sum.
(Appendix 5)
The second parameter is a weight of a self-attention pooling layer or a parameter of a one-dimensional convolutional neural network;
The processor:
The learning device according to claim 1 or 3, wherein second integrated information is generated by integrating the first integrated information in a time direction using the self-attention pooling layer or the one-dimensional convolutional neural network.
(Appendix 6)
A non-transitory storage medium storing a program executable by a computer to perform a learning process,
The learning process includes:
inputting training data including a speech, a first sentence related to the speech, and a second sentence corresponding to the first sentence;
generating information representing the features of the speech for each predetermined time interval based on a speech feature extractor configured with multiple layers;
generating first integrated information for each time interval based on a first parameter, the first integrated information being obtained by integrating information representing the features generated in a plurality of predetermined layers of the audio feature extractor;
generating second integrated information by integrating the first integrated information for each time section in a time direction based on a second parameter;
calculating a generation probability of a third sentence corresponding to the first sentence based on the first sentence, the second integrated information, and a language model;
A non-transitory storage medium for learning learning target parameters including the first parameter and the second parameter based on the generation probability of the third sentence and the second sentence.
(Appendix 7)
A non-transitory storage medium storing a program executable by a computer to perform an inference process,
The inference process includes:
inputting test data including a speech and a first sentence related to the speech;
generating information representing the features of the speech for each predetermined time interval based on a speech feature extractor configured with multiple layers;
generating first integrated information for each time interval by integrating information representing the features generated in a plurality of predetermined layers of the speech feature extractor based on the trained first parameters;
generating second integrated information by integrating the first integrated information for each time interval in a time direction based on the learned second parameter;
a non-transitory storage medium for generating a second sentence corresponding to the first sentence based on the first sentence, the second integrated information, and a language model;
本発明は、具体的に開示された上記の実施形態に限定されるものではなく、請求の範囲の記載から逸脱することなく、種々の変形や変更、既知の技術との組み合わせ等が可能である。 The present invention is not limited to the specifically disclosed embodiments above, and various modifications, alterations, and combinations with known technologies are possible without departing from the scope of the claims.
[参考文献]
参考文献1:Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli, "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations," arXiv preprint arXiv:2006.11477, 2020.
参考文献2:S. Chen et al., "WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing," in IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505-1518, Oct. 2022, doi: 10.1109/JSTSP.2022.3188113.
参考文献3:H. Touvron et al., "Llama 2: Open Foundation and Fine-Tuned Chat Models," arXiv preprint arXiv:2307.09288, 2023.
[References]
Reference 1: Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli, "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations," arXiv preprint arXiv:2006.11477, 2020.
Reference 2: S. Chen et al., "WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing," in IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505-1518, Oct. 2022, doi: 10.1109/JSTSP.2022.3188113.
Reference 3: H. Touvron et al., "Llama 2: Open Foundation and Fine-Tuned Chat Models," arXiv preprint arXiv:2307.09288, 2023.
10 音声理解装置
101 入力装置
102 表示装置
103 外部I/F
103a 記録媒体
104 通信I/F
105 RAM
106 ROM
107 補助記憶装置
108 プロセッサ
109 バス
201 モデル構築部
202 モデル学習部
203 学習済み音声エンコーダ記憶部
204 学習済み大規模言語モデル記憶部
205 音声理解モデル記憶部
206 学習データセット記憶部
207 出力文生成部
208 学習済み音声理解モデル記憶部
209 テストデータ記憶部
211 学習データ入力部
212 音声エンコード部
213 第1の統合部
214 第2の統合部
215 線形変換部
216 事後確率算出部
217 パラメータ更新部
218 終了判定部
221 テストデータ入力部
222 音声エンコード部
223 第1の統合部
224 第2の統合部
225 線形変換部
226 生成部
227 出力部
10 Speech understanding device 101 Input device 102 Display device 103 External I/F
103a Recording medium 104 Communication I/F
105 RAM
106 ROM
107 Auxiliary storage device 108 Processor 109 Bus 201 Model construction unit 202 Model learning unit 203 Trained speech encoder storage unit 204 Trained large-scale language model storage unit 205 Speech understanding model storage unit 206 Training dataset storage unit 207 Output sentence generation unit 208 Trained speech understanding model storage unit 209 Test data storage unit 211 Training data input unit 212 Speech encoding unit 213 First integration unit 214 Second integration unit 215 Linear transformation unit 216 Posterior probability calculation unit 217 Parameter update unit 218 End determination unit 221 Test data input unit 222 Speech encoding unit 223 First integration unit 224 Second integration unit 225 Linear transformation unit 226 Generation unit 227 Output unit
Claims (8)
複数の層で構成された音声特徴抽出器に基づいて、所定の時間区間毎に、前記音声の特徴を表す情報を生成する音声特徴生成部と、
第1のパラメータに基づいて、前記時間区間毎に、前記音声特徴抽出器の所定の複数の層でそれぞれ生成された前記特徴を表す情報を統合した第1の統合情報を生成する第1の統合部と、
第2のパラメータに基づいて、前記時間区間毎の前記第1の統合情報を時間方向に統合した第2の統合情報を生成する第2の統合部と、
前記第1の文と、前記第2の統合情報と、言語モデルとに基づいて、前記第1の文に対応する第3の文の生成確率を算出する算出部と、
前記第3の文の生成確率と、前記第2の文とに基づいて、前記第1のパラメータと前記第2のパラメータとを含む学習対象パラメータを学習する学習部と、
を有する学習装置。 an input unit for inputting training data including a speech, a first sentence related to the speech, and a second sentence corresponding to the first sentence;
a speech feature generation unit that generates information representing the speech features for each predetermined time interval based on a speech feature extractor configured with multiple layers;
a first integration unit that generates, for each time interval, first integrated information by integrating information representing the features generated in a plurality of predetermined layers of the speech feature extractor based on a first parameter;
a second integration unit that generates second integrated information by integrating the first integrated information for each time section in a time direction based on a second parameter;
a calculation unit that calculates a generation probability of a third sentence corresponding to the first sentence based on the first sentence, the second integrated information, and a language model;
a learning unit that learns learning target parameters including the first parameter and the second parameter based on the generation probability of the third sentence and the second sentence;
A learning device having the above configuration.
複数の層で構成された音声特徴抽出器に基づいて、所定の時間区間毎に、前記音声の特徴を表す情報を生成する音声特徴生成部と、
学習済みの第1のパラメータに基づいて、前記時間区間毎に、前記音声特徴抽出器の所定の複数の層でそれぞれ生成された前記特徴を表す情報を統合した第1の統合情報を生成する第1の統合部と、
学習済みの第2のパラメータに基づいて、前記時間区間毎の前記第1の統合情報を時間方向に統合した第2の統合情報を生成する第2の統合部と、
前記第1の文と、前記第2の統合情報と、言語モデルとに基づいて、前記第1の文に対応する第2の文を生成する文生成部と、
を有する推論装置。 an input unit for inputting test data including a speech and a first sentence related to the speech;
a speech feature generation unit that generates information representing the speech features for each predetermined time interval based on a speech feature extractor configured with multiple layers;
a first integration unit that generates first integrated information by integrating information representing the features generated in a plurality of predetermined layers of the speech feature extractor for each time interval based on the trained first parameters;
a second integration unit that generates second integrated information by integrating the first integrated information for each time interval in a time direction based on a learned second parameter;
a sentence generation unit that generates a second sentence corresponding to the first sentence based on the first sentence, the second integrated information, and a language model;
An inference device having:
前記算出部は、
前記第1の文と、前記線形変換後の第2の統合情報と、前記言語モデルとに基づいて、前記第3の文の生成確率を算出する、請求項1に記載の学習装置。 a linear transformation unit that linearly transforms the second integrated information based on a third parameter;
The calculation unit
The learning device according to claim 1 , wherein the generation probability of the third sentence is calculated based on the first sentence, the second integrated information after the linear transformation, and the language model.
前記第1の統合部は、
前記特徴を前記重み付け和又は前記線形変換和により統合した第1の統合情報を生成する、請求項1又は3に記載の学習装置。 the first parameter is a weight used in a weighted sum or a linear transform coefficient used in a linear transform sum,
The first integration unit
The learning device according to claim 1 or 3, wherein first integrated information is generated by integrating the features using the weighted sum or the linear transformation sum.
前記第2の統合部は、
前記第1の統合情報を前記自己注意プーリング層又は前記1次元畳み込みニューラルネットワークにより時間方向に統合した第2の統合情報を生成する、請求項1又は3に記載の学習装置。 The second parameter is a weight of a self-attention pooling layer or a parameter of a one-dimensional convolutional neural network;
The second integration unit
The learning device according to claim 1 or 3, wherein the first integrated information is integrated in a time direction by the self-attention pooling layer or the one-dimensional convolutional neural network to generate second integrated information.
複数の層で構成された音声特徴抽出器に基づいて、所定の時間区間毎に、前記音声の特徴を表す情報を生成する音声特徴生成手順と、
第1のパラメータに基づいて、前記時間区間毎に、前記音声特徴抽出器の所定の複数の層でそれぞれ生成された前記特徴を表す情報を統合した第1の統合情報を生成する第1の統合手順と、
第2のパラメータに基づいて、前記時間区間毎の前記第1の統合情報を時間方向に統合した第2の統合情報を生成する第2の統合手順と、
前記第1の文と、前記第2の統合情報と、言語モデルとに基づいて、前記第1の文に対応する第3の文の生成確率を算出する算出手順と、
前記第3の文の生成確率と、前記第2の文とに基づいて、前記第1のパラメータと前記第2のパラメータとを含む学習対象パラメータを学習する学習手順と、
をコンピュータが実行する学習方法。 an input step of inputting training data including a speech, a first sentence related to the speech, and a second sentence corresponding to the first sentence;
a speech feature generation step for generating information representing the speech features for each predetermined time interval based on a speech feature extractor configured with multiple layers;
a first integration step for generating, for each time interval, first integrated information by integrating information representing the features generated in a plurality of predetermined layers of the speech feature extractor based on a first parameter;
a second integration step of generating second integrated information by integrating the first integrated information for each time section in a time direction based on a second parameter;
a calculation step of calculating a generation probability of a third sentence corresponding to the first sentence based on the first sentence, the second integrated information, and a language model;
a learning procedure for learning learning target parameters including the first parameter and the second parameter based on the generation probability of the third sentence and the second sentence;
A learning method that computers perform.
複数の層で構成された音声特徴抽出器に基づいて、所定の時間区間毎に、前記音声の特徴を表す情報を生成する音声特徴生成手順と、
学習済みの第1のパラメータに基づいて、前記時間区間毎に、前記音声特徴抽出器の所定の複数の層でそれぞれ生成された前記特徴を表す情報を統合した第1の統合情報を生成する第1の統合手順と、
学習済みの第2のパラメータに基づいて、前記時間区間毎の前記第1の統合情報を時間方向に統合した第2の統合情報を生成する第2の統合手順と、
前記第1の文と、前記第2の統合情報と、言語モデルとに基づいて、前記第1の文に対応する第2の文を生成する文生成手順と、
をコンピュータが実行する推論方法。 an input step of inputting test data including a speech and a first sentence related to the speech;
a speech feature generation step for generating information representing the speech features for each predetermined time interval based on a speech feature extractor configured with multiple layers;
a first integration step for generating, for each time interval, first integrated information by integrating information representing the features generated in a plurality of predetermined layers of the speech feature extractor based on the trained first parameters;
a second integration step of generating second integrated information by integrating the first integrated information for each time interval in a time direction based on a learned second parameter;
a sentence generation step of generating a second sentence corresponding to the first sentence based on the first sentence, the second integrated information, and a language model;
A method of inference performed by a computer.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/JP2024/001906 WO2025158547A1 (en) | 2024-01-23 | 2024-01-23 | Learning device, inference device, learning method, inference method, and program |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/JP2024/001906 WO2025158547A1 (en) | 2024-01-23 | 2024-01-23 | Learning device, inference device, learning method, inference method, and program |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025158547A1 true WO2025158547A1 (en) | 2025-07-31 |
Family
ID=96544586
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2024/001906 Pending WO2025158547A1 (en) | 2024-01-23 | 2024-01-23 | Learning device, inference device, learning method, inference method, and program |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2025158547A1 (en) |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2021166207A1 (en) * | 2020-02-21 | 2021-08-26 | 日本電信電話株式会社 | Recognition device, learning device, method for same, and program |
| JP2023117248A (en) * | 2022-02-10 | 2023-08-23 | 株式会社東芝 | Machine learning device, machine learning method, machine learning program and reasoning device |
-
2024
- 2024-01-23 WO PCT/JP2024/001906 patent/WO2025158547A1/en active Pending
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2021166207A1 (en) * | 2020-02-21 | 2021-08-26 | 日本電信電話株式会社 | Recognition device, learning device, method for same, and program |
| JP2023117248A (en) * | 2022-02-10 | 2023-08-23 | 株式会社東芝 | Machine learning device, machine learning method, machine learning program and reasoning device |
Non-Patent Citations (1)
| Title |
|---|
| GONG, YUAN ET AL.: "JOINT AUDIO AND SPEECH UNDERSTANDING.", ARXIV.ORG E-PRINT ARCHIVE, 10 December 2023 (2023-12-10), pages 1 - 8, XP034518593, Retrieved from the Internet <URL:https://arxiv.org/pdf/2309.14405> [retrieved on 20240311], DOI: 10.48550.arXiv.2309.14405 * |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20220068255A1 (en) | Speech Recognition Using Unspoken Text and Speech Synthesis | |
| US20210295858A1 (en) | Synthesizing speech from text using neural networks | |
| US11929060B2 (en) | Consistency prediction on streaming sequence models | |
| JP7502561B2 (en) | Using speech recognition to improve interlingual speech synthesis. | |
| US10789942B2 (en) | Word embedding system | |
| EP4409568B1 (en) | Contrastive siamese network for semi-supervised speech recognition | |
| WO2010125736A1 (en) | Language model creation device, language model creation method, and computer-readable recording medium | |
| JP2022037862A (en) | Method, system, and computer readable storage media for distilling longitudinal section type spoken language understanding knowledge utilizing text-based pre-learning model | |
| JP7678227B2 (en) | Joint Unsupervised and Supervised Training (JUST) for Multilingual Automatic Speech Recognition | |
| CN119054013A (en) | Training speech recognition models using non-parallel voice conversion | |
| El‐Bialy et al. | Developing phoneme‐based lip‐reading sentences system for silent speech recognition | |
| CN115294962B (en) | Training method, device, equipment and storage medium for speech synthesis model | |
| CN113823259B (en) | Method and device for converting text data into phoneme sequence | |
| JPWO2020240709A1 (en) | Dialogue processing device, learning device, dialogue processing method, learning method and program | |
| WO2019167296A1 (en) | Device, method, and program for natural language processing | |
| JP2024530969A (en) | Improving speech recognition with speech synthesis-based model adaptation | |
| CN120239884A (en) | Semi-supervised training scheme for speech recognition | |
| Radzikowski et al. | Dual supervised learning for non-native speech recognition | |
| US20240177706A1 (en) | Monte Carlo Self-Training for Speech Recognition | |
| CN115346510B (en) | A speech synthesis method, device, electronic device and storage medium | |
| Do et al. | Transferring Emphasis in Speech Translation Using Hard-Attentional Neural Network Models. | |
| Bai et al. | Integrating knowledge into end-to-end speech recognition from external text-only data | |
| WO2019171925A1 (en) | Device, method and program using language model | |
| WO2025158547A1 (en) | Learning device, inference device, learning method, inference method, and program | |
| Santos et al. | Automatic Speech Recognition: Comparisons Between Convolutional Neural Networks, Hidden Markov Model and Hybrid Architecture |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24920434 Country of ref document: EP Kind code of ref document: A1 |