[go: up one dir, main page]

WO2025099939A1 - Learning device, generation device, learning method, generation method, and program - Google Patents

Learning device, generation device, learning method, generation method, and program Download PDF

Info

Publication number
WO2025099939A1
WO2025099939A1 PCT/JP2023/040613 JP2023040613W WO2025099939A1 WO 2025099939 A1 WO2025099939 A1 WO 2025099939A1 JP 2023040613 W JP2023040613 W JP 2023040613W WO 2025099939 A1 WO2025099939 A1 WO 2025099939A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
information
unit
input
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/JP2023/040613
Other languages
French (fr)
Japanese (ja)
Inventor
勇祐 井島
孝平 松浦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NTT Inc
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Priority to PCT/JP2023/040613 priority Critical patent/WO2025099939A1/en
Publication of WO2025099939A1 publication Critical patent/WO2025099939A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • This disclosure relates to a learning device, a generating device, a learning method, a generating method, and a program.
  • Non-Patent Document 1 In recent years, in the field of speech processing, many technologies have been proposed for recognizing or detecting paralinguistic and non-linguistic information contained in speech (for example, Non-Patent Document 1).
  • paralinguistic and non-linguistic information both refer to information contained in speech that is not linguistic information.
  • paralinguistic and non-linguistic information will be collectively referred to as "paralinguistic and non-linguistic information.”
  • the present disclosure has been made in consideration of the above points, and provides technology that can output paralinguistic and non-linguistic information contained in speech in a highly interpretable format.
  • a learning device includes an input unit that inputs learning data including voice data and teacher data representing a natural language description of paralinguistic information or non-linguistic information contained in the voice represented by the voice data, a calculation unit that uses as input a voice sequence obtained by converting the voice data into a predetermined format for each predetermined unit, and calculates the generation probability of a natural language description of paralinguistic information or non-linguistic information contained in the voice represented by the voice data from which the input voice sequence is converted using a machine learning model that calculates the generation probability of a natural language description of paralinguistic information or non-linguistic information contained in the voice, and an update unit that updates the learnable parameters of the machine learning model based on the error between the natural language description generated according to the generation probability and the teacher data included in the learning data.
  • Technology can output paralinguistic and non-linguistic information contained in speech in a highly interpretable format.
  • FIG. 1 is a diagram illustrating an example of a hardware configuration of a caption generation device according to a first embodiment. A figure showing an example of the functional configuration of the caption generation device according to the first embodiment during learning.
  • FIG. 4 is a diagram illustrating an example of learning data stored in a learning data storage unit.
  • 2 is a diagram illustrating an example of a detailed functional configuration of a caption generation unit according to the first embodiment.
  • FIG. 5 is a flowchart illustrating an example of a learning process according to the first embodiment.
  • 11 is a flowchart illustrating an example of a generation process of an output token random sequence according to the first embodiment.
  • 2 is a diagram illustrating an example of a functional configuration at the time of inference of the caption generation device according to the first embodiment.
  • FIG. 11 is a flowchart showing an example of a caption generation process according to the first embodiment.
  • 11 is a flowchart illustrating an example of a generation process of an output token sequence according to the first embodiment.
  • FIG. 11 is a diagram illustrating an example of a detailed functional configuration of a caption generation unit according to a second embodiment.
  • 13 is a flowchart illustrating an example of a learning process according to the second embodiment.
  • 13 is a flowchart illustrating an example of a generation process of an output token random sequence according to the second embodiment.
  • 13 is a flowchart showing an example of a caption generation process according to the second embodiment.
  • 13 is a flowchart illustrating an example of a generation process of an output token sequence according to the second embodiment.
  • a caption generation device 10 that can output paralinguistic and non-linguistic information contained in audio in the form of a caption written in natural language.
  • a caption is generally a term that refers to an explanatory text about the contents of, for example, a video or a photo, but in the following embodiments, the term is used to refer to an explanatory text that describes the paralinguistic and non-linguistic information contained in audio in natural language. This makes it possible to output the paralinguistic and non-linguistic information contained in audio in the highly interpretable form of a caption.
  • terms such as "natural language sentence”, “explanatory text”, “sentence”, and "character string” may be used instead of the term caption.
  • captions outputting paralinguistic and non-linguistic information contained in audio in the form of captions is just one example, and the highly interpretable format for expressing paralinguistic and non-linguistic information is not limited to captions.
  • Each of the embodiments described below can be similarly applied to cases where paralinguistic and non-linguistic information is expressed in a highly interpretable format other than captions.
  • highly interpretable formats other than captions include figures, tables, images, etc. that include explanatory text that describes the paralinguistic and non-linguistic information in natural language.
  • paralinguistic and non-linguistic information are non-linguistic information contained in speech, but generally, paralinguistic information refers to information that can be changed at will, while non-linguistic information refers to information that cannot be changed at will.
  • paralinguistic information include information that indicates the speaker's intentions and attitudes.
  • non-linguistic information include information that indicates the speaker's gender and emotions, the presence or absence of a certain illness, and information that identifies the speaker.
  • the caption generation device 10 has a "learning time” during which a machine learning model that generates captions (hereinafter referred to as a “captioning model”) is trained, and an “inference time” during which a caption is generated from a voice using the trained captioning model.
  • a machine learning model that generates captions
  • an “inference time” during which a caption is generated from a voice using the trained captioning model.
  • the caption generation device 10 is realized by the same device during learning and inference, but for example, the caption generation device 10 may be realized by different devices during learning and inference.
  • caption generation device 10 during learning may be called, for example, a "learning device.”
  • caption generation device 10 during inference may be called, for example, simply a “generation device” or an “inference device.”
  • Fig. 1 is a diagram showing an example of the hardware configuration of the caption generation device 10 according to the first embodiment.
  • the caption generating device 10 has an input device 101, a display device 102, an external I/F 103, a communication I/F 104, a RAM (Random Access Memory) 105, a ROM (Read Only Memory) 106, an auxiliary storage device 107, and a processor 108.
  • a processor 108 Each of these pieces of hardware is connected to each other so as to be able to communicate with each other via a bus 109.
  • the input device 101 is, for example, a keyboard, a mouse, a touch panel, a physical button, etc.
  • the display device 102 is, for example, a display, a display panel, etc. Note that the caption generation device 10 does not have to have at least one of the input device 101 and the display device 102, for example.
  • the external I/F 103 is an interface with external devices such as a recording medium 103a.
  • recording media 103a include a CD (Compact Disc), a DVD (Digital Versatile Disk), an SD memory card (Secure Digital memory card), and a USB (Universal Serial Bus) memory card.
  • the hardware configuration shown in FIG. 1 is an example, and the hardware configuration of the caption generation device 10 is not limited to this.
  • the caption generation device 10 may have multiple auxiliary storage devices 107 and multiple processors 108, may not have some of the hardware shown in the figure, or may have various hardware other than the hardware shown in the figure.
  • Fig. 2 is a diagram showing an example of a functional configuration of the caption generation device 10 according to the first embodiment during learning.
  • the caption generation device 10 has an input unit 201, a voice conversion unit 202, a caption generation unit 203, a correct caption conversion unit 204, and a parameter update unit 205. Each of these units is realized, for example, by a process in which one or more programs installed in the caption generation device 10 are executed by the processor 108. Furthermore, during learning, the caption generation device 10 according to the first embodiment has a learning data storage unit 206 and a model parameter storage unit 207. Each of these storage units is realized, for example, by a storage area such as the auxiliary storage device 107. However, for example, at least one of the storage units of the learning data storage unit 206 and the model parameter storage unit 207 may be realized by a storage area of a storage device or the like that is communicably connected to the caption generation device 10.
  • the input unit 201 inputs the learning data stored in the learning data storage unit 206.
  • the learning data is data for training a captioning model, and is represented as a pair of audio data representing the voice of a certain speaker and a correct answer caption that describes the paralinguistic and non-linguistic information contained in the voice in natural language.
  • the voice conversion unit 202 converts the voice data included in the training data input by the input unit 201 into an input voice sequence.
  • the input voice sequence is the voice sequence input to the captioning model.
  • the voice sequence is also a sequence of data obtained by converting voice data into a format that can be handled on a frame-by-frame basis by signal processing or the like. Specific examples of voice sequences include sequence of data converted into formats such as spectrum, mel frequency cepstrum (MFCC), and mel spectrogram.
  • the audio conversion unit 202 may use data obtained by dividing the audio waveform of the audio data into frames as the input audio sequence.
  • the caption generation unit 203 is realized by a captioning model, and generates an output token probability sequence from the input speech sequence converted by the speech conversion unit 202.
  • the output token probability sequence is a sequence of generation probabilities of the output token that constitutes the caption.
  • a token is a sequence of one or more characters that represents a certain unit, and a typical example is a word.
  • the generation probability of the output token is expressed as an N-dimensional vector in which the sum of the elements of each dimension is 1, and the value of each element of the dimension is the probability that the token corresponding to that element will be generated, when the number of types of tokens that can be generated is N. An example of a detailed functional configuration of the caption generation unit 203 will be described later.
  • the correct caption conversion unit 204 converts the correct caption contained in the learning data input by the input unit 201 into a correct token probability sequence.
  • a correct token probability sequence is a sequence of probabilities corresponding to the correct tokens that make up the correct caption. For example, when the number of types of tokens to be generated is N, the probability corresponding to the correct token is expressed as an N-dimensional vector in which only the value of the element corresponding to the correct token is 1 and the values of the other elements are 0.
  • the parameter update unit 205 updates the learnable parameters of the captioning model (hereinafter referred to as "model parameters") using the error between the output token probability sequence generated by the caption generation unit 203 and the correct token probability sequence converted by the correct caption conversion unit 204.
  • model parameters learnable parameters of the captioning model
  • cross-entropy error can be used as the error.
  • the learning data storage unit 206 stores learning data that has been created in advance. Specific examples of learning data will be described later.
  • the model parameter storage unit 207 stores the model parameters of the captioning model.
  • Fig. 3 is a diagram showing an example of the learning data stored in the learning data storage unit 206.
  • the learning data storage unit 206 stores one or more learning data.
  • Each learning data includes audio data representing a voice of a certain speaker speaking a certain sentence, and a correct answer caption representing an explanatory text in natural language describing the paralinguistic and non-linguistic information received by a person who hears the voice.
  • the correct answer caption may be called, for example, teacher data.
  • the learning data in the first row contains audio data 1 and the correct caption "An elderly man speaking slowly.”
  • the learning data in the second row contains audio data 2 and the correct caption "The voice of a young woman who seems to be suffering from a cold.”
  • the learning data in the third row contains audio data 3 and the correct caption "A little girl playing suddenly with her friends.”
  • the learning data in the fourth row contains audio data 4 and the correct caption "A male university student whispering to the person next to him.”
  • the training data storage unit 206 stores multiple training data, each represented by a pair of audio data and its correct caption.
  • the number of training data stored in the training data storage unit 206 is preferably several hundred to several thousand or more. It is also preferable that the number of speakers of the audio data included in the training data is several hundred or more, and the number of audio documents represented by the audio data is several to several tens of sentences or more.
  • FIG. 4 is a diagram showing an example of a detailed functional configuration of the caption generation unit 203 according to the first embodiment.
  • a voice encoder which is a machine learning model including a neural network, a conversion network, and text data
  • the captioning model does not necessarily have to be configured with three machine learning models, a voice encoder, a conversion network, and text data, and may have other configurations.
  • the captioning model may be configured with one machine learning model including a neural network.
  • the caption generation unit 203 is composed of an audio encoding unit 301 realized by an audio encoder, a vector conversion unit 302 realized by a conversion network, and a text decoding unit 303 realized by a text decoder.
  • the speech encoding unit 301 takes an input speech sequence as input and generates a fixed-length vector (hereinafter referred to as an "audio fixed-length vector") that expresses the features of the input speech sequence (features that express paralinguistic and non-linguistic information).
  • a speech encoder that realizes the speech encoding unit 301 for example, a Global Style Token (Reference 1) or the like can be used.
  • a speech encoder that realizes the speech encoding unit 301 for example, a speech representation model based on self-supervised learning such as HuBERT (Reference 2) or WavLM (Reference 3) and a Global Style Token or the like may be used.
  • the above-mentioned audio conversion unit 202 takes data obtained by dividing the audio waveform of the audio data by frame as the input audio sequence.
  • the input audio sequence is used as input, and a sequence of frame-by-frame features output from the audio SSL model is input to a Global Style Token, etc., and an audio fixed-length vector is generated.
  • the vector conversion unit 302 receives the fixed-length audio vector generated by the audio encoding unit 301 as input and generates a sequence of P text information vectors (hereinafter referred to as a "text information vector sequence") that can be handled by the text decoding unit 303.
  • the vector conversion unit 302 converts the input fixed-length audio vector into a text information vector sequence.
  • P is a predetermined fixed value.
  • the text information vector sequence is a vector sequence that represents a word embedding expression that reflects the features expressed by the input fixed-length audio vector for the P vectors included in the model parameters.
  • the P vectors included in the model parameters are also one of the learnable parameters.
  • a conversion network that realizes the vector conversion unit 302 for example, a normal MLP (Multilayer Perceptron) can be used, similar to the Mapping Network described in Reference 4.
  • a neural network that can take into account sequence information before and after a transformer-encoder, or a neural network that combines an MLP and a transformer-encoder, etc. can also be used.
  • a decoder-type pre-trained model (Reference 5) trained with large-scale text data can be used.
  • pre-trained models include large-scale language models (LLMs) such as the Generative Pre-trained Transformer (GPT).
  • the text decoding unit 303 can also generate an output token sequence that constitutes a caption, for example, by sampling the output tokens according to each generation probability included in the output token probability sequence.
  • Fig. 5 is a flowchart showing an example of the learning process according to the first embodiment.
  • online learning is assumed as an example, and steps S101 to S105 in Fig. 5 are repeatedly executed for each piece of learning data.
  • the learning process may be executed by mini-batch learning, batch learning, or the like.
  • the input unit 201 inputs one piece of learning data stored in the learning data storage unit 206 (step S101).
  • the voice conversion unit 202 converts the voice data contained in the training data input in step S101 above into an input voice sequence (step S102).
  • the caption generation unit 203 generates an output token probability sequence from the input speech sequence converted in step S102 (step S103). Details of the processing in this step will be described later.
  • the correct caption conversion unit 204 converts the correct caption contained in the learning data input in step S101 above into a correct token probability sequence (step S104).
  • the parameter update unit 205 updates the model parameters using the error between the output token probability sequence generated in step S103 and the correct token probability sequence converted in step S104 (step S105). That is, the parameter update unit 205 updates the model parameters by a known optimization method so as to minimize the error. However, at this time, the parameter update unit 205 may exclude the model parameters of the text decoder that realizes the text decoding unit 303 from the update target. Also, if a voice SSL model is used for the voice encoder that realizes the voice encoding unit 301, the parameter update unit 205 may exclude the model parameters of the voice SSL model from the update target. Note that, when updating the model parameters of the text decoder that realizes the text decoding unit 303 or the model parameters of the voice SSL model, the parameter update unit 205 may update the entire model parameters, or may use an adapter such as LoRA (Reference 6).
  • LoRA Reference 6
  • a captioning model in which the trained model parameters are set i.e., a trained captioning model
  • Fig. 6 is a flowchart showing an example of the process of generating an output token probability sequence according to the first embodiment.
  • the audio encoding unit 301 of the caption generation unit 203 generates a fixed-length audio vector from the input audio sequence (step S201).
  • the vector conversion unit 302 of the caption generation unit 203 converts the fixed-length audio vector generated in step S201 above into a text information vector sequence (step S202).
  • the text decoding unit 303 of the caption generation unit 203 generates an output token probability sequence from the text information vector sequence converted in step S202 above (step S203).
  • Inference Time Inference time of the caption generation device 10 will be described below.
  • the model parameters of the captioning model have already been learned at the time of inference. Note that, at the time of inference, differences from the time of learning will be mainly described, and descriptions of components similar to those at the time of learning will be omitted.
  • Fig. 7 is a diagram showing an example of a functional configuration at the time of inference of the caption generation device 10 according to the first embodiment.
  • the caption generation device 10 has an input unit 201, a voice conversion unit 202, a caption generation unit 203, and an output unit 208.
  • the output unit 208 is realized, for example, by a process in which one or more programs installed in the caption generation device 10 are executed by the processor 108.
  • the caption generation device 10 according to the first embodiment has a model parameter storage unit 207.
  • the input unit 201 When audio data for which a caption is to be generated is given, the input unit 201 inputs this audio data.
  • the voice conversion unit 202 converts the voice data input by the input unit 201 into an input voice sequence.
  • the caption generation unit 203 is realized by a trained captioning model, and generates an output token sequence from the input speech sequence converted by the speech conversion unit 202.
  • the output unit 208 outputs a caption composed of an output token sequence generated by the caption generation unit 203 to a predetermined output destination.
  • the output destination is not limited to a specific output destination, and any output destination can be targeted.
  • the output destination can be a storage area of the auxiliary storage device 107, a display device 102 such as a display, other devices connected in a communicable manner, etc.
  • the model parameter storage unit 207 stores the learned model parameters of the captioning model.
  • Fig. 8 is a flowchart showing an example of the caption generation process according to the first embodiment. Note that, in the following, it is assumed that audio data for which a caption is to be generated is provided to the caption generation device 10.
  • the input unit 201 inputs the given voice data (step S301).
  • the voice conversion unit 202 converts the voice data input in step S301 above into an input voice sequence (step S302).
  • the caption generation unit 203 generates an output token sequence from the input speech sequence converted in step S302 (step S303). Details of the processing in this step will be described later.
  • the output unit 208 outputs the caption composed of the output token sequence generated in step S303 above to a predetermined output destination (step S304). This results in a caption that describes in natural language the paralinguistic and non-linguistic information contained in the speech represented by the given audio data.
  • Fig. 9 is a flowchart showing an example of the process of generating an output token sequence according to the first embodiment.
  • the audio encoding unit 301 of the caption generation unit 203 generates an audio fixed-length vector from the input audio sequence (step S401).
  • the vector conversion unit 302 of the caption generation unit 203 converts the fixed-length audio vector generated in step S401 above into a text information vector sequence (step S402).
  • Fig. 10 is a diagram showing an example of a functional configuration of the caption generation device 10 according to the second embodiment during learning.
  • the caption generation device 10 has, in addition to the various units described in the first embodiment, a recognition unit 209 that is realized by a recognition model that recognizes paralinguistic and non-linguistic information contained in speech.
  • the recognition unit 209 is realized, for example, by a process in which one or more programs installed in the caption generation device 10 are executed by the processor 108.
  • the identification unit 209 generates an identification result that is a result of identifying paralinguistic and non-linguistic information from the input speech sequence converted by the speech conversion unit 202.
  • the identification model that realizes the identification unit 209 is not limited to a specific identification model, and any identification model that can identify any paralinguistic and non-linguistic information such as illness, gender, emotion, etc. from the input speech sequence can be used.
  • the identification unit 209 may be realized by one identification model or by multiple identification models. When the identification unit 209 is realized by multiple identification models, the identification unit 209 generates an identification result that is a result of identifying each of the multiple pieces of paralinguistic and non-linguistic information from the input speech sequence.
  • one or more discrimination models that realize the discrimination unit 209 may be pre-trained, or may be trained together with a captioning model.
  • the training data includes, in addition to the audio data and the correct caption, correct paralinguistic and non-linguistic information that indicates the correct discrimination result.
  • the caption generation unit 203 generates an output token probability sequence from the input speech sequence converted by the speech conversion unit 202 and the classification result generated by the classification unit 209. An example of a detailed functional configuration of the caption generation unit 203 will be described later.
  • Fig. 11 is a diagram showing an example of a detailed functional configuration of the caption generation unit 203 according to the second embodiment.
  • the following describes, as an example, a case in which a captioning model is composed of three machine learning models, a voice encoder, a conversion network, and text data.
  • the caption generation unit 203 is composed of an audio encoding unit 301 realized by an audio encoder, a vector conversion unit 302 realized by a conversion network, and a text decoding unit 303 realized by a text decoder.
  • the vector conversion unit 302 generates a text information vector sequence using the fixed-length audio vector generated by the audio encoding unit 301 and the classification result generated by the classification unit 209 as input. In other words, the vector conversion unit 302 converts the input fixed-length audio vector and the input classification result into a text information vector sequence. Specifically, the vector conversion unit 302 combines the fixed-length audio vector generated by the audio encoding unit 301 and the classification result generated by the classification unit 209 to create a new fixed-length audio vector, and then generates a text information vector sequence from this new fixed-length audio vector, similar to the vector conversion unit 302 according to the first embodiment. This makes it possible to generate a text information vector sequence that takes into account not only the fixed-length audio vector but also the classification result of the classification model, and as a result, it becomes possible to generate a caption that is composed of an output token sequence that takes these into account.
  • a vector combining a fixed-length audio vector and a classification result is, for example, an M+L-dimensional vector whose elements are each element of the fixed-length audio vector and the L classification results, in the case where the fixed-length audio vector is an M-dimensional vector, there are L classification results, and all classification results are scalar values that take the two values 0 or 1.
  • Fig. 12 is a flowchart showing an example of the learning process according to the second embodiment.
  • online learning is assumed as an example, and steps S501 to S506 in Fig. 12 are repeatedly executed for each piece of learning data.
  • Steps S501 and S502 in FIG. 12 are similar to steps S101 and S102 in FIG. 5, respectively, and therefore will not be described.
  • the identification unit 209 generates an identification result that identifies paralinguistic and non-linguistic information from the input speech sequence converted in step S502 (step S503).
  • the caption generation unit 203 generates an output token probability sequence from the input speech sequence converted in step S502 and the classification result generated in step S503 (step S504). Details of the processing in this step will be described later.
  • the correct caption conversion unit 204 converts the correct caption included in the learning data input in step S501 into a correct token probability sequence (step S505), similar to step S104 in FIG. 5.
  • the parameter update unit 205 updates the model parameters using the error between the output token probability sequence generated in the above step S504 and the correct token probability sequence converted in the above step S505, similar to step S105 in FIG. 5 (step S506).
  • the parameter update unit 205 also uses the error between the correct paralinguistic and non-linguistic information included in the training data input in step S501 and the paralinguistic and non-linguistic information represented by the recognition result generated in the above step S503 to update the learnable parameters of the one or more discrimination models realizing the identification unit 209 in addition to the model parameters.
  • a captioning model in which the trained model parameters are set i.e., a trained captioning model
  • Fig. 13 is a flowchart showing an example of the process of generating an output token probability sequence according to the second embodiment.
  • the audio encoding unit 301 of the caption generation unit 203 generates an audio fixed-length vector from the input audio sequence (step S601), similar to step S201 in FIG. 6.
  • the vector conversion unit 302 of the caption generation unit 203 converts the fixed-length audio vector generated in step S601 above and the classification result generated in step S503 of FIG. 14 into a text information vector sequence (step S602).
  • the text decoding unit 303 of the caption generation unit 203 generates an output token probability sequence from the text information vector sequence converted in step S602 above (step S603), similar to step S203 in FIG. 6.
  • Inference Time Inference time of the caption generation device 10 will be described below.
  • the model parameters of the captioning model and the learnable parameters of the discriminative model have both been learned at the time of inference. Note that, at the time of inference, differences from the time of learning will be mainly described, and descriptions of components similar to those at the time of learning will be omitted.
  • Fig. 14 is a diagram showing an example of a functional configuration at the time of inference of the caption generation device 10 according to the second embodiment.
  • the caption generation device 10 As shown in FIG. 14, during inference, the caption generation device 10 according to the second embodiment has an identification unit 209 in addition to the units described in the first embodiment.
  • the identification unit 209 generates an identification result by identifying paralinguistic and non-linguistic information from the input speech sequence converted by the speech conversion unit 202.
  • the caption generation unit 203 is realized by a trained captioning model, and generates an output token sequence from the input speech sequence converted by the speech conversion unit 202 and the classification result generated by the classification unit 209.
  • Fig. 15 is a flowchart showing an example of the caption generation process according to the second embodiment. Note that, in the following, it is assumed that audio data for which a caption is to be generated is provided to the caption generation device 10.
  • Steps S701 and S702 in FIG. 15 are similar to steps S301 and S302 in FIG. 8, respectively, and therefore will not be described.
  • the identification unit 209 generates an identification result that identifies paralinguistic and non-linguistic information from the input speech sequence converted in step S702 (step S703).
  • the caption generation unit 203 generates an output token sequence from the input speech sequence converted in step S702 above and the identification result generated in step S703 above (step S704).
  • the output unit 208 outputs the caption composed of the output token sequence generated in step S704 above to a predetermined output destination (step S705), similar to step S304 in FIG. 8. This allows for the acquisition of a caption that describes in natural language the paralinguistic and non-linguistic information contained in the speech represented by the given speech data, taking into account the identification results obtained from the speech data using the identification model.
  • Fig. 16 is a flowchart showing an example of the output token sequence generation process according to the second embodiment.
  • the audio encoding unit 301 of the caption generation unit 203 generates an audio fixed-length vector from the input audio sequence (step S801), similar to step S401 in FIG. 9.
  • the vector conversion unit 302 of the caption generation unit 203 converts the fixed-length audio vector generated in step S801 above and the classification result generated in step S703 of FIG. 15 into a text information vector sequence (step S802).
  • the text decoding unit 303 of the caption generation unit 203 generates an output token sequence from the text information vector sequence converted in step S802 above (step S803), similar to step S403 in FIG. 9.
  • the caption generation device 10 learns a captioning model from learning data including voice data and correct captions that describe in natural language the paralinguistic and non-linguistic information contained in the voice represented by the voice data.
  • the caption generation device 10 according to the first embodiment can then generate captions that describe in natural language the paralinguistic and non-linguistic information contained in the voice represented by the given voice data, using the learned captioning model. This makes it possible to output the paralinguistic and non-linguistic information contained in the voice in the form of a caption that has high interpretability.
  • the caption generation device 10 also learns a captioning model using the results of a discrimination model that discriminates paralinguistic and non-linguistic information from the speech represented by the audio data.
  • the caption generation device 10 according to the second embodiment can then use the learned captioning model and discrimination model to generate a caption for given audio data and generate paralinguistic and non-linguistic information as discrimination results. This makes it possible to obtain the paralinguistic and non-linguistic information contained in the speech and its caption.
  • Reference 1 Y. Wang, D. Stanton, Y. Zhang, R.-S. Ryan, E. Battenberg, J. Shor, et al., "Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis," in ICML, 2018.
  • Reference 2 Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451-3460, 2021.
  • Reference 3 Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al., "WavLM: Large-scale self-supervised pre-training for full stack speech processing," IEEE J-STSP, vol. 16, no. 6, pp. 1505-1518, 2022.
  • Reference 4 R. Mokady, A. Hertz, and A. H. Bermano, "ClipCap: CLIP prefix for image captioning," arXiv preprint arXiv: 2111.09734, 2021.
  • Reference 5 T. Brown, B. Mann, N.
  • Caption generating device 101 Input device 102 Display device 103 External I/F 103a Recording medium 104 Communication I/F 105 RAM 106 ROM 107 Auxiliary storage device 108 Processor 109 Bus 201 Input unit 202 Voice conversion unit 203 Caption generation unit 204 Correct caption conversion unit 205 Parameter update unit 206 Learning data storage unit 207 Model parameter storage unit 208 Output unit 209 Classification unit

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

A learning device according to one aspect includes: an input unit for inputting learning data including voice data and training data representing a natural language description of paralinguistic information or non-language information included in the voice represented by the voice data; a calculation unit for calculating, by a machine learning model that takes a voice sequence, obtained by converting the voice data into a predetermined format for each predetermined unit, as input and calculates the generation probability of a natural language description of paralinguistic information or non-language information included in the voice, the generation probability of the natural language description of paralinguistic information or non-language information included in the voice represented by the voice data of the conversion source of the input voice sequence; and an update unit for updating learnable parameters of the machine learning model on the basis of the error between the natural language description generated according to the generation probability and the training data included in the learning data.

Description

学習装置、生成装置、学習方法、生成方法、及びプログラムLearning device, generation device, learning method, generation method, and program

 本開示は、学習装置、生成装置、学習方法、生成方法、及びプログラムに関する。 This disclosure relates to a learning device, a generating device, a learning method, a generating method, and a program.

 近年、音声処理の分野で、音声に含まれるパラ言語情報や非言語情報を認識又は検出する技術が数多く提案されている(例えば、非特許文献1)。ここで、パラ言語情報及び非言語情報とはいずれも、音声に含まれる情報のうち、言語情報でない情報のことである。以下、パラ言語情報と非言語情報とをまとめて「パラ言語・非言語情報」と呼ぶ。 In recent years, in the field of speech processing, many technologies have been proposed for recognizing or detecting paralinguistic and non-linguistic information contained in speech (for example, Non-Patent Document 1). Here, paralinguistic and non-linguistic information both refer to information contained in speech that is not linguistic information. Hereinafter, paralinguistic and non-linguistic information will be collectively referred to as "paralinguistic and non-linguistic information."

M. B. Akcay and K. Oguz, "Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers," Speech Communication, vol. 116, pp. 56-76, 2020.M. B. Akcay and K. Oguz, “Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers,” Speech Communication, vol. 116, pp. 56-76, 2020.

 しかしながら、音声に含まれるパラ言語・非言語情報を認識又は検出する従来技術では、その認識又は検出結果のみが出力される場合が多く、解釈性が高くなかった。 However, conventional technologies that recognize or detect paralinguistic and non-linguistic information contained in speech often only output the results of the recognition or detection, and are not highly interpretable.

 本開示は、上記の点に鑑みてなされたもので、音声に含まれるパラ言語・非言語情報を解釈性の高い形式で出力できる技術を提供する。 The present disclosure has been made in consideration of the above points, and provides technology that can output paralinguistic and non-linguistic information contained in speech in a highly interpretable format.

 本開示の一態様による学習装置は、音声データと、前記音声データが表す音声に含まれるパラ言語情報又は非言語情報の自然言語記述を表す教師データとが含まれる学習データを入力する入力部と、前記音声データを所定の単位毎に所定の形式に変換した音声系列を入力として、音声に含まれるパラ言語情報又は非言語情報の自然言語記述の生成確率を算出する機械学習モデルにより、入力された音声系列の変換元の音声データが表す音声に含まれるパラ言語情報又は非言語情報の自然言語記述の生成確率を算出する算出部と、前記生成確率に従って生成される自然言語記述と、前記学習データに含まれる教師データとの誤差に基づいて、前記機械学習モデルの学習可能パラメータを更新する更新部と、を有する。 A learning device according to one aspect of the present disclosure includes an input unit that inputs learning data including voice data and teacher data representing a natural language description of paralinguistic information or non-linguistic information contained in the voice represented by the voice data, a calculation unit that uses as input a voice sequence obtained by converting the voice data into a predetermined format for each predetermined unit, and calculates the generation probability of a natural language description of paralinguistic information or non-linguistic information contained in the voice represented by the voice data from which the input voice sequence is converted using a machine learning model that calculates the generation probability of a natural language description of paralinguistic information or non-linguistic information contained in the voice, and an update unit that updates the learnable parameters of the machine learning model based on the error between the natural language description generated according to the generation probability and the teacher data included in the learning data.

 音声に含まれるパラ言語・非言語情報を解釈性の高い形式で出力できる技術が提供される。 Technology is provided that can output paralinguistic and non-linguistic information contained in speech in a highly interpretable format.

第一の実施形態に係るキャプション生成装置のハードウェア構成の一例を示す図である。1 is a diagram illustrating an example of a hardware configuration of a caption generation device according to a first embodiment. 第一の実施形態に係るキャプション生成装置の学習時の機能構成の一例を示す図である。A figure showing an example of the functional configuration of the caption generation device according to the first embodiment during learning. 学習データ記憶部に記憶されている学習データの一例を示す図である。FIG. 4 is a diagram illustrating an example of learning data stored in a learning data storage unit. 第一の実施形態に係るキャプション生成部の詳細な機能構成の一例を示す図である。2 is a diagram illustrating an example of a detailed functional configuration of a caption generation unit according to the first embodiment. FIG. 第一の実施形態に係る学習処理の一例を示すフローチャートである。5 is a flowchart illustrating an example of a learning process according to the first embodiment. 第一の実施形態に係る出力トークン確率系列の生成処理の一例を示すフローチャートである。11 is a flowchart illustrating an example of a generation process of an output token random sequence according to the first embodiment. 第一の実施形態に係るキャプション生成装置の推論時の機能構成の一例を示す図である。2 is a diagram illustrating an example of a functional configuration at the time of inference of the caption generation device according to the first embodiment. FIG. 第一の実施形態に係るキャプション生成処理の一例を示すフローチャートである。11 is a flowchart showing an example of a caption generation process according to the first embodiment. 第一の実施形態に係る出力トークン系列の生成処理の一例を示すフローチャートである。11 is a flowchart illustrating an example of a generation process of an output token sequence according to the first embodiment. 第二の実施形態に係るキャプション生成装置の学習時の機能構成の一例を示す図である。A figure showing an example of the functional configuration of a caption generation device according to a second embodiment during learning. 第二の実施形態に係るキャプション生成部の詳細な機能構成の一例を示す図である。FIG. 11 is a diagram illustrating an example of a detailed functional configuration of a caption generation unit according to a second embodiment. 第二の実施形態に係る学習処理の一例を示すフローチャートである。13 is a flowchart illustrating an example of a learning process according to the second embodiment. 第二の実施形態に係る出力トークン確率系列の生成処理の一例を示すフローチャートである。13 is a flowchart illustrating an example of a generation process of an output token random sequence according to the second embodiment. 第二の実施形態に係るキャプション生成装置の推論時の機能構成の一例を示す図である。A figure showing an example of a functional configuration at the time of inference of a caption generation device according to a second embodiment. 第二の実施形態に係るキャプション生成処理の一例を示すフローチャートである。13 is a flowchart showing an example of a caption generation process according to the second embodiment. 第二の実施形態に係る出力トークン系列の生成処理の一例を示すフローチャートである。13 is a flowchart illustrating an example of a generation process of an output token sequence according to the second embodiment.

 以下、本発明の各実施形態について、図面を参照しながら説明する。以下の各実施形態では、一例として、音声に含まれるパラ言語・非言語情報を自然言語で記述されたキャプションの形式で出力することができるキャプション生成装置10について説明する。キャプションとは、一般に、例えば、動画や写真等の内容に関する説明文を表す用語であるが、以下の各実施形態では音声に含まれるパラ言語・非言語情報を自然言語で記述した説明文を表す用語として用いるものとする。これにより、音声に含まれるパラ言語・非言語情報をキャプションという解釈性の高い形式で出力することが可能となる。なお、キャプションとの用語の代わりに、例えば、「自然言語文」、「説明文」、「文」、「文字列」等の用語が用いられてもよい。 Each embodiment of the present invention will be described below with reference to the drawings. In each of the following embodiments, as an example, a caption generation device 10 will be described that can output paralinguistic and non-linguistic information contained in audio in the form of a caption written in natural language. A caption is generally a term that refers to an explanatory text about the contents of, for example, a video or a photo, but in the following embodiments, the term is used to refer to an explanatory text that describes the paralinguistic and non-linguistic information contained in audio in natural language. This makes it possible to output the paralinguistic and non-linguistic information contained in audio in the highly interpretable form of a caption. Note that instead of the term caption, terms such as "natural language sentence", "explanatory text", "sentence", and "character string" may be used.

 ただし、音声に含まれるパラ言語・非言語情報をキャプションの形式で出力することは一例であって、パラ言語・非言語情報を表す解釈性の高い形式はキャプションに限られるものではない。以下で説明する各実施形態は、キャプション以外の解釈性の高い形式でパラ言語・非言語情報を表す場合にも同様に適用することが可能である。キャプション以外の解釈性の高い形式としては、例えば、パラ言語・非言語情報を自然言語で記述した説明文を含む図や表、画像等が挙げられる。 However, outputting paralinguistic and non-linguistic information contained in audio in the form of captions is just one example, and the highly interpretable format for expressing paralinguistic and non-linguistic information is not limited to captions. Each of the embodiments described below can be similarly applied to cases where paralinguistic and non-linguistic information is expressed in a highly interpretable format other than captions. Examples of highly interpretable formats other than captions include figures, tables, images, etc. that include explanatory text that describes the paralinguistic and non-linguistic information in natural language.

 なお、パラ言語情報及び非言語情報はいずれも音声に含まれる情報のうち、言語情報でない情報のことであるが、一般に、パラ言語情報は随意的に変化させることが可能な情報のことを指し、非言語情報は随意的に変化させることができない情報のことを指す。パラ言語情報の例としては、話者の意図や態度を示す情報等が挙げられる。一方で、非言語情報の例としては、話者の性別や感情、或る特定の病気の有無等を示す情報、話者自体の識別情報等が挙げられる。 Note that both paralinguistic and non-linguistic information are non-linguistic information contained in speech, but generally, paralinguistic information refers to information that can be changed at will, while non-linguistic information refers to information that cannot be changed at will. Examples of paralinguistic information include information that indicates the speaker's intentions and attitudes. On the other hand, examples of non-linguistic information include information that indicates the speaker's gender and emotions, the presence or absence of a certain illness, and information that identifies the speaker.

 [第一の実施形態]
 以下、第一の実施形態について説明する。ここで、第一の実施形態に係るキャプション生成装置10には、キャプションを生成する機械学習モデル(以下、「キャプショニングモデル」と呼ぶ。)を学習する「学習時」と、学習済みのキャプショニングモデルにより音声からキャプションを生成する「推論時」とが存在する。以下では、学習時と推論時とで同一の装置によりキャプション生成装置10が実現される場合を想定するが、例えば、学習時と推論時とで異なる装置によりキャプション生成装置10が実現されてもよい。
[First embodiment]
A first embodiment will be described below. Here, the caption generation device 10 according to the first embodiment has a "learning time" during which a machine learning model that generates captions (hereinafter referred to as a "captioning model") is trained, and an "inference time" during which a caption is generated from a voice using the trained captioning model. In the following, it is assumed that the caption generation device 10 is realized by the same device during learning and inference, but for example, the caption generation device 10 may be realized by different devices during learning and inference.

 なお、学習時におけるキャプション生成装置10は、例えば、「学習装置」等と呼ばれてもよい。また、推論時におけるキャプション生成装置10は、例えば、単に「生成装置」と呼ばれてもよいし、「推論装置」等と呼ばれてもよい。 Note that the caption generation device 10 during learning may be called, for example, a "learning device." Furthermore, the caption generation device 10 during inference may be called, for example, simply a "generation device" or an "inference device."

 <第一の実施形態に係るキャプション生成装置10のハードウェア構成例>
 第一の実施形態に係るキャプション生成装置10のハードウェア構成例について、図1を参照しながら説明する。図1は、第一の実施形態に係るキャプション生成装置10のハードウェア構成の一例を示す図である。
<Hardware configuration example of the caption generation device 10 according to the first embodiment>
An example of the hardware configuration of the caption generation device 10 according to the first embodiment will be described with reference to Fig. 1. Fig. 1 is a diagram showing an example of the hardware configuration of the caption generation device 10 according to the first embodiment.

 図1に示すように、第一の実施形態に係るキャプション生成装置10は、入力装置101と、表示装置102と、外部I/F103と、通信I/F104と、RAM(Random Access Memory)105と、ROM(Read Only Memory)106と、補助記憶装置107と、プロセッサ108とを有する。これらの各ハードウェアは、それぞれがバス109を介して通信可能に接続される。 As shown in FIG. 1, the caption generating device 10 according to the first embodiment has an input device 101, a display device 102, an external I/F 103, a communication I/F 104, a RAM (Random Access Memory) 105, a ROM (Read Only Memory) 106, an auxiliary storage device 107, and a processor 108. Each of these pieces of hardware is connected to each other so as to be able to communicate with each other via a bus 109.

 入力装置101は、例えば、キーボード、マウス、タッチパネル、物理ボタン等である。表示装置102は、例えば、ディスプレイ、表示パネル等である。なお、キャプション生成装置10は、例えば、入力装置101及び表示装置102のうちの少なくとも一方を有していなくてもよい。 The input device 101 is, for example, a keyboard, a mouse, a touch panel, a physical button, etc. The display device 102 is, for example, a display, a display panel, etc. Note that the caption generation device 10 does not have to have at least one of the input device 101 and the display device 102, for example.

 外部I/F103は、記録媒体103a等の外部装置とのインタフェースである。記録媒体103aとしては、例えば、CD(Compact Disc)、DVD(Digital Versatile Disk)、SDメモリカード(Secure Digital memory card)、USB(Universal Serial Bus)メモリカード等が挙げられる。 The external I/F 103 is an interface with external devices such as a recording medium 103a. Examples of recording media 103a include a CD (Compact Disc), a DVD (Digital Versatile Disk), an SD memory card (Secure Digital memory card), and a USB (Universal Serial Bus) memory card.

 通信I/F104は、通信ネットワークに接続するためのインタフェースである。RAM105は、プログラムやデータを一時保持する揮発性の半導体メモリ(記憶装置)である。ROM106は、電源を切ってもプログラムやデータを保持することができる不揮発性の半導体メモリ(記憶装置)である。補助記憶装置107は、例えば、HDD(Hard Disk Drive)、SSD(Solid State Drive)、フラッシュメモリ等の不揮発性の記憶装置である。プロセッサ108は、例えば、CPU(Central Processing Unit)やGPU(Graphic Processing Unit)等の各種演算装置である。 The communication I/F 104 is an interface for connecting to a communication network. The RAM 105 is a volatile semiconductor memory (storage device) that temporarily stores programs and data. The ROM 106 is a non-volatile semiconductor memory (storage device) that can store programs and data even when the power is turned off. The auxiliary storage device 107 is a non-volatile storage device such as a HDD (Hard Disk Drive), SSD (Solid State Drive), or flash memory. The processor 108 is various types of arithmetic devices such as a CPU (Central Processing Unit) or GPU (Graphic Processing Unit).

 なお、図1に示すハードウェア構成は一例であって、キャプション生成装置10のハードウェア構成はこれに限られるものではない。例えば、キャプション生成装置10は、複数の補助記憶装置107や複数のプロセッサ108を有していてもよいし、図示したハードウェアの一部を有していなくてもよいし、図示したハードウェア以外の種々のハードウェアを有していてもよい。 Note that the hardware configuration shown in FIG. 1 is an example, and the hardware configuration of the caption generation device 10 is not limited to this. For example, the caption generation device 10 may have multiple auxiliary storage devices 107 and multiple processors 108, may not have some of the hardware shown in the figure, or may have various hardware other than the hardware shown in the figure.

 ・学習時
 以下、第一の実施形態に係るキャプション生成装置10の学習時について説明する。
- Learning Time Hereinafter, the learning time of the caption generation device 10 according to the first embodiment will be described.

 <第一の実施形態に係るキャプション生成装置10の学習時の機能構成例>
 第一の実施形態に係るキャプション生成装置10の学習時の機能構成例について、図2を参照しながら説明する。図2は、第一の実施形態に係るキャプション生成装置10の学習時の機能構成の一例を示す図である。
<Example of functional configuration during learning of the caption generation device 10 according to the first embodiment>
An example of a functional configuration of the caption generation device 10 according to the first embodiment during learning will be described with reference to Fig. 2. Fig. 2 is a diagram showing an example of a functional configuration of the caption generation device 10 according to the first embodiment during learning.

 図2に示すように、学習時において、第一の実施形態に係るキャプション生成装置10は、入力部201と、音声変換部202と、キャプション生成部203と、正解キャプション変換部204と、パラメータ更新部205とを有する。これら各部は、例えば、キャプション生成装置10にインストールされた1以上のプログラムが、プロセッサ108に実行させる処理により実現される。また、学習時において、第一の実施形態に係るキャプション生成装置10は、学習データ記憶部206と、モデルパラメータ記憶部207とを有する。これら各記憶部は、例えば、補助記憶装置107等の記憶領域により実現される。ただし、例えば、学習データ記憶部206及びモデルパラメータ記憶部207の少なくとも一方の記憶部が、キャプション生成装置10と通信可能に接続される記憶装置等の記憶領域により実現されてもよい。 As shown in FIG. 2, during learning, the caption generation device 10 according to the first embodiment has an input unit 201, a voice conversion unit 202, a caption generation unit 203, a correct caption conversion unit 204, and a parameter update unit 205. Each of these units is realized, for example, by a process in which one or more programs installed in the caption generation device 10 are executed by the processor 108. Furthermore, during learning, the caption generation device 10 according to the first embodiment has a learning data storage unit 206 and a model parameter storage unit 207. Each of these storage units is realized, for example, by a storage area such as the auxiliary storage device 107. However, for example, at least one of the storage units of the learning data storage unit 206 and the model parameter storage unit 207 may be realized by a storage area of a storage device or the like that is communicably connected to the caption generation device 10.

 入力部201は、学習データ記憶部206に記憶されている学習データを入力する。ここで、学習データとはキャプショニングモデルを学習するためのデータであり、或る話者の音声を表す音声データと、その音声に含まれるパラ言語・非言語情報を自然言語で記述した正解キャプションとのペアで表される。 The input unit 201 inputs the learning data stored in the learning data storage unit 206. Here, the learning data is data for training a captioning model, and is represented as a pair of audio data representing the voice of a certain speaker and a correct answer caption that describes the paralinguistic and non-linguistic information contained in the voice in natural language.

 音声変換部202は、入力部201によって入力された学習データに含まれる音声データを入力音声系列に変換する。入力音声系列とは、キャプショニングモデルに入力される音声系列のことである。また、音声系列とは、音声データを信号処理等によりフレーム単位で扱える形式に変換した系列データのことである。音声系列の具体例としては、スペクトル、メル周波数ケプストラム(MFCC)、メルスペクトログラム等の形式に変換した系列データ等が挙げられる。 The voice conversion unit 202 converts the voice data included in the training data input by the input unit 201 into an input voice sequence. The input voice sequence is the voice sequence input to the captioning model. The voice sequence is also a sequence of data obtained by converting voice data into a format that can be handled on a frame-by-frame basis by signal processing or the like. Specific examples of voice sequences include sequence of data converted into formats such as spectrum, mel frequency cepstrum (MFCC), and mel spectrogram.

 ただし、後述する音声エンコード部301を実現するモデルによっては、音声変換部202は、音声データの音声波形をフレーム単位で分割したデータを入力音声系列としてもよい。 However, depending on the model that realizes the audio encoding unit 301 described below, the audio conversion unit 202 may use data obtained by dividing the audio waveform of the audio data into frames as the input audio sequence.

 キャプション生成部203は、キャプショニングモデルによって実現され、音声変換部202によって変換された入力音声系列から出力トークン確率系列を生成する。出力トークン確率系列とは、キャプションを構成する出力トークンの生成確率の系列のことである。また、トークンとは、或る1つの単位を表す1以上の文字の並びのことであり、典型例としては単語等が挙げられる。出力トークンの生成確率は、例えば、生成可能なトークンの種類数をNとしたとき、各次元の要素の和が1であり、かつ、各次元の要素の値をその要素に対応するトークンが生成される確率とするN次元ベクトルで表される。なお、キャプション生成部203の詳細な機能構成例については後述する。 The caption generation unit 203 is realized by a captioning model, and generates an output token probability sequence from the input speech sequence converted by the speech conversion unit 202. The output token probability sequence is a sequence of generation probabilities of the output token that constitutes the caption. A token is a sequence of one or more characters that represents a certain unit, and a typical example is a word. The generation probability of the output token is expressed as an N-dimensional vector in which the sum of the elements of each dimension is 1, and the value of each element of the dimension is the probability that the token corresponding to that element will be generated, when the number of types of tokens that can be generated is N. An example of a detailed functional configuration of the caption generation unit 203 will be described later.

 正解キャプション変換部204は、入力部201によって入力された学習データに含まれる正解キャプションを正解トークン確率系列に変換する。正解トークン確率系列とは、正解キャプションを構成する正解トークンに対応する確率の系列のことである。正解トークンに対応する確率とは、例えば、生成対象となるトークンの種類数をNとしたとき、正解トークンに対応する要素の値のみが1、それ以外の要素の値が0であるN次元ベクトルで表される。 The correct caption conversion unit 204 converts the correct caption contained in the learning data input by the input unit 201 into a correct token probability sequence. A correct token probability sequence is a sequence of probabilities corresponding to the correct tokens that make up the correct caption. For example, when the number of types of tokens to be generated is N, the probability corresponding to the correct token is expressed as an N-dimensional vector in which only the value of the element corresponding to the correct token is 1 and the values of the other elements are 0.

 パラメータ更新部205は、キャプション生成部203によって生成された出力トークン確率系列と、正解キャプション変換部204によって変換された正解トークン確率系列との誤差を用いて、キャプショニングモデルの学習可能パラメータ(以下、「モデルパラメータ」という。)を更新する。なお、誤差としては、例えば、クロスエントロピー誤差等を採用することができる。 The parameter update unit 205 updates the learnable parameters of the captioning model (hereinafter referred to as "model parameters") using the error between the output token probability sequence generated by the caption generation unit 203 and the correct token probability sequence converted by the correct caption conversion unit 204. Note that, for example, cross-entropy error can be used as the error.

 学習データ記憶部206は、予め作成された学習データを記憶する。なお、学習データの具体例については後述する。 The learning data storage unit 206 stores learning data that has been created in advance. Specific examples of learning data will be described later.

 モデルパラメータ記憶部207は、キャプショニングモデルのモデルパラメータを記憶する。 The model parameter storage unit 207 stores the model parameters of the captioning model.

  ≪学習データ記憶部206に記憶されている学習データの例≫
 学習データ記憶部206に記憶されている学習データの例について、図3を参照しながら説明する。図3は、学習データ記憶部206に記憶されている学習データの一例を示す図である。
<Example of learning data stored in learning data storage unit 206>
An example of the learning data stored in the learning data storage unit 206 will be described with reference to Fig. 3. Fig. 3 is a diagram showing an example of the learning data stored in the learning data storage unit 206.

 図3に示すように、学習データ記憶部206には、1以上の学習データが記憶されている。また、各学習データには、或る発話者が或る文章を発話した音声を表す音声データと、その音声を聞いた者が受けるパラ言語・非言語情報を自然言語で記述した説明文を表す正解キャプションとが含まれている。なお、正解キャプションは、例えば、教師データ等と呼ばれてもよい。 As shown in FIG. 3, the learning data storage unit 206 stores one or more learning data. Each learning data includes audio data representing a voice of a certain speaker speaking a certain sentence, and a correct answer caption representing an explanatory text in natural language describing the paralinguistic and non-linguistic information received by a person who hears the voice. The correct answer caption may be called, for example, teacher data.

 例えば、図3に示す例では、1行目の学習データには、音声データ1と、正解キャプション「老人男性がゆっくり話している」とが含まれている。同様に、2行目の学習データには、音声データ2と、正解キャプション「風邪気味の若い女性の声」とが含まれている。同様に、3行目の学習データには、音声データ3と、正解キャプション「友達と楽しそうに遊んでいる小さい女の子」とが含まれている。同様に、4行目の学習データには、音声データ4と、正解キャプション「隣の人にささやくように話しかけている男子大学生」とが含まれている。 For example, in the example shown in FIG. 3, the learning data in the first row contains audio data 1 and the correct caption "An elderly man speaking slowly." Similarly, the learning data in the second row contains audio data 2 and the correct caption "The voice of a young woman who seems to be suffering from a cold." Similarly, the learning data in the third row contains audio data 3 and the correct caption "A little girl playing happily with her friends." Similarly, the learning data in the fourth row contains audio data 4 and the correct caption "A male university student whispering to the person next to him."

 このように、学習データ記憶部206には、音声データとその正解キャプションとのペアでそれぞれ表される複数の学習データが記憶されている。なお、学習データ記憶部206に記憶されている学習データの数は、数百~数千以上であることが好ましい。また、学習データに含まれる音声データの話者数は数百名以上、音声データが表す音声の文書数は数文章~数十文章以上であることが好ましい。 In this way, the training data storage unit 206 stores multiple training data, each represented by a pair of audio data and its correct caption. The number of training data stored in the training data storage unit 206 is preferably several hundred to several thousand or more. It is also preferable that the number of speakers of the audio data included in the training data is several hundred or more, and the number of audio documents represented by the audio data is several to several tens of sentences or more.

  ≪第一の実施形態に係るキャプション生成部203の詳細な機能構成例≫
 第一の実施形態に係るキャプション生成部203の詳細な機能構成例について、図4を参照しながら説明する。図4は、第一の実施形態に係るキャプション生成部203の詳細な機能構成の一例を示す図である。以下では、一例として、ニューラルネットワークを含む機械学習モデルである音声エンコーダ、変換ネットワーク及びテキストデータの3つのモデルでキャプショニングモデルが構成されている場合について説明する。ただし、これは一例であって、キャプショニングモデルは必ずしも音声エンコーダ、変換ネットワーク及びテキストデータの3つの機械学習モデルで構成されている必要はなく、その他の構成であってもよい。例えば、キャプショニングモデルが、ニューラルネットワークを含む1つの機械学習モデルで構成されていてもよい。
<<Detailed Functional Configuration Example of the Caption Generation Unit 203 According to the First Embodiment>>
A detailed functional configuration example of the caption generation unit 203 according to the first embodiment will be described with reference to FIG. 4. FIG. 4 is a diagram showing an example of a detailed functional configuration of the caption generation unit 203 according to the first embodiment. In the following, as an example, a case where a captioning model is configured with three models, a voice encoder, which is a machine learning model including a neural network, a conversion network, and text data, will be described. However, this is only an example, and the captioning model does not necessarily have to be configured with three machine learning models, a voice encoder, a conversion network, and text data, and may have other configurations. For example, the captioning model may be configured with one machine learning model including a neural network.

 図4に示すように、第一の実施形態に係るキャプション生成部203は、音声エンコーダによって実現される音声エンコード部301と、変換ネットワークによって実現されるベクトル変換部302と、テキストデコーダによって実現されるテキストデコード部303とで構成される。 As shown in FIG. 4, the caption generation unit 203 according to the first embodiment is composed of an audio encoding unit 301 realized by an audio encoder, a vector conversion unit 302 realized by a conversion network, and a text decoding unit 303 realized by a text decoder.

 音声エンコード部301は、入力音声系列を入力として、その入力音声系列の特徴(パラ言語・非言語情報を表す特徴)を表現する固定長ベクトル(以下、「音声固定長ベクトル」という。)を生成する。音声エンコード部301を実現する音声エンコーダとしては、例えば、Global Style Token(参考文献1)等を用いることが可能である。また、音声エンコード部301を実現する音声エンコーダとして、例えば、HuBERT(参考文献2)やWavLM(参考文献3)等の自己教師あり学習に基づく音声表現モデル(音声SSL(Self-Supervised Learning)モデル)とGlobal Style Token等とを用いてもよい。 The speech encoding unit 301 takes an input speech sequence as input and generates a fixed-length vector (hereinafter referred to as an "audio fixed-length vector") that expresses the features of the input speech sequence (features that express paralinguistic and non-linguistic information). As a speech encoder that realizes the speech encoding unit 301, for example, a Global Style Token (Reference 1) or the like can be used. In addition, as a speech encoder that realizes the speech encoding unit 301, for example, a speech representation model based on self-supervised learning such as HuBERT (Reference 2) or WavLM (Reference 3) and a Global Style Token or the like may be used.

 音声エンコード部301を実現する音声エンコーダに音声SSLモデルが用いられる場合、上記の音声変換部202は、音声データの音声波形をフレーム単位で分割したデータを入力音声系列とする。この場合、入力音声系列を入力として音声SSLモデルから出力されたフレーム単位の特徴量の系列がGlobal Style Token等に入力され、音声固定長ベクトルが生成される。 When an audio SSL model is used in the audio encoder that realizes the audio encoding unit 301, the above-mentioned audio conversion unit 202 takes data obtained by dividing the audio waveform of the audio data by frame as the input audio sequence. In this case, the input audio sequence is used as input, and a sequence of frame-by-frame features output from the audio SSL model is input to a Global Style Token, etc., and an audio fixed-length vector is generated.

 ベクトル変換部302は、音声エンコード部301によって生成された音声固定長ベクトルを入力として、テキストデコード部303で扱うことが可能なP個のテキスト情報ベクトルの列(以下、「テキスト情報ベクトル列」という。)を生成する。言い換えれば、ベクトル変換部302は、入力された音声固定長ベクトルをテキスト情報ベクトル列に変換する。ここで、Pは予め決められる固定値である。また、テキスト情報ベクトル列は、モデルパラメータに含まれるP個のベクトルに対して、入力された音声固定長ベクトルが表現する特徴を反映した単語埋め込み表現を表すベクトル列である。なお、モデルパラメータに含まれるP個のベクトルも学習可能パラメータの1つである。 The vector conversion unit 302 receives the fixed-length audio vector generated by the audio encoding unit 301 as input and generates a sequence of P text information vectors (hereinafter referred to as a "text information vector sequence") that can be handled by the text decoding unit 303. In other words, the vector conversion unit 302 converts the input fixed-length audio vector into a text information vector sequence. Here, P is a predetermined fixed value. The text information vector sequence is a vector sequence that represents a word embedding expression that reflects the features expressed by the input fixed-length audio vector for the P vectors included in the model parameters. The P vectors included in the model parameters are also one of the learnable parameters.

 ベクトル変換部302を実現する変換ネットワークとしては、例えば、参考文献4に記載されているMapping Networkと同様に、通常のMLP(Multilayer perceptron)を用いることが可能である。これ以外にも、ベクトル変換部302を実現する変換ネットワークとして、例えば、Transformer-encoder等の前後の系列情報を考慮可能なニューラルネットワーク、又は、MLPやTransformer-encoder等を組み合わせたニューラルネットワークを用いることも可能である。 As a conversion network that realizes the vector conversion unit 302, for example, a normal MLP (Multilayer Perceptron) can be used, similar to the Mapping Network described in Reference 4. In addition to this, as a conversion network that realizes the vector conversion unit 302, for example, a neural network that can take into account sequence information before and after a transformer-encoder, or a neural network that combines an MLP and a transformer-encoder, etc. can also be used.

 テキストデコード部303は、ベクトル変換部302によって変換されたテキスト情報ベクトル列を入力として、出力トークン確率系列を生成する。 The text decoding unit 303 receives the text information vector sequence converted by the vector conversion unit 302 as input and generates an output token probability sequence.

 テキストデコード部303を実現するテキストデコーダとしては、例えば、大規模なテキストデータで学習されたデコーダ型の事前学習モデル(参考文献5)を用いることが可能である。このような事前学習モデルとしては、例えば、GPT(Generative Pre-trained Transformer)等の大規模言語モデル(LLM:Large Language Models)が挙げられる。 As a text decoder for realizing the text decoding unit 303, for example, a decoder-type pre-trained model (Reference 5) trained with large-scale text data can be used. Examples of such pre-trained models include large-scale language models (LLMs) such as the Generative Pre-trained Transformer (GPT).

 また、テキストデコード部303は、例えば、出力トークン確率系列に含まれる各生成確率に従って出力トークンをサンプリングすることにより、キャプションを構成する出力トークン系列を生成することも可能である。 The text decoding unit 303 can also generate an output token sequence that constitutes a caption, for example, by sampling the output tokens according to each generation probability included in the output token probability sequence.

 <第一の実施形態に係る学習処理>
 以下、第一の実施形態に係る学習処理について、図5を参照しながら説明する。図5は、第一の実施形態に係る学習処理の一例を示すフローチャートである。以下では、一例として、オンライン学習を想定し、図5のステップS101~ステップS105は学習データ毎に繰り返し実行されるものとする。ただし、これは一例であって、ミニバッチ学習やバッチ学習等により学習処理が実行されてもよい。
<Learning Process According to the First Embodiment>
The learning process according to the first embodiment will be described below with reference to Fig. 5. Fig. 5 is a flowchart showing an example of the learning process according to the first embodiment. In the following, online learning is assumed as an example, and steps S101 to S105 in Fig. 5 are repeatedly executed for each piece of learning data. However, this is only an example, and the learning process may be executed by mini-batch learning, batch learning, or the like.

 まず、入力部201は、学習データ記憶部206に記憶されている1件の学習データを入力する(ステップS101)。 First, the input unit 201 inputs one piece of learning data stored in the learning data storage unit 206 (step S101).

 次に、音声変換部202は、上記のステップS101で入力された学習データに含まれる音声データを入力音声系列に変換する(ステップS102)。 Next, the voice conversion unit 202 converts the voice data contained in the training data input in step S101 above into an input voice sequence (step S102).

 次に、キャプション生成部203は、上記のステップS102で変換された入力音声系列から出力トークン確率系列を生成する(ステップS103)。なお、本ステップの処理の詳細については後述する。 Next, the caption generation unit 203 generates an output token probability sequence from the input speech sequence converted in step S102 (step S103). Details of the processing in this step will be described later.

 次に、正解キャプション変換部204は、上記のステップS101で入力された学習データに含まれる正解キャプションを正解トークン確率系列に変換する(ステップS104)。 Next, the correct caption conversion unit 204 converts the correct caption contained in the learning data input in step S101 above into a correct token probability sequence (step S104).

 そして、パラメータ更新部205は、上記のステップS103で生成された出力トークン確率系列と、上記のステップS104で変換された正解トークン確率系列との誤差を用いて、モデルパラメータを更新する(ステップS105)。すなわち、パラメータ更新部205は、当該誤差を最小化するように、既知の最適化手法によりモデルパラメータを更新する。ただし、このとき、パラメータ更新部205は、テキストデコード部303を実現するテキストデコーダのモデルパラメータを更新対象外としてもよい。また、音声エンコード部301を実現する音声エンコーダに音声SSLモデルが用いられる場合、パラメータ更新部205は、音声SSLモデルのモデルパラメータを更新対象外としてもよい。なお、パラメータ更新部205は、テキストデコード部303を実現するテキストデコーダのモデルパラメータや音声SSLモデルのモデルパラメータを更新する場合にはモデルパラメータ全体を更新してもよいし、LoRA(参考文献6)等に代表されるAdapterを用いてもよい。 Then, the parameter update unit 205 updates the model parameters using the error between the output token probability sequence generated in step S103 and the correct token probability sequence converted in step S104 (step S105). That is, the parameter update unit 205 updates the model parameters by a known optimization method so as to minimize the error. However, at this time, the parameter update unit 205 may exclude the model parameters of the text decoder that realizes the text decoding unit 303 from the update target. Also, if a voice SSL model is used for the voice encoder that realizes the voice encoding unit 301, the parameter update unit 205 may exclude the model parameters of the voice SSL model from the update target. Note that, when updating the model parameters of the text decoder that realizes the text decoding unit 303 or the model parameters of the voice SSL model, the parameter update unit 205 may update the entire model parameters, or may use an adapter such as LoRA (Reference 6).

 例えば、i=1,・・・,Iとして、i番目の出力トークンをt(i)、出力トークンt(1)~t(i-1)までの出力トークンが得られたときにi番目の出力トークンの生成確率(事後確率)をp(t(i))とする。一方で、i番目の正解トークンをs(i)、i番目の正解トークンに対応する確率をp(s(i))とする。このとき、クロスエントロピー誤差を用いる場合、パラメータ更新部205は、i=1,・・・,Iに関する-p(s(i))logp(t(i))の和を最小化するように、モデルパラメータを更新すればよい。 For example, let i = 1, ..., I, the i-th output token be t(i), and the generation probability (posterior probability) of the i-th output token when output tokens t(1) to t(i-1) are obtained be p(t(i)). On the other hand, let the i-th correct token be s(i), and the probability corresponding to the i-th correct token be p(s(i)). In this case, when using the cross-entropy error, the parameter update unit 205 updates the model parameters so as to minimize the sum of -p(s(i))logp(t(i)) for i = 1, ..., I.

 以上のステップS101~ステップS105が繰り返し実行されることにより、学習済みモデルパラメータが設定されたキャプショニングモデル(つまり、学習済みのキャプショニングモデル)が得られる。 By repeatedly executing steps S101 to S105 above, a captioning model in which the trained model parameters are set (i.e., a trained captioning model) is obtained.

  ≪出力トークン確率系列の生成処理≫
 以下、上記のステップS103における出力トークン確率系列の生成処理の詳細について、図6を参照しながら説明する。図6は、第一の実施形態に係る出力トークン確率系列の生成処理の一例を示すフローチャートである。
<<Generation process of output token probability sequence>>
Hereinafter, details of the process of generating an output token probability sequence in step S103 will be described with reference to Fig. 6. Fig. 6 is a flowchart showing an example of the process of generating an output token probability sequence according to the first embodiment.

 まず、キャプション生成部203の音声エンコード部301は、入力音声系列から音声固定長ベクトルを生成する(ステップS201)。 First, the audio encoding unit 301 of the caption generation unit 203 generates a fixed-length audio vector from the input audio sequence (step S201).

 次に、キャプション生成部203のベクトル変換部302は、上記のステップS201で生成された音声固定長ベクトルをテキスト情報ベクトル列に変換する(ステップS202)。 Next, the vector conversion unit 302 of the caption generation unit 203 converts the fixed-length audio vector generated in step S201 above into a text information vector sequence (step S202).

 そして、キャプション生成部203のテキストデコード部303は、上記のステップS202で変換されたテキスト情報ベクトル列から出力トークン確率系列を生成する(ステップS203)。 Then, the text decoding unit 303 of the caption generation unit 203 generates an output token probability sequence from the text information vector sequence converted in step S202 above (step S203).

 ・推論時
 以下、第一の実施形態に係るキャプション生成装置10の推論時について説明する。以下、推論時では、キャプショニングモデルのモデルパラメータは学習済みであるものとする。なお、推論時では、主に、学習時との相違点について説明し、学習時と同様の構成要素についてはその説明を省略する。
Inference Time Inference time of the caption generation device 10 according to the first embodiment will be described below. In the following, it is assumed that the model parameters of the captioning model have already been learned at the time of inference. Note that, at the time of inference, differences from the time of learning will be mainly described, and descriptions of components similar to those at the time of learning will be omitted.

 <第一の実施形態に係るキャプション生成装置10の推論時の機能構成例>
 第一の実施形態に係るキャプション生成装置10の推論時の機能構成例について、図7を参照しながら説明する。図7は、第一の実施形態に係るキャプション生成装置10の推論時の機能構成の一例を示す図である。
<Example of functional configuration at the time of inference of the caption generation device 10 according to the first embodiment>
An example of a functional configuration at the time of inference of the caption generation device 10 according to the first embodiment will be described with reference to Fig. 7. Fig. 7 is a diagram showing an example of a functional configuration at the time of inference of the caption generation device 10 according to the first embodiment.

 図7に示すように、推論時において、第一の実施形態に係るキャプション生成装置10は、入力部201と、音声変換部202と、キャプション生成部203と、出力部208とを有する。出力部208は、例えば、キャプション生成装置10にインストールされた1以上のプログラムが、プロセッサ108に実行させる処理により実現される。また、推論時において、第一の実施形態に係るキャプション生成装置10は、モデルパラメータ記憶部207を有する。 As shown in FIG. 7, during inference, the caption generation device 10 according to the first embodiment has an input unit 201, a voice conversion unit 202, a caption generation unit 203, and an output unit 208. The output unit 208 is realized, for example, by a process in which one or more programs installed in the caption generation device 10 are executed by the processor 108. Furthermore, during inference, the caption generation device 10 according to the first embodiment has a model parameter storage unit 207.

 入力部201は、キャプションの生成対象となる音声データが与えられると、この音声データを入力する。 When audio data for which a caption is to be generated is given, the input unit 201 inputs this audio data.

 音声変換部202は、入力部201によって入力された音声データを入力音声系列に変換する。 The voice conversion unit 202 converts the voice data input by the input unit 201 into an input voice sequence.

 キャプション生成部203は、学習済みのキャプショニングモデルによって実現され、音声変換部202によって変換された入力音声系列から出力トークン系列を生成する。 The caption generation unit 203 is realized by a trained captioning model, and generates an output token sequence from the input speech sequence converted by the speech conversion unit 202.

 出力部208は、キャプション生成部203によって生成された出力トークン系列で構成されるキャプションを予め決められた所定の出力先に出力する。なお、当該出力先は特定の出力先に限定されるものではなく、任意の出力先を対象とすることが可能である。例えば、補助記憶装置107の記憶領域、ディスプレイ等の表示装置102、通信可能に接続される他の機器等を出力先とすることが可能である。 The output unit 208 outputs a caption composed of an output token sequence generated by the caption generation unit 203 to a predetermined output destination. Note that the output destination is not limited to a specific output destination, and any output destination can be targeted. For example, the output destination can be a storage area of the auxiliary storage device 107, a display device 102 such as a display, other devices connected in a communicable manner, etc.

 モデルパラメータ記憶部207は、キャプショニングモデルの学習済みモデルパラメータを記憶する。 The model parameter storage unit 207 stores the learned model parameters of the captioning model.

 <第一の実施形態に係るキャプション生成処理>
 以下、第一の実施形態に係るキャプション生成処理について、図8を参照しながら説明する。図8は、第一の実施形態に係るキャプション生成処理の一例を示すフローチャートである。なお、以下では、キャプションの生成対象となる音声データがキャプション生成装置10に与えられたものとする。
<Caption Generation Process According to the First Embodiment>
The caption generation process according to the first embodiment will be described below with reference to Fig. 8. Fig. 8 is a flowchart showing an example of the caption generation process according to the first embodiment. Note that, in the following, it is assumed that audio data for which a caption is to be generated is provided to the caption generation device 10.

 まず、入力部201は、与えられた音声データを入力する(ステップS301)。 First, the input unit 201 inputs the given voice data (step S301).

 次に、音声変換部202は、上記のステップS301で入力された音声データを入力音声系列に変換する(ステップS302)。 Next, the voice conversion unit 202 converts the voice data input in step S301 above into an input voice sequence (step S302).

 次に、キャプション生成部203は、上記のステップS302で変換された入力音声系列から出力トークン系列を生成する(ステップS303)。なお、本ステップの処理の詳細については後述する。 Next, the caption generation unit 203 generates an output token sequence from the input speech sequence converted in step S302 (step S303). Details of the processing in this step will be described later.

 そして、出力部208は、上記のステップS303で生成された出力トークン系列で構成されるキャプションを予め決められた所定の出力先に出力する(ステップS304)。これにより、与えられた音声データが表す音声に含まれるパラ言語・非言語情報を自然言語で記述したキャプションが得られる。 Then, the output unit 208 outputs the caption composed of the output token sequence generated in step S303 above to a predetermined output destination (step S304). This results in a caption that describes in natural language the paralinguistic and non-linguistic information contained in the speech represented by the given audio data.

  ≪出力トークン系列の生成処理≫
 以下、上記のステップS303における出力トークン系列の生成処理の詳細について、図9を参照しながら説明する。図9は、第一の実施形態に係る出力トークン系列の生成処理の一例を示すフローチャートである。
<<Generation process of output token sequence>>
Hereinafter, details of the process of generating an output token sequence in step S303 will be described with reference to Fig. 9. Fig. 9 is a flowchart showing an example of the process of generating an output token sequence according to the first embodiment.

 まず、キャプション生成部203の音声エンコード部301は、入力音声系列から音声固定長ベクトルを生成する(ステップS401)。 First, the audio encoding unit 301 of the caption generation unit 203 generates an audio fixed-length vector from the input audio sequence (step S401).

 次に、キャプション生成部203のベクトル変換部302は、上記のステップS401で生成された音声固定長ベクトルをテキスト情報ベクトル列に変換する(ステップS402)。 Next, the vector conversion unit 302 of the caption generation unit 203 converts the fixed-length audio vector generated in step S401 above into a text information vector sequence (step S402).

 そして、キャプション生成部203のテキストデコード部303は、上記のステップS402で変換されたテキスト情報ベクトル列から出力トークン系列を生成する(ステップS403)。なお、出力トークン系列は、例えば、出力トークン確率系列に含まれる各生成確率に従って出力トークンをサンプリングすることにより生成することが可能である。 Then, the text decoding unit 303 of the caption generating unit 203 generates an output token sequence from the text information vector sequence converted in step S402 (step S403). Note that the output token sequence can be generated, for example, by sampling the output tokens according to each generation probability included in the output token probability sequence.

 [第二の実施形態]
 以下、第二の実施形態について説明する。例えば、音声に含まれるパラ言語・非言語情報を識別する識別モデルを医療分野に応用し、音声から病気やその兆候等を識別する場合等を考えた場合、識別モデルの識別結果にはその理由付けが求められるものと考えられる。また、その理由付けを表す自然言語記述も識別結果に応じて変化するものと考えられる。
[Second embodiment]
A second embodiment will be described below. For example, when considering a case where a discrimination model for discriminating paralinguistic and non-linguistic information contained in a voice is applied to the medical field to discriminate illnesses and their symptoms from a voice, it is considered that the discrimination result of the discrimination model requires a justification. In addition, it is considered that the natural language description expressing the justification changes depending on the discrimination result.

 そこで、第二の実施形態では、音声に含まれるパラ言語・非言語情報を識別する識別モデルの識別結果も考慮して、モデルパラメータの学習とキャプションの生成とを行う場合について説明する。 In the second embodiment, we therefore explain the case where model parameter learning and caption generation are carried out while taking into account the results of a discrimination model that discriminates paralinguistic and non-linguistic information contained in speech.

 なお、第二の実施形態では、主に、第一の実施形態との相違点について説明し、第一の実施形態と同様の構成要素についてはその説明を省略する。 In addition, in the second embodiment, the differences from the first embodiment will be mainly described, and the description of the components that are the same as those in the first embodiment will be omitted.

 ・学習時
 以下、第二の実施形態に係るキャプション生成装置10の学習時について説明する。
- Learning Time Hereinafter, the learning time of the caption generation device 10 according to the second embodiment will be described.

 <第二の実施形態に係るキャプション生成装置10の学習時の機能構成例>
 第二の実施形態に係るキャプション生成装置10の学習時の機能構成例について、図10を参照しながら説明する。図10は、第二の実施形態に係るキャプション生成装置10の学習時の機能構成の一例を示す図である。
<Example of functional configuration during learning of the caption generation device 10 according to the second embodiment>
An example of a functional configuration of the caption generation device 10 according to the second embodiment during learning will be described with reference to Fig. 10. Fig. 10 is a diagram showing an example of a functional configuration of the caption generation device 10 according to the second embodiment during learning.

 図10に示すように、学習時において、第二の実施形態に係るキャプション生成装置10は、第一の実施形態で説明した各部に加えて、音声に含まれるパラ言語・非言語情報を識別する識別モデルによって実現される識別部209を有する。識別部209は、例えば、キャプション生成装置10にインストールされた1以上のプログラムが、プロセッサ108に実行させる処理により実現される。 As shown in FIG. 10, during learning, the caption generation device 10 according to the second embodiment has, in addition to the various units described in the first embodiment, a recognition unit 209 that is realized by a recognition model that recognizes paralinguistic and non-linguistic information contained in speech. The recognition unit 209 is realized, for example, by a process in which one or more programs installed in the caption generation device 10 are executed by the processor 108.

 識別部209は、音声変換部202によって変換された入力音声系列からパラ言語・非言語情報を識別した結果を識別結果として生成する。識別部209を実現する識別モデルは特定の識別モデルに限定されるものではなく、例えば、病気、性別、感情等といった任意のパラ言語・非言語情報を入力音声系列から識別可能な任意の識別モデルを用いることが可能である。また、識別部209は、1つの識別モデルによって実現されていてもよいし、複数の識別モデルによって実現されていてもよい。複数の識別モデルによって識別部209が実現されている場合、識別部209は、入力音声系列から複数のパラ言語・非言語情報をそれぞれ識別した結果を識別結果として生成する。 The identification unit 209 generates an identification result that is a result of identifying paralinguistic and non-linguistic information from the input speech sequence converted by the speech conversion unit 202. The identification model that realizes the identification unit 209 is not limited to a specific identification model, and any identification model that can identify any paralinguistic and non-linguistic information such as illness, gender, emotion, etc. from the input speech sequence can be used. Furthermore, the identification unit 209 may be realized by one identification model or by multiple identification models. When the identification unit 209 is realized by multiple identification models, the identification unit 209 generates an identification result that is a result of identifying each of the multiple pieces of paralinguistic and non-linguistic information from the input speech sequence.

 なお、識別部209を実現する1以上の識別モデルは予め学習済みであってもよいし、キャプショニングモデルと共に学習されてもよい。識別部209を実現する1以上の識別モデルをキャプショニングモデルと共に学習する場合、学習データには、音声データと正解キャプションに加えて、識別結果の正解を表す正解パラ言語・非言語情報も含まれる。 Note that one or more discrimination models that realize the discrimination unit 209 may be pre-trained, or may be trained together with a captioning model. When one or more discrimination models that realize the discrimination unit 209 are trained together with a captioning model, the training data includes, in addition to the audio data and the correct caption, correct paralinguistic and non-linguistic information that indicates the correct discrimination result.

 キャプション生成部203は、音声変換部202によって変換された入力音声系列と識別部209によって生成された識別結果とから出力トークン確率系列を生成する。なお、キャプション生成部203の詳細な機能構成例については後述する。 The caption generation unit 203 generates an output token probability sequence from the input speech sequence converted by the speech conversion unit 202 and the classification result generated by the classification unit 209. An example of a detailed functional configuration of the caption generation unit 203 will be described later.

  ≪第二の実施形態に係るキャプション生成部203の詳細な機能構成例≫
 第二の実施形態に係るキャプション生成部203の詳細な機能構成例について、図11を参照しながら説明する。図11は、第二の実施形態に係るキャプション生成部203の詳細な機能構成の一例を示す図である。以下では、第一の実施形態と同様に、一例として、キャプショニングモデルが音声エンコーダ、変換ネットワーク及びテキストデータの3つの機械学習モデルで構成されている場合について説明する。
<<Detailed Functional Configuration Example of the Caption Generation Unit 203 According to the Second Embodiment>>
A detailed functional configuration example of the caption generation unit 203 according to the second embodiment will be described with reference to Fig. 11. Fig. 11 is a diagram showing an example of a detailed functional configuration of the caption generation unit 203 according to the second embodiment. As in the first embodiment, the following describes, as an example, a case in which a captioning model is composed of three machine learning models, a voice encoder, a conversion network, and text data.

 図11に示すように、第二の実施形態に係るキャプション生成部203は、音声エンコーダによって実現される音声エンコード部301と、変換ネットワークによって実現されるベクトル変換部302と、テキストデコーダによって実現されるテキストデコード部303とで構成される。 As shown in FIG. 11, the caption generation unit 203 according to the second embodiment is composed of an audio encoding unit 301 realized by an audio encoder, a vector conversion unit 302 realized by a conversion network, and a text decoding unit 303 realized by a text decoder.

 ベクトル変換部302は、音声エンコード部301によって生成された音声固定長ベクトルと識別部209によって生成された識別結果とを入力として、テキスト情報ベクトル列を生成する。言い換えれば、ベクトル変換部302は、入力された音声固定長ベクトルと入力された識別結果とをテキスト情報ベクトル列に変換する。具体的には、ベクトル変換部302は、音声エンコード部301によって生成された音声固定長ベクトルと識別部209によって生成された識別結果とを結合したベクトルを新たな音声固定長ベクトルとした上で、第一の実施形態に係るベクトル変換部302と同様に、この新たな音声固定長ベクトルからテキスト情報ベクトル列を生成する。これにより、音声固定長ベクトルだけでなく、識別モデルの識別結果も考慮したテキスト情報ベクトル列を生成することが可能となり、その結果、それらを考慮した出力トークン系列で構成されるキャプションを生成することが可能となる。 The vector conversion unit 302 generates a text information vector sequence using the fixed-length audio vector generated by the audio encoding unit 301 and the classification result generated by the classification unit 209 as input. In other words, the vector conversion unit 302 converts the input fixed-length audio vector and the input classification result into a text information vector sequence. Specifically, the vector conversion unit 302 combines the fixed-length audio vector generated by the audio encoding unit 301 and the classification result generated by the classification unit 209 to create a new fixed-length audio vector, and then generates a text information vector sequence from this new fixed-length audio vector, similar to the vector conversion unit 302 according to the first embodiment. This makes it possible to generate a text information vector sequence that takes into account not only the fixed-length audio vector but also the classification result of the classification model, and as a result, it becomes possible to generate a caption that is composed of an output token sequence that takes these into account.

 なお、音声固定長ベクトルと識別結果とを結合したベクトルとは、例えば、音声固定長ベクトルがM次元ベクトル、識別結果の数がL個ですべての識別結果が0又は1の2値を取るスカラー値である場合、音声固定長ベクトルの各要素とL個の識別結果とを要素とするM+L次元ベクトルのことである。 Note that a vector combining a fixed-length audio vector and a classification result is, for example, an M+L-dimensional vector whose elements are each element of the fixed-length audio vector and the L classification results, in the case where the fixed-length audio vector is an M-dimensional vector, there are L classification results, and all classification results are scalar values that take the two values 0 or 1.

 <第二の実施形態に係る学習処理>
 以下、第一の実施形態に係る学習処理について、図12を参照しながら説明する。図12は、第二の実施形態に係る学習処理の一例を示すフローチャートである。以下では、一例として、オンライン学習を想定し、図12のステップS501~ステップS506は学習データ毎に繰り返し実行されるものとする。
<Learning Process According to the Second Embodiment>
The learning process according to the first embodiment will be described below with reference to Fig. 12. Fig. 12 is a flowchart showing an example of the learning process according to the second embodiment. In the following, online learning is assumed as an example, and steps S501 to S506 in Fig. 12 are repeatedly executed for each piece of learning data.

 図12のステップS501~ステップS502は、図5のステップS101~ステップS102とそれぞれ同様であるため、その説明を省略する。 Steps S501 and S502 in FIG. 12 are similar to steps S101 and S102 in FIG. 5, respectively, and therefore will not be described.

 ステップS502に続いて、識別部209は、ステップS502で変換された入力音声系列からパラ言語・非言語情報を識別した結果を識別結果として生成する(ステップS503)。 Following step S502, the identification unit 209 generates an identification result that identifies paralinguistic and non-linguistic information from the input speech sequence converted in step S502 (step S503).

 次に、キャプション生成部203は、ステップS502で変換された入力音声系列と上記のステップS503で生成された識別結果とから出力トークン確率系列を生成する(ステップS504)。なお、本ステップの処理の詳細については後述する。 Next, the caption generation unit 203 generates an output token probability sequence from the input speech sequence converted in step S502 and the classification result generated in step S503 (step S504). Details of the processing in this step will be described later.

 次に、正解キャプション変換部204は、図5のステップS104と同様に、ステップS501で入力された学習データに含まれる正解キャプションを正解トークン確率系列に変換する(ステップS505)。 Next, the correct caption conversion unit 204 converts the correct caption included in the learning data input in step S501 into a correct token probability sequence (step S505), similar to step S104 in FIG. 5.

 そして、パラメータ更新部205は、図5のステップS105と同様に、上記のステップS504で生成された出力トークン確率系列と、上記のステップS505で変換された正解トークン確率系列との誤差を用いて、モデルパラメータを更新する(ステップS506)。ただし、このとき、識別部209を実現する1以上の識別モデルも学習する場合、パラメータ更新部205は、ステップS501で入力された学習データに含まれる正解パラ言語・非言語情報と、上記のステップS503で生成された識別結果が表すパラ言語・非言語情報との誤差も用いて、モデルパラメータに加えて、識別部209を実現する1以上の識別モデルの学習可能パラメータも更新する。 Then, the parameter update unit 205 updates the model parameters using the error between the output token probability sequence generated in the above step S504 and the correct token probability sequence converted in the above step S505, similar to step S105 in FIG. 5 (step S506). However, at this time, if one or more discrimination models realizing the identification unit 209 are also trained, the parameter update unit 205 also uses the error between the correct paralinguistic and non-linguistic information included in the training data input in step S501 and the paralinguistic and non-linguistic information represented by the recognition result generated in the above step S503 to update the learnable parameters of the one or more discrimination models realizing the identification unit 209 in addition to the model parameters.

 以上のステップS501~ステップS506が繰り返し実行されることにより、学習済みモデルパラメータが設定されたキャプショニングモデル(つまり、学習済みのキャプショニングモデル)が得られる。 By repeatedly executing steps S501 to S506 above, a captioning model in which the trained model parameters are set (i.e., a trained captioning model) is obtained.

  ≪出力トークン確率系列の生成処理≫
  以下、上記のステップS504における出力トークン確率系列の生成処理の詳細について、図13を参照しながら説明する。図13は、第二の実施形態に係る出力トークン確率系列の生成処理の一例を示すフローチャートである。
<<Generation process of output token probability sequence>>
Hereinafter, details of the process of generating an output token probability sequence in step S504 will be described with reference to Fig. 13. Fig. 13 is a flowchart showing an example of the process of generating an output token probability sequence according to the second embodiment.

 まず、キャプション生成部203の音声エンコード部301は、図6のステップS201と同様に、入力音声系列から音声固定長ベクトルを生成する(ステップS601)。 First, the audio encoding unit 301 of the caption generation unit 203 generates an audio fixed-length vector from the input audio sequence (step S601), similar to step S201 in FIG. 6.

 次に、キャプション生成部203のベクトル変換部302は、上記のステップS601で生成された音声固定長ベクトルと、図14のステップS503で生成された識別結果とをテキスト情報ベクトル列に変換する(ステップS602)。 Next, the vector conversion unit 302 of the caption generation unit 203 converts the fixed-length audio vector generated in step S601 above and the classification result generated in step S503 of FIG. 14 into a text information vector sequence (step S602).

 そして、キャプション生成部203のテキストデコード部303は、図6のステップS203と同様に、上記のステップS602で変換されたテキスト情報ベクトル列から出力トークン確率系列を生成する(ステップS603)。 Then, the text decoding unit 303 of the caption generation unit 203 generates an output token probability sequence from the text information vector sequence converted in step S602 above (step S603), similar to step S203 in FIG. 6.

 ・推論時
 以下、第二の実施形態に係るキャプション生成装置10の推論時について説明する。以下、推論時では、キャプショニングモデルのモデルパラメータと識別モデルの学習可能パラメータはいずれも学習済みであるものとする。なお、推論時では、主に、学習時との相違点について説明し、学習時と同様の構成要素についてはその説明を省略する。
Inference Time Inference time of the caption generation device 10 according to the second embodiment will be described below. In the following, it is assumed that the model parameters of the captioning model and the learnable parameters of the discriminative model have both been learned at the time of inference. Note that, at the time of inference, differences from the time of learning will be mainly described, and descriptions of components similar to those at the time of learning will be omitted.

 <第二の実施形態に係るキャプション生成装置10の推論時の機能構成例>
 第二の実施形態に係るキャプション生成装置10の推論時の機能構成例について、図14を参照しながら説明する。図14は、第二の実施形態に係るキャプション生成装置10の推論時の機能構成の一例を示す図である。
<Example of functional configuration at the time of inference of the caption generation device 10 according to the second embodiment>
An example of a functional configuration at the time of inference of the caption generation device 10 according to the second embodiment will be described with reference to Fig. 14. Fig. 14 is a diagram showing an example of a functional configuration at the time of inference of the caption generation device 10 according to the second embodiment.

 図14に示すように、推論時において、第二の実施形態に係るキャプション生成装置10は、第一の実施形態で説明した各部に加えて、識別部209を有する。 As shown in FIG. 14, during inference, the caption generation device 10 according to the second embodiment has an identification unit 209 in addition to the units described in the first embodiment.

 識別部209は、音声変換部202によって変換された入力音声系列からパラ言語・非言語情報を識別した結果を識別結果として生成する。 The identification unit 209 generates an identification result by identifying paralinguistic and non-linguistic information from the input speech sequence converted by the speech conversion unit 202.

 キャプション生成部203は、学習済みのキャプショニングモデルによって実現され、音声変換部202によって変換された入力音声系列と識別部209によって生成された識別結果とから出力トークン系列を生成する。 The caption generation unit 203 is realized by a trained captioning model, and generates an output token sequence from the input speech sequence converted by the speech conversion unit 202 and the classification result generated by the classification unit 209.

 <第二の実施形態に係るキャプション生成処理>
 以下、第二の実施形態に係るキャプション生成処理について、図15を参照しながら説明する。図15は、第二の実施形態に係るキャプション生成処理の一例を示すフローチャートである。なお、以下では、キャプションの生成対象となる音声データがキャプション生成装置10に与えられたものとする。
<Caption Generation Process According to the Second Embodiment>
The caption generation process according to the second embodiment will be described below with reference to Fig. 15. Fig. 15 is a flowchart showing an example of the caption generation process according to the second embodiment. Note that, in the following, it is assumed that audio data for which a caption is to be generated is provided to the caption generation device 10.

 図15のステップS701~ステップS702は、図8のステップS301~ステップS302とそれぞれ同様であるため、その説明を省略する。 Steps S701 and S702 in FIG. 15 are similar to steps S301 and S302 in FIG. 8, respectively, and therefore will not be described.

 ステップS702に続いて、識別部209は、ステップS702で変換された入力音声系列からパラ言語・非言語情報を識別した結果を識別結果として生成する(ステップS703)。 Following step S702, the identification unit 209 generates an identification result that identifies paralinguistic and non-linguistic information from the input speech sequence converted in step S702 (step S703).

 次に、キャプション生成部203は、上記のステップS702で変換された入力音声系列と上記のステップS703で生成された識別結果から出力トークン系列を生成する(ステップS704)。 Next, the caption generation unit 203 generates an output token sequence from the input speech sequence converted in step S702 above and the identification result generated in step S703 above (step S704).

 そして、出力部208は、図8のステップS304と同様に、上記のステップS704で生成された出力トークン系列で構成されるキャプションを予め決められた所定の出力先に出力する(ステップS705)。これにより、与えられた音声データから識別モデルによって識別された識別結果も考慮して、その音声データが表す音声に含まれるパラ言語・非言語情報を自然言語で記述したキャプションが得られる。 Then, the output unit 208 outputs the caption composed of the output token sequence generated in step S704 above to a predetermined output destination (step S705), similar to step S304 in FIG. 8. This allows for the acquisition of a caption that describes in natural language the paralinguistic and non-linguistic information contained in the speech represented by the given speech data, taking into account the identification results obtained from the speech data using the identification model.

  ≪出力トークン系列の生成処理≫
 以下、上記のステップS704における出力トークン系列の生成処理の詳細について、図16を参照しながら説明する。図16は、第二の実施形態に係る出力トークン系列の生成処理の一例を示すフローチャートである。
<<Generation process of output token sequence>>
Details of the output token sequence generation process in step S704 above will be described below with reference to Fig. 16. Fig. 16 is a flowchart showing an example of the output token sequence generation process according to the second embodiment.

 まず、キャプション生成部203の音声エンコード部301は、図9のステップS401と同様に、入力音声系列から音声固定長ベクトルを生成する(ステップS801)。 First, the audio encoding unit 301 of the caption generation unit 203 generates an audio fixed-length vector from the input audio sequence (step S801), similar to step S401 in FIG. 9.

 次に、キャプション生成部203のベクトル変換部302は、上記のステップS801で生成された音声固定長ベクトルと、図15のステップS703で生成された識別結果とをテキスト情報ベクトル列に変換する(ステップS802)。 Next, the vector conversion unit 302 of the caption generation unit 203 converts the fixed-length audio vector generated in step S801 above and the classification result generated in step S703 of FIG. 15 into a text information vector sequence (step S802).

 そして、キャプション生成部203のテキストデコード部303は、図9のステップS403と同様に、上記のステップS802で変換されたテキスト情報ベクトル列から出力トークン系列を生成する(ステップS803)。 Then, the text decoding unit 303 of the caption generation unit 203 generates an output token sequence from the text information vector sequence converted in step S802 above (step S803), similar to step S403 in FIG. 9.

 [まとめ]
 以上のように、第一の実施形態に係るキャプション生成装置10は、音声データとその音声データが表す音声に含まれるパラ言語・非言語情報を自然言語で記述した正解キャプションとが含まれる学習データからキャプショニングモデルを学習する。そして、第一の実施形態に係るキャプション生成装置10は、学習済みのキャプショニングモデルにより、与えられた音声データが表す音声に含まれるパラ言語・非言語情報を自然言語で記述したキャプションを生成することができる。これにより、音声に含まれるパラ言語・非言語情報をキャプションという解釈性の高い形式で出力することが可能となる。
[summary]
As described above, the caption generation device 10 according to the first embodiment learns a captioning model from learning data including voice data and correct captions that describe in natural language the paralinguistic and non-linguistic information contained in the voice represented by the voice data. The caption generation device 10 according to the first embodiment can then generate captions that describe in natural language the paralinguistic and non-linguistic information contained in the voice represented by the given voice data, using the learned captioning model. This makes it possible to output the paralinguistic and non-linguistic information contained in the voice in the form of a caption that has high interpretability.

 また、第二の実施形態に係るキャプション生成装置10は、音声データが表す音声からパラ言語・非言語情報を識別する識別モデルによる識別結果も用いてキャプショニングモデルを学習する。そして、第二の実施形態に係るキャプション生成装置10は、学習済みのキャプショニングモデルと識別モデルにより、与えられた音声データのキャプションを生成すると共にパラ言語・非言語情報を識別結果として生成することができる。これにより、音声に含まれるパラ言語・非言語情報と、そのキャプションとを得ることが可能となる。 The caption generation device 10 according to the second embodiment also learns a captioning model using the results of a discrimination model that discriminates paralinguistic and non-linguistic information from the speech represented by the audio data. The caption generation device 10 according to the second embodiment can then use the learned captioning model and discrimination model to generate a caption for given audio data and generate paralinguistic and non-linguistic information as discrimination results. This makes it possible to obtain the paralinguistic and non-linguistic information contained in the speech and its caption.

 本発明は、具体的に開示された上記の各実施形態に限定されるものではなく、請求の範囲の記載から逸脱することなく、種々の変形や変更、既知の技術との組み合わせ等が可能である。 The present invention is not limited to the specifically disclosed embodiments above, and various modifications, changes, and combinations with known technologies are possible without departing from the scope of the claims.

 [参考文献]
 参考文献1:Y. Wang, D. Stanton, Y. Zhang, R.-S. Ryan, E. Battenberg, J. Shor, et al., "Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis," in ICML, 2018.
 参考文献2:Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed, "HuBERT: Self-supervised speech representation learning by masked prediction of hidden units," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451-3460, 2021.
 参考文献3:Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al., "WavLM: Large-scale self-supervised pre-training for full stack speech processing," IEEE J-STSP, vol. 16, no. 6, pp. 1505-1518, 2022.
 参考文献4:R. Mokady, A. Hertz, and A. H. Bermano, "ClipCap: CLIP prefix for image captioning," arXiv preprint arXiv: 2111.09734, 2021.
 参考文献5:T. Brown, B. Mann, N. Ryder, et al., "Language models are few-shot learners," in Proc. NeurIPS, 2020.
 参考文献6:E. J Hu, Y. Shen, P. Wallis, et al., "Language models are few-shot learners," in Proc. ICLR, 2022.
[References]
Reference 1: Y. Wang, D. Stanton, Y. Zhang, R.-S. Ryan, E. Battenberg, J. Shor, et al., "Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis," in ICML, 2018.
Reference 2: Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed, "HuBERT: Self-supervised speech representation learning by masked prediction of hidden units," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451-3460, 2021.
Reference 3: Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al., "WavLM: Large-scale self-supervised pre-training for full stack speech processing," IEEE J-STSP, vol. 16, no. 6, pp. 1505-1518, 2022.
Reference 4: R. Mokady, A. Hertz, and A. H. Bermano, "ClipCap: CLIP prefix for image captioning," arXiv preprint arXiv: 2111.09734, 2021.
Reference 5: T. Brown, B. Mann, N. Ryder, et al., "Language models are few-shot learners," in Proc. NeurIPS, 2020.
Reference 6: E. J Hu, Y. Shen, P. Wallis, et al., "Language models are few-shot learners," in Proc. ICLR, 2022.

 10    キャプション生成装置
 101   入力装置
 102   表示装置
 103   外部I/F
 103a  記録媒体
 104   通信I/F
 105   RAM
 106   ROM
 107   補助記憶装置
 108   プロセッサ
 109   バス
 201   入力部
 202   音声変換部
 203   キャプション生成部
 204   正解キャプション変換部
 205   パラメータ更新部
 206   学習データ記憶部
 207   モデルパラメータ記憶部
 208   出力部
 209   識別部
10 Caption generating device 101 Input device 102 Display device 103 External I/F
103a Recording medium 104 Communication I/F
105 RAM
106 ROM
107 Auxiliary storage device 108 Processor 109 Bus 201 Input unit 202 Voice conversion unit 203 Caption generation unit 204 Correct caption conversion unit 205 Parameter update unit 206 Learning data storage unit 207 Model parameter storage unit 208 Output unit 209 Classification unit

Claims (8)

 音声データと、前記音声データが表す音声に含まれるパラ言語情報又は非言語情報の自然言語記述を表す教師データとが含まれる学習データを入力する入力部と、
 前記音声データを所定の単位毎に所定の形式に変換した音声系列を入力として、音声に含まれるパラ言語情報又は非言語情報の自然言語記述の生成確率を算出する機械学習モデルにより、入力された音声系列の変換元の音声データが表す音声に含まれるパラ言語情報又は非言語情報の自然言語記述の生成確率を算出する算出部と、
 前記生成確率に従って生成される自然言語記述と、前記学習データに含まれる教師データとの誤差に基づいて、前記機械学習モデルの学習可能パラメータを更新する更新部と、
 を有する学習装置。
an input unit for inputting learning data including speech data and training data representing a natural language description of paralinguistic information or non-linguistic information included in the speech represented by the speech data;
a calculation unit that uses an input of a speech sequence obtained by converting the speech data into a predetermined format for each predetermined unit, and calculates the generation probability of a natural language description of paralinguistic information or non-linguistic information contained in the speech, which is represented by the speech data that is the source of the input speech sequence, using a machine learning model that calculates the generation probability of a natural language description of paralinguistic information or non-linguistic information contained in the speech;
an update unit that updates a learnable parameter of the machine learning model based on an error between a natural language description generated according to the generation probability and teacher data included in the training data;
A learning device having the above configuration.
 前記算出部は、
 入力された音声系列を、前記音声系列の特徴を表現する固定長の第1の情報に変換し、
 前記第1の情報を、前記第1の情報の特徴を反映した単語埋め込み表現を表す第2の情報の列に変換し、
 前記第2の情報の列から、前記自然言語記述に含まれる所定の単位毎の文字の並びの生成確率を算出する、請求項1に記載の学習装置。
The calculation unit is
converting an input speech sequence into fixed-length first information representative of a feature of the speech sequence;
Converting the first information into a string of second information representing word embedding expressions that reflect characteristics of the first information;
The learning device according to claim 1 , further comprising: a generation probability of a character sequence for each predetermined unit included in the natural language description, the generation probability being calculated from the second information string.
 前記音声系列を入力として、音声に含まれるパラ言語情報又は非言語情報を識別する識別モデルにより、入力された音声系列の変換元の音声データが表す音声に含まれるパラ言語情報又は非言語情報を識別する識別部、を更に有し、
 前記算出部は、
 前記第1の情報と、前記識別部によって識別されたパラ言語情報又は非言語情報を表す情報とで構成される第3の情報を、前記第3の情報の特徴を反映した単語埋め込み表現を表す第2の情報の列に変換する、請求項2に記載の学習装置。
a discrimination unit that uses the speech sequence as an input and discriminates paralinguistic information or non-linguistic information contained in speech represented by the speech data that is the source of conversion of the input speech sequence using a discrimination model that discriminates paralinguistic information or non-linguistic information contained in speech,
The calculation unit is
The learning device according to claim 2, wherein third information consisting of the first information and information representing paralinguistic information or non-linguistic information identified by the identification unit is converted into a string of second information representing word embedding expressions reflecting characteristics of the third information.
 与えられた音声データを所定の単位毎に所定の形式に変換した音声系列を入力として、音声に含まれるパラ言語情報又は非言語情報の自然言語記述を生成する学習済み機械学習モデルにより、入力された音声系列の変換元の音声データが表す音声に含まれるパラ言語情報又は非言語情報の自然言語記述を生成する生成部、
 を有する生成装置。
a generation unit that receives as input a speech sequence obtained by converting given speech data into a predetermined format for each predetermined unit, and generates a natural language description of paralinguistic information or non-linguistic information contained in the speech represented by the speech data from which the input speech sequence is converted, using a trained machine learning model that generates a natural language description of paralinguistic information or non-linguistic information contained in the speech;
A generating device having the following:
 前記音声系列を入力として、音声に含まれるパラ言語情報又は非言語情報を識別する識別モデルにより、入力された音声系列の変換元の音声データが表す音声に含まれるパラ言語情報又は非言語情報を識別する識別部、を更に有し、
 前記生成部は、
 前記識別部によって識別されたパラ言語情報又は非言語情報に基づいて、前記パラ言語情報又は非言語情報の自然言語記述を生成する、請求項4に記載の生成装置。
a discrimination unit that uses the speech sequence as an input and discriminates paralinguistic information or non-linguistic information contained in speech represented by the speech data that is the source of conversion of the input speech sequence using a discrimination model that discriminates paralinguistic information or non-linguistic information contained in speech,
The generation unit is
The generating device according to claim 4 , further comprising: a generating section configured to generate a natural language description of the paralinguistic information or the non-linguistic information based on the paralinguistic information or the non-linguistic information identified by the identifying section.
 音声データと、前記音声データが表す音声に含まれるパラ言語情報又は非言語情報の自然言語記述を表す教師データとが含まれる学習データを入力する入力手順と、
 前記音声データを所定の単位毎に所定の形式に変換した音声系列を入力として、音声に含まれるパラ言語情報又は非言語情報の自然言語記述の生成確率を算出する機械学習モデルにより、入力された音声系列の変換元の音声データが表す音声に含まれるパラ言語情報又は非言語情報の自然言語記述の生成確率を算出する算出手順と、
 前記生成確率に従って生成される自然言語記述と、前記学習データに含まれる教師データとの誤差に基づいて、前記機械学習モデルの学習可能パラメータを更新する更新手順と、
 をコンピュータが実行する学習方法。
an input step of inputting learning data including speech data and training data representing a natural language description of paralinguistic information or non-linguistic information contained in the speech represented by the speech data;
a calculation step of calculating the probability of generating a natural language description of paralinguistic information or non-linguistic information contained in speech represented by the speech data from which the input speech sequence is converted, using a machine learning model that calculates the probability of generating a natural language description of paralinguistic information or non-linguistic information contained in speech, the natural language description being a source of conversion of the input speech sequence, using a speech sequence obtained by converting the speech data into a predetermined format for each predetermined unit as an input;
an update step of updating a learnable parameter of the machine learning model based on an error between the natural language description generated according to the generation probability and teacher data included in the training data;
The computer executes the learning method.
 与えられた音声データを所定の単位毎に所定の形式に変換した音声系列を入力として、音声に含まれるパラ言語情報又は非言語情報の自然言語記述を生成する学習済み機械学習モデルにより、入力された音声系列の変換元の音声データが表す音声に含まれるパラ言語情報又は非言語情報の自然言語記述を生成する生成手順、
 をコンピュータが実行する生成方法。
a generation step of using as input a speech sequence obtained by converting given speech data into a predetermined format for each predetermined unit, and generating a natural language description of the paralinguistic or non-linguistic information contained in the speech, which is represented by the speech data from which the input speech sequence is converted, using a trained machine learning model that generates a natural language description of the paralinguistic or non-linguistic information contained in the speech;
A computer executed generation method.
 コンピュータを、請求項1乃至3の何れか一項に記載の学習装置、又は、請求項4又は5に記載の生成装置、として機能させるプログラム。 A program that causes a computer to function as a learning device according to any one of claims 1 to 3, or as a generating device according to claim 4 or 5.
PCT/JP2023/040613 2023-11-10 2023-11-10 Learning device, generation device, learning method, generation method, and program Pending WO2025099939A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2023/040613 WO2025099939A1 (en) 2023-11-10 2023-11-10 Learning device, generation device, learning method, generation method, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2023/040613 WO2025099939A1 (en) 2023-11-10 2023-11-10 Learning device, generation device, learning method, generation method, and program

Publications (1)

Publication Number Publication Date
WO2025099939A1 true WO2025099939A1 (en) 2025-05-15

Family

ID=95695273

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2023/040613 Pending WO2025099939A1 (en) 2023-11-10 2023-11-10 Learning device, generation device, learning method, generation method, and program

Country Status (1)

Country Link
WO (1) WO2025099939A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020085070A1 (en) * 2018-10-22 2020-04-30 日本電信電話株式会社 Paralanguage information estimation device, method for estimating paralanguage information, and program
US10878840B1 (en) * 2019-10-15 2020-12-29 Audio Analytic Ltd Method of recognising a sound event
JP2021124642A (en) * 2020-02-06 2021-08-30 本田技研工業株式会社 Information processing device, vehicle, program, and information processing method
JP2022507189A (en) * 2019-04-17 2022-01-18 ▲騰▼▲訊▼科技(深▲セン▼)有限公司 Hidden state generation method and device in recurrent neural network for language processing
JP2023539397A (en) * 2020-06-22 2023-09-14 エスアールアイ インターナショナル Controllable natural paralanguage for text-to-speech synthesis

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020085070A1 (en) * 2018-10-22 2020-04-30 日本電信電話株式会社 Paralanguage information estimation device, method for estimating paralanguage information, and program
JP2022507189A (en) * 2019-04-17 2022-01-18 ▲騰▼▲訊▼科技(深▲セン▼)有限公司 Hidden state generation method and device in recurrent neural network for language processing
US10878840B1 (en) * 2019-10-15 2020-12-29 Audio Analytic Ltd Method of recognising a sound event
JP2021124642A (en) * 2020-02-06 2021-08-30 本田技研工業株式会社 Information processing device, vehicle, program, and information processing method
JP2023539397A (en) * 2020-06-22 2023-09-14 エスアールアイ インターナショナル Controllable natural paralanguage for text-to-speech synthesis

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
DROSSOS, KONSTANTINOS ET AL.: "Automated audio captioning with recurrent neural networks", IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA, 2017, pages 374 - 378, XP033264965, Retrieved from the Internet <URL:https://ieeexplore.ieee.org/abstract/document/8170058> [retrieved on 20240117], DOI: 10.1109/WASPAA.2017.8170058 *

Similar Documents

Publication Publication Date Title
Vashisht et al. Speech recognition using machine learning
CN113692616B (en) Phoneme-based contextualization for cross-language speech recognition in end-to-end models
US10796105B2 (en) Device and method for converting dialect into standard language
CN105679317B (en) Method and apparatus for training language models and recognizing speech
EP3826007B1 (en) Method and apparatus with speech processing
US10963819B1 (en) Goal-oriented dialog systems and methods
JP7678227B2 (en) Joint Unsupervised and Supervised Training (JUST) for Multilingual Automatic Speech Recognition
JP7418991B2 (en) Speech recognition method and device
Dubey et al. Deep speech based end-to-end automated speech recognition (asr) for indian-english accents
US20240304178A1 (en) Using text-injection to recognize speech without transcription
Sen et al. Speech processing and recognition system
Liang et al. A hybrid CTC+ Attention model based on end-to-end framework for multilingual speech recognition
Nigar et al. An intelligent framework based on deep learning for online Quran learning during pandemic
US12073825B2 (en) Method and apparatus for speech recognition
US20250279089A1 (en) Using Synthetic Data to Improve Word Error Rate of Differentially Private ASR Models
US20250217638A1 (en) Methods and systems for speech emotion retrieval via natural language prompts
Santos et al. Automatic Speech Recognition: Comparisons Between Convolutional Neural Networks, Hidden Markov Model and Hybrid Architecture
WO2025099939A1 (en) Learning device, generation device, learning method, generation method, and program
Hsu et al. Niesr: Nuisance invariant end-to-end speech recognition
Baranwal et al. Improved Mispronunciation detection system using a hybrid CTC-ATT based approach for L2 English speakers
Kim End-to-end speech recognition on conversations
Soundarya et al. Analysis of mispronunciation detection and diagnosis based on conventional deep learning techniques
Tornay et al. Subunits inference and lexicon development based on pairwise comparison of utterances and signs
Rudrappa et al. KHiTE: Multilingual Speech Acquisition to Monolingual Text Translation
Nagano et al. Unsupervised phoneme and word acquisition from continuous speech based on a hierarchical probabilistic generative model

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23958338

Country of ref document: EP

Kind code of ref document: A1