WO2024182135A1 - Knowledge distillation from non-streaming to streaming encoder - Google Patents
Knowledge distillation from non-streaming to streaming encoder Download PDFInfo
- Publication number
- WO2024182135A1 WO2024182135A1 PCT/US2024/016101 US2024016101W WO2024182135A1 WO 2024182135 A1 WO2024182135 A1 WO 2024182135A1 US 2024016101 W US2024016101 W US 2024016101W WO 2024182135 A1 WO2024182135 A1 WO 2024182135A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- streaming
- model
- loss
- streaming model
- encoder
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
Definitions
- This disclosure relates to non-streaming and streaming model encoders.
- ASR Automatic speech recognition
- digital assistants may include stand-alone virtual assistant devices, smartphone applications, or the like.
- This disclosure relates generally to techniques and devices for speech related streaming models, such as ASR models, and to training techniques for such models.
- Various aspects of the techniques of this disclosure may provide for improved streaming model performance. While the techniques of this disclosure are generally discussed in terms of ASR models, these techniques may be applicable to any speech related models that may be categorized as either non-streaming or streaming.
- non-streaming ASR models There may be an information gap between non-streaming ASR models and streaming ASR models, with non-streaming ASR models normally performing better than streaming ASR models.
- non-streaming ASR models may have issues as well. Processing associated with streaming ASR models typically has a much lower latency because there is no need to wait for the speech (e.g., utterance) to end prior to starting the processing of the captured utterance. Therefore, non-streaming ASR models may not be desirable for on-device (e.g., not in a cloud computing environment), real-time ASR as the latency attributes of streaming ASR models may be more suited for on-device, realtime ASR.
- KD Knowledge distillation
- a streaming ASR model student may be trained by applying KD techniques from a non-streaming ASR model teacher.
- a streaming ASR model (the streaming ASR model student) may mimic behavior of a non-streaming ASR model teacher.
- a system may apply KD only to an encoder of the system, which may be a part (e.g., not all) of the entire model.
- the techniques of this disclosure may result in faster training and/or processing, may not require labeling of all data, and may not result in output misalignment between the non-streaming model teacher to streaming model student.
- various aspects of the techniques are directed to a device including memory configured to store a speech signal representative of speech and streaming model, the streaming model including an on-device, real-time streaming model; one or more processors implemented in circuitry coupled to the memory, the one or more processors being configured to: determine one or more words in the speech signal based on one or more transfers of learned knowledge from a non-streaming model to the streaming model; and take an action based on the determined one or more words.
- various aspects of the techniques are directed to a method including determining one or more words in a speech signal based on one or more transfers of learned knowledge from a non-streaming model to a streaming model, the streaming model including an on-device, real-time streaming model; and take an action based on the determined one or more words.
- various aspects of the techniques are directed to a method including transferring learned knowledge from a non-streaming model to an on-device, real-time streaming model.
- various aspects of the techniques are directed to a non- transitory computer-readable storage medium having stored thereon instructions that, when executed, cause one or more processors to determine one or more words in a speech signal based on one or more transfers of learned knowledge from a non-streaming model to a streaming model, the streaming model including an on-device, real-time streaming model; and take an action based on the determined one or more words.
- various aspects of the techniques are directed to a device including means for determining one or more words in a speech signal based on one or more transfers of learned knowledge from a non-streaming model to a streaming model, the streaming model including an on-device, real-time streaming model; and means for take an action based on the determined one or more words.
- FIG. l is a block diagram of an example system for automatic speech recognition according to the techniques of this disclosure.
- FIG. 2 is a block diagram illustrating an implementation of a system for training a streaming model according to the techniques of this disclosure.
- FIG. 3 is a block diagram illustrating example application of KD from a nonstreaming ASR model encoder teacher to a streaming ASR model encoder student according to the techniques of this disclosure.
- FIG. 4 is a conceptual diagram illustrating an example use of a KD loss function according to the techniques of this disclosure.
- FIG. 5 is a conceptual diagram illustrating example transformer attention masks according to the techniques of this disclosure.
- FIG. 6 is a chart illustrating test results of streaming ASR model encoders.
- FIG. 7 is a block diagram illustrating an example of a device according to the techniques of this disclosure.
- FIG. 8 is a flow diagram illustrating example techniques for KD from a nonstreaming to a streaming encoder according to one or more aspects of this disclosure.
- ASR models may be categorized into two groups: non-streaming ASR models or streaming ASR models.
- Non-streaming ASR models may use an entire captured audio signal to transcribe a captured utterance (e.g., phrase, sentence, command, query, etc.).
- An utterance may be a continuous piece of speech or an uninterrupted chain of spoken language which may begin and/or end with a pause.
- an utterance may be a word, a sentence, or a sentence fragment (e.g., one or more words).
- the entire captured audio signal may include the captured utterance.
- a non-streaming ASR model may not make an inference about the captured audio signal (e.g., what word(s) were in the spoken utterance) until the entire utterance is captured.
- a non-streaming ASR model may be implemented as a neural network having a plurality of non-streaming layers. Each layer of the non-streaming ASR model may perform a specific task. Such layers may include an input layer, an output layer, and one or more hidden layers.
- Steaming ASR models may only use the past context and thus, may not need to use the entire captured utterance.
- streaming ASR models may make an inference in real-time based on a portion of an utterance captured up to that point in time.
- steaming ASR models may update the inference as more of the utterance is captured.
- a streaming ASR model may be implemented as a neural network having a plurality of streaming layers. Speech related models other than ASR models may also be categorized as non-streaming or streaming. Each layer of the streaming ASR model may perform a specific task. Such layers may include an input layer, an output layer, and one or more hidden layers.
- non-streaming ASR models normally perform better than streaming ASR models (which usually perform worse).
- processing associated with streaming ASR models typically has a much lower latency because there is no need to wait for the speech (e.g., utterance) to end prior to starting the processing of the captured utterance. Therefore, the latency attributes of streaming ASR models are desirable for on-device real-time ASR.
- KD techniques may be used to transfer learned knowledge from a non-streaming ASR model to a streaming ASR model in an attempt to improve the performance of a streaming ASR model while maintaining the latency benefits of the streaming ASR model.
- KD KD techniques may be used to transfer learned knowledge from a non-streaming ASR model to a streaming ASR model in an attempt to improve the performance of a streaming ASR model while maintaining the latency benefits of the streaming ASR model.
- KD KD techniques may be used to transfer learned knowledge from a non-streaming ASR model to a streaming ASR model in an attempt to improve the performance of a streaming ASR model while maintaining the latency benefits of the streaming ASR model.
- KD is applied for final output probabilities
- all data may require labeling (e.g., a text transcription of captured audio data); and (2) the output data will likely be misaligned between the non-streaming ASR model teacher and the streaming ASR model student which may negatively affect further training.
- labeling e.g., a text transcription of captured audio data
- the output data will likely be misaligned between the non-streaming ASR model teacher and the streaming ASR model student which may negatively affect further training.
- This disclosure relates to systems, devices, and techniques for applying, and that may result from applying, KD only to encoders of ASR models.
- Such an encoder may be a part (e.g., not all) of an entire ASR model.
- the techniques of this disclosure may result in faster training and/or processing, may not require labeling of all data, and may not result in output misalignment between the non-streaming ASR model teacher to streaming ASR model student.
- Such techniques may include using auxiliary non-streaming layers during training. Additionally, or alternatively, the system may include a specialized loss function for KD from the non-streaming ASR model teacher to streaming ASR model student.
- the techniques of this disclosure may achieve a clear margin of improvement compared to other techniques and may not require labeled data, thereby fundamentally removing the heavy data labeling cost.
- the techniques of this disclosure may provide no additional overhead for the inference stage (e.g., the streaming ASR model making inferences after training).
- the techniques of this disclosure may improve on-device, real-time streaming ASR models, for example, running on smartphones or other devices.
- the techniques of this disclosure may improve streaming ASR model performance for various speech-related tasks, such as keyword detection, voice assistance, speaker verification, or the like.
- a system of the present disclosure is configured to automatically recognize speech and take an action based on the recognized speech (e.g., elements of the recognized speech, such as words of an utterance).
- the system may be integrated into a device, such as a mobile device, a smart speaker system (e.g., a speaker within a user’s home that is capable of playing audio, receiving spoken user commands, and performing actions based on the user commands), a vehicle, a robot, or the like.
- a user may be in the kitchen using cutlery and speak the command “turn on the kitchen light.”
- the system may receive audio data that corresponds to the user’s speech (e.g., “turn on the kitchen light”).
- the system may identify the words within the utterance “turn on the kitchen light” and respond by taking the action of turning on the kitchen light.
- FIG. l is a block diagram of an example system for automatic speech recognition according to the techniques of this disclosure.
- a system 100 includes a processor 102, memory 104 coupled to the processor 102, a microphone 110, a transmitter 140, and a receiver 142.
- the transmitter 140 and the receiver 142 may be configured to facilitate the interaction of system 100 with a second device 144 (e.g., the kitchen light switch or another device).
- the system 100 may optionally include an interface device 106, a display device 108, a camera 116, and a position sensor 118.
- the system 100 is implemented in a smart speaker system (e.g., a wireless speaker and voice command device that is integrated with a virtual assistant).
- the system 100 is implemented in a mobile device, such as a mobile phone (e.g., a smartphone), a laptop computer, a tablet computer, a computerized watch, etc.
- the system 100 is implemented in one or more Internet of Things (loT) devices, such as smart appliances or the like.
- the system 100 is implemented in a vehicle, such as an automobile or a self-driving vehicle or the like.
- the system 100 is implemented in a robot.
- the system 100 is configured to perform automatic speech recognition and take an action based on the recognized speech.
- the processor 102 is configured to automatically recognize speech, such as a spoken utterance, and to perform one or more tasks based on the recognized speech. Such tasks may include processing of recognized speech into text, responding to commands, responding to queries (such as a request for information from the Internet by retrieving information therefrom), or the like. For example, processor 102 may be configured to identify a particular spoken word or words of an utterance and take an action based on the identity of the word(s).
- the memory 104 is configured to store a streaming ASR model 120 including an encoder 121.
- the streaming ASR model 120 is illustrated as being stored at the memory 104 of the system 100 (e.g., an on-device model), in other implementations, the streaming ASR model 120 (or a portion thereof) may be stored remotely in a network-based storage (e.g., “the cloud”).
- the encoder 121 may include a plurality of streaming layers. In some examples, the encoder 121 does not include any non-streaming layers. For example, when the encoder 121 has already been trained, the encoder 121 may not include any non-streaming layers.
- the microphone 110 (which may be one or more microphones) is configured to capture an audio input 112 (e.g., speech) and to generate input audio data 114 (e.g., a speech signal) based on the audio input 112.
- the audio input 112 may include an utterance (e.g., speech) from a speaker (e.g., a person).
- the processor 102 is configured to automatically recognize the input audio data 114.
- the processor 102 may execute the streaming ASR model 120 to automatically recognize the input audio data 114.
- the processor 102 may compare the input audio data 114 (or a portion thereof) to known models for different words as part of automatically recognizing the words represented by the input audio data 114. Such models may be learned by the streaming ASR model 120 as described in this disclosure.
- the processor 102 may take an action based on the recognized input audio data 114.
- the processor 102 may process recognized speech into text, respond to commands, respond to queries, or the like.
- the action may include converting the recognized input audio data 114 into a text string 150.
- the processor 102 may be configured to perform speech to text conversion on the input audio data 114 to convert the input audio data 114, or a portion thereof that includes speech, into the text string 150.
- the text string 150 may include a textual representation of the speech included in the input audio data 114.
- FIG. 2 is a block diagram illustrating an implementation of a system for training a streaming model according to the techniques of this disclosure.
- the system 200 may include or correspond to the system 100 or portions thereof.
- the elements of the system 200 may include or correspond to hardware within the processor 102.
- Streaming ASR model 210 may be an example of streaming ASR model 120.
- Each of the elements of the system 200 may be represented in hardware, such as via an application-specific integrated circuit (ASIC) or a field-programmable gate array (FPGA), or the operations described with reference to the elements may be performed by one or more processors executing computer-readable instructions.
- ASIC application-specific integrated circuit
- FPGA field-programmable gate array
- the system 200 includes a streaming ASR model 210.
- the streaming ASR model 210 includes an encoder 202 and an inferencer 204.
- the encoder 202 may include a plurality of streaming layers and may be configured to receive input audio data, such as the input audio data 114 and to encode the input audio data 114 to be used by inferencer 204.
- Inferencer 204 may use the encoded input audio data to perform inferencing.
- the inferencing may include operations to identify words within speech of audio inputs.
- the streaming ASR model 210 may be trained using an already trained nonstreaming ASR model 220, for example using KD techniques.
- an encoder 214 including a plurality of non-streaming layers, of the non-streaming ASR model 220 may be used to train the encoder 202.
- the training of the encoder 202 by the encoder 214 may utilize a plurality of auxiliary non-streaming layers 230.
- the non-streaming ASR model 220 and the auxiliary non-streaming layers 230 may be removed from the system 200 such that when the streaming ASR model 210 receives the input audio data, the streaming ASR model 210 may automatically recognize speech in the input audio data and system 200 may take an action 208 based on the recognized speech.
- Such automatic speech recognition may be undertaken by streaming ASR model 210 without any additional overhead (e.g., processing power, memory user, latency, etc.) that may be associated with the auxiliary non-streaming layers 230 or the non-streaming ASR model 220.
- the streaming ASR model 210 may perform the automatic speech recognition on-device (e.g., on system 100) and in real-time.
- FIG. 3 is a block diagram illustrating example application of KD from a nonstreaming ASR model encoder teacher to a streaming ASR model encoder student according to the techniques of this disclosure.
- System 300 includes a non-streaming ASR model 310, auxiliary non-streaming layers 304, and a streaming ASR model 312.
- System 300 may represent an example of system 200 during training, with the non-streaming ASR model 310 representing non-streaming ASR model 220, the auxiliary non-streaming layers 304 representing the auxiliary non-streaming layers 230, and the streaming ASR model 212 representing the streaming ASR model 210.
- the non-streaming ASR model 310 may include an encoder 308, which may represent the encoder 214.
- the streaming ASR model 312 may include an encoder 302, which may represent the encoder 202.
- the encoder 302 may be trained using KD techniques by the encoder 308 via the auxiliary non-streaming layers 304.
- the encoder 302 may be trained through layer-wise distillation of selected layers (represented by the layers pointed to by the dotted lines).
- the same input data which may include labeled data (e.g., a spoken word accompanied by a label identifying the word) and/or unlabeled data (e.g., a spoken word).
- the KD techniques may be utilized without including any labeled data as input data for training the encoder 302.
- the auxiliary nonstreaming layers 304 may be inserted between the encoder 308 and the encoder 302. In some examples, the auxiliary non-streaming layers 304 may be inserted between the encoder 302 and the encoder 308 on the streaming ASR model 312. In some examples, the auxiliary non-streaming layers 304 may be inserted between the encoder 302 and the encoder 308 on the non-streaming ASR model 310. In some examples, the auxiliary nonstreaming layers 304 may be inserted between the encoder 302 and the encoder 308 in a location other than on the non-streaming ASR model 310 and the streaming ASR model 312. Once the encoder 302 is trained, the auxiliary non-streaming layers 304 may be removed so as to not affect the overhead of operating the streaming ASR model 312 when operating to automatically recognize speech (e.g., to make inferences).
- the trained encoder 308 of the non-streaming ASR model 310 may function as a teacher, while the encoder 302 may function as a student.
- the layers of the encoder 302 may include only streaming layers, as shown.
- the system 350 may use at least one of two losses: an ASR loss and a KD loss.
- the ASR loss may be used to train the streaming ASR model 312 to accurately transcribe the speech
- the KD loss be used to influence the encoder 302 (e.g., the student) to better follow the behavior of the encoder 308 (e.g., the teacher).
- FIG. 4 is a conceptual diagram illustrating an example use of a KD loss function by a system according to the techniques of this disclosure.
- the system 400 may be an example of the system 300 of FIG. 3.
- the system 400 may use a specialized KD loss for streaming ASR training.
- a KD loss function may be determined or computed as a weighted sum of up to three losses.
- the three losses may include a distance (DIS) loss, a Kullback-Leibler divergence (KLD) loss, and an autoregressive predictive coding (APC) loss.
- DIS distance
- KLD Kullback-Leibler divergence
- API autoregressive predictive coding
- the KLD loss may be applied between a non-streaming layer of the encoder 308 and an associated nonstreaming layer of the auxiliary non-streaming layers 304.
- the DIS loss may be applied prior to a shift of N steps by the encoder 308 and a unidirectional long-short term memory (LSTM) by the encoder 302.
- the APC loss may be applied after the shift of N steps 402 by the encoder 308 and the unidirectional LSTM layer 404 by the encoder 302.
- the DIS loss may be determined as: where h is an output feature sequence of teacher/ student layer, D is a feature dimension, t is an index of each element in a sequence, and Lis a number of elements in the sequence. [0050]
- the APC loss may be determined as: where " is a distance indicating how far a target for prediction is located and y is an output feature sequence of an LSTM layer (e.g., the unidirectional LSTM layer 404).
- these three losses may be reinterpreted as performing the following functions: (1) DIS loss: reducing a gap of between the extracted feature of each frame extracted by the encoder 308 and the encoder 302; (2) KLD loss: matching a frame-to- frame relationship between all frames between the encoder 308 and the encoder 302; and (3) APC loss: predicting a future frame by only using the past context.
- the KLD loss may be a combination of three losses: a KLD query loss, a KLD key loss, and a KLD value loss.
- a KLD query /key/value may correspond to the KLD loss equation above where A is a query /key/value matrix, respectively.
- the weighted sum of the three loses may be represented as: where L KD is a knowledge distribution loss, L DIS is a distance loss, L q ⁇ y is a KLD query loss, L k ⁇ y D is a KLD key loss, L v ⁇ e is a KLD value loss, L APC is an APC loss, and a, ft, and y are weights, as shown in FIG. 4.
- FIG. 5 is a conceptual diagram illustrating example transformer attention masks according to the techniques of this disclosure. Using these three loses discussed with respect to FIG. 4, may be particularly helpful for streaming ASR models, where a model cannot access future information when making an inference.
- non-streaming attention mask 500 of FIG. 5 the X-axis represents a current frame index and the Y-axis represents a target frame.
- the non-streaming attention mask 500 represents an attention mask of a non-streaming ASR model. Using the nonstreaming attention mask 500, during the represented 12 frames, each individual current frame would be able to access all 12 frames, whether the frames were past, present or future frames (relative to the current frame). Using a “chunk-wise” streaming attention mask 502 of FIG.
- a given current frame is able to access only the frames within the “chunk.” For example, with a 2 frame chunk, the current frame may access the present frame and either the immediate past frame or the immediate future frame, which together with the current frame, make up the 2 frame chunk.
- the system 300 may modify a transformer attention mask.
- the system 400 may apply the nonstreaming attention for APC loss mask 504.
- a current frame may not access the following four frames.
- This application of a modified transformer attention mask for example, to the auxiliary non-streaming layers 304 during training, the encoder 302 cannot directly obtain the APC target frame from the self-attention mechanism, thus preventing “cheating” by the encoder 302. For example, the encoder 302 cannot infer the next frame by simply performing a “cut and paste” operation.
- FIG. 6 is a chart illustrating test results of streaming ASR model encoders.
- a streaming ASR model 312 having an encoder, e.g., the encoder 302, trained according to the techniques of this disclosure, may exhibit improved ASR performance when compared to otherwise trained streaming ASR models.
- Streaming ASR models were trained using different techniques.
- a baseline model was tested without using KD.
- Another model was tested after being trained using previous techniques having KD performed on output token probabilities.
- a third model was tested after the encoder being trained using a DIS loss as a KD loss.
- a fourth model was tested after the encoder being trained using DIS and KLD loss to determine the KD loss.
- a fifth model was tested after the encoder being trained using the DIS, KLD, and APC loss to determine the KD loss.
- the displayed metric in FIG. 6 is word error rate (WER) which is presented as a percentage, where a lower percentage WER is better than a higher percentage WER.
- the dataset used for the testing was LibriSpeech (dev-clean, dev-other subsets).
- the numbers set forth in FIG. 6 are numbers compared using the same setting (e.g., same epochs). As can be seen from FIG. 6, the techniques of this disclosure resulted in a better WER for a streaming ASR model than previous techniques.
- FIG. 7 is a block diagram illustrating an example of a device according to the techniques of this disclosure.
- the device 700 may be a wireless communication device, such as a smartphone.
- the device 700 may have more or fewer components than illustrated in FIG. 7.
- the device 700 may perform one or more operations described with reference to the techniques discussed with respect to FIGS. 1-6.
- the device 700 includes a processor 710, such as a central processing unit (CPU) or a digital signal processor (DSP), coupled to memory 732.
- the memory 732 includes instructions 768 (e.g., executable instructions) such as computer-readable instructions or processor-readable instructions.
- the instructions 768 may include one or more instructions that are executable by a computer, such as the processor 710.
- the memory 732 also includes the streaming ASR model 120 as described with reference to FIG. 1.
- the device 700 may include a display controller 726 that is coupled to the processor 710 and to a display 728.
- a coder/decoder (CODEC) 734 may also be coupled to the processor 710.
- a speaker 736 and a microphone 738 may be coupled to the CODEC 734.
- FIG. 7 also illustrates that a wireless interface 740, such as a wireless controller, and a transceiver 746 may be coupled to the processor 710 and to an antenna 742, such that wireless data received via the antenna 742, the transceiver 746, and the wireless interface 740 may be provided to the processor 710.
- the processor 710, the display controller 726, the memory 732, the CODEC 734, the wireless interface 740, and the transceiver 746 are included in a system-in-package or system-on-chip device 722.
- an input device 730 and a power supply 744 are coupled to the system-on-chip device 722.
- FIG. 7 illustrates that a wireless interface 740, such as a wireless controller, and a transceiver 746 may be coupled to the processor 710 and to an antenna 742, such that wireless data received via the antenna 742, the transceiver 746, and the wireless interface 740 may be provided to the processor 710.
- each of the display 728, the input device 730, the speaker 736, the microphone 738, the antenna 742, and the power supply 744 is external to the system-on-chip device 722.
- each of the display 728, the input device 730, the speaker 736, the microphone 738, the antenna 742, and the power supply 744 may be coupled to a component of the system-on-chip device 722, such as an interface or a controller.
- the memory 732 includes or stores the instructions 768 (e.g., executable instructions), such as computer-readable instructions or processor-readable instructions.
- the memory 732 may include or correspond to a non-transitory, computer readable medium storing the instructions 768.
- the instructions 768 may include one or more instructions that are executable by a computer, such as the processor 710.
- the device 700 includes a non-transitory, computer readable medium (e.g., the memory 732) storing instructions (e.g., the instructions 768) that, when executed by one more processors (e.g., the processor 710), may cause the one or more processors to perform operations including determining one or more words in a speech signal (e.g., input audio data 114) based on one or more transfers of learned knowledge from a non-streaming model (e.g., non-streaming ASR model 220) to a streaming model (e.g., streaming ASR model 210), the streaming model including an on-device, real-time streaming model (e.g., streaming ASR model 120).
- the instructions may also cause the one or more processors to take an action (e.g., action 208) based on the determined one or more words.
- the device 700 may include a wireless telephone, a mobile communication device, a mobile device, a mobile phone, a smartphone, a cellular phone, a laptop computer, a desktop computer, a computer, a tablet computer, a set top box, a personal digital assistant (PDA), a display device, a television, a gaming console, an augmented reality (AR) device, a virtual reality (VR) device, a music player, a radio, a video player, an entertainment unit, a communication device, a fixed location data unit, a personal media player, a digital video player, a digital video disc (DVD) player, a tuner, a camera, a navigation device, a decoder system, an encoder system, a vehicle, a component of a vehicle, or any combination thereof.
- PDA personal digital assistant
- This disclosure may state that one or more words in a speech signal may be determined based on one or more transfers of learned knowledge from a non-streaming model to a streaming model and that the streaming model is an on-device, real-time streaming model.
- Such language is intended to include the following examples.
- Example 1 the transfer of learned knowledge may occur while the streaming model is located on the device (e.g., device 700), for example, while the streaming model is resident in memory 732.
- Example 2 the transfer of learned knowledge may occur prior to the streaming model being installed on a device, where the streaming model is intended to perform on-device, real-time processing, such as ASR.
- streaming ASR model 120 may be trained (e.g., the transfer of learned knowledge from non-streaming ASR model 220 to streaming ASR model 210) while being located on a different device (e.g., in a laboratory, in a cloud computing environment, on a server, etc.) than device 700 before being loaded onto device 700.
- Example 3 the transfer of learned knowledge may partially occur prior to the streaming model being located on the device and partially occur while the streaming model is located on the device.
- the transfer of learned knowledge may occur in one or more transfers (e.g., in a single transfer or a plurality of transfers). While located on device 700, streaming ASR model 120 may be said to be an on-device, real-time streaming model.
- an apparatus or device may include means for storing one or more category labels associated with one or more categories of a natural language processing library.
- the means for storing may include or correspond to the memory 104 of FIG. 1, the memory 732 of FIG. 7, one or more other structures or circuits configured to store one or more category labels associated with one or more categories of a natural language processing library, or any combination thereof.
- the apparatus or device may further include means for processing.
- the means for processing may include means for determining one or more words in a speech signal (e.g., input audio data 114) based on one or more transfers of learned knowledge from a non-streaming model (e.g., non-streaming ASR model 220) to a streaming model (e.g., streaming ASR model 210), the streaming model including an on-device, real-time streaming model (e.g., streaming ASR model 120).
- the means for processing may also include means for taking an action (e.g., action 208) based on the determined one or more words.
- One or more of the disclosed aspects may be implemented in a system or an apparatus, such as the device 700, that may include a communications device, a fixed location data unit, a mobile location data unit, a mobile phone, a cellular phone, a satellite phone, a computer, a tablet, a portable computer, a display device, a media player, or a desktop computer.
- the device 700 may include a set top box, an entertainment unit, a navigation device, a personal digital assistant (PDA), a monitor, a computer monitor, a television, a tuner, a radio, a satellite radio, a music player, a digital music player, a portable music player, a video player, a digital video player, a digital video disc (DVD) player, a portable digital video player, a satellite, a vehicle, a component integrated within a vehicle, any other device that includes a processor or that stores or retrieves data or computer instructions, or a combination thereof.
- PDA personal digital assistant
- the device 700 may include remote units, such as handheld personal communication systems (PCS) units, portable data units such as global positioning system (GPS) enabled devices, meter reading equipment, or any other device that includes a processor or that stores or retrieves data or computer instructions, or any combination thereof.
- remote units such as handheld personal communication systems (PCS) units, portable data units such as global positioning system (GPS) enabled devices, meter reading equipment, or any other device that includes a processor or that stores or retrieves data or computer instructions, or any combination thereof.
- PCS personal communication systems
- GPS global positioning system
- FIG. 7 illustrates a wireless communication device including a processor configured to perform automatic speech recognition
- a processor configured to perform automatic speech recognition may be included in various other electronic devices.
- a processor configured to perform automatic speech recognition as described with references to FIGS. 1-7 may be included in one or more components of a base station.
- a base station may be part of a wireless communication system.
- the wireless communication system may include multiple base stations and multiple wireless devices.
- the wireless communication system may be a Long Term Evolution (LTE) system, a Code Division Multiple Access (CDMA) system, a Global System for Mobile Communications (GSM) system, a wireless local area network (WLAN) system, or some other wireless system.
- LTE Long Term Evolution
- CDMA Code Division Multiple Access
- GSM Global System for Mobile Communications
- WLAN wireless local area network
- a CDMA system may implement Wideband CDMA (WCDMA), CDMA IX, Evolution-Data Optimized (EVDO), Time Division Synchronous CDMA (TD-SCDMA), or some other version of CDMA
- Various functions may be performed by one or more components of the base station, such as sending and receiving messages and data (e.g., audio data).
- the one or more components of the base station may include a processor (e.g., a CPU), a transcoder, a memory, a network connection, a media gateway, a demodulator, a transmission data processor, a receiver data processor, a transmission multiple input-multiple output (MIMO) processor, transmitters and receivers (e.g., transceivers), an array of antennas, or a combination thereof.
- the base station, or one or more of the components of the base station may include a processor configured to perform adaptive audio analytics, as described above with reference to FIGS. 1-7.
- one or more antennas of the base station may receive a data stream from a wireless device.
- a transceiver may receive the data stream from the one or more antennas and may provide the data stream to the demodulator.
- the demodulator may demodulate modulated signals of the data stream and provide demodulated data to the receiver data processor.
- the receiver data processor may extract audio data from the demodulated data and provide the extracted audio data to the processor.
- the processor may provide the audio data to the transcoder for transcoding.
- the decoder of the transcoder may decode the audio data from a first format into decoded audio data and the encoder may encode the decoded audio data into a second format.
- the encoder may encode the audio data using a higher data rate (e.g., upconvert) or a lower data rate (e.g., downconvert) than received from the wireless device.
- the audio data may not be transcoded.
- Transcoding operations e.g., decoding and encoding
- the processor may provide the audio data to the media gateway for conversion to another transmission protocol, coding scheme, or both.
- the media gateway may provide the converted data to another base station or core network via the network connection.
- FIG. 8 is a flow diagram illustrating example techniques for KD from a nonstreaming to a streaming encoder according to one or more aspects of this disclosure.
- Processor 102 may determine one or more words in the speech signal based on one or more transfers of learned knowledge from a non-streaming model to a streaming model (800).
- audio input 112 may include speech.
- Microphone 110 may capture the speech in audio input 112 and convert the audio input 112 into a signal, such as input audio data 114.
- Input audio data 114 may include a speech signal.
- Processor 102 may execute streaming ASR model 120.
- Streaming ASR model 120 may be an example of streaming ASR model 210 and/or streaming ASR model 312.
- Streaming ASR model 210 may be trained by one or more transfers of learned knowledge from non-streaming ASR model 220.
- Streaming ASR model 312 may be trained by one or more transfers of learned knowledge from non-streaming ASR model 310.
- Such training may be utilized by processor 102 executing streaming ASR model 120 to determine the one or more words in the speech signal.
- Streaming ASR model 120 may be an on-device, real-time streaming model.
- streaming ASR model 120 may be trained while resident in memory 104.
- streaming ASR model 120 may be trained prior to streaming ASR model 120 becoming resident in memory 104.
- streaming ASR model 120 may be trained while streaming ASR model 120 is resident on a different device (e.g., trained prior to being stored in memory 104).
- Processor 102 may take an action based on the determined one or more words (802). For example, processor 102 may process the one or more words into text, respond to commands within the one or more words (e.g., turn on the kitchen light in response to determining the one or more words correspond to “turn on the kitchen light”), respond to queries (e.g., retrieve and audibly present a weather report from the Internet in response to determining the one or more words correspond to “what is the current weather”), or the like.
- commands within the one or more words e.g., turn on the kitchen light in response to determining the one or more words correspond to “turn on the kitchen light”
- queries e.g., retrieve and audibly present a weather report from the Internet in response to determining the one or more words correspond to “what is the current weather”
- the non-streaming model includes a trained non-streaming model and the one or more transfers of learned knowledge from the non-streaming model to the streaming model includes training the streaming model using the non-streaming model.
- the one or more transfers of learned knowledge are based on an encoder (e.g., encoder 121, encoder 202, encoder 214, encoder 302, and/or encoder 308) configured to encode the speech.
- the encoder includes multiple layers. In some examples, the encoder transfers knowledge at selected layers of the multiple layers.
- the one or more transfers of learned knowledge are from an encoder of the non-streaming model (e.g., encoder 214 and/or encoder 308) to an encoder of the streaming model (e.g., encoder 202 and/or encoder 302).
- the streaming model includes a streaming ASR model (e.g., streaming ASR model 120, streaming ASR model 210, and/or streaming ASR model 312) and the non-streaming model includes a non-streaming ASR model (e.g., non-streaming ASR model 220 and/or non-streaming ASR model 310.
- the one or more transfers of learned knowledge are based on a plurality of auxiliary non-streaming layers (e.g., auxiliary nonstreaming layers 230 and/or auxiliary non-streaming layers 304) between the streaming model and the non-streaming model.
- the one or more transfers of learned knowledge are based on a modified attention mask (e.g., of FIG. 5C) associated with the plurality of auxiliary non-streaming layers.
- the one or more transfers of learned knowledge are based on KD. In some examples, the one or more transfers of learned knowledge are based on a KD loss function. In some examples, the KD loss function includes at least one of a distance loss, a KLD loss, or an autoregressive predictive coding loss. In some examples, the KD loss function includes at least two of a distance loss, a Kullback-Leibler divergence loss, or an autoregressive predictive coding loss. In some examples, the KD loss function includes a weighted sum of the distance loss, the Kullback-Leibler divergence loss, and the autoregressive predictive coding loss.
- the KD loss function includes: ira where L KD is a knowledge distribution loss, L DIS is a distance loss, L q ⁇ y is a KLD query loss, L k ⁇ y D is a KLD key loss, L v ⁇ e is a KLD value loss, L APC is an APC loss, and a, ft, and y are weights.
- the speech includes an utterance.
- the utterance includes the one or more words.
- At least one of the one or more transfers of learned knowledge occurs prior to the streaming model being located on the device.
- a transfer of learned knowledge may occur when streaming ASR model 120 is located on a server.
- at least one of the one or more transfers of learned knowledge occurs after the streaming model is located on the device.
- a transfer of learned knowledge may occur when streaming ASR model 120 is located on device 700.
- the one or more transfers of learned knowledge may partially occur prior to the streaming model being located on the device and partially occur after the streaming model is located on the device.
- FIGS. 1-8 may illustrate systems, apparatuses, and/or methods according to the teachings of the disclosure, the disclosure is not limited to these illustrated systems, apparatuses, and/or methods.
- One or more functions or components of any of FIGS. 1-8 as illustrated or described herein may be combined with one or more other portions of another of FIGS. 1-8. Accordingly, no single implementation described herein should be construed as limiting and implementations of the disclosure may be suitably combined without departing form the teachings of the disclosure.
- one or more operations described with reference to FIGS. 1-7 may be optional, may be performed at least partially concurrently, and/or may be performed in a different order than shown or described.
- a device configured to automatically recognize speech, the device comprising: memory configured to store the speech signal representative of speech and an on-device, real-time streaming model; one or more processors implemented in circuitry coupled to the memory, the one or more processors being configured to: determine one or more words in the speech signal based on transfer of learned knowledge from a non-streaming model to the streaming model; and take an action based on the determined one or more words.
- Clause 2 A The device of clause 1 A, wherein the transfer of learned knowledge is based on an encoder configured to encode the speech.
- Clause 4A The device of clause 3 A, wherein the encoder transfers knowledge at selected layers of the multiple layers.
- Clause 5A The device of any of clauses 2A-4A, wherein the transfer of learned knowledge is from an encoder of the non-streaming model to an encoder of the streaming model.
- Clause 6 A The device of any of clauses 1 A-5A, wherein the streaming model comprises a streaming automatic speech recognition (ASR) model and the non-streaming model comprises a non-streaming ASR model.
- ASR streaming automatic speech recognition
- Clause 7 A The device of any of clauses 1A-6A, wherein the transfer of learned knowledge is based on a plurality of auxiliary non-streaming layers between the streaming model and the non-streaming model.
- Clause 8A The device of clause 7A, wherein the transfer of learned knowledge is based on a modified attention mask associated with the plurality of auxiliary non-streaming layers.
- Clause 9 A The device of any of clauses 1A-8A, wherein the transfer of learned knowledge is based on knowledge distribution (KD).
- Clause 10 A The device of clause 9 A, wherein the transfer of learned knowledge is based on a KD loss function.
- Clause 11 A The device of clause 10A, wherein the KD loss function comprises at least one of a distance loss, a Kullback-Leibler divergence loss, or an autoregressive predictive coding loss.
- Clause 12A The device of clause 11 A, wherein the KD loss function comprises at least two of a distance loss, a Kullback-Leibler divergence loss, or an autoregressive predictive coding loss.
- Clause 13A The device of clause 12A, wherein the KD loss function comprises a weighted sum of the distance loss, the Kullback-Leibler divergence loss, and the autoregressive predictive coding loss.
- Clause 14A The device of clause 13 A, wherein the KD loss function comprises: where L KD is a knowledge distribution loss, L DIS is a distance loss, L q ⁇ y is a KLD query loss, L k ⁇ y D is a KLD key loss, L v ⁇ e is a KLD value loss, L APC is an APC loss, and a, ft, and y are weights.
- Clause 15 A The device of any of clauses 1A-14A, wherein the speech comprises an utterance.
- Clause 16 A The device of clause 15 A, wherein the utterance comprises the one or more words.
- Clause 17 A The device of any of clauses 1A-16A, further comprising one or more microphones configured to capture the speech signal.
- Clause 18 A The device of any of clauses 1A-17A, wherein the transfer of learned knowledge occurs prior to the streaming model being located on the device.
- Clause 19 A The device of any of clauses 1A-17A, wherein the transfer of learned knowledge occurs after the streaming model is located on the device.
- Clause 20A The device of any of clauses 1A-19A, wherein the action comprises processing speech into text, responding to a command, or responding to a query.
- Clause 21 A A method comprising: determining one or more words in a speech signal based on transfer of learned knowledge from a non-streaming model to a streaming model, the streaming model being an on-device, real-time streaming model; and take an action based on the determined one or more words.
- Clause 22 A The method of clause 21 A, wherein the transfer of learned knowledge is based on an encoder configured to encode the speech.
- Clause 23 A The method of clause 22A, wherein the encoder comprises multiple layers.
- Clause 24A The method of clause 23 A, wherein the encoder transfers knowledge at selected layers of the multiple layers.
- Clause 25 A The method of any of clauses 21A-24A, wherein the transfer of learned knowledge is from an encoder of the non-streaming model to an encoder of the streaming model.
- Clause 26A The method of any of clauses 21A-25A, wherein the streaming model comprises a streaming automatic speech recognition (ASR) model and the nonstreaming model comprises a non-streaming ASR model.
- ASR streaming automatic speech recognition
- Clause 28A The method of clause 27A, wherein the transfer of learned knowledge is based on a modified attention mask associated with the plurality of auxiliary non-streaming layers.
- Clause 29 A The method of any of clauses 21A-28A, wherein the transfer of learned knowledge is based on knowledge distribution (KD).
- Clause 30 A The method of clause 29 A, wherein the transfer of learned knowledge is based on a KD loss function.
- Clause 31 A The method of clause 30A, wherein the KD loss function comprises at least one of a distance loss, a Kullback-Leibler divergence loss, or an autoregressive predictive coding loss.
- Clause 32A The method of clause 31 A, wherein the KD loss function comprises at least two of a distance loss, a Kullback-Leibler divergence loss, or an autoregressive predictive coding loss.
- Clause 33A The method of clause 32A, wherein the KD loss function comprises a weighted sum of the distance loss, the Kullback-Leibler divergence loss, and the autoregressive predictive coding loss.
- Clause 34A The method of clause 33A, wherein the KD loss function comprises: where L KD is a knowledge distribution loss, L DIS is a distance loss, L q ⁇ y is a KLD query loss, L k ⁇ y D is a KLD key loss, L v ⁇ e is a KLD value loss, L APC is an APC loss, and a, ft, and y are weights.
- Clause 35A The method of any of clauses 21A-34A, wherein the speech comprises an utterance.
- Clause 36A The method of clause 35A, wherein the utterance comprises the one or more words.
- Clause 37A A method of training a streaming model comprising: transferring learned knowledge from a non-streaming model to an on-device, real-time streaming model.
- a non-transitory computer-readable storage medium having stored thereon instructions that, when executed, cause one or more processors to: determine one or more words in the speech signal based on transfer of learned knowledge from a nonstreaming model to a streaming model, the streaming model comprising an on-device, real-time streaming model; and take an action based on the determined one or more words.
- a device comprising: means for determining one or more words in the speech signal based on transfer of learned knowledge from a non-streaming model to the streaming model, the streaming model comprising an on-device, real-time streaming model; and means for take an action based on the determined one or more words.
- a device configured to automatically recognize speech, the device comprising: memory configured to store a speech signal representative of speech and a streaming model, the streaming model comprising an on-device, real-time streaming model; one or more processors implemented in circuitry coupled to the memory, the one or more processors being configured to: determine one or more words in the speech signal based on one or more transfers of learned knowledge from a non-streaming model to the streaming model; and take an action based on the determined one or more words.
- Clause 2B The device of clause IB, wherein the non-streaming model comprises a trained non-streaming model and wherein the one or more transfers of learned knowledge from the non-streaming model to the streaming model comprise training the streaming model using the non-streaming model.
- Clause 3B The device of clause IB or clause 2B, wherein the one or more transfers of learned knowledge are based on an encoder configured to encode the speech.
- Clause 4B The device of clause 3B, wherein the encoder comprises multiple layers.
- Clause 6B The device of any of clauses 3B-5B, wherein the one or more transfers of learned knowledge are from an encoder of the non-streaming model to an encoder of the streaming model.
- Clause 7B The device of any of clauses 1B-6B, wherein the streaming model comprises a streaming automatic speech recognition (ASR) model and the non-streaming model comprises a non-streaming ASR model.
- ASR streaming automatic speech recognition
- Clause 8B The device of any of clauses 1B-7B, wherein the one or more transfers of learned knowledge are based on a plurality of auxiliary non-streaming layers between the streaming model and the non-streaming model.
- Clause 9B The device of clause 8B, wherein the one or more transfers of learned knowledge are based on a modified attention mask associated with the plurality of auxiliary non-streaming layers.
- Clause 10B The device of any of clauses 1B-9B, wherein the one or more transfers of learned knowledge are based on a KD loss function.
- Clause 1 IB The device of clause 10B, wherein the KD loss function comprises at least one of a distance loss, a Kullback-Leibler divergence loss, or an autoregressive predictive coding loss.
- Clause 12B The device of clause 1 IB, wherein the KD loss function comprises at least two of a distance loss, a Kullback-Leibler divergence loss, or an autoregressive predictive coding loss.
- Clause 13B The device of clause 12B, wherein the KD loss function comprises a weighted sum of the distance loss, the Kullback-Leibler divergence loss, and the autoregressive predictive coding loss.
- Clause 14B The device of clause 13B, wherein the KD loss function comprises: where L KD is a knowledge distribution loss, L DIS is a distance loss, L q ⁇ y is a Kullback- Leibler divergence (KLD) query loss, L k ⁇ y D is a KLD key loss, L v ⁇ e is a KLD value loss, L APC is an autoregressive predictive coding (APC) loss, and a, and y are weights.
- L KD is a knowledge distribution loss
- L DIS is a distance loss
- L q ⁇ y is a Kullback- Leibler divergence (KLD) query loss
- L k ⁇ y D is a KLD key loss
- L v ⁇ e is a KLD value loss
- L APC is an autoregressive predictive coding (APC) loss
- a, and y are weights.
- Clause 15B The device of any of clauses 1B-14B, wherein the speech comprises an utterance comprising the one or more words.
- Clause 16B The device of any of clauses 1B-15B, further comprising one or more microphones configured to capture the speech signal.
- Clause 17B The device of any of clauses 1B-16B, wherein at least one of the one or more transfers of learned knowledge occurs prior to the streaming model being located on the device.
- Clause 18B The device of any of clauses 1B-16B, wherein at least one of the one or more transfers of learned knowledge occurs after the streaming model is located on the device.
- Clause 19B The device of any of clauses 1B-18B, wherein the action comprises at least one of processing speech into text, responding to a command, or responding to a query.
- Clause 20B A method comprising: determining one or more words in a speech signal based on one or more transfers of learned knowledge from a non-streaming model to a streaming model, the streaming model comprising an on-device, real-time streaming model; and taking an action based on the determined one or more words.
- Clause 21B The method of clause 20B, wherein the non-streaming model comprises a trained non-streaming model and wherein the one or more transfers of learned knowledge from the non-streaming model to the streaming model comprise training the streaming model using the non-streaming model.
- Clause 22B The method of clause 20B or clause 21B, wherein the one or more transfers of learned knowledge are based on an encoder configured to encode the speech.
- Clause 23B The method of clause 22B, wherein the encoder comprises multiple layers.
- Clause 24B The method of clause 23B, wherein the encoder transfers knowledge at selected layers of the multiple layers.
- Clause 25B The method of any of clauses 22B-24B, wherein the one or more transfers of learned knowledge are from an encoder of the non-streaming model to an encoder of the streaming model.
- Clause 26B The method of any of clauses 20B-25B, wherein the streaming model comprises a streaming automatic speech recognition (ASR) model and the nonstreaming model comprises a non-streaming ASR model.
- ASR streaming automatic speech recognition
- Clause 27B The method of any of clauses 20B-26B, wherein the one or more transfers of learned knowledge are based on a plurality of auxiliary non-streaming layers between the streaming model and the non-streaming model.
- Clause 28B The method of any of clauses 20B-27B, wherein the action comprises at least one of processing speech into text, responding to a command, or responding to a query.
- a non-transitory computer-readable storage medium having stored thereon instructions that, when executed, cause one or more processors to: determine one or more words in a speech signal based on one or more transfers of learned knowledge from a non-streaming model to a streaming model, the streaming model comprising an on-device, real-time streaming model; and take an action based on the determined one or more words.
- a device comprising: means for determining one or more words in a speech signal based on one or more transfers of learned knowledge from a nonstreaming model to a streaming model; the streaming model comprising an on-device, real-time streaming model; and means for taking an action based on the determined one or more words.
- Computer-readable media may include computer-readable storage media, which corresponds to a tangible medium such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another, e.g., according to a communication protocol.
- Computer-readable media generally may correspond to (1) tangible computer-readable storage media which is non-transitory or (2) a communication medium such as a signal or carrier wave.
- Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementation of the techniques described in this disclosure.
- a computer program product may include a computer-readable medium.
- such computer-readable storage media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage, or other magnetic storage devices, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. Also, any connection is properly termed a computer- readable medium.
- coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave are included in the definition of medium.
- DSL digital subscriber line
- computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transitory media, but are instead directed to non-transitory, tangible storage media.
- Disk and disc includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and Blu-ray disc, where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
- processors such as one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), or other equivalent integrated or discrete logic circuitry.
- DSPs digital signal processors
- ASICs application specific integrated circuits
- FPGAs field programmable gate arrays
- processors may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein.
- the functionality described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or incorporated in a combined codec. Also, the techniques could be fully implemented in one or more circuits or logic elements.
- the techniques of this disclosure may be implemented in a wide variety of devices or apparatuses, including a wireless handset, an integrated circuit (IC) or a set of ICs (e.g., a chip set).
- IC integrated circuit
- a set of ICs e.g., a chip set.
- Various components, modules, or units are described in this disclosure to emphasize functional aspects of devices configured to perform the disclosed techniques, but do not necessarily require realization by different hardware units. Rather, as described above, various units may be combined in a codec hardware unit or provided by a collection of interoperative hardware units, including one or more processors as described above, in conjunction with suitable software and/or firmware.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
Abstract
Description
Claims
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202480014101.6A CN120752696A (en) | 2023-02-28 | 2024-02-16 | Knowledge Distillation from Non-Streaming Encoder to Streaming Encoder |
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202363487449P | 2023-02-28 | 2023-02-28 | |
| US63/487,449 | 2023-02-28 | ||
| US18/355,055 US20240290332A1 (en) | 2023-02-28 | 2023-07-19 | Knowledge distillation from non-streaming to streaming encoder |
| US18/355,055 | 2023-07-19 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024182135A1 true WO2024182135A1 (en) | 2024-09-06 |
Family
ID=90361718
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2024/016101 Ceased WO2024182135A1 (en) | 2023-02-28 | 2024-02-16 | Knowledge distillation from non-streaming to streaming encoder |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2024182135A1 (en) |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20220343894A1 (en) * | 2021-04-23 | 2022-10-27 | Google Llc | Streaming Automatic Speech Recognition With Non-Streaming Model Distillation |
-
2024
- 2024-02-16 WO PCT/US2024/016101 patent/WO2024182135A1/en not_active Ceased
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20220343894A1 (en) * | 2021-04-23 | 2022-10-27 | Google Llc | Streaming Automatic Speech Recognition With Non-Streaming Model Distillation |
Non-Patent Citations (5)
| Title |
|---|
| CHANG HENG-JUI ET AL: "Distilhubert: Speech Representation Learning by Layer-Wise Distillation of Hidden-Unit Bert", ICASSP 2022 - 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 23 May 2022 (2022-05-23), pages 7087 - 7091, XP034157217, DOI: 10.1109/ICASSP43922.2022.9747490 * |
| GOU JIANPING ET AL: "Knowledge Distillation: A Survey", INTERNATIONAL JOURNAL OF COMPUTER VISION, vol. 129, no. 6, 30 June 2020 (2020-06-30), New York, pages 1789 - 1819, XP093103205, ISSN: 0920-5691, Retrieved from the Internet <URL:https://arxiv.org/pdf/2006.05525v3.pdf> DOI: 10.1007/s11263-021-01453-z * |
| JINYU LI: "Recent Advances in End-to-End Automatic Speech Recognition", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 2 November 2021 (2021-11-02), XP091093401 * |
| KURATA GAKUTO ET AL: "Knowledge Distillation from Offline to Streaming RNN Transducer for End-to-End Speech Recognition", INTERSPEECH 2020, 25 October 2020 (2020-10-25), ISCA, pages 2117 - 2121, XP093000384, Retrieved from the Internet <URL:http://www.interspeech2020.org/uploadfile/pdf/Wed-1-5-3.pdf> DOI: 10.21437/Interspeech.2020-2442 * |
| KYUHONG SHIM ET AL: "Knowledge Distillation from Non-streaming to Streaming ASR Encoder using Auxiliary Non-streaming Layer", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 31 August 2023 (2023-08-31), XP091601518 * |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN111883110B (en) | Acoustic model training method, system, equipment and medium for speech recognition | |
| US10380996B2 (en) | Method and apparatus for correcting speech recognition result, device and computer-readable storage medium | |
| US20240028841A1 (en) | Speech translation method, device, and storage medium | |
| CN111402861B (en) | Voice recognition method, device, equipment and storage medium | |
| CN107657017B (en) | Method and apparatus for providing voice service | |
| US8930187B2 (en) | Methods, apparatuses and computer program products for implementing automatic speech recognition and sentiment detection on a device | |
| WO2021174757A1 (en) | Method and apparatus for recognizing emotion in voice, electronic device and computer-readable storage medium | |
| CN112530408A (en) | Method, apparatus, electronic device, and medium for recognizing speech | |
| CN112883967B (en) | Image character recognition method, device, medium and electronic equipment | |
| CN113327595B (en) | Pronunciation deviation detection method and device and storage medium | |
| CN108986790A (en) | The method and apparatus of voice recognition of contact | |
| US20230298567A1 (en) | Speech synthesis and speech recognition | |
| WO2021169825A1 (en) | Speech synthesis method and apparatus, device and storage medium | |
| US20250006199A1 (en) | Phone recognition method and apparatus, electronic device and storage medium | |
| US20250232762A1 (en) | Adaptive visual speech recognition | |
| CN114330371A (en) | Session intention identification method and device based on prompt learning and electronic equipment | |
| CN112509562A (en) | Method, apparatus, electronic device and medium for text post-processing | |
| JP7348447B2 (en) | Speaker diarization correction method and system utilizing text-based speaker change detection | |
| CN114694637A (en) | Hybrid speech recognition method, device, electronic device and storage medium | |
| CN110647613A (en) | Courseware construction method, courseware construction device, courseware construction server and storage medium | |
| CN118471191A (en) | Audio generation method, model training method, device, equipment and storage medium | |
| CN113793598A (en) | Training method and data enhancement method, device and equipment for speech processing model | |
| US20240290332A1 (en) | Knowledge distillation from non-streaming to streaming encoder | |
| WO2024182135A1 (en) | Knowledge distillation from non-streaming to streaming encoder | |
| CN113112993B (en) | Audio information processing method, device, electronic equipment and storage medium |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24710003 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 202547064029 Country of ref document: IN |
|
| WWP | Wipo information: published in national office |
Ref document number: 202547064029 Country of ref document: IN |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 202480014101.6 Country of ref document: CN |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2024710003 Country of ref document: EP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWP | Wipo information: published in national office |
Ref document number: 202480014101.6 Country of ref document: CN |
|
| ENP | Entry into the national phase |
Ref document number: 2024710003 Country of ref document: EP Effective date: 20250929 |
|
| ENP | Entry into the national phase |
Ref document number: 2024710003 Country of ref document: EP Effective date: 20250929 |
|
| ENP | Entry into the national phase |
Ref document number: 2024710003 Country of ref document: EP Effective date: 20250929 |