GB2626038A

GB2626038A - Systems and Methods for Speech Processing

Info

Publication number: GB2626038A
Application number: GB2300276.9A
Authority: GB
Inventors: Li Mohan; Sanand Doddipatla Rama
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2023-01-09
Filing date: 2023-01-09
Publication date: 2024-07-10
Anticipated expiration: 2043-01-09
Also published as: JP7673166B2; GB2626038B; JP2024098143A

Abstract

An audio signal 502 is received capturing a user’s speech and processed using a non-autoregressive encoder neural network (NN) 510 to generate an embedding 506 and initial transcription 508 comprising one or more tokens each associated with respective confidence scores of being the correct token for that position in the transcription. The initial transcription is modified to generate a masked token sequence by masking 514 tokens whose confidence score is below a threshold. The masked token sequence 516 and embedding is then processed by a non-autoregressive bi-directional decoder NN 522 to generate an output transcription of the utterance 526, a label classifying the utterance 528, and labels 513 indicating a parameter type associated with respective words in the transcription. Also disclosed is a method of training the combined encoder-decoder system, where embeddings and conditional probability distributions of candidate transcriptions are used together with ground-truth transcriptions to compute an encoder loss value to be minimised using trainable parameters of the encoder NN, and where the decoder NN is trained based on its output probability distributions for candidate tokens, utterance classification and parameter types, and corresponding ground-truth values.

Description

Systems and Methods for Speech Processing

FIELD

Embodiments described herein relate to systems and methods for speech processing, in particular, for automatic speech recognition and spoken language understanding.

BACKGROUND

Speech recognition methods and systems receive speech audio and recognise the content of such speech audio, e.g. the textual content of such speech audio. Speech recognition systems include hybrid systems, and may include an acoustic model (AM), pronunciation lexicon and language model (LM) to determine the content of speech audio, e.g. decode speech. Earlier hybrid systems utilized Hidden Markov Models (HMMs) or similar statistical methods for the acoustic model and/or the language model. Later hybrid systems utilize neural networks for at least one of the acoustic model and/or the language model. These systems may be referred to as deep speech recognition systems.

Speech recognition systems with end-to-end architectures have also been introduced. In these systems, a single neural network is used into which the acoustic model, pronunciation lexicon and language model can be considered to be implicitly integrated. The single neural network may be a recurrent neural network. More recently, transformer models have been used for speech recognition systems. Transformer models may perform speech recognition using a self-attention mechanism by which the dependencies are captured regardless of their distance. Transformer models may employ an encoder-decoder framework.

BRIEF DESCRIPTION OF FIGURES

Systems and methods in accordance with non-limiting examples will now be described with reference to the accompanying figures in which: Fig. 1A is an illustration of a voice assistant system in accordance with example embodiments; Fig. 18 is an illustration of a speech transcription system in accordance with example embodiments; Fig. 10 is a flow diagram of a method for performing voice assistance in accordance with example embodiments; Fig. 1D is a flow diagram of a method for performing speech transcription in accordance with example embodiments; Fig. 2 is a flow diagram of a method for speech processing in accordance with example embodiments; Fig. 3 is an illustration of exemplary speech processing output; Fig. 4 is an illustration of an iterative refinement process for generating a transcription; Fig. 5 is a schematic illustration of a speech processing system in accordance with

example embodiments;

Fig. 6 is a schematic illustration of an exemplary decoder block; Fig. 7 is a schematic illustration of an exemplary encoder block; Fig. 8 is a schematic illustration of an encoder neural network in accordance with

example embodiments;

Fig. 9 is a schematic illustration of processing at an intermediate layer of an encoder neural network; Fig. 10 is a flow diagram of a method for training a speech processing system in accordance with example embodiments; Fig. 11 is a flow diagram of processing carried out at an intermediate layer of an encoder neural network during training; Fig. 12 is a schematic illustration of hardware for implementing methods and systems in accordance with example embodiments.

Like reference numerals refer to like elements in the Figures.

DETAILED DESCRIPTION

According to an aspect, there is provided a is provided a computer-implemented method for speech processing. The method comprises receiving an audio signal capturing an utterance spoken by a user. The method further comprises processing, by an encoder comprising a non-autoregressive encoder neural network, the audio signal to generate an embedding of the audio signal and an initial transcription of the utterance, wherein the initial transcription comprises one or more tokens and each of the one or more tokens is associated with a confidence score of being the correct token for that position in the transcription. The method further comprises modifying the initial transcription to generate a masked token sequence, wherein modifying the initial transcription comprises masking the one or more tokens in the initial transcription having a confidence score below a threshold confidence level. The method further comprises processing, by an output decoder comprising a non-autoregressive, bi-directional decoder neural network, the embedding of the audio signal and the masked token sequence to generate a speech processing output comprising: a predicted token for each of the masked tokens in the masked token sequence to generate an output transcription of the utterance; a label indicating a classification of the utterance; and one or more labels indicating a parameter type associated with respective words in the output transcription, the parameter types associated with the classification of the utterance.

The disclosed method jointly performs automatic speech recognition and spoken language understanding on an input audio signal (or acoustic features of the audio signal). In particular, the predicted tokens and the output transcription elements of the speech processing output relate to automatic speech recognition whilst the label indicating a classification of the utterance and one or more labels indicating a parameter type relate to spoken language understanding. For example, the label indicating a classification of the utterance may correspond to an action that is to be performed in response to the utterance by an appropriate device or system. This type of speech processing output is sometimes known as "intent classification". With respect to the parameter type labels, an action may require determining certain parameter values in order to fully form and execute the action. The words (or sub-word units) in the output transcription may be labelled with the corresponding parameter types for filling in the required parameters. This type of speech processing output is sometimes known as "slot filling". The method may then further comprise causing the device or system to carry out the action. For example, by transmitting an appropriate command.

The method uses non-autoregressive neural networks. A non-autoregressive neural network produces all its outputs concurrently in one time step compared to an autoregressive neural network which generates its output over multiple time steps, typically one element at a time with the output generated so far fed-back as input. Nonautoregressive approaches typically generate their outputs faster than autoregressive approaches and may be used in computational systems with more limited computational resources such as in mobile devices or embedded systems or where low-latency is desired. For example, inference speed of Spoken Language Understanding systems may be measured using real-time factor (RTF), a ratio of the execution time to the length of the input utterance. Using a non-autoregressive system according to the methods described herein can achieve a reduction in RTF by a factor of six compared to an equivalent autoregressive baseline.

The decoder neural network is bi-directional. That is, when processing a particular element of an input sequence, a bi-directional decoder neural network processes the particular element based on both the elements that precede the particular element in the input sequence and the subsequent elements that follow the particular element in the sequence. By comparison, a uni-directional decoder neural network typically only considers the elements preceding the particular element. In this way, the decoder neural network can generate an output by considering the entirety of the input. This is particularly suitable as the tasks performed by the decoder, prediction of masked tokens and generating labels for spoken language understanding, benefit from being able to condition on the entirety of the input to the decoder. The encoder neural network may also be bi-directional.

In addition, the split of tasks performed by the encoder and decoder further facilitates the joint performance of automatic speech recognition (ASR) and spoken language understanding (SLU). In particular, the encoder is used to provide an initial ASR transcription and any low confidence tokens in the transcription are masked. The decoder performs the tasks of predicting the low-confidence masked tokens from the available high-confidence tokens as well as the SLU tasks of intent classification and slot filling which are more directly related tasks than pure ASR and SLU are. Thus, the decoder is able to operate more effectively as it is performing more relatable tasks and can learn more effective joint features for performing the tasks.

The embedding of the audio signal generated by the encoder neural network provides a latent representation of the audio signal containing salient information for use in generating the initial transcription and for the decoder in generating the speech processing output. The embedding of the audio signal may be a vector or tensor representation of the audio signal.

The tokens may correspond to words or sub-word units, such as word-pieces, phonemes, characters or other suitable unit as deemed appropriate by a person skilled in the art. For example, the set of tokens, or vocabulary, may be generated by applying Byte Pair Encoding (BPE) to a suitable corpus such as a training data set.

The initial transcription of the utterance may be generated based upon a Connectionist Temporal Classification (CTC) algorithm. A greedy search may be used to determine the most likely token at each position in the sequence independently. Alternatively, a beam search may be used.

The decoder neural network of the output decoder may comprise a first output head for generating the predicted token for each of the masked tokens in the masked token sequence; a second output head for generating the one or more labels indicating a parameter type associated with the respective words in the output transcription; and wherein the first output head and second output heads are distinct output heads. In this way, one part of the decoder neural network can become specialized in generating the masked token predictions and another part of the decoder neural network can become specialized in generating the utterance-level label and parameter type labels. In addition, two separate output sequences, one from each output head, may be generated. This may be beneficial as the length of each output sequence is reduced as traditionally, encoder/decoder architectures have difficulty with long sequences and modelling long-term dependencies.

Generating an output transcription of the utterance may comprise iteratively refining the token predictions for the masked tokens in the masked token sequence. The iterative refinement may comprise generating a predicted token for each of the masked tokens in the masked token sequence; wherein each predicted token is associated with a confidence score of being the correct token; and for each predicted token, replacing the corresponding masked token in the masked token sequence with the predicted token when the confidence score is above a threshold confidence level, otherwise maintaining the masked token in the masked token sequence. That is, on subsequent iterations, the new masked token sequence may be processed again by the decoder to generate a set of predictions for the remaining masked tokens until either no masked tokens remain or a fixed number of iterations have been performed. The iterative refinement enables predictions to be conditioned on earlier higher confidence tokens.

As the higher confidence tokens are progressively filled-in, it becomes easier to make predictions for the previously lower confidence token positions.

Processing by the encoder may comprise, at one or more intermediate layers of the encoder neural network: generating an intermediate layer embedding of the audio signal and an intermediate layer transcription of the utterance, the intermediate layer transcription comprising one or more tokens and each of the one or more tokens is associated with a confidence score of being the correct token for that position in the intermediate layer transcription; modifying the intermediate layer transcription to generate an intermediate layer masked token sequence, wherein modifying the intermediate layer transcription comprises masking the one or more tokens in the intermediate layer transcription having a confidence score below an intermediate layer threshold confidence level; processing, by an intermediate layer decoder comprising a non-autoregressive, bi-directional decoder neural network, the intermediate layer embedding of the audio signal and the intermediate layer masked token sequence to generate an intermediate layer decoding output; combining the intermediate layer embedding of the audio signal and the intermediate layer decoding output to provide an input for one or more subsequent neural network layers of the encoder neural network following the respective intermediate layer of the encoder neural network; and processing, by the one or more subsequent neural network layers, the input to generate the embedding of the audio signal and the initial transcription of the utterance.

That is, the encoder neural network may comprise an intermediate layer decoder which may be configured to carry out the same tasks as the output decoder (and may have the same architecture and may share parameters). The output of the intermediate layer decoder is provided to the subsequent layers of encoder neural network for processing and generating the embedding of the audio signal. In this way, the encoder can generate a representation of the audio signal that is more useful for the processing carried out by the output decoder. In addition, certain algorithms for generating transcriptions such as the CTC algorithm have strong conditional independence assumptions. That is, each token in the transcription is assumed to be conditionally independent. By using an intermediate layer decoder and processing the intermediate layer decoding output in the subsequent layers of the encoder neural network, the conditional independence assumption may be mitigated when generating the initial transcription.

The encoder neural network may comprise a plurality of intermediate layer decoders at different intermediate layers of the encoder neural network.

The intermediate layer decoding output may comprise a masked token probability distribution comprising a probability distribution over candidate tokens for predicting a token for each of the masked tokens in the intermediate layer masked token sequence; an utterance classification probability distribution comprising a probability distribution over labels indicating a classification of the utterance; and a parameter type probability distribution comprising a probability distribution over labels indicating a parameter type for the respective words in the intermediate layer transcription.

Combining the intermediate layer embedding of the audio signal and the intermediate layer decoding output may comprise: combining the masked token probability distribution, the utterance classification probability distribution and the parameter type probability distribution; performing a linear projection of the combined probability distributions such that the dimensionality matches the intermediate layer embedding of the audio signal; and summing the projected combined probability distribution and the intermediate layer embedding of the audio signal. The masked token probability distribution, the utterance classification probability distribution and the parameter type probability distribution may be combined based upon concatenation or other appropriate combination operation.

The audio signal may comprise a plurality of audio segments. Generating an intermediate layer transcription of the utterance may comprise generating a probability distribution over candidate tokens for transcribing each audio segment and generating the intermediate layer transcription based upon the probability distributions for each audio segment. Combining the intermediate layer embedding of the audio signal and the intermediate layer decoding output may comprise: determining an audio segment position corresponding to each token in the intermediate layer transcription; mapping the masked token probability distribution and the parameter type probability distribution to audio segment positions based upon the corresponding tokens in the intermediate layer transcription; and associating the utterance classification probability distribution with each of the audio segment positions in the mapping. Combining the masked token probability distribution, the utterance classification probability distribution and parameter type probability distribution may comprise combining the probability distribution over candidate tokens, the masked token probability distribution, the utterance classification probability distribution and the parameter type probability distribution having corresponding audio segment positions according to the mapping.

The method may further comprise, adding a pre-fix token to the start of the masked token sequence prior to processing by the output decoder for generating the label indicating a classification of the utterance. Where the encoder neural network comprises an intermediate layer decoder, the method may further comprise adding a pre-fix token to the start of the intermediate layer masked token sequence prior to processing by the intermediate layer decoder for generating utterance classification probability distribution. The pre-fix token added to the start of the masked token sequence may facilitate generating a label indicating a classification of the utterance. The decoder may use the first element position to carry the utterance-level information and the first element in the output sequence may correspond to the utterance-level label. The pre-fix token may be denoted with "<CLS>".

The decoder neural network of the output decoder and/or intermediate layer decoder may be based upon a Transformer architecture. In this regard, the decoder neural network(s) may comprise a plurality of stacked Transformer decoder blocks.

The encoder neural network may be based upon a Conformer architecture and may comprise a plurality of stacked Conformer blocks.

The encoder may comprise an audio-processing front-end. The audio processing front-end may be configured to extract acoustic features from the audio signal. The audio processing front-end may comprise a convolutional neural network. Alternatively, the audio signal may be pre-processed externally and the received audio signal may comprise acoustic features.

According to another aspect, there is provided a computer-implemented method of training a speech processing system. The speech processing system comprises an encoder comprising a non-autoregressive encoder neural network having a plurality of trainable parameters; and an output decoder comprising a non-autoregressive, bi-directional decoder neural network having a plurality of trainable parameters. The method comprises receiving an audio signal capturing an utterance spoken by a user for training the speech processing system. The method further comprises processing, by the encoder, the audio signal to generate an embedding of the audio signal and a conditional probability distribution over candidate transcriptions of the utterance given the input audio signal. The method further comprises receiving a ground-truth transcription of the utterance, wherein the ground-truth transcription comprises a sequence of one or more tokens. The method further comprises computing an encoder loss value based upon the conditional probability distribution over candidate transcriptions of the utterance and the ground-truth transcription of the utterance. The method further comprises modifying the ground-truth transcription to generate a masked token sequence, wherein modifying the ground-truth transcription comprises masking one or more tokens in the ground-truth transcription. The method further comprises processing, by the output decoder, the embedding of the audio signal and the masked token sequence to generate a speech processing output comprising: a probability distribution over candidate tokens for predicting a token for each of the masked tokens in the masked token sequence; a probability distribution over labels indicating a classification of the utterance; and a probability distribution over labels indicating a parameter type associated with respective words in an output transcription, the parameter types associated with the classification of the utterance. The method further comprises receiving ground-truth labels indicating the classification of the utterance and the parameter types respectively. The method further comprises computing a decoder loss value based upon the speech processing output and the ground-truth transcription and labels. The method further comprises adjusting the values of the plurality of trainable parameters of the decoder neural network of the output decoder and encoder neural network based upon the decoder loss value. The method further comprises adjusting the values of the plurality of trainable parameters of the encoder neural network based upon the encoder loss value.

In the training method, the ground-truth transcripts are used for generating the masked token sequence provided as input to the decoder. This is to avoid the decoder learning any erroneous relationships due to the errors in the initial transcription generated by the encoder as would be expected during the first phases of training. The decoder is still provided with the embedding of the audio signal from encoder as input.

Adjusting the values of the trainable parameters of the encoder/decoder neural networks may be based upon backpropagation and stochastic gradient descent methods. The training process may be repeated for a number of iterations or/and until a stopping criteria is reached.

Processing by the encoder may comprise, at one or more intermediate layers of the encoder neural network: generating an intermediate layer embedding of the audio signal and an intermediate layer conditional probability distribution over candidate transcriptions of the utterance given the output of the layer preceding the intermediate layer. The method may further comprise computing an intermediate layer loss value based upon the intermediate layer conditional probability distribution and the ground-truth transcription; and wherein the encoder loss value is further based upon the intermediate layer loss value. The method may further comprise generating an intermediate layer transcription of the utterance based upon the intermediate layer conditional probability distribution, wherein the intermediate layer transcription comprises a sequence of one or more tokens. The method may further comprise modifying the intermediate layer transcription to generate an intermediate layer masked token sequence, wherein modifying the intermediate layer transcription comprises masking the one or more tokens in the intermediate layer transcription having a confidence score below an intermediate layer threshold confidence level. The method may further comprise processing, by an intermediate layer decoder comprising a nonautoregressive, bi-directional decoder neural network, the intermediate embedding of the audio signal and the intermediate layer masked token sequence to generate an intermediate layer decoding output. The method may further comprise combining the intermediate layer embedding of the audio signal and the intermediate layer decoding output to provide an input for one or more subsequent neural network layers of the encoder neural network following the respective intermediate layer of the encoder neural network. The method may further comprise processing, by the one or more subsequent neural network layers, the input to generate the embedding of the audio signal and the conditional probability distribution over candidate transcriptions of the utterance given the input audio signal.

In the case where the encoder neural network comprises an intermediate layer decoder, the intermediate layer loss value is not directly based upon any of the intermediate layer decoding outputs. The intermediate layer decoding outputs will factor into the processing by the rest of the encoder neural network to generate the encoder transcription (initial transcription) and any further intermediate layer transcriptions along the way and is hence incorporated into the encoder loss value only indirectly.

The encoder and decoder loss values may be indicative of an error produced by the system. The encoder loss value and/or the intermediate loss value may be computed based upon a CTC loss function. The decoder loss value may be based upon a negative log likelihood of the ground-truth masked tokens obtained from the probability distribution over candidate tokens for predicting a token for each of the masked tokens in the masked token sequence. The decoder loss value may be based upon a classification loss value determined based upon a negative log likelihood of the ground-truth labels obtained from the probability distributions over the labels indicating a classification of the utterance and over the labels indicating a parameter type respectively.

Any of the above probability distributions may be parameterized by the encoder and/or decoder neural networks. A softmax layer may be used to generate a probability distribution from a particular layer of a neural network.

The encoder and decoder may be configured to carry out the operations described in connection with the first method aspect.

The methods are computer-implemented methods. Since some methods in accordance with embodiments can be implemented by software, some embodiments encompass computer code provided to a general purpose computer on any suitable carrier medium. The carrier medium can comprise any storage medium such as a floppy disk, a CD ROM, a magnetic device or a programmable memory device, or any transient medium such as any signal e.g. an electrical, optical or microwave signal. The carrier medium may comprise a non-transitory computer readable storage medium. According to a further aspect, there is provided a carrier medium comprising computer readable code configured to cause a computer to perform any of the above described methods.

According to another aspect, there is provided a system comprising a memory storing processor readable instructions; and a processor arranged to read and execute instructions in said memory. The processor readable instructions arranged to control the system to carry out operations according to any of the above method aspects.

Aspects can be combined and it will be readily appreciated that features described in the context of one aspect can be combined with other aspects.

For the purposes of illustration, example contexts in which the subject innovations can be applied are described in relation to Figures 1A-1D. However, it should be understood that these are exemplary, and the subject innovations may be applied in any suitable context, e.g. any context in which speech recognition/processing is applicable.

Voice Assistant System Figure 1A is an illustration of a voice assistant system 120 in accordance with example embodiments.

A user 110 may speak a command 112, 114, 116 to the voice assistant system 120. In response to the user 110 speaking the command 112, 114, 116, the voice assistant system performs the command, which may include outputting an audible response.

To receive the spoken command 112, 114, 116, the voice assistant system 120 includes or is connected to a microphone. To output an audible response, the voice assistant system 120 includes or is connected to a speaker. The voice assistant system 120 may include functionality, e.g. software and/or hardware, suitable for recognising the spoken command, performing the command or causing the command to be performed, and/or causing a suitable audible response to be output. Alternatively, or additionally, the voice assistant system 120 may be connected via a network, e.g. via the internet and/or a local area network, to one or more other system(s) suitable for recognising the spoken command, causing the command to be performed, e.g. a cloud computing system and/or a local server. A first part of the functionality may be performed by hardware and/or software of the voice assistant system 120 and a second part of the functionality may be performed by the one or more other systems. In some examples, the functionality, or a greater part thereof, may be provided by the one or more other systems where these one or more other systems are accessible over the network, but the functionality may be provided by the voice assistant system 120 when they are not, e.g. due to the disconnection of the voice assistant system 120 from the network and/or the failure of the one or more other systems. In these examples, the voice assistant system 120 may be able to take advantage of the greater computational resources and data availability of the one or more other systems, e.g. to be able to perform a greater range of command commands, to improve the quality of speech recognition, and/or to improve the quality of the audible output, while still being able to operate without a connection to the one or more other systems.

For example, in the command 112, the user 110 asks "What is XT. This command 112 may be interpreted by the voice assistant system 120 as a spoken command to provide a definition of the term X. In response to the command, the voice assistant system 120 may query a knowledge source, e.g. a local database, a remote database, or another type of local or remote index, to obtain a definition of the term X. The term X may be any term for which a definition can be obtained. For example, the term X could be a dictionary term, e.g. a noun, verb or adjective; or an entity name, e.g. the name of a person or a business. When the definition has been obtained from the knowledge source, the definition may be synthesised into a sentence, e.g. a sentence in the form of "X is [definition]". The sentence may then be converted into an audible output 112, e.g. using text-to-speech functionality of the voice assistant system 120, and output using the speaker included in or connected in the voice assistant system 120.

As another example, in the command 114, the user 110 says "Turn Off Lights". The command 114 may be interpreted by the voice assistant system as a spoken command to turn off one or more lights. The command 114 may be interpreted by the voice assistant system 120 in a context sensitive manner. For example, the voice assistant system 120 may be aware of the room in which it is located and turn off the lights in that room specifically. In response to the command, the voice assistant system 120 may cause one or more lights to be turned off, e.g. cause one or more smart bulbs to no longer emit light. The voice assistant system 120 may cause the one or more lights to be turned off by directly interacting with the one or more lights, e.g. over a wireless connection, such as a Bluetooth connection, between the voice assistant system and the one or more lights; or by indirectly interacting with the lights, e.g. sending one or more messages to turn the lights off to a smart home hub or a cloud smart home control server. The voice assistant system 120 may also produce an audible response 124, e.g. a spoken voice saying lights off, confirming to the user that the command has been heard and understood by the voice assistant system 120.

As an additional example, in the command 116, the user 110 says "Play Music". The command 116 may be interpreted by the voice assistant system as a spoken command to play music. In response to the command, the voice assistant system 120 may: access a music source, such as local music files or a music streaming service, stream music from the music source, and output the streamed music 126 from the speaker included in or connected to the voice assistant system 120. The music 126 outputted by the voice assistant system 120 may be personalised to the user 110. For example, the voice assistant system 120 may recognise the user 110, e.g. by the properties of the voice of user 110, or may be statically associated with the user 110, then resume the music previously played by the user 110 or play a playlist personalised to the user 110.

Speech Transcription System Figure 1B is an illustration of a speech transcription system in accordance with example embodiments.

A user 130 may speak to a computer 140. In a response to the user speaking, the computer 140 produces a textual output 142 representing the content of the speech 132.

To receive the speech, the computer 140 includes or is connected to a microphone.

The computer 140 may include software suitable for recognising the content of the speech audio and outputting text representing the content of the speech, e.g. transcribe the content of the speech. Alternatively, or additionally, the computer 140 may be connected via a network, e.g. via the internet and/or a local area network, to one or more other system(s) suitable for recognising the content of the speech audio and outputting text representing the content of the speech. A first part of the functionality may be performed by hardware and/or software of the computer 140 and a second part of the functionality may be performed by the one or more other systems. In some examples, the functionality, or a greater part thereof, may be provided by the one or more other systems where these one or more other systems are accessible over the network, but the functionality may be provided by the computer 140 when they are not, e.g. due to the disconnection of the computer 140 from the network and/or the failure of the one or more other systems. In these examples, the computer 140 may be able to take advantage of the greater computational resources and data availability of the one or more other systems, e.g. to improve the quality of speech transcription, while still being able to operate without a connection to the one or more other systems.

The outputted text 142 may be displayed on a display included in or connected to the computer 140. The outputted text may be input to one or more computer programs running on the computer 140, e.g. a word processing computer program or a web browser computer program.

Voice Assistance Method Figure 10 is a flow diagram of a method 150 for performing voice assistance in accordance with example embodiments. Optional steps are indicated by dashed lines.

The example method 150 may be implemented as one or more computer-executable instructions executed by one or more computing devices, e.g. the hardware 1200 described in relation to Figure 12. The one or more computing devices may be or include a voice assistant system, e.g. the voice assistant system 120, and/or may be integrated into a multi-purpose computing device, such as a desktop computer, laptop computer, smartphone, smart television, or games console.

In step 152, speech audio is received using a microphone, e.g. a microphone of a voice assistant system or a microphone integrated into or connected to a multi-purpose computing device. As the speech audio is received, the speech audio may be buffered in a memory, e.g. a memory of a voice assistant system or a multi-purpose computing device.

In step 154, the content of the speech audio is recognised. The content of speech audio may be recognised using methods described herein, e.g. the method 200 of Fig. 2. Prior to the use of such methods, the audio may be pre-processed as described in further detail below. The recognised content of the speech audio may be text, syntactic content, and/or semantic content. The recognised content may be represented using one or more vectors. Additionally, e.g. after further processing, or alternatively, the recognised content may be represented using one or more tokens. Where the recognised content is text, each token and/or vector may represent a character, a phoneme, a morpheme or other morphological unit, a word part, or a word.

In step 156, a command is performed based on the content of the speech audio. The performed command may be, but is not limited to, any of the commands 112, 114, 116 described in relation to Fig. 1A, and may be performed in the manner described. The command to be performed may be determined by matching the recognised content to one or more command phrases or command patterns. The match may be approximate. For example, for the command 114 which turns off lights, the command may be matched to phrases containing the words "lights" and "off', e.g. "turn the lights off' or "lights off'. The command 114 may also be matched to phrases that approximately semantically correspond to "turn the lights off", such as "close the lights" or "lamp off'.

In step 158, an audible response is output based on the content of the speech audio, e.g. using a speaker included in or connected to a voice assistant system or multi-purpose computing device. The audible response may be any of the audible responses 122, 124, 126 described in relation to Fig. 1A, and may be produced in the same or a similar manner to that described. The audible response may be a spoken sentence, word or phrase; music; or another sound, e.g. a sound effect or alarm. The audible response may be based on the content of the speech audio in itself and/or may be indirectly based on the content of the speech audio, e.g. be based on the command performed, which is itself based on the content of the speech audio.

Where the audible response is a spoken sentence, phrase or word, outputting the audible response may include using text-to-speech functionality to transform a textual, vector or token representation of a sentence, phrase or word into spoken audio corresponding to the sentence, phrase or word. The representation of the sentence or phrase may have been synthesised on the basis of the content of the speech audio in itself and/or the command performed. For example, where the command is a definition retrieval command in the form "What is X?", the content of the speech audio includes X, and the command causes a definition, pet], to be retrieve from a knowledge source. A sentence in the form "X is peq' is synthesised, where X is from the content of the speech audio and peg is content retrieved from a knowledge source by the command being performed.

As another example, where the command is a command causing a smart device to perform a function, such as a turn lights off command that causes one or more smart bulbs to turn off, the audible response may be a sound effect indicating that the function has been or is being performed.

As indicated by the dashed lines in the figures, the step of producing an audible response is optional and may not occur for some commands and/or in some implementations. For example, in the case of a command causing a smart device to perform a function, the function may be performed without an audible response being output. An audible response may not be output, because the user has other feedback that the command has been successfully completed, e.g. the light being off.

Speech Transcription Method Figure 10 is a flow diagram of a method 160 for performing speech transcription in accordance with example embodiments. The example method 160 may be implemented as one or more computer-executable instructions executed by one or more computing devices, e.g. the hardware 1200 described in relation to Figure 12. The one or more computing devices may be a computing device, such as a desktop computer, laptop computer, smartphone, smart television, or games console.

In step 162, speech audio is received using a microphone, e.g. a microphone integrated into or connected to a computing device. As the speech audio is received, the speech audio may be buffered in a memory, e.g. a memory of a computing device.

In step 164, the content of the speech audio is recognised. The content of speech audio may be recognised using methods described herein, e.g. the method 200 of Fig. 2. Prior to the use of such methods, the audio may be pre-processed as described in more detail below. The recognised content of the speech audio may be textual content, syntactic content, and/or semantic content. The recognised content may be represented using one or more vectors. Additionally, e.g. after further processing, or alternatively, the recognised content may be represented using one or more tokens. Where the recognised content is text, each token and/or vector may represent a character, a phoneme, a morpheme or other morphological unit, a word part, or a word.

In step 166, text is output based on the content of the speech audio. Where the recognised content of the speech audio is textual content, the outputted text may be the textual content, or may be derived from the textual content as recognised. For example, the textual content may be represented using one or more tokens, and the outputted text may be derived by converting the tokens into the characters, the phonemes, the morphemes or other morphological units, word parts, or words that they represent. Where the recognised content of the speech audio is or includes semantic content, output text having a meaning corresponding to the semantic content may be derived. Where the recognised content of the speech audio is or includes syntactic content, output text having a structure, e.g. a grammatical structure, corresponding to the syntactic content may be derived.

The outputted text may be displayed. The outputted text may be input to one or more computer programs, such as a word processor or web browser. Further processing may be performed on the outputted text. For example, spelling and grammar errors in the outputted text may be highlighted or corrected. In another example, the outputted text may be translated, e.g. using a machine translation system.

Figure 2 -Flow diagram of method Fig. 2 is a flow diagram of a speech processing method 200. In particular, the speech processing method jointly performs automatic speech recognition and spoken language understanding. In step 210, an audio signal capturing an utterance spoken by a user is received. The audio signal may be obtained from a sound capture apparatus such as a microphone on a user device. The utterance may express a command or action that the user wishes the device to perform.

In step 220, the audio signal is processed by an encoder to generate an embedding of the audio signal and an initial transcription of the utterance. The encoder comprises a non-autoregressive encoder neural network configured to generate the embedding of the audio signal. A non-autoregressive neural network produces all its outputs concurrently in one time step compared to an autoregressive neural network which generates its output over multiple time steps, typically one element at a time with the output generated so far fed-back as input. Thus, the non-autoregressive encoder neural network generates the embedding of the audio signal in full in one time step.

The embedding of the audio signal provides a latent representation of the audio signal containing salient information for use in generating the initial transcription and for a decoder in generating the speech processing output. The embedding of the audio signal may be a vector or tensor representation of the audio signal.

The encoder may perform pre-processing on the audio signal prior to processing by the encoder neural network to generate the embedding of the audio signal. For example, the encoder may carry out a frequency transformation of the audio signal to generate acoustic feature vectors. Alternatively, the audio signal may be pre-processed prior to its receipt in step 210. Further details with respect to the encoder are described in more detail below.

The initial transcription of the utterance comprises one or more tokens. The tokens may correspond to words or sub-word units, such as word-pieces, phonemes, characters or other suitable unit as deemed appropriate by a person skilled in the art. In one example, the set of tokens, or vocabulary, is generated by applying Byte Pair Encoding (BPE) to a suitable corpus such as a training dataset. For instance, the vocabulary may comprise 500 word-pieces generated by BPE.

Each of the one or more tokens in the initial transcription is associated with a confidence score of being the correct token for that position in the transcription. For example, the audio signal may be pre-processed into temporal audio segments and for each audio segment, the encoder, using the encoder neural network, may generate a probability distribution over the token vocabulary representing the likelihood that the audio segment comprises speech corresponding to a particular token. Any suitable speech recognition decoding method may be used to determine a transcription of the utterance from the probability distributions. For example, a Connectionist Temporal Classification (CTC) method may be used. The confidence scores associated with the one or more tokens may be generated based upon the probability distributions generated by the encoder and/or the particular decoding method used. Further details with respect to generating the initial transcription is described below.

At step 230, the initial transcription is modified to generate a masked token sequence. The modification comprises masking the one or more tokens in the initial transcription having a confidence score below a threshold confidence level. For example, a special "<MASK>" token may be used to replace a token having a confidence score below the threshold confidence level. Thus, only the tokens in the initial transcription with high confidence of being correct are retained whilst low confidence tokens are masked out. In one example, the confidence score threshold is 0.999 although it will be appreciated that any other value may be used as deemed appropriate by a person skilled in the art.

A pre-fix token may also be added to the start of the masked token sequence to facilitate generating a label indicating a classification of the utterance. The pre-fix token may be denoted with "<CLS>".

At step 240, the embedding of the audio signal generated by the encoder at step 220 and the masked token sequence generated at step 230 are processed by an output decoder to generate a speech processing output. The output decoder comprises a nonautoregressive, bi-directional decoder neural network. As discussed above, a nonautoregressive neural network generates its output concurrently in one time step. When processing a particular element of an input sequence, a bi-directional decoder neural network processes the particular element based on both the elements that precede the particular element in the input sequence and the subsequent elements that follow the particular element in the sequence. By comparison, a uni-directional decoder neural network typically only considers the elements preceding the particular element. Further details regarding the decoder are described below.

The speech processing output comprises a predicted token for each of the masked tokens in the masked token sequence to generate an output transcription of the utterance. That is, the output decoder uses the embedding of the audio signal and the high confidence tokens from the initial transcription produced by the encoder to help derive the masked-out low confidence tokens. The decoder may be specifically trained using a token prediction task for this purpose. Training of the decoder is described in more detail below. The predicted tokens generated by the decoder may be used to replace the corresponding masked tokens in the masked token sequence to generate an output transcription of the utterance.

The speech processing output further comprises a label indicating a classification of the utterance. For example, the label may correspond to an action that is to be performed in response to the utterance by an appropriate device or system. This type of speech processing output is sometimes known as "intent classification". The method may further comprise causing the device or system to carry out the action. For example, by transmitting an appropriate command.

The speech processing output further comprises one or more labels indicating a parameter type associated with respective words in the output transcription. The parameters types are associated with the classification of the utterance. For example, an action may require determining certain parameter values in order to fully form and execute the action. The words (or sub-word units) in the output transcription may be labelled with the corresponding parameter types for filling in the required parameters.

This type of speech processing output is sometimes known as "slot filling".

The speech processing output may be provided as two separate sequences, a first sequence comprising the output transcription of the utterance and a second sequence comprising the utterance-level label and the parameter type labels.

Figure 3 provides an example output transcription 310, together with an intent label 320 classifying the utterance and slot labels 330 indicating the corresponding parameter types of the words in the transcribed utterance.

Referring back to Figure 2, the method 200 provides a speech processing output comprising a predicted token for each of the masked tokens in the masked token sequence to generate an output transcription of the utterance, a label indicating a classification of the utterance, and one or more labels indicating a parameter type associated with respective words in the output transcription starting from an input audio signal capturing an utterance spoken by a user. The method therefore provides end-to-end, joint automatic speech recognition and spoken language understanding.

Referring now to Figure 4, the generation of the output transcription of the utterance may comprise iteratively refining the token predictions for the masked tokens in the token sequence. For example, in Figure 4, the masked token sequence 410 received by the decoder comprises the sequence, "H", "<MASK>", "<MASK>", "L", "0". In a first iteration, the decoder may make predictions 420 for each of the masked tokens in the masked token sequence. For example, the decoder may predict the first masked token is an "A" with confidence score 0.5. The decoder may predict the second masked token is an "L" with confidence score 0.95. The decoder may then, for each predicted token, replace the corresponding masked token in the masked token sequence with the predicted token when the confidence score is above a threshold level. Otherwise, if the confidence score is not above the threshold level, the masked token is maintained, i.e. not replaced. For example, in Figure 4, the prediction of "A" with confidence score 0.5 is less than the threshold level 0.9 and therefore the first masked token is maintained and not replaced with the predicted token. The prediction of "L" however for the second masked token has a confidence score greater than the threshold level and as such, the prediction "12 replaces the second "<MASK>" token in the masked token sequence. The modified masked token sequence 430 is therefore, "H", "<MASK>", "12, "L", "0".

A second iteration may be carried out to determine the remaining masked tokens in the masked token sequence by re-processing the masked token sequence with the newly filled token by the decoder. For example, based on the newly filled token, the decoder now makes a prediction 440 that the masked token is an "E" with a 0.95 confidence score. This is greater than the confidence threshold level and therefore the masked token is replaced with the predicted token "E". The new masked token sequence 450 is "H", "E", "12, "12, "0". There are no further masked tokens in the masked token sequence and therefore the transcription may be considered complete. In general, the iterative refinement may be carried out until all masked tokens have been replaced or until a fixed number of iterations have been performed. In the latter case, the masked tokens are replaced with their predictions in the final iteration regardless of the confidence score. In one example, a maximum of ten iterations are performed, though it will be appreciated that any suitable maximum may be selected as deemed appropriate by a person skilled in the art.

The iterative refinement enables predictions to be conditioned on earlier higher confidence tokens. As the higher confidence tokens are progressively filled-in, it becomes easier to make predictions for the previously lower confidence token positions.

An exemplary method of generating an initial transcription will now be described. The exemplary method uses a CTC-based algorithm. As noted above, the audio signal may be pre-processed to provide a plurality of temporal audio segments. The audio segments may be processed by the encoder neural network to provide a probability distribution over the token vocabulary indicating the likelihood that the audio segment contains speech corresponding to the particular token in the vocabulary. The probability distribution over the token vocabulary for each audio segment may be used to determine the token for that audio segment. The transcription can be generated using a greedy selection process, that is, the token with the highest probability at each audio segment may be individually selected. Alternatively, a beam search may be used to consider multiple potential candidate token sequences as the selection of the individual most likely token may not necessarily yield the most likelihood sequence overall. Dynamic programming may be used to efficiently compute the probability of a token sequence.

It is likely that multiple audio segments will correspond to a single token and there may also be audio segments corresponding to silence. Consecutive repeating tokens may be collapsed to a single token instance for the transcription. In the CTC algorithm, the vocabulary includes a special blank token to enable audio segments to be labelled as silence and to provide a separator to prevent legitimate consecutive tokens from being collapsed, e.g. the double "L" in "HELLO". Thus, the repeated tokens are first collapsed and then the blank tokens are removed to provide the transcription. Further details with respect to training and the CTC loss function are described in further detail below.

Fig 5. Architecture Fig. 5 is a schematic illustration of a system for performing speech processing in accordance with example embodiments. In particular, the system 500 may implement the processing of Figures 1 to 4. The system 500 may be implemented using one or more computer-executable instructions on one or more computing devices, e.g. the hardware described below.

The system 500 is configured to receive an audio signal 502 that captures an utterance spoken by a user. As discussed above, the audio signal may be obtained from a sound capture apparatus such as a microphone on a user device.

The system 500 comprises an encoder 504 which is configured to process the audio signal 502 to generate an embedding 506 of the audio signal and an initial transcription 508 of the utterance. The encoder 504 comprises a non-autoregressive encoder neural network 510 which is configured to generate the embedding 506 of the audio signal.

The encoder 504 may optionally comprise a transcription engine 512 configured to generate the initial transcription 508 of the utterance. For example, the transcription engine 512 may be configured to implement the CTC algorithm discussed above. Alternatively, the encoder neural network 510 may directly output the initial transcription 512.

The system 500 further comprises a masking engine 514 configured to modify the initial transcription 508 to generate a masked token sequence 516. As discussed above, the initial transcription 508 comprises one or more tokens and each of the one or more tokens is associated with a confidence score of being the correct token for that position in the transcription. Modifying the initial transcription 508 comprises masking the one or more tokens in the initial transcription 508 having a confidence score below a threshold confidence level. In Figure 5, the masking engine 514 is depicted as being part of the encoder 504 however, the masking engine 514 may be external to the encoder 504.

The system 500 further comprises an output decoder 518 which is configured to process the embedding 506 of the audio signal and the masked token sequence 516 to generate a speech processing output 520.

The output decoder 518 comprises a non-autoregressive, bi-directional decoder neural network 522. The speech processing output 520 comprises a predicted token 524 for each of the masked tokens in the masked token sequence 516 to generate an output transcription 526 of the utterance; a label 528 indicating a classification of the utterance; and one or more labels 530 indicating a parameter type associated with respective words in the output transcription 526, the parameter types associated with the classification of the utterance.

The decoder neural network 522 may be configured to generate the speech processing output 520 directly or may generate output for use in generating the speech processing output such as probability distributions for predicting the masked tokens and determining the labels 528, 530. The decoder neural network 522 may comprise a first output head 532 for the predicted tokens and a second output head 534 for the labels 528, 530. The first output head 532 being distinct from the second output head 534. In this way, one part of the decoder neural network 522 becomes specialized in generating masked token predictions 524 and another part of the decoder neural network 522 becomes specialized in generating labels 528, 530. Two separate output sequences, one from each output head, may thus be generated.

Prior to the processing by the output decoder 518, a pre-fix token may be added to the start of the masked token sequence to facilitate generating the label indicating a classification of the utterance. The decoder neural network 522 may use the first element position to carry the utterance-level information and the first element in the output sequence comprising the labels 528, 530 may correspond to the utterance-level label 528. The pre-fix token may be denoted with "<CLS>".

The decoder neural network 522 may be based upon a Transformer architecture. Details regarding Transformer architectures can be found in Vaswani et al., "Attention is all you need", Advances in Neural Information Processing Systems, pp. 5998-6008, 2017, which is hereby incorporated by reference in its entirety. As discussed above however, unlike a conventional Transformer decoder in Vaswani et al., the decoder neural network 522 is both non-autoregressive and bi-directional. That is, where a conventional Transformer decoder generates one output element at a time conditioned on the previously generated elements, the decoder neural network 522 generates an output for all positions in the output sequence concurrently. In addition, when generating an output element, a conventional Transformer decoder avoids attending to positions of elements that have yet to be generated by masking those positions. The decoder neural network 522, when based upon a Transformer architecture, attends to all positions in the output and is hence bi-directional. Thus, when based upon a Transformer architecture, the decoder neural network 522 operates without the future position masking and is a non-autoregressive. In this regard, the decoder neural network 522 operates more akin to an encoder of a Transformer. However, the decoder neural network 522 may still utilize the same structure as a conventional Transformer decoder block.

Figure 6 provides a schematic illustration of an exemplary Transformer decoder block 600 for use in constructing the decoder neural network 522. The decoder block 600 may comprise an input 602 to the block 600. For example, the input 602 may be the masked token sequence 516 if the decoder block 600 is the first decoder block of the decoder neural network 522 or the input 602 may be output of a previous decoder block. In addition, the input 602 of a first decoder block 600 may include a positional encoding to provide relative positional information regarding the input elements.

The decoder block 600 may further comprise a Multi-head Self-Attention layer 604 configured to process the input 602 and to perform a self-attention operation to generate an attention layer output. In general, a self-attention operation relates pairs of elements in the input 602 to determine a new representation of the input 602. The self-attention operation may comprise generating a query vector 0 from the input 602 using a linear transformation matrix Wq. The self-attention operation may further comprise generating a key vector K from the input 602 using a second linear transformation matrix Wk. The self-attention operation may further comprise generating a value vector V from the input 602 using a third linear transformation matrix W. In general, the key vector acts as a means of content-based addressing for the values V. The values may be extracted based upon a similarity between the query and key vectors.

For example, a dot product may be taken between the query vector and key vector to determine a similarity score between the elements of the query vector and key vector. The similarity scores may be scaled by a scaling factor such as 1 / the square root of the dimensionality of the key vector. A softmax operation may be applied to the similarity scores to provide a set of attention weights. The attention weights may be used to perform a weighted sum of the values V to provide the output of the self-attention operation and the new representation of the input 602.

The decoder block 600 may further comprise an "Add & Norm" layer 606 that is configured to combine the input 602 of the block 600 (via residual connection 608) with the output of the Multi-head Self-Attention layer 604 and to perform a layer normalization operation on the combination. The combination operation may be a summation.

The decoder block 600 may further comprise a Multi-head Cross-Attention layer 610 that is configured to perform an attention operation between the embedding 506 of the audio signal generated by the encoder neural network 510 and the output of the Add & Norm layer 606. The attention operation in the Multi-head Cross-Attention layer 610 may be the same as the attention operation in the Multi-head Self-Attention layer 604, except that the key vector and value vector are generated from the embedding 506 of the audio signal from the encoder neural network 510. The query vector may be generated using the output of the Add & Norm layer 606. In this way, the attention operation relates the embedding 506 generated by the encoder neural network 510 with the representation provided by the decoder block 600 to modify the representation based upon information provided by the encoder neural network 510.

The decoder block 600 may further comprise a second Add & Norm layer 612 configured to perform the same operations as the first Add & Norm layer 606 using the output of the Multi-head Cross-Attention layer 610 and the output of the first Add & Norm layer 606 (via residual connection 614) as input.

The decoder block 600 may further comprise a Feed-forward layer 616 configured to process the output of the second Add & Norm layer 612 to generate an output of the layer. The decoder block 600 may further comprise a third Add & Norm layer 618 configured to perform the same operations as the first and second Add & Norm layers 606, 612 using the output of the Feed-forward layer 616 and the output of the second Add & Norm layer 612 (via residual connection 620) to generate the output 622 of the decoder block 600.

The decoder neural network 522 may comprise a plurality of stacked decoder blocks 600. In one example, the decoder neural network 522 comprises six stacked decoder blocks. However, it will be appreciated that other numbers of decoder blocks may be used. The architectural parameters for the decoder block may be the same as the encoder block for the attention-type layers and the feed-forward layers.

As discussed above, the decoder neural network 522 may comprise a first output head 532 and a second output head 534. Each output head may comprise a linear layer followed by a softmax layer to generate a probability distribution. The input of an output head may be the output of a final decoder block 600.

Referring back to the encoder 504, the encoder 504 may optionally comprise an audio processing front-end 536. The audio processing front-end 536 may be configured to extract acoustic features from the audio signal 502. For example, the audio signal 502 may be split into frames of 25ms with a 10ms shift. Acoustic features may be extracted for each frame using a filter bank. In one example, the acoustic features comprise 80-dimensional filter banks together with 3-dimensional pitch-related information. Processing for acoustic feature extraction may however be carried out externally to the encoder 504 and the received audio signal 502 may comprise acoustic features to be processed by the encoder neural network 510.

The audio processing front-end 536 may comprise a two-layer convolutional neural network. The kernel size of the convolutional layers may be 3 with stride 2. This has the effect of reducing the frame-rate by a factor of 4.

The encoder neural network 510 may be based upon a Conformer architecture. Details regarding the Conformer architecture can be found in Gulati et al., "Conformer: Convolution-augmented Transformer for speech recognition," in Interspeech, 2020, pp. 5036-5040 which is hereby incorporated by reference in its entirety. In brief however, the Conformer attempts to incorporate convolutions into a Transformer block to provide a Conformer block. Figure 7 provides a schematic illustration of a Conformer encoder block 700.

The exemplary Conformer encoder block 700 of Figure 7 comprises an input 702 to the encoder block 700. The encoder block 700 may comprise a first half-step feed-forward layer 704 configured to process the input 702. A half-step feed-forward layer is a feed-forward layer where the output of the feed-forward layer is multiplied by 0.5. The encoder block 700 may be configured to combine 706 the output of the first half-step feed-forward layer 704 with the input 702 to the encoder block 700. The combination 706 may be a summation.

The encoder block 700 may further comprise a Multi-head Self-Attention layer 708 configured to process the output of the combination 706 to perform a self-attention operation to generate an attention layer output. The self-attention operation may be carried out as described above in relation to the decoder block 600. The input to the Multi-head Self-Attention layer 708 may be augmented to include positional information which may utilize a relative sinusoidal positional encoding scheme.

The encoder block 700 may be further configured to combine 710 the output of the Multi-head Self Attention layer 708 and the output of the previous combination 706 via a residual connection. The combination 710 may also be a summation.

The encoder block 700 may further comprise a Convolutional layer 712 configured to process the output of the combination 710. The Convolutional layer 712 may implement a single convolution operation or may implement a plurality of convolution operations. For example, the convolution operations may comprise a pointwise convolution and/or a 1D depthwise convolution. The Convolutional layer 712 may include further operations before or after the convolution operations such as layer normalization, gated liner units, batch normalization, swish activation and dropout.

The encoder block 700 may be further configured to combine 714 the output of the Convolutional layer 712 and the output of the previous combination 710 via a residual connection. The combination 714 may also be a summation.

The encoder block 700 may further comprise a second half-step feed-forward layer 716 configured to process the output of the combination 714. The encoder block 700 may be further configured to combine 718 the output of the second half-step feed-forward layer 716 and the output of the previous combination 714 via a residual connection.

The encoder block 700 may further comprise a Layer Normalization operation 720 configured to process the output of the combination 718 to provide an output 722 of the encoder block 700.

The encoder neural network 510 may comprise a plurality of stacked Conformer blocks 700. In one example, the encoder neural network 510 comprises 12 stacked Conformer blocks with each block having 2048 units in the half-step feed-forward layers, 4 attention heads in the Mutli-head Self-Attention layers with an attention dimensionality of 256, and a convolution kernel size of 15 in the Convolutional layers. It will be appreciated however that other numbers of blocks and architectural parameters may be used as deemed appropriate by a person skilled in the art.

Referring now to Figure 8, the encoder neural network 510 may comprise an intermediate layer decoder 802 at one or more intermediate layers of the encoder neural network 510 (i.e. a hidden layer of the neural network). Where the encoder neural network 510 comprises a plurality of stacked Conformer blocks, the intermediate layer decoder 802 may be provided in between two blocks. In one example, intermediate layer decoders are provided at the end of the 3, 61h and 9'h blocks (out of 12 blocks in the encoder neural network).

The intermediate layer decoder 802 may be configured in substantially the same way as the output decoder 518 and may have the same architecture and may share parameters.

The particular intermediate layer 804 may be configured to generate an intermediate layer embedding 806 of the audio signal. This may simply be the output of the intermediate layer 804. An intermediate layer transcription 808 may be generated from the intermediate layer embedding 806. The encoder 504 may further comprise an intermediate layer transcription engine 810 configured to generate the intermediate layer transcription 808. Alternatively, the transcription engine 512 may be used or the intermediate layer 804 may directly provide the intermediate layer transcription 808.

The encoder 504 may further comprise an intermediate layer masking engine 812 configured to modify the intermediate layer transcription 808 to generate an intermediate layer masked token sequence 814 in the same way as the masking engine 514 is configured to modify the initial transcription 508 to generate the masked token sequence 516. That is, modifying the intermediate layer transcription 808 comprises masking the one or more tokens in the intermediate layer transcription 808 having a confidence score below an intermediate layer threshold confidence level. The intermediate layer threshold may be set to a different value to the threshold confidence level used for masking the initial transcription 508. Where there exists a plurality of intermediate layer decoders, the intermediate layer threshold may also be different between them and/or may have same value. For example, where there are intermediate layer decoders at the end of 3rd, 6th and 9th blocks, the intermediate layer thresholds may be 0.9, 0.99 and 0.999 respectively. The masking engine 514 may be used instead of a separate intermediate layer masking engine 812.

The intermediate layer decoder 802 may be configured to process the intermediate layer embedding 806 of the audio signal and the intermediate layer masked token sequence 814 to generate an intermediate layer decoding output 816. The encoder 504 is configured to combine the intermediate layer embedding 806 of the audio signal and the intermediate layer decoding output 816 to provide an input 818 for one or more subsequent neural network layers of the encoder neural network 510 following the respective intermediate layer 804 of the encoder neural network 510. The combination may be a concatenation or a summation or other function as deemed appropriate by a person skilled in the art. An exemplary combination operation is described in more detail below.

The one or more subsequent neural network layers may be configured to process the input 818 to generate the embedding 506 of the audio signal and the initial transcription 508 of the utterance as discussed above.

The intermediate layer decoding output 816 may comprise data relating to masked token prediction of the intermediate layer masked token sequence 814, generation of an utterance-level classification and generation of parameter type labels. For example, the intermediate layer decoding output 816 may comprise a masked token probability distribution comprising a probability distribution over candidate tokens for predicting a token for each of the masked tokens in the intermediate layer masked token sequence 814. The intermediate layer decoding output 816 may comprise an utterance classification probability distribution comprising a probability distribution over labels indicating a classification of the utterance. The intermediate layer decoding output 816 may comprise a parameter type probability distribution comprising a probability distribution over labels indicating a parameter type for the respective words in the intermediate layer transcription 808.

In this regard, the encoder 504 may be configured to combine the intermediate layer embedding 806 of the audio signal and the intermediate layer decoding output 816 by combining the masked token probability distribution, the utterance classification probability distribution and the parameter type probability distribution. The probability distributions may be combined by concatenating the distributions into a single vector or tensor or through other appropriate techniques. The encoder 504 may be further configured to perform a linear projection of the combined probability distributions such that the dimensionality matches the intermediate layer embedding 806 of the audio signal. The linear projection may be learned as part of training the encoder. Training is described in more detail below.

The encoder 504 may be further configured to sum the projected combined probability distribution and the intermediate layer embedding 806 of the audio signal to provide the input 818 for the one or more subsequent neural network layers. The combination operation may be implemented as further neural network layers of the encoder neural network 510.

Referring now to Figure 9, a further combination operation is described. The audio signal 502 may comprise a plurality of audio segments. The encoder 504 may be configured to generate an intermediate layer transcription 808 of the utterance based upon a generated probability distribution over candidate tokens for transcribing each audio segment and generating the intermediate layer transcription 808 based upon the probability distributions for each audio segment. The probability distributions may be generated based upon the intermediate layer embedding 806. For example, as shown in Figure 9, the intermediate layer embedding 806, labelled X0.1, is transformed using a linear neural network layer and a softmax layer 902 to generate the probability distributions over the candidate tokens for each audio segment. The probability distributions for each audio segment are represented by the vectors labelled Z 904.

The encoder 504 may then be configured to generate the intermediate layer transcription 808 from the probability distributions 904, for example using the CTC algorithm discussed above. In Figure 9, the intermediate layer transcription 808 is labelled as Y 906 having tokens 1, y2, y2 and y4. The encoder 504 may be configured to determine an audio segment position corresponding to each token in the intermediate layer transcription 808 which will be used in the combination operation later.

As discussed above, the encoder 504 may be configured to generate an intermediate layer masked token sequence 814. In Figure 9, the token y2 has been replaced a "<MASK>" token. In addition, a "<CLS>" token is pre-fixed to the start of the intermediate layer masked token sequence 814 for generating the utterance-level classification probability distribution / label.

As discussed above, the intermediate layer decoder 802 may be configured to generate a masked token probability distribution, an utterance classification probability distribution and a parameter type probability distribution. The masked token probability distribution and the parameter type probability distribution may be mapped to the audio segment positions based upon the corresponding tokens in the intermediate layer transcription 808. This is shown schematically in Figure 9 whereby the masked token probability distribution, labelled y2, 908 is positioned in line with the corresponding masked token, <Mask>, 907. Likewise, the parameter type probability distributions are positioned in line with their corresponding tokens, that is, o, 910 is positioned in line with y1, 912; os2, 914 is positioned in line with <Mask>/y2, 907; os3, 916 is positioned in line with y3, 918; and os4, 920 is positioned in line with y4, 922 to represent the mapping to audio segment positions.

The utterance classification probability distribution may be associated with each of the audio segment positions in the mapping as shown schematically in Figure 9 by the utterance classification probability distribution, oi, 924 being broadcast to each of the above audio segment positions.

The probability distributions at each audio segment position may then be combined. The combination may be a concatenation of the probability distributions mapped to the corresponding audio segment positions. To facilitate the combination, the probability distribution over candidate tokens 904 may include additional slots for the other probability distributions to be placed in. The token vocabulary may therefore include the utterance level classes and the parameter type classes.

As shown in Figure 9, the encoder neural network 510 may further comprise a linear neural network layer 926 configured to apply a linear projection to the combined probability distributions such that the dimensionality matches that of the intermediate layer embedding 806. The encoder neural network 510 may further comprise a layer normalization layer 928 configured to perform a layer normalization operation on the intermediate layer embedding 806 prior to summation with the linearly projected combined probability distribution.

Fig. 10 Training Figure 10 is a flow diagram showing processing 1000 for training a speech processing system such as the system of Figure 5. The speech processing system comprises an encoder and an output decoder. The encoder comprises a non-autoregressive encoder neural network and the output decoder comprises a non-autoregressive, bi-directional decoder neural network, both of which comprises a plurality of trainable parameters. The training aims to obtain optimal values for the plurality of trainable parameters for the speech processing tasks described herein.

At step 1005, an audio signal capturing an utterance spoken by a user for training the speech processing system is received. The audio signal may be obtained from a training data set. For example, a suitable training data set may be the SLURP dataset, described in Basfianelli et al. "SLURP: A spoken language understanding resource package," in 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 7252-7262, which is hereby incorporated by reference in its entirety. In brief, the SLURP dataset comprises 140,000 spoken utterances from the domain of an in-home personal robot assistant.

At step 1010, the audio signal is processed by the encoder to generate an embedding of the audio signal and a conditional probability distribution over candidate transcriptions of the utterance given the input audio signal. The conditional probability distribution may be generated based on a particular method used for generating candidate transcriptions for the utterance. For example, the conditional probability distribution may be generated based upon the CTC algorithm. Further details regarding generation of the conditional probability distribution are described below.

At step 1015, a ground-truth transcription of the utterance is received. The ground-truth transcription comprises a sequence of one or more tokens and is one target output for the system. The training data set typically includes the ground-truth transcription for the input audio signal. It will be appreciated that the ground-truth transcription may be received at an earlier point, for example, together with the audio signal at step 1010.

At step 1020, an encoder loss value is computed based upon the conditional probability distribution over candidate transcriptions of the utterance and the ground-truth transcription of the utterance. The encoder loss value may be indicative of the error produced by the encoder when the conditional probability distribution is used to generate a transcription of the utterance. The encoder loss value is computed based upon a loss function and may comprise a negative log likelihood. The loss function may also be dependent on the particular transcription method as noted above. For example, the encoder loss value may be based upon a CTC loss function. An exemplary loss function is described in more detail below.

At step 1025, the ground-truth transcription is modified to generate a masked token sequence, that is, the one or more tokens in the ground-truth transcription is replaced with a "<MASK>" token. The number of tokens selected for masking may be selected at random from a uniform distribution. Alternatively, the number of tokens selected for masking may be determined as a proportion of the length of the ground-truth transcription or a fixed number of tokens may be selected for masking. The one or more tokens selected for masking may also be selected at random from a uniform distribution.

At step 1030, the embedding of the audio signal and the masked token sequence is processed by the output decoder to generate a speech processing output. The speech processing output comprises a probability distribution over candidate tokens for predicting a token for each of the masked tokens in the masked token sequence; a probability distribution over labels indicating a classification of the utterance; and a probability distribution over labels indicating a parameter type associated with respective words in an output transcription, the parameter types associated with the classification of the utterance. Each of the probability distributions may be generated as discussed above.

At step 1035, ground-truth labels indicating the classification of the utterance and the parameter types respectively are received. The ground-truth labels comprise another target output of the system. The training data set typically includes the ground-truth labels. It will be appreciated that the ground-truth labels may be received at an earlier point, for example, together with the ground-truth transcriptions at step 1015 or with the audio signal at step 1005.

At step 1040, a decoder loss value based upon the speech processing output and the ground-truth transcription and labels is computed. The decoder loss value may be indicative of the error produced by the decoder when the probability distributions from the speech processing output are used to generate an output transcription of the utterance and utterance-level and parameter type classifications. The decoder loss value is computed based upon a loss function. The loss function may be a joint loss function and may comprise the encoder loss value and decoder loss value as components. The decoder loss value may comprise a negative log likelihood. An exemplary loss function is described in more detail below.

At step 1045, the values of the plurality of trainable parameters of the decoder neural network and encoder neural network are adjusted based upon the decoder loss value. At step 1050, the values of the plurality of trainable parameters of the encoder neural network are adjusted based upon the encoder loss value. The adjustment of the plurality of trainable parameters may be carried out using standard neural network techniques, for example using backpropagation and stochastic gradient descent. In one example, stochastic gradient descent is used with mini-batches of size 64.

The training process 1000 may be repeated for a number of iterations or/and until a stopping criteria is reached. In one example, the training process is repeated for a maximum of 400 epochs (i.e. 400 passes through the training data set).

The training data set may also be augmented by applying the SpecAugment algorithm. Details regarding SpecAugment can be found in Park et al., "Specaugment: A simple data augmentation method for automatic speech recognition," in Interspeech, 2019, pp.2613-2617, which is hereby incorporated by reference in its entirety. In brief, for each training audio signal, SpecAugment converts the audio signal into spectrogram form and then applies one or more transformations on the spectrogram. Example transformations include warping in the time dimension, masking blocks of consecutive frequency channels and masking blocks of utterances in time. The transformed spectrogram may be converted back into an audio signal or alternatively, the system may process the transformed spectrogram directly as acoustic features or to generate acoustic features.

Referring now to Figure 11, where the speech processing system comprises an intermediate layer decoder such as the systems of Figures 8 and 9, the processing by the encoder at step 1010 during training may further comprise the following processing 1100.

At step 1105, an intermediate layer embedding of the audio signal and an intermediate layer conditional probability distribution over candidate transcriptions of the utterance given the output of the layer preceding the intermediate layer is generated. This intermediate layer conditional probability distribution may be generated as discussed above with reference to Figures 8 and 9.

At step 1110, an intermediate layer loss value is computed based upon the intermediate layer conditional probability distribution and the ground-truth transcription. The encoder loss value may be further based upon the intermediate layer loss value. For example, the intermediate layer loss value may be summed together with the encoder loss value described above. The intermediate layer loss value may be computed in the same way as the encoder loss value, for example using a CTC loss function. Further details regarding the intermediate layer loss value are described below.

At step 1115, an intermediate layer transcription of the utterance based upon the intermediate layer conditional probability distribution is generated. The intermediate layer transcription may be generated as discussed above, for example, using a CTC algorithm.

At step 1120, the intermediate layer transcription is modified to generate an intermediate layer masked token sequence. That is, the one or more tokens in the intermediate layer transcription having a confidence score below an intermediate layer threshold confidence level are replaced with a "<MASK>" token as discussed above.

At step 1125, the intermediate embedding of the audio signal and the intermediate layer masked token sequence are processed by an intermediate layer decoder to generate an intermediate layer decoding output. The intermediate layer decoder comprises a non-autoregressive, bi-directional decoder neural network with a plurality of trainable parameters. The trainable parameters of the intermediate layer decoder may be shared with the output decoder and any other intermediate layer decoder present.

At step 1130, the intermediate layer embedding of the audio signal and the intermediate layer decoding output are combined to provide an input for one or more subsequent neural network layers of the encoder neural network following the respective intermediate layer of the encoder neural network. The combining may be carried out as discussed above with reference to Figures 8 and 9.

At step 1135, the input is processed by the one or more subsequence neural network layers to generate the embedding of the audio signal and the conditional probability distribution over candidate transcriptions of the utterance given the input audio signal.

Exemplary loss functions suitable for computing the encoder loss value and the decoder loss value will now be described in more detail. As noted above, the encoder loss value may be based upon the particular decoding method for generating transcriptions. For example, where the CTC algorithm is used, the standard CTC loss function may be used as the encoder loss function to compute the encoder loss value.

As discussed above, the encoder neural network may provide a conditional probability distribution over the token vocabulary for each temporal audio segment. A token is selected for each temporal audio segment to provide a token sequence known as an alignment. More formally, the conditional probability of an alignment, A, given input audio, X, may be formulated as a product of the per-segment token probabilities: Pac(141X) = pctc(at ix) =1 where T is the number of temporal audio segments.

The conditional probability of a transcript, Y, given input audio, X, may be computed by marginalizing over the set of valid alignments for the transcript, Y: Pct,(YIX) = Pct(AIX) Aer,(Y) where fl-1(Y) returns all possible valid alignments for transcript, Y. Given the properties of the CTC alignments, the summation can be computed efficiently using dynamic programming.

The CTC loss function may be based upon minimizing the negative log-likelihood: = -log Pct,(YIX) The encoder loss value may be the negative log-likelihood for the input training audio signal and its corresponding ground-truth transcription. Alternatively, depending on the type of update rule used for adjusting the trainable parameters, the encoder loss value may be a summation of the negative log-likelihood over a batch of training data items or the whole training data set.

Where intermediate layer transcriptions are generated, the encoder loss value may further comprise an intermediate layer loss value. The intermediate layer loss value may be computed in the same way as the encoder loss value and may be combined to generate the overall encoder loss value. For example, the overall encoder loss value may be a weighted sum of the encoder loss value and the intermediate layer loss value according to the following equation: = iiL + (1 - er-ctc where L the overall encoder loss value, -ettc --is Lac is the encoder loss value computed based upon the initial transcription, Linter-ctc is the intermediate layer loss value, and II is a predetermined hyper-parameter. In one example, q is set to 0.5, though it will be appreciated that other values may be used.

In the case where a plurality of intermediate layer transcriptions is generated, an intermediate layer loss value may be computed for each and an average taken to provide the overall intermediate layer loss value for use in the above equation. In these examples, it is noted that the intermediate layer loss value is not directly based upon any of the intermediate layer decoding outputs. The intermediate layer decoding outputs will factor into the processing by the rest of the encoder neural network to generate the encoder transcription (initial transcription) and any further intermediate layer transcriptions along the way and is hence incorporated into the encoder loss value only indirectly.

The decoder loss value may comprise a first component based upon the probability distribution over candidate tokens for predicting the masked tokens and a second component based upon the probability distributions over the utterance-level label and parameter type labels. For example, the conditional probability distribution for predicting the masked tokens may be represented as: Pdec-mask(YmasklYobs, X) Ye Ymask P(yI Yobs, x) where Y mask is the set of masked tokens, Yobs labs is the set of observed tokens and X is the embedding of the audio signal generated by the encoder neural network. The first component of the decoder loss value may then be computed as the negative log likelihood as shown in the equation below: L -In ask --10gPdec-m(YmasklYobs, X) Like the encoder loss value, the above negative log-likelihood may be computed for a single training audio signal and its corresponding ground-truth, or may be a summation of the negative log-likelihood over a batch of training audio signals or the whole training data set.

The probability distribution P * dec-mask may be parameterized by the decoder neural network and in particular, by a first output head having a softmax output as described above.

The probability distributions over the utterance-level label and the parameter type labels may be parameterized by the decoder neural network and by a second output head also having a softmax output as described above. The probability distribution may be represented according to the following equation: Pdec-iabels(01 Y, X) = P(oilito) P(o'szlh") =1 where 0 = [of; os], the utterance-level label, o, followed by the parameter type labels os = osnl, Y is the masked token sequence including a pre-fix token, {ho,...hn} is the final hidden state of the second output head of the decoder neural network and N is the length of the ground-truth transcript.

The second component of the decoder loss value may be computed as the negative log likelihood as shown below: edec -labels = -10CP dec-labels (0 IY, X) Again, the above negative log-likelihood may be computed for a single training audio signal and its corresponding ground-truth labels, or may be a summation of the negative log-likelihood over a batch of training audio signals or the whole training data set.

The overall decoder loss value may be a weighted sum of the above first and second components as shown below: dec = rEdec -mask ± (1 -Y)Idec-labels where y is pre-determined hyper-parameter. In one example, y is set to 0.5, though it will be appreciated that other values may be used.

The encoder loss values and decoder loss values may also be combined to form an overall loss function for training the system. For example, a weighted sum may be computed as shown below: Lsys = pLenc + (1 -bt)Lci, where pc is a pre-determined hyper-parameter. In one example, pc is set to 0.4, though it will be appreciated that other values may be used.

The overall loss function may be used to determine an error value for backpropagation and gradient descent to determine adjusted values for the trainable parameters of the encoder and decoder neural networks. It will be appreciated that the encoder loss value component of the overall loss function will be used in the determination of the adjusted values for the trainable parameters of the encoder neural network whilst the decoder loss value component of the overall loss function will be used for both the decoder and encoder neural networks given the structure of the encoder/decoder architecture of the speech processing system.

The token vocabulary may be generated from the training data set using BPE. In one example, the token vocabulary comprises 500 word-pieces generated using BPE from the SLURP dataset. The token vocabulary may also be expanded to include the utterance-level labels and the parameter type labels. For example, token vocabulary may comprise 70 utterance-level labels and 56 parameter type from the SLURP dataset. The token vocabulary may also comprise further functional tokens such as the <MASK>, <CLS>, <EOS>, <SOS> and <BLANK> tokens. In one example, the token vocabulary comprises 682 tokens in total from the combination of word-pieces, labels and functional tokens.

Computer hardware Fig. 12 is a schematic of the hardware that can be used to implement methods and systems in accordance with embodiments described herein. It should be noted that this is just one example and other arrangements can be used.

The hardware comprises a computing section 1200. In this particular example, the components of this section will be described together. However, it will be appreciated they are not necessarily co-located.

Components of the computing system 1200 may include, but not limited to, a processing unit 1213 (such as central processing unit, CPU), a system memory 1201, a system bus 1211 that couples various system components including the system memory 1201 to the processing unit 1213. The system bus 1211 may be any of several types of bus structure including a memory bus or memory controller, a peripheral bus and a local bus using any of a variety of bus architecture etc. The computing section 1200 also includes external memory 1215 connected to the bus 1211.

The system memory 1201 includes computer storage media in the form of volatile/or non-volatile memory such as read-only memory. A basic input output system (BIOS) 1203 containing the routines that help transfer information between the elements within the computer, such as during start-up is typically stored in system memory 1201. In addition, the system memory contains the operating system 1205, application programs 1207 and program data 1209 that are in use by the CPU 1213.

Also, interface 1225 is connected to the bus 1211. The interface may be a network interface for the computer system to receive information from further devices. The interface may also be a user interface that allows a user to respond to certain commands et cetera.

In this example, a video interface 1217 is provided. The video interface 1217 comprises a graphics processing unit 1219 which is connected to a graphics processing memory 1221.

Graphics processing unit (GPU) 1219 is particularly well suited to the training of the speech recognition system due to its adaptation to data parallel operations, such as neural network training. Therefore, in an embodiment, the processing for training the speech recognition system may be divided between CPU 1213 and GPU 1219.

It should be noted that in some embodiments different hardware may be used for the training the speech recognition system and for performing speech recognition. For example, the training of the speech recognition system may occur on one or more local desktop or workstation computers or on devices of a cloud computing system, which may include one or more discrete desktop or workstation GPUs, one or more discrete desktop or workstation CPUs, e.g. processors having a PC-oriented architecture, and a substantial amount of volatile system memory, e.g. 16GB or more. While, for example, the performance of speech recognition may use mobile or embedded hardware, which may include a mobile GPU as part of a system on a chip (SoC) or no GPU; one or more mobile or embedded CPUs, e.g. processors having a mobile-oriented architecture, or a microcontroller-oriented architecture, and a lesser amount of volatile memory, e.g. less than 1GB. For example, the hardware performing speech recognition may be a voice assistant system 120, such as a smart speaker, or a mobile phone including a virtual assistant. The hardware used for training the speech recognition system may have significantly more computational power, e.g. be able to perform more operations per second and have more memory, than the hardware used for performing tasks using the agent. Using hardware having lesser resources is possible because performing speech recognition, e.g. by performing inference using one or more neural networks, is substantially less computationally resource intensive than training the speech recognition system, e.g. by training one or more neural networks. Furthermore, techniques can be employed to reduce the computational resources used for performing speech recognition, e.g. for performing inference using one or more neural networks. Examples of such techniques include model distillation and, for neural networks, neural network compression techniques, such as pruning and quantization.

Whilst certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel devices, and methods described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the devices, methods and products described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.

Claims

CLAIMS: 1. A computer-implemented method for speech processing, the method comprising: receiving an audio signal capturing an utterance spoken by a user; processing, by an encoder comprising a non-autoregressive encoder neural network, the audio signal to generate an embedding of the audio signal and an initial transcription of the utterance, wherein the initial transcription comprises one or more tokens and each of the one or more tokens is associated with a confidence score of being the correct token for that position in the transcription; modifying the initial transcription to generate a masked token sequence, wherein modifying the initial transcription comprises masking the one or more tokens in the initial transcription having a confidence score below a threshold confidence level; processing, by an output decoder comprising a non-autoregressive, bi-directional decoder neural network, the embedding of the audio signal and the masked token sequence to generate a speech processing output comprising: a predicted token for each of the masked tokens in the masked token sequence to generate an output transcription of the utterance; a label indicating a classification of the utterance; and one or more labels indicating a parameter type associated with respective words in the output transcription, the parameter types associated with the classification of the utterance.
2. The method of claim 1, wherein the decoder neural network of the output decoder comprises: a first output head for generating the predicted token for each of the masked tokens in the masked token sequence; a second output head for generating the one or more labels indicating a parameter type associated with the respective words in the output transcription; and wherein the first output head and second output heads are distinct output heads.
3. The method of claim 1 or 2, wherein generating an output transcription of the utterance comprises iteratively refining the token predictions for the masked tokens in the masked token sequence; the iterative refinement comprising: generating a predicted token for each of the masked tokens in the masked token sequence; wherein each predicted token is associated with a confidence score of being the correct token; for each predicted token, replacing the corresponding masked token in the masked token sequence with the predicted token when the confidence score is above a threshold confidence level, otherwise maintaining the masked token in the masked token sequence.
4. The method of any preceding claim, wherein processing by the encoder comprises, at one or more intermediate layers of the encoder neural network: generating an intermediate layer embedding of the audio signal and an intermediate layer transcription of the utterance, the intermediate layer transcription comprising one or more tokens and each of the one or more tokens is associated with a confidence score of being the correct token for that position in the intermediate layer transcription; modifying the intermediate layer transcription to generate an intermediate layer masked token sequence, wherein modifying the intermediate layer transcription comprises masking the one or more tokens in the intermediate layer transcription having a confidence score below an intermediate layer threshold confidence level; processing, by an intermediate layer decoder comprising a non-autoregressive, bi-directional decoder neural network, the intermediate layer embedding of the audio signal and the intermediate layer masked token sequence to generate an intermediate layer decoding output; combining the intermediate layer embedding of the audio signal and the intermediate layer decoding output to provide an input for one or more subsequent neural network layers of the encoder neural network following the respective intermediate layer of the encoder neural network; and processing, by the one or more subsequent neural network layers, the input to generate the embedding of the audio signal and the initial transcription of the utterance.
5. The method of claim 4, wherein the intermediate layer decoding output comprises: a masked token probability distribution comprising a probability distribution over candidate tokens for predicting a token for each of the masked tokens in the intermediate layer masked token sequence; an utterance classification probability distribution comprising a probability distribution over labels indicating a classification of the utterance; and a parameter type probability distribution comprising a probability distribution over labels indicating a parameter type for the respective words in the intermediate layer transcription.
6. The method of claim 5, wherein combining the intermediate layer embedding of the audio signal and the intermediate layer decoding output comprises: combining the masked token probability distribution, the utterance classification probability distribution and the parameter type probability distribution; performing a linear projection of the combined probability distributions such that the dimensionality matches the intermediate layer embedding of the audio signal; and summing the projected combined probability distribution and the intermediate layer embedding of the audio signal.
7. The method of claim 6, wherein the audio signal comprises a plurality of audio segments and wherein generating an intermediate layer transcription of the utterance comprises generating a probability distribution over candidate tokens for transcribing each audio segment and generating the intermediate layer transcription based upon the probability distributions for each audio segment; and wherein combining the intermediate layer embedding of the audio signal and the intermediate layer decoding output comprises: determining an audio segment position corresponding to each token in the intermediate layer transcription; mapping the masked token probability distribution and the parameter type probability distribution to audio segment positions based upon the corresponding tokens in the intermediate layer transcription; associating the utterance classification probability distribution with each of the audio segment positions in the mapping; and wherein combining the masked token probability distribution, the utterance classification probability distribution and parameter type probability distribution comprises: combining the probability distribution over candidate tokens, the masked token probability distribution, the utterance classification probability distribution and the parameter type probability distribution having corresponding audio segment positions according to the mapping.
8. The method of any preceding claim, wherein the initial transcription and/or the intermediate layer transcription of the utterance is generated based upon a CTC algorithm.
9. The method of any preceding claim, further comprising: adding a pre-fix token to the start of the masked token sequence prior to processing by the output decoder for generating the label indicating a classification of the utterance; and/or adding a pre-fix token to the start of the intermediate layer masked token sequence prior to processing by the intermediate layer decoder for generating utterance classification probability distribution.
10. The method of any preceding claim, wherein the decoder neural network of the output decoder and/or intermediate layer decoder is based upon a Transformer architecture.
11. The method of any preceding claim, wherein the encoder neural network is based upon a Conformer architecture.
12. The method of any preceding claim, wherein the classification of the utterance comprises an indication of an action to be performed in response to the utterance. 25
13. A computer-implemented method of training a speech processing system, the speech processing system comprising: an encoder comprising a non-autoregressive encoder neural network having a plurality of trainable parameters; an output decoder comprising a non-autoregressive, bi-directional decoder neural network having a plurality of trainable parameters; the method comprising: receiving an audio signal capturing an utterance spoken by a user for training the speech processing system; processing, by the encoder, the audio signal to generate an embedding of the audio signal and a conditional probability distribution over candidate transcriptions of the utterance given the input audio signal; receiving a ground-truth transcription of the utterance, wherein the ground-truth transcription comprises a sequence of one or more tokens; computing an encoder loss value based upon the conditional probability distribution over candidate transcriptions of the utterance and the ground-truth transcription of the utterance; modifying the ground-truth transcription to generate a masked token sequence, wherein modifying the ground-truth transcription comprises masking one or more tokens in the ground-truth transcription; processing, by the output decoder, the embedding of the audio signal and the masked token sequence to generate a speech processing output comprising: a probability distribution over candidate tokens for predicting a token for each of the masked tokens in the masked token sequence; a probability distribution over labels indicating a classification of the utterance; and a probability distribution over labels indicating a parameter type associated with respective words in an output transcription, the parameter types associated with the classification of the utterance; receiving ground-truth labels indicating the classification of the utterance and the parameter types respectively; computing a decoder loss value based upon the speech processing output and the ground-truth transcription and labels; adjusting the values of the plurality of trainable parameters of the decoder neural network of the output decoder and encoder neural network based upon the decoder loss value; adjusting the values of the plurality of trainable parameters of the encoder neural network based upon the encoder loss value.
14. The method of claim 13, wherein processing by the encoder comprises, at one or more intermediate layers of the encoder neural network: generating an intermediate layer embedding of the audio signal and an intermediate layer conditional probability distribution over candidate transcriptions of the utterance given the output of the layer preceding the intermediate layer; computing an intermediate layer loss value based upon the intermediate layer conditional probability distribution and the ground-truth transcription; and wherein the encoder loss value is further based upon the intermediate layer loss value; generating an intermediate layer transcription of the utterance based upon the intermediate layer conditional probability distribution, wherein the intermediate layer transcription comprises a sequence of one or more tokens; modifying the intermediate layer transcription to generate an intermediate layer masked token sequence, wherein modifying the intermediate layer transcription comprises masking the one or more tokens in the intermediate layer transcription having a confidence score below an intermediate layer threshold confidence level; processing, by an intermediate layer decoder comprising a non-autoregressive, bi-directional decoder neural network, the intermediate embedding of the audio signal and the intermediate layer masked token sequence to generate an intermediate layer decoding output; combining the intermediate layer embedding of the audio signal and the intermediate layer decoding output to provide an input for one or more subsequent neural network layers of the encoder neural network following the respective intermediate layer of the encoder neural network; and processing, by the one or more subsequent neural network layers, the input to generate the embedding of the audio signal and the conditional probability distribution over candidate transcriptions of the utterance given the input audio signal.
15. The method of claim 13 or 14, wherein the encoder loss value and/or the intermediate loss value is computed based upon a CTC loss function.
16. The method of claim 13 to 15, wherein the decoder loss value is based upon a negative log likelihood of the ground-truth masked tokens obtained from the probability distribution over candidate tokens for predicting a token for each of the masked tokens in the masked token sequence.[more precisely probability distribution is over all potential tokens in the vocabulary, conditioned on observed tokens input to decoder and embedding]
17. The method of claim 13 to 16, wherein the decoder loss value is based upon a classification loss value determined based upon a negative log likelihood of the ground-truth labels obtained from the probability distributions over the labels indicating a classification of the utterance and over the labels indicating a parameter type respectively.
18. A system comprising: a memory storing processor readable instructions; a processor arranged to read and execute instructions in said memory; wherein said processor readable instructions comprise instructions arranged to control the system to carry out a method according to any preceding claim.
19. A non-transitory computer readable medium carrying computer readable instructions configured to cause a computer to carry out a method according to any one claims 1 to 17.