WO2021029642A1 - Système et procédé pour reconnaître la voix d'un utilisateur - Google Patents
Système et procédé pour reconnaître la voix d'un utilisateur Download PDFInfo
- Publication number
- WO2021029642A1 WO2021029642A1 PCT/KR2020/010565 KR2020010565W WO2021029642A1 WO 2021029642 A1 WO2021029642 A1 WO 2021029642A1 KR 2020010565 W KR2020010565 W KR 2020010565W WO 2021029642 A1 WO2021029642 A1 WO 2021029642A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- output value
- server
- encoder
- domain
- encoder output
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/30—Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/008—Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/08—Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
- G10L19/09—Long term prediction, i.e. removing periodical redundancies, e.g. by using adaptive codebook or pitch predictor
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/226—Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
Definitions
- the disclosure relates to a system and method for recognizing a user's speech, and for example, to a system and method by which a device and a server interwork together to recognize a user's speech.
- ASR Automatic speech recognition
- a device receives a human speech as input, and recognizes the input speech and converts it into text using a pre-trained speech recognition model in the device.
- the text is provided as a final output.
- DNN deep neural network
- Embodiments of the disclosure provide a system and method for recognizing a user's speech by providing a server with an output value of an encoder of an end-to-end automatic speech recognition (ASR) model in a device.
- ASR automatic speech recognition
- Embodiments of the disclosure also provide are a system and method for recognizing a user's speech using a decoder corresponding to a domain related to an output value of an encoder of an end-to-end ASR model in a device.
- Embodiments of the disclosure also provide a system and method capable of more accurately recognizing a user's speech by performing encoding for an end-to-end ASR model at a device while performing decoding for an end-to-end ASR model at a server.
- FIG. 1 is a diagram illustrating an example automatic speech recognition system according to an embodiment of the disclosure
- FIG. 2 is a diagram illustrating an example speech recognition system including decoders associated with a plurality of domains, according to an embodiment of the disclosure
- FIG. 3 is a flowchart illustrating an example method, performed by a device and a server in a speech recognition system, of recognizing a speech input and obtaining a text string, according to an embodiment of the disclosure
- FIG. 4 is a flowchart illustrating an example method, performed by a device, of transmitting output values obtained from a plurality of layers in an encoder to a server, according to an embodiment of the disclosure
- FIG. 5 is a flowchart illustrating an example method, performed by a server, of inputting an encoder output value to a selected decoder, according to an embodiment of the disclosure
- FIG. 6A is a diagram illustrating an example in which a server selects one decoder corresponding to a particular domain to process an encoder output value, according to an embodiment of the disclosure
- FIG. 6B is a diagram illustrating an example in which a server selects a plurality of decoders corresponding to a particular domain to process an encoder output value, according to an embodiment of the disclosure
- FIG. 6C is a diagram illustrating an example in which a server selects a plurality of decoders corresponding to a plurality of domains to process an encoder output value, according to an embodiment of the disclosure
- FIG. 6D is a diagram illustrating an example in which a server selects, based on a decoder of the same type as an encoder of a device not being in the server, a different type of decoder to process an encoder output value, according to an embodiment of the disclosure;
- FIG. 7A is a diagram illustrating an example in which a device and a server obtain a text string from a speech signal based on the device and the server respectively including attention-based automatic speech recognition (ASR) models according to an embodiment of the disclosure;
- ASR attention-based automatic speech recognition
- FIG. 7B is a diagram illustrating an example in which a device and a server obtain a text string from a speech signal based on the device and the server respectively including recurrent neural network transducer (RNN-T) based ASR models according to an embodiment of the disclosure;
- RNN-T recurrent neural network transducer
- FIG. 8A is a diagram illustrating an example in which a device and a server obtain a text string from a speech signal based on encoders of attention-based ASR models not being included in the server, according to an embodiment of the disclosure;
- FIG. 8B is a diagram illustrating an example in which a device and a server obtain a text string from a speech signal based on encoders of RNN-T based ASR models not being included in the server, according to an embodiment of the disclosure;
- FIG. 9 is a flowchart illustrating an example method, performed by a device and a server, of performing speech recognition and natural language understanding (NLU) processing on a speech input, according to an embodiment of the disclosure
- FIG. 10 is a flowchart illustrating an example method, performed by a device and a server, of performing speech recognition and NLU processing on a speech input, according to an embodiment of the disclosure
- FIG. 11 is a flowchart illustrating an example method, performed by a device and a server, of performing speech recognition and NLU processing on a speech input, according to an embodiment of the disclosure
- FIG. 12 is a flowchart illustrating an example method, performed by a device and a server, of performing speech recognition and NLU processing on a speech input, according to an embodiment of the disclosure
- FIG. 13 is a block diagram illustrating an example server according to an embodiment of the disclosure.
- FIG. 14 is a block diagram illustrating an example configuration of an example device according to an embodiment of the disclosure.
- a method, performed by a server, of providing a text string for a speech signal input to a device includes: receiving, from the device, an encoder output value derived from an encoder of an end-to-end automatic speech recognition (ASR) model in the device; identifying a domain corresponding to the received encoder output value; selecting a decoder corresponding to the identified domain from among a plurality of decoders of an end-to-end ASR model included in the server; obtaining a text string from the received encoder output value using the selected decoder; and providing the obtained text string to the device, wherein the encoder output value is derived by the device by encoding the speech signal input to the device.
- ASR automatic speech recognition
- a server for providing a text string for a speech signal input to a device includes: a communication interface comprising communication circuitry; a storage storing a program including one or more instructions; and a processor configured to execute the one or more instructions of the program stored in the storage to: receive, from the device, an encoder output value derived from an encoder of an end-to-end automatic speech recognition (ASR) model in the device; identify a domain corresponding to the received encoder output value; select a decoder corresponding to the identified domain from among a plurality of decoders of an end-to-end ASR model included in the server; obtain a text string from the received encoder output value using the selected decoder; and provide the obtained text string to the device, wherein the encoder output value is derived by the device by encoding the speech signal input to the device.
- ASR automatic speech recognition
- the expression "at least one of a, b or c" indicates only a, only b, only c, both a and b, both a and c, both b and c, all of a, b, and c, or variations thereof.
- FIG. 1 is a diagram illustrating an example speech recognition system according to an embodiment of the disclosure.
- the speech recognition system includes a device 1000 and a server 2000.
- the device 1000 may include an encoder of an end-to-end automatic speech recognition (ASR) model
- the server 2000 may include a decoder of an end-to-end ASR model.
- the device 1000 may perform an operation of encoding a user's speech input using the encoder of the end-to-end ASR model in order to recognize the user's speech input
- the server 2000 may perform an operation of decoding the user's speech input encoded by the device 1000 using the decoder of the end-to-end ASR model. In such a manner, an ASR model may be performed.
- the end-to-end ASR model may refer, for example, to a speech recognition model that recognizes a text string from speech using an integrated neural network, and may have a structure including an integrated neural network without separately using an acoustic model, a pronunciation dictionary, and a language model. Using the integrated neural network, the end-to-end ASR model may convert speech into text without having to perform a process of recognizing phonemes in the speech and converting the recognized phonemes into text.
- the end-to-end ASR model may, for example, have a structure with a recurrent neural network (RNN) and include an encoder for encoding speech input and a decoder for estimating a text string from an encoder output value.
- RNN recurrent neural network
- the device 1000 may receive the user's speech input, encode the received user's speech input using the encoder of the end-to-end ASR model, and provide an encoder output value to the server 2000. Furthermore, the server 2000 may receive the encoder output value from the device 1000 and decode the received encoder output value using the decoder of the end-to-end ASR model. The server 2000 may obtain a speech recognition result by decoding the encoder output value and provide the obtained speech recognition result to the device 1000.
- Examples of the device 1000 may include, but are not limited to, a smartphone, a tablet PC, a PC, a smart TV, a mobile phone, a personal digital assistant (PDA), a laptop, a media player, a micro server, a global positioning system (GPS), an electronic book terminal, a digital broadcasting terminal, a navigation device, a kiosk, an MP3 player, a digital camera, and other mobile or non-mobile computing devices, or the like.
- the device 1000 may be a wearable device such as a watch, glasses, a hair band, and a ring, having a communication function and a data processing function, or the like.
- the device 1000 may include any apparatus capable of exchanging data with the server 2000 via a network.
- the network may include, for example, and without limitation, a local area network (LAN), a wide area network (WAN), a value added network (VAN), a mobile radio communication network, a satellite communication network, or the like, or any combination thereof.
- the network may include a data communication network, in a comprehensive sense, configured to enable smooth communication across network entities shown in FIG. 1 and includes, for example, and without limitation, a wired Internet, a wireless Internet, a mobile wireless communication network, or the like.
- FIG. 2 is a diagram illustrating an example speech recognition system including decoders associated with a plurality of domains, according to an embodiment of the disclosure.
- the device 1000 may include an encoder of an end-to-end ASR system
- the server 2000 may include decoders of the end-to-end ASR system.
- the server 2000 may include a plurality of first decoders corresponding to a first domain and a plurality of second decoders corresponding to a second domain.
- the device 1000 may extract features from a speech input to obtain a feature vector and then input the obtained feature vector to the encoder of the end-to-end ASR system.
- the device 1000 may provide an encoder output value to the server 2000.
- the encoder output value may, for example, be in the form of a sequence, e.g., a hidden layer vector sequence that is an output value of a neural network layer in the encoder.
- the server 2000 may receive an encoder output value from the device 1000 and select a domain related to the encoder output value based on the encoder output value.
- the domain may, for example, represent a field related to an input speech and may be preset according to the semantic meaning, attributes, etc. of the input speech. For example, the domain may be classified according to a service related to the input speech.
- an end-to-end ASR model may be trained for each domain, and in this case, the end-to-end ASR model trained for each domain may include, for example, a model trained using an input speech related to the domain and its corresponding text as the correct answer.
- the server 2000 may select at least one of a plurality of preset domains and then at least one of decoders corresponding to the selected domain.
- the server 2000 may select a decoder corresponding to the encoder in the device 1000. Furthermore, the server 2000 may obtain a decoder output value by inputting the encoder output value to the selected decoder. The server 2000 may obtain a text string that is a result of recognizing the speech input using a decoder output value and provide the obtained text string to the device 1000.
- FIG. 3 is a flowchart illustrating an example method, performed by the device 1000 and the server 2000 in a speech recognition system, of recognizing a speech input and obtaining a text string, according to an embodiment of the disclosure.
- the device 1000 may perform various example operations illustrated in FIG. 3 by executing instructions stored in a memory of the device 1000.
- the device 1000 may perform its operations illustrated in FIG. 3 by executing at least one of a speech recognition evaluation module 1430, a natural language understanding (NLU) determination module 1440, a domain identification module 1450, a domain registration module 1460, an ASR model 1410, an NLU model 1420, or the like, as described below with reference to FIG. 14.
- NLU natural language understanding
- the device 1000 may execute other programs stored in the memory in order to perform certain operations.
- the server 2000 may perform its operations illustrated in FIG. 3 by executing instructions stored in a storage of the server 2000.
- the server 2000 may perform its operations illustrated in FIG. 3 by executing at least one of a speech recognition management module 2310, an ASR module 2320, an NLU module 2330, or a speech interpretation management module 2340, as described below with reference to FIG. 13.
- a speech recognition management module 2310 an ASR module 2320, an NLU module 2330, or a speech interpretation management module 2340, as described below with reference to FIG. 13.
- embodiments of the disclosure are not limited thereto, and the server 2000 may execute other programs stored in the memory in order to perform certain operations.
- the device 1000 may generate a feature vector for a speech signal (operation S300).
- the device 1000 may receive a user's speech input (e.g., a user utterance) via a microphone and generate a feature vector representing features of a speech signal obtained via the microphone based on the speech signal.
- a user's speech input e.g., a user utterance
- the device 1000 may remove noise from the speech signal and obtain a feature vector based on the speech signal from which the noise has been removed.
- the device 1000 may encode the feature vector using an encoder (operation S305).
- the device 1000 may input a feature vector to an encoder of an end-to-end ASR model in the device 1000 to recognize a user's speech.
- the device 1000 may convert the feature vector into a format suitable for the encoder of the end-to-end ASR model.
- the device 1000 may encode the feature vector using an encoder 1411 in the ASR model 1410 of FIG. 14.
- the device 1000 may obtain a confidence level of an encoder output value (operation S310).
- An encoder of an end-to-end ASR model in the device 1000 may include a plurality of layers, e.g., a plurality of stacked long short-term memory (LSTM) layers.
- the encoded output value may be one of output values obtained from a plurality of layers in the encoder.
- the encoder output value may be a hidden vector output from a layer in the encoder.
- the confidence level of the encoded output value may include, for example, a numerical value indicating a degree of matching between the encoded output value and input speech, and the confidence level may include, for example, a confidence score, but is not limited thereto.
- the confidence level of the encoded output value may indicate a degree of matching between text represented by the encoded output value and the input speech.
- a connectionist temporal classification (CTC) loss function may be used.
- CTC connectionist temporal classification
- the device 1000 may calculate a confidence level of an encoder output value based on the text obtained from the projection layer connected to the output terminal of the encoder.
- a method of calculating a confidence level is not limited thereto, and for example, the device 1000 may calculate a confidence level directly from an encoder output value using a preset algorithm.
- the device 1000 may obtain a confidence level of an encoder output value using the speech recognition evaluation module (e.g., 1430 of FIG. 14).
- the device 1000 may determine whether to transmit the encoder output value to the server 2000 (operation S315).
- the device 1000 may determine whether to transmit the encoder output value to the server 2000 by comparing the confidence level of the encoder output value with a preset threshold. When the confidence level of the encoder output value is greater than or equal to the preset threshold, the device 1000 may determine to transmit the encoder output value to the server 2000. On the other hand, when the confidence level of the encoder output value is less than the preset threshold, the device 1000 may determine not to transmit the encoder output value to the server 2000.
- the device 1000 may determine whether to use an output value of an ASR model in the device 1000 based on the output value of the ASR model. In this case, the device 1000 may determine whether to transmit an encoder output value to the server 2000 based on a confidence level of the output value of the ASR model.
- the output value of the ASR model in the device 1000 may be at least one text string obtained when decoding an encoder output value from an encoder of the ASR model in the device 1000 using a decoder of the ASR model in the device 1000.
- a confidence level of the output value of the ASR model may be a numerical value indicating the degree of matching a text string output from the ASR model to an input speech.
- a confidence score may be calculated for each text string output from the ASR model, the confidence score indicating the degree of matching the text string to an input speech.
- the device 1000 may request a text string from the server 2000 (operation S320).
- the device 1000 may request a text string from the server 2000 while transmitting the encoder output value to the server 2000.
- the device 1000 may transmit, to the server 2000, an output value obtained from a last layer among a plurality of layers in the encoder.
- the device 1000 may transmit, to the server 2000, an output value selected from among output values obtained from a plurality of layers in the encoder.
- the device 1000 may provide encoding information regarding the encoder in the device 1000 to the server 2000 while requesting a text string from the server 2000.
- Encoding information may include, for example, information related to an encoder output value, and may include, for example, but is not limited to, a type of an end-to-end ASR model in the device 1000, an identification value of the end-to-end ASR model, an identification value of an encoder in the end-to-end ASR model, a type of the encoder, properties of the encoder, and information indicating a degree of encoding, but is not limited thereto.
- the information indicating the degree of encoding may include, for example, information for identifying a layer that outputs the encoder output value.
- the device 1000 may provide domain information related to the encoder output value to the server 2000 while requesting a text string from the server 2000.
- the domain information may include, for example, information for identifying a domain and may include, for example, a domain name and a domain identifier, but is not limited thereto.
- the device 1000 may identify a domain related to the encoder output value using the domain identification module (e.g., 1450 of FIG. 14) in the device 1000.
- the device 1000 may identify a domain related to the encoder output value in text obtained from a projection layer connected to an output terminal of the encoder.
- the projection layer may be a layer that outputs text by taking as input a hidden vector value output from the encoder.
- the device 1000 may identify a domain related to the encoder output value based on a domain confidence level for the text generated from the encoder output value.
- a domain confidence level may include, for example, a numerical value indicating how closely at least a portion of text is relevant to a particular domain.
- the device 1000 may calculate a confidence score indicating the degree of a relevance of text obtained from the encoder output value to a domain pre-registered to decode the encoder output value.
- the device 1000 may identify a domain related to the encoder output value based on a domain confidence level calculated for the pre-registered domain.
- the device 1000 may identify a domain related to the encoder output value based on a rule or obtain a domain confidence level related to the encoder output value using an artificial intelligence (AI) model trained for domain identification.
- AI artificial intelligence
- the AI model for domain identification may be a part of an NLU model or a model separate therefrom.
- the device 1000 provides an encoder output value, encoding information, and domain information to the server 2000 while requesting a text string from the server 2000
- embodiments of the disclosure are not limited thereto.
- the device 1000 may provide only an encoder output value or only the encoder output value and encoding information to the server 2000.
- the server 2000 may determine a domain for decoding (operation S325).
- the server 2000 may identify a domain related to the encoder output value based on the domain information.
- the server 2000 may identify a domain related to the encoder output value using the domain identification module 1450 (refer to FIG. 14) in the server 2000.
- the server 2000 may receive, from the device 1000, text obtained from a projection layer connected to an output terminal of the encoder in the device 1000 and identify a domain related to the encoder output value using the received text.
- the server 2000 may obtain text by applying a projection layer to the encoder output value based on encoding information received from the device 1000 and identify a domain related to the encoder output value using the obtained text.
- the server 2000 may identify a domain related to the encoder output value based on a domain confidence level for the text generated from the encoder output value.
- a domain confidence level may include, for example, a numerical value indicating how closely at least a portion of text is relevant to a particular domain.
- the server 2000 may calculate a confidence score indicating the degree of a relevance of text obtained from the encoder output value to a domain pre-registered to decode the encoder output value.
- the server 2000 may identify a domain related to the encoder output value based on a domain confidence level calculated for the pre-registered domain.
- the server 2000 may identify a domain related to the encoder output value based on a rule or obtain a domain confidence level related to the encoder output value using an AI model trained for domain identification.
- the AI model for domain identification may be a part of an NLU model or a model separate therefrom.
- the server 2000 may identify a domain related to an encoder output value by analyzing the encoder output value and a format of the encoder output value and applying a projection layer to the encoder output value based on a result of the analysis.
- the server 2000 may identify a type of the encoder and a degree of encoding (operation S330).
- the server 2000 may identify a type of the encoder that outputs the encoder output value and a degree of encoding based on the encoding information.
- the server 2000 may identify a type of the encoder and a degree of encoding by analyzing the encoder output value and a format of the encoder output value. For example, the server 2000 may identify a type of the encoder and a degree of encoding using an encoder identification module 2311 of FIG. 13.
- the server 2000 may select a decoder for generating a text string (operation S335).
- the server 2000 may include decoders related to a plurality of domains and select at least one domain related to an encoder output value from among the domains. For example, the server 2000 may select a domain corresponding to the domain determined in operation S325.
- the domain corresponding to the determined domain may be identical or similar to the determined domain. For example, when a plurality of domains related to decoders in the server 2000 are "movie", "place", and "region name”, and the domain identified in operation S325 is "movie", the server 2000 may select "movie".
- the server 2000 may select "video content”.
- information about an identification value similar to an identification value of the domain selected by the server 2000 may be stored in the server 2000.
- the server 2000 may select at least one decoder to decode an identification value of the encoder from among a plurality of decoders corresponding to the selected domain.
- the server 2000 may select at least one decoder capable of decoding the encoder output value, based on the type of encoder identified in operation S330. For example, when the identified type of encoder is an encoder used in an attention-based end-to-end ASR model, the server 2000 may select a decoder used in the attention-based end-to-end ASR model.
- the server 2000 may select a decoder used in the RNN-T based end-to-end ASR model.
- RNN-T RNN transducer
- the server 2000 may select various decoders according to preset criteria, and in this case, may convert a format of the encoder output value into a format suitable for the selected decoder.
- the server 2000 may identify a type of encoder and a degree of encoding using a decoder selection module (e.g., 2313 of FIG. 13).
- the server 2000 may generate a text string based on the encoder output value using the selected decoder (operation S340).
- the server 2000 may input the encoder output value to the selected decoder and obtain a text string based on a decoder output value from the decoder.
- the server 2000 may preprocess the format of the encoder output value into a format suitable for the selected decoder and input a preprocessed result to the decoder.
- a method, performed by the server 200, of preprocessing an encoder output value will be described in greater detail below with reference to FIG. 5.
- the server 2000 may generate a plurality of text strings using output values respectively obtained from the decoders by inputting encoder output values to the decoders, respectively.
- the server 2000 may provide the text string to the device 1000 (operation S345).
- the server 2000 may compare confidence levels of the text strings with one another and provide a text string having a high confidence level to the device 1000.
- a confidence level of a generated text string may include, for example, a numerical value indicating the degree of matching between the generated text string and an input speech, and the confidence level may include, for example, a confidence score, but is not limited thereto.
- the device 1000 may provide the encoder output value to a decoder in the device 1000 (operation S350) and obtain a text string from the encoder output value using the decoder (operation S355).
- the device 1000 may register the domain with itself (operation S360).
- the device 1000 may process the encoder output value using a decoder in the device 1000 without transmitting the encoder output value to the server 2000.
- the device 1000 may register a domain using the domain registration module (e.g., 1460 of FIG. 14).
- the device 1000 may evaluate the text string received from the server 2000 and register a domain related to the evaluated text string with the device 1000 based on a result of the evaluation.
- the device 1000 may obtain text from the encoder output value and evaluate the text string received from the server 2000 by comparing the obtained text with the received text string. Furthermore, when text obtained from an encoder output value related to a particular domain matches the text string received from the server 2000 more than a preset number of times, the particular domain may be registered with the device 1000.
- the device 1000 may identify confidence levels of encoder output values and a domain related to the encoder output values, and when the confidence levels of the encoder output values related to the identified domain are each greater than or equal to a preset threshold, the device 1000 may register the identified domain with the device 1000.
- a voice assistant service may include, for example, a service that provides conversations with a user.
- a voice assistant may provide a response message to the user as if a human directly talks to the user taking into account a user's situation, device conditions, etc.
- the voice assistant may act as a user's personal assistant to appropriately create information for the user and provide the information to the user.
- the voice assistant service may be linked to various types of services to provide requested information or functions to the user.
- the various types of services may include, for example, and without limitation, broadcast services, content sharing services, content provision services, power management services, game provision services, chat services, document creation services, search services, call services, photo capture services, transportation recommendation services, video playback services, and the like.
- the server 2000 may provide information for performing a conversation with the user to the device 1000 using, for example, an NLU model, a dialog manager (DM) model, a natural language generating (NLG) model, etc. in the server 2000. Furthermore, the server 2000 may directly control another device 1000 based on a result of interpreting the text string. In addition, the server 2000 may generate control information necessary for the device 1000 to control the other device 1000 based on the result of interpreting the text string and provide the generated control information to the device 1000.
- NLU model e.g., a dialog manager (DM) model, a natural language generating (NLG) model, etc.
- the server 2000 may directly control another device 1000 based on a result of interpreting the text string.
- the server 2000 may generate control information necessary for the device 1000 to control the other device 1000 based on the result of interpreting the text string and provide the generated control information to the device 1000.
- FIG. 4 is a flowchart illustrating an example method, performed by the device 1000, of transmitting output values obtained from a plurality of layers in an encoder to the server 2000, according to an embodiment of the disclosure.
- the device 1000 may perform its operations illustrated in FIG. 4 by executing instructions stored in the memory of the device 1000.
- the device 1000 may perform its operations illustrated in FIG. 4 by executing at least one of the speech recognition evaluation module 1430, the domain identification module 1450, the ASR model 1410, or the NLU model 1420, as described in greater detail below with reference to FIG. 14.
- the device 1000 may execute other programs stored in the memory in order to perform preset operations.
- the device 1000 may obtain a first encoder output value from a first layer in the encoder (operation S400).
- the first layer in the encoder may be one of a plurality of stacked LSTM layers in the encoder, and the first encoder output value may include, for example, a hidden layer vector output from the first layer.
- the device 1000 may obtain a confidence level of the first encoder output value (operation S405).
- the device 1000 may obtain text from the first encoder output value from the first layer using, for example, a projection layer connected to an output terminal of the first layer. Furthermore, the device 1000 may calculate a confidence level of the first encoder output value based on the obtained text.
- the confidence level of the first encoder output value may include, for example, a numerical value indicating a degree of matching between the first encoder output value and an input speech, and the confidence level may include, for example, a confidence score, but is not limited thereto.
- the device 1000 may determine whether to transmit the first encoder output value to the server 2000 (operation S410).
- the device 1000 may compare the confidence level of the first encoder output value with a preset threshold and determine whether to transmit the first encoder output value to the server 2000 based on a result of the comparison.
- the device 1000 may identify whether a domain related to the first encoder output value corresponds to a domain registered with the device 1000, and when the domain related to the first encoder output value does not correspond to the registered domain, the device 1000 may determine whether to transmit the first encoder output value to the server 2000.
- the device 1000 may transmit the first encoder output value to the server 2000 (operation S415).
- the server 2000 may have both an encoder and a decoder of an end-to-end ASR model, and the first encoder output value transmitted to the server 2000 may be input to an output terminal of a first layer of the encoder in the server 2000.
- the device 1000 may obtain a second encoder output value from a second layer of the encoder (operation S420).
- the second layer of the encoder may include, for example, one of a plurality of stacked LSTM layers in the encoder, and may be a layer after the first layer.
- the second encoder output value may be a hidden layer vector output from the second layer.
- the device 1000 may obtain a confidence level of the second encoder output value (operation S425).
- the device 1000 may obtain text from the second encoder output value from the second layer using, for example, a projection layer connected to an output terminal of the second layer. Furthermore, the device 1000 may calculate a confidence level of the second encoder output value based on the obtained text.
- the confidence level of the second encoder output value may include, for example, a numerical value indicating a degree of matching between the second encoder output value and the input speech, and the confidence level may include, for example, a confidence score, but is not limited thereto.
- the device 1000 may determine whether to transmit the second encoder output value to the server 2000 (operation S430).
- the device 1000 may compare the confidence level of the second encoder output value with a preset threshold and determine whether to transmit the second encoder output value to the server 2000 based on a result of the comparison.
- the device 1000 may identify whether a domain related to the second encoder output value corresponds to a domain registered with the device 1000, and when the domain related to the second encoder output value does not correspond to the registered domain, the device 1000 may determine whether to transmit the second encoder output value to the server 2000.
- the device 1000 may transmit the second encoder output value to the server 2000 (operation S435).
- the server 2000 may have both an encoder and a decoder of an end-to-end ASR model, and the second encoder output value transmitted to the server 2000 may be input to an output terminal of a second layer of the encoder in the server 2000.
- FIG. 5 is a flowchart illustrating an example method, performed by the server 2000, of inputting an encoder output value to a selected decoder, according to an embodiment of the disclosure.
- the server 2000 may perform its operations illustrated in FIG. 5 by executing instructions stored in the storage of the server 2000.
- the server 2000 may perform its operations illustrated in FIG. 5 by executing an output value conversion module 2314 as described in detail below with reference to FIG. 13.
- the server 2000 may convert a format of an encoder output value according to a selected decoder (operation S500).
- the server 2000 may convert the format of the encoder output value received from the device 1000.
- the server 2000 may convert the format of the encoder output value using a tool for converting the encoder output value into a format compatible with the selected decoder.
- the server 2000 may store conversion tools corresponding to each of combinations of a plurality of types of encoders and a plurality of types of decoders. For example, to allow n types of encoder output values to be fed into m types of decoders, the server 2000 may store at least one conversion tool capable of performing n x m types of data conversions.
- the server 2000 may input the resulting encoder output value to the selected decoder (operation S510).
- the server 2000 may obtain a text string using a decoder output value that is obtained from the decoder based on the resulting encoder output value.
- FIG. 6A is a diagram illustrating an example in which the server 2000 selects one decoder corresponding to a particular domain to process an encoder output value, according to an embodiment of the disclosure.
- an encoder output value from an encoder of first type 60 in the device 1000 may be provided to an encoder identification module (e.g., including processing circuitry and/or executable program elements) 2311 in the server 2000.
- the encoder identification module 2311 of the server 2000 may receive the encoder output value and identify a type of the encoder of first type 60 that outputs the encoder output value.
- the encoder identification module 2311 may identify a type of the encoder of first type 60 as being the first type based on the encoding information provided by the device 1000.
- the encoder identification module 2311 may identify the type of the encoder of first type 60 as being the first type by analyzing the encoder output value provided by the device 1000 and a format of the encoder output value.
- a domain identification module 2312 of the server 2000 may identify a domain related to the encoder output value provided by the device 1000.
- the domain identification module 2312 may identify that the encoder output value is related to a first domain based on the domain information provided by the device 1000.
- the domain identification module 2312 may use the encoder output value provided by the device 1000 to identify a domain related to the encoder output value as being the first domain.
- the domain identification module 2312 may obtain text from the projection layer and identify a domain related to the encoder output value as being the first domain based on the obtained text.
- the server 2000 may identify a domain related to the encoder output value based on a domain confidence level for the text generated from the encoder output value.
- the server 2000 may calculate domain confidence levels, each indicating the degree of a relevance between the text obtained from the encoder output value and each of the first and second domains.
- the server 2000 may identify a domain related to the encoder output value as being the first domain by comparing the domain confidence levels respectively calculated for the first and second domains with each other.
- a decoder selection module (e.g., including processing circuitry and/or executable program elements) 2313 of the server 2000 may select a decoder to decode the encoder output value.
- the decoder selection module 2313 may select a decoder of first type 61 corresponding to the first domain from among a plurality of decoders, e.g., sets of decoders of first and second types 61 and 62 and 63 and 64 in the server 2000, based on the type of encoder identified by the encoder identification module 2311 and the domain identified by the domain identification module 2312.
- the encoder output value may be input to the decoder of first type 61, and the server 2000 may obtain a text string based on an output value from the decoder of first type 61. In addition, the server 2000 may provide the obtained text string to the device 1000.
- FIG. 6B is a diagram illustrating an example in which the server 2000 selects a plurality of decoders corresponding to a particular domain to process an encoder output value, according to an embodiment of the disclosure.
- an encoder output value from the encoder of first type 60 in the device 1000 may be provided to a domain identification module 2312 in the server 2000.
- the domain identification module 2312 may identify a domain related to the encoder output value provided by the device 1000.
- the domain identification module 2312 may identify that the encoder output value is related to a first domain based on the domain information provided by the device 1000.
- the domain identification module 2312 may use the encoder output value provided by the device 1000 to identify a domain related to the encoder output value as being a first domain.
- the domain identification module 2312 may obtain text from the projection layer and identify a domain related to the encoder output value as being the first domain based on the obtained text.
- the server 2000 may identify a domain related to the encoder output value based on a domain confidence level for the text generated from the encoder output value.
- a decoder selection module 2313 of the server 2000 may select a decoder to decode the encoder output value.
- the decoder selection module 2313 may select, based on the domain identified by the domain identification module 2312, a decoder of first type 61 and a decoder of second type 62, each corresponding to the first domain, from among a plurality of decoders, e.g., a set of the decoders of first and second types 61 and 62 and a set of decoders of first and second types 63 and 64 in the server 2000.
- the encoder output value may be input to the decoder of first type 61, and the server 2000 may obtain a first text string based on an output value from the decoder of first type 61. Furthermore, the encoder output value may be converted into a format suitable for the decoder of second type 62 and then input to the decoder of second type 62, and the server 2000 may obtain a second text string based on an output value from the decoder of second type 62.
- the server 2000 may compare a confidence level of the first text string with that of the second text string, select a text string having a higher confidence level, and provide the selected text string to the device 1000.
- a confidence level of a text string may be a numerical value indicating a degree of matching between the obtained text string and the input speech, and the confidence level may include, for example, a confidence score, but is not limited thereto.
- FIG. 6C is a diagram illustrating an example in which the server 2000 selects a plurality of decoders corresponding to a plurality of domains to process an encoder output value, according to an embodiment of the disclosure
- an encoder output value from the encoder of first type 60 in the device 1000 may be provided to a decoder selection module 2313 in the server 2000.
- the decoder selection module 2313 of the server 2000 may select a decoder to decode the encoder output value.
- the decoder selection module 2313 may select a plurality of decoders, e.g., a set of decoders of first and second types 61 and 62 and a set of decoders of first and second types 63 and 64, the sets respectively corresponding to a plurality of domains, e.g., first and second domains in the server 2000.
- the encoder output value may be input to the decoder of first type 61 corresponding to the first domain, and the server 2000 may obtain a first text string based on an output value from the decoder of first type 61 corresponding to the first domain. Furthermore, the encoder output value may be converted into a format suitable for the decoder of second type 62 corresponding to the first domain and then input to the decoder of second type 62 corresponding to the first domain, and the server 2000 may obtain a second text string based on an output value from the decoder of second type 62 corresponding to the first domain.
- the encoder output value may be input to the decoder of first type 63 corresponding to the second domain, and the server 2000 may obtain a third text string based on an output value from the decoder of first type 63 corresponding to the second domain.
- the encoder output value may be converted into a format suitable for the decoder of second type 64 corresponding to the second domain and then input to the decoder of second type 64 corresponding to the second domain, and the server 2000 may obtain a fourth text string based on the output value from the decoder of second type 64 corresponding to the second domain.
- the server 2000 may compare confidence levels of the first through fourth text strings with one another, select a text string having a high confidence level, and provide the selected text string to the device 1000.
- FIG. 6D is a diagram illustrating an example in which the server 2000 selects, based on a decoder of the same type as an encoder of a device not being in the server 2000, a different type of decoder to process an encoder output value, according to an embodiment of the disclosure;
- the server 2000 may not include a decoder of the same type as the encoder of first type 60 in the device 1000.
- a decoder selection module 2313 may select a decoder to process an encoder output value from the encoder of first type 60 from among a plurality of decoders, e.g., a set of decoders of third (65) through fifth (67) types 65 through 67 and a set of decoders of third (68) through fifth (70) types 68 through 70, the sets respectively corresponding to a plurality of domains, e.g., first and second domains.
- the server 2000 may select the decoder of third type 65 and the decoder of fourth type 66 that are of different types than the encoder of first type 60.
- the encoder output value may be converted into a format suitable for the decoder of third type 65 and then input to the decoder of third type 65, and the server 2000 may obtain a fifth text string based on an output value from the decoder of third type 65.
- the encoder output value may be converted into a format suitable for the decoder of fourth type 66 and then input to the decoder of fourth type 66, and the server 2000 may obtain a sixth text string based on an output value from the decoder of fourth type 66.
- the server 2000 may compare a confidence level of the fifth text string with that of the sixth text string, select a text string having a higher confidence level, and provide the selected text string to the device 1000.
- FIG. 7A is a diagram illustrating an example in which the device 1000 and the server 2000 obtain a text string from a speech signal based on the device 1000 and the server 2000 each including attention-based ASR models according to an embodiment of the disclosure.
- the device 1000 and the server 2000 may each include an attention-based ASR model.
- An attention-based ASR model may include an encoder and a decoder, each including an RNN.
- An encoder of an attention-based ASR model may compute an output from a feature vector for a speech signal frame-by-frame, and the decoder may determine which encoder output value corresponding to a frame should be attended to, select an encoder output value according to an attention weight, and estimate a text string using selected encoder output value as input.
- An encoder of an attention-based ASR model in the device 1000 may include a plurality of stacked LSTM layers, and the device 1000 may provide the server 2000 with a hidden layer vector that is an encoder output value from a last LSTM layer 71 in the encoder of the attention-based ASR model in the device 1000 itself. Furthermore, the hidden layer vector received from the device 1000 may be fed into an output terminal of a last LSTM layer 73 in an encoder of an attention-based ASR model in the server 2000.
- the device 1000 may provide the server 2000 with a hidden layer vector that is an encoder output value from an LSTM layer 72 in the encoder of the attention-based ASR model in the device 1000. Furthermore, the hidden layer vector received from the device 1000 may be fed into an output terminal of an LSTM layer 74 in the encoder of the attention-based ASR model in the server 2000.
- the LSTM layer 74 in the encoder of the attention-based ASR model in the server 2000 may be a layer corresponding to the LSTM layer 72 in the encoder of the attention-based ASR model in the device 1000.
- the server 2000 may provide the device 1000 with a text string that is output from the attention-based ASR model using, for example, a hidden layer vector received from the device 1000.
- the server 2000 may identify a domain related to the hidden layer vector received from the device 1000 and select an attention-based ASR model corresponding to the identified domain from among a plurality of attention-based ASR models in the server 2000.
- the server 2000 may obtain a text string as an output value by inputting the hidden layer vector received from the device 1000 to the selected attention-based ASR model.
- FIG. 7B is a diagram illustrating an example in which the device 1000 and the server 2000 obtain a text string from a speech signal based on the device 1000 and the server 2000 each including RNN-T based ASR models according to an embodiment of the disclosure.
- the device 1000 and the server 2000 may each include, for example, an RNN-T based ASR model.
- a prediction network may be connected to an encoder to achieve an effect of post-processing an encoder output value via a language model.
- An encoder of an attention-based ASR model in the device 1000 may include a plurality of stacked LSTM layers, and the device 1000 may provide the server 2000 with a hidden layer vector that is an encoder output value from a last LSTM layer 75 in the encoder of the RNN-T based ASR model in the device 1000. Furthermore, the hidden layer vector received from the device 1000 may be fed into an output terminal of a last LSTM layer 77 in an encoder of an RNN-T based ASR model in the server 2000.
- the device 1000 may provide the server 2000 with a hidden layer vector that is an encoder output value from an LSTM layer 76 in the encoder of the RNN-T based ASR model in the device 1000. Furthermore, the hidden layer vector received from the device 1000 may be fed into an output terminal of an LSTM layer 78 in the encoder of the RNN-T based ASR model in the server 2000.
- the LSTM layer 78 in the encoder of the RNN-T based ASR model in the server 2000 may be a layer corresponding to the LSTM layer 76 in the encoder of the RNN-T based ASR model in the device 1000.
- the server 2000 may provide the device 1000 with a text string that is output from the RNN-T based ASR model using, for example, a hidden layer vector received from the device 1000.
- the server 2000 may identify a domain related to the hidden layer vector received from the device 1000 and select an RNN-T based ASR model corresponding to the identified domain from among a plurality of RNN-T based ASR models in the server 2000.
- the server 2000 may obtain a text string as an output value by inputting the hidden layer vector received from the device 1000 to the selected RNN-T based ASR model.
- FIG. 8A is a diagram illustrating an example in which the device 1000 and the server 2000 obtain a text string from a speech signal based on encoders of attention-based ASR models not being included in the server 2000, according to an embodiment of the disclosure.
- the device 1000 may include both an encoder and a decoder of an attention-based ASR model
- the server 2000 may include only decoders of an attention-based ASR model.
- the server 2000 may include, as the decoders of the attention-based ASR model, a decoder corresponding to domain A and a decoder corresponding to domain B. Accordingly, an output value of the encoder of the attention-based ASR model in the device 1000 may be provided to at least one of the decoder corresponding to domain A or the decoder corresponding to domain B in the server 2000.
- FIG. 8B is a diagram illustrating an example in which the device 1000 and the server 2000 obtain a text string from a speech signal based on encoders of RNN-T based ASR models not being included in the server 2000, according to an embodiment of the disclosure.
- the device 1000 may include both an encoder and a decoder of an RNN-T based ASR model
- the server 2000 may include only decoders of an RNN-T based ASR model.
- the server 2000 may include, as the decoders of the RNN-T based ASR model, a decoder corresponding to domain A and a decoder corresponding to domain B. Accordingly, an output value of the encoder of the RNN-T based ASR model in the device 1000 may be provided to at least one of the decoder corresponding to domain A or the decoder corresponding to domain B in the server 2000.
- FIG. 9 is a flowchart illustrating an example method, performed by the device 1000 and the server 2000, of performing speech recognition and NLU processing on a speech input, according to an embodiment of the disclosure.
- the device 1000 may determine whether to use an output value of its ASR model and whether to perform NLU processing at the device 1000 itself.
- the device 1000 may determine whether to use an output value of its ASR model based on an encoder output value from the ASR model (operation 900).
- the device 1000 may obtain a confidence level of the encoder output value and determine whether to use the output value of the ASR model by comparing the confidence level of the encoder output value with a preset threshold. When the confidence level of the encoder output value is greater than or equal to the preset threshold, the device 1000 may determine to use the output value of the ASR model. On the other hand, when the confidence level of the encoder output value is less than the preset threshold, the device 1000 may determine not to use the output value of the ASR model.
- the confidence level of the encoder output value may represent a degree of matching between text represented by the encoder output value and an input speech.
- a CTC loss function may be used.
- text may be obtained from the projection layer.
- the device 1000 may calculate a confidence level of an encoder output value based on the text obtained from the projection layer connected to the output terminal of the encoder.
- a method of calculating (e.g., determining) a confidence level is not limited thereto, and for example, the device 1000 may calculate a confidence level directly from an encoder output value using a preset algorithm.
- the device 1000 may obtain a confidence level of an encoder output value using the speech recognition evaluation module (e.g., 1430 of FIG. 14).
- the device 1000 may determine whether to use an output value of its ASR model based on the output value of the ASR model (operation 900).
- the output value of the ASR model of the device 1000 may be, for example, at least one text string generated from a feature vector for a speech signal.
- the output value of the ASR model of the device 1000 may be a text string obtained when decoding an encoder output value from the encoder (1411 of FIG. 14) in the ASR model of the device 1000 using the decoder (1412 of FIG. 14) in the ASR model of the device 1000.
- the device 1000 may determine whether to use the output value of the ASR model for a voice assistant service based on a confidence level of the output value of the ASR model.
- the confidence level of the output value of the ASR model may represent, for example, a degree of matching between a text string output from the ASR model and the input speech.
- the device 1000 may provide the server 2000 with an encoder output value from the encoder in the ASR model of the device 1000.
- the device 1000 may provide the server 2000 with a hidden vector output from a hidden layer in the encoder or graphemes generated from the hidden vector.
- an ASR model 905 of the server 2000 may obtain its output value using the encoder output value received from the device 1000.
- the server 2000 may input the encoder output value received from the device 1000 to a decoder in the ASR model 905 of the server 2000.
- the server 2000 may generate a text string from the encoder output value, as in operations S325 through S340 of FIG. 3.
- the server 2000 may provide the output value of the ASR model 905 thereof to an NLU model 910 installed thereon.
- the device 1000 may determine whether to perform NLU processing at the device 1000 itself using the output value of its ASR model (operation 915). For example, the device 1000 may determine whether to perform NLU processing therein based on a domain confidence level for the output value of its ASR model.
- a domain confidence level may include, for example, a numerical value indicating how closely a text string is relevant to a particular domain.
- the device 1000 may calculate a domain confidence level indicating the degree of a relevance of the output value of its ASR model to a domain pre-registered for NLU processing and determine whether to perform the NLU processing at the device 1000 itself based on the domain confidence level calculated for the pre-registered domain. For example, the device 1000 may obtain, for each of a plurality of pre-registered domains, a confidence score indicating the degree of a relevance of the output value of the ASR model to a pre-registered domain and determine whether to perform NLU processing at the device 1000 itself based on the obtained confidence score.
- the device 1000 may identify a domain related to the output value of the ASR model based on a rule or obtain a domain confidence level related to the output value of the ASR model using an AI model trained for domain identification.
- the AI model for domain identification may be a part of an NLU model or a model separate therefrom.
- the domain confidence level for the output value of the ASR model may be calculated by the domain identification module 1450 of FIG. 14.
- the device 1000 may provide the output value of the ASR model to the server 2000. For example, when confidence levels for output values of the ASR model of the device 1000, which are related to pre-registered domains, are less than a preset threshold so it is determined that the output values of the ASR model are not related or slightly related to the pre-registered domains, the device 1000 may determine not to perform the NLU processing at the device 1000 itself
- the NLU model 910 of the server 2000 may perform NLU processing using, for example, at least one of the output value of the ASR model 905 of the server 2000 or the output value of the ASR model of the device 1000.
- the server 2000 may provide an NLU result value output from the NLU model 910 to the device 1000.
- the NLU result value is data representing a result of interpreting text, and may be, for example, data output from the NLU model 910.
- the NLU result value may include intent and parameters.
- the intent is information determined by interpreting text using an NLU model, and may indicate, for example, a user's intention of an utterance.
- the intent may include information indicating a user's intention of utterance (hereinafter, referred to as "intent information"), as well as a numerical value corresponding to the intent information.
- a numerical value may indicate a probability that the text will be related to information indicating a particular intent.
- a piece of intent information with a maximum numerical value corresponding to the intent information may be determined as the intent.
- the parameters may indicate detailed information related to the intent.
- the parameters are information related to the intent, and a plurality of types of parameters may correspond to a single intent.
- the device 1000 may provide the output value of the ASR model to an NLU model 920 installed thereon.
- the NLU model 920 of the device 1000 may perform NLU processing using the output value of the ASR model of the device 1000.
- the device 1000 may determine whether to register a domain with itself (operation 925). When the device 1000 registers a particular domain therewith and determines that an output value of its ASR model generated after the registration is related to the registered domain, the device 1000 may set (e.g., configure) the output value of the ASR model to be processed using the NLU model 920 of the device 1000 instead of transmitting the output value of the ASR model to the server 2000. For domain registration, for example, the device 1000 may evaluate an NLU result value received from the server 2000 and an NLU result value output from the NLU model 920 thereof and register a particular domain with itself based on a result of the evaluation.
- the device 1000 may compare intent and parameters in the NLU result value output from the NLU model 910 of the server 2000 with intent and parameters in the NLU result value output from the NLU model 920 of the device 1000 and then register a particular domain with itself based on a result of the comparison. For example, when the intent and parameters in the NLU result value output from the NLU model 920 of the device 1000 are substantially identical or similar to the intent and parameters in the NLU result value output from the NLU model 910 of the server 2000, the device 1000 may register a domain corresponding to the NLU model 920 therewith.
- the device 1000 may display, on a screen of the device 1000, a graphical user interface (GUI) for evaluating the NLU result value output from the NLU model 910 of the server 2000 and the NLU result value output from the NLU model 920 of the device 1000 and then evaluate the NLU result value from the NLU model 910 and the NLU result value from the NLU model 920 based on a user input via the displayed GUI.
- GUI graphical user interface
- the device 1000 may register a particular domain therewith using the domain registration module (e.g., 1460 of FIG. 14).
- the domain registration module e.g., 1460 of FIG. 14
- FIG. 10 is a flowchart illustrating an example method, performed by the device 1000 and the server 2000, of performing speech recognition and NLU processing on a speech input, according to an embodiment of the disclosure.
- the device 1000 may determine whether to use an output value of its ASR model and determine whether to perform NLU processing at the device 1000 itself using an output value of an ASR model 105 of the server 2000.
- the device 1000 may determine whether to use an output value of its ASR model based on an encoder output value from its ASR model (operation 100).
- the device 1000 may obtain a confidence level of the encoder output value and determine whether to use the output value of the ASR model by comparing the confidence level of the encoder output value with a preset threshold. When the confidence level of the encoder output value is greater than or equal to the preset threshold, the device 1000 may determine to use the output value of the ASR model. On the other hand, when the confidence level of the encoder output value is less than the preset threshold, the device 1000 may determine not to use the output value of the ASR model.
- the confidence level of the encoder output value may represent a degree of matching between text represented by the encoder output value and an input speech. Furthermore, for example, the confidence level of the encoder output value may be calculated using the speech recognition evaluation module (e.g., 1430 of FIG. 14).
- the device 1000 may determine whether to use an output value of its ASR model based on the output value of the ASR model (operation 100).
- the device 1000 may determine whether to use the output value of the ASR model for a voice assistant service based on a confidence level of the output value of the ASR model.
- the confidence level of the output value of the ASR model may represent, for example, a degree of matching between a text string output from the ASR model and the input speech.
- the device 1000 may provide the server 2000 with an encoder output value from an encoder in the ASR model of the device 1000.
- An ASR model 105 of the server 2000 may obtain its output value using the encoder output value received from the device 1000.
- the server 2000 may input the encoder output value received from the device 1000 to a decoder of the ASR model 105 of the server 2000.
- the server 2000 may generate a text string from the encoder output value, as in operations S325 through S340 of FIG. 3.
- the server 2000 may provide the output value of the ASR model 105 thereof to the device 1000.
- the device 1000 may determine whether to perform NLU processing at the device 1000 itself using the output value of its ASR model (operation 110). In this case, for example, the device 1000 may determine whether to perform NLU processing therein based on a domain confidence level for the output value of its ASR model.
- the device 1000 may identify a domain related to the output value of its ASR model based on a rule or obtain a domain confidence level related to the output value of the ASR model using an AI model trained for domain identification.
- the AI model for domain identification may be a part of an NLU model or a model separate therefrom.
- the domain confidence level for the output value of the ASR model of the device 1000 may be calculated by the domain identification module (e.g., 1450 of FIG. 14).
- the device 1000 may determine whether to perform NLU processing at the device 1000 itself using the output value of the ASR model 105 of the server 2000 (operation 110). For example, the device 1000 may determine whether to perform NLU processing therein based on a domain confidence level for the output value of the ASR model 105 of the server 2000. The device 1000 may calculate a domain confidence level indicating the degree of a relevance of the output value of the ASR model 105 of the server 2000 to a domain pre-registered for NLU processing and determine whether to perform the NLU processing at the device 1000 itself based on the domain confidence level calculated for the pre-registered domain.
- the device 1000 may obtain, for each of a plurality of pre-registered domains, a confidence score indicating the degree of a relevance of the output value of the ASR model 105 of the server 2000 to a pre-registered domain and determine whether to perform NLU processing at the device 1000 itself based on the obtained confidence score.
- the device 1000 may identify a domain related to the output value of the ASR model 105 of the server 2000 based on a rule or obtain a domain confidence level related to the output value of the ASR model 105 using an AI model trained for domain identification.
- the AI model for domain identification may be a part of an NLU model or a model separate therefrom.
- the domain confidence level for the output value of the ASR model 105 of the server 2000 may be calculated by the domain identification module (e.g., 1450 of FIG. 14).
- the device 1000 may provide the output value of its ASR model to an NLU model 120 of the device 1000. Furthermore, when the device 1000 determines to perform the NLU processing therein using the output value of the ASR model 105 of the server 2000, the device 1000 may provide the output value of the ASR model 105 of the server 2000 to the NLU model 120 of the device 1000. In this case, the NLU model 120 of the device 1000 may perform NLU processing using the output value of its ASR model or the output value of the ASR model 105 of the server 2000.
- the device 1000 may provide the output value of its ASR model to the server 2000. Furthermore, when the device 1000 determines not to perform the NLU processing therein using the output value of the ASR model 105 of the server 2000, the device 1000 may provide the output value of the ASR model 105 of the server 2000 to the server 2000.
- An NLU model 115 of the server 2000 may perform NLU processing using at least one of the output value of the ASR model 105 of the server 2000 or the output value of the ASR model of the device 1000. Furthermore, the server 2000 may provide an NLU result value output from the NLU model 115 to the device 1000.
- the device 1000 may determine whether to register a domain with itself (operation 125). When the device 1000 registers a particular domain therewith and determines that an output value of its ASR model generated after the registration is related to the registered domain, the device 1000 may set (e.g., configure) the output value of the ASR model to be processed using the NLU model 120 in the device 1000 instead of transmitting the output value of the ASR model to the server 2000.
- FIG. 11 is a flowchart illustrating an example method, performed by the device 1000 and the server 2000, of performing speech recognition and NLU processing on a speech input, according to an embodiment of the disclosure.
- the device 1000 may determine whether to use an output value of its ASR model and whether to perform NLU processing at the device 1000 itself. Furthermore, the server 2000 may determine whether NLU processing is to be performed at the device 1000 based on an output value of an ASR model 215 of the server 2000.
- the device 1000 may determine whether to use an output value of its ASR model based on an encoder output value from its ASR model (operation 200).
- the device 1000 may obtain a confidence level of the encoder output value and determine whether to use the output value of its ASR model by comparing the confidence level of the encoder output value with a preset threshold.
- the device 1000 may determine whether to use the output value of its ASR model based on the output value of its ASR model (operation 200).
- the device 1000 may determine whether to perform NLU processing at the device 1000 itself using the output value of its ASR model (operation 205). For example, the device 1000 may determine whether to perform NLU processing therein based on a domain confidence level for the output value of its ASR model. The device 1000 may calculate a domain confidence level indicating the degree of a relevance of the output value of its ASR model to a domain pre-registered for NLU processing and determine whether to perform the NLU processing at the device 1000 itself based on the domain confidence level calculated for the pre-registered domain.
- the device 1000 may identify a domain related to the output value of its ASR model based on a rule or obtain a domain confidence level related to the output value of the ASR model using an AI model trained for domain identification.
- the AI model for domain identification may be a part of an NLU model or a model separate therefrom.
- the domain confidence level for the output value of the ASR model of the device 1000 may be calculated by the domain identification module (e.g., 1450 of FIG. 14).
- the device 1000 may provide the output value of its ASR model to the server 2000.
- the server 2000 may then input the output value of the ASR model of the device 1000 to an NLU model 210 of the server 2000 and obtain an NLU result value from the NLU model 210. Furthermore, the server 2000 may provide the NLU result value obtained from the NLU model 210 to the device 1000.
- the device 1000 may input the output value of its ASR model to an NLU model 225 of the device 1000 and obtain an NLU result value from the NLU model 225.
- the device 1000 may provide the server 2000 with an encoder output value from an encoder in the ASR model of the device 1000.
- the ASR model 215 of the server 2000 may obtain its output value using the encoder output value received from the device 1000.
- the server 2000 may input the encoder output value received from the device 1000 to a decoder in the ASR model 215 of the server 2000 and obtain an output value from the ASR model 215.
- the server 2000 may generate a text string from the encoder output value, as in operations S325 through S340 of FIG. 3.
- the server 2000 may then determine whether NLU processing is to be performed at the device 1000 based on the output value of the ASR model 215 (operation 220).
- the server 2000 may determine whether a domain related to the output value of the ASR model 215 is a domain in which the device 1000 is able to perform NLU processing.
- the server 2000 may determine that the device 1000 is to perform NLU processing (“Yes” in operation 220).
- the server 2000 may determine that the device 1000 is not to perform NLU processing (“No” in operation 220).
- the server 2000 may determine whether NLU processing is to be performed at the device 1000 based on a domain confidence level for the output value of the ASR model 215.
- the server 2000 may calculate a domain confidence level indicating the degree of a relevance of the output value of the ASR model 215 to a domain pre-registered with the device 1000 for NLU processing and determine whether the NLU processing is to be performed at the device 1000 based on the domain confidence level calculated for the pre-registered domain.
- the server 2000 may obtain, for each of a plurality of pre-registered domains, a confidence score indicating the degree of a relevance of the output value of the ASR model 215 of the server 2000 to a pre-registered domain and determine whether NLU processing is to be performed at the device 1000 based on the obtained confidence score.
- information indicating which domain is registered with the device 1000 may be prestored in the server 2000.
- Information indicating which domain is registered with the device 1000 may be provided from the device 1000 to the server 2000 in response to a request by the server 2000.
- the server 1000 may identify a domain related to the output value of the ASR model 215 based on a rule or obtain a domain confidence level related to the output value of the ASR model 215 using an AI model trained for domain identification.
- the AI model for domain identification may be a part of an NLU model or a model separate therefrom.
- the domain confidence level for the output value of the ASR model 215 of the server 2000 may be calculated by a speech recognition evaluation module (e.g., 2341 of FIG. 13).
- the server 2000 may provide the output value of the ASR model 215 to the device 1000.
- the device 1000 may then input the output value of the ASR model 215 of the server 2000 to the NLU model 225 of the device 1000 and obtain an NLU result value from the NLU model 225.
- the server 2000 may provide the output value of the ASR model 215 to the NLU model 210.
- the server 2000 may then input the output value of the ASR model 215 to the NLU model 210 and obtain an NLU result value from the NLU model 210. Furthermore, the server 2000 may provide the obtained NLU result value to the device 1000.
- the device 1000 may determine whether to register a domain with itself (operation 230). When the device 1000 registers a particular domain therewith and determines that an output value of its ASR model, which is generated after the registration, is related to the registered domain, the device 1000 may set (e.g., configure) the output value of the ASR model to be processed using the NLU model 225 thereof instead of transmitting the output value of the ASR model to the server 2000.
- FIG. 12 is a flowchart illustrating an example method, performed by the device 1000 and the server 2000, of performing speech recognition and NLU processing on a speech input, according to an embodiment of the disclosure. Referring to FIG. 12, the device 1000 may immediately determine whether to perform NLU processing therein.
- the device 1000 may determine whether to perform NLU processing at the device 1000 itself based on the output value of its ASR model (operation 300).
- the device 1000 may determine whether to perform NLU processing at the device 1000 itself based on at least one of a confidence level of the output value of its ASR model or a domain confidence level for its ASR model.
- the confidence level of the output value of the ASR model may include a confidence score indicating the degree of matching between a text string output from the ASR model and an input speech.
- the domain confidence level may include, for example, a confidence score indicating the degree of a relevance of a text string output from the ASR model to a particular domain.
- the device 1000 may determine whether to perform the NLU processing therein based on a weighted sum of the confidence level of the output value of its ASR model and the domain confidence level for its ASR model. For example, the device 1000 may apply a first weight to the confidence level of the output value of the ASR model, apply a second weight to the domain confidence level for the ASR model, and determine whether to perform NLU processing at the device 1000 itself based on the confidence level weighted by the first weight and the domain confidence level weighted by the second weight.
- the confidence level of the output value of the ASR model and the domain confidence level for the ASR model may be values normalized according to a preset criterion.
- the weighted sum of the confidence level of the output value of the ASR model and the domain confidence level for the ASR model may be calculated by the speech recognition evaluation module (e.g., 1430 of FIG. 14).
- the device 1000 may input an output value of its ASR model to an NLU model 305 thereof and obtain an NLU result value output from the NLU model 305.
- the device 1000 may provide the server 2000 with an encoder output value from an encoder in the ASR model of the device 1000.
- the server 2000 may input the encoder output value to an ASR model 315 thereof and input an output value of the ASR model 315 to an NLU model 320 of the server 2000.
- the server 2000 may obtain an NLU result value output from the NLU model 320 and provide it to the device 1000.
- the device 1000 may determine whether to register a domain with itself (operation 325). When the device 1000 registers a particular domain therewith and determines that an output value of its ASR model generated after the registration is related to the registered domain, the device 1000 may set (e.g., configure) the output value of the ASR model to be processed using the NLU model 305.
- FIG. 13 is a block diagram illustrating an example server 2000 according to an embodiment of the disclosure.
- the server 2000 may include a communication interface (e.g., including communication circuitry) 2100, a processor (e.g., including processing circuitry) 2200, and a storage 2300, and the storage 2300 may include a speech recognition management module (e.g., including executable program elements) 2310, an ASR module (e.g., including executable program elements) 2320, an NLU module (e.g., including executable program elements) 2330, and a speech interpretation management module (e.g., including executable program elements) 2340.
- a communication interface e.g., including communication circuitry
- a processor e.g., including processing circuitry
- storage 2300 may include a speech recognition management module (e.g., including executable program elements) 2310, an ASR module (e.g., including executable program elements) 2320, an NLU module (e.g., including executable program elements) 2330, and a speech interpretation management module (e.g., including executable program elements) 2340.
- the communication interface 2100 may include one or more components (e.g., circuitry) for performing communication with the device 1000 and other servers (not shown).
- the communication interface 2100 may exchange information for speech recognition and voice assistant services with the device 1000 and the other servers.
- the communication interface 2100 may perform communication via a local area network (LAN), a wide area network (WAN), a value added network (VAN), a mobile radio communication network, a satellite communication network, or combinations thereof, but is not limited thereto.
- the processor 2200 may include various processing circuitry and controls all operations of the server 2000.
- the processor 2200 may execute programs stored in the storage 2300 to control all operations of the server 2000 presented in the disclosure.
- the storage 2300 may store programs necessary for processing or control operations performed by the processor 2200 or store data input to or output from the server 2000.
- the storage 2300 may include at least one of types of storage media, e.g., a flash memory-type memory, a hard disk-type memory, a multimedia card micro-type memory, a card-type memory (e.g., an SD card or an XD memory), random access memory (RAM), static RAM (SRAM), read-only memory (ROM), electrically erasable programmable ROM (EEPROM), PROM, a magnetic memory, a magnetic disc, or an optical disc, but is not limited thereto.
- types of storage media e.g., a flash memory-type memory, a hard disk-type memory, a multimedia card micro-type memory, a card-type memory (e.g., an SD card or an XD memory), random access memory (RAM), static RAM (SRAM), read-only memory (ROM), electrically erasable programmable ROM (EEPROM), PRO
- Programs stored in the storage 2300 may be classified into a plurality of modules according to their functions, such as, for example, the speech interpretation management module 2340, the ASR module 2320, and the NLU module 2330.
- the speech recognition management module 2310 may include various executable program elements including various modules and provides an encoder output value received from the device 1000 to the ASR model 2320.
- the speech recognition management module 2310 may include an encoder identification module 2311, a domain identification module 2312, a decoder selection module 2313, and an output value conversion module 2314.
- the domain identification module 2312 may identify a domain related to an encoder output value received from the device 1000.
- the domain identification module 2312 may identify a domain related to an encoder output value based on the domain information.
- the domain identification module 2312 may identify a domain related to an encoder output value based on the encoder output value.
- the server 2000 may receive, from the device 1000, text obtained from a projection layer connected to an output terminal of an encoder in the device 1000, and the domain identification module 2312 may identify a domain related to an encoder output value using the received text.
- the domain identification module 2312 may obtain text by applying a projection layer to an encoder output value based on encoding information received from the device 1000 and identify a domain related to the encoder output value using the obtained text.
- the domain identification module 2312 may identify a domain related to the encoder output value based on a domain confidence level for the text generated from the encoder output value.
- the domain identification module 2312 may calculate a confidence score indicating the degree of a relevance of text obtained from an encoder output value to a domain pre-registered to decode the encoder output value.
- the domain identification module 2312 may identify a domain related to the encoder output value based on a domain confidence level calculated for the pre-registered domain.
- the domain identification module 2312 may identify a domain related to an encoder output value by analyzing the encoder output value and a format of the encoder output value and applying a projection layer to the encoder output value based on a result of the analysis.
- the encoder identification module 2311 may identify a type of an encoder in the device 1000 and degree of encoding related to an encoder output value.
- the encoder identification module 2311 may identify a type of an encoder that outputs an encoder output value and a degree of encoding based on the encoding information.
- the server 2000 may identify a type of an encoder and a degree of encoding by analyzing an encoder output value and a format of the encoder output value.
- the decoder selection module 2313 may select a decoder to decode an encoder output value.
- the decoder selection module 2313 may select a decoder to decode the encoder output value based on at least one of a domain identified by the domain identification module 2312, a type of an encoder, or encoding information, wherein the type of the encoder and the encoding information are identified by the encoder identification module 2311.
- the decoder selection module 2313 may select at least some of a plurality of decoders in the server 2000 according to a preset criterion.
- the decoder selection module 2313 may select one decoder corresponding to a particular domain.
- the decoder selection module 2313 may select, for example, one decoder of the same type as an encoder output value from among a plurality of decoders corresponding to a domain related to an encoder output value.
- the decoder selection module 2313 may select a plurality of decoders corresponding to a particular domain.
- the decoder selection module 2313 may select, for example, a plurality of decoders corresponding to one domain related to an encoder output value.
- the selected decoders may include a decoder of the same type as the encoder output value and a decoder of a different type.
- the decoder selection module 2313 may select a plurality of decoders corresponding to a plurality of domains. For example, the decoder selection module 2313 may select a plurality of decoders corresponding to a plurality of domains related to an encoder output value. The selected decoders may include a decoder of the same type as and a decoder of a different type than the encoder output value. The decoder selection module 2313 may select all decoders in the server 2000.
- the decoder selection module 2313 may select a decoder of a different type than an encoder output value to process the encoder output value.
- the output value conversion module 2314 may convert a format of an encoder output value according to a selected decoder.
- the output value conversion module 2314 may convert the format of the encoder output value received from the device 1000.
- the output value conversion module 2314 may convert the format of the encoder output value using a tool for converting the encoder output value into a format compatible with the selected decoder.
- the server 2000 may store conversion tools corresponding to each of combinations of a plurality of types of encoders and a plurality of types of decoders.
- the ASR module 2320 may obtain a text string by decoding an encoder output value received from the device 1000.
- the ASR module 2320 may include a plurality of ASR models corresponding to a plurality of domains, such as a first ASR model 2321, a second ASR model 2322, etc.
- the first ASR model 2321 and the second ASR model 2322 may include end-to-end ASR models.
- the first ASR model 2321 may include an encoder 2321-1 and a decoder 2321-2
- the second ASR model 2322 may include an encoder 2322-1 and a decoder 2322-2.
- the encoder 2321-1 may not be included in the first ASR model 2321
- the encoder 2322-1 may not be included in the second ASR model 2322.
- the NLU module 2330 may interpret a text string output from the ASR module 2320.
- the NLU module 2330 may include a plurality of NLU models corresponding to a plurality of domains, such as a first NLU model 2331, a second NLU model 2332, etc.
- the NLU module 2330 may interpret the output value of the ASR model, which is received from the device 1000.
- the speech interpretation management module 2340 may evaluate a speech recognition result obtained from the ASR module 2320 of the server 2000 and determine whether to perform NLU processing on the speech recognition result.
- the speech interpretation management module 2340 may include a speech recognition evaluation module 2341 and an NLU determination module 2342.
- the speech recognition evaluation module 2341 may calculate a confidence level of a text string generated based on an output value of a decoder in the server 2000. When a plurality of decoders are selected and a plurality of text strings are generated using the selected decoders, the speech recognition evaluation module 2341 may compare confidence levels of the text strings with one another and select a text string having a high confidence level as a text string to be provided to the device 1000.
- a confidence level of a text string may be a numerical value indicating the degree of matching between the text string and an input speech, and the confidence level may include, for example, a confidence score, but is not limited thereto.
- the speech recognition evaluation module 2341 may calculate a domain confidence level for a text string generated based on an output value of a decoder.
- the speech recognition evaluation module 2341 may calculate a domain confidence level indicating the degree of a relevance of an output value of an ASR model of the server 2000 to a domain pre-registered with the device 1000 for NLU processing and determine whether NLU processing is to be performed at the device 1000 based on the domain confidence level calculated for the pre-registered domain.
- the speech recognition evaluation module 2341 may obtain, for each of a plurality of domains pre-registered with the device 1000, a confidence score indicating the degree of a relevance of an output value of an ASR model of the server 2000 to a pre-registered domain and determine whether NLU processing is to be performed at the device 1000 based on the obtained confidence score.
- information indicating which domain is registered with the device 1000 may be prestored in the server 2000.
- information indicating which domain is registered in the device 1000 may be provided from the device 1000 to the server 2000 in response to a request by the server 2000.
- the NLU determination module 2342 may determine whether NLU processing is to be performed at the device 1000 or the server 2000 with respect to a speech recognition result from the ASR module 2320 in the server 2000.
- the NLU determination module 2342 may determine whether a domain related to an output value of the ASR module 2320 is a domain in which the device 1000 is able to perform NLU processing.
- the NLU determination module 2342 may determine that the device 1000 is to perform NLU processing.
- the NLU determination module 2342 may determine that the device 1000 is not to perform NLU processing.
- the server 2000 may receive a list of domains pre-registered with the device 1000 from the device 1000 and store the list in the storage 2300.
- FIG. 14 is a block diagram illustrating an example configuration of device 1000 according to an embodiment of the disclosure.
- the device 1000 may include a communication interface (e.g., including communication circuitry) 1100, an input/output (I/O) interface (e.g., including I/O circuitry) 1200, a processor (e.g., including processing circuitry) 1300, and a memory 1400, and the memory 1400 may include an ASR model 1410, an NLU model 1420, a speech recognition evaluation module 1430, an NLU determination module 1440, a domain identification module 1450, and a domain registration module 1460.
- a communication interface e.g., including communication circuitry
- I/O input/output
- processor e.g., including processing circuitry
- the memory 1400 may include an ASR model 1410, an NLU model 1420, a speech recognition evaluation module 1430, an NLU determination module 1440, a domain identification module 1450, and a domain registration module 1460.
- the communication interface 1100 may include one or more components (e.g., circuitry) for performing communication with the server 2000 and external devices (not shown).
- the communication interface 1100 may exchange information for speech recognition and voice assistant services with the server 2000 and the external devices.
- the communication interface 1100 may perform communication via a LAN, a WAN, a VAN, a mobile radio communication network, a satellite communication network, or combinations thereof, but is not limited thereto.
- the I/O interface 1200 may include various I/O circuitry and receive data input to the device 1000 and output data from the device 1000.
- the I/O interface 1200 may include, for example, and without limitation, a user input interface, a camera, a microphone, a display, an audio output interface, or the like.
- Examples of the user input interface may include, but are not limited to, a keypad, a dome switch, a touch pad (a capacitive overlay type, a resistive overlay type, an infrared beam type, a surface acoustic wave type, an integral strain gauge type, a piezoelectric type, etc.), a jog wheel, a jog switch, etc.
- the display may display and output information processed by the device 1000.
- the display may display a GUI for voice assistant services.
- the display and a touch pad form a layer structure to construct a touch screen
- the display may be used as an input device as well as an output device.
- the display may include, for example, and without limitation, at least one of a liquid crystal display (LCD), a thin-film transistor-LCD (TFT-LCD), an organic light-emitting diode (OLED) display, a flexible display, a three-dimensional (3D) display, an electrophoretic display, or the like.
- the audio output interface may output audio data and may include, for example, a speaker, a buzzer, etc.
- the camera may obtain an image frame such as a still or moving image via an image sensor in a video call mode or image-capturing mode.
- An image captured via the image sensor may be processed by the processor 1300 or a separate image processor (not shown).
- the microphone may receive a user's utterance and process the user's utterance as electrical speech data.
- the processor 1300 may include various processing circuitry and controls all operations of the device 1000.
- the processor 1300 may execute programs stored in the memory 1400 to control all operations of the device 1000 presented in the disclosure.
- the memory 1400 may store programs necessary for processing or control operations performed by the processor 1300 or store data input to or output from the device 1000.
- the memory 1400 may include at least one of types of storage media, e.g., a flash memory-type memory, a hard disk-type memory, a multimedia card micro-type memory, a card-type memory (e.g., an SD card or an XD memory), RAM, SRAM, ROM, EEPROM, PROM, a magnetic memory, a magnetic disc, or an optical disc, but is not limited thereto.
- Programs stored in the memory 1400 may be classified into a plurality of modules according to their functions, such as the ASR model 1410, the NLU model 1420, the speech recognition evaluation module 1430, the NLU determination module 1440, the domain identification module 1450, and the domain registration module 1460.
- the ASR model 1410 may encode a speech signal input to the device 1000.
- the ASR model 1410 may be an end-to-end ASR model and include an encoder 1411 and a decoder 1412.
- the speech signal input to the device 1000 may be encoded by the encoder 1411 in the ASR model 1410.
- the encoder 1411 in the ASR model 1410 may include a plurality of layers, e.g., a plurality of stacked LSTM layers.
- an encoded output value may be one of output values derived from the layers in the encoder 1411.
- the encoded output value may be a hidden vector output from a layer in the encoder 1411.
- the ASR model 1410 may generate a text string from a speech signal using the encoder 1411 and the decoder 1422.
- the NLU model 1420 may interpret a text string output from the ASR model 1410.
- the NLU model 1420 may interpret a text string provided from an ASR model in the server 2000.
- the server 2000 provides an output value of an ASR model therein to the device 1000
- the NLU model 1420 may interpret the output value of the ASR model, which is provided by the server 2000.
- the speech recognition evaluation module 1430 may evaluate an output value of the encoder 1411 and an output value of the ASR model 1410.
- the speech recognition evaluation module 1430 may obtain a confidence level of the output value of the encoder 1411.
- a confidence level of an encoded output value may include, for example, a numerical value indicating a degree of matching between the encoded output value and an input speech, and the confidence level may include, for example, a confidence score, but is not limited thereto.
- the confidence level of the encoded output value may represent a degree of matching between text represented by the encoded output value and the input speech. For example, when the encoder 1411 of the ASR model 1410 in the device 1000 is trained, a CTC loss function may be used.
- the speech recognition evaluation module 1430 may calculate a confidence level of an encoder output value based on the text obtained from the projection layer connected to the output terminal of the encoder.
- a method of calculating a confidence level is not limited thereto, and for example, the speech recognition evaluation module 1430 may calculate a confidence level directly from an encoder output value using a preset algorithm.
- Encoder output values may be respectively derived from the layers in the encoder 1411, and the speech recognition evaluation module 1430 may calculate a confidence level of each of the derived encoder output values. Furthermore, it may be determined, based on a confidence level calculated by the speech recognition evaluation module 1430, whether to provide an encoder output value to the server 2000.
- the speech recognition evaluation module 1430 may calculate a confidence level of a text string output from the ASR model 1410.
- a confidence level of a text string may be a numerical value indicating the degree of matching between the text string and an input speech, and the confidence level may include, for example, a confidence score, but is not limited thereto. It may be determined, based on a confidence level calculated by the speech recognition evaluation module 1430, whether the device 1000 is to use an output value of the ASR model 1410 thereof.
- the NLU determination module 1440 may determine whether NLU processing is to be performed on a text string output from the ASR model 1410 at the device 1000 or the server 2000.
- the NLU determination module 1440 may determine whether a domain related to an output value of the ASR model 1410 is a domain in which the device 1000 is able to perform NLU processing.
- the NLU determination module 1440 may determine that the device 1000 is to perform NLU processing.
- the NLU determination module 1440 may determine that the device 1000 is not to perform NLU processing.
- the domain identification module 1450 may identify a domain related to an encoder output value derived from the encoder 1411.
- the domain identification module 1450 may identify a domain related to an encoder output value based on the encoder output value.
- the domain identification module 1450 may obtain text from a projection layer connected to an output terminal of an encoder in the device 1000 and identify a domain related to an encoder output value using the obtained text.
- the domain identification module 1450 may identify a domain related to an encoder output value based on a domain confidence level for text generated from the encoder output value.
- the domain identification module 1405 may calculate a confidence score indicating the degree of a relevance of text obtained from an encoder output value to a preset domain.
- the domain identification module 1450 may identify a domain related to the encoder output value based on a domain confidence level calculated for the preset domain.
- the domain identification module 1450 may calculate a domain confidence level for an output value of the ASR model 1410 in the device 1000.
- the domain identification module 1450 may calculate a domain confidence level indicating the degree of a relevance of the output value of the ASR model 1410 in the device 1000 to a domain pre-registered for NLU processing.
- the domain identification module 1450 may obtain, for each of a plurality of pre-registered domains, a confidence score indicating the degree of a relevance of an output value of the ASR model 1410 in the device 1000 to a pre-registered domain.
- the domain identification module 1450 may calculate a domain confidence level for an output value of an ASR model in the server 2000.
- the domain identification module 1450 may calculate a domain confidence level indicating the degree of a relevance of the output value of the ASR model in the server 2000 to a domain pre-registered for NLU processing.
- the domain identification module 1450 may obtain, for each of a plurality of pre-registered domains, a confidence score indicating the degree of a relevance of an output value of an ASR model in the server 2000 to a pre-registered domain.
- the domain registration module 1460 may register a domain with the device 1000.
- the encoder output value generated after the registration may be processed using a decoder in the device 1000 without being transmitted to the server 2000.
- the domain registration module 1460 may register a particular domain with the device 1000 based on a result of evaluating an encoder output value. As another example, the domain registration module 1460 may register a particular domain with the device 1000 based on a result of evaluating a text string output from the ASR model 1410. As another example, the domain registration module 1460 may register, based on a result of evaluating a text string received from the server 2000, a domain related to the evaluated text string with the device 1000.
- functions related to AI may operate via a processor and a memory.
- the processor may be configured as one or a plurality of processors.
- the one or plurality of processors may include a general-purpose processor such as, for example, and without limitation, a central processing unit (CPU), a dedicated processor, an application processor (AP), a digital signal processor (DSP), a dedicated graphics processor such as a graphical processing unit (GPU), a vision processing unit (VPU), a dedicated AI processor such as a neural processing unit (NPU).
- the one or plurality of processors may control input data to be processed according to predefined operation rules or an AI model stored in the memory.
- the dedicated AI processor may be designed with a hardware structure specialized for processing a particular AI model.
- the predefined operation rules or AI model may be created via a training process.
- the creation via the training process may refer to the predefined operation rules or AI model being set (e.g., configured) to perform desired characteristics (or purpose) are created by training a basic AI model based on a large number of training data via a learning algorithm.
- the training process may be performed by an apparatus itself in which AI is performed or via a separate server and/or system.
- Examples of a learning algorithm may include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning.
- An AI model may include a plurality of neural network layers.
- Each of the neural network layers has a plurality of weight values and may perform neural network computations via calculations between a result of computations in a previous layer and a plurality of weight values.
- a plurality of weight values assigned to each of the neural network layers may be optimized based on a result of training the AI model. For example, a plurality of weight values may be modified to reduce or minimize a loss or cost value obtained by the AI model during a training process.
- An artificial neural network may include a deep neural network (DNN) and may include, for example, and without limitation, a convolutional neural network (CNN), a DNN, an RNN, a restricted Boltzmann machine (RBM), a deep belief network (DBN), a bidirectional recurrent DNN (BRDNN), deep Q-networks (DQN), or the like, but is not limited thereto.
- DNN deep neural network
- CNN convolutional neural network
- RNN restricted Boltzmann machine
- DNN deep belief network
- BNN bidirectional recurrent DNN
- DQN deep Q-networks
- Embodiments of the disclosure may be implemented through computer-readable recording media having recorded thereon computer-executable instructions such as program modules that are executed by a computer.
- the computer-readable recording media may be any available media that can be accessed by a computer and include both volatile and nonvolatile media and both detachable and non-detachable media.
- the computer-readable recording media may include computer storage media and communication media.
- the computer storage media include both volatile and nonvolatile media and both detachable and non-detachable media implemented by any method or technique for storing information such as computer-readable instructions, data structures, program modules, or other data.
- Communication media typically embody computer-readable instructions, data structures, program modules, or other data in a modulated data signal.
- unit may include a hardware component such as a processor or circuit and/or a software component that is executed by a hardware component such as a processor.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Telephonic Communication Services (AREA)
Abstract
L'invention concerne un système et un procédé permettant de reconnaître la voix d'un utilisateur. L'invention concerne un procédé exécuté par un serveur permettant de fournir une chaîne de texte pour une entrée de signal vocal à un dispositif, ledit procédé consistant à : recevoir, du dispositif, une valeur de sortie de codeur dérivée d'un codeur d'un modèle de reconnaissance vocale automatique de bout en bout (ASR) inclus dans le dispositif ; identifier un domaine correspondant à la valeur de sortie de codeur reçue ; sélectionner un décodeur correspondant au domaine identifié parmi une pluralité de décodeurs d'un modèle ASR de bout en bout inclus dans le serveur ; obtenir une chaîne de texte à partir de la valeur de sortie de codeur reçue en utilisant le décodeur sélectionné ; et fournir la chaîne de texte obtenue au dispositif.
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP20852027.0A EP3980991B1 (fr) | 2019-08-13 | 2020-08-10 | Système et procédé pour reconnaître la voix d'un utilisateur |
| CN202080055338.0A CN114207711B (zh) | 2019-08-13 | 2020-08-10 | 用于识别用户的语音的系统和方法 |
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201962886027P | 2019-08-13 | 2019-08-13 | |
| US62/886,027 | 2019-08-13 | ||
| KR10-2019-0146177 | 2019-11-14 | ||
| KR1020190146177A KR20210019920A (ko) | 2019-08-13 | 2019-11-14 | 사용자의 음성을 인식하는 시스템 및 방법 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2021029642A1 true WO2021029642A1 (fr) | 2021-02-18 |
Family
ID=74568424
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/KR2020/010565 Ceased WO2021029642A1 (fr) | 2019-08-13 | 2020-08-10 | Système et procédé pour reconnaître la voix d'un utilisateur |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US11532310B2 (fr) |
| EP (1) | EP3980991B1 (fr) |
| CN (1) | CN114207711B (fr) |
| WO (1) | WO2021029642A1 (fr) |
Families Citing this family (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11328721B2 (en) * | 2020-02-04 | 2022-05-10 | Soundhound, Inc. | Wake suppression for audio playing and listening devices |
| CN114503193A (zh) * | 2020-02-13 | 2022-05-13 | 谷歌有限责任公司 | 多流递归神经网络换能器 |
| WO2022131851A1 (fr) | 2020-12-18 | 2022-06-23 | Samsung Electronics Co., Ltd. | Procédé et systèmes de décodage d'une interrogation audio |
| US11562734B2 (en) * | 2021-01-04 | 2023-01-24 | Kwai Inc. | Systems and methods for automatic speech recognition based on graphics processing units |
| US11830480B2 (en) * | 2021-02-17 | 2023-11-28 | Kwai Inc. | Systems and methods for accelerating automatic speech recognition based on compression and decompression |
| US12112742B2 (en) | 2021-03-03 | 2024-10-08 | Samsung Electronics Co., Ltd. | Electronic device for correcting speech input of user and operating method thereof |
| CN115762528A (zh) * | 2022-11-14 | 2023-03-07 | 北京宾理信息科技有限公司 | 用于语音识别的方法、装置、设备、车辆和介质 |
Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20050182628A1 (en) * | 2004-02-18 | 2005-08-18 | Samsung Electronics Co., Ltd. | Domain-based dialog speech recognition method and apparatus |
| US20170148431A1 (en) * | 2015-11-25 | 2017-05-25 | Baidu Usa Llc | End-to-end speech recognition |
| US20170256254A1 (en) | 2016-03-04 | 2017-09-07 | Microsoft Technology Licensing, Llc | Modular deep learning model |
| KR20180001889A (ko) * | 2016-06-28 | 2018-01-05 | 삼성전자주식회사 | 언어 처리 방법 및 장치 |
| US20180057683A1 (en) | 2015-03-26 | 2018-03-01 | Dsm Ip Assets B.V. | Cover for a tablet or a mobile phone or a laptop bottom and a watch strap consisting at least partly of a polymer composition |
| US20180180740A1 (en) | 2016-12-23 | 2018-06-28 | X Development Llc | Detecting Sensor Orientation Characteristics Using Marker-Based Localization |
| JP2019120841A (ja) * | 2018-01-09 | 2019-07-22 | 国立大学法人 奈良先端科学技術大学院大学 | スピーチチェイン装置、コンピュータプログラムおよびdnn音声認識・合成相互学習方法 |
| US10366692B1 (en) | 2017-05-15 | 2019-07-30 | Amazon Technologies, Inc. | Accessory for a voice-controlled device |
Family Cites Families (38)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPS5834852A (ja) | 1981-08-25 | 1983-03-01 | Japan Synthetic Rubber Co Ltd | 高弾性率ゴム組成物 |
| JP2000348141A (ja) | 1999-06-08 | 2000-12-15 | Toshiba Corp | 入力情報の予測方法と装置、ならびにプログラム記憶媒体 |
| AU7486200A (en) * | 1999-09-22 | 2001-04-24 | Conexant Systems, Inc. | Multimode speech encoder |
| KR20030012064A (ko) | 2001-07-30 | 2003-02-12 | 와이더덴닷컴 주식회사 | 서버-씬 클라이언트 구성용 분산형 음성 인식 시스템 |
| JP4012143B2 (ja) | 2003-12-16 | 2007-11-21 | キヤノン株式会社 | 情報処理装置およびデータ入力方法 |
| JP4274962B2 (ja) * | 2004-02-04 | 2009-06-10 | 株式会社国際電気通信基礎技術研究所 | 音声認識システム |
| GB0416720D0 (en) * | 2004-07-27 | 2004-09-01 | British Telecomm | Method and system for voice over IP streaming optimisation |
| US8352273B2 (en) | 2005-07-26 | 2013-01-08 | Honda Motor Co., Ltd. | Device, method, and program for performing interaction between user and machine |
| KR20090035222A (ko) * | 2007-10-05 | 2009-04-09 | 한국전자통신연구원 | 음성 인식 시스템 및 방법 |
| US7933777B2 (en) | 2008-08-29 | 2011-04-26 | Multimodal Technologies, Inc. | Hybrid speech recognition |
| US9171541B2 (en) | 2009-11-10 | 2015-10-27 | Voicebox Technologies Corporation | System and method for hybrid processing in a natural language voice services environment |
| JP6317111B2 (ja) | 2011-02-22 | 2018-04-25 | スピーク・ウィズ・ミー・インコーポレイテッドSpeak With Me,Inc. | ハイブリッド型クライアントサーバ音声認識 |
| US9202465B2 (en) * | 2011-03-25 | 2015-12-01 | General Motors Llc | Speech recognition dependent on text message content |
| KR101197010B1 (ko) | 2011-03-30 | 2012-11-05 | 포항공과대학교 산학협력단 | 음성 처리 장치 및 방법 |
| US20150149167A1 (en) | 2011-03-31 | 2015-05-28 | Google Inc. | Dynamic selection among acoustic transforms |
| US9620122B2 (en) | 2011-12-08 | 2017-04-11 | Lenovo (Singapore) Pte. Ltd | Hybrid speech recognition |
| US8924211B2 (en) | 2012-07-09 | 2014-12-30 | Nuance Communications, Inc. | Detecting potential significant errors in speech recognition results |
| US9263036B1 (en) | 2012-11-29 | 2016-02-16 | Google Inc. | System and method for speech recognition using deep recurrent neural networks |
| KR20140077423A (ko) * | 2012-12-14 | 2014-06-24 | 한국전자통신연구원 | 음성인식기의 언어모델 저장방법 |
| US9131369B2 (en) | 2013-01-24 | 2015-09-08 | Nuance Communications, Inc. | Protection of private information in a client/server automatic speech recognition system |
| US9430465B2 (en) | 2013-05-13 | 2016-08-30 | Facebook, Inc. | Hybrid, offline/online speech translation system |
| US9305554B2 (en) | 2013-07-17 | 2016-04-05 | Samsung Electronics Co., Ltd. | Multi-level speech recognition |
| JP6222821B2 (ja) | 2013-10-10 | 2017-11-01 | 日本放送協会 | 誤り修正モデル学習装置、及びプログラム |
| KR102215579B1 (ko) | 2014-01-22 | 2021-02-15 | 삼성전자주식회사 | 대화형 시스템, 디스플레이 장치 및 그 제어 방법 |
| KR102069700B1 (ko) | 2014-05-20 | 2020-01-23 | 한국전자통신연구원 | 특화영역 교체형 음성인식 시스템, 모바일 장치 및 그 방법 |
| US9710464B1 (en) * | 2016-08-29 | 2017-07-18 | Le Technology, Inc. | Language translation of encoded voice packets during a cellular communication session |
| US10957322B2 (en) | 2016-09-09 | 2021-03-23 | Sony Corporation | Speech processing apparatus, information processing apparatus, speech processing method, and information processing method |
| WO2018051841A1 (fr) | 2016-09-16 | 2018-03-22 | 日本電信電話株式会社 | Dispositif d'apprentissage de modèle, procédé associé et programme |
| US10127908B1 (en) | 2016-11-11 | 2018-11-13 | Amazon Technologies, Inc. | Connected accessory for a voice-controlled device |
| KR20180071029A (ko) | 2016-12-19 | 2018-06-27 | 삼성전자주식회사 | 음성 인식 방법 및 장치 |
| US10032451B1 (en) | 2016-12-20 | 2018-07-24 | Amazon Technologies, Inc. | User recognition for speech processing systems |
| CN106896933B (zh) * | 2017-01-19 | 2019-12-06 | 深圳情景智能有限公司 | 将语音输入转换成文本输入的方法、装置和语音输入设备 |
| US10839790B2 (en) * | 2017-02-06 | 2020-11-17 | Facebook, Inc. | Sequence-to-sequence convolutional architecture |
| US10706840B2 (en) * | 2017-08-18 | 2020-07-07 | Google Llc | Encoder-decoder models for sequence to sequence mapping |
| US10540970B2 (en) * | 2017-12-12 | 2020-01-21 | Amazon Technologies, Inc. | Architectures and topologies for vehicle-based, voice-controlled devices |
| US10672388B2 (en) | 2017-12-15 | 2020-06-02 | Mitsubishi Electric Research Laboratories, Inc. | Method and apparatus for open-vocabulary end-to-end speech recognition |
| CN109949796B (zh) * | 2019-02-28 | 2021-04-06 | 天津大学 | 一种基于藏文部件的端到端架构拉萨方言语音识别方法 |
| KR20210019920A (ko) * | 2019-08-13 | 2021-02-23 | 삼성전자주식회사 | 사용자의 음성을 인식하는 시스템 및 방법 |
-
2020
- 2020-08-10 EP EP20852027.0A patent/EP3980991B1/fr active Active
- 2020-08-10 US US16/988,929 patent/US11532310B2/en active Active
- 2020-08-10 WO PCT/KR2020/010565 patent/WO2021029642A1/fr not_active Ceased
- 2020-08-10 CN CN202080055338.0A patent/CN114207711B/zh active Active
Patent Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20050182628A1 (en) * | 2004-02-18 | 2005-08-18 | Samsung Electronics Co., Ltd. | Domain-based dialog speech recognition method and apparatus |
| US20180057683A1 (en) | 2015-03-26 | 2018-03-01 | Dsm Ip Assets B.V. | Cover for a tablet or a mobile phone or a laptop bottom and a watch strap consisting at least partly of a polymer composition |
| US20170148431A1 (en) * | 2015-11-25 | 2017-05-25 | Baidu Usa Llc | End-to-end speech recognition |
| US20170256254A1 (en) | 2016-03-04 | 2017-09-07 | Microsoft Technology Licensing, Llc | Modular deep learning model |
| KR20180001889A (ko) * | 2016-06-28 | 2018-01-05 | 삼성전자주식회사 | 언어 처리 방법 및 장치 |
| US20180180740A1 (en) | 2016-12-23 | 2018-06-28 | X Development Llc | Detecting Sensor Orientation Characteristics Using Marker-Based Localization |
| US10366692B1 (en) | 2017-05-15 | 2019-07-30 | Amazon Technologies, Inc. | Accessory for a voice-controlled device |
| JP2019120841A (ja) * | 2018-01-09 | 2019-07-22 | 国立大学法人 奈良先端科学技術大学院大学 | スピーチチェイン装置、コンピュータプログラムおよびdnn音声認識・合成相互学習方法 |
Non-Patent Citations (2)
| Title |
|---|
| KIM HOON: "[Kakao AI Report] Voice Recognition Method and Kakao i's Voice Engine", pages 1 - 9, XP055781397, Retrieved from the Internet <URL:https://brunch.co.kr/@kakao-it/105> * |
| See also references of EP3980991A4 |
Also Published As
| Publication number | Publication date |
|---|---|
| EP3980991B1 (fr) | 2024-01-03 |
| US11532310B2 (en) | 2022-12-20 |
| US20210050016A1 (en) | 2021-02-18 |
| CN114207711B (zh) | 2025-09-05 |
| EP3980991C0 (fr) | 2024-01-03 |
| CN114207711A (zh) | 2022-03-18 |
| EP3980991A4 (fr) | 2022-08-10 |
| EP3980991A1 (fr) | 2022-04-13 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2021029642A1 (fr) | Système et procédé pour reconnaître la voix d'un utilisateur | |
| WO2020153736A1 (fr) | Procédé et dispositif de reconnaissance de la parole | |
| WO2020105856A1 (fr) | Appareil électronique pour traitement d'énoncé utilisateur et son procédé de commande | |
| WO2018124620A1 (fr) | Procédé et dispositif permettant d'émettre et de recevoir des données audio | |
| WO2019117466A1 (fr) | Dispositif électronique pour analyser la signification de la parole, et son procédé de fonctionnement | |
| WO2021029643A1 (fr) | Système et procédé de modification d'un résultat de reconnaissance vocale | |
| WO2020130447A1 (fr) | Procédé de fourniture de phrases basé sur un personnage et dispositif électronique de prise en charge de ce dernier | |
| EP3850622A1 (fr) | Procédé et dispositif de reconnaissance de la parole | |
| WO2021020877A1 (fr) | Système et procédé d'enregistrement de dispositif pour service d'assistant vocal | |
| WO2020116818A1 (fr) | Dispositif électronique et son procédé de commande | |
| WO2020060151A1 (fr) | Système et procédé de fourniture d'un service d'assistant vocal | |
| WO2018097439A1 (fr) | Dispositif électronique destiné à la réalisation d'une traduction par le partage d'un contexte d'émission de parole et son procédé de fonctionnement | |
| WO2021107390A1 (fr) | Dispositif électronique et procédé de commande du dispositif électronique | |
| WO2021158040A1 (fr) | Dispositif électronique fournissant un énoncé correspondant au contexte d'une conversation, et procédé d'utilisation associé | |
| WO2018124464A1 (fr) | Dispositif électronique et procédé de fourniture de service de recherche de dispositif électronique | |
| WO2023101343A1 (fr) | Procédé et appareil d'exécution de journalisation de locuteur sur des signaux vocaux à bande passante mixte | |
| WO2023038292A1 (fr) | Dispositif électronique et procédé de traitement de la parole de dispositif électronique | |
| WO2022164192A1 (fr) | Dispositif et procédé pour fournir des phrases recommandées associées à une entrée d'énoncé d'un utilisateur | |
| WO2020101389A1 (fr) | Dispositif électronique d'affichage d'une image fondée sur la reconnaissance vocale | |
| WO2022131566A1 (fr) | Dispositif électronique et procédé de fonctionnement de dispositif électronique | |
| WO2022186435A1 (fr) | Dispositif électronique pour corriger une entrée vocale d'un utilisateur et son procédé de fonctionnement | |
| EP3545519A1 (fr) | Procédé et dispositif permettant d'émettre et de recevoir des données audio | |
| WO2022139122A1 (fr) | Dispositif électronique et son procédé de commande | |
| WO2021246812A1 (fr) | Solution et dispositif d'analyse de niveau de positivité d'actualités utilisant un modèle nlp à apprentissage profond | |
| WO2022250383A1 (fr) | Dispositif électronique et procédé de commande de dispositif électronique |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20852027 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2020852027 Country of ref document: EP Effective date: 20220104 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWG | Wipo information: grant in national office |
Ref document number: 202080055338.0 Country of ref document: CN |