WO2002089112A1 - Adaptive learning of language models for speech recognition - Google Patents
Adaptive learning of language models for speech recognition Download PDFInfo
- Publication number
- WO2002089112A1 WO2002089112A1 PCT/GB2002/002048 GB0202048W WO02089112A1 WO 2002089112 A1 WO2002089112 A1 WO 2002089112A1 GB 0202048 W GB0202048 W GB 0202048W WO 02089112 A1 WO02089112 A1 WO 02089112A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- adaptive learning
- user
- speech
- adaptive
- hybrid
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/19—Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
Definitions
- This invention relates to pattern matching, and in particular to adaptive learning in pattern matching or pattern recognition systems. It is particularly, but not exclusively applicable to adaptive learning of spoken utterances, for example in spoken language interfaces.
- Adaptive Learning is the ability to change certain behaviours as a consequence of historical actions.
- a spoken language interface is a system that allows a user to complete useful tasks by entering into a spoken dialogue with a machine.
- the system itself can be trained to adapt as it learns the behaviour patterns of users. It can learn to anticipate the kind of phrasing a user is likely to use and the system should be able to adapt its behaviour in real time - or process during the time the user is not logged on to the system - as a response to specific events.
- the initially speaker and line independent acoustic models used in the speech pattern matching process can also adapt to accommodate the acoustic peculiarities of a particular speaker or line.
- the system can be designed to anticipate gradual user behavioural adaptation.
- users can be defined as beginners, intermediates, and advanced users and both the quantitative and qualitative nature of prompts can be tailored accordingly.
- prompts are automatically tailored to the task. More verbose help is available for the beginner, terse questions for the experienced user.
- This definition of users can distinguish between users and can also define progressive behavioural changes within a single user. By detecting progressive behavioural changes, the system can be automatically adapted.
- Another form of AL concerns changing the prompts and interaction style on the basis of background noise and the strength of a mobile phone signal .
- the system understands' that para- conversational conditions are changing and adapts its conversational behaviour accordingly. For instance, if the user is using a mobile phone, and there is a large amount of background noise, the system can be trained to predict that it will have difficulty recognising a multi-slot verbose utterance (utterances with many data parameters such as "I want to fly to Paris tomorrow at 3 pm" where Paris, tomorrow and 3 pm are slots. Up to 8 slots may typically be filled. Adapting to these conditions, a message will be played to the user asking them to use single-slot terse speech until conditions improve or by dynamically and automatically limiting the number of slots to a small number (2 - 3) .
- a multi-slot verbose utterance utterances with many data parameters such as "I want to fly to Paris tomorrow at 3 pm" where Paris, tomorrow and 3 pm are slots. Up to 8 slots may typically be filled.
- a message will be played to the user asking them to use single-slot terse speech until conditions
- Such system-level adaptive learning is used to guide the user's behaviour and improve recognition accuracy. It does this by automatically adapting the prompt wording.
- the prompts are stored as phrases and the phrase that is played is predicated on the condition (behaviour or para- conversational) .
- Typical embodiments of a SLI incorporate components that recognise the natural language of a caller via automatic speech recognition (ASR) and interpret the spoken utterance to allow the caller to complete useful tasks .
- ASR automatic speech recognition
- the ASR component In order to recognise what the user says, the ASR component must be provided with a model of the language that it expects to hear.
- the SLI can contain grammars to model language. To allow users to speak naturally, these grammars must specify a large number of utterances to capture the variety of language styles that different users employ: some will say “I want to send an email", others might say “May I send an email” or just “send an email”.
- Adaptive Learning modifies the language model to selectively listen for utterances that match the style of speech used by callers when talking to the SLI.
- Perplexity is a measure of the average branching factor of the model - roughly the average number of words that the recogniser must be attempting to match to a segment of the speech signal at any moment in time.
- reducing the perplexity of a language model used in ASR will lead to an increase in the recognition accuracy.
- grounding In a dialogue system the process whereby the user and the system negotiate a common "understanding" of each others' informational state is a process known as grounding. Grounding may be achieved explicitly, e.g. "Did you say you want to fly to Paris?" or implicitly, "On which day would you like to travel to Paris", and depending on the particular dialogue, the process of grounding a piece of information may extend, or be delayed, e.g. for several turns.
- a model such as, for example, a language model or other model used in interactive pattern recognition, may be trained and adapted for an individual user or a group of users.
- the interactions of a user can be used to modify an original model so that the likelihood of correct interpretations/recognitions is increased and the likelihood of incorrect interpretations/recognitions is decreased.
- supervised adaptation user input is manually classified as being either correct or incorrect. The classified input is used to adapt an original model that is used to interpret subsequent user input.
- supervised adaptation is clearly undesirable where complex systems dealing with many users are desired since it is too slow and labour-intensive to be practicable.
- unsupervised adaptation may be used.
- unsupervised adaptation a confidence parameter is chosen or derived that indicates a confidence measure associated with a plurality of possible interpretations of user input. The confidence parameter can be used to weight adaptation material, or to select a subset of material that is most suitable for adaptation to provide a model that is subsequently used to interpret user input.
- the confidence parameter may be used to provide an adaptive learning mechanism that is semi-supervised, i.e. neither fully supervised or fully unsupervised.
- semi-supervision in which grounding may be linked with the confidence parameter, the quality of adaptation material may be improved thereby improving recogniser accuracy in comparison to known unsupervised learning systems.
- an adaptive learning mechanism for use in a spoken language interface, wherein the adaptive learning mechanism comprises an automatic speech recognition mechanism operable to produce a weighted set of output hypotheses from a language model in response to an input from a source, and an analyser mechanism operable to analyse one or more weighted set of output hypotheses may be adjusted. Respective of the weights associated with the output hypotheses to provide an updated language model.
- the source may be one or more user or caller, who may be identified each time he/she/they communicate with the adaptive learning mechanism. Each user/caller may be assigned a respective language model.
- the language model may be updated each time the source is identified.
- the adaptive learning mechanism may store weighted sets of output hypotheses, or information relating to them, for use as historical information relating to a user(or caller) or group of users (or callers). In contrast to known learning schemes, this allows the adaptive learning mechanism to modify the language model without explicit confirmation of user's utterances during a current dialogue.
- the adaptive learning mechanism may be operable to adjust individual weighted output hypotheses in weighted sets according to one or more associated hybrid confidence measure.
- the language model may be modified in dependence upon one or more hybrid confidence measure.
- the hybrid confidence measure may be a single parameter derived from a current utterance and other utterances.
- a hybrid confidence measure may be derived from a current utterance and preceding and/or succeeding utterances derived from previous dialogues.
- the adaptive learning mechanism may be operable to update the language model or models in response to non-explicit instructions from the source or sources.
- a spoken language interface mechamsm for providing communication between a user and one or more software application, comprising an adaptive learning mechanism according to the first aspect of the invention and a speech generation mechanism for providing at least one response to the user.
- a computer program product comprising a computer usable medium having computer readable program code embodied in the computer usable medium.
- the computer readable program code comprises computer readable program code for causing at least one computer to provide the adaptive learning mechanism according to the first aspect of the invention or the spoken language interface mechanism according to the second aspect of the invention.
- the carrier medium may include at least one of the following set of media: a radio-frequency signal, an optical signal, an electronic signal, a magnetic disc or tape, solid-state memory, an optical disc, a magneto-optical disc, a compact disc and a digital versatile disc.
- a computer system configured to provide the adaptive learning mechanism according to the first aspect of the invention or the spoken language interface mechanism according to the second aspect of the invention.
- a method for providing adaptive learning in a spoken language interface comprising producing a weighted set of output hypotheses from a language model in response to an input from a source, analysing the weighted set of output hypotheses in dependence upon any previous input from the source, and adapting the language model for applying to any subsequent input from the source.
- the method may comprise adapting the language model each time the source is identified.
- the method may comprise providing a plurality of language models, each of the plurality of language models being associated with a respective source or group of sources.
- the source or sources may comprise one or more user or caller.
- the method may comprise updating the language model or models in response to non-explicit instructions from the source or sources.
- the method may comprise adjusting individual weighted output hypotheses in weighted sets according to one or more hybrid confidence measure.
- Embodiments and preferred embodiments of the invention have the advantage that ASR accuracy problems are improved by automatically modelling and classifying a user's language profile.
- a user's language style is monitored to selectively tune the recogniser to listen out for the user's preferred utterances. This contrasts to prior art systems which have poor accuracy and use grammars that equally weight a large vocabulary of utterances, or prior art adaptive systems that require human intervention in the adaptation process.
- utterances used herein includes spoken words and speech as well as sounds, abstractions or parts of words or speech.
- Figure 1 is a schematic view of a spoken language interface
- Figure 2 is a logical model of the SLI architecture
- Figure 3 is a more detailed view of the SLI architectare.
- Figure 4 shows a graph of a uniform probability distribution
- Figure 5 shows a graph of a probability of utterance across a language set
- Figure 6 shows a graph of probability of utterance for ,given caller
- Figure 7 shows a computer system that may be used to implement embodiments of the invention.
- Figure 8 shows an adaptive learning mechanism according to an embodiment of the present invention receiving an input from a sound source.
- the system schematically outlined in Figure 1 is a spoken language interface intended for communication with applications via mobile, satellite, or landline telephone.
- communication is via a mobile telephone 18 but any other voice telecommunications device such as a conventional telephone can be utilised.
- Calls to the system are handled by a telephony unit 20.
- a Voice Controller 19 Connected to the telephony unit are a Voice Controller 19, an Automatic Speech Recognition System (ASR) 22 and a automatic speech- generation system (ASG) 26.
- the ASR 22 and ASG systems are each connected to the voice controller 19.
- a dialogue manager 24 is connected to the voice controller 19 and also to a spoken language interface (SLI) repository 30, a personalisation and adaptive learning unit 32 which is also attached to the SLI repository 30, and a session and notification manager 28.
- SLI spoken language interface
- the Dialogue Manager is also connected to a plurality of Application Managers AM, 34 each of which is connected to an application which may be content provision external to the system.
- the content layer includes e-mail, news, travel, information, diary, banking etc. The nature of the content provided is not important to the principles of the invention.
- the SLI repository is also connected to a development suite 35 .
- FIG 2 provides a more detailed overview of the architecture of the system.
- the automatic speech generation unit 26 of figure 1 includes a basic text-to-speech (TTS) unit, a batch IIS unit 120, connected to a prompt cache 124 and an audio player 122.
- TTS text-to-speech
- a batch IIS unit 120 connected to a prompt cache 124 and an audio player 122.
- pre-recorded speech may be played to the user under the control of the voice control 19. It the embodiment illustrated a mixture of pre-recorded voice and TTS is used.
- the system then comprises three levels: session level
- the session level comprises a location manager 126 and a dialogue manager 128.
- the session level also includes an interactive device control 130 and a session manager 132 which includes the functions of user identification and Help Desk.
- the application layer comprises the application framework 134 under which an application manager controls an application. Many application managers and applications will be provided, such as UMS (Unified Messaging Service) , Call connect & conferencing, e-Commerce, Dictation etc.
- the non-application level 124 comprises a back office subsystem 140 which includes functions such as reporting, billing, account management, system administration, "push" advertising and current user profile.
- a transaction subsystem 142 includes a transaction log together with a transaction monitor and message broker.
- an activity log 144 and a user profile repository 146 communicate with an adaptive learning unit 148.
- the adaptive learning unit also communicates with the dialogue manager 128.
- a personalisation module 150 also communicates with the user profiles repository 146 and the dialogue manager 128.
- the system allows the system to be independent of the ASR 22 and TTS 26 by providing an interface to either proprietary or non-proprietary speech recognition, text to speech and telephony components .
- the TTS may be replaced by, or supplemented by, recorded voice.
- the voice control also provides for logging and assessing call quality. The voice control will optimise the performance of the ASR. Spoken Language Interface Reposi tory 30
- grammars that is constructs and user utterances for which the system listens, prompts and workflow descriptors are stored as data in a database rather than written in time consuming ASR/TTS specific scripts.
- multiple languages can be readily supported with greatly reduced development time, a multi-user development environment is facilitated and the database can be updated at anytime to reflect new or updated applications without taking the system down.
- the data is stored in a notation independent form.
- the data is converted or compiled between the repository and the voice control to the optimal notation for the ASR being used. This enables the system to be ASR independent.
- the voice engine is effectively dumb as all control comes from the dialogue manager via the voice control .
- the dialogue manager controls the dialogue across multiple voice servers and other interactive servers (eg WAP, Web etc) .
- As well as controlling dialogue flow it controls the steps required for a user to complete a task through mixed initiative - by permitting the user to change initiative with respect to specifying a data element (e.g. destination city for travel) .
- the Dialog Manager may support comprehensive mixed initiative, allowing the user to change topic of conversation, across multiple applications while maintaining state representations where the user left off in the many domain specific conversations. Currently, as initiative is changed across two applications, state of conversation is maintained. Within the system, the dialogue manager controls the workflow.
- the method by which the adaptive learning agent was conceived is to collect user speaking data from call data records. This data, collected from a large domain of calls (thousands) provides the general profile of language usage across the population of speakers. This profile, or mean language model forms a basis for the first step in adjusting the language model probabilities to improve ASR accuracy.
- the individual user's profile is generated and adaptively tuned across the user's subsequent calls.
- the dialog manager includes a personalisation engine. Given the user demographics (age, sex, dialect) a specific personality tuned to user characteristics for that user's demographic group is invoked.
- the dialog manager also allows dialogue structures and applications to be updated or added without shutting the system down. It enables users to move easily between contexts, for example from flight booking to calendar etc, hang up and resume conversation at any point; specify information either step-by-step or in one complex sentence, cut-in and direct the conversation or pause the conversation temporarily.
- the telephony component includes the physical telephony interface and the software API that controls it.
- the physical interface controls inbound and outbound calls, handles conferencing, and other telephony related functionality.
- the Session Manager initiates and maintains user and application sessions. These are persistent in the event of a voluntary or involuntary disconnection. They can reinstate the call at the position it had reached in the system at any time within a given period, for example 24 hours.
- a major problem in achieving this level of session storage and retrieval relates to retrieving a session in which a conversation is stored with either a dialogue structure, workflow structure or an application manager has been upgraded. In the preferred embodiment this problem is overcome through versioning of dialogue structures, workflow structures and application managers. The system maintains a count of active sessions for each version and only retires old versions once the versions count reaches zero.
- An alternative, which may be implemented, requires new versions of dialogue structures, workflow structures and application managers to supply upgrade agents. These agents are invoked whenever by the session manager whenever it encounters old versions in the stored session. A log is kept by the system of the most recent version number. It may be beneficial to implement a combination of these solutions the former for dialogue structures and workflow structures and the latter for application managers.
- the notification manager brings events to a user's attention, such as the movement of a share price by a predefined margin. This can be accomplished while the users are offline through interaction with the dialogue manager or offline. Offline notification is achieved either by the system calling the user and initiating an online session or through other media channels, for example, SMS, Pager, fax, email or other device.
- AM Application Managers
- Each application manager (there is one for every content supplier) exposes a set of functions to the dialogue manager to allow business transactions to be realised (e.g. GetEmailO, SendEmaiK), BookFlight () , GetNewsItemO , etc).
- DM Functions require the DM to pass the complete set of parameters required to complete the transaction.
- the AM returns the successful result or an error code to be handled in a predetermined fashion by the DM.
- An AM is also responsible for handling some stateful information. For example, User A has been passed the first 5 unread emails. Additionally, it stores information relevant to a current user task. For example, flight booking details. It is able to facilitate user access to secure systems, such as banking, email or other. It can also deal with offline events, such as email arriving while a user is offline or notification from a flight reservation system that a booking has been confirmed. In these instances the AM's role is to pass the information to the Notification Manager.
- An AM also exposes functions to other devices or channels, such as web, WAP, etc. This facilitates the multi channel conversation discussed earlier.
- AMs are able to communicate with each other to facilitate aggregation of tasks. For example, booking a flight primarily would involve a flight booking AM, but this would directly utilise a Calendar AM in order to enter flight times into a users Calendar.
- AMs are discrete components built, for example, as enterprise Java Beans (EJBs) they can be added or updated while the system is live.
- EJBs enterprise Java Beans
- the Transaction and Message Broker records every logical transaction, identifies revenue-generating transactions, routes messages and facilitates system recovery.
- Spoken conversational language reflects quite a bit of a user's psychology, socio-economic background, and dialect and speech style. The reason an SLI is a challenge, which is met by embodiments of the invention, is due to these confounding factors .
- Embodiments of the invention provide a method of modelling these features and then tuning the system to effectively listen out for the most likely occurring features.
- a very large vocabulary of phrases encompassing all dialectic and speech style (verbose, terse or declarative) results in a complex listening test for any recogniser.
- User profiling solves the problem of recognition accuracy by tuning the recogniser to listen out for only the likely occurring subset of utterance in a large domain of options .
- the Help Assistant & Interactive Training component allows users to receive real-time interactive assistance and training.
- the component provides for simultaneous, multi channel conversation (i.e. the user can talk through a voice interface and at the same time see visual representation of their interaction through another device, such as the web) .
- the system uses a commercially available database such as Oracle 81 from Oracle Corp.
- the Central Directory stores information on users, available applications, available devices, locations of servers and other directory type information.
- the System Administration - Applications provides centralised, web-based functionality to administer the custom build components of the system (e.g. Application Managers, Content Negotiators, etc.).
- Development Suite (35) This provides an environment for building spoken language systems incorporating dialogue and prompt design, workflow and business process design, version control and system testing. It is also used to manage deployment of system updates and versioning.
- the development suite Rather than having to labouriously code likely occurring user responses in a cumbersome grammar (e.g. BNF grammar - Bachus Nauer Format) resulting in time consuming detailed syntactic specification, the development suite provides an intuitive hierarchical, graphical display of language, reducing the modelling act to creatively uncover the precise utterance but the coding act to a simple entry of a data string.
- the development suite enables a Rapid Application Development (RAD) tool that combines language modelling with business process design (workflow) .
- RAD Rapid Application Development
- the Dialogue Subsystem manages , controls and provides the interface for human dialogue via speech and sound. Referring to Figure 1, it includes the dialogue manager, spoken language interface repository, session and notification managers, the voice controller 19, the Automatic Speech Recognition Unit 22, the Automatic Speech Generation unit 26 and telephony components 20. The subsystem is illustrated in more detailed architecture of the interface shown in Figure 3.
- SLI Spoken Language Interface
- a SLI refers to the hardware, software and data components that allow users to interact with a computer through spoken language.
- the term "interface” is particularly apt in the context of voice interaction, since the SLIacts as a conversational mediator, allowing information to be exchanged between user and system via speech. In its idealised form, this interface would be "invisible” and the interaction would, from the user's standpoint, appear as seamless and natural as a conversation with another person. In fact, one principle aim of most SLI projects is to create a system that is as near as possible to a human-human conversation.
- the objective for the SLI development team is to create the ears, mind and voice of the machine.
- the ears of the system are created by the Automatic Speech Recognition (ASR) System 22.
- the voice is created via the Automatic Speech Generation (ASG) software 26, and the mind is made up of the computational power of the hardware and the databases of information contained in the system.
- ASR Automatic Speech Recognition
- ASG Automatic Speech Generation
- the present system uses software developed by other companies for its ASR and ASG. Suitable systems are available from Nuance and Lernout & Hauspie respectively. These systems will not be described further. However, it should be noted that the system allows great flexibility in the selection of these components from different vendors.
- the basic Text To Speech unit supplied, for example, by Lernout & Hauspie may be supplemented by an audio subsystem which facilitates batch recording of TTS (to reduce system latency and CPU requirements) , streaming of audio data from other source (e.g. music, audio news, etc) and playing of audio output from standard digital audio file formats.
- an audio subsystem which facilitates batch recording of TTS (to reduce system latency and CPU requirements) , streaming of audio data from other source (e.g. music, audio news, etc) and playing of audio output from standard digital audio file formats.
- a voice controller 19 and the dialogue manager 24 control and manage the dialogue between the system and the end user.
- the dialogue is dynamically generated at run time from a SLI repository which is managed by a separate component, the development suite.
- the ASR unit 22 comprises a plurality of ASR servers.
- the ASG unit 26 comprises a plurality of speech servers. Both are managed and controlled by the voice controller.
- the telephony unit 20 comprises a number of telephony board servers and communicates with the voice controller, the ASR servers and the ASG servers .
- Calls from users, shown as mobile phone 18 are handled initially by the telephony server 20 which makes contact with a free voice controller.
- the voice controller contacts the locates an available ASR resource.
- the voice controller 19 which identifies the relevant ASR and ASG ports to the telephony server.
- the telephony server can now stream voice data from the user to the ASR server and the ASG stream audio to the telephony server.
- the voice controller having established contacts with the ASR and ASG servers now requests a informs the Dialogue Manager which requests a session on behalf of a user in the session manager.
- the user is required to provide authentication information before this step can take place.
- This request is made to the session manager 28 which is represented logically at 132 in the session layer in Figure 2.
- the session manager server 28 checks with a dropped session store (not shown) whether the user has a recently dropped session.
- a dropped session could be caused by, for example, a user on a mobile entering a tunnel. This facility enables the user to be reconnected to a session without having to start over again.
- the dialogue manager 24 communicates with the application managers 34 which in turn communicate with the internal/external services or applications to which the user has access.
- the application managers each communicate with a business transaction log 50, which records transactions and with the notification manager 28b. Communications from the application manager to the notification manager are asynchronous and communications from the notification manager to the application managers are synchronous.
- the notification manager also sends communications asynchronously to the dialogue manager 24.
- the dialogue manager 24 has a synchronous link with the session manager 28a, which has a synchronous link with the notification manager.
- the dialogue manager 24 communicates with the adaptive learning unit 33 via an event log 52 which records user activity so that the system can learn from the users interaction. This log also provides a series of debugging and reporting information.
- the adaptive learning unit is connected to the personalisation module 34 which is in turn connected to the dialogue manager.
- Workflow 56, Dialogue 58 and Personalisation repositories 60 are also connected to the dialogue manager 24 through the personalisation module 554 so that a personalised view is always handled by the dialogue manager 24.
- the personalisation can also write to the personalisation repository 60.
- the Development Suite 35 is connected to the workflow and dialogue repositories 56, 58 and implements functional specifications of applications storing the relevant grammars, dialogues, workflow and application manager function references for each the application in the repositories. It also facilitates the design and implementation of system, help, navigation and misrecognition grammars, dialogues, workflow and action references in the same repositories .
- the dialogue manager 24 provides the following key areas of functionality: the dynamic management of task oriented conversation and dialogue; the management of synchronous conversations across multiple formats; and the management of resources within the dialogue subsystem. Each of these will now be considered in turn.
- the conversation a user has with a system is determined by a set of dialogue and workflow structures, typically one set for each application.
- the structures store the speech to which the user listens, the keywords for which the ASR listens and the steps required to complete a task (workflow) .
- the DM determines its next contribution to the conversation or action to be carried out by the AMs.
- the system allows the user to move between applications or context using either hotword or natural language navigation.
- the complex issues relating to managing state as the user moves from one application to the next or even between multiple instances of the same application is handled by the DM.
- This state management allows users to leave an application and return to it at the same point as when they left.
- This functionality is extended by another component, the session manager, to allow users to leave the system entirely and return to the same point in an application when they log back in - this is discussed more fully later under Session Manager.
- the dialogue manager communicates via the voice controller with both the speech engine (ASG) 26 and the voice recognition engine (ASR) 22.
- the output from the speech generator 26 is voice data from the dialogue structures, which is played back to the user either as dynamic text to speech, as a pre-recorded voice or other stored audio format.
- the ASR listens for keywords or phrases that the user might say.
- the dialogue structures are predetermined (but stochastic language models could be employed in an implementation of the system or hybrids of the two) .
- Predetermined dialogue structures or grammars are statically generated when the system is inactive. This is acceptable in prior art systems as scripts tended to be simple and did not change often once a system was activated.
- the dialogue structures can be complex and may be modified frequently when the system is activated.
- the dialogue structure is stored as data in a run time repository, together with the mappings between recognised conversation points and application functionality.
- the repository is dynamically accessed and modified by multiple sources even when active users are on-line.
- the dialogue subsystem comprises a plurality of voice controllers 19 and dialogue managers 24 (shown as a single server in Figure 3) .
- the ability to update the dialogue and workflow structures dynamically greatly increases the flexibility of the system.
- it allows updates of the voice interface and applications without taking the system down; and provides for adaptive learning functionality which enriches the voice experience to the user as the system becomes more responsive and friendly to a user's particular syntax and phraseology with time.
- a typical SLI system works as follows.
- a prompt is played to a user.
- the user can reply to the prompt, and be understood, as long as they use an utterance that has been predicted by the grammar writers. Analysis of likely coverage and usability of the grammars is discussed in another patent application.
- the accuracy with which the utterance will be recognised is determined, among other things, by the perplexity of the grammars. As large predicted pools of utterances (with correspondingly large perplexity values) are necessary to accommodate a wide and varied user group this may have a negative impact on recognition accuracy.
- a frequency based adaptive learning algorithm can automatically assign probability-of- use values to each predicted utterance in a grammar.
- the algorithm is tuned on an on-going basis by the collection of thousands of utterances pooled across all users. Statistical analysis and summaries of the pooled user data result in the specification of weights applied to the ASR language model .
- weights applied to the ASR language model When users use an utterance that has a high probability-of-use the increased probability of the utterance in the grammar increases the probability of correct recognition, regardless of the size of the pool of predicted utterances. Naturally, this method leads to a reduction in recognition accuracy for users who make use of infrequently used utterances that have a low probability-of- use. Modelling of individual users addresses this.
- AL can be used to probabilistically model the language of individual users. Over the course of repeated uses, the statistics of language use is monitored. Data collection can occur in real-time during the course of the user machine session.
- a model of an individual user's language it is possible to automatically assign user-dependent probability-of-use values (by dynamic weighting of the language model) to each predicted utterance in a grammar.
- user-dependent pools of predicted utterances are automatically identified from within larger pools of predicted utterances. The identification of such sub-pools of utterances, and the assignment of user- dependent probabilities-of-use significantly improve recognition accuracy, especially for very large pools of predicted utterances, where a particular user uses unusual speech patterns.
- the AL methodologies described above generalise to novel grammars, such that an experienced user can start to interact with an entirely new application and immediately achieve high levels of recognition accuracy because the language model derived in one application informs the distribution of probabilities in the new application grammars .
- AL techniques such as those described above are that they require human transcription of users utterances into text so that they may be used for adaptation.
- the following description discusses adapting a language model according to new observations made on-line of the user/caller input utterances as opposed to adaptation using a hand transcribed corpus of caller utterances .
- Selection of appropriate ASR hypotheses for Real-time automatic adaptation is achieved using a hybrid confidence measure that enables the system itself to decide when sentence hypotheses are sufficiently reliable to be included in the adaptation data.
- the system can take account of both the quality and quantity of adaptation data when combining language models derived from different training corpora .
- the method relies on existing techniques to capture the statistical likelihood of an utterance given the current context of the application. What follows is an outline of some of the existing techniques to which the unique aspects of this invention can be applied.
- the example implementation relies on three key components :
- This approach takes advantage of both rule-based (e.g. Context Free Grammars - CFGs) and data-driven (e.g. n-gram) approaches.
- Rule-based e.g. Context Free Grammars - CFGs
- data-driven e.g. n-gram
- the advantage of an n-gram approach is that it quite powerfully models the statistics of language from examples or real usage (the corpus) .
- the example sentences used to derive the n-gram probabilities must be representative of language usage in a particular context, but it can be difficult to find enough relevant examples for a new application.
- the benefit of a CFG approach is that a grammar writer can draw upon their own experience of language to tightly define the appropriate utterances for the topic - but they cannot reliably estimate the likelihood of each of these sentences being used in a live system.
- grammars may be written to quickly and easily define the scope of a language model in light of the context and data driven approaches can be used to accurately estimate the statistics of real usage.
- the grammar writer uses rules to conveniently define a list of possible utterances using an appropriate grammar formalism.
- Count (X) is the number of times sentence X occurs in our training corpus .
- N is the number of sentences in our training corpus .
- ⁇ s> and ⁇ s> are the beginning and end of sentence marker s .
- the straightforward grammar defines the set of allowable utterances in a domain, but the probability of each utterance is uniform.
- a non-probabilistic grammar assumes that each of these words is equally likely, but the n-gram approximation described earlier can be used to estimate the probability of each possible word u ni following the word history h.
- p ⁇ u nil h) K* P(u ni
- grammars and statistical language models in this way may be implemented directly in the ASR architecture, or used indirectly to train a probabilistic context free grammar (PCFG) , a probabilistic finite state grammar (PFSG) or other probabilistic grammar formalism.
- PCFG probabilistic context free grammar
- PFSG probabilistic finite state grammar
- TFIDF Term Frequency Inverse Document Frequency
- TFIDF vectors can be used as a measure of the similarity of one document (or grammar or set of utterances in a dialogue history) with another. Below is the definition of TFIDF.
- TF(i,j) is the Term Frequency - the number of times a term j (a word taken from the vocabulary) occurs in document j ⁇
- Each document can be represented as a vector whose components are the TFIDF values for each word in the vocabulary.
- the cosine of the angle between the two vectors is can be calculated from the dot product of the two vectors :
- ⁇ is the angle between vector N and M.
- TFIDF vector for "I want to send an email” [l*log(3/3) ,l*log(3/2) , l*log (3/2) , l*log (3/1) ,l*log(3/l) ,1*1 og(3/l) ,0*log(3/l) ,0*log(3/2) ,0*log(3/l) , 0*log(3/l) ,0*log(3/l) ,0*log(3/l) ,0*log(3/l) ,0*log(3/l) ,0*log(3/l) ,0*log(3/l) ,0*log(3/l)]
- Such a similarity measure can be used to find sub corpora for adaptation: 1) Using a grammar The grammar is used to generate example sentences . These form the query document .
- the query document may contain dialogue histories for a single user or for all users, and these histories may be further divided into application or grammar contexts .
- the query document is then compared with all documents in the corpus in order to create a sub corpus containing those documents that are most similar based on a particular measure - such as TFIDF. Hence a sub-corpus is selected that is expected to be more representative of real usage in the context defined by the query document, than the original corpus .
- a hybrid confidence measure is defined to take account of the recogniser confidence values and other accuracy cues in the dialogue history.
- S Subsequent dialogue history - e.g. confirms, disconfirms, task switches, asking for help etc.
- the form of the function must be determined appropriately for the application and recognition engine.
- N-gram derived from a sub corpus selected to contain documents similar to sentences covered by the grammar N-gram derived from a sub corpus selected to contain documents similar to sentences covered by the grammar.
- N-gram derived from all users' dialogue histories either directly or via selection of a sub corpus.
- N-gram derived from a particular user's dialogue history either directly or via selection of a sub corpus.
- the user adapts to the system, just as the system should attempt to adapt to the user. This means that the user's interaction style is likely to change gradually over time. To accommodate this effect in the system adaptation process, more attention should be paid to more recent dialogue histories than those in the distant past.
- the embodiment described enables a spoken user interface (SLI) or other pattern matching system to adapt automatically.
- SLI spoken user interface
- the automatic adaptation is based on grammar probabilities.
- the embodiment described has the further advantage in that the determination of what data is suitable for use in the adaption process is made according to a hybrid confidence measure.
- individual word confidence scores and the hybrid confidence function can advantageously be used to bias the contribution of individual n-grams in the combined model.
- Figure 7 shows a computer system 700 that may be used to implement embodiments of the invention.
- the computer system 700 may be used to implement a an adaptive learning mechanism and/or spoken language interface mechanism according to aspects of the present invention .
- the computer system 700 may be used to provide at least the adaptive learning mechanism of a spoken language interface.
- the computer system 700 comprises various data processing resources such as a processor (CPU) 730 coupled to a bus structure 738. Also connected to the bus structure 738 are further data processing resources such as read only memory 732 and random access memory 734.
- a display adapter 736 connects a display device 718 having screen 720 to the bus structure 738.
- One or more user-input device adapters 740 connect the user-input devices, including the keyboard 722 and mouse 724 to the bus structure 738.
- An adapter 741 for the connection of a printer 721 may also be provided.
- One or more media drive adapters 742 can be provided for connecting the media drives, for example the optical disk drive 714, the floppy disk drive 716 and hard disk drive 719, to the bus structure 738.
- One or more telecommunications adapters 744 can be provided thereby providing processing resource interface means for connecting the computer system to one or more networks or to other computer systems or devices.
- the communications adapters 744 could include a local area network adapter, a modem and/or ISDN terminal adapter, or serial or parallel port adapter etc, as required.
- the processor 730 will execute computer program instructions that may be stored in one or more of the read only memory 732, random access memory 734 the hard disk drive 719, a floppy disk in the floppy disk drive 716 and an optical disc, for example a compact disc (CD) or digital versatile disc (DVD), in the optical disc drive or dynamically loaded via adapter 744.
- the results of the processing performed may be displayed to a user via the display adapter 736 and display device 718.
- User inputs for controlling the operation of the computer system 700 may be received via the user-input device adapters 740 from the user-input devices.
- a computer program for implementing various functions or conveying various information can be written in a variety of different computer languages and can be supplied on carrier media.
- a program or program element may be supplied on one or more CDs, DVDs and/or floppy disks and then stored on a hard disk, for example.
- a program may also be embodied as an electronic signal supplied on a telecommunications medium, for example over a telecommunications network.
- FIG. 8 shows an the adaptive language mechanism 800 according to an embodiment of the invention receiving input from a source 810.
- Automatic speech recogniser mechanism (ASRM) 820 generates a set of weighted hypotheses 830 corresponding to what the ASRM 820 determines to be the most likely input according to language model 850.
- Each hypothesis has one or more associated parameter, such as, for example, confidence measure C.
- Hypotheses associated with successive turns in a dialogue are stored in a store 860.
- the analyser mechanism 840 checks information regarding successive turns and determines a hybrid confidence measure for each hypothesis.
- the analyser mechanism 840 updates the language model on the basis of the hypotheses and the associated hybrid confidences.
- Hybrid_C(P,C,S) Hybrid_C(P,C,S), where Hybrid C is a function selected according to whether a correct or incorrect interpretation of the hypothesis is determined, P represents information from the preceding dialogue history, S represents information from the succeeding dialogue history, and C represents information relating to the current recognition step (dialogue turn).
- Hybrid_C+ and Hybrid_C- may be used to determine whether interpretation is deemed correct or incorrect.
- Hybrids- should return a value near the maximum of it's range.
- Hybrid_C- and Hybrid_C+ may be determined by hand, based on an analysis of the grounding strategies of the dialogue, or induced from a corpus of hand annotated data using well-known machine learning algorithms.
- the value of the Hybrid_C measures (appropriately transformed if necessary) are used to weight the contribution of examples in the model adaptation process.
- examples in the adaptation corpus are selected, classified and weighted on the basis of supervisory information (mainly grounding and dialogue context) implicitly provided by the user of the system.
- Hybrid_C functions are as follows:
- C is the previous confidence value, and takes a value between 0 and 100; x is the number of words in hypothesis ti that are confirmed by subsequent turns tj where j>i; y is the number of words in hypothesis ti are disconfirmed by subsequent turns tj where j>i , and w is the total number of words in hypothesis ti that are not confirmed by subsequent turns tj.
- the recognition system will hypothesise one or more transcriptions of their utterance with an associated confidence level, and the machine will then respond.
- the machine's response, and therefore the continuation of the dialogue is based on the hypotheses at each turn. For example:
- the hypothesis is incorrect.
- the user notices this and corrects the machine.
- the hypothesis is correct, and at turn 4, this is confirmed by the user.
- the confidence measure (as with all confidence measures) is not perfect. It has labelled correct hypotheses with low confidence, and incorrect hypotheses with medium confidence. The degree of accuracy of the confidence measure (how well it can discriminate between correct and incorrect hypotheses) directly affects the performance of unsupervised adaptation. By introducing the hybrid confidence measure, this invention improves the selection of correct and incorrect hypotheses for adaptation of a language model.
- a software-controlled programmable processing device such as a Digital Signal Processor, microprocessor, other processing devices, data processing apparatus or computer system
- a computer program for configuring a programmable device, apparatus or system to implement the foregoing described methods is envisaged as an aspect of the present invention.
- the computer program may be embodied as source code and undergo compilation for implementation on a processing device, apparatus or system, or may be embodied as object code, for example.
- object code for example.
- the term computer system in its most general sense encompasses programmable devices such as referred to above, and data processing apparatus and firmware embodied equivalents, whether part of a distributed computer system or not.
- Software components may be implemented as plug-ins, modules and/or objects, for example, and may be provided as a computer program product stored on a carrier medium in machine or device readable form.
- a computer program may be stored, for example, in solid-state memory, magnetic memory such as disc or tape, optically or magneto-optically readable memory, such as compact disc read-only or read-write memory (CD-ROM, CD-RW), digital versatile disc (DVD) etc., and the processing device utilises the program or a part thereof to configure it for operation.
- the computer program product may be supplied from a remote source embodied on a communications medium such as an electronic signal, radio frequency carrier wave or optical carrier wave.
- Such carrier media are also envisaged as aspects of the present invention.
- any communication link between a user and a mechanism, interface and/or system according to aspects of the invention may be implemented using any available mechanisms, including mechanisms using of one or more of: wired, WWW, LAN, Internet, WAN, wireless, optical, satellite, TV, cable, microwave, telephone, cellular etc.
- the communication link may also be a secure link.
- the communication link can be a secure link created over the Internet using Public Cryptographic key Encryption techniques or as an SSL link.
- Embodiments of the invention may also employ voice recognition techniques for identifying a user.
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Abstract
Description
Claims
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| GB0326758A GB2391680B (en) | 2001-05-02 | 2002-05-02 | Adaptive learning of language models for speech recognition |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| GB0110810.9 | 2001-05-02 | ||
| GB0110810A GB2375211A (en) | 2001-05-02 | 2001-05-02 | Adaptive learning in speech recognition |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2002089112A1 true WO2002089112A1 (en) | 2002-11-07 |
Family
ID=9913924
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/GB2002/002048 Ceased WO2002089112A1 (en) | 2001-05-02 | 2002-05-02 | Adaptive learning of language models for speech recognition |
Country Status (2)
| Country | Link |
|---|---|
| GB (2) | GB2375211A (en) |
| WO (1) | WO2002089112A1 (en) |
Cited By (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| ES2311351A1 (en) * | 2006-05-31 | 2009-02-01 | France Telecom España, S.A. | Method to adapt dynamically the acoustic models of recognition of the speech to the user. (Machine-translation by Google Translate, not legally binding) |
| US7783488B2 (en) | 2005-12-19 | 2010-08-24 | Nuance Communications, Inc. | Remote tracing and debugging of automatic speech recognition servers by speech reconstruction from cepstra and pitch information |
| EP2008189A4 (en) * | 2006-04-03 | 2010-10-20 | Google Inc | Automatic language model update |
| US7925506B2 (en) | 2004-10-05 | 2011-04-12 | Inago Corporation | Speech recognition accuracy via concept to keyword mapping |
| EP2317507A1 (en) * | 2004-10-05 | 2011-05-04 | Inago Corporation | Corpus compilation for language model generation |
| EP2711923A3 (en) * | 2006-04-03 | 2014-04-09 | Vocollect, Inc. | Methods and systems for assessing and improving the performance of a speech recognition system |
| US9928829B2 (en) | 2005-02-04 | 2018-03-27 | Vocollect, Inc. | Methods and systems for identifying errors in a speech recognition system |
| US20220139373A1 (en) * | 2020-07-08 | 2022-05-05 | Google Llc | Identification and utilization of misrecognitions in automatic speech recognition |
| US11984118B2 (en) | 2018-08-27 | 2024-05-14 | Beijing Didi Infinity Technology And Development Co., Ltd. | Artificial intelligent systems and methods for displaying destination on mobile device |
Families Citing this family (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| DE10341305A1 (en) * | 2003-09-05 | 2005-03-31 | Daimlerchrysler Ag | Intelligent user adaptation in dialog systems |
| US7899671B2 (en) * | 2004-02-05 | 2011-03-01 | Avaya, Inc. | Recognition results postprocessor for use in voice recognition systems |
| RU2297676C2 (en) * | 2005-03-30 | 2007-04-20 | Федеральное государственное научное учреждение научно-исследовательский институт "Специализированные вычислительные устройства защиты и автоматика" | Method for recognizing words in continuous speech |
| US9508346B2 (en) | 2013-08-28 | 2016-11-29 | Verint Systems Ltd. | System and method of automated language model adaptation |
| CN104681023A (en) * | 2015-02-15 | 2015-06-03 | 联想(北京)有限公司 | Information processing method and electronic equipment |
| US12079825B2 (en) * | 2016-09-03 | 2024-09-03 | Neustar, Inc. | Automated learning of models for domain theories |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6173266B1 (en) * | 1997-05-06 | 2001-01-09 | Speechworks International, Inc. | System and method for developing interactive speech applications |
| WO2001026093A1 (en) * | 1999-10-05 | 2001-04-12 | One Voice Technologies, Inc. | Interactive user interface using speech recognition and natural language processing |
| WO2001050453A2 (en) * | 2000-01-04 | 2001-07-12 | Heyanita, Inc. | Interactive voice response system |
Family Cites Families (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP0241170B1 (en) * | 1986-03-28 | 1992-05-27 | AT&T Corp. | Adaptive speech feature signal generation arrangement |
| US6026359A (en) * | 1996-09-20 | 2000-02-15 | Nippon Telegraph And Telephone Corporation | Scheme for model adaptation in pattern recognition based on Taylor expansion |
| DE19708183A1 (en) * | 1997-02-28 | 1998-09-03 | Philips Patentverwaltung | Method for speech recognition with language model adaptation |
| EP1426923B1 (en) * | 1998-12-17 | 2006-03-29 | Sony Deutschland GmbH | Semi-supervised speaker adaptation |
| JP2001100781A (en) * | 1999-09-30 | 2001-04-13 | Sony Corp | Audio processing device, audio processing method, and recording medium |
-
2001
- 2001-05-02 GB GB0110810A patent/GB2375211A/en not_active Withdrawn
-
2002
- 2002-05-02 GB GB0326758A patent/GB2391680B/en not_active Expired - Fee Related
- 2002-05-02 WO PCT/GB2002/002048 patent/WO2002089112A1/en not_active Ceased
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6173266B1 (en) * | 1997-05-06 | 2001-01-09 | Speechworks International, Inc. | System and method for developing interactive speech applications |
| WO2001026093A1 (en) * | 1999-10-05 | 2001-04-12 | One Voice Technologies, Inc. | Interactive user interface using speech recognition and natural language processing |
| WO2001050453A2 (en) * | 2000-01-04 | 2001-07-12 | Heyanita, Inc. | Interactive voice response system |
Non-Patent Citations (1)
| Title |
|---|
| RICCARDI G ET AL: "Stochastic language adaptation over time and state in natural spoken dialog systems", IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, JAN. 2000, IEEE, USA, vol. 8, no. 1, pages 3 - 10, XP002205299, ISSN: 1063-6676 * |
Cited By (20)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP2317507A1 (en) * | 2004-10-05 | 2011-05-04 | Inago Corporation | Corpus compilation for language model generation |
| US8352266B2 (en) | 2004-10-05 | 2013-01-08 | Inago Corporation | System and methods for improving accuracy of speech recognition utilizing concept to keyword mapping |
| US7925506B2 (en) | 2004-10-05 | 2011-04-12 | Inago Corporation | Speech recognition accuracy via concept to keyword mapping |
| US9928829B2 (en) | 2005-02-04 | 2018-03-27 | Vocollect, Inc. | Methods and systems for identifying errors in a speech recognition system |
| US7783488B2 (en) | 2005-12-19 | 2010-08-24 | Nuance Communications, Inc. | Remote tracing and debugging of automatic speech recognition servers by speech reconstruction from cepstra and pitch information |
| EP2711923A3 (en) * | 2006-04-03 | 2014-04-09 | Vocollect, Inc. | Methods and systems for assessing and improving the performance of a speech recognition system |
| US10410627B2 (en) | 2006-04-03 | 2019-09-10 | Google Llc | Automatic language model update |
| EP2008189A4 (en) * | 2006-04-03 | 2010-10-20 | Google Inc | Automatic language model update |
| US8423359B2 (en) | 2006-04-03 | 2013-04-16 | Google Inc. | Automatic language model update |
| US8447600B2 (en) | 2006-04-03 | 2013-05-21 | Google Inc. | Automatic language model update |
| EP3627497A1 (en) * | 2006-04-03 | 2020-03-25 | Vocollect, Inc. | Methods and systems for assessing and improving the performance of a speech recognition system |
| US9159316B2 (en) | 2006-04-03 | 2015-10-13 | Google Inc. | Automatic language model update |
| EP2453436A3 (en) * | 2006-04-03 | 2012-07-04 | Google Inc. | Automatic language model update |
| US9953636B2 (en) | 2006-04-03 | 2018-04-24 | Google Llc | Automatic language model update |
| EP2541545B1 (en) * | 2006-04-03 | 2018-12-19 | Vocollect, Inc. | Methods and systems for adapting a model for a speech recognition system |
| ES2311351B1 (en) * | 2006-05-31 | 2009-12-17 | France Telecom España, S.A. | METHOD FOR DYNAMICALLY ADAPTING THE ACOUSTIC MODELS OF ACKNOWLEDGMENT OF SPEAKING TO THE USER. |
| ES2311351A1 (en) * | 2006-05-31 | 2009-02-01 | France Telecom España, S.A. | Method to adapt dynamically the acoustic models of recognition of the speech to the user. (Machine-translation by Google Translate, not legally binding) |
| US11984118B2 (en) | 2018-08-27 | 2024-05-14 | Beijing Didi Infinity Technology And Development Co., Ltd. | Artificial intelligent systems and methods for displaying destination on mobile device |
| US20220139373A1 (en) * | 2020-07-08 | 2022-05-05 | Google Llc | Identification and utilization of misrecognitions in automatic speech recognition |
| US12165628B2 (en) * | 2020-07-08 | 2024-12-10 | Google Llc | Identification and utilization of misrecognitions in automatic speech recognition |
Also Published As
| Publication number | Publication date |
|---|---|
| GB2375211A (en) | 2002-11-06 |
| GB0110810D0 (en) | 2001-06-27 |
| GB0326758D0 (en) | 2003-12-17 |
| GB2391680B (en) | 2005-07-20 |
| GB2391680A (en) | 2004-02-11 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| EP1602102B1 (en) | Management of conversations | |
| US9495956B2 (en) | Dealing with switch latency in speech recognition | |
| CA2576605C (en) | Natural language classification within an automated response system | |
| US9619572B2 (en) | Multiple web-based content category searching in mobile search application | |
| US20050033582A1 (en) | Spoken language interface | |
| US7103542B2 (en) | Automatically improving a voice recognition system | |
| US8635243B2 (en) | Sending a communications header with voice recording to send metadata for use in speech recognition, formatting, and search mobile search application | |
| US11580959B2 (en) | Improving speech recognition transcriptions | |
| US20040260543A1 (en) | Pattern cross-matching | |
| US20110054899A1 (en) | Command and control utilizing content information in a mobile voice-to-speech application | |
| US20110054900A1 (en) | Hybrid command and control between resident and remote speech recognition facilities in a mobile voice-to-speech application | |
| US20110060587A1 (en) | Command and control utilizing ancillary information in a mobile voice-to-speech application | |
| US20110054894A1 (en) | Speech recognition through the collection of contact information in mobile dictation application | |
| US20110054895A1 (en) | Utilizing user transmitted text to improve language model in mobile dictation application | |
| US20110054898A1 (en) | Multiple web-based content search user interface in mobile search application | |
| US20110054896A1 (en) | Sending a communications header with voice recording to send metadata for use in speech recognition and formatting in mobile dictation application | |
| US20110054897A1 (en) | Transmitting signal quality information in mobile dictation application | |
| WO2005122145A1 (en) | Speech recognition dialog management | |
| WO2002089112A1 (en) | Adaptive learning of language models for speech recognition | |
| US12243517B1 (en) | Utterance endpointing in task-oriented conversational systems | |
| US20220101835A1 (en) | Speech recognition transcriptions | |
| KR20250051049A (en) | System and method for optimizing a user interaction session within an interactive voice response system | |
| GB2375210A (en) | Grammar coverage tool for spoken language interface | |
| CN114283810A (en) | Improve speech recognition transcription | |
| Williams | A probabilistic model of human/computer dialogue with application to a partially observable Markov decision process |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SL TJ TM TN TR TT TZ UA UG US UZ VN YU ZA ZM ZW |
|
| AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
| ENP | Entry into the national phase |
Ref document number: 0326758 Country of ref document: GB Kind code of ref document: A Free format text: PCT FILING DATE = 20020502 Format of ref document f/p: F Ref document number: 0326758 Country of ref document: GB Kind code of ref document: A Free format text: PCT FILING DATE = 20020502 |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
| DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
| WWE | Wipo information: entry into national phase |
Ref document number: 0326758 Country of ref document: GB |
|
| REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
| 122 | Ep: pct application non-entry in european phase | ||
| NENP | Non-entry into the national phase |
Ref country code: JP |
|
| WWW | Wipo information: withdrawn in national office |
Country of ref document: JP |