[go: up one dir, main page]

WO1994009485A1 - Apparatus and method for continuous speech recognition - Google Patents

Apparatus and method for continuous speech recognition Download PDF

Info

Publication number
WO1994009485A1
WO1994009485A1 PCT/CA1993/000420 CA9300420W WO9409485A1 WO 1994009485 A1 WO1994009485 A1 WO 1994009485A1 CA 9300420 W CA9300420 W CA 9300420W WO 9409485 A1 WO9409485 A1 WO 9409485A1
Authority
WO
WIPO (PCT)
Prior art keywords
words
signal
word
segment
utterance
Prior art date
Application number
PCT/CA1993/000420
Other languages
French (fr)
Inventor
Hanavi M. Hirsh
Original Assignee
Hirsh Hanavi M
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hirsh Hanavi M filed Critical Hirsh Hanavi M
Priority to AU51467/93A priority Critical patent/AU5146793A/en
Publication of WO1994009485A1 publication Critical patent/WO1994009485A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection

Definitions

  • This invention relates to apparatus and methods for the recognition of continuous speech in which an operator interacts with the system.
  • My invention is an improved method for continuous speech recognition in which an operator sends the system signals to explicitly mark the divisions between spoken words.
  • the utterance segments delimited by the signals are analyzed in parallel, using prosodic data as well as conventional spectral data, with different processing strategies being employed for segments of different length.
  • CSR automatic continuous speech recognition
  • a human listener can recognize a particular word that is spoken at a different pitch or with a different intonation or stress pattern, whether it is spoken loudly or softly, quickly or slowly, even if some small parts of the word have been left out, are distorted, or are obscured by noise.
  • the ultimate goal of efforts in the field of automatic speech recognition is to develop a system with this level of tolerance for variability in the speech signal.
  • This invention relates to a method for continuous speech recognition comprising the step of marking the divisions between spoken words or syllabic segments of such words by generating a signal that coincides in time substantially with the divisions between the words or segments. It also relates to an apparatus for continuous speech recognition comprising signal-sending means for marking the divisions between spoken words or syllabic segments of such words. Accordingly, several advantages of the present invention are that:
  • my invention can use prosodic and other data relating to word-length segments as part of a pattern-matching process with similar data associated with entries in the system lexicon independently of and prior to any lexical entry pattern matching which uses phonological unit data; and (e) my invention can take full advantage of the processing power of multiprogramming and parallel processing computers by making it possible for the analysis of a number of different word-length utterance segments to take place simultaneously.
  • Figs IA and IB show block diagram abstractions of conventional CSR systems and of the present invention to indicate the nature of the inputs and outputs.
  • Figs 2A and 2B show block diagram abstractions of a conventional CSR system and of the present invention in which each employ a confirmation and correction process controlled interactively by the operator
  • Figs 3A and 3B show block diagrams depicting the hardware components of a conventional CSR and of the preferred embodiment of the invention.
  • Fig 4 shows the relationship between the elements which link acoustic data to recognized words, according to most CSR systems.
  • Fig 5 shows the sequence of processing steps that comprise the operation of a conventional CSR system.
  • Fig 6A shows the sequence of off-line processing steps that comprise the system preparation processes of the invention.
  • Fig 6B shows the sequence of on-line processing steps that comprise the operation of the preferred embodiment of the invention.
  • Reference Numerals in Drawings
  • a typical embodiments of my invention has two input signals: a speech input 10 and a word-marker signal 14 sent by the operator.
  • Fig 2B shows my invention in an embodiment in which the system operator confirms and corrects the recognized words by means of a confirmation and correction process 16.
  • Fig 2B shows three inputs: speech input 10; signal 14 from the operator that is received by the system prior to a speech recognition process 12; and a confirmation and correction input 20 which is received from the operator after speech recognition process 12.
  • Fig 3B shows the functional elements which comprise the preferred embodiment of the invention of the type shown in Fig 2B. All hardware elements are similar to those which comprise a conventional CSR system with the exception of an operator-actuated signal-sending device 18 which sends word-marker signals 14 to digital computer 22.
  • Speech input 10 is received by microphone 26 whose output is an analog signal directed to a band-pass filter bank 28.
  • the output of filter-bank 28 is a set of band-limited analog signals of different central frequencies which cover the most significant range of the original spectrum.
  • the analog signals are transformed by an analog-to-digital conversion module 32 into a digital data stream 30 that is directed to digital computer 22 which stores the data in the form of a time-sliced data set 34 for each frequency band, with each data element being associated with the time when it is received.
  • computer 22 receives marker signals 14 indicating the divisions between words, as sent by signal-sending device 18.
  • signal-sending device 18 is designed to generate a tone or a click which is picked up by microphone 26, which, in a remote location embodiment, is the microphone built into the telephone handset.
  • a purpose-built telephone for that application would have the signal-sending device generate an electrical analog signal which is added to the signal sent by the handset microphone 26.
  • signal-sending device 24 could take many different forms, the preferred embodiment of the invention for local speech input applications uses a two-button mouse. Such a device is commonly available, is inexpensive, and is specifically designed to send signals to a computer.
  • the preferred use of the device is to tap alternatively on the buttons with the index and middle fingers, timing each tap with the beginning of each word. If only one finger is used on a single button, some operators will experience difficulty keeping up with rapid speech. Other methods of marking the divisions between words, such as timing the signal to coincide with the end of each word, or sending one signal before and one signal after each word, were found to be less satisfactory than the preferred method. In some applications, an alternative use of the signal-sending device would have the operator mark the breaks between syllables rather than between words.
  • Figs 6A and 6B The processing steps which comprise the functioning of the preferred embodiment of the invention are set out in block diagram form in Figs 6A and 6B.
  • Fig 6A illustrates the functioning of off-line processes which are employed to prepare the system for recognition use.
  • the system preparation processes include:
  • a speaker Prior to using the system for a recognition session, a speaker will use a system-to-speaker adaptation process 56.
  • a lexicon adaptation process 50 is run to update system lexicon 40 with data in the form of admissible word entries, word relations, and pronunciation variants, based on the experience gained by the system during the most recent recognition sessions.
  • Fig 6B sets out, in block diagram form, the sequence of processing steps which are employed by the preferred embodiment of my CSR system during a recognition session. The preferred method comprises:
  • System monitor process 82 will cause a warning signal 90 to be generated in the event that speech production is about to exceed the capacity of the system.
  • Each process 64 will employ servant tasks according to need, with those tasks comprising: a class 'A' analysis process 84, a class 'B' analysis process 86; and a class 'C analysis process 88.
  • the class-specific analysis processes 84, 86, and 88 will call on their servant tasks, as needed, which comprise:
  • the servant tasks can draw on a set of knowledge base rules 42. Operation - Figs 6A and 6B
  • each speaker who uses the system will speak a training text,as part of speaker verification process 54 to determine which speaker class he or she falls into, and whether this speaker exhibits speech peculiarities which deviate significantly from the expected pattern.
  • Some of those deviations from the norms used in developing acoustic reference tables 38 can be allowed for by employing adaptation parameters which can be differently set for each speaker.
  • the operator of my CSR system explicitly declares the start of each word by alternatively pressing the right and left buttons of signal-sending device 18, using the index finger and the middle finger of the dominant hand.
  • the time that each word-marker signal is received by the digital computer 22 is stored in a file after it has been adjusted by the speaker's characteristic delay factor which has been established during operator training process 52.
  • the utterance segment delimited by each pair of signals thus contains a single word, and the digital data for that segment, which is stored in a set of data elements, each of which can be descriptive of, say, lOO ⁇ sec-long time slices, can readily be extracted.
  • the data is then grouped together in time frames which may, for computational convenience, each be 25.6ms in duration so that each frame will contain 256 data elements.
  • Signal 14 sent by the operator will only give an approximate indication of the break between words.
  • Analytical techniques employed by a word boundary detection process 58 to pin down the break position precisely will start looking in the time frame in which the word-marker signal falls. If no definitive break identification can be found there, the adjacent time frames will be examined. Extraneous productions of noise, such as throat-clearing or breathing sounds, will typically occur after a word is spoken and before the next signal is sent. Noise productions are recognized and excised by a noise excision process 59, which uses techniques chosen from among those developed for this purpose which are well known to those skilled in the art.
  • a word end-point determination process 60 is used which is similar to that used in isolated word or discrete utterance recognition.
  • a contiguous juncture boundary determination process 62 which is closely related to that used in isolated word recognition to determine syllabic breaks, is used.
  • Such algorithms or methods are well-known to those skilled in the art and exist in many specialized versions. The choice of an optimal boundary detection instrument depends on the specific pair of phonological units which must be divided.
  • the speech recognition problem is reduced, substantially, to one of isolated word recognition.
  • the size of the vocabulary is, at most, a few hundred words.
  • the words in the lexicon can be expected to number in the thousands.
  • the task of discriminating between words in my CSR system can also draw on syntactic and semantic constraints. Many methods of doing so, including expert systems and network models, are well known to those skilled in the art.
  • the system depicted in Fig 5B includes a segment classification process 66 which counts the number of syllables in each word-length utterance segment. It employs techniques well known to those skilled in the art which can identify syllabic breaks with a high level of reliability.
  • the word-length segments are divided into three classes:
  • Class 'A' one-syllable words
  • Class 'B' two-syllable words
  • Class 'C three-or-more-syllable words
  • My CSR system enables the most efficient recognition strategy to be used for each utterance segment, one that will make use of the most appropriate distinctive characteristics in each case.
  • a hierarchical ordering of parameters and successive hypothesize-and-test iterations will enable the process to converge to a recognized word in as few steps as possible.
  • each parameter in itself, is liable to be unreliable as a fine discriminator, the application of a series of constraints will quickly bring the number of possible word candidates down to a single best choice.
  • Computer 22 keeps a record of the time that each word-marker signal 14 is received. The start of each signal marks the creation of a new instantiation of word-recognition process 64. Thus a separate process 64 runs for each segment of the utterance, with all segments being analyzed simultaneously. Each process 64 will employ techniques which are appropriate to the class, as determined above, of the utterance segment that is being processed. This is done by calling the appropriate servant process, class 'A' analysis process 84, class *B' analysis process 86, or class 'C analysis process 88, which are described below. The way that computer resources are allocated to the different concurrently executing processes depends on the type of computer system used.
  • a multipro ⁇ gramming environment will have each process share the same processor, running in different memory partitions in a time-sharing mode.
  • a multiprocessor syste will divide the processes among the independent processors.
  • the system is designed to process sentence-long utterances which have a maximum duration and maximum number of words that depends on the main memory capacity and processin speed of the computer that is employed.
  • the system will, if possible, simulta ⁇ neously run an independent process for each word in the sentence.
  • phrase-analysis process 68 and sentence- analysis process 70 are running.
  • syntactic-analysis process 72, semantic-analysis process 74, and, in some applications, pragmatic-analysis process 76 act as servant processes which can be brought into play by the multiple concurrent word-recognition processes 64. These processes consult the syntactic, semantic, and pragmatic knowledge bases. As each word-recognition process 64 terminates, the results of the analysis are passed to sentence- analysis process 70 and the next copy of word-analysis process 64 can run in the freed partition to process the next word-length segment to be processed. Warning signal 90 asks the speaker to pause if the system processing capacity is about to be exceeded, as detected by system-monitor process 82. In such cases, a sentence fragment will be processed.
  • Each admissible word is associated with a "volume" in lexicon 40 whose hierarchical arrangement of volumes determines the order of consultation.
  • the structure of the lexicon is context-dependent. If the application relates to travel, next in sequence after the volume containing high-frequency standard vocabulary is a volume containing a set of specialized words used in the context of travel. The specialized vocabulary would be different if the context is, for instance, an architectural specification. Subsequent volumes contain words of decreasing frequency. The fact that a word is recognized more frequently by the system than expected will lead to its being promoted to the appropriate volume.
  • Words other than those found in the high frequency standard vocabulary volume are associated with other lexical entries which appear most often with them in the same phrase.
  • This aspect of the lexicon is compiled by means of word-relation process 48 that extracts such information from many samples of text pertaining to a certain context that are entered as part of lexicon compilation process 44.
  • the system "learns" more about such connections between words as it is used, by means of lexicon-adaptation process 50.
  • a duration value is stored for each word in the lexicon.
  • prosodic parameters which characterize significant distinctive non-spectral features of an utterance segment, including parameters related to syllable stress, syllable duration, syllable intonation, and segment overall duration, are handled by a prosodic analysis process 78.
  • the first is the average speaking rate of the person whose speech is to be recognized. He or she will speak a known text before the recognition process begins during system-to-speaker adaptation process 56. This enables the syste to be adapted to the speech of that particular speaker by means of special parameters which compensate for any deviations from the system's standard values, including the rate of speech production.
  • Th phrase in question is the sequence of words falling within a continuous intonation contour that includes the word in question.
  • a comparison of the average interval between stressed syllables for that phrase as computed by phrase analysis process 68, in comparison with the overall average for the speaker, will yield a second factor. Both factors would be applied to the measured duration of an utterance segment before that value is used to make a comparison with the value for words in the lexicon.
  • Stress is another characteristic that can help distinguish one monosyllabi word from another one that is acoustically similar. For instance, while differences between the sounds of "of” and “off” can be difficult to distinguish, “of” will usually be unstressed while “off” will likely be stressed. As is the case for duration, stress is a relative measure that onl yields a meaningful comparison when it is applied to two words in the same phrase.
  • a Class 'A' analysis process 84 therefore employs precis and unambiguous phonological units: demi-syllables and affixes.
  • spectral variants for the words in the lexicon are handled by associating them with different classes of speakers who participat in reference table compilation process 46 which resulted in the compilation o the acoustic reference data stored in the system tables.
  • Variations which result from interactions with adjacent words are handled by maintaining a plurality of templates for the same word, or by the application of phonologic rules.
  • a speaker's characteristic pronunciation is ascertained during speaker verification process 54. Although some adaptation parameters will be set as consequence, the major adaptation is accomplished by placing the speaker in a particular classification. Class 'B' Analysis Process
  • Two-syllable words can be divided into 16 different classifications based on each syllable being either long or short and stressed or unstressed.
  • the intonation pattern of a word i.e. the change in fundamental frequency between the two syllables, can help discriminate between different word candidates.
  • the consideration of this pattern must be done in the context of the overall intonation pattern of the phrase.
  • Intonation analysis process 80 measures this characteristic when required.
  • a word's characteristic intonational contour may change according to its syntactical role.
  • An example of this pattern is the word, "German". It is high-low as a noun, but becomes low-low when used as an adjective, as in "German shepherd". Such a distinction can be helpful in determining the syntactical role played by a word in a particular context.
  • a first analysis of spectral data to a determination of the manner classes of each sound in the utterance, e.g. a grouping of sounds by the manner in which they are produced.
  • the form of classification used consists of: vowels, plosives, fricatives, nasals, glides, affricates, silence, and others.
  • the manner class determination is accomplished by a manner class determination process 92.
  • Each Class'C entry in the lexicon will be characterized by the sounds it contains in terms of manner classes. This will avoid the computationally more expensive and inherently more error-prone process of analyzing the utterance into specific phonological units.
  • the lexicon also includes strings of symbols representing finer resolution phonological units, such as demi-syllables, for long as well as short words in the lexicon. These reference strings are used, on an exception basis, for the purposes of disambiguating word candidates when the computationally simpler techniques fail to discriminate between them. In such cases, the utterance segment must be analyzed into comparable units.
  • each word is added to the sequence of recognized words which are displayed on the computer terminal.
  • the operator uses signal-sending device 18 to send an acceptance or a rejection signal to the system in response to the single highlighted word on the screen, as each word in the sentence is highlighted in turn.
  • the operator can choose the correct one from a list of alternative candidate words which are displayed as soon as the rejection signal has been received. If the correct word is not on the list, the operator has the choice of either speaking the word again or spelling the word out by means of the computer terminal keyboard.
  • Signal-sending device 18 is also used during confirmation and correction process 16 to indicate desired hyphens, punctuation, and capitalization.
  • the CSR system of this invention can form the basis of a voice-actuated typewriter with the operator being, for instance, a keyboard-shy executive.
  • CSR system could be used by a relatively unskilled operator to transcribe pre-recorded speech or speech which is being received from a remote location via telephone.
  • Another application would use my CSR system to transcribe prewritten text, such as a hand-written document, which is not amenable to optical character recognition. This would be particularly valuable for research projects which deal with voluminous archival material.
  • My CSR system can also be operated cooperatively by two people, one who marks the words and a second who uses a separate terminal and signal-sending device to effect the confirmation and correction of the orthographic output.
  • Such a configuration will make possible the production of transcriptions of ongoing conversational speech in real time. This could be most useful for conferences in which speakers, such as those participating in a discussion or debate, do not speak from prepared texts. Transcriptions of the conference session would be available for distribution a few minutes after its closing.
  • the output from such a conference transcription system could be directed to a computer-aided translation system run by a third operator.
  • the syntactic and semantic analyses already performed by the CSR system for its own purposes would be available to the translation system that needs such information to prepare an accurate version of the speaker's words in the second language.
  • the output from the above translation system could be directed to a large digitized text display unit that is visible to all participants in the conference, including those who do not understand the speaker, as is commonly done for subtitles of foreign language films shown in film festivals.
  • a variant would see the same-language transcription of the speaker's words displayed on a large digitized display, even in single-language conferences, for the benefit of those participants who are hearing impaired.
  • the same orthographic output from a simultaneous translation system based on my CSR system can be transformed into speech in the second language by means of a conventional text-to-speech process.
  • the translation would then be made available to conference participants, almost in real time, via FM transmission to headsets.
  • FM receivers are commonly used by conventional simultaneous translation services which require the efforts of highly skilled and expensive interpreters.
  • the same system is useful, even when the translation is also displayed on a large digitized display, for the benefit of participants who are visually impaired.
  • a portable version of the above translation system could be used by a single operator, with slower response time. Someone would use such an automatic translator to communicate while traveling in a foreign country in a way that is much more graceful than the common thumbing and stumbling through a phrase book.
  • the CSR system of my invention could be used remotely, . y someone who wants to enter continuous speech into a system via telephone. Complex database inquiries could be made in this way, with the desired information being given to the caller in voice-response fashion by means of a text-to-speech system.
  • the operator uses a variation of the signal-sending device that has been described as being a standard computer mouse in the preferred embodiment.
  • a two-button finger-actuated portable device that sends an analog signal through the telephone handset microphone could be employed.
  • a simpler device would be in the form of two finger-size sleeves with hard protuberances at their ends which are worn over the index and middle fingers of the dominant hand.
  • the speaker can tap directly on the handset to generate high-frequency clicks at the onset of each spoken word.
  • the word-marking clicks will be conveyed by th handset microphone and telephone connection together with the spoken words to the CSR system.
  • prosodic parameters are very effective discriminatory instruments for long words. This substantially reduces the time-consuming computation that is inherent in conventional systems, and makes use of parameters which are more robust than those which are available to conventional CSR systems.
  • prosodic data contribute additional orthogonal parameters to help differentiate between word candidates proposed by parameters derived from acoustic data.
  • a capital advantage conferred by the present invention which is applicable to words of any length in an utterance is that the analysis of different word-length utterances can be undertaken simultan ⁇ eously, by means of readily available multiprogramming and multiprocessor computers. This brings the benefit of a dramatic increase in recognition speed in comparison with the results obtainable by conventional techniques, which cannot make efficient use of such computer hardware, making real-time continuous speech recognition possible using low-cost equipment.
  • connectionist models including artificial neural nets, to deal with the uncertainties in the network of possibilities that ties together computed parameters, distinctive features, phonological units, and words.
  • CSR systems based on such models can also benefit from the use of the utterance segmentation techniques of my invention.
  • the use of the signal-sending device by an operator who also is the speaker will, without any conscious effort on the operator's part to do so, lead to speech production that is better articulated and which exhibits clearly defined word boundaries in the acoustic data.
  • This phenomenon can be verified by the reader by the simple expedient of tapping on a table alternatively with the index and middle fingers of the dominant hand as the words of this disclosure are read out loud. The reader will find that after very little practice the marking of the onset of words can be accomplished accurately, without adversely affecting the fluency of speech.
  • signal-sending device 18 can be foot-operated instead of hand-operated, and the device may be actuated by any type of switch known to the switch-making art.
  • This can include, but is not limited to, electrostatic switches, acoustically-operated switches, and switches operated by the interruption of radiant enercry.
  • the processing steps shown which, without exception, use algorithms well known to those skilled in the art, can be employed in arrangements which are very different from that used in the example given, while still taking full advantage of the word-marking information supplied by the signal-sending device.
  • Alternative configurations could use special hardware, such as connectionist machines, including artificial neural network computers, and fuzzy logic circuits which can handle the great variability which is a characteristic of speech.
  • FIG IA A "black box" representation of a conventional CSR system is shown in Fig IA.
  • the single input into a speech recognition system 12 is continuous speech 10.
  • the output from the system is a transcription of the spoken words 11.
  • a CSR system can be viewed as comprising: s
  • Fig 2A depicts a modification of the CSR system of IA which includes an interactive process 16 that displays the recognized words for confirmation or correction by the operator before the confirmed and corrected words are presented as final output.
  • a second operator input 20 is shown that is used posteriorly to recognition process 12.
  • FIG. 1A shows a more detailed block diagram that sets out the hardware components of a typical conventional CSR.
  • a typical CSR system comprises substantially the following elements:
  • microphone 26 as an input device and transducer; • a bank of band-pass filters 28 which can extract component signal information, or an equivalent computational process;
  • analog-to-digital converter module 32 to convert the analog wave forms into a stream of discrete data elements 30 that can be analyzed by a digital computer 22;
  • CSR critical functional elements of a CSR, i.e. the ones which significantl differentiate one system from another, are the various algorithms employed. Some of the steps in the CSR process may employ algorithms which can be implemented either in hardware or in software. Although actual CSR systems will vary considerably in implementation, a typical system will use methods known to the art to accomplish substantially the following steps, including off-line processes which are similar to some of those employed in the present invention, as depicted in Fig 6A, and on-line processes as depicted in Fig 5.
  • Appropriate algorithms will analyze the sets of acoustic data gathered in the training sessions which relate to ⁇ particular sounds.
  • the data will cover the range of variation that follows from the expected inter-speaker and intra-speaker variability. Values are calculated for, typically, ten or more distinct descriptive parameters. Distinctive features are considered to be present when a certain time-frame of an utterance is associated with certain sets of computed parameters whose values fall within specific ranges.
  • Each phonological unit is defined by a specific constellation of features. Reference tables are built which contain range values for the sets of parameters which are used to describe the orthogonal features of the particula phonological units which have been chosen. Fig 4 depicts this relationship that permits analysis in one direction and synthesis in the other.
  • a computational technique such as a Fast Fourier Transform, may be employed to accomplish a comparable spectral analysis after the signal has been digitized.
  • a sampling rate of 10kHz is typically used when the analog signal is converted into digital data.
  • Each time slice will comprise multiple data elements within each component frequency band, with between 8 and 35 different bands typically used.
  • the resolution of the system is decided by the sampling rate and the choice of storage width for each data element, which, typically, is 12 bits. A compromise must be chosen between a resolution that is too low to adequately model the relevant features and one that obscures those features by highlighting noise and minute variations which do not represent useful information.
  • Algorithms ar known which can effect a digital simulation of substantially any analog analytical process.
  • a recognition strategy will often begin with a normalization process 45 which could be carried by special circuits before the analog-to-digital conversion or by computational techniques after.
  • the intent is to "massage" the raw utterance into a form that corresponds in terms of overall signal intensity, mean spectral frequency, and syllable length to the norms which applied to the "training" speech used to develop the system's reference data.
  • a sophisticated normalization process will attend to the pattern of variance a well as to the mean value of each orthogonal measure.
  • Word Segmentation as Carried Out In Conventional CSR Systems
  • Word segmentation is a highly problematic process in the field of CSR.
  • segmentation is a serial process in which utterance segments postulated to be word-length are assembled from phonological units. An attempt is then made to match the synthesized segments with lexicon entries.
  • the phonological units have been previously determined by a two-step process of phonological unit segmentation, comprising "extraction" and labelin of the unit, substantially as outlined below.
  • Extraction refers to the partitioning of the utterance into sub-utterance segments which the system can attempt to match with phonological unit referenc templates. This first partitioning is accomplished using algorithms which mus first successfully locate and excise pauses and breaks, with such breaks marking either breaks between words or between syllables. Problems Associated With Extraction
  • non-speech components of the utterance such as lip-smacking, breathing noise, throat-clearings, etc. must also be excised.
  • acoustic phenomema are, typically, part of a pattern that is characteristic of a particular speaker.
  • Coarticulation effects at the juncture of syllables and within them can largely be accounted for by analyzing the utterance into unambiguous phonological units, such as demi-syllables, phones, or diphones.
  • the system lexicon will use strings of similar phonological units in the word reference templates. In the case of vocabularies which substantially exceed 500 words, such strings are generally used.
  • a pattern-matching based on whole words has only been successful with small vocabularies. Aside from the excessive data storage requirements that would be created, gathering whole-word spectral data for a large vocabulary, i.e thousands of words, is an impractical process as a number of speakers will usually participate in the long training process.
  • a segment Once a segment has been extracted, it must be labelled, i.e. identified as one member of the set of phonological units which comprise all of the sounds used to acoustically describe words in the lexicon. This, in itself, is a two-step process.
  • the first step uses algorithms to compute values for the different parameters which describe significant aspects of the spectral data, as was don at the time-slice level.
  • the second step is some form of pattern-matching process which attempts to match the set of computed values with the sets of values stored as reference data which define different features.
  • Techniques such as dynamic programming make it possible to match features despite time-scale discrepencies between th extracted segment and the reference unit.
  • the system will attempt, in what amounts to trial-and-error fashion, to concatenate a sequence of phonological units into a postulated word-length segment and match it against entries in the lexicon.
  • the matching process is focussed on finding the best match in the lexicon for the phonological unit string that has been assembled. Because the matching process must allow for possible substitution, insertion, or deletion of phonological units, a probabalistic method, such as a hidden Markov model approach, is generally employed. The best match, and the next best matches, become word candidates. The best choice from among the candidates is decided by applying semantic, syntactic, and pragmatic constraints. It can also include whole-word prosodic parameters, such as the position of stressed syllables, but, as was the case at the phonological unit level with spectral data, attention must be given to normalization questions. Dealing With the Inherent Variability of Speech
  • the first methods are the normalization techniques which have been referred to above in the data preparation step. Methods exist to "massage" the speech representation as a whole, by using either analog or digital techniques, to correct the gross departure of aspects of a particular utterance stream from the expected norms. Normalization techniques are well known in the art to effect either linear or non-linear transformations in one or more of the three basic dimensions: time, frequency, and intensity.
  • the second set of methods comprises a wide variety of techniques which are applied to the pattern-matching process that includes: probabilistic models of possible variations of pronunciation which are stored in the system lexicon; analogous models of the unknown utterance in terms of permutations of different phonological units, such as a phonetic lattice, which might account for the acoustic data; use of features which are determined by parameters whose values fall within certain ranges; and dynamic "time warp" programming methods which are tolerant of variability in the duration of phonemes in a phoneme string.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)

Abstract

In systems for recognizing continuous speech, this invention provides an improved method for dividing an utterance into word-length segments. In the many earlier attempts undertaken by others in this field, the critical step of utterance segmentation has been accomplished by trial-and-error approaches in which individual sounds must first be delimited, identified, and then concatenated into hypothesized words. Because conventional processes are complex, computationally expensive, and error prone, with the computational burden quickly growing as the number of permitted words increases, real-time operation requires expensive high speed computers. In contrast, my invention uses a simple and effective technique to directly segment the utterance. Using a computer mouse, the operator indicates word breaks by sending signals which coincide in time with the boundaries of spoken words. With the words thus delimited, the different segments of an utterance can be processed simultaneously via a low cost general purpose computer, using appropriate processing strategies chosen according to segment length. Unknown utterance segments are matched with templates of words stored in the specially prepared lexicon. In addition to conventional acoustic parameters, prosodic ones are used to increase the precision and efficiency of the matching process. Practical applications of this invention, such as voice-actuated typewriters, can be based on low-cost microcomputers. Such devices will be welcomed by keyboard-shy managers, by the blind, and by the physically handicapped. A voice-actuated typewriter can also facilitate communication with the hearing impaired.

Description

APPARATUS AND METHOD FOR CONTINUOUS SPEECH RECOGNITION
Background — Field of Invention
This invention relates to apparatus and methods for the recognition of continuous speech in which an operator interacts with the system.
Summary of Invention
My invention is an improved method for continuous speech recognition in which an operator sends the system signals to explicitly mark the divisions between spoken words. The utterance segments delimited by the signals are analyzed in parallel, using prosodic data as well as conventional spectral data, with different processing strategies being employed for segments of different length.
Background — Description of Prior Art
The most general form of the challenging problem of automatic continuous speech recognition (CSR) might be stated as follows: analyze by automatic means utterances in the form of connected words taken from a normal range of vocabulary spoken in conversational tones and rhythm by a normal range of people in normal environments, and transform the utterances into equivalent orthographic form, i.e. a spelled-out representation of the words which were spoken.
In connected speech, a human listener can recognize a particular word that is spoken at a different pitch or with a different intonation or stress pattern, whether it is spoken loudly or softly, quickly or slowly, even if some small parts of the word have been left out, are distorted, or are obscured by noise. The ultimate goal of efforts in the field of automatic speech recognition is to develop a system with this level of tolerance for variability in the speech signal.
1
SUBSTITUTE SHEET Despite the fact that there are numerous instances where voice entry of speech into a computer would be preferred to the use of a keyboard, and despite the fact that many examples of an automatic CSR system are known to the art, none has had wide success in the marketplace. The different CSR systems currently available suffer from being inadequate in one or more of the following areas: response time, ease of use, flexibility in terms of size of vocabulary or variety of speakers accomodated, ratio of benefits gained compared to system cost, and suitability for use in a normal working environment.
Although no satisfactory solution to the general problem of CSR has been demonstrated, when measured by the number and sophistication of the well established computational techniques used by researchers in this field speech recognition must be considered to be a highly developed art. These techniques are employed in aspects of the recognition process which include signal normalization, utterance segmentation, feature extraction, and feature matching with reference data.
The Appendix to this invention disclosure describes the hardware elements of a conventional CSR as well as the analytical steps typically therein employed. The present invention represents an improvement over existing systems in the area of utterance segmentation. Any CSR implementation that uses the improved method will draw on the large choice of well-developed computational strategies that are already known to those familiar with the art. $
Examples of recent prior art which are representative of speech recognition systems which rely on conventional utterance segmentation techniques are described in U.S. Pat. Nos. 4,949,382 Griggs and 5,025,471 Scott et al. Analytical techniques known to the art which can be effectively employed in combination with the novel utterance segmentation technique of the present invention are described in Speech Analysis, Synthesis and Perception, Flanagan, 1972, IEEE Symposium on Speech Recognition, Erman (ed.), 1974, and Speech Recognition by Machine, Ainsworth, 1988.
Problems Associated With a "Bottom-up" Approach
Because it is virtually impossible for conventional CSR systems to identify, at the outset of the process, intervals of the utterance stream which represent word-length data, complex analytical methods are first used to identify the component phonological units by the error-prone processes of extraction and labeling. Only after those processes have been completed can word-length utterance segments be hypothesized by concatenating strings of phonological units. Even then, it is quite likely that a hypothesized segment does not contain the word to be recognized.
In the "bottom up" approach of conventional systems, the initial discrimination process, which focusses exclusively on the basic phonological units, is necessarily carried out in a sequential manner. In contrast, a system that works, from the outset, on word-length segments can process all of the words in a sentence in parallel. Limiting a CSR system to sequential processing during the most computationally expensive part of the whole process is a very serious constraint at a time when very powerful multiprogramming microcomputers are available at low cost. Multiprocessor microcomputers systems and massively parallel processing machines which use standard components are also increasingly available, with their costs steadily dropping, making applications which could benefit from their exceptional parallel processing power increasingly affordable. A system that is constrained to perform sequential processing, as is the case for conventional CSR systems, cannot gain significant benefit from the newly available hardware.
Advantages of My Invention Over Existing Systems
No CSR that claims to be capable of functioning as a speech-to-text transcriber has proven to have the performance capabilities and affordable price that would lead to its gaining recognition in the marketplace. All of the CSR systems heretofore known suffer from a number of specific disadvantages which follow primarily from difficulties associated with the steps leading to the segmentation of utterances into words. In summary, those disadvantages are:
(a) Because a CSR system must contend with the inherent variability of normal speech, it is highly likely that some errors will arise in the initial analytical processes of phonological unit extraction and labeling.
(b) Because the process of identifying a phonological unit is influenced by the state of the preceding one, errors arising from both extraction and labeling will tend to propagate, making correct identification of the succeeding unit more difficult and uncertain. (c) The degree of uncertainty inherent in the identification of individual phonological units will be multiplied when those units are concatenated into strings for word-level matching with lexical entries.
(d) Bottom-up approaches cannot avail themselves of word-level and sentence- level data, such as the characteristics of stress, intonation, and the relative duration of words and syllables, known collectively as prosodies, before a great amount of other processing has been done.
(e) The sequential processing methods of conventional CSR systems cannot take full advantage of the immense processing power of multiprogramming and parallel processing microcomputers.
This invention relates to a method for continuous speech recognition comprising the step of marking the divisions between spoken words or syllabic segments of such words by generating a signal that coincides in time substantially with the divisions between the words or segments. It also relates to an apparatus for continuous speech recognition comprising signal-sending means for marking the divisions between spoken words or syllabic segments of such words. Accordingly, several advantages of the present invention are that:
(a) my invention incorporates a method of determining robust and effective distinguishing characteristics of word-length utterance segments that is not totally dependent on a highly accurate extraction and labeling of phonological units;
(b) my invention determines distinguishing characteristics for utterance segments, with the success of that determination being substantially independent of the outcome of a similar process which has been applied to preceding segments;
(c) my invention does not totally depend on specific phonological units being concatenated into a string before matching can be attempted with lexical entries;
(d) my invention can use prosodic and other data relating to word-length segments as part of a pattern-matching process with similar data associated with entries in the system lexicon independently of and prior to any lexical entry pattern matching which uses phonological unit data; and (e) my invention can take full advantage of the processing power of multiprogramming and parallel processing computers by making it possible for the analysis of a number of different word-length utterance segments to take place simultaneously.
Further advantages of the apparatus and method for continuous speech recognition and understanding of my invention are: It can, at an early stage in processing, make use of prosodic patterns in multi-worcl segments of the utterance as a whole, such as phrases and sentences; it can select processing strategies which are appropriate for each word-length segment according to the number of syllables in the segment; it is easy to use, even by the physically handicapped; it does not require expensive, purpose-built hardware for its realization; it facilitates the production of well articulated and readily recognized utterances by requiring the speaker to explicitly indicate divisions between words; it can be used remotely via telephone; it can be part of a real-time system for speech transcription; it can be part of a computer- assisted translation system for use at conferences and by travelers; and it can be part of systems which are designed to aid communication with the blind and the hearing-impaired. Still further objects and advantages will become apparent from a consideration of the ensuing description and drawings.
Drawing Figures
In the drawings, which illustrate diagrammatically the general form of prior art and the preferred embodiments of the invention, Figs IA and IB show block diagram abstractions of conventional CSR systems and of the present invention to indicate the nature of the inputs and outputs.
Figs 2A and 2B show block diagram abstractions of a conventional CSR system and of the present invention in which each employ a confirmation and correction process controlled interactively by the operator, Figs 3A and 3B show block diagrams depicting the hardware components of a conventional CSR and of the preferred embodiment of the invention.
Fig 4 shows the relationship between the elements which link acoustic data to recognized words, according to most CSR systems.
Fig 5 shows the sequence of processing steps that comprise the operation of a conventional CSR system.
Fig 6A shows the sequence of off-line processing steps that comprise the system preparation processes of the invention, and
Fig 6B shows the sequence of on-line processing steps that comprise the operation of the preferred embodiment of the invention. Reference Numerals in Drawings
Inputs and outputs On-line processes
10 speech input 12 speech recognition
11 speech transcription output 16 confirmation and correction 14 word-marker signal 23 acoustic data collection
20 confirmation-and-correction and preparation input 25 output presentation 30 digital data stream 45 signal normalization 90 warning signal output 47 phonological unit extraction
49 phonological unit labelling
Data storage contents 51 trial segment synthesis
34 sets of acoustic data 53 pattern matching
36 algorithm-based programs 58 word-boundary detection
38 reference tables 59 noise detection and excision
40 system lexicon 60 word end-point determination
42 knowledge-base rules 62 contiguous junction boundary determination
Hardware elements 64 word recognition
18 signal-sending device 66 segment classification
22 digital computer 68 phrase analysis
24 data store 70 sentence analysis
26 microphone 72 syntactic analysis
28 band-pass filter bank 74 semantic analysis
32 analog-to-digital-conversion 76 pragmatic analysis module 78 prosodic analysis 34 video display unit 80 intonation analysis
82 system monitor
Off-line processes 84 class 'A' analysis
42 system preparation 86 class 'B' analysis
44 lexicon compilation 88 class 'C analysis
46 reference table compilation 92 manner class determination
48 word relation compilation
50 lexicon adaptation
52 operator training
54 speaker verification
56 system-to-speaker adaptation Description - Figs 1, 2, and 6
Unlike the conventional CSR system shown in Fig IA which has a single input, a typical embodiments of my invention, as shown in Fig IB, has two input signals: a speech input 10 and a word-marker signal 14 sent by the operator. Fig 2B shows my invention in an embodiment in which the system operator confirms and corrects the recognized words by means of a confirmation and correction process 16. In contrast with a conventional two-input embodiment of an interactive system, as shown in Fig 2A, Fig 2B shows three inputs: speech input 10; signal 14 from the operator that is received by the system prior to a speech recognition process 12; and a confirmation and correction input 20 which is received from the operator after speech recognition process 12.
Fig 3B shows the functional elements which comprise the preferred embodiment of the invention of the type shown in Fig 2B. All hardware elements are similar to those which comprise a conventional CSR system with the exception of an operator-actuated signal-sending device 18 which sends word-marker signals 14 to digital computer 22. Speech input 10 is received by microphone 26 whose output is an analog signal directed to a band-pass filter bank 28. The output of filter-bank 28 is a set of band-limited analog signals of different central frequencies which cover the most significant range of the original spectrum. The analog signals are transformed by an analog-to-digital conversion module 32 into a digital data stream 30 that is directed to digital computer 22 which stores the data in the form of a time-sliced data set 34 for each frequency band, with each data element being associated with the time when it is received. In parallel with its receiving of data stream 30, computer 22 receives marker signals 14 indicating the divisions between words, as sent by signal-sending device 18.
In an implementation of my speech recognition system in which speech is conveyed to the system by telephone from a remote location, the embodiment will differ from Fig 3B only in that signal-sending device 18 is designed to generate a tone or a click which is picked up by microphone 26, which, in a remote location embodiment, is the microphone built into the telephone handset. A purpose-built telephone for that application would have the signal-sending device generate an electrical analog signal which is added to the signal sent by the handset microphone 26. Although those skilled in the art will recognize that signal-sending device 24 could take many different forms, the preferred embodiment of the invention for local speech input applications uses a two-button mouse. Such a device is commonly available, is inexpensive, and is specifically designed to send signals to a computer.
The preferred use of the device is to tap alternatively on the buttons with the index and middle fingers, timing each tap with the beginning of each word. If only one finger is used on a single button, some operators will experience difficulty keeping up with rapid speech. Other methods of marking the divisions between words, such as timing the signal to coincide with the end of each word, or sending one signal before and one signal after each word, were found to be less satisfactory than the preferred method. In some applications, an alternative use of the signal-sending device would have the operator mark the breaks between syllables rather than between words.
The processing steps which comprise the functioning of the preferred embodiment of the invention are set out in block diagram form in Figs 6A and 6B. Fig 6A illustrates the functioning of off-line processes which are employed to prepare the system for recognition use. The system preparation processes include:
(a) a lexicon compilation 44 process and a reference table compilation process 46 which prepare, respectively, a system lexicon 40 and reference tables 38 which reside in a data store 24 with a suite of programs 36;
(b) a word relation process 48;
(c) an operator training process 52; and
(d) a speaker verification process 54.
Prior to using the system for a recognition session, a speaker will use a system-to-speaker adaptation process 56. On a regular basis, a lexicon adaptation process 50 is run to update system lexicon 40 with data in the form of admissible word entries, word relations, and pronunciation variants, based on the experience gained by the system during the most recent recognition sessions. Fig 6B sets out, in block diagram form, the sequence of processing steps which are employed by the preferred embodiment of my CSR system during a recognition session. The preferred method comprises:
(a) a word-boundary detection process 58 which calls on its servant tasks, a word-endpoint determination process 60, a contiguous-boundary determination process 62, and a noise detection and excision process 59;
(b) a sentence analysis process 70;
(c) a phrase analysis process 68;
(d) a segment classification process 66;
(e) an intonation analysis process 80; and
(d) a word recognition process 64.
Separate copies of word recognition process 64 run in different partitions, one for each utterance segment, under the control of a system monitor process 82. System monitor process 82 will cause a warning signal 90 to be generated in the event that speech production is about to exceed the capacity of the system.
Each process 64 will employ servant tasks according to need, with those tasks comprising: a class 'A' analysis process 84, a class 'B' analysis process 86; and a class 'C analysis process 88. The class-specific analysis processes 84, 86, and 88 will call on their servant tasks, as needed, which comprise:
(a) a syntactic analysis process 72;
(b) a semantic analysis task 74;
(c) a prosodic analysis task 78; and
(d) a pragmatic analysis task 76.
The servant tasks can draw on a set of knowledge base rules 42. Operation - Figs 6A and 6B
The operation of the preferred embodiment includes the following generic processes:
(a) system preparation;
(b) operator training; (c) system-to-speaker adaptation;
(d) word recognition; and
(e) word confirmation and correction.
(a) System Preparation Before it can be used, the reference data which is the basis of effecting the recognition of utterances must be entered and structured. This includes various reference tables 38 compiled by reference table compilation process 46, which relates acoustic data to distinctive features, and system lexicon 40, compiled by lexicon compilation process 44, which lists all admissible words together with various characteristics for each entry. A number of different speakers, representative of the expected user population in terms of accent and manner of speaking, train the system by reading sufficient known text to provide for the requisite variety of phonological unit templates, which may be combined into "blurred" templates, against which unknown utterances will be matched. For applications in which a large number of very different speakers are expected to use the system, many training speakers are required. They are sometimes usefully divided into classes according to their manner of speech, e.g. male native speakers, female native speakers, male Hispanic speakers, female Hispanic speakers, etc. Representative templates are gathered for each class.
(b) Operator Training
Each operator who will be using the system must be trained in the use of signal-sending device 18, and the system must know something of their characteristic use of device 18. Individuals will enter the click marking the start of a word in a particular way, with the actual time of the click deviating form the start of the word by a characteristic delay. With this information being available to the system, more accurate word boundary marking can be achieved. Interactive operator training process 52, which gives real-time feed-back to the operator concerning this delay, quickly helps the operator develop the knack of sending the marker signal at the right time and leads to more consistent and accurate word marks. For a population that includes a number of very divergent speaking styles, each speaker who uses the system will speak a training text,as part of speaker verification process 54 to determine which speaker class he or she falls into, and whether this speaker exhibits speech peculiarities which deviate significantly from the expected pattern. Some of those deviations from the norms used in developing acoustic reference tables 38 can be allowed for by employing adaptation parameters which can be differently set for each speaker.
(c) System-to-speaker Adaptation The speech of an individual will vary from day to day. Even on the same day, a particular speaker may speak in a different manner just after having a coffee and donut in the morning compared to a session with the recognition system that takes place after a large lunch that includes wine and a martini. To adapt the system for such variations, particularly the prosodic ones which relate to speech rhythm, syllabic stress, and intonation, a very brief known sign-on text is read at the start of each session as part of a system-to- speaker adaptation process 56. Parameters related to speaking manner can thus be set, based on the way that the known text is read.
(d) Word Recognition
The operator of my CSR system explicitly declares the start of each word by alternatively pressing the right and left buttons of signal-sending device 18, using the index finger and the middle finger of the dominant hand. The time that each word-marker signal is received by the digital computer 22 is stored in a file after it has been adjusted by the speaker's characteristic delay factor which has been established during operator training process 52. The utterance segment delimited by each pair of signals thus contains a single word, and the digital data for that segment, which is stored in a set of data elements, each of which can be descriptive of, say, lOOμsec-long time slices, can readily be extracted. The data is then grouped together in time frames which may, for computational convenience, each be 25.6ms in duration so that each frame will contain 256 data elements.
Signal 14 sent by the operator will only give an approximate indication of the break between words. Analytical techniques employed by a word boundary detection process 58 to pin down the break position precisely will start looking in the time frame in which the word-marker signal falls. If no definitive break identification can be found there, the adjacent time frames will be examined. Extraneous productions of noise, such as throat-clearing or breathing sounds, will typically occur after a word is spoken and before the next signal is sent. Noise productions are recognized and excised by a noise excision process 59, which uses techniques chosen from among those developed for this purpose which are well known to those skilled in the art.
Where clear gaps between adjacent sounds occur, a word end-point determination process 60 is used which is similar to that used in isolated word or discrete utterance recognition. When there is no clear break, a different method, a contiguous juncture boundary determination process 62, which is closely related to that used in isolated word recognition to determine syllabic breaks, is used. Such algorithms or methods are well-known to those skilled in the art and exist in many specialized versions. The choice of an optimal boundary detection instrument depends on the specific pair of phonological units which must be divided.
In normal continuous speech some word breaks will not be readily discernible. The terminal sound of one word will be confused by and merge into the initial sound of the adjacent word because of coarticulation and clipping. When the operator of my CSR system is also the speaker, the use of signal-sending device 18 leads to a much more precise articulation of each word, with more definite breaks between words, even without any conscious effort to do so being made by the speaker-operator.
Once the utterance stream has been divided into word-length segments, the speech recognition problem is reduced, substantially, to one of isolated word recognition. There are many methods well known to those skilled in the art which deal effectively with what is generally recognized as being a much easier problem than continuous speech recognition. In most such applications, however, the size of the vocabulary is, at most, a few hundred words. In the present CSR system, the words in the lexicon can be expected to number in the thousands. The task of discriminating between words in my CSR system, however, can also draw on syntactic and semantic constraints. Many methods of doing so, including expert systems and network models, are well known to those skilled in the art.
To deal effectively with a large vocabulary, the system depicted in Fig 5B includes a segment classification process 66 which counts the number of syllables in each word-length utterance segment. It employs techniques well known to those skilled in the art which can identify syllabic breaks with a high level of reliability.
The word-length segments are divided into three classes:
Class 'A': one-syllable words Class 'B': two-syllable words Class 'C: three-or-more-syllable words
A different analytical strategy will then be applied to each class of segment, as the problem is quite different for the different classes. A major shortcoming of the many heretofore known CSR systems is that one set of techniques must be universally applied. With any conventional approach that performs a sequential analysis of phonological units, the system cannot know what size word is being dealt with until it has been recognized.
My CSR system enables the most efficient recognition strategy to be used for each utterance segment, one that will make use of the most appropriate distinctive characteristics in each case. A hierarchical ordering of parameters and successive hypothesize-and-test iterations will enable the process to converge to a recognized word in as few steps as possible. Although each parameter, in itself, is liable to be unreliable as a fine discriminator, the application of a series of constraints will quickly bring the number of possible word candidates down to a single best choice.
Computer 22 keeps a record of the time that each word-marker signal 14 is received. The start of each signal marks the creation of a new instantiation of word-recognition process 64. Thus a separate process 64 runs for each segment of the utterance, with all segments being analyzed simultaneously. Each process 64 will employ techniques which are appropriate to the class, as determined above, of the utterance segment that is being processed. This is done by calling the appropriate servant process, class 'A' analysis process 84, class *B' analysis process 86, or class 'C analysis process 88, which are described below. The way that computer resources are allocated to the different concurrently executing processes depends on the type of computer system used. A multipro¬ gramming environment will have each process share the same processor, running in different memory partitions in a time-sharing mode. A multiprocessor syste will divide the processes among the independent processors. The system is designed to process sentence-long utterances which have a maximum duration and maximum number of words that depends on the main memory capacity and processin speed of the computer that is employed. The system will, if possible, simulta¬ neously run an independent process for each word in the sentence. In parallel with the word analysis processes, phrase-analysis process 68 and sentence- analysis process 70 are running. As well as, syntactic-analysis process 72, semantic-analysis process 74, and, in some applications, pragmatic-analysis process 76, act as servant processes which can be brought into play by the multiple concurrent word-recognition processes 64. These processes consult the syntactic, semantic, and pragmatic knowledge bases. As each word-recognition process 64 terminates, the results of the analysis are passed to sentence- analysis process 70 and the next copy of word-analysis process 64 can run in the freed partition to process the next word-length segment to be processed. Warning signal 90 asks the speaker to pause if the system processing capacity is about to be exceeded, as detected by system-monitor process 82. In such cases, a sentence fragment will be processed.
Class 'A' Analysis Process
In normal speech, more than 50% of the words are drawn from a small sub-set of the overall vocabulary. All of those 300 or so common words have one or two syllables, and 75% of those are monosyllabic. This means that a strategy that tries to identify single-syllable utterances in continuous speech would do well to first look for a match from among the most common words.
Each admissible word is associated with a "volume" in lexicon 40 whose hierarchical arrangement of volumes determines the order of consultation. The structure of the lexicon is context-dependent. If the application relates to travel, next in sequence after the volume containing high-frequency standard vocabulary is a volume containing a set of specialized words used in the context of travel. The specialized vocabulary would be different if the context is, for instance, an architectural specification. Subsequent volumes contain words of decreasing frequency. The fact that a word is recognized more frequently by the system than expected will lead to its being promoted to the appropriate volume.
Words other than those found in the high frequency standard vocabulary volume are associated with other lexical entries which appear most often with them in the same phrase. This aspect of the lexicon is compiled by means of word-relation process 48 that extracts such information from many samples of text pertaining to a certain context that are entered as part of lexicon compilation process 44. The system "learns" more about such connections between words as it is used, by means of lexicon-adaptation process 50.
Another parameter which can help distinguish one word from another is the duration of the spoken word. A monosyllabic word such as "bit" may be quite similar, acoustically, to "beet", but the latter is significantly longer in duration. A duration value, based on a standard rate of speech production, e.g. the normal number of stressed syllables per minute, is stored for each word in the lexicon.
The computation of values for the prosodic parameters which characterize significant distinctive non-spectral features of an utterance segment, including parameters related to syllable stress, syllable duration, syllable intonation, and segment overall duration, are handled by a prosodic analysis process 78.
If a useful comparison is to be made between a lexical entry's duration and the duration of an unknown utterance segment, two levels of normalization can first be considered.
The first is the average speaking rate of the person whose speech is to be recognized. He or she will speak a known text before the recognition process begins during system-to-speaker adaptation process 56. This enables the syste to be adapted to the speech of that particular speaker by means of special parameters which compensate for any deviations from the system's standard values, including the rate of speech production.
A second normalization relates to the particular phrase being analyzed. Th phrase in question is the sequence of words falling within a continuous intonation contour that includes the word in question. A comparison of the average interval between stressed syllables for that phrase as computed by phrase analysis process 68, in comparison with the overall average for the speaker, will yield a second factor. Both factors would be applied to the measured duration of an utterance segment before that value is used to make a comparison with the value for words in the lexicon.
Stress is another characteristic that can help distinguish one monosyllabi word from another one that is acoustically similar. For instance, while differences between the sounds of "of" and "off" can be difficult to distinguish, "of" will usually be unstressed while "off" will likely be stressed. As is the case for duration, stress is a relative measure that onl yields a meaningful comparison when it is applied to two words in the same phrase.
Because a monosyllabic word, in comparison with a long word, contains a relatively small number of distinguishing features, all the significant nuanc of its features must be employed to ensure a level of redundancy that is sufficient to the reliable recognition of the word from the often imperfect data which is obtainable in situations outside of the laboratory. A reflecti of this concern is the careful analysis of spectral data that is required for monosyllabic words. A Class 'A' analysis process 84 therefore employs precis and unambiguous phonological units: demi-syllables and affixes.
Normally occurring spectral variants for the words in the lexicon are handled by associating them with different classes of speakers who participat in reference table compilation process 46 which resulted in the compilation o the acoustic reference data stored in the system tables. Variations which result from interactions with adjacent words are handled by maintaining a plurality of templates for the same word, or by the application of phonologic rules.
A speaker's characteristic pronunciation is ascertained during speaker verification process 54. Although some adaptation parameters will be set as consequence, the major adaptation is accomplished by placing the speaker in a particular classification. Class 'B' Analysis Process
Two-syllable words can be divided into 16 different classifications based on each syllable being either long or short and stressed or unstressed. In some instances, the intonation pattern of a word, i.e. the change in fundamental frequency between the two syllables, can help discriminate between different word candidates. The consideration of this pattern must be done in the context of the overall intonation pattern of the phrase. Intonation analysis process 80 measures this characteristic when required. A word's characteristic intonational contour may change according to its syntactical role. An example of this pattern is the word, "German". It is high-low as a noun, but becomes low-low when used as an adjective, as in "German shepherd". Such a distinction can be helpful in determining the syntactical role played by a word in a particular context.
Class 'C Analysis Process As the number of syllables in a word-length utterance segment increases, the prosodic characteristics of syllabic stress and duration in combination with the normalized total utterance duration, become increasingly determinant. The simplest parameter is word duration. A stress pattern can usually be detected by syllabic variations in total energy. Some speakers will characteristically raise the pitch of the emphasized syllable instead of the intensity. Such an idiosyncrasy can be detected by changes in the fundamental frequency which are not explained by the overall intonational contour of the phrase. A notation that uses capital letters for stressed syllables can describe the duration- stress character of the word, "redundant", as short-LONG-short and of "industrious" as short-LONG-short-long. A database representation of the same information requires only two bits per syllable.
In utterance segments of this class, it is sufficient to limit a first analysis of spectral data to a determination of the manner classes of each sound in the utterance, e.g. a grouping of sounds by the manner in which they are produced. The form of classification used consists of: vowels, plosives, fricatives, nasals, glides, affricates, silence, and others. The manner class determination is accomplished by a manner class determination process 92. Each Class'C entry in the lexicon will be characterized by the sounds it contains in terms of manner classes. This will avoid the computationally more expensive and inherently more error-prone process of analyzing the utterance into specific phonological units. The lexicon also includes strings of symbols representing finer resolution phonological units, such as demi-syllables, for long as well as short words in the lexicon. These reference strings are used, on an exception basis, for the purposes of disambiguating word candidates when the computationally simpler techniques fail to discriminate between them. In such cases, the utterance segment must be analyzed into comparable units.
(e) Word confirmation and correction
As they are recognized, each word is added to the sequence of recognized words which are displayed on the computer terminal. At the end of every sentence, the operator uses signal-sending device 18 to send an acceptance or a rejection signal to the system in response to the single highlighted word on the screen, as each word in the sentence is highlighted in turn.
In the case of a rejected word, the operator can choose the correct one from a list of alternative candidate words which are displayed as soon as the rejection signal has been received. If the correct word is not on the list, the operator has the choice of either speaking the word again or spelling the word out by means of the computer terminal keyboard. Signal-sending device 18 is also used during confirmation and correction process 16 to indicate desired hyphens, punctuation, and capitalization.
Ramifications
The CSR system of this invention can form the basis of a voice-actuated typewriter with the operator being, for instance, a keyboard-shy executive.
Letters and documents, including those of a confidential nature, can be created without the delay and inconvenience of an intermediate dictation process and without having to use the services of a skilled typist.
The same CSR system could be used by a relatively unskilled operator to transcribe pre-recorded speech or speech which is being received from a remote location via telephone. Another application would use my CSR system to transcribe prewritten text, such as a hand-written document, which is not amenable to optical character recognition. This would be particularly valuable for research projects which deal with voluminous archival material.
My CSR system can also be operated cooperatively by two people, one who marks the words and a second who uses a separate terminal and signal-sending device to effect the confirmation and correction of the orthographic output. Such a configuration will make possible the production of transcriptions of ongoing conversational speech in real time. This could be most useful for conferences in which speakers, such as those participating in a discussion or debate, do not speak from prepared texts. Transcriptions of the conference session would be available for distribution a few minutes after its closing.
The output from such a conference transcription system could be directed to a computer-aided translation system run by a third operator. The syntactic and semantic analyses already performed by the CSR system for its own purposes would be available to the translation system that needs such information to prepare an accurate version of the speaker's words in the second language.
The output from the above translation system could be directed to a large digitized text display unit that is visible to all participants in the conference, including those who do not understand the speaker, as is commonly done for subtitles of foreign language films shown in film festivals.
A variant would see the same-language transcription of the speaker's words displayed on a large digitized display, even in single-language conferences, for the benefit of those participants who are hearing impaired.
The same orthographic output from a simultaneous translation system based on my CSR system can be transformed into speech in the second language by means of a conventional text-to-speech process. The translation would then be made available to conference participants, almost in real time, via FM transmission to headsets. Such FM receivers are commonly used by conventional simultaneous translation services which require the efforts of highly skilled and expensive interpreters. The same system is useful, even when the translation is also displayed on a large digitized display, for the benefit of participants who are visually impaired.
A portable version of the above translation system could be used by a single operator, with slower response time. Someone would use such an automatic translator to communicate while traveling in a foreign country in a way that is much more graceful than the common thumbing and stumbling through a phrase book. The CSR system of my invention could be used remotely, . y someone who wants to enter continuous speech into a system via telephone. Complex database inquiries could be made in this way, with the desired information being given to the caller in voice-response fashion by means of a text-to-speech system. In this circumstance, the operator uses a variation of the signal-sending device that has been described as being a standard computer mouse in the preferred embodiment. For remote applications, a two-button finger-actuated portable device that sends an analog signal through the telephone handset microphone could be employed. A simpler device would be in the form of two finger-size sleeves with hard protuberances at their ends which are worn over the index and middle fingers of the dominant hand. With this device, the speaker can tap directly on the handset to generate high-frequency clicks at the onset of each spoken word. The word-marking clicks will be conveyed by th handset microphone and telephone connection together with the spoken words to the CSR system.
Conclusions, Further Ramifications, and Scope
Accordingly, the reader will see that the method and apparatus of this invention, in which an indication of the location of the start of words in continuous speech is explicitly given to a speech recognition system by the operator by means of a commonly available signal-sending device, greatly facilitates the task of analyzing the utterance. Powerful analytic techniques which make use of various parameters derived from word-length utterance segments can be applied from the outset. An utterance that includes the words "recognize speech" represents a difficult problem to conventional CSR systems, as acoustic, syntactic, and semantic analyses can easily lead to "wreck a nice beach". A CSR system that uses the improved segmentation technique of my invention will greatly reduce the number of such ambiguities.
In the case of polysyllabic words, only an approximate classification of sounds in the utterance segment is required, as easily computed prosodic parameters are very effective discriminatory instruments for long words. This substantially reduces the time-consuming computation that is inherent in conventional systems, and makes use of parameters which are more robust than those which are available to conventional CSR systems. In the case of shorter words, prosodic data contribute additional orthogonal parameters to help differentiate between word candidates proposed by parameters derived from acoustic data. A capital advantage conferred by the present invention which is applicable to words of any length in an utterance is that the analysis of different word-length utterances can be undertaken simultan¬ eously, by means of readily available multiprogramming and multiprocessor computers. This brings the benefit of a dramatic increase in recognition speed in comparison with the results obtainable by conventional techniques, which cannot make efficient use of such computer hardware, making real-time continuous speech recognition possible using low-cost equipment.
Current workers in the art make use of connectionist models, including artificial neural nets, to deal with the uncertainties in the network of possibilities that ties together computed parameters, distinctive features, phonological units, and words. CSR systems based on such models can also benefit from the use of the utterance segmentation techniques of my invention.
Furthermore, the use of the signal-sending device by an operator who also is the speaker will, without any conscious effort on the operator's part to do so, lead to speech production that is better articulated and which exhibits clearly defined word boundaries in the acoustic data. This phenomenon can be verified by the reader by the simple expedient of tapping on a table alternatively with the index and middle fingers of the dominant hand as the words of this disclosure are read out loud. The reader will find that after very little practice the marking of the onset of words can be accomplished accurately, without adversely affecting the fluency of speech.
Although the description above contains many specificities, these should not be construed as limiting the scope of the invention but as merely providing illustrations of some of the presently preferred embodiments of this invention. For example, signal-sending device 18 can be foot-operated instead of hand-operated, and the device may be actuated by any type of switch known to the switch-making art. This can include, but is not limited to, electrostatic switches, acoustically-operated switches, and switches operated by the interruption of radiant enercry. The processing steps shown, which, without exception, use algorithms well known to those skilled in the art, can be employed in arrangements which are very different from that used in the example given, while still taking full advantage of the word-marking information supplied by the signal-sending device. Alternative configurations could use special hardware, such as connectionist machines, including artificial neural network computers, and fuzzy logic circuits which can handle the great variability which is a characteristic of speech.
Thus the scope of the invention should be determined by the appended claims and their legal equivalents, rather than by the examples given.
Appendix Elements of a Typical Conventional CSR System
A "black box" representation of a conventional CSR system is shown in Fig IA. The single input into a speech recognition system 12 is continuous speech 10. The output from the system is a transcription of the spoken words 11.
In essence, a CSR system can be viewed as comprising: s
a means to collect and prepare acoustic data from speech input;
• stored knowledge bases, a lexicon, and tables of reference data;
a suitable computer system programmed with a set of algorithms which act upon and analyze the input data in order to effect a comparison with the reference data; and
an output system to present the recognized words.
Fig 2A depicts a modification of the CSR system of IA which includes an interactive process 16 that displays the recognized words for confirmation or correction by the operator before the confirmed and corrected words are presented as final output. In this configuration, a second operator input 20 is shown that is used posteriorly to recognition process 12.
Fig 3A shows a more detailed block diagram that sets out the hardware components of a typical conventional CSR. A typical CSR system comprises substantially the following elements:
Speech Input and Data Preparation
microphone 26 as an input device and transducer; • a bank of band-pass filters 28 which can extract component signal information, or an equivalent computational process;
analog-to-digital converter module 32 to convert the analog wave forms into a stream of discrete data elements 30 that can be analyzed by a digital computer 22;
System Reference Data
• stored reference data characterizing phonological units and words;
• structured knowledge bases relating to semantic, syntactical, phonological, and other constraints; Algorithms
a set of algorithms implemented in the form of computer programs, possibly supplemented by special-purpose signal processing circuits, t operate on the received data, in conjunction with the stored reference data, resulting in words being identified for each word-length utterance segment;
in the case of interactive systems as depicted in Fig 2A, an interactive process 16 that enables the operator to confirm or correct the words proposed by the CSR internal processes;
Data Processing Hardware
a digital computer 22 with suitable processing speed, RAM work space, and data storage capacity to carry out the above computational processes; and
- one or more suitable output devices.
Typical Operation of a Conventional CSR System
The critical functional elements of a CSR, i.e. the ones which significantl differentiate one system from another, are the various algorithms employed. Some of the steps in the CSR process may employ algorithms which can be implemented either in hardware or in software. Although actual CSR systems will vary considerably in implementation, a typical system will use methods known to the art to accomplish substantially the following steps, including off-line processes which are similar to some of those employed in the present invention, as depicted in Fig 6A, and on-line processes as depicted in Fig 5.
Off-line Processes
(a) Prepare tables of acoustic reference data 38 which are used in the processes that link acoustic data received by the system to phonological units which represent distinct sounds by means of a reference table compilation process 46.
This is typically done by "training" the system using a number of speakers representative of the expected system user population who repeatedly speak a training text. The spoken words are known to comprise specific sounds. This can be accurately reflected in their phonological unit transcription if sound- specific units are chosen. If the common phonetic transcription is used, comprehensive phonological rules must also be employed. Although a particular phoneme, to a human listener, represents one specific sound, that same symbol actually covers a number of significantly different acoustic variations, known as allophones, which result from the coarticulation effects occasioned by the proximate production of adjacent sounds.
Appropriate algorithms will analyze the sets of acoustic data gathered in the training sessions which relate to~ particular sounds. The data will cover the range of variation that follows from the expected inter-speaker and intra-speaker variability. Values are calculated for, typically, ten or more distinct descriptive parameters. Distinctive features are considered to be present when a certain time-frame of an utterance is associated with certain sets of computed parameters whose values fall within specific ranges. Each phonological unit is defined by a specific constellation of features. Reference tables are built which contain range values for the sets of parameters which are used to describe the orthogonal features of the particula phonological units which have been chosen. Fig 4 depicts this relationship that permits analysis in one direction and synthesis in the other.
(b) Build system lexicon 40 which contains all admissible words together wit all of the relevant data for each word that can help differentiate one word from another in the context of connected speech by means of a lexicon compilation process 44.
This can include a phonological representation and syntactical, semantic, and prosodic information.
(c) Verify, by means of a speaker verification process 54, that a speaker's manner of speaking is acceptable to the system.
In many applications, it is important to know from the outset whether or no a person's characteristic speech can be readily recognized by the system. Thi is ascertained by having the person read a known text. Some systems incorporate an initialization process in which the reading of a known text by new speaker is used to place the speaker in a certain classification or to establish particular general characteristics -of that person's speech to set some classification-specific or speaker-specific system parameters to adapt the system to the speaker.
Online Functions
To make it easier to understand and appreciate the significance and value of my invention, it is helpful to first consider the problems inherent in the known art which arise in the following steps of a conventional CSR process:
(a) Arrange the acoustic data into a form amenable to digital analysis by means of an acoustic data collection and preparation process.
Analyze the original analog speech signal, as it is received from the microphone transducer, into a set of signals of different band width by means of a bank of band-pass filters. Alternatively, a computational technique, such as a Fast Fourier Transform, may be employed to accomplish a comparable spectral analysis after the signal has been digitized.
A sampling rate of 10kHz is typically used when the analog signal is converted into digital data.
The computations used to effect the initial phonological segmentation process work with time-slices of data, each slice typically being between 5ms and 30ms in width. Within a frequency band, each time-slice will contain a set of N data elements, with N = time-slice width X sampling rate.
Each time slice will comprise multiple data elements within each component frequency band, with between 8 and 35 different bands typically used.
The resolution of the system is decided by the sampling rate and the choice of storage width for each data element, which, typically, is 12 bits. A compromise must be chosen between a resolution that is too low to adequately model the relevant features and one that obscures those features by highlighting noise and minute variations which do not represent useful information.
A purpose-built signal-processing "front-end", using specialized analog components and circuits, has been employed in some CSR systems. Algorithms ar known which can effect a digital simulation of substantially any analog analytical process. The current availability of high-speed general-purpose digital signal processing chips and low-cost 32-bit microprocessors operating at 50mhz, howver, now make it practical and economical, in many cases, to convert the original analog speech signal into a digital representation before any complex processing is undertaken.
Signal Normalization A recognition strategy will often begin with a normalization process 45 which could be carried by special circuits before the analog-to-digital conversion or by computational techniques after. The intent is to "massage" the raw utterance into a form that corresponds in terms of overall signal intensity, mean spectral frequency, and syllable length to the norms which applied to the "training" speech used to develop the system's reference data. A sophisticated normalization process will attend to the pattern of variance a well as to the mean value of each orthogonal measure.
Word Segmentation as Carried Out In Conventional CSR Systems Word segmentation is a highly problematic process in the field of CSR. In conventional CSR systems, segmentation is a serial process in which utterance segments postulated to be word-length are assembled from phonological units. An attempt is then made to match the synthesized segments with lexicon entries. The phonological units have been previously determined by a two-step process of phonological unit segmentation, comprising "extraction" and labelin of the unit, substantially as outlined below.
(b) Extract the phonological units from the utterance by means of a phonological unit extraction process 47.
Extraction refers to the partitioning of the utterance into sub-utterance segments which the system can attempt to match with phonological unit referenc templates. This first partitioning is accomplished using algorithms which mus first successfully locate and excise pauses and breaks, with such breaks marking either breaks between words or between syllables. Problems Associated With Extraction
Often, breaks between phonological units in the spectral data will not be readily apparent. As a consequence, complex algorithms must be used to detect the subtle pattern changes which indicate the passage from one sound to another. Breaks, where they exist, may turn out to be inter-syllable or inter-word. No acoustic analysis can reliably differentiate between the two s cases.
Complicating matters immensely are coarticulation effects. These are phonemic modifications which arise because of phoneme interactions between adjacent phonemes or words. This phenomenon is a consequence of hysteretic effects which are characteristic of the functioning of the human vocal apparatus.
To delimit a "clean" segment, non-speech components of the utterance, such as lip-smacking, breathing noise, throat-clearings, etc. must also be excised. Such acoustic phenomema are, typically, part of a pattern that is characteristic of a particular speaker.
Coarticulation effects at the juncture of syllables and within them can largely be accounted for by analyzing the utterance into unambiguous phonological units, such as demi-syllables, phones, or diphones. The system lexicon will use strings of similar phonological units in the word reference templates. In the case of vocabularies which substantially exceed 500 words, such strings are generally used. A pattern-matching based on whole words has only been successful with small vocabularies. Aside from the excessive data storage requirements that would be created, gathering whole-word spectral data for a large vocabulary, i.e thousands of words, is an impractical process as a number of speakers will usually participate in the long training process.
Coarticulation effects between adjacent words are much more problematic as overlap, clipping, and word-pair-specific effects, which are not fully predictable by phonological rule generalizations, often occur at such junctures. Maintaining reference templates for each possible word pair for a large vocabulary, even with syntactical and semantic constraints considered, is not practical. The critical problem of this first segmenting of an utterance has received much attention in the art, as any errors made here will propagate and compromise all subsequent analyses. A wrongly placed break obviously leads to an analysis being applied to inaccurate data sets for the segment preceding th break as well as for the one following. Furthermore, once one sound has been incorrectly labelled any attempt to employ phonological rules to try and explain the next sound based on the previous one will lead to another error.
In practice, most CSR systems will first calculate parameters for each time slice of an utterance. Adjacent slices with similar parameters, or whose parameters demonstrate an expected variation for features which vary characteristically over time, are grouped together. Special algorithms will make a decision as to the dividing point between one phonological unit and another, although in practice there may be breaks, clipping, or distortions resulting from coarticulation phenomena. Distorted data resulting from sound interaction can often be explained by employing an elaborate set of phonological rules.
(c) Labeling of phonological units by phonological unit labeling process 49.
Once a segment has been extracted, it must be labelled, i.e. identified as one member of the set of phonological units which comprise all of the sounds used to acoustically describe words in the lexicon. This, in itself, is a two-step process.
The first step uses algorithms to compute values for the different parameters which describe significant aspects of the spectral data, as was don at the time-slice level.
The second step is some form of pattern-matching process which attempts to match the set of computed values with the sets of values stored as reference data which define different features. Techniques such as dynamic programming make it possible to match features despite time-scale discrepencies between th extracted segment and the reference unit.
Problems Associated with Labelling
The labeling process leads to problems when a poorly articulated sound must be dealt with. A label must be affixed to this sound, even if the actual acoustic data makes it impossible to decide on the label with a high degree of confidence. For systems which do not associate a measure of confidence with each label, the subsequent steps must assume that each label represents an equally definitive identification. If the system is to correctly interpret the meaning of the acoustic features of a particular segment of speech, it will bring to bear a set of phonological rules which, to be useful, require the correct identification of the preceding sound. It is evident that labeling errors will propagate and compromise all future processing steps as is the case for segmentation errors.
(d) Postulating word candidates by means of trial segment synthesis process 51.
Once a string of phonological units have been identified, the system will attempt, in what amounts to trial-and-error fashion, to concatenate a sequence of phonological units into a postulated word-length segment and match it against entries in the lexicon.
If a system must deal with a large vocabulary, however, homonyms become a problem. Ambiguities proliferate in continuous speech, particularly when speech is being analyzed by a conventional CSR system that cannot reliably differentiate between syllabic breaks and word boundaries. Dixon and Martin, P. 3, make this point with the example, "Our machine can wreck a nice beach." Or was that, "Our machine can recognize speech."? Conventional CSR systems simply do not have the means to decide between such ambiguities.
(e) Matching of word-length segments with entries in the lexicon by means of pattern-matching process 53
The matching process is focussed on finding the best match in the lexicon for the phonological unit string that has been assembled. Because the matching process must allow for possible substitution, insertion, or deletion of phonological units, a probabalistic method, such as a hidden Markov model approach, is generally employed. The best match, and the next best matches, become word candidates. The best choice from among the candidates is decided by applying semantic, syntactic, and pragmatic constraints. It can also include whole-word prosodic parameters, such as the position of stressed syllables, but, as was the case at the phonological unit level with spectral data, attention must be given to normalization questions. Dealing With the Inherent Variability of Speech
When words are characterized in a machine-based system by the acoustic parameters derived from spectral data, the same word when spoken by different people will appear to be very different. The same word, even when spoken by the same person, will be different, acoustically, when spoken at different times, under different conditions, or when preceded or followed by different words. The variability described above extends as well, quite obviously, to the word sub-segments, i.e. the phonological units which comprise the whole word.
To be effective, efforts at pattern matching for the purposes of analyzing speech, whether that process is being carried out at the level of words or at the level of phonological units, and even when only one speaker is involved, must tolerate considerable variability. Workers in the field therefore resort to a variety of methods well known to the art which fall under two categories.
The first methods are the normalization techniques which have been referred to above in the data preparation step. Methods exist to "massage" the speech representation as a whole, by using either analog or digital techniques, to correct the gross departure of aspects of a particular utterance stream from the expected norms. Normalization techniques are well known in the art to effect either linear or non-linear transformations in one or more of the three basic dimensions: time, frequency, and intensity.
The second set of methods comprises a wide variety of techniques which are applied to the pattern-matching process that includes: probabilistic models of possible variations of pronunciation which are stored in the system lexicon; analogous models of the unknown utterance in terms of permutations of different phonological units, such as a phonetic lattice, which might account for the acoustic data; use of features which are determined by parameters whose values fall within certain ranges; and dynamic "time warp" programming methods which are tolerant of variability in the duration of phonemes in a phoneme string.

Claims

Claims: I claim:
1. A method for continuous speech recognition comprising the step of marking the divisions between spoken words or syllabic segments of such words by generating a signal that coincides in time substantially with the divisions between the words or segments. 5
2. A method as defined in claim 1 in which said signal is generated by a device that is actuated by one or more finger-actuated switches.
10 3. A method as defined in claim 2 in which said device is a computer mouse.
4. A method as defined in claim 1 in which said signal is generated by a device that is actuated by one or more foot pedals.
15
5. A method as defined in claim 1 in which said signal is generated by a tone generator.
6. A method as defined in claim 1 in which said signal is generated by means 20 of a device that equips one or more fingertips with a hard protuberance so that a signal can be generated by tapping.
7. A method as defined in claim 1 which further comprises analyzing the 25 acoustic data as segmented by said signals.
8. A method as defined in claim 7 in which the step of analyzing comprises:
(a) determining the length of each segment; and
(b) choosing an analytical strategy for each segment that is designed to be
30 efficient for a segment of that particular length.
9. A method as defined in claim 8 in which the analytical strategy designed for polysyllabic words comprises:
,5 (a) determining prosodic parameters descriptive of said words; and
(b) determining an approximate phonological representation of the words;
10. A method as defined in claim 8 in which said analytical strategies chosen for each segment of an utterance are carried out simultaneously by a multiprogramming or multiprocessor computer.
11. An apparatus for continuous speech recognition comprising signal-sending means for marking the divisions between spoken words or syllabic segments s of such words.
12. An apparatus as defined in claim 11 in which said signal-sending means comprises a device that uses one or more finger-actuated switches.
13. An apparatus as defined in claim 12 in which said device comprises a computer mouse.
14. An apparatus as defined in claim 11 in which said signal-sending means comprises a device that uses one or more foot pedals.
15. An apparatus as defined in claim 11 in which said signal-sending means comprises a tone generator.
16. An apparatus as defined in claim 11 in which said signal-sending means comprises a device that equips one or more fingertips with a hard protuberance so that a signal can be generated by tapping.
17. An apparatus as defined in claim 11 which further comprises analyzing means for analyzing the acoustic data as segmented by said signal-sending means.
18. An apparatus defined in claim 17 in which said analyzing means comprises:
(a) segment analysis means for determining the length of each segment; and
(b) feature analysis means that is adapted to each segment according to each segment's length.
19. An apparatus as defined in claim 18 in which said feature analysis means, when adapted to polysyllabic words, comprises:
(a) computing means to determine prosodic parameters descriptive of said words; and (b) computing means to determine an approximate phonological representati of said words.
20. An apparatus as defined in claim 19 which further comprises system preparation means comprising:
(a) reference table compilation means to determine parameters descriptive of particular sounds;
(b) lexicon compilation means to determine parameters descriptive of particular words;
(c) speaker verification means to determine whether the speech of a particular speaker can be recognized by the system and into which speaker classification a particular speaker falls;
(d) system-to-speaker adaptation means to adjust the recognition means to the state of the speaker;
(e) word relation compilation means to determine, in the case of specialized vocabularies, which words are most likely to accompany other words in a particualr phrase; and
(f) operator training means to train an operator in the use of the signal-sending device while recording said operator's characteristic use of the device.
21. A voice-actuated typewriter which comprises:
(a) acoustic processing means to transform speech into digital data;
(b) signal-sending means to generate word delimiting signals;
(c) analyzing means to determine parameters descriptive of each delimited word-length segment;
(d) pattern-matching means to match said segments with entries in a syste lexicon;
(e) display means to display a proposed transcription on the video displa terminal;
(f) confirmation and correction means to prepare a final version of said transcription; and
(g) printing means to print said verified transcription.
PCT/CA1993/000420 1992-10-22 1993-10-13 Apparatus and method for continuous speech recognition WO1994009485A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU51467/93A AU5146793A (en) 1992-10-22 1993-10-13 Apparatus and method for continuous speech recognition

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CA002081188A CA2081188A1 (en) 1992-10-22 1992-10-22 Apparatus and method for continuous speech recognition
CA2,081,188 1992-10-22

Publications (1)

Publication Number Publication Date
WO1994009485A1 true WO1994009485A1 (en) 1994-04-28

Family

ID=4150587

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CA1993/000420 WO1994009485A1 (en) 1992-10-22 1993-10-13 Apparatus and method for continuous speech recognition

Country Status (3)

Country Link
AU (1) AU5146793A (en)
CA (1) CA2081188A1 (en)
WO (1) WO1994009485A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE19938649A1 (en) * 1999-08-05 2001-02-15 Deutsche Telekom Ag Method and device for recognizing speech triggers speech-controlled procedures by recognizing specific keywords in detected speech signals from the results of a prosodic examination or intonation analysis of the keywords.
GB2502944A (en) * 2012-03-30 2013-12-18 Jpal Ltd Segmentation and transcription of speech
TWI503813B (en) * 2012-09-10 2015-10-11 Univ Nat Chiao Tung Prosody signal generating device capable of controlling speech rate and hierarchical rhythm module with speech rate dependence
CN111429942A (en) * 2020-03-19 2020-07-17 北京字节跳动网络技术有限公司 Audio data processing method and device, electronic equipment and storage medium
CN112466328A (en) * 2020-10-29 2021-03-09 北京百度网讯科技有限公司 Breath sound detection method and device and electronic equipment
CN114822536A (en) * 2022-04-27 2022-07-29 维沃移动通信有限公司 Voice recognition method and device, electronic equipment and readable storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104267922B (en) * 2014-09-16 2019-05-31 联想(北京)有限公司 A kind of information processing method and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0180047A2 (en) * 1984-10-30 1986-05-07 International Business Machines Corporation Text editor for speech input
EP0241183A1 (en) * 1986-03-25 1987-10-14 International Business Machines Corporation Speech recognition system
US5136654A (en) * 1989-10-19 1992-08-04 Kurzweil Applied Intelligence, Inc. Vocabulary partitioned speech recognition apparatus

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0180047A2 (en) * 1984-10-30 1986-05-07 International Business Machines Corporation Text editor for speech input
EP0241183A1 (en) * 1986-03-25 1987-10-14 International Business Machines Corporation Speech recognition system
US5136654A (en) * 1989-10-19 1992-08-04 Kurzweil Applied Intelligence, Inc. Vocabulary partitioned speech recognition apparatus

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE19938649A1 (en) * 1999-08-05 2001-02-15 Deutsche Telekom Ag Method and device for recognizing speech triggers speech-controlled procedures by recognizing specific keywords in detected speech signals from the results of a prosodic examination or intonation analysis of the keywords.
GB2502944A (en) * 2012-03-30 2013-12-18 Jpal Ltd Segmentation and transcription of speech
US9786283B2 (en) 2012-03-30 2017-10-10 Jpal Limited Transcription of speech
TWI503813B (en) * 2012-09-10 2015-10-11 Univ Nat Chiao Tung Prosody signal generating device capable of controlling speech rate and hierarchical rhythm module with speech rate dependence
CN111429942A (en) * 2020-03-19 2020-07-17 北京字节跳动网络技术有限公司 Audio data processing method and device, electronic equipment and storage medium
CN111429942B (en) * 2020-03-19 2023-07-14 北京火山引擎科技有限公司 Audio data processing method and device, electronic equipment and storage medium
CN112466328A (en) * 2020-10-29 2021-03-09 北京百度网讯科技有限公司 Breath sound detection method and device and electronic equipment
CN112466328B (en) * 2020-10-29 2023-10-24 北京百度网讯科技有限公司 Breath sound detection method and device and electronic equipment
CN114822536A (en) * 2022-04-27 2022-07-29 维沃移动通信有限公司 Voice recognition method and device, electronic equipment and readable storage medium

Also Published As

Publication number Publication date
CA2081188A1 (en) 1994-04-23
AU5146793A (en) 1994-05-09

Similar Documents

Publication Publication Date Title
Ramu Reddy et al. Identification of Indian languages using multi-level spectral and prosodic features
Loukina et al. Rhythm measures and dimensions of durational variation in speech
Zue The use of speech knowledge in automatic speech recognition
Reddy Speech recognition by machine: A review
US5623609A (en) Computer system and computer-implemented process for phonology-based automatic speech recognition
Klatt Speech perception: A model of acoustic–phonetic analysis and lexical access
US5787230A (en) System and method of intelligent Mandarin speech input for Chinese computers
Wang et al. An acoustic measure for word prominence in spontaneous speech
Rao et al. Language identification using spectral and prosodic features
JPH09500223A (en) Multilingual speech recognition system
Mary Extraction of prosody for automatic speaker, language, emotion and speech recognition
Fellbaum et al. Principles of electronic speech processing with applications for people with disabilities
Wightman Automatic detection of prosodic constituents for parsing
Burileanu Basic research and implementation decisions for a text-to-speech synthesis system in Romanian
WO1994009485A1 (en) Apparatus and method for continuous speech recognition
Gerhard Computationally measurable differences between speech and song
Berkling Automatic language identification with sequences of language-independent phoneme clusters
Heo et al. Classification based on speech rhythm via a temporal alignment of spoken sentences
Gerhard Computationally measurable temporal differences between speech and song
EP0760150B1 (en) Computer system and computer-implemented process for phonology-based automatic speech recognition
Nti Studying dialects to understand human language
Pärssinen Multilingual text-to-speech system for mobile devices: development and applications.
Datta et al. Epoch Synchronous Overlap Add (ESOLA)
McQueen et al. Cognitive processes in spoken-word recognition
Hagmüller Recognition of regional variants of German using prosodic features

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AU BG BR BY CA CZ FI HU JP KR NO NZ PL RO RU SK UA US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FR GB GR IE IT LU MC NL PT SE

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: CA