WO2005060595A2

WO2005060595A2 - Mobile telephone with a speech interface

Info

Publication number: WO2005060595A2
Application number: PCT/US2004/040922
Authority: WO
Inventors: Dong-Jian Yue; Gui-Lin Chen; Zhen-Li Yu; Yi-Qing Zu
Original assignee: Motorola Inc
Current assignee: Motorola Solutions Inc
Priority date: 2003-12-17
Filing date: 2004-12-07
Publication date: 2005-07-07
Anticipated expiration: 2006-06-17
Also published as: WO2005060595A3; CN1630394A

Abstract

A mobile telephone (1) includes a voice function button on a keypad (8), a speech recognition function and a text to speech synthesis function displayed on a screen (6). The voice function button is used by the user to select a desired one of three speech recognition modes performed by a processor (4), these modes are: name recognition, number recognition and command recognition. The telephone (1) bases its approach to speech recognition of input speech according to the selected speech mode. If the name recognition mode is selected, a set of names is used as the basis for recognition. If the number recognition mode is selected, numbers are used as the basis for recognition. If the command recognition mode is selected, a set of commands is used as the basis for recognition. Text to speech synthesis is used to playback the recognised speech before going ahead with dialling a telephone number, through a communications unit (2), based on the recognised name or number or going ahead with the commanded function.

Description

MOBILE TELEPHONE WITH A SPEECH INTERFACE

FIELD OF THE INVENTION This invention relates in general to electronic devices with a speech interface. The invention is particularly useful for, but not necessarily limited to, such devices with a telephone function.

BACKGROUND ART OF THE INVENTION Mobile electronic devices such as mobile telephones and personal digital assistants (PDAs) are globally popular, especially as they gain features and hybrids, which function as both, enter the market, hi many countries such devices are ubiquitous and are owned by the majority of teenagers and young adults and have become essential as data storage devices, personal organisers and/or especially as personal communicators. The predominantly used interface between such mobile devices and their users is the tactile-visual interface, which is a graphical user interface (GUI) similar to that used by most personal computers and other devices. For most mobile telephones, this involves using buttons to navigate one's way through sequences of menus. For most PDAs, one similarly navigates one's way through sequences of menus using a touch screen and stylus. Many devices offer both touch screen and button navigation. Thus, using a typical mobile telephone, if a user wants to call a person whose number is recorded in the phonebook stored on the telephone, the user has at least to do the following steps, each requiring a separate button pushing exercise: enter the phonebook, locate the target person's record and select corresponding telephone number. The problem is not so much in having to push the same or different buttons one or more times for each step. Instead the problem is that the user must be paying attention to the screen, and possibly the buttons, throughout the operation, at every step. Generally, with small screens and small text, as tend to be used on mobile electronic devices, the user has to concentrate more than he would if the screen and text were larger. Thus the visibility of the screen and a good line of sight during use is an essential element of using such devices. Unfortunately such an interface is not easily usable by the blind or otherwise visually impaired. Moreover, more importantly for most people, such an interface is inconvenient or dangerous to use when one is moving fast, shaking around too much to focus on the screen or concentrating on other matters, especially in driving. In particular, because such interfaces require so much concentration, in some countries it is illegal or soon to be illegal to use them whilst driving. The use of other interfaces is also known for certain purposes. For instance United States Patent Application Publication No. 2003/0,139,922, published on 24 July 2003, to Hoffmann et al., mentions a speech recognition system for speech-controlled inputting of short messages into a mobile telephone for using the Short Messaging Service (SMS). Further, United States Patent Application Publication No.

2003/0,008,680, published on 9 January 2003, to Huh et al., describes a pocket and docking station system for a mobile telephone, where the system can receive and recognise speech commands and convert them to electronic signals for instructing the mobile telephone. The system can also provide audible voice prompts to acknowledge commands given by the user.

SUMMARY OF THE INVENTION In this specification, including the claims, the terms 'comprises', 'comprising' or similar terms are intended to mean a non-exclusive inclusion, such that a method or apparatus that comprises a list of elements does not include those elements solely, but may well include other elements not listed. According to one aspect of the invention there is provided an electronic device. This device comprises voice function selection means and a processor.

The voice function selection means is user operable to select a first speech recognition mode or a second, different speech recognition mode. The processor is operable to perform speech recognition on a received speech signal, according to the mode selected by the voice function selection means, and to perform a further function based on recognised received speech. According to another aspect of the invention there is provided a method of controlling a mobile electronic device to perform a desired function. The method comprises receiving a selection of one speech recognition mode from a plurality of speech recognition modes and receiving a speech signal. The method then performs speech recognition on the received speech signal according to the selected speech recognition mode. The method performs a further function based on recognised received speech. According to a further aspect of the invention there is provided computer software for controlling a mobile electronic device to perform a desired function. The software comprising computer code means for instructing a processor. The code means instructs the processor to receive an input as a selection of a speech recognition mode from a plurality of speech recognition modes and to receive a speech signal. The processor is instructed to perform speech recognition on the received speech signal according to the selected speech recognition mode. The processor is further instructed to perform a further function based on recognised received speech. BRIEF DESCRIPTION OF THE DRAWINGS In order that the invention may be readily understood and put into practical effect, reference will now be made to a preferred non-limiting embodiment, as illustrated with reference to the accompanying drawings in which: Figure 1 is a block diagram illustrating components of a mobile telephone in accordance with an embodiment of the invention; Figure 2 is a schematic view of a displayed page of a telephone following speech recognition in one mode; Figure 3 is a schematic view of a displayed page of a telephone following speech recognition in another mode; and Figure 4 is a flow diagram relating to the selection and operation of speech recognition modes.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT OF THE INVENTION In the drawings, like numerals on different Figs are used to indicate like elements throughout. Figure 1 is a block diagram of components of an electronic device in the form of a mobile radio telephone 1 in accordance with an embodiment of the invention. The radio telephone 1 has a radio frequency communications unit 2 coupled to be in communication with a processor 4. A standard input interface in the form of a screen 6 and a keypad 8 are also coupled to be in communication with the processor 4. The processor 4 includes an encoder/decoder 10 with an associated Read Only Memory (ROM) 12 storing data for encoding and decoding voice or other signals that may be transmitted or received by the radio telephone 1. The processor 4 also includes a microprocessor 14 coupled, by a common data and address bus 16, to the encoder/decoder 10 and an associated character Read

Only Memory (ROM) 18, an acoustic unit inventory Read Only Memory (ROM) 20, a Random Access Memory (RAM) 22, a static programmable memory 24 and a removable SIM module 26. The static programmable memory 24 and SIM module 26 each can store, amongst other things, selected incoming text messages and a telephone book database of telephone numbers. The microprocessor 14 has ports for coupling to the keypad 8, the screen 6, an alert module 28 that contains a vibrator motor and associated driver, a microphone 30 and a speaker 32. The microphone 30 and speaker 32 in this embodiment also form part of the interface between a user and telephone

1. The character ROM 18 stores code for decoding or encoding text messages that may be received by the communication unit 2, input at the keypad 8. The character ROM 18 and the inventory ROM 20 both also store operating code (OC) for the microprocessor 14, the OC in the inventory ROM 20 being used for Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) synthesis. The radio frequency communications unit 2 is a combined receiver and transmitter having a common antenna 34. The communications unit 2 has a transceiver 36 coupled to the antenna 34, via a radio frequency amplifier 38.

The transceiver 36 is also coupled to a combined modulator/demodulator 40 that couples the communications unit 2 to the processor 4. The above mobile telephone is operable as a standard telephone in terms of making and receiving telephone calls, sending and receiving SMS messages, etc. The difference is in the user interface. The processor 4 of the telephone is able to operate as an ASR engine and conduct automatic speech recognition on speech received by the microphone 30 and converted into electrical signals. The recognition is based on suitable software stored in the code ROM 12, comparing incoming signals with speech models stored in the inventory ROM 20. The processor 4 of the telephone is also able to operate as a TTS engine and conduct text to speech synthesis on text received, for instance as an SMS message, or text read out from a memory item, for instance a menu heading or its contents. The synthesised speech signals are communicated to the user through the speaker 32. The speech synthesis is based on suitable software stored in the code ROM

12, concatenating acoustic units stored in the inventory ROM 20. The telephone of Figure 1 has a multi-modal user interface, integrating the screen 6, the keypad 8, an embedded ASR engine used with the microphone 30 and an embedded TTS engine used with the speaker 32, to enhance the usability of the mobile phone. The user can input data and commands by any of the screen 6, the keypad 8 and the ASR engine. The telephone is able to output data and messages by either of the screen 6 and the TTS engine. In addition voice signals can be picked up and transmitted and received and played through the microphone 30 and speaker 32, respectively, when the telephone is used in a telephone call. The mobile telephone of Figure 1 is operable in various ASR modes and TTS modes. The mobile telephone includes a voice function button (typically disposed on keypad 8), a speech recognition function and a text to speech synthesis function. The voice function button is used by the user to select a desired one of several speech recognition modes (in this embodiment: name recognition, number recognition and command recognition). The telephone bases its approach to speech recognition of input speech according to the selected speech mode. If the name recognition mode is selected, a set of names is used as the basis for recognition. If the number recognition mode is selected, numbers are used as the basis for recognition. If the command recognition mode is selected, a set of commands is used as the basis for recognition. Text to speech synthesis is used to playback the recognised speech before going ahead with dialling a telephone number based on the recognised name or number or going ahead with the commanded function. The ASR engine can work in three speech recognition modes: i) a name recognition mode, in the form of the Name Dial mode, ii) a command recognition mode, in the form of the Command Control mode; and iii) a number recognition mode, in the form of the Digit Dial mode. The user can selectively access these by way of different access actions. i) Name Dial mode (accessible from any existing mode except a

Dial Pad mode) - This mode is used to retrieve and dial a telephone number based on a voiced input of a person's name. Thus if the user were to say a name (for instance "Charlie Farley"), whilst the telephone is in Name Dial mode, the ASR would seek to recognise the spoken term ("Charlie Farley") and to match it with the names in the telephone book in the static programmable memory 24. Once a match is made, the name is deemed to have been recognised, the associated telephone number is dialled and a call is made. The telephone uses TTS to read out the name it has matched, as confirmation. ii) Command Control mode (accessible from any existing mode) - This mode is used to input command keywords (which may be words or phrases) to the telephone. Thus if the user were to say something whilst the telephone is in Command Control mode, the ASR would seek to recognise what is said and match it with a command keyword in the inventory ROM 20. Once a match is made, a command is deemed to have been recognised, and the appropriate action is taken. This may, for instance, be: to start recording an SMS message, to flesh-out and play a newly or previously received SMS, to find and read out the events in today's day planner, to check the current signal or battery level, to open the dial pad, etc. Some of the actions require some kind of response from the telephone, in terms of a spoken response, for which the TTS engine is used. Examples of some possible commands, with the relevant resulting actions are provided in Table 1 below.

Table 1. Some Possible Voice Commands

iii) Digit Dial mode (accessible only when in the Dial Pad mode, which is achieved through the suitable command in the command control mode) - This mode is used to dial a telephone number directly from voiced input. Thus if the user were to say a series of numbers, whilst the telephone is in Digit Dial mode, the ASR would seek to recognise each number. Once each number in the entire sequence of numbers has been determined, the associated number is dialled and a call is made. The telephone uses TTS to read out the sequence of numbers it has recognised, as confirmation. Thus the processor 4 performs speech recognition based on different categories of data (in this embodiment: name, number or command), based on the speech recognition mode selected. All three ASR modes are accessible in this embodiment through pushing a single specific button on the telephone, in this case a voice function button. The voice function button is part of the keypad 8, although it is most usefully on the side of the telephone (although it can be sited elsewhere, without necessarily being located with the rest or majority of the keypad buttons). The voice function button in this embodiment is multiplexed with another function to go back a page or level if it is double-clicked in quick succession. That is it allows a user to return from the current function or menu level to the previous function or up level. For example, if a user were browsing the Internet, he could leave the current program and return to the previous program or home page by double-clicking the voice function button. This button operation is assigned to have the same function as a "go-back" menu on a program page. It speeds up the navigation among the functions or programs in the mobile telephone. Alternatively, the voice function button can be multiplexed with other function buttons and/or to perform other functions. In another alternative embodiment instance it provides only the operations mentioned above. The Name Dial mode is accessed by pushing the voice function button and holding it down until there is a beep. At this point the user says the name and then releases the voice function button. The Digit Dial mode is accessed in the same way, but only when already in the dial pad mode. Further, the user usually says a telephone number, rather than a name. The Command Control mode is accessed by clicking the voice function button, then promptly pushing the voice function button again and holding it down until there is a beep. At this point the user says the command keyword and then releases the voice function button. All ASR operations in this embodiment are based on push-to- talk, using and holding down the voice function button. This may be different in other embodiments. In this embodiment, in each mode, whether it involves dialling or a command, the telephone provides voice confirmation of what speech it recognised, before going ahead (thereby giving the user an opportunity to stop the operation). In an alternative embodiment, the device awaits confirmation to continue before doing so (dialling the relevant number or following the relevant command instruction). The TTS engine can work in two modes: i) the ASR Confirmation mode and ii) the Talking mode. The user can only selectively access the

Talking mode; the ASR Confirmation mode is automatic, in response to ASR mode operations. ASR Confirmation mode (automatically accessed from any of the three ASR modes, above) - This mode is accessed by the telephone in response to an ASR action by the user in the three above-mentioned ASR modes. In particular, TTS is used to speech synthesise: the name matched in the Name Dial mode; a response or a confirmation of an action in response to a command matched in the Command Control mode; or a sequence of recognised numbers input in the Digit Dial mode. i) Talking mode (accessible from any existing mode) - This mode involves the TTS providing further voice feedback from command keywords in the Command Control mode, for instance reading out a message in the message box. It is also used to provide feedback (as in reading out a current line of text, a menu heading, a page or more of an address book, several numbers relating to the same person, etc.) in GUI navigation and a voice alert for selected applications, for instance: indicating when there is an incoming call and who the caller is (based on caller recognition), instead of or as well as a ringing tone; reading a web page accessed by a web-browser, where there is one on the electronic device; indicating a time; indicating an appointment; indicating that a geographical position has been reached (where the mobile device includes a

GPS or other location system receiver). The talking mode can be turned on or off, using a check icon in the telephone configuration pull-up in the status bar in the GUI. Alternatively, it could be turned off and on using a command keyword in the Command Control mode. In the preferred embodiment, the ASR Confirmation mode is always on. However, in alternative embodiments, it may be possible to turn it on or off as desired, for instance through a check icon or command keyword in the Command Control mode. Whilst the ASR modes are selected using the voice function button in particular ways, in other embodiments, different operations may be used to access the different modes mentioned earlier. In further embodiments, different buttons can be used to access different modes. The means for selecting the voice function need not be one or more buttons. It could instead be screen operated or some other input mode. The usefulness of requiring the user to press a button and, in particular, to hold it down, is that it prevents accidental operation and makes accurate

ASR and ultimate operation more likely, especially when only certain operations are possible in each ASR mode. Whilst this still requires some kind of contact between the user and telephone (or other device) and therefore requires some notice to be paid, it is just to one button in the preferred embodiment and not to the screen. Moreover, with the button placed at the side of the device, that button is easier to find by touch alone. Thus it becomes optional to look at the screen whilst using the device. During operation of the mobile telephone, using ASR or TTS, the screen of the telephone still shows screens it would be showing during normal operation by keypad or display. For instance, in the Name Dial mode, when a name is spoken, once the telephone determines it has recognised a name, it still displays the telephone book page on the screen 6 with that name on it. Figure 2 is a schematic view of a displayed page on the screen 6 of the telephone 1 following speech recognition in one mode. Figure 2 shows a displayed page 40 on the screen 6, when, in the Name Dial mode, the user has spoken the name George Ferackis and that name has been recognised. At the same time, the TTS synthesiser itself causes the name "George Ferackis" to be spoken from the speaker 32. Figure 3 is a schematic view of a displayed page of the telephone 1, following speech recognition in the Digit Dial mode. In the Digit Dial mode, a dial pad page 42 is displayed on the telephone screen 6. When a number sequence is spoken, as the telephone determines it recognises the numbers the telephone displays the recognised numbers in a digit string box 44. In the case of Figure 3, the telephone has determined that the user has spoken the telephone number 6785567 and those numbers are displayed. At the same time, the TTS synthesiser itself causes the number "678 5567" to be spoken from the telephone speaker 32. The display page of Figure 3 is also used for other dialling, and not just in Digit Dial Mode. Thus the page also includes a dial pad 46 (for manually dialling a number), a delete action button 48 (for deleting a number from the digit string box 44), a return key button 50 (for going to a previous page), a dial action button 52 (for dialling out the number in the digit string box 44 to make a call) and a confirm action button 54 (for speaking out the number in the digit string box 44 one time, for confirmation before dialling out). The operation of the speech recognition and text to speech modes of the mobile telephone of Figure 1 is discussed below with reference to Figure 4, which is a flow diagram relating to the selection and operation of speech recognition modes. This process is controlled by the microprocessor 14. In step SI 00, the device or telephone lis turned on and then awaits an input. Input detection occurs at step SI 02. Step SI 04 then determines if the input is via the voice function button. If the input is not via the voice function button, the device or telephone 1 performs whatever other function is called for in step SI 06, then reverts to step SI 02, waiting for the next input. If the input is via the voice function button, then, at step SI 08, the process determines if the voice function button is currently being held down (actuated). If the voice function button is currently being held down, then at step SI 10 the process determines if the device is in the Dial Pad mode. If the device or telephone 1 is in the Dial Pad mode, then, in step SI 12, any speech is recorded in the RAM 22, from the microphone 30, for as long as the voice function button continues to be held down. Once the voice function button is no longer held down, the recording stops and, in step SI 14, the device or telephone 1 performs number recognition on the received speech signal, using the ASR engine. Once the number has been recognised, the TTS engine is used to synthesise the recognised number and play it back through the speaker 32, in step SI 16. In step SI 18, the device then performs the further function of automatically dialling the recognised number to make a call via the radio communications unit 2, then reverts to step SI 02, awaiting further input. If at step SI 10, it is determined that the device or telephone 1 is not in the Dial Pad mode, then, in step SI 20, any speech is still recorded in the RAM 22, from the microphone 30, for as long as the voice function button continues to be held down. Once the voice function button is no longer held down, the recording stops and, in step S122, the device or telephone 1 performs name recognition on the received speech signal, using the ASR engine. Once the name has been recognised, the TTS engine is used to synthesise the recognised name and play it back through the speaker 32, in step S124. This is again followed by step SI 18, where the telephone 1 performs the further function of automatically dialling the number corresponding to the recognised name to make a call, then reverts to step S 102, awaiting further input. If at step SI 08, it is determined that the voice function button is not currently being held down, step S126 determines if the voice function button was previously just clicked and is now being held down. If this is determined to be what is happening, any speech is recorded in the RAM 22, from the microphone 30, for as long as the voice function button continues to be held down, in step S128. Once the voice function button is released, in step S130 the telephone 1 performs command keyword recognition on the received speech signal, using the ASR engine. Once the command keyword has been recognised, the TTS engine is used to synthesise the recognised command keyword and to play it back through the speaker 32, in step S132. This is followed, in step SI 34, by the device automatically performing the commanded further function corresponding to the recognised command keyword. The process then reverts to step SI 02, awaiting further input. If it is determined at step SI 26 that the voice function button was not just clicked once and held, the process determines at step SI 36 whether the voice function button was clicked twice in quick succession. If voice function button was so clicked twice in quick succession, the processor moves the current display back one page, at step S138, then reverts to step S102, awaiting further input. If it is determined at step SI 36 that the voice function button was not clicked twice in quick succession, the processor simply reverts to step SI 02, awaiting further input, on the assumption that there has been an error. Whilst not specifically described, the process can also deal with cases where the ASR is unable to recognise a name, number or command keyword, or where the voice function button is held down too long, or other possibilities. Whilst one particular flow of operations has been described, other flows are possible for the same results or other similar results as required within the scope of the invention. The described embodiments of the invention use ASR. The relevant data used in recognising input speech can be found in the inventory ROM 20 and the static memory 24. The electronic device or telephone 1 can also include a learning program to improve the accuracy of the ASR over time, with the results also stored in the static memory 24. Further both the ASR and TTS can readily be personalised and set-up for each user, in that all the voice functions may be itemised further and set as on/off in the mobile phone system set-up. The above-described telephone has an ASR function and a TTS function and uses a multi-modal interface. By using the tactile, auditory and visual interfaces the user can operate the mobile telephone and access information conveniently and efficiently. The ASR allows much of the operation of the telephone to be voice controlled. Further TTS allows synthetic speech to provide almost any information to the user relating to the telephone's operation. Moreover, the use of just a single button to select a specific, or limited choices, of ASR function allows easier and more accurate use of the ASR. This multi-modal scheme can be integrated into existing mobile telephone designs and approaches seamlessly. Existing mobile telephone function structures and most operation logic can remain unchanged and preserved to be consistent with the existing style. Thus the keypad is typically includes a voice function selection means, user operable to select a first speech recognition mode or a second, different speech recognition mode, the speech recognition mode being based on a sequence of operations of the keypad. Also, as described, the 4 processor is operable to perform speech recognition on a received speech signal, according to the mode selected by the voice keypad, and to perform a further function based on recognised received speech. The new speech functions can easily be inserted and implemented. This may even be done through new or added software (i.e. computer code means to instruct the processor to perform certain functions on incoming inputs and signals, and to record certain data in certain memories), whether into a new or existing telephone or other device. The detailed description provides a preferred exemplary embodiment only, and is not intended to limit the scope, applicability, or configuration of the invention. Rather, the detailed description of the preferred exemplary embodiment provides those skilled in the art with an enabling description for implementing a preferred exemplary embodiment of the invention. It should be understood that various changes may be made in the function and arrangement of elements without departing from the spirit and scope of the invention as set forth in the appended claims.

Claims

WE CLAM:

1. A mobile electronic device comprising: voice function selection means, user operable to select a first speech recognition mode or a second, different speech recognition mode; and a processor operable to perform speech recognition on a received speech signal, according to the mode selected by the voice function selection means, and to perform a further function based on recognised received speech.

2. A device according to claim 1, wherein the voice function selection means is operable to select the speech recognition mode based on a sequence of operations of the voice function selection means by a user.

3. A device according to claim 2, wherein the voice function selection means comprises a single button which is user operable.

4. A device according to claim 3, wherein the processor is operable to perfomi speech recognition on speech which is received for as long as the button is held down.

5. A device according to claim 1, wherein the processor is operable to recognise the received speech based on different categories of data, according to the mode selected by the voice function selection means

6. A device according to claim 1, wherein the processor is operable to recognise the received speech as a name, a number or a command, depending on the selected speech recognition mode.

7. A device according to claim 1, wherein the voice function selection means is user operable to select a third speech recognition mode.

8. A device according to claim 7, wherein a same operation of the voice function selection means selects the first speech recognition mode or the third speech recognition mode, depending on the setting of the device prior to that operation of the voice function selection means.

9. A device according to claim 1, wherein the first speech recognition mode comprises a name recognition mode.

10. A device according to claim 7, wherein the third speech recognition mode comprises a number recognition mode, seeking to recognise the received speech as a number sequence.

11. A device according to claim 1, wherein the further function comprises providing a signal as a telephone number based on recognised received speech in the first mode.

12. A device according to claim 1, wherein the second speech recognition mode comprises a command recognition mode, seeking to recognise the received speech as a command.

13. A device according to claim 12, wherein the further function comprises performing a function as commanded by a recognised command.

14. A device according to claim 1, wherein the processor is further operable to provide feedback as to the recognised received speech prior to performing the further function.

15. A device according to claim 14, wherein the processor is further operable to perform text to speech synthesis to provide sound signals corresponding to the recognised received speech as the feedback.

16. A device according to claim 14, wherein the processor is further operable to wait for further input thereto, after providing the feedback, before performing the further function.

17. A device according to claim 1, embodying a telephone and further comprising a microphone, a speaker a transmitter and a receiver.

18. A method of controlling a mobile electronic device to perform a desired function, comprising: receiving a selection of a one speech recognition mode from a plurality of speech recognition modes; receiving a speech signal; performing speech recognition on the received speech signal according to the selected speech recognition mode; and performing a further function based on recognised received speech.