WO2019203016A1 - Information processing device, information processing method, and program - Google Patents
Information processing device, information processing method, and program Download PDFInfo
- Publication number
- WO2019203016A1 WO2019203016A1 PCT/JP2019/015070 JP2019015070W WO2019203016A1 WO 2019203016 A1 WO2019203016 A1 WO 2019203016A1 JP 2019015070 W JP2019015070 W JP 2019015070W WO 2019203016 A1 WO2019203016 A1 WO 2019203016A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- presentation
- information
- phrase
- user
- information processing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
Definitions
- the present technology relates to an information processing device, an information processing method, and a program, and more particularly, to an information processing device, an information processing method, and a program that improve the accuracy of speech recognition.
- Patent Document 1 can only be applied to the case where each alphabet is input by voice one by one.
- the user needs to know and follow the alphabet input method in advance, which increases the burden on the user.
- the present technology has been made in view of such a situation, and is intended to improve voice recognition accuracy while suppressing the burden on the user.
- An information processing apparatus uses a presentation information generation unit that generates presentation information that is information to be presented to a user, and controls the presentation of the presentation information to the user, and uses a phrase with high voice recognition accuracy
- a presentation control unit for guiding presentation information that is information to be presented to a user, and controls the presentation of the presentation information to the user, and uses a phrase with high voice recognition accuracy
- An information processing method generates presentation information that is information to be presented to a user, controls presentation of the presentation information to the user, and induces use of a phrase with high voice recognition accuracy.
- a program generates presentation information that is information to be presented to a user, controls presentation of the presentation information to the user, and performs processing that induces use of a phrase with high voice recognition accuracy on a computer. Let it run.
- presentation information that is information to be presented to the user is generated, the presentation of the presentation information to the user is controlled, and the use of a phrase with high voice recognition accuracy is induced.
- the voice recognition accuracy is improved while suppressing the burden on the user.
- FIG. 1 is a block diagram illustrating a configuration example of an interactive system 11 to which the present technology is applied.
- the dialogue system 11 is a system that performs dialogue with the user using at least one of visual information (for example, images) and auditory information (for example, voice).
- the interactive system 11 includes various systems and devices (for example, interactive robots, smart speakers, and the like) having a user interface using voice recognition.
- the dialogue system 11 includes an input unit 21, an information processing unit 22, and an output unit 23.
- the input unit 21 is used for inputting input data to the information processing unit 22.
- the input unit 21 includes a microphone 31 and a camera 32.
- the microphone 31 collects surrounding sounds including the user's voice, and supplies the voice data indicating the collected voice to the voice recognition unit 41, the noise recognition unit 42, and the user recognition unit 43 of the information processing unit 22.
- the camera 32 shoots the user and supplies image data indicating the obtained image to the voice recognition unit 41 and the user recognition unit 43 of the information processing unit 22.
- the information processing unit 22 executes various processes for performing a dialog with the user.
- the information processing unit 22 includes a voice recognition unit 41, a noise recognition unit 42, a user recognition unit 43, an interpretation unit 44, a learning unit 45, a user information storage unit 46, a phrase information storage unit 47, and a presentation processing unit 48.
- the voice recognition unit 41 performs a process for recognizing the voice emitted by the user based on the voice data from the microphone 31 and the image data from the camera 32.
- the voice recognition unit 41 supplies data indicating the result of voice recognition to the interpretation unit 44 and the learning unit 45.
- the noise recognition unit 42 performs processing for recognizing noise around the dialog system 11 (or the user) based on the voice data from the microphone 31.
- the noise recognition unit 42 supplies data indicating the result of noise recognition to the presentation processing unit 48.
- the user recognition unit 43 performs a user recognition process based on the audio data from the microphone 31 and the image data from the camera 32.
- the user recognition unit 43 supplies data indicating the user recognition result to the learning unit 45.
- the interpretation unit 44 interprets the user's utterance content based on the speech recognition result and the phrase information stored in the phrase information storage unit 47.
- the interpretation unit 44 supplies data indicating the user's utterance content to the learning unit 45 and the presentation processing unit 48.
- the learning unit 45 learns the pronunciation characteristics of the user recognized by the user recognition unit 43 based on the speech recognition result and the utterance content, and updates the pronunciation characteristic information indicating the user's pronunciation characteristics.
- the user information storage unit 46 stores user information (for example, pronunciation characteristic information, profile, etc.) related to each user who uses the dialogue system 11.
- the phrase information storage unit 47 stores phrase information (for example, a dictionary, speech recognition accuracy of each phrase, etc.) related to the phrase used for the presentation information presented to the user.
- phrase information for example, a dictionary, speech recognition accuracy of each phrase, etc.
- the presentation processing unit 48 performs a process of presenting presentation information to the user.
- the presentation processing unit 48 includes a presentation information generation unit 61 and a presentation control unit 62.
- the presentation information generation unit 61 generates presentation information based on the user's utterance content, the user information stored in the user information storage unit 46, and the phrase information stored in the phrase information storage unit 47.
- the presentation information generation unit 61 includes a presentation content setting unit 71 and a phrase selection unit 72.
- the presentation content setting unit 71 sets the content of information to be presented to the user based on the user's utterance content and user information, and the phrase selection result by the phrase selection unit 72, and generates presentation information indicating the set content To do.
- the presentation content setting unit 71 supplies the presentation information to the presentation control unit 62.
- the phrase selection unit 72 selects a phrase to be used for the presentation information based on the user information and the phrase information.
- the presentation control unit 62 controls the presentation of presentation information to the user, and induces the use of words with high speech recognition accuracy and the suppression of the use of words with low speech recognition accuracy, as will be described later.
- the presentation control unit 62 controls the display of the image indicating the presentation information by generating image data for visually presenting the presentation information and supplying the image data to the display 81 of the output unit 23.
- the presentation control unit 62 generates voice data for aural presentation of the presentation information and supplies it to the speaker 82 of the output unit 23 to control the output of voice indicating the presentation information.
- the output unit 23 is used for outputting presentation information.
- the output unit 23 includes a display 81 and a speaker 82.
- the display 81 displays an image based on the image data from the presentation control unit 62.
- Speaker 82 outputs audio based on audio data from presentation control unit 62.
- This process is started when, for example, the power of the dialog system 11 is turned on, and is ended when the power of the dialog system 11 is turned off.
- step S1 the dialogue system 11 starts acquiring input data.
- the microphone 31 collects surrounding sounds including the sounds emitted by the user, and supplies the sound data indicating the collected sounds to the sound recognition unit 41, the noise recognition unit 42, and the user recognition unit 43. To start.
- the camera 32 starts photographing of the user and starts processing for supplying image data indicating the obtained image to the voice recognition unit 41 and the user recognition unit 43.
- step S2 the user recognition unit 43 starts user recognition. Specifically, the user recognition unit 43 performs a user recognition process based on the image data and the sound data, and starts a process of supplying data indicating the recognition result to the learning unit 45.
- any method can be used according to required accuracy, processing speed, and the like.
- step S3 the voice recognition unit 41 starts voice recognition. Specifically, the voice recognition unit 41 performs a process of recognizing a voice uttered by the user based on the voice data, and starts a process of supplying data indicating the recognition result to the interpretation unit 44 and the learning unit 45.
- the voice recognition unit 41 may perform voice recognition using not only voice data but also image data (for example, a user's mouth shape) as necessary.
- step S4 the noise recognition unit 42 starts noise recognition. Specifically, the noise recognizing unit 42 performs processing for recognizing noise around the interactive system 11 (or the user) based on the voice data and supplying data indicating the recognition result to the presentation processing unit 48. Start. For example, voices other than the user's voice that is the target of voice recognition (for example, other person's voice, environmental sound, etc.) are recognized as noise.
- any method can be used depending on, for example, required accuracy and processing speed.
- step S5 the interpretation unit 44 starts interpretation of the user's utterance content. Specifically, the interpretation unit 44 performs a process of interpreting the content (meaning) spoken by the user based on the recognition result of the voice uttered by the user and the phrase information stored in the phrase information storage unit 47. Start. In addition, the interpretation unit 44 starts a process of supplying data indicating the interpreted utterance content to the learning unit 45 and the presentation processing unit 48.
- any method can be used as the method for interpreting the utterance content depending on, for example, required accuracy and processing speed.
- the presentation information generation unit 61 generates presentation information for the user's utterance content. For example, when the presentation content setting unit 71 determines that a response is required for the user's utterance content, the content of information to be presented to the user (for example, response content) is set, and the presentation information indicating the set content Is generated. At this time, the phrase selection unit 72, as necessary, the user's pronunciation characteristics stored in the user information storage unit 46, the speech recognition accuracy of each phrase stored in the phrase information storage unit 47, and noise recognition. Based on the recognition result of ambient noise by the unit 42, a word / phrase to be used for the presentation information is selected. The presentation content setting unit 71 uses the phrase selected by the phrase selection unit 72 as the presentation information. The presentation content setting unit 71 supplies the generated presentation information to the presentation control unit 62.
- the phrase selection unit 72 uses the phrase selected by the phrase selection unit 72 as the presentation information.
- step S7 the dialogue system 11 presents presentation information.
- the presentation control unit 62 generates at least one of image data and audio data for presenting presentation information to the user.
- the presentation control unit 62 supplies the image data to the display 81 when the image data is generated.
- the display 81 displays an image indicating presentation information based on the image data. Thereby, the presentation information is visually presented to the user.
- the presentation control unit 62 when the presentation control unit 62 generates voice data, the presentation control unit 62 supplies the voice data to the speaker 82.
- the speaker 82 outputs a sound indicating the presentation information based on the sound data. Thereby, presentation information is auditorily presented to the user.
- step S8 the learning unit 45 learns the pronunciation characteristics of the user.
- the learning unit 45 determines whether the speech recognition result is correct based on the content of the user's utterance. For example, when the user's utterance content is a negative reaction such as “different” or “not”, the learning unit 45 recognizes a speech of a phrase uttered by the user before presenting the presentation information that caused the negative reaction. It is determined that there is an error in the result. On the other hand, for example, when the user is responding normally to the presentation information, it is determined that the speech recognition result of the phrase issued by the user before the presentation information is presented is correct.
- the learning part 45 learns a user's pronunciation characteristic based on the correctness of the speech recognition result. For example, the learning unit 45 learns, for each user, a phrase or pronunciation that the voice recognition unit 41 is good at (high recognition accuracy) and a phrase or pronunciation that the voice recognition unit 41 is not good at (low recognition accuracy).
- the learning unit 45 determines that the speech recognition unit 41 is not good at recognizing the word / phrase of the user when the failure of the speech recognition for the same word / phrase is repeated for the user to be learned. For example, a case where the pronunciation of the word of the user is special or a case where the voice recognition unit 41 and the pronunciation of the word of the user are incompatible is assumed. In addition, for example, when the speech recognition success for the same word is repeated for the user, the learning unit 45 determines that the speech recognition unit 41 is good at recognizing the word of the user.
- the learning unit 45 learns the pronunciation of the user who is not good at the voice recognition unit 41 (for example, the pronunciation of a line) based on the tendency of words and phrases that the voice recognition unit 41 is not good at.
- the learning unit 45 updates the user's pronunciation characteristic information stored in the user information storage unit 46 based on the learning result.
- timing for learning the pronunciation characteristics of the user is not limited to this example. For example, you may make it learn after the dialog with a user is over.
- step S6 the process returns to step S6, and the processes after step S6 are executed.
- the presentation information generation unit 61 uses the person's tendency to use a word / phrase (hereinafter referred to as a high recognition word / phrase) with high recognition accuracy (voice recognition accuracy) of the voice recognition unit 41 for the user. Presentation information to be guided. In addition, for example, the presentation information generation unit 61 generates presentation information that guides the user to suppress the use of words (hereinafter referred to as low recognition words / phrases) with low recognition accuracy of the voice recognition unit 41.
- the high recognition phrase and the low recognition phrase may be determined individually based on the recognition accuracy of the speech recognition unit 41, or may be relative based on the recognition accuracy relationship of the speech recognition unit 41 between the phrases. Sometimes it is decided. In the former case, for example, a phrase whose recognition accuracy of the speech recognition unit 41 is equal to or higher than a predetermined first threshold is classified as a high recognition phrase, and a phrase that is less than a second threshold and smaller than the first threshold is classified as a low recognition phrase. Is done. In the latter case, for example, of the two phrases, the phrase with higher recognition accuracy of the speech recognition unit 41 is classified as a high recognition phrase, and the phrase with lower recognition accuracy of the speech recognition unit 41 is classified as a low recognition phrase. Is done.
- the phrase selection unit 72 increases the priority used for the presentation information for a phrase with a higher recognition accuracy of the voice recognition unit 41, and decreases the priority for the presentation information for a phrase with a lower recognition accuracy of the voice recognition unit 41.
- the usage frequency of the high-priority words increases, and the usage frequency of the low-priority words decreases.
- the phrase selecting unit 72 selects a phrase having a higher priority when there are a plurality of usable phrases (for example, phrases having the same meaning). Further, for example, the phrase selection unit 72 replaces a low recognition phrase with a high recognition phrase having the same meaning, adds a high recognition phrase to the low recognition phrase, or selects a high recognition phrase as a phrase to be presented together with the low recognition phrase. .
- the presentation content setting unit 71 sets the presentation content using the phrase selected by the phrase selection unit 72, and generates presentation information indicating the set presentation content.
- the recognition accuracy of the voice uttered by the user is improved.
- the priority of each word may be fixed regardless of the user and surrounding conditions, or may be dynamically changed according to the user and surrounding conditions.
- the phrase selection unit 72 sets the priority of each phrase based on the speech recognition accuracy of each phrase stored in the phrase information storage unit 47 without considering the user and surrounding circumstances. Thereby, the priority of each word is fixed. For example, when the speech recognition accuracy of each word is appropriately updated based on past recognition results, the priority of each word changes when viewed in a long span.
- the phrase selection unit 72 dynamically changes the priority of each phrase in consideration of the user and surrounding conditions in addition to the voice recognition accuracy of each phrase.
- the phrase selection unit 72 adjusts the speech recognition accuracy of each phrase based on the pronunciation characteristics of the user stored in the user information storage unit 46, and prioritizes each phrase based on the adjusted speech recognition accuracy. Set the degree.
- voice recognition accuracy may change even for the same word or phrase due to ambient noise.
- a short phrase is more likely to be buried in noise than a long phrase, and therefore, it is more susceptible to voice recognition accuracy due to noise.
- the phrase selection unit 72 adjusts the speech recognition accuracy of each phrase based on the recognition result of ambient noise, and sets the priority of each phrase based on the adjusted speech recognition accuracy.
- FIGS. 3 to 6 show an example in which presentation information including options such as “Inu”, “Saru”, “Kiji” is displayed on the display 81 and the user selects it.
- numbers are displayed in front of each option as a phrase (hereinafter referred to as a selected term) used for selecting each option.
- a selected term used for selecting each option.
- the user utters the selected term phrases (“1”, “2”, “3”) in addition to the phrases (“Inu”, “Saru”, “Kiji”) representing each option. Choices can be selected.
- the recognition rate (hereinafter referred to as the selection recognition rate) of options selected by the user by voice is improved.
- a short phrase for example, a one-tone or two-tone phrase
- long words such as “Alpha”, “Bravo”, and “Charlie” may be used as the selected term phrases instead of short numbers and alphabets as in the example of FIG. Thereby, the selection recognition rate is improved.
- speech recognition for short words is more susceptible to ambient noise than speech recognition for long words. Therefore, for example, the selected term phrase may be lengthened as the ambient noise level increases.
- the space for displaying the selected term increases. For this reason, for example, the space for displaying the options may be reduced, or the options may be difficult to see.
- the characters used in the selected term phrase for example, when the same character as the selected term phrase is included in the options
- the type of characters used in the selected term phrase it may be difficult to distinguish between the selected term phrases. Therefore, for example, instead of the selected term phrase, a symbol representing the selected term phrase (hereinafter referred to as a selection symbol) may be displayed together with the options. For example, an icon, a symbol, an image, or the like is used as the selection symbol.
- buttons representing “star”, “moon”, and “sun” are used as selection symbols. Accordingly, the user can select an option by uttering “Hoshi”, “Tsuke”, and “Taiyo” which are words (selection term phrases) represented by the selection symbol in addition to the words representing the choices. it can.
- phrase selection unit 72 selects, for example, a phrase having higher speech recognition accuracy than a phrase representing an option as a selected term phrase. Thereby, the selection recognition rate is improved as compared with the case where only the options are presented.
- the presentation control unit 62 may control the presentation of the presentation information by a method that induces the use of the high recognition phrase and induces the suppression of the use of the low recognition phrase.
- the presentation control unit 62 is a mode represented by a highly recognized word / phrase, and controls to present each option in a different mode, and the user selects the option based on the word / phrase representing each mode. You may be able to do it.
- each option may be displayed using different display effects (for example, color, size, font, etc.), and the options may be selected by a word representing the display effect.
- FIG. 7 shows an example in which each option and a selected term phrase are displayed in different colors because the difference in color is not shown because it is monochrome. Specifically, in FIG. 7, “A. Inu” is displayed in red, “B. Monkey” is displayed in blue, and “C. Kiji” is displayed in green. As a result, the user can select the choice by uttering the words (eg, “red”, “blue”, “green”) that indicate the color of each choice in addition to the phrase that represents each choice and the selected term. Can be selected.
- the words eg, “red”, “blue”, “green”
- each option may be displayed at a different position so that the option can be selected by a word representing the display position.
- a word representing the display position For example, in the example of FIG. 8, each option and the selected term phrase are displayed on the top, bottom, left, and right, respectively.
- the user utters a voice of a word (for example, “up”, “done”, “hidari”, “migi”) indicating the display position of each option, in addition to the word indicating each option and the selected word phrase. It is possible to select an option.
- the presentation control unit 62 uses, for example, a phrase having higher speech recognition accuracy than a word representing an option as a word representing an aspect of presenting each option. In other words, the speech recognition accuracy is higher than that of a word representing an option. Control each option to be presented in a manner expressed by high phrases. Thereby, a selection recognition rate improves compared with the case where each choice is shown in the same mode.
- the selection recognition rate can be further improved by using, for example, the user's gesture or the direction of the line of sight.
- each option may be presented in a different manner without adding a selected term phrase to each option.
- a phrase with higher voice recognition accuracy may be used as a phrase representing an option.
- the choices are “Keiyo Line”, “Keikyu Line”, and “Keio Line”, since the sounds between the choices are similar, the selection recognition rate may decrease.
- the selection recognition rate can be improved by presenting the full names of each route such as “JR Keiyo Line”, “Keihin Express Line”, and “Keio Line” as options.
- the selection recognition rate can be improved by, for example, presenting “one”, “two”, and “mitsu” as options.
- the case where the option is displayed on the display 81 is taken as an example.
- the selection recognition rate is improved by using the selected term phrase similarly. Can do.
- the selection recognition rate can be improved by outputting the selected term phrase and the choices by voice and allowing the user to input the voice of the selected term phrase.
- a phrase with higher voice recognition accuracy is displayed on the display 81 as a typical expression. You may make it do. As a result, the user uses an expression with higher speech recognition accuracy for a typical expression.
- the user may be guided to utter some words and phrases serving as pilot information, and the user's speech characteristics may be learned. Thereby, it becomes possible to learn a user's speech characteristic efficiently and in a short period of time.
- the pronunciation characteristic information of another user having the same habit is used until learning of the user's pronunciation characteristic is sufficiently advanced. Also good.
- only one of preferential use of the high recognition phrase in the presentation information and suppression of use of the low recognition phrase in the presentation information may be performed.
- the user is mainly guided to use the high recognition phrase.
- the user is mainly guided to suppress the use of low-recognition words.
- the pronunciation characteristic information of each user may be shared between different systems (for example, between a plurality of agents).
- the present technology is not limited to a scene where a dialogue with a user is performed, but can be applied to various scenes where a user's voice recognition is performed.
- FIG. 9 is a block diagram showing an example of the hardware configuration of a computer that executes the above-described series of processing by a program.
- a CPU Central Processing Unit
- ROM Read Only Memory
- RAM Random Access Memory
- An input / output interface 505 is further connected to the bus 504.
- An input unit 506, an output unit 507, a recording unit 508, a communication unit 509, and a drive 510 are connected to the input / output interface 505.
- the input unit 506 includes an input switch, a button, a microphone, an image sensor, and the like.
- the output unit 507 includes a display, a speaker, and the like.
- the recording unit 508 includes a hard disk, a nonvolatile memory, and the like.
- the communication unit 509 includes a network interface or the like.
- the drive 510 drives a removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory.
- the CPU 501 loads the program recorded in the recording unit 508 to the RAM 503 via the input / output interface 505 and the bus 504 and executes the program, for example. A series of processing is performed.
- the program executed by the computer 500 can be provided by being recorded on a removable medium 511 as a package medium, for example.
- the program can be provided via a wired or wireless transmission medium such as a local area network, the Internet, or digital satellite broadcasting.
- the program can be installed in the recording unit 508 via the input / output interface 505 by attaching the removable medium 511 to the drive 510. Further, the program can be received by the communication unit 509 via a wired or wireless transmission medium and installed in the recording unit 508. In addition, the program can be installed in the ROM 502 or the recording unit 508 in advance.
- the program executed by the computer may be a program that is processed in time series in the order described in this specification, or in parallel or at a necessary timing such as when a call is made. It may be a program for processing.
- the system means a set of a plurality of components (devices, modules (parts), etc.), and it does not matter whether all the components are in the same housing. Therefore, a plurality of devices housed in separate housings and connected via a network, and a single device housing a plurality of modules in one housing are all systems. .
- the present technology can take a cloud computing configuration in which one function is shared by a plurality of devices via a network and is jointly processed.
- each step described in the above flowchart can be executed by one device or can be shared by a plurality of devices.
- the plurality of processes included in the one step can be executed by being shared by a plurality of apparatuses in addition to being executed by one apparatus.
- a presentation information generation unit that generates presentation information that is information to be presented to the user;
- An information processing apparatus comprising: a presentation control unit that controls presentation of the presentation information to the user and guides use of a phrase with high voice recognition accuracy.
- the presentation information generation unit generates the presentation information that induces use of a word with high voice recognition accuracy.
- the presentation information generation unit selects a phrase to be used for the presentation information based on voice recognition accuracy of each phrase.
- the presentation information generation unit preferentially uses words with high voice recognition accuracy for the presentation information.
- the presentation information generation unit when presenting options in the presentation information, selects a selected term phrase that is a phrase used for selection of the option based on the speech recognition accuracy of each phrase (3) or (4) The information processing apparatus described in 1. (6) The information processing apparatus according to (5), wherein the presentation information generation unit selects, as the selected term phrase, a phrase having higher speech recognition accuracy than a phrase representing the option. (7) The information processing apparatus according to (5) or (6), wherein the presentation information generation unit generates the presentation information that presents a symbol representing the selected term phrase together with the options. (8) The information processing apparatus according to any one of (3) to (7), wherein the presentation information generation unit further selects a word / phrase to be used for the presentation information based on the pronunciation characteristics of the user.
- the information processing apparatus according to (8), further including a learning unit that learns the pronunciation characteristics of the user.
- the presentation information generation unit further selects a word / phrase to be used for the presentation information based on ambient noise.
- the information processing apparatus according to (10), further including a noise recognition unit that recognizes the ambient noise.
- the presentation control unit controls presentation of the presentation information by a method for guiding the use of a word with high voice recognition accuracy.
- the presentation control unit is an aspect represented by a phrase having high voice recognition accuracy, and controls the presentation of the presentation information so as to present the options in different aspects.
- the information processing apparatus according to (12).
- (14) The information processing apparatus according to (13), wherein the aspect is a display effect or a display position of the option.
- the presentation control unit uses the aspect in which the word representing the aspect has higher voice recognition accuracy than the word representing the option.
- (16) The information processing apparatus according to any one of (1) to (15), wherein the presentation control unit controls presentation of the presentation information to the user and induces suppression of use of a phrase with low speech recognition accuracy.
- the information processing apparatus according to any one of (1) to (16), further including a voice recognition unit that performs voice recognition of the user.
- the presentation information includes a response to the utterance content of the user.
- 11 dialogue system 21 input unit, 22 information processing unit, 23 output unit, 31 microphone, 32 camera, 41 voice recognition unit, 42 noise recognition unit, 43 user recognition unit, 44 interpretation unit, 45 learning unit, 46 user information storage Section, 47 phrase information storage section, 48 presentation processing section, 61 presentation information generation section, 62 presentation control section, 71 presentation content setting section, 72 phrase selection section, 81 display, 82 speaker
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
Description
本技術は、情報処理装置、情報処理方法、及び、プログラムに関し、特に、音声認識の精度を向上させるようにした情報処理装置、情報処理方法、及び、プログラムに関する。 The present technology relates to an information processing device, an information processing method, and a program, and more particularly, to an information processing device, an information processing method, and a program that improve the accuracy of speech recognition.
従来、各アルファベットの前に予め設定した単語(例えば、ドット)を付加して音声入力させることにより、各アルファベットの音声認識精度を向上させることが提案されている(例えば、特許文献1参照)。 Conventionally, it has been proposed to improve speech recognition accuracy of each alphabet by adding a word (for example, a dot) set in advance before each alphabet and inputting the speech (for example, see Patent Document 1).
しかしながら、特許文献1に記載の発明は、各アルファベットを1文字ずつ音声入力する場合にしか適用できない。また、ユーザが事前にアルファベットの入力方法を把握し、それに従う必要があり、ユーザの負担が増加する。 However, the invention described in Patent Document 1 can only be applied to the case where each alphabet is input by voice one by one. In addition, the user needs to know and follow the alphabet input method in advance, which increases the burden on the user.
本技術は、このような状況に鑑みてなされたものであり、ユーザの負担を抑制しつつ音声認識精度を向上させるようにするものである。 The present technology has been made in view of such a situation, and is intended to improve voice recognition accuracy while suppressing the burden on the user.
本技術の一側面の情報処理装置は、ユーザに提示する情報である提示情報を生成する提示情報生成部と、前記ユーザへの前記提示情報の提示を制御し、音声認識精度が高い語句の使用を誘導する提示制御部とを備える。 An information processing apparatus according to one aspect of the present technology uses a presentation information generation unit that generates presentation information that is information to be presented to a user, and controls the presentation of the presentation information to the user, and uses a phrase with high voice recognition accuracy A presentation control unit for guiding
本技術の一側面の情報処理方法は、ユーザに提示する情報である提示情報を生成し、前記ユーザへの前記提示情報の提示を制御し、音声認識精度が高い語句の使用を誘導する。 An information processing method according to one aspect of the present technology generates presentation information that is information to be presented to a user, controls presentation of the presentation information to the user, and induces use of a phrase with high voice recognition accuracy.
本技術の一側面のプログラムは、ユーザに提示する情報である提示情報を生成し、前記ユーザへの前記提示情報の提示を制御し、音声認識精度が高い語句の使用を誘導する処理をコンピュータに実行させる。 A program according to one aspect of the present technology generates presentation information that is information to be presented to a user, controls presentation of the presentation information to the user, and performs processing that induces use of a phrase with high voice recognition accuracy on a computer. Let it run.
本技術の一側面においては、ユーザに提示する情報である提示情報が生成され、前記ユーザへの前記提示情報の提示を制御し、音声認識精度が高い語句の使用を誘導する。 In one aspect of the present technology, presentation information that is information to be presented to the user is generated, the presentation of the presentation information to the user is controlled, and the use of a phrase with high voice recognition accuracy is induced.
本技術の一側面によれば、ユーザの負担を抑制しつつ音声認識精度を向上させるようにするものである。 According to one aspect of the present technology, the voice recognition accuracy is improved while suppressing the burden on the user.
なお、ここに記載された効果は必ずしも限定されるものではなく、本開示中に記載された何れかの効果であってもよい。 Note that the effects described here are not necessarily limited, and may be any of the effects described in the present disclosure.
以下、本技術を実施するための形態について説明する。説明は以下の順序で行う。
1.実施の形態
2.変形例
3.その他
Hereinafter, embodiments for carrying out the present technology will be described. The description will be made in the following order.
1.
<<1.実施の形態>>
<対話システム11の構成例>
図1は、本技術を適用した対話システム11の構成例を示すブロック図である。
<< 1. Embodiment >>
<Configuration Example of
FIG. 1 is a block diagram illustrating a configuration example of an
対話システム11は、視覚情報(例えば、画像)及び聴覚情報(例えば、音声)のうち少なくとも一方を用いて、ユーザとの対話を行うシステムである。例えば、対話システム11は、音声認識を用いたユーザインタフェースを備える各種のシステムや装置(例えば、対話型のロボット、スマートスピーカ等)からなる。対話システム11は、入力部21、情報処理部22、及び、出力部23を備える。
The
入力部21は、情報処理部22への入力データの入力に用いられる。入力部21は、マイクロフォン31及びカメラ32を備える。
The
マイクロフォン31は、ユーザの音声を含む周囲の音声を収集し、収集した音声を示す音声データを情報処理部22の音声認識部41、雑音認識部42、及び、ユーザ認識部43に供給する。
The
カメラ32は、ユーザを撮影し、得られた画像を示す画像データを情報処理部22の音声認識部41及びユーザ認識部43に供給する。
The
情報処理部22は、ユーザとの対話を行うための各種の処理を実行する。情報処理部22は、音声認識部41、雑音認識部42、ユーザ認識部43、解釈部44、学習部45、ユーザ情報記憶部46、語句情報記憶部47、及び、提示処理部48を備える。
The
音声認識部41は、マイクロフォン31からの音声データ及びカメラ32からの画像データに基づいて、ユーザが発する音声の認識処理を行う。音声認識部41は、音声認識の結果を示すデータを解釈部44及び学習部45に供給する。
The
雑音認識部42は、マイクロフォン31からの音声データに基づいて、対話システム11(又は、ユーザ)の周囲の雑音の認識処理を行う。雑音認識部42は、雑音認識の結果を示すデータを提示処理部48に供給する。
The
ユーザ認識部43は、マイクロフォン31からの音声データ及びカメラ32からの画像データに基づいて、ユーザの認識処理を行う。ユーザ認識部43は、ユーザの認識結果を示すデータを学習部45に供給する。
The
解釈部44は、音声認識の結果、及び、語句情報記憶部47に記憶されている語句情報に基づいて、ユーザの発話内容を解釈する。解釈部44は、ユーザの発話内容を示すデータを学習部45及び提示処理部48に供給する。
The
学習部45は、音声認識の結果、及び、発話内容に基づいて、ユーザ認識部43により認識されたユーザの発音特性の学習を行い、ユーザの発音特性を示す発音特性情報を更新する。
The
ユーザ情報記憶部46は、対話システム11を使用する各ユーザに関するユーザ情報(例えは、発音特性情報、プロファイル等)を記憶する。
The user
語句情報記憶部47は、ユーザに提示する提示情報に用いる語句に関する語句情報(例えば、辞書、各語句の音声認識精度等)を記憶する。
The phrase
提示処理部48は、ユーザに提示情報を提示する処理を行う。提示処理部48は、提示情報生成部61及び提示制御部62を備える。
The
提示情報生成部61は、ユーザの発話内容、ユーザ情報記憶部46に記憶されているユーザ情報、及び、語句情報記憶部47に記憶されている語句情報に基づいて、提示情報を生成する。提示情報生成部61は、提示内容設定部71及び語句選択部72を備える。
The presentation
提示内容設定部71は、ユーザの発話内容及びユーザ情報、並びに、語句選択部72による語句の選択結果に基づいて、ユーザに提示する情報の内容を設定し、設定した内容を示す提示情報を生成する。提示内容設定部71は、提示情報を提示制御部62に供給する。
The presentation
語句選択部72は、ユーザ情報及び語句情報に基づいて、提示情報に用いる語句を選択する。
The
提示制御部62は、ユーザへの提示情報の提示を制御し、後述するように、音声認識精度が高い語句の使用や、音声認識精度の低い語句の使用の抑制を誘導する。例えば、提示制御部62は、提示情報を視覚的に提示するための画像データを生成し、出力部23のディスプレイ81に供給することにより、提示情報を示す画像の表示を制御する。また、提示制御部62は、提示情報を聴覚的に提示するための音声データを生成し、出力部23のスピーカ82に供給することにより、提示情報を示す音声の出力を制御する。
The
出力部23は、提示情報の出力に用いられる。出力部23は、ディスプレイ81及びスピーカ82を備える。
The
ディスプレイ81は、提示制御部62からの画像データに基づく画像を表示する。
The
スピーカ82は、提示制御部62からの音声データに基づく音声を出力する。
<対話処理>
次に、図2のフローチャートを参照して、対話システム11により実行される対話処理について説明する。
<Interaction processing>
Next, the dialogue process executed by the
この処理は、例えば、対話システム11の電源がオンされたとき開始され、対話システム11の電源がオフされたとき終了する。
This process is started when, for example, the power of the
ステップS1において、対話システム11は、入力データの取得を開始する。
In step S1, the
具体的には、マイクロフォン31は、ユーザが発する音声を含む周囲の音声を収集し、収集した音声を示す音声データを音声認識部41、雑音認識部42、及び、ユーザ認識部43に供給する処理を開始する。
Specifically, the
カメラ32は、ユーザの撮影を開始し、得られた画像を示す画像データを音声認識部41及びユーザ認識部43に供給する処理を開始する。
The
ステップS2において、ユーザ認識部43は、ユーザ認識を開始する。具体的には、ユーザ認識部43は、画像データ及び音声データに基づいて、ユーザの認識処理を行い、認識結果を示すデータを学習部45に供給する処理を開始する。
In step S2, the
なお、ユーザの認識処理には、例えば、要求される精度や処理速度等に応じて任意の手法を用いることが可能である。 For the user recognition process, for example, any method can be used according to required accuracy, processing speed, and the like.
ステップS3において、音声認識部41は、音声認識を開始する。具体的には、音声認識部41は、音声データに基づいて、ユーザが発する音声の認識処理を行い、認識結果を示すデータを解釈部44及び学習部45に供給する処理を開始する。
In step S3, the
なお、音声認識には、例えば、要求される精度や処理速度等に応じて任意の手法を用いることが可能である。また、音声認識部41は、例えば、必要に応じて、音声データだけでなく、画像データ(例えば、ユーザの口形等)を用いて音声認識を行うようにしてもよい。
For voice recognition, for example, any method can be used according to required accuracy, processing speed, and the like. Further, for example, the
ステップS4において、雑音認識部42は、雑音認識を開始する。具体的には、雑音認識部42は、音声データに基づいて、対話システム11(又は、ユーザ)の周囲の雑音の認識処理を行い、認識結果を示すデータを提示処理部48に供給する処理を開始する。例えば、音声認識の対象となるユーザの音声以外の音声(例えば、他人の音声、環境音等)が、雑音として認識される。
In step S4, the
なお、雑音認識には、例えば、必要な精度や処理速度等に応じて任意の手法を用いることが可能である。 For noise recognition, any method can be used depending on, for example, required accuracy and processing speed.
ステップS5において、解釈部44は、ユーザの発話内容の解釈を開始する。具体的には、解釈部44は、ユーザが発した音声の認識結果、及び、語句情報記憶部47に記憶されている語句情報に基づいて、ユーザが話した内容(意味)を解釈する処理を開始する。また、解釈部44は、解釈した発話内容を示すデータを学習部45及び提示処理部48に供給する処理を開始する。
In step S5, the
なお、発話内容の解釈方法には、例えば、要求される精度や処理速度等に応じて任意の手法を用いることが可能である。 It should be noted that any method can be used as the method for interpreting the utterance content depending on, for example, required accuracy and processing speed.
ステップS6において、提示情報生成部61は、ユーザの発話内容に対する提示情報を生成する。例えば、提示内容設定部71は、ユーザの発話内容に対して応答が必要であると判定した場合、ユーザに提示する情報の内容(例えば、応答内容)を設定し、設定した内容を示す提示情報を生成する。このとき、語句選択部72は、必要に応じて、ユーザ情報記憶部46に記憶されているユーザの発音特性、語句情報記憶部47に記憶されている各語句の音声認識精度、並びに、雑音認識部42による周囲の雑音の認識結果に基づいて、提示情報に用いる語句を選択する。また、提示内容設定部71は、語句選択部72により選択された語句を提示情報に用いる。提示内容設定部71は、生成した提示情報を提示制御部62に供給する。
In step S6, the presentation
なお、提示情報の生成方法の例については後述する。 An example of a method for generating presentation information will be described later.
ステップS7において、対話システム11は、提示情報を提示する。具体的には、提示制御部62は、提示情報をユーザに提示するための画像データ及び音声データのうちの少なくとも一方を生成する。
In step S7, the
提示制御部62は、画像データを生成した場合、画像データをディスプレイ81に供給する。ディスプレイ81は、画像データに基づいて、提示情報を示す画像を表示する。これにより、提示情報が視覚的にユーザに提示される。
The
また、提示制御部62は、音声データを生成した場合、音声データをスピーカ82に供給する。スピーカ82は、音声データに基づいて、提示情報を示す音声を出力する。これにより、提示情報が聴覚的にユーザに提示される。
In addition, when the
なお、提示情報の提示方法の例については後述する。 An example of the presentation information presentation method will be described later.
ステップS8において、学習部45は、ユーザの発音特性を学習する。
In step S8, the
例えば、学習部45は、ユーザの発話内容に基づいて、音声認識結果の正誤を判定する。例えば、ユーザの発話内容が「違う」や「それではない」等のネガティブな反応である場合、学習部45は、そのネガティブな反応を引き起こした提示情報の提示前にユーザが発した語句の音声認識結果に誤りがあると判定する。一方、例えば、ユーザが提示情報に対して正常に反応している場合、その提示情報の提示前にユーザが発した語句の音声認識結果が正しいと判定する。
For example, the
そして、学習部45は、音声認識結果の正誤に基づいて、ユーザの発音特性の学習を行う。例えば、学習部45は、音声認識部41が得意な(認識精度が高い)語句や発音、並びに、音声認識部41が苦手な(認識精度が低い)語句や発音をユーザ毎に学習する。
And the learning
例えば、学習部45は、学習対象となるユーザについて、同じ語句に対する音声認識の失敗が繰り返されている場合、音声認識部41が当該ユーザの当該語句の認識が苦手であると判定する。これは、例えば、当該ユーザの当該語句の発音が特殊な場合や、音声認識部41と当該ユーザの当該語句の発音との相性が悪い場合等が想定される。また、例えば、学習部45は、当該ユーザについて、同じ語句に対する音声認識の成功が繰り返されている場合、音声認識部41が当該ユーザの当該語句の認識が得意であると判定する。
For example, the
さらに、学習部45は、例えば、音声認識部41が苦手な語句の傾向に基づいて、音声認識部41が苦手なユーザの発音(例えば、サ行の発音等)を学習する。
Further, for example, the
学習部45は、学習結果に基づいて、ユーザ情報記憶部46に記憶されているユーザの発音特性情報を更新する。
The
なお、ユーザの発音特性を学習するタイミングは、この例に限定されるものではない。例えば、ユーザとの対話が終わった後に学習するようにしてもよい。 Note that the timing for learning the pronunciation characteristics of the user is not limited to this example. For example, you may make it learn after the dialog with a user is over.
その後、処理はステップS6に戻り、ステップS6以降の処理が実行される。 Thereafter, the process returns to step S6, and the processes after step S6 are executed.
ここで、提示情報及び提示方法の例について説明する。 Here, an example of presentation information and a presentation method will be described.
人が会話する場合、必ずしもいつも同じ語句を使用するとは限らない。例えば、一般的に人には相手が使用する語句等に追従する傾向がある。例えば、人には、会話をスムーズにするために相手が使用する語句に合わせたり、視界に入った語句を無意識に使用したりする傾向がある。 場合 When people talk, the same words are not always used. For example, in general, a person tends to follow a phrase used by the other party. For example, people tend to match words and phrases used by the other party in order to make the conversation smooth, or use words that are in sight unconsciously.
そこで、例えば、提示情報生成部61は、この人の傾向を利用して、音声認識部41の認識精度(音声認識精度)が高い語句(以下、高認識語句と称する)の使用をユーザに対して誘導する提示情報を生成する。また、例えば、提示情報生成部61は、音声認識部41の認識精度が低い語句(以下、低認識語句と称する)の使用の抑制をユーザに対して誘導する提示情報を生成する。
Therefore, for example, the presentation
なお、高認識語句と低認識語句は、例えば、音声認識部41の認識精度に基づいて個々に決められる場合もあるし、或いは、語句間の音声認識部41の認識精度の関係に基づいて相対的に決められる場合もある。前者の場合、例えば、音声認識部41の認識精度が所定の第1の閾値以上の語句が高認識語句に分類され、第1の閾値より小さい第2の閾値以下の語句が低認識語句に分類される。後者の場合、例えば、2つの語句のうち、音声認識部41の認識精度が高い方の語句が高認識語句に分類され、音声認識部41の認識精度が低い方の語句が低認識語句に分類される。
The high recognition phrase and the low recognition phrase may be determined individually based on the recognition accuracy of the
例えば、語句選択部72は、音声認識部41の認識精度が高い語句ほど、提示情報に用いる優先度を上げ、音声認識部41の認識精度が低い語句ほど、提示情報に用いる優先度を下げる。これにより、提示情報において、優先度の高い語句の使用頻度が高くなり、優先度が低い語句の使用頻度が低くなる。
For example, the
例えば、語句選択部72は、使用可能な語句(例えば、同様の意味を持つ語句)が複数存在する場合、より優先度が高い語句を選択する。また、例えば、語句選択部72は、低認識語句を同じ意味の高認識語句に置き換えたり、低認識語句に高認識語句を付加したり、低認識語句とともに提示する語句として高認識語句を選択する。提示内容設定部71は、語句選択部72により選択された語句を用いて提示内容を設定し、設定した提示内容を示す提示情報を生成する。
For example, the
これにより、高認識語句が優先的にユーザに提示されるため、ユーザは無意識に高認識語句を使用するように誘導される。一方、低認識語句の提示が抑制されるため、ユーザは無意識に低認識語句の使用を抑制するように誘導される。その結果、ユーザが発する音声の認識精度が向上する。 Thereby, since the high recognition word / phrase is preferentially presented to the user, the user is unconsciously guided to use the high recognition word / phrase. On the other hand, since the presentation of the low recognition phrase is suppressed, the user is unconsciously guided to suppress the use of the low recognition phrase. As a result, the recognition accuracy of the voice uttered by the user is improved.
なお、各語句の優先度は、ユーザや周囲の状況等に関わらず固定するようにしてもよいし、ユーザや周囲の状況等に応じて動的に変化させるようにしてもよい。 Note that the priority of each word may be fixed regardless of the user and surrounding conditions, or may be dynamically changed according to the user and surrounding conditions.
例えば、語句選択部72は、ユーザや周囲の状況等を考慮せずに、語句情報記憶部47に記憶されている各語句の音声認識精度に基づいて、各語句の優先度を設定する。これにより、各語句の優先度が固定される。なお、例えば、各語句の音声認識精度が、過去の認識結果に基づいて適宜更新される場合、各語句の優先度は長いスパンで見れば変化する。
For example, the
一方、例えば、語句選択部72は、各語句の音声認識精度に加えて、ユーザや周囲の状況等を考慮して、各語句の優先度を動的に変化させる。
On the other hand, for example, the
例えば、ユーザの発音特性(例えば、発音の癖)により、同じ語句でもユーザ間で音声認識精度が異なる場合がある。そこで、語句選択部72は、ユーザ情報記憶部46に記憶されているユーザの発音特性に基づいて、各語句の音声認識精度を調整し、調整後の音声認識精度に基づいて、各語句の優先度を設定する。
For example, depending on the user's pronunciation characteristics (for example, pronunciation habit), the voice recognition accuracy may differ between users even for the same word or phrase. Therefore, the
また、例えば、周囲の雑音により、同じ語句でも音声認識精度が変化する場合がある。例えば、一般的に、短い語句の方が、長い語句より雑音に埋もれやすいため、雑音による音声認識精度の影響を受けやすい。例えば、雑音が大きくなると、一般的に、短い語句の方が、長い語句より音声認識精度の低下率が大きくなる。また、例えば、ユーザの周囲で他者が数字に関する話題を話している場合、数字に関する認識精度が低下する。そこで、語句選択部72は、周囲の雑音の認識結果に基づいて、各語句の音声認識精度を調整し、調整後の音声認識精度に基づいて、各語句の優先度を設定する。
Also, for example, voice recognition accuracy may change even for the same word or phrase due to ambient noise. For example, in general, a short phrase is more likely to be buried in noise than a long phrase, and therefore, it is more susceptible to voice recognition accuracy due to noise. For example, when noise increases, generally speaking, a short word / phrase has a higher rate of decrease in speech recognition accuracy than a long word / phrase. In addition, for example, when another person is talking about a number around the user, the recognition accuracy regarding the number is lowered. Therefore, the
ここで、図3乃至図6を参照して、提示情報の具体例について説明する。なお、図3乃至図6は、「いぬ」、「さる」、「きじ」等の選択肢を含む提示情報をディスプレイ81に表示し、ユーザに選択させる場合の例を示している。
Here, specific examples of the presentation information will be described with reference to FIGS. FIGS. 3 to 6 show an example in which presentation information including options such as “Inu”, “Saru”, “Kiji” is displayed on the
例えば、図3の例では、各選択肢の選択に用いる語句(以下、選択用語句と称する)として、各選択肢の前に数字が表示されている。これにより、ユーザは、各選択肢を表す語句(「いぬ」、「さる」、「きじ」)に加えて、選択用語句(「1」、「2」、「3」)の音声を発することにより選択肢を選択することができる。 For example, in the example of FIG. 3, numbers are displayed in front of each option as a phrase (hereinafter referred to as a selected term) used for selecting each option. As a result, the user utters the selected term phrases (“1”, “2”, “3”) in addition to the phrases (“Inu”, “Saru”, “Kiji”) representing each option. Choices can be selected.
ここで、選択用語句には、より音声認識精度が高い語句を用いることが望ましい。例えば、アルファベットの方が数字より音声認識精度が高い場合、図4の例のように、数字の代わりにアルファベット(「A」、「B」、「C」)が選択用語句に用いられる。これにより、ユーザが音声により選択した選択肢の認識率(以下、選択認識率と称する)が向上する。 Here, it is desirable to use a phrase with higher speech recognition accuracy as the selected term phrase. For example, when the alphabet has higher voice recognition accuracy than numbers, alphabets (“A”, “B”, “C”) are used for the selected term phrases instead of numbers as in the example of FIG. Thereby, the recognition rate (hereinafter referred to as the selection recognition rate) of options selected by the user by voice is improved.
また、一般的に短い語句(例えば、1音又は2音の語句)は、それより長い語句と比較して、音声認識精度が低くなる傾向にある。そこで、例えば、図5の例のように、音の短い数字やアルファベットの代わりに、「Alpha」、「Bravo」、「Charlie」等の長い語句を選択用語句に用いるようにしてもよい。これにより、選択認識率が向上する。 In general, a short phrase (for example, a one-tone or two-tone phrase) tends to have lower speech recognition accuracy than a longer phrase. Therefore, for example, long words such as “Alpha”, “Bravo”, and “Charlie” may be used as the selected term phrases instead of short numbers and alphabets as in the example of FIG. Thereby, the selection recognition rate is improved.
なお、上述したように、短い語句に対する音声認識は、長い語句に対する音声認識と比較して、周囲の雑音の影響を受けやすい。そこで、例えば、周囲の雑音のレベルが上がるにつれて、選択用語句を長くするようにしてもよい。 As described above, speech recognition for short words is more susceptible to ambient noise than speech recognition for long words. Therefore, for example, the selected term phrase may be lengthened as the ambient noise level increases.
さらに、長い語句を選択用語句に用いた場合、選択用語句を表示するスペースが大きくなる。そのため、例えば、選択肢を表示するスペースが小さくなったり、選択肢が見づらくなったりする恐れがある。また、選択用語句に用いられる文字によっては(例えば、選択用語句と同じ文字が選択肢に含まれる場合)、選択用語句と選択肢の見分けが付きにくくなる場合がある。さらに、選択用語句に用いられる文字の種類によっては、選択用語句間の見分けが付きにくくなる場合がある。そこで、例えば、選択用語句の代わりに、選択用語句を表すシンボル(以下、選択用シンボルと称する)を選択肢と共に表示するようにしてもよい。選択用シンボルには、例えば、アイコン、記号、画像等が用いられる。 Furthermore, when a long word is used as the selected term, the space for displaying the selected term increases. For this reason, for example, the space for displaying the options may be reduced, or the options may be difficult to see. Further, depending on the characters used in the selected term phrase (for example, when the same character as the selected term phrase is included in the options), it may be difficult to distinguish the selected term phrase from the options. Further, depending on the type of characters used in the selected term phrase, it may be difficult to distinguish between the selected term phrases. Therefore, for example, instead of the selected term phrase, a symbol representing the selected term phrase (hereinafter referred to as a selection symbol) may be displayed together with the options. For example, an icon, a symbol, an image, or the like is used as the selection symbol.
図6の例では、「星」、「月」、「太陽」を表すアイコンが選択用シンボルに用いられている。これにより、ユーザは、選択肢を表す語句に加えて、選択用シンボルが表す語句(選択用語句)である「ほし」、「つき」、「たいよう」の音声を発することにより選択肢を選択することができる。 In the example of FIG. 6, icons representing “star”, “moon”, and “sun” are used as selection symbols. Accordingly, the user can select an option by uttering “Hoshi”, “Tsuke”, and “Taiyo” which are words (selection term phrases) represented by the selection symbol in addition to the words representing the choices. it can.
なお、語句選択部72は、例えば、選択肢を表す語句より音声認識精度が高い語句を選択用語句に選択する。これにより、選択肢のみを提示する場合と比較して、選択認識率が向上する。
Note that the
また、例えば、提示制御部62が、高認識語句の使用を誘導し、低認識語句の使用の抑制を誘導する方法で提示情報の提示を制御するようにしてもよい。具体的には、例えば、提示制御部62が、高認識語句により表される態様であって、それぞれ異なる態様で各選択肢を提示するように制御し、ユーザが各態様を表す語句により選択肢を選択できるようにしてもよい。
Also, for example, the
例えば、異なる表示効果(例えば、色、サイズ、フォント等)を用いて各選択肢を表示し、表示効果を表す語句により選択肢を選択できるようにしてもよい。例えば、図7は、モノクロのため色の違いが図示されていないが、各選択肢及び選択用語句が互いに異なる色で表示されている例を示している。具体的には、図7では、「A.いぬ」が赤で表示され、「B.さる」が青で表示され、「C.きじ」が緑で表示されているものとする。これにより、ユーザは、各選択肢を表す語句及び選択用語句に加えて、各選択肢を表示する色を表す語句(例えば、「あか」、「あお」、「みどり」)の音声を発することにより選択肢を選択することができる。 For example, each option may be displayed using different display effects (for example, color, size, font, etc.), and the options may be selected by a word representing the display effect. For example, FIG. 7 shows an example in which each option and a selected term phrase are displayed in different colors because the difference in color is not shown because it is monochrome. Specifically, in FIG. 7, “A. Inu” is displayed in red, “B. Monkey” is displayed in blue, and “C. Kiji” is displayed in green. As a result, the user can select the choice by uttering the words (eg, “red”, “blue”, “green”) that indicate the color of each choice in addition to the phrase that represents each choice and the selected term. Can be selected.
また、各選択肢を異なる位置に表示し、表示位置を表す語句により選択肢を選択できるようにしてもよい。例えば、図8の例では、各選択肢及び選択用語句が、上下左右にそれぞれ表示されている。これにより、ユーザは、各選択肢を表す語句及び選択用語句に加えて、各選択肢の表示位置を表す語句(例えば、「うえ」、「した」、「ひだり」、「みぎ」)の音声を発することにより選択肢を選択することができる。 Further, each option may be displayed at a different position so that the option can be selected by a word representing the display position. For example, in the example of FIG. 8, each option and the selected term phrase are displayed on the top, bottom, left, and right, respectively. As a result, the user utters a voice of a word (for example, “up”, “done”, “hidari”, “migi”) indicating the display position of each option, in addition to the word indicating each option and the selected word phrase. It is possible to select an option.
なお、提示制御部62は、例えば、選択肢を表す語句より音声認識精度が高い語句を、各選択肢を提示する態様を表す語句に用いることにより、換言すれば、選択肢を表す語句より音声認識精度が高い語句により表される態様で各選択肢を提示するよう制御する。これにより、各選択肢を同じ態様で提示する場合と比較して、選択認識率が向上する。
In addition, the
また、図8の例において、例えば、ユーザのジェスチャや視線の方向等を補助的に用いることにより、選択認識率をさらに向上させることができる。 In the example of FIG. 8, for example, the selection recognition rate can be further improved by using, for example, the user's gesture or the direction of the line of sight.
さらに、図7及び図8の例において、例えば、選択用語句の提示を省略することが可能である。すなわち、各選択肢に選択用語句を付加せずに、各選択肢をそれぞれ異なる態様で提示するようにしてもよい。 Further, in the examples of FIGS. 7 and 8, for example, it is possible to omit the presentation of the selected term phrase. That is, each option may be presented in a different manner without adding a selected term phrase to each option.
また、例えば、より音声認識精度が高い語句を、選択肢を表す語句に用いるようにしてもよい。例えば、選択肢が「京葉線」、「京急線」、及び、「京王線」の場合、各選択肢間の音が似通っているため、選択認識率が低下するおそれがある。そこで、例えば、「JR京葉線」、「京浜急行線」、及び、「京王線」と各路線のフルネームを選択肢として提示することにより、選択認識率を向上させることができる。 Also, for example, a phrase with higher voice recognition accuracy may be used as a phrase representing an option. For example, when the choices are “Keiyo Line”, “Keikyu Line”, and “Keio Line”, since the sounds between the choices are similar, the selection recognition rate may decrease. Thus, for example, the selection recognition rate can be improved by presenting the full names of each route such as “JR Keiyo Line”, “Keihin Express Line”, and “Keio Line” as options.
さらに、例えば、選択肢が「1個」、「2個」、「3個」の場合、各選択肢が短い上に、ユーザが数字の部分しか発音しないことが想定される。これに対して、例えば、「ひとつ」、「ふたつ」、「みっつ」を選択肢として提示することにより、選択認識率を向上させることができる。 Furthermore, for example, when there are “1”, “2”, and “3” choices, it is assumed that each choice is short and the user only pronounces a numeric part. On the other hand, the selection recognition rate can be improved by, for example, presenting “one”, “two”, and “mitsu” as options.
なお、以上の説明では、選択肢をディスプレイ81に表示する場合を例に挙げたが、例えば、音声により選択肢を提示する場合も同様に、選択用語句を用いることにより、選択認識率を向上させることができる。例えば、選択用語句と選択肢を続けて音声により出力し、ユーザに選択用語句の音声を入力させることにより、選択認識率を向上させることができる。
In the above description, the case where the option is displayed on the
以上のようにして、ユーザが無意識のうちに音声認識精度がより高い語句を用いるようになり、ユーザの負担を抑制しつつ音声認識精度を向上させることができる。これにより、例えば、音声認識の失敗を減らし、ユーザとのスムーズな対話を実現することが可能になる。 As described above, words with higher voice recognition accuracy are used unconsciously by the user, and the voice recognition accuracy can be improved while suppressing the burden on the user. Thereby, for example, failure of voice recognition can be reduced, and a smooth dialogue with the user can be realized.
<<2.変形例>>
以下、上述した本技術の実施の形態の変形例について説明する。
<< 2. Modification >>
Hereinafter, modifications of the above-described embodiment of the present technology will be described.
例えば、ユーザインタフェースのナビゲーション等、ある程度定型的な表現(例えば、「次」、「戻る」等)が用いられるユースケースにおいて、定型的な表現として、より音声認識精度が高い語句をディスプレイ81に表示するようにしてもよい。これにより、ユーザが、より音声認識精度が高い表現を定型的な表現に用いるようになる。
For example, in a use case where a certain standard expression (for example, “next”, “return”, etc.) is used, such as user interface navigation, a phrase with higher voice recognition accuracy is displayed on the
また、例えば、パイロット情報となるいくつかの語句を発声するようにユーザを誘導し、ユーザの発話特性の学習を行うようにしてもよい。これにより、ユーザの発話特性を効率的かつ短期間で学習することが可能になる。 Also, for example, the user may be guided to utter some words and phrases serving as pilot information, and the user's speech characteristics may be learned. Thereby, it becomes possible to learn a user's speech characteristic efficiently and in a short period of time.
さらに、例えば、ユーザの発音の癖をある程度認識できた時点で、当該ユーザの発音特性の学習が十分に進捗するまでの間、同様の癖を持つ他のユーザの発音特性情報を用いるようにしてもよい。 Furthermore, for example, when the user's pronunciation habit is recognized to some extent, the pronunciation characteristic information of another user having the same habit is used until learning of the user's pronunciation characteristic is sufficiently advanced. Also good.
また、例えば、提示情報における高認識語句の優先的な使用、及び、提示情報における低認識語句の使用の抑制のうちの一方のみを行うようにしてもよい。前者のみを行う場合、ユーザは主に高認識語句を用いるように誘導される。後者のみを行う場合、ユーザは主に低認識語句の使用を抑制するように誘導される。 Further, for example, only one of preferential use of the high recognition phrase in the presentation information and suppression of use of the low recognition phrase in the presentation information may be performed. When only the former is performed, the user is mainly guided to use the high recognition phrase. When only the latter is performed, the user is mainly guided to suppress the use of low-recognition words.
さらに、例えば、各ユーザの発音特性情報を異なるシステム間(例えば、複数のエージェント間)で共有するようにしてもよい。 Furthermore, for example, the pronunciation characteristic information of each user may be shared between different systems (for example, between a plurality of agents).
また、本技術は、ユーザとの対話を行う場面に限定されず、ユーザの音声認識を行う各種の場面で適用することが可能である。 Further, the present technology is not limited to a scene where a dialogue with a user is performed, but can be applied to various scenes where a user's voice recognition is performed.
<<3.その他>>
<コンピュータの構成例>
上述した一連の処理は、ハードウェアにより実行することもできるし、ソフトウェアにより実行することもできる。一連の処理をソフトウェアにより実行する場合には、そのソフトウェアを構成するプログラムが、コンピュータにインストールされる。ここで、コンピュータには、専用のハードウェアに組み込まれているコンピュータや、各種のプログラムをインストールすることで、各種の機能を実行することが可能な、例えば汎用のパーソナルコンピュータなどが含まれる。
<< 3. Other >>
<Computer configuration example>
The series of processes described above can be executed by hardware or can be executed by software. When a series of processing is executed by software, a program constituting the software is installed in the computer. Here, the computer includes, for example, a general-purpose personal computer capable of executing various functions by installing a computer incorporated in dedicated hardware and various programs.
図9は、上述した一連の処理をプログラムにより実行するコンピュータのハードウェアの構成例を示すブロック図である。 FIG. 9 is a block diagram showing an example of the hardware configuration of a computer that executes the above-described series of processing by a program.
コンピュータ500において、CPU(Central Processing Unit)501,ROM(Read Only Memory)502,RAM(Random Access Memory)503は、バス504により相互に接続されている。
In the
バス504には、さらに、入出力インタフェース505が接続されている。入出力インタフェース505には、入力部506、出力部507、記録部508、通信部509、及びドライブ510が接続されている。
An input /
入力部506は、入力スイッチ、ボタン、マイクロフォン、撮像素子などよりなる。出力部507は、ディスプレイ、スピーカなどよりなる。記録部508は、ハードディスクや不揮発性のメモリなどよりなる。通信部509は、ネットワークインタフェースなどよりなる。ドライブ510は、磁気ディスク、光ディスク、光磁気ディスク、又は半導体メモリなどのリムーバブルメディア511を駆動する。
The
以上のように構成されるコンピュータ500では、CPU501が、例えば、記録部508に記録されているプログラムを、入出力インタフェース505及びバス504を介して、RAM503にロードして実行することにより、上述した一連の処理が行われる。
In the
コンピュータ500(CPU501)が実行するプログラムは、例えば、パッケージメディア等としてのリムーバブルメディア511に記録して提供することができる。また、プログラムは、ローカルエリアネットワーク、インターネット、デジタル衛星放送といった、有線または無線の伝送媒体を介して提供することができる。
The program executed by the computer 500 (CPU 501) can be provided by being recorded on a
コンピュータ500では、プログラムは、リムーバブルメディア511をドライブ510に装着することにより、入出力インタフェース505を介して、記録部508にインストールすることができる。また、プログラムは、有線または無線の伝送媒体を介して、通信部509で受信し、記録部508にインストールすることができる。その他、プログラムは、ROM502や記録部508に、あらかじめインストールしておくことができる。
In the
なお、コンピュータが実行するプログラムは、本明細書で説明する順序に沿って時系列に処理が行われるプログラムであっても良いし、並列に、あるいは呼び出しが行われたとき等の必要なタイミングで処理が行われるプログラムであっても良い。 The program executed by the computer may be a program that is processed in time series in the order described in this specification, or in parallel or at a necessary timing such as when a call is made. It may be a program for processing.
また、本明細書において、システムとは、複数の構成要素(装置、モジュール(部品)等)の集合を意味し、すべての構成要素が同一筐体中にあるか否かは問わない。したがって、別個の筐体に収納され、ネットワークを介して接続されている複数の装置、及び、1つの筐体の中に複数のモジュールが収納されている1つの装置は、いずれも、システムである。 Further, in this specification, the system means a set of a plurality of components (devices, modules (parts), etc.), and it does not matter whether all the components are in the same housing. Therefore, a plurality of devices housed in separate housings and connected via a network, and a single device housing a plurality of modules in one housing are all systems. .
さらに、本技術の実施の形態は、上述した実施の形態に限定されるものではなく、本技術の要旨を逸脱しない範囲において種々の変更が可能である。 Furthermore, embodiments of the present technology are not limited to the above-described embodiments, and various modifications can be made without departing from the gist of the present technology.
例えば、本技術は、1つの機能をネットワークを介して複数の装置で分担、共同して処理するクラウドコンピューティングの構成をとることができる。 For example, the present technology can take a cloud computing configuration in which one function is shared by a plurality of devices via a network and is jointly processed.
また、上述のフローチャートで説明した各ステップは、1つの装置で実行する他、複数の装置で分担して実行することができる。 Further, each step described in the above flowchart can be executed by one device or can be shared by a plurality of devices.
さらに、1つのステップに複数の処理が含まれる場合には、その1つのステップに含まれる複数の処理は、1つの装置で実行する他、複数の装置で分担して実行することができる。 Further, when a plurality of processes are included in one step, the plurality of processes included in the one step can be executed by being shared by a plurality of apparatuses in addition to being executed by one apparatus.
<構成の組み合わせ例>
本技術は、以下のような構成をとることもできる。
<Combination example of configuration>
This technology can also take the following composition.
(1)
ユーザに提示する情報である提示情報を生成する提示情報生成部と、
前記ユーザへの前記提示情報の提示を制御し、音声認識精度が高い語句の使用を誘導する提示制御部と
を備える情報処理装置。
(2)
前記提示情報生成部は、音声認識精度の高い語句の使用を誘導する前記提示情報を生成する
前記(1)に記載の情報処理装置。
(3)
前記提示情報生成部は、各語句の音声認識精度に基づいて、前記提示情報に用いる語句を選択する
前記(2)に記載の情報処理装置。
(4)
前記提示情報生成部は、音声認識精度が高い語句を優先して前記提示情報に用いる
前記(3)に記載の情報処理装置。
(5)
前記提示情報生成部は、前記提示情報において選択肢を提示する場合、各語句の音声認識精度に基づいて、前記選択肢の選択に用いる語句である選択用語句を選択する
前記(3)又は(4)に記載の情報処理装置。
(6)
前記提示情報生成部は、前記選択肢を表す語句より音声認識精度が高い語句を前記選択用語句に選択する
前記(5)に記載の情報処理装置。
(7)
前記提示情報生成部は、前記選択用語句を表すシンボルを前記選択肢ともに提示する前記提示情報を生成する
前記(5)又は(6)に記載の情報処理装置。
(8)
前記提示情報生成部は、さらに前記ユーザの発音特性に基づいて、前記提示情報に用いる語句を選択する
前記(3)乃至(7)のいずれかに記載の情報処理装置。
(9)
前記ユーザの発音特性を学習する学習部を
さらに備える前記(8)に記載の情報処理装置。
(10)
前記提示情報生成部は、さらに周囲の雑音に基づいて、前記提示情報に用いる語句を選択する
前記(3)乃至(9)のいずれかに記載の情報処理装置。
(11)
前記周囲の雑音を認識する雑音認識部を
さらに備える前記(10)に記載の情報処理装置。
(12)
前記提示制御部は、音声認識精度の高い語句の使用を誘導する方法で前記提示情報の提示を制御する
前記(1)乃至(11)のいずれかに記載の情報処理装置。
(13)
前記提示制御部は、前記提示情報において選択肢を提示する場合、音声認識精度が高い語句により表される態様であって、それぞれ異なる態様で前記選択肢を提示するように前記提示情報の提示を制御する
前記(12)に記載の情報処理装置。
(14)
前記態様は、前記選択肢の表示効果又は表示位置である
前記(13)に記載の情報処理装置。
(15)
前記提示制御部は、前記態様を表す語句の方が前記選択肢を表す語句より音声認識精度が高い前記態様を用いる
前記(13)又は(14)に記載の情報処理装置。
(16)
前記提示制御部は、前記ユーザへの前記提示情報の提示を制御し、音声認識精度が低い語句の使用の抑制を誘導する
前記(1)乃至(15)のいずれかに記載の情報処理装置。
(17)
前記ユーザの音声認識を行う音声認識部を
さらに備える前記(1)乃至(16)のいずれかに記載の情報処理装置。
(18)
前記提示情報は、前記ユーザの発話内容に対する応答を含む
前記(1)乃至(17)のいずれかに記載の情報処理装置。
(19)
ユーザに提示する情報である提示情報を生成し、
前記ユーザへの前記提示情報の提示を制御し、音声認識精度が高い語句の使用を誘導する
情報処理方法。
(20)
ユーザに提示する情報である提示情報を生成し、
前記ユーザへの前記提示情報の提示を制御し、音声認識精度が高い語句の使用を誘導する
処理をコンピュータに実行させるためのプログラム。
(1)
A presentation information generation unit that generates presentation information that is information to be presented to the user;
An information processing apparatus comprising: a presentation control unit that controls presentation of the presentation information to the user and guides use of a phrase with high voice recognition accuracy.
(2)
The information processing apparatus according to (1), wherein the presentation information generation unit generates the presentation information that induces use of a word with high voice recognition accuracy.
(3)
The information processing apparatus according to (2), wherein the presentation information generation unit selects a phrase to be used for the presentation information based on voice recognition accuracy of each phrase.
(4)
The information processing apparatus according to (3), wherein the presentation information generation unit preferentially uses words with high voice recognition accuracy for the presentation information.
(5)
The presentation information generation unit, when presenting options in the presentation information, selects a selected term phrase that is a phrase used for selection of the option based on the speech recognition accuracy of each phrase (3) or (4) The information processing apparatus described in 1.
(6)
The information processing apparatus according to (5), wherein the presentation information generation unit selects, as the selected term phrase, a phrase having higher speech recognition accuracy than a phrase representing the option.
(7)
The information processing apparatus according to (5) or (6), wherein the presentation information generation unit generates the presentation information that presents a symbol representing the selected term phrase together with the options.
(8)
The information processing apparatus according to any one of (3) to (7), wherein the presentation information generation unit further selects a word / phrase to be used for the presentation information based on the pronunciation characteristics of the user.
(9)
The information processing apparatus according to (8), further including a learning unit that learns the pronunciation characteristics of the user.
(10)
The information processing apparatus according to any one of (3) to (9), wherein the presentation information generation unit further selects a word / phrase to be used for the presentation information based on ambient noise.
(11)
The information processing apparatus according to (10), further including a noise recognition unit that recognizes the ambient noise.
(12)
The information processing apparatus according to any one of (1) to (11), wherein the presentation control unit controls presentation of the presentation information by a method for guiding the use of a word with high voice recognition accuracy.
(13)
When presenting options in the presentation information, the presentation control unit is an aspect represented by a phrase having high voice recognition accuracy, and controls the presentation of the presentation information so as to present the options in different aspects. The information processing apparatus according to (12).
(14)
The information processing apparatus according to (13), wherein the aspect is a display effect or a display position of the option.
(15)
The information processing apparatus according to (13) or (14), wherein the presentation control unit uses the aspect in which the word representing the aspect has higher voice recognition accuracy than the word representing the option.
(16)
The information processing apparatus according to any one of (1) to (15), wherein the presentation control unit controls presentation of the presentation information to the user and induces suppression of use of a phrase with low speech recognition accuracy.
(17)
The information processing apparatus according to any one of (1) to (16), further including a voice recognition unit that performs voice recognition of the user.
(18)
The information processing apparatus according to any one of (1) to (17), wherein the presentation information includes a response to the utterance content of the user.
(19)
Generate presentation information that is information to be presented to the user,
An information processing method for controlling the presentation of the presentation information to the user and guiding the use of a phrase with high speech recognition accuracy.
(20)
Generate presentation information that is information to be presented to the user,
A program for controlling a presentation of the presentation information to the user to induce the use of a word / phrase with high voice recognition accuracy.
なお、本明細書に記載された効果はあくまで例示であって限定されるものではなく、他の効果があってもよい。 In addition, the effect described in this specification is an illustration to the last, and is not limited, There may exist another effect.
11 対話システム, 21 入力部, 22 情報処理部, 23 出力部, 31 マイクロフォン, 32 カメラ, 41 音声認識部, 42 雑音認識部, 43 ユーザ認識部, 44 解釈部, 45 学習部, 46 ユーザ情報記憶部, 47 語句情報記憶部, 48 提示処理部, 61 提示情報生成部, 62 提示制御部, 71 提示内容設定部, 72 語句選択部, 81 ディスプレイ, 82 スピーカ 11 dialogue system, 21 input unit, 22 information processing unit, 23 output unit, 31 microphone, 32 camera, 41 voice recognition unit, 42 noise recognition unit, 43 user recognition unit, 44 interpretation unit, 45 learning unit, 46 user information storage Section, 47 phrase information storage section, 48 presentation processing section, 61 presentation information generation section, 62 presentation control section, 71 presentation content setting section, 72 phrase selection section, 81 display, 82 speaker
Claims (20)
前記ユーザへの前記提示情報の提示を制御し、音声認識精度が高い語句の使用を誘導する提示制御部と
を備える情報処理装置。 A presentation information generation unit that generates presentation information that is information to be presented to the user;
An information processing apparatus comprising: a presentation control unit that controls presentation of the presentation information to the user and guides use of a phrase with high voice recognition accuracy.
請求項1に記載の情報処理装置。 The information processing apparatus according to claim 1, wherein the presentation information generation unit generates the presentation information that induces use of a word with high voice recognition accuracy.
請求項2に記載の情報処理装置。 The information processing apparatus according to claim 2, wherein the presentation information generation unit selects a phrase to be used for the presentation information based on speech recognition accuracy of each phrase.
請求項3に記載の情報処理装置。 The information processing apparatus according to claim 3, wherein the presentation information generation unit preferentially uses words with high voice recognition accuracy for the presentation information.
請求項3に記載の情報処理装置。 4. The information processing according to claim 3, wherein when presenting an option in the presenting information, the presenting information generating unit selects a selected term that is a term used for selecting the option based on speech recognition accuracy of each term. apparatus.
請求項5に記載の情報処理装置。 The information processing apparatus according to claim 5, wherein the presentation information generation unit selects, as the selected term phrase, a phrase having higher speech recognition accuracy than a phrase representing the option.
請求項5に記載の情報処理装置。 The information processing apparatus according to claim 5, wherein the presentation information generation unit generates the presentation information that presents a symbol representing the selected term phrase together with the options.
請求項3に記載の情報処理装置。 The information processing apparatus according to claim 3, wherein the presentation information generation unit further selects a word / phrase to be used for the presentation information based on the pronunciation characteristics of the user.
さらに備える請求項8に記載の情報処理装置。 The information processing apparatus according to claim 8, further comprising a learning unit that learns the pronunciation characteristics of the user.
請求項3に記載の情報処理装置。 The information processing apparatus according to claim 3, wherein the presentation information generation unit further selects a word / phrase to be used for the presentation information based on ambient noise.
さらに備える請求項10に記載の情報処理装置。 The information processing apparatus according to claim 10, further comprising a noise recognition unit that recognizes the ambient noise.
請求項1に記載の情報処理装置。 The information processing apparatus according to claim 1, wherein the presentation control unit controls the presentation of the presentation information by a method for guiding the use of a phrase with high voice recognition accuracy.
請求項12に記載の情報処理装置。 When presenting options in the presentation information, the presentation control unit is an aspect represented by a phrase having high voice recognition accuracy, and controls the presentation of the presentation information so as to present the options in different aspects. The information processing apparatus according to claim 12.
請求項13に記載の情報処理装置。 The information processing apparatus according to claim 13, wherein the aspect is a display effect or a display position of the option.
請求項13に記載の情報処理装置。 The information processing apparatus according to claim 13, wherein the presentation control unit uses the aspect in which the word representing the aspect has higher voice recognition accuracy than the word representing the option.
請求項1に記載の情報処理装置。 The information processing apparatus according to claim 1, wherein the presentation control unit controls the presentation of the presentation information to the user and induces suppression of use of a phrase with low voice recognition accuracy.
さらに備える請求項1に記載の情報処理装置。 The information processing apparatus according to claim 1, further comprising a voice recognition unit that performs voice recognition of the user.
請求項1に記載の情報処理装置。 The information processing apparatus according to claim 1, wherein the presentation information includes a response to the utterance content of the user.
前記ユーザへの前記提示情報の提示を制御し、音声認識精度が高い語句の使用を誘導する
情報処理方法。 Generate presentation information that is information to be presented to the user,
An information processing method for controlling the presentation of the presentation information to the user and guiding the use of a phrase with high voice recognition accuracy.
前記ユーザへの前記提示情報の提示を制御し、音声認識精度が高い語句の使用を誘導する
処理をコンピュータに実行させるためのプログラム。 Generate presentation information that is information to be presented to the user,
A program for causing a computer to execute processing for controlling the presentation of the presentation information to the user and inducing use of a phrase with high speech recognition accuracy.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2018-080377 | 2018-04-19 | ||
| JP2018080377 | 2018-04-19 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2019203016A1 true WO2019203016A1 (en) | 2019-10-24 |
Family
ID=68238969
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2019/015070 Ceased WO2019203016A1 (en) | 2018-04-19 | 2019-04-05 | Information processing device, information processing method, and program |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2019203016A1 (en) |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPH05313691A (en) * | 1992-05-08 | 1993-11-26 | Sony Corp | Voice processor |
| JPH09266510A (en) * | 1996-03-28 | 1997-10-07 | Mitsubishi Electric Corp | Message creation method for pager |
| US6185530B1 (en) * | 1998-08-14 | 2001-02-06 | International Business Machines Corporation | Apparatus and methods for identifying potential acoustic confusibility among words in a speech recognition system |
| JP2002278587A (en) * | 2001-03-14 | 2002-09-27 | Fujitsu Ltd | Voice recognition input device |
| JP2005004032A (en) * | 2003-06-13 | 2005-01-06 | Sony Corp | Speech recognition apparatus and speech recognition method |
| JP2006146193A (en) * | 2004-11-24 | 2006-06-08 | Microsoft Corp | General spelling mnemonics |
| JP2009223171A (en) * | 2008-03-18 | 2009-10-01 | Advanced Telecommunication Research Institute International | Communication system |
-
2019
- 2019-04-05 WO PCT/JP2019/015070 patent/WO2019203016A1/en not_active Ceased
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPH05313691A (en) * | 1992-05-08 | 1993-11-26 | Sony Corp | Voice processor |
| JPH09266510A (en) * | 1996-03-28 | 1997-10-07 | Mitsubishi Electric Corp | Message creation method for pager |
| US6185530B1 (en) * | 1998-08-14 | 2001-02-06 | International Business Machines Corporation | Apparatus and methods for identifying potential acoustic confusibility among words in a speech recognition system |
| JP2002278587A (en) * | 2001-03-14 | 2002-09-27 | Fujitsu Ltd | Voice recognition input device |
| JP2005004032A (en) * | 2003-06-13 | 2005-01-06 | Sony Corp | Speech recognition apparatus and speech recognition method |
| JP2006146193A (en) * | 2004-11-24 | 2006-06-08 | Microsoft Corp | General spelling mnemonics |
| JP2009223171A (en) * | 2008-03-18 | 2009-10-01 | Advanced Telecommunication Research Institute International | Communication system |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| EP3888084B1 (en) | Method and device for providing voice recognition service | |
| US10650802B2 (en) | Voice recognition method, recording medium, voice recognition device, and robot | |
| US20190087734A1 (en) | Information processing apparatus and information processing method | |
| JP6172417B1 (en) | Language learning system and language learning program | |
| KR101819459B1 (en) | Voice recognition system and apparatus supporting voice recognition error correction | |
| EP1768103B1 (en) | Device in which selection is activated by voice and method in which selection is activated by voice | |
| CN107403011B (en) | Virtual reality environment language learning implementation method and automatic recording control method | |
| US20140372117A1 (en) | Transcription support device, method, and computer program product | |
| JP2019208138A (en) | Utterance recognition device and computer program | |
| JPWO2017175351A1 (en) | Information processing device | |
| WO2018079332A1 (en) | Information processing device and information processing method | |
| KR20230135396A (en) | Method for dialogue management, user terminal and computer-readable medium | |
| TW201737125A (en) | Response generation device, dialog control system, and response generation method | |
| JP2002344915A (en) | Communication grasping device and method | |
| WO2018079294A1 (en) | Information processing device and information processing method | |
| KR101111487B1 (en) | English Learning Apparatus and Method | |
| JP5818753B2 (en) | Spoken dialogue system and spoken dialogue method | |
| KR101104822B1 (en) | Language Systems and Methods Based on Loud Speech | |
| WO2019203016A1 (en) | Information processing device, information processing method, and program | |
| US11403060B2 (en) | Information processing device and non-transitory computer readable medium for executing printing service according to state of utterance | |
| TWI906850B (en) | Information processing apparatus and information processing | |
| EP4293660A1 (en) | Electronic device and method for controlling same | |
| KR20190072777A (en) | Method and apparatus for communication | |
| EP4428854A1 (en) | Method for providing voice synthesis service and system therefor | |
| KR20100044301A (en) | Immersion display apparatus and method for teaching language using voice recognition |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19788075 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 19788075 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: JP |