JP6664359B2

JP6664359B2 - Voice processing device, method and program

Info

Publication number: JP6664359B2
Application number: JP2017172162A
Authority: JP
Inventors: 成宗松村; 純史布引; 細淵　貴司; 貴司細淵
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2017-09-07
Filing date: 2017-09-07
Publication date: 2020-03-13
Anticipated expiration: 2037-09-07
Also published as: JP2019045831A

Description

この発明は、ユーザとの音声対話を支援する音声処理装置、方法およびプログラムに関する。 The present invention relates to a voice processing device, a method, and a program for supporting voice dialogue with a user.

従来、スマートフォンやロボット等のように、音声による対話機能を備えた装置が様々開発されている。 2. Description of the Related Art Conventionally, various devices having a voice interactive function, such as a smartphone and a robot, have been developed.

このような音声対話では、ユーザからの発話音声の認識、当該認識の結果に基づく応答内容データの生成、および、当該応答内容データに対応する応答音声の合成を実施することにより、応答音声がユーザに返される（例えば、特許文献１を参照）。 In such a spoken dialogue, the response voice is recognized by the user by recognizing the utterance voice from the user, generating response content data based on the recognition result, and synthesizing the response voice corresponding to the response content data. (See, for example, Patent Document 1).

特開２０１６−２１８５６６号公報JP-A-2006-218566

ところが、発話音声の認識、応答内容データの生成、および応答音声の合成には、ある程度の処理時間が必要とされる。したがって、発話音声を発したユーザは、装置が応答音声を発するまでの間、装置からレスポンスが返らないことによって不安にさせられるという問題がある。 However, recognition of an uttered voice, generation of response content data, and synthesis of a response voice require some processing time. Therefore, there is a problem that the user who has emitted the uttered voice is anxious because the device does not return a response until the device emits the response voice.

この発明は上記事情に着目してなされたもので、その目的とするところは、ユーザからの発話音声に対する応答音声の出力が開始されるまでに、ユーザにフィラー情報を出力する音声処理装置、方法およびプログラムを提供することにある。 The present invention has been made in view of the above circumstances, and a purpose thereof is to provide a speech processing apparatus and method for outputting filler information to a user before output of a response voice in response to a speech voice from the user is started. And to provide programs.

上記課題を解決するために、この発明の第１の態様は、ユーザからの発話音声の認識、当該認識の結果に基づく応答内容データの生成、および当該応答内容データに対応する応答音声の合成の実施とともに使用される、音声処理装置であって、前記発話音声の長さと、過去の応答内容データに関する情報とに基づいて、前記発話音声の終了時点から前記応答音声の出力を開始するまでに要する遅延時間を予測する予測部と、前記予測された遅延時間内において、当該遅延時間に応じたフィラー情報を出力するフィラー情報出力部とを備え、前記予測部が、前記発話音声の長さを検出し、当該検出された発話音声の長さに基づいて前記発話音声の認識に要する第１の時間を予測する手段と、前記過去の応答内容データに関する情報に基づいて、前記発話音声の認識の結果に基づく応答内容データの生成に要する第２の時間を予測する手段と、前記過去の応答内容データに関する情報に基づいて、前記生成される応答内容データに対応する応答音声の合成に要する第３の時間を予測する手段と、前記予測された第１、第２および第３の時間に基づいて、前記発話音声の終了時点から前記応答音声の出力を開始するまでに要する遅延時間を予測する手段とを備えるようにしたものである。 In order to solve the above problem, a first aspect of the present invention is to recognize speech voice from a user, generate response content data based on the result of the recognition, and synthesize response voice corresponding to the response content data. An audio processing device used in conjunction with implementation, which is required from the end of the utterance voice to the start of output of the response voice based on the length of the utterance voice and information on past response content data. A prediction unit that predicts a delay time, and a filler information output unit that outputs filler information according to the delay time within the predicted delay time , wherein the prediction unit detects the length of the uttered voice Means for predicting a first time required for recognition of the uttered voice based on the length of the detected uttered voice, and information on the past response content data, Means for predicting a second time required to generate response content data based on the result of recognition of the recorded voice, and a response voice corresponding to the generated response content data based on information on the past response content data Means for estimating a third time required for synthesizing the speech, and a time required from the end of the uttered voice to the start of output of the response voice based on the predicted first, second and third times. is obtained by the so that a means for predicting the delay time.

この発明の第２の態様は、前記第１の時間を予測する手段が、ユーザからの過去の発話音声の長さと、当該長さの発話音声の認識に要した時間とに基づいて、発話音声の長さと発話音声の認識に要する時間との係数を算出する手段と、前記検出された発話音声の長さと、前記算出された、発話音声の長さと発話音声の認識に要する時間との係数とに基づいて、前記第１の時間を予測する手段とを備えるようにしたものである。 According to a second aspect of the present invention, the means for predicting the first time is based on a length of a past uttered voice from the user and a time required for recognizing the uttered voice of the length. Means for calculating a coefficient of the length of the utterance voice and the time required for recognition of the utterance voice; and a coefficient of the length of the detected utterance voice and the calculated length of the utterance voice and the time required for the recognition of the utterance voice. Means for estimating the first time based on the

この発明の第３の態様は、前記過去の応答内容データに関する情報が、過去の応答内容データに対応する応答音声の合成に要した時間を含み、前記第３の時間を予測する手段が、前記第３の時間を、所定の回数の前記過去の応答音声の合成に要した時間の平均値に基づいて予測するようにしたものである。 In a third aspect of the present invention, the information on the past response content data includes a time required for synthesizing a response voice corresponding to the past response content data, and the means for predicting the third time includes: The third time is predicted based on the average value of the time required for synthesizing a predetermined number of past response voices.

この発明の第４の態様は、前記音声処理装置が、前記出力されたフィラー情報の再生が終了した際に、前記ユーザからの発話音声に対する応答音声の合成が完了しているか否かを判定する判定部をさらに備え、前記フィラー情報出力部が、前記応答音声の合成が完了していないと判定された場合に、追加のフィラー情報を出力するようにしたものである。 In a fourth aspect of the present invention, when the reproduction of the output filler information is completed, the voice processing device determines whether or not synthesis of a response voice to the uttered voice from the user has been completed. The apparatus further includes a determining unit, wherein the filler information output unit outputs additional filler information when it is determined that the synthesis of the response voice is not completed.

この発明の第５の態様は、前記音声処理装置が、前記出力されたフィラー情報の再生が終了した際に、前記ユーザからの発話音声に対する応答音声の合成が完了しているか否かを判定し、さらに、前記応答音声の合成が完了していないと判定された場合に、前記ユーザからの発話音声の認識と、当該認識の結果に基づく応答内容データの生成が完了しているか否かを判定する判定部をさらに備え、前記過去の応答内容データに関する情報が、ユーザからの過去の発話音声に対する応答内容データの長さと、当該長さの応答内容データに対応する応答音声の合成に要した時間とを含み、前記第３の時間を予測する手段が、前記ユーザからの過去の発話音声に対する応答内容データの長さと、当該長さの応答内容データに対応する応答音声の合成に要した時間とに基づいて、応答内容データの長さと応答音声の合成に要する時間との係数を算出する手段と、前記応答音声の合成が完了していないと判定され、かつ、前記ユーザからの発話音声の認識と、当該認識の結果に基づく応答内容データの生成が完了していると判定された場合に、前記ユーザからの発話音声に対する応答内容データの長さと、前記算出された、応答内容データの長さと応答音声の合成に要する時間との係数とに基づいて、前記第３の時間を再予測する手段とを備え、前記遅延時間を予測する手段が、前記再予測された第３の時間に基づいて、前記ユーザからの発話音声の終了時点から前記応答音声の出力を開始するまでに要する遅延時間を再予測し、前記フィラー情報出力部が、前記再予測された遅延時間内において、前記ユーザからの発話音声の終了時点からの経過時間を前記再予測された遅延時間から減算した時間に応じた、追加のフィラー情報を出力するようにしたものである。 In a fifth aspect of the present invention, when the reproduction of the output filler information is completed, the voice processing device determines whether or not the synthesis of the response voice to the uttered voice from the user is completed. Further, when it is determined that the synthesis of the response voice is not completed, it is determined whether the recognition of the utterance voice from the user and the generation of the response content data based on the result of the recognition are completed. And determining the information on the past response content data, the length of the response content data for the past uttered voice from the user, and the time required for synthesizing the response voice corresponding to the response content data of the length. Means for predicting the third time comprises: synthesizing a response voice corresponding to the response content data having the length of the response content data for the past speech voice from the user; Means for calculating a coefficient between the length of the response content data and the time required for synthesizing the response voice based on the time required, and determining that the synthesis of the response voice has not been completed, and When it is determined that the recognition of the uttered voice and the generation of the response content data based on the result of the recognition are completed, the length of the response content data for the uttered voice from the user, and the calculated response content Means for re-estimating the third time based on a coefficient of a length of data and a time required for synthesizing a response voice, wherein the means for estimating the delay time comprises: Based on the time, the delay time required from the end of the uttered voice from the user to the start of the output of the response voice is re-predicted, and the filler information output unit is controlled within the re-predicted delay time. Te, the time elapsed from the end of the speech from the user corresponding to the time obtained by subtracting from the re-predicted delay time is obtained so as to output the additional filler information.

この発明の第１の態様によれば、ユーザからの発話音声の長さと、過去の応答内容データに関する情報とに基づいて、ユーザからの発話音声の終了時点から、当該発話音声に対する応答音声の出力を開始するまでに要する遅延時間が予測される。その後、予測された遅延時間内において、当該遅延時間に応じた、応答音声の準備処理中の通知であるフィラー情報が出力される。このため、発話音声を発したユーザが、レスポンスが返らないことにより不安にさせられることがなくなる。また、例えば、遅延時間に対応する時間的な長さで意味を有する言葉を発するフィラー音声を出力するようにすると、ユーザは、出力されるフィラー音声の種類によって、応答音声が返ってくるまでに待つ必要がある時間を予測でき、これにより、ユーザをさらに安心させることができる。 According to the first aspect of the present invention, based on the length of the speech sound from the user and the information on the past response content data, the output of the response sound to the speech sound from the end of the speech sound from the user The delay time required to start is predicted. Thereafter, within the predicted delay time, filler information, which is a notification during the preparation process of the response voice, is output according to the delay time. For this reason, the user who uttered the uttered voice will not be disturbed by the fact that no response is returned. In addition, for example, when a filler voice that emits a meaningful word with a time length corresponding to the delay time is output, the user may change the response voice depending on the type of the filler voice to be output until the response voice is returned. The time required to wait can be predicted, which can further reassure the user.

さらにこの発明の第１の態様によれば、上記発話音声の長さが検出され、検出された発話音声の長さに基づいて上記発話音声の認識に要する第１の時間が予測される。また、過去の応答内容データに関する情報に基づいて、上記発話音声の認識の結果に基づく応答内容データの生成に要する第２の時間と、生成される応答内容データに対応する応答音声の合成に要する第３の時間とが予測される。予測された第１、第２および第３の時間に基づいて、上記遅延時間が予測される。このように、発話音声の認識に要する第１の時間については、検出された発話音声の長さを利用することにより精度が高い予測をすることができる。また、応答内容データの生成に要する第２の時間については、多くの場合、発話音声の認識の相違による応答内容データ生成処理時間の変動は少ないので、過去の応答内容データに関する情報を用いることにより信頼度が高い予測をすることができる。また、応答音声の合成に要する第３の時間についても、過去の応答内容データに関する情報を用いることにより信頼性のある予測をすることができる。 Further , according to the first aspect of the present invention, the length of the uttered voice is detected, and the first time required for recognition of the uttered voice is predicted based on the detected length of the uttered voice. Further, based on information on past response content data, a second time required for generating response content data based on the result of recognition of the uttered voice, and a time required for synthesizing a response voice corresponding to the generated response content data. A third time is expected. The delay time is predicted based on the predicted first, second, and third times. As described above, for the first time required for recognition of the uttered voice, highly accurate prediction can be performed by using the length of the detected uttered voice. In addition, in the second time required for generating the response content data, in many cases, the variation in the response content data generation processing time due to the difference in the recognition of the uttered voice is small. A highly reliable prediction can be made. Also, the third time required for the synthesis of the response voice can be reliably predicted by using the information on the past response content data.

この発明の第２の態様によれば、ユーザからの過去の発話音声の長さと、当該長さの発話音声の認識に要した時間とに基づいて、発話音声の長さと発話音声の認識に要する時間との係数が算出される。上記検出された発話音声の長さと、当該算出された係数とに基づいて、第１の時間が予測される。例えば応答音声の準備処理を別の装置で行う場合等のように、実装によっては通信処理等の遅延時間も生じ得るが、このように過去の発話音声に係る実際の情報を用いることにより、着目する発話音声についても当該過去の情報を取得したのと同条件で処理すれば、このような通信処理等の時間も含めて発話音声の認識に要する第１の時間を予測することができる。 According to the second aspect of the present invention, it is necessary to recognize the length of the uttered voice and the uttered voice based on the length of the past uttered voice from the user and the time required to recognize the uttered voice of the length. A coefficient with time is calculated. The first time is predicted based on the length of the detected uttered voice and the calculated coefficient. Depending on the implementation, for example, when a response voice preparation process is performed by another device, a delay time such as a communication process may occur. If the uttered voice to be processed is processed under the same conditions as when the past information is obtained, the first time required for the uttered voice recognition including the time for such communication processing can be predicted.

この発明の第３の態様によれば、第３の時間が、所定の回数の過去の応答音声の合成に要した時間の平均値に基づいて予測される。このように過去の応答音声の合成に要した実際の時間を用いることにより、応答音声の合成に要する第３の時間について、信頼度が高い予測をすることができる。 According to the third aspect of the present invention, the third time is predicted based on the average value of the time required for synthesizing a predetermined number of past response voices. As described above, by using the actual time required for the synthesis of the response voice in the past, the third time required for the synthesis of the response voice can be predicted with high reliability.

この発明の第４の態様によれば、上記出力されたフィラー情報の再生が終了した際に、応答音声の合成が完了しているか否かが判定される。応答音声の合成が完了していないと判定された場合に、追加のフィラー情報が出力される。このため、出力されたフィラー情報の再生が終了した後に、発話音声を発したユーザが、応答音声が出力されるのをさらに待つ必要がある場合にも、レスポンスが返らないことにより不安にさせられることがなくなる。 According to the fourth aspect of the present invention, when the reproduction of the output filler information ends, it is determined whether or not the synthesis of the response voice has been completed. If it is determined that the synthesis of the response voice has not been completed, additional filler information is output. For this reason, even after the reproduction of the output filler information is completed, the user who has made the utterance voice needs to wait further for the output of the response voice, and is thus disturbed by the fact that no response is returned. Disappears.

この発明の第５の態様によれば、上記出力されたフィラー情報の再生が終了した際に、応答音声の合成が完了しているか否かが判定される。さらに、応答音声の合成が完了していないと判定された場合には、ユーザからの発話音声の認識と、当該認識の結果に基づく応答内容データの生成が完了しているか否かが判定される。ユーザからの発話音声の認識と、当該認識の結果に基づく応答内容データの生成が完了していると判定された場合には、ユーザからの発話音声に対する応答内容データの長さと、過去の応答内容データに関する情報に基づいて算出された応答内容データの長さと応答音声の合成に要する時間との係数とに基づいて、第３の時間が再予測される。再予測された第３の時間に基づいて、ユーザからの発話音声の終了時点から応答音声の出力を開始するまでに要する遅延時間が再予測される。ユーザからの発話音声の終了時点からの経過時間を再予測された遅延時間から減算した時間に応じた、追加のフィラー情報が、再予測された遅延時間内に出力される。このように、第３の時間について、応答内容データの長さを用いることにより精度が高い再予測をすることができ、遅延時間について精度の高い再予測がされることになる。これにより、追加のフィラー情報も、ユーザがさらに待つ必要がある時間に応じたものとすることができ、ユーザをさらに安心させることができる。
According to the fifth aspect of the present invention, when the reproduction of the output filler information ends, it is determined whether or not the synthesis of the response voice has been completed. Furthermore, when it is determined that the synthesis of the response voice is not completed, it is determined whether the recognition of the uttered voice from the user and the generation of the response content data based on the result of the recognition are completed. . When it is determined that the recognition of the speech voice from the user and the generation of the response content data based on the result of the recognition are completed, the length of the response content data to the speech voice from the user and the past response content The third time is re-estimated based on the coefficient of the length of the response content data calculated based on the information about the data and the time required for synthesizing the response voice. Based on the re-predicted third time, the delay time required from the end of the uttered voice from the user to the start of output of the response voice is re-predicted. Additional filler information according to the time obtained by subtracting the elapsed time from the end time of the speech sound from the user from the re-predicted delay time is output within the re-predicted delay time. In this manner, for the third time, the re-prediction with high accuracy can be performed by using the length of the response content data, and the re-prediction with high precision for the delay time is performed. Thereby, the additional filler information can be made to correspond to the time when the user needs to wait further, and the user can be further relieved.

すなわち、この発明によれば、ユーザからの発話音声に対する応答音声の出力が開始されるまでに、ユーザにフィラー情報を出力する音声処理装置、方法およびプログラムを提供することができる。 That is, according to the present invention, it is possible to provide a voice processing device, a method, and a program that output filler information to a user before output of a response voice to a speech voice from the user is started.

この発明の第１の実施形態に係る、ユーザとの音声対話を実現するシステムの概略構成図。FIG. 1 is a schematic configuration diagram of a system for implementing a voice dialogue with a user according to a first embodiment of the present invention. 図１に示したシステム中の音声対話装置の機能構成を示すブロック図。FIG. 2 is a block diagram showing a functional configuration of a voice interaction device in the system shown in FIG. 1. 図１に示したシステム中のサーバの機能構成を示すブロック図。FIG. 2 is a block diagram showing a functional configuration of a server in the system shown in FIG. 1. 図２に示した音声対話装置の制御ユニットによって実行されるフィラー情報出力処理の一例を示すフロー図。FIG. 3 is a flowchart showing an example of filler information output processing executed by the control unit of the voice interaction device shown in FIG. 2. 図２に示した音声対話装置の制御ユニットによって実行されるフィラー情報出力処理の一例を示すフロー図。FIG. 3 is a flowchart showing an example of filler information output processing executed by the control unit of the voice interaction device shown in FIG. 2.

以下、図面を参照してこの発明に係わる実施形態を説明する。
［第１の実施形態］
（構成）
図１は、この発明の第１の実施形態に係る、ユーザとの音声対話を実現するシステムの概略構成図である。本実施形態では、音声処理装置の非限定的な例として音声対話装置について説明する。 An embodiment according to the present invention will be described below with reference to the drawings.
[First Embodiment]
(Constitution)
FIG. 1 is a schematic configuration diagram of a system for implementing a voice dialogue with a user according to a first embodiment of the present invention. In the present embodiment, a voice interactive device will be described as a non-limiting example of a voice processing device.

本実施形態のシステムは、音声対話装置１と、当該音声対話装置１に通信ネットワークにより接続されたサーバ２とからなる。 The system according to the present embodiment includes a voice interactive device 1 and a server 2 connected to the voice interactive device 1 via a communication network.

音声対話装置１は、マイク１４を介して入力されたユーザからの発話音声に対して、スピーカ１５を介して応答音声を返すものであり、また、当該応答音声を返すまでに、ユーザが待たされる遅延時間に応じたフィラー（例えば音声による応答音声準備処理中の通知）を出力することができる。サーバ２は、音声対話装置１から上記発話音声のデータを受け取り、音声対話装置１から出力されることになる上記応答音声の合成をする装置である。なお、本明細書では、発話音声から応答音声を準備する処理は、音声対話装置１とは別の装置であるサーバ２において実現するようなシステムについて説明しているが、当該準備処理を音声対話装置１において実現するようにしてもよい。 The voice interaction apparatus 1 returns a response voice via the speaker 15 to the voice of the user input via the microphone 14, and the user waits until returning the response voice. It is possible to output a filler (for example, a notification during a response voice preparation process by voice) according to the delay time. The server 2 is a device that receives the uttered voice data from the voice interaction device 1 and synthesizes the response voice to be output from the voice interaction device 1. In the present specification, a description is given of a system in which the process of preparing the response voice from the uttered voice is realized in the server 2 which is another device different from the voice interactive device 1. It may be realized in the device 1.

図２は、図１に示したシステム中の音声対話装置１の機能構成を示すブロック図である。 FIG. 2 is a block diagram showing a functional configuration of the voice interaction device 1 in the system shown in FIG.

音声対話装置１は、制御ユニット１１と、記憶ユニット１２と、通信インタフェースユニット１３と、マイク１４と、スピーカ１５とを備えている。 The voice interaction device 1 includes a control unit 11, a storage unit 12, a communication interface unit 13, a microphone 14, and a speaker 15.

マイク１４は、ユーザからの発話音声を制御ユニット１１に入力する。 The microphone 14 inputs an uttered voice from the user to the control unit 11.

通信インタフェースユニット１３は、例えば１つ以上の有線または無線の通信インタフェースユニットを含んでいる。通信インタフェースユニット１３は、制御ユニット１１から出力される発話音声データを取得し、取得された発話音声データを通信ネットワークを介してサーバ２に送信する。さらに、通信インタフェースユニット１３は、通信ネットワークを介してサーバ２から応答音声データ等の情報を取得し、取得された情報を制御ユニット１１に入力する。 The communication interface unit 13 includes, for example, one or more wired or wireless communication interface units. The communication interface unit 13 acquires the uttered voice data output from the control unit 11, and transmits the obtained uttered voice data to the server 2 via the communication network. Further, the communication interface unit 13 acquires information such as response voice data from the server 2 via the communication network, and inputs the acquired information to the control unit 11.

スピーカ１５は、制御ユニット１１から出力されるフィラー情報および応答音声データを再生する。 The speaker 15 reproduces filler information and response voice data output from the control unit 11.

記憶ユニット１２は、記憶媒体としてＨＤＤ（Hard Disc Drive）またはＳＳＤ（Solid State Drive）等の随時書き込みおよび読み出しが可能な不揮発メモリを使用したものであり、本実施形態を実現するために使用される記憶領域として、音声データ記憶部１２１と、発話時間記憶部１２２と、応答準備時間記憶部１２３と、応答文字数記憶部１２４と、フィラー情報記憶部１２５とを備えている。なお、発話時間記憶部１２２、応答準備時間記憶部１２３、および応答文字数記憶部１２４は、図面中では別個の記憶部として図示しているが、これらの記憶部に記憶された内容を、例えば１つのテーブルにまとめて記憶するようにしてもよい。 The storage unit 12 uses a nonvolatile memory such as a hard disk drive (HDD) or a solid state drive (SSD) that can be written and read at any time as a storage medium, and is used to realize the present embodiment. The storage area includes a voice data storage unit 121, an utterance time storage unit 122, a response preparation time storage unit 123, a response character number storage unit 124, and a filler information storage unit 125. Although the utterance time storage unit 122, the response preparation time storage unit 123, and the number of response characters storage unit 124 are illustrated as separate storage units in the drawing, the contents stored in these storage units are, for example, 1 unit. You may make it memorize | store in one table collectively.

音声データ記憶部１２１は、マイク１４を介して取得された音声のデータを記憶させるために使用される。 The audio data storage unit 121 is used to store audio data obtained via the microphone 14.

発話時間記憶部１２２は、ユーザからの過去の発話音声の時間的な長さであるユーザ発話時間、および、ユーザからの着目する発話音声の時間的な長さであるユーザ発話時間を記憶させるために使用される。 The utterance time storage unit 122 stores a user utterance time, which is a temporal length of a past uttered voice from the user, and a user utterance time, which is a temporal length of a focused uttered voice from the user. Used for

応答準備時間記憶部１２３は、ユーザからの過去の発話音声の認識に要した時間と、当該認識の結果に基づく応答内容データの生成に要した時間と、当該応答内容データに対応する応答音声の合成に要した時間とを記憶させるために使用される。 The response preparation time storage unit 123 stores a time required for recognizing a past uttered voice from the user, a time required for generating response content data based on the result of the recognition, and a response voice corresponding to the response content data. It is used to store the time required for synthesis.

応答文字数記憶部１２４は、ユーザからの過去の発話音声に対する応答内容データの長さである応答文字数を記憶させるために使用される。 The response character number storage unit 124 is used to store the number of response characters, which is the length of response content data for a past uttered voice from the user.

フィラー情報記憶部１２５は、さまざまな長さの時間に応じたフィラー情報を記憶させるために使用される。 The filler information storage unit 125 is used to store filler information corresponding to various lengths of time.

制御ユニット１１は、ＣＰＵ（Central Processing Unit）を含み、本実施形態における処理機能を実行するために、音声データ取得部１１１と、発話音声データ抽出部１１２と、応答準備時間予測部１１３と、フィラー情報出力部１１４と、処理完了通知取得部１１５と、応答準備完了判定部１１６と、応答音声データ出力部１１７とを備えている。これらの各部における処理機能はいずれも、図示しないプログラムメモリに格納されたプログラムを上記ＣＰＵに実行させることによって実現される。 The control unit 11 includes a CPU (Central Processing Unit), and executes a processing function in the present embodiment, and includes a voice data acquisition unit 111, a speech voice data extraction unit 112, a response preparation time prediction unit 113, a filler An information output unit 114, a process completion notification acquisition unit 115, a response preparation completion determination unit 116, and a response voice data output unit 117 are provided. All of the processing functions of these units are realized by causing the CPU to execute a program stored in a program memory (not shown).

音声データ取得部１１１は、マイク１４を介して入力されたユーザからの発話音声を含む音声をデジタルデータに変換し、変換後の音声データを記憶ユニット１２の音声データ記憶部１２１に記憶させる処理を実行する。 The voice data acquisition unit 111 performs a process of converting voice including speech voice from the user input through the microphone 14 into digital data, and storing the converted voice data in the voice data storage unit 121 of the storage unit 12. Execute.

発話音声データ抽出部１１２は、記憶ユニット１２の音声データ記憶部１２１に記憶される音声データを読み出し、読み出された音声データにおいてユーザが実際に発話している区間を抽出し、抽出された発話音声データを、通信インタフェースユニット１３に入力する処理を実行する。当該発話音声データは、通信インタフェースユニット１３を介してサーバ２に送信され、サーバ２において、応答音声の準備処理が行われる。また、発話音声データ抽出部１１２は、抽出された発話音声データに基づいて、ユーザからの発話音声に係るユーザ発話時間を検出し、検出されたユーザ発話時間を、記憶ユニット１２の発話時間記憶部１２２に記憶させる処理を実行する。 The utterance voice data extraction unit 112 reads the voice data stored in the voice data storage unit 121 of the storage unit 12, extracts a section in which the user is actually speaking from the read voice data, and extracts the extracted utterance. A process for inputting audio data to the communication interface unit 13 is executed. The uttered voice data is transmitted to the server 2 via the communication interface unit 13, and the server 2 performs a response voice preparation process. Further, the utterance voice data extraction unit 112 detects a user utterance time relating to the utterance voice from the user based on the extracted utterance voice data, and stores the detected user utterance time in the utterance time storage unit of the storage unit 12. The processing to be stored in the storage 122 is executed.

応答準備時間予測部１１３は、取得された発話音声データに基づく発話音声の認識に要する第１の時間と、当該認識の結果に基づく応答内容データの生成に要する第２の時間と、当該応答内容データに対応する応答音声の合成に要する第３の時間をそれぞれ予測して、予測された第１、第２および第３の時間に基づいて、ユーザからの発話音声の終了時点から、当該発話音声に対する応答音声の出力を開始するまでに要する遅延時間を予測する処理を実行する。第１、第２および第３の時間の予測処理は、応答準備時間予測部１１３が備える、音声認識時間予測部１１３１、応答内容生成時間予測部１１３２、および音声合成時間予測部１１３３において実行される。 The response preparation time prediction unit 113 includes a first time required for recognizing the uttered voice based on the acquired uttered voice data, a second time required for generating response content data based on the result of the recognition, and a response content. A third time required for synthesizing the response voice corresponding to the data is predicted, and based on the predicted first, second, and third times, the utterance voice from the end of the utterance voice from the user is determined. A process of estimating a delay time required to start outputting a response voice to the response is executed. The first, second, and third time prediction processes are executed by the speech recognition time prediction unit 1131, the response content generation time prediction unit 1132, and the speech synthesis time prediction unit 1133 included in the response preparation time prediction unit 113. .

音声認識時間予測部１１３１は、記憶ユニット１２の発話時間記憶部１２２に記憶される、取得された発話音声データに基づくユーザ発話時間を読み出す処理を実行する。また、音声認識時間予測部１１３１は、記憶ユニット１２の発話時間記憶部１２２に記憶される、過去の発話音声に係るユーザ発話時間と、記憶ユニット１２の応答準備時間記憶部１２３に記憶される、当該ユーザ発話時間に対応する過去の発話音声の認識に要した時間とを読み出す処理を実行する。音声認識時間予測部１１３１は、取得された発話音声データに基づくユーザ発話時間と、過去の発話音声に係るユーザ発話時間と、当該ユーザ発話時間に対応する過去の発話音声の認識に要した時間とに基づいて、上記発話音声の認識に要する第１の時間を予測する処理を実行する。 The voice recognition time prediction unit 1131 executes a process of reading the user utterance time based on the obtained utterance voice data stored in the utterance time storage unit 122 of the storage unit 12. Further, the voice recognition time prediction unit 1131 stores the user utterance time related to the past uttered voice stored in the utterance time storage unit 122 of the storage unit 12 and the response preparation time storage unit 123 of the storage unit 12. A process of reading out the time required for recognition of the past uttered voice corresponding to the user utterance time is executed. The voice recognition time prediction unit 1131 is configured to calculate the user utterance time based on the acquired utterance voice data, the user utterance time related to the past utterance voice, and the time required for recognition of the past utterance voice corresponding to the user utterance time. , A process of estimating a first time required for recognition of the uttered voice is executed.

応答内容生成時間予測部１１３２は、記憶ユニット１２の応答準備時間記憶部１２３に記憶される過去の応答内容データの生成に要した時間を読み出し、読み出された過去の応答内容データの生成に要した時間に基づいて、上記応答内容データの生成に要する第２の時間を予測する処理を実行する。 The response content generation time prediction unit 1132 reads the time required to generate the past response content data stored in the response preparation time storage unit 123 of the storage unit 12, and reads the time required to generate the read past response content data. A process of estimating a second time required for generating the response content data is performed based on the performed time.

音声合成時間予測部１１３３は、記憶ユニット１２の応答文字数記憶部１２４に記憶される、過去の発話音声に対する応答内容データに係る応答文字数と、記憶ユニット１２の応答準備時間記憶部１２３に記憶される、過去の応答内容データに対応する応答音声の合成に要した時間とを読み出す処理を実行する。また、音声合成時間予測部１１３３は、通信インタフェースユニット１３を介してサーバ２から、取得された発話音声データに対する応答内容データに係る応答文字数の通知を受信する処理を実行する。音声合成時間予測部１１３３は、過去の発話音声に対する応答内容データに係る応答文字数と、過去の応答内容データに対応する応答音声の合成に要した時間と、通知される応答文字数とのうちの少なくとも１つに基づいて、上記応答音声の合成に要する第３の時間を予測する処理を実行する。 The speech synthesis time prediction unit 1133 is stored in the response character number storage unit 124 of the storage unit 12 and is stored in the response preparation time storage unit 123 of the storage unit 12 in response to the response content data for the past uttered voice. And a process of reading the time required for synthesizing the response voice corresponding to the past response content data. Further, the speech synthesis time prediction unit 1133 executes a process of receiving from the server 2 via the communication interface unit 13 a notification of the number of response characters relating to response content data to the acquired utterance voice data. The speech synthesis time prediction unit 1133 is configured to calculate at least one of the number of response characters related to the response content data for the past speech voice, the time required for synthesizing the response voice corresponding to the past response content data, and the number of response characters to be notified. Based on one, a process of predicting a third time required for the synthesis of the response voice is executed.

フィラー情報出力部１１４は、記憶ユニット１２のフィラー情報記憶部１２５に記憶されるさまざまな長さの時間に応じたフィラー情報の中から、予測された遅延時間に応じたフィラー情報を読み出し、読み出されたフィラー情報を上記遅延時間内にスピーカ１５に出力する処理を実行する。なお、フィラー情報記憶部１２５に記憶されたフィラー情報を利用する代わりに、予測された遅延時間に応じたフィラー情報を、ネットワーク上のデータベースからその都度検索して取得するようにしてもよい。 The filler information output unit 114 reads filler information corresponding to the predicted delay time from filler information corresponding to various lengths of time stored in the filler information storage unit 125 of the storage unit 12, and reads the filler information. A process of outputting the filler information to the speaker 15 within the delay time is executed. Instead of using the filler information stored in the filler information storage unit 125, filler information corresponding to the predicted delay time may be searched and acquired from a database on the network each time.

処理完了通知取得部１１５は、音声対話装置１からサーバ２に送信された発話音声データに関する、上記発話音声の認識が完了したことの通知、上記応答内容データの生成が完了したことの通知、および、上記応答音声の合成が完了したことの通知を、通信インタフェースユニット１３を介してサーバ２からそれぞれ取得する処理を実行する。 The processing completion notification acquiring unit 115 notifies the user that the recognition of the uttered voice has been completed with respect to the uttered voice data transmitted from the voice interaction device 1 to the server 2, the notification that the generation of the response content data has been completed, and Then, a process of acquiring the notification that the synthesis of the response voice has been completed from the server 2 via the communication interface unit 13 is executed.

応答準備完了判定部１１６は、上記出力されたフィラー情報の再生が終了した際に、上記各通知をすべて取得しているか否かに基づいて、ユーザからの発話音声に対する応答音声の合成が完了しているか否かを判定する処理を実行する。 The response preparation completion determination unit 116 completes the synthesis of the response voice to the uttered voice from the user, based on whether or not all the notifications have been obtained, when the output of the output filler information ends. A process is performed to determine whether or not it has been performed.

応答音声の合成が完了していないと判定された場合に、フィラー情報出力部１１４は、追加のフィラー情報をスピーカ１５に出力する処理を実行する。 When it is determined that the synthesis of the response voice has not been completed, the filler information output unit 114 executes a process of outputting additional filler information to the speaker 15.

応答音声データ出力部１１７は、応答音声の合成が完了していると判定された場合に、通信インタフェースユニット１３を介してサーバ２から応答音声データを取得し、取得された応答音声データをスピーカ１５に出力する処理を実行する。その後、出力された応答音声データがスピーカ１５から再生され、ユーザとの音声対話がなされる。 When it is determined that the synthesis of the response voice has been completed, the response voice data output unit 117 obtains the response voice data from the server 2 via the communication interface unit 13 and outputs the obtained response voice data to the speaker 15. Execute the process of outputting to. Thereafter, the output response voice data is reproduced from the speaker 15, and a voice dialogue with the user is performed.

図３は、図１に示したシステム中のサーバ２の機能構成を示すブロック図である。 FIG. 3 is a block diagram showing a functional configuration of the server 2 in the system shown in FIG.

サーバ２は、制御ユニット２１と、記憶ユニット２２と、通信インタフェースユニット２３とを備えている。 The server 2 includes a control unit 21, a storage unit 22, and a communication interface unit 23.

通信インタフェースユニット２３は、例えば１つ以上の有線または無線の通信インタフェースユニットを含んでいる。通信インタフェースユニット２３は、通信ネットワークを介して音声対話装置１から発話音声データを取得し、取得された発話音声データを制御ユニット２１に出力する。さらに、通信インタフェースユニット２３は、制御ユニット２１から出力された、発話音声データに対する応答音声データを、通信ネットワークを介して音声対話装置１に出力する。 The communication interface unit 23 includes, for example, one or more wired or wireless communication interface units. The communication interface unit 23 obtains the utterance voice data from the voice interaction device 1 via the communication network, and outputs the obtained utterance voice data to the control unit 21. Further, the communication interface unit 23 outputs the response voice data corresponding to the utterance voice data output from the control unit 21 to the voice interactive device 1 via the communication network.

記憶ユニット２２は、記憶媒体としてＨＤＤ（Hard Disc Drive）またはＳＳＤ（Solid State Drive）等の随時書き込みおよび読み出しが可能な不揮発メモリを使用したものであり、本実施形態を実現するために使用される記憶領域として、発話音声データ記憶部２２１と、発話テキストデータ記憶部２２２と、応答テキストデータ記憶部２２３と、応答音声データ記憶部２２４とを備えている。 The storage unit 22 uses a non-volatile memory such as a hard disk drive (HDD) or a solid state drive (SSD) that can be written and read at any time, and is used to realize the present embodiment. The storage area includes an utterance voice data storage unit 221, an utterance text data storage unit 222, a response text data storage unit 223, and a response voice data storage unit 224.

発話音声データ記憶部２２１は、音声対話装置１から取得された発話音声データを記憶させるために使用される。 The utterance voice data storage unit 221 is used for storing utterance voice data acquired from the voice interaction device 1.

発話テキストデータ記憶部２２２は、発話音声データに基づく発話音声の認識の結果である発話テキストデータを記憶させるために使用される。 The utterance text data storage unit 222 is used to store utterance text data that is the result of utterance speech recognition based on utterance speech data.

応答テキストデータ記憶部２２３は、上記認識の結果に基づく応答内容データである応答テキストデータを記憶させるために使用される。 The response text data storage unit 223 is used to store response text data which is response content data based on the result of the recognition.

応答音声データ記憶部２２４は、応答テキストデータに対応する応答音声データを記憶させるために使用される。 The response voice data storage unit 224 is used to store response voice data corresponding to the response text data.

制御ユニット２１は、ＣＰＵ（Central Processing Unit）を含み、本実施形態における処理機能を実行するために、音声認識機能部２１１と、応答内容生成機能部２１２と、音声合成機能部２１３とを備えている。これらの各部における処理機能はいずれも、図示しないプログラムメモリに格納されたプログラムを上記ＣＰＵに実行させることによって実現される。 The control unit 21 includes a CPU (Central Processing Unit), and includes a voice recognition function unit 211, a response content generation function unit 212, and a voice synthesis function unit 213 to execute the processing functions in the present embodiment. I have. All of the processing functions of these units are realized by causing the CPU to execute a program stored in a program memory (not shown).

音声認識機能部２１１、応答内容生成機能部２１２、および音声合成機能部２１３はそれぞれ、発話音声データに関する上記発話音声の認識、上記応答内容データの生成、および上記応答音声の合成をする処理を実行する。なお、音声認識機能部２１１、応答内容生成機能部２１２、および音声合成機能部２１３はそれぞれ、各機能部における以下に説明する処理が完了した際に、上記発話音声の認識が完了したことの通知、上記応答内容データの生成が完了したことの通知、および、上記応答音声の合成が完了したことの通知を、通信インタフェースユニット２３を介して音声対話装置１に送信する処理を実行する。 The voice recognition function unit 211, the response content generation function unit 212, and the voice synthesis function unit 213 respectively execute processing for recognizing the utterance voice with respect to the utterance voice data, generating the response content data, and synthesizing the response voice. I do. Note that the voice recognition function unit 211, the response content generation function unit 212, and the voice synthesis function unit 213 each notify that the recognition of the uttered voice has been completed when the processing described below in each function unit is completed. A process of transmitting a notification that the generation of the response content data has been completed and a notification that the synthesis of the response voice has been completed to the voice interactive device 1 via the communication interface unit 23 are executed.

まず、音声認識機能部２１１は、発話音声データ取得部２１１１と、発話テキストデータ生成部２１１２とを備えている。 First, the voice recognition function unit 211 includes an utterance voice data acquisition unit 2111 and an utterance text data generation unit 2112.

発話音声データ取得部２１１１は、通信インタフェースユニット２３を介して音声対話装置１から発話音声データを取得し、取得された発話音声データを記憶ユニット２２の発話音声データ記憶部２２１に記憶させる処理を実行する。 The utterance voice data acquisition unit 2111 executes a process of acquiring utterance voice data from the voice interaction device 1 via the communication interface unit 23 and storing the obtained utterance voice data in the utterance voice data storage unit 221 of the storage unit 22. I do.

発話テキストデータ生成部２１１２は、記憶ユニット２２の発話音声データ記憶部２２１に記憶される発話音声データを読み出す処理を実行する。その後、発話テキストデータ生成部２１１２は、読み出された発話音声データに対応する発話テキストデータを生成し、生成された発話テキストデータを記憶ユニット２２の発話テキストデータ記憶部２２２に記憶させる処理を実行する。 The utterance text data generation unit 2112 performs a process of reading utterance voice data stored in the utterance voice data storage unit 221 of the storage unit 22. Thereafter, the utterance text data generation unit 2112 performs a process of generating utterance text data corresponding to the read utterance voice data, and storing the generated utterance text data in the utterance text data storage unit 222 of the storage unit 22. I do.

応答内容生成機能部２１２は、応答テキストデータ生成部２１２１を備えている。 The response content generation function unit 212 includes a response text data generation unit 2121.

応答テキストデータ生成部２１２１は、記憶ユニット２２の発話テキストデータ記憶部２２２に記憶される発話テキストデータを読み出す処理を実行する。その後、応答テキストデータ生成部２１２１は、読み出された発話テキストデータに基づいて、ユーザからの発話音声に対する応答文章である、応答内容データとしての応答テキストデータを生成し、生成された応答テキストデータを記憶ユニット２２の応答テキストデータ記憶部２２３に記憶させる処理を実行する。 The response text data generation unit 2121 executes a process of reading utterance text data stored in the utterance text data storage unit 222 of the storage unit 22. Thereafter, the response text data generation unit 2121 generates response text data as response content data, which is a response sentence to the uttered voice from the user, based on the read utterance text data, and generates the generated response text data. Is stored in the response text data storage unit 223 of the storage unit 22.

音声合成機能部２１３は、応答音声データ合成部２１３１と、応答音声データ出力部２１３２とを備えている。 The voice synthesis function section 213 includes a response voice data synthesis section 2131 and a response voice data output section 2132.

応答音声データ合成部２１３１は、記憶ユニット２２の応答テキストデータ記憶部２２３に記憶される応答テキストデータを読み出し、読み出された応答テキストデータに対応する応答音声データを合成し、合成された応答音声データを記憶ユニット２２の応答音声データ記憶部２２４に記憶させる処理を実行する。 The response voice data synthesizing unit 2131 reads the response text data stored in the response text data storage unit 223 of the storage unit 22, synthesizes the response voice data corresponding to the read response text data, and generates the synthesized response voice. A process for storing data in the response voice data storage unit 224 of the storage unit 22 is executed.

応答音声データ出力部２１３２は、記憶ユニット２２の応答音声データ記憶部２２４に記憶される応答音声データを読み出し、読み出された応答音声データを通信インタフェースユニット２３を介して音声対話装置１に出力する処理を実行する。 The response voice data output unit 2132 reads the response voice data stored in the response voice data storage unit 224 of the storage unit 22 and outputs the read response voice data to the voice interaction device 1 via the communication interface unit 23. Execute the process.

（動作）
次に、以上のように構成された音声対話装置１の動作を説明する。 (motion)
Next, the operation of the voice interaction device 1 configured as described above will be described.

図４Ａ，４Ｂは、図２に示した音声対話装置１の制御ユニット１１によって実行されるフィラー情報出力処理の一例を示すフロー図である。 4A and 4B are flowcharts showing an example of the filler information output process executed by the control unit 11 of the voice interaction device 1 shown in FIG.

最初に、ステップＳ１０１において、制御ユニット１１は、予めフィラー情報として、例えばさまざまな長さの時間のフィラーの音声データを合成し、合成されたフィラー音声データをフィラー情報記憶部１２５に記憶させておく。例えば、フィラー音声データとして、１秒から１０秒までの時間的な長さを有する、１秒毎に１０個のフィラー音声データを記憶させておく。例えば、１秒の時間的な長さを有するフィラー音声データとしては「ええっと」と発話されるフィラー音声データを、３秒の時間的な長さを有するフィラー音声データとしては「考えているから、ちょっと待ってね」と発話されるフィラー音声データ等を用いる。なお、フィラー情報としてフィラー音声データを用いる例を説明するが、フィラー情報は音声データに限られず、例えば、ユーザからの発話音声に対する応答音声が出力されるまでの遅延時間を（図示していない）ディスプレイに表示してユーザに知らせ続けるテキストデータ等であってもよい。 First, in step S101, the control unit 11 previously synthesizes, for example, filler voice data of various lengths of time as filler information, and stores the synthesized filler voice data in the filler information storage unit 125. . For example, as filler audio data, ten filler audio data having a time length of 1 second to 10 seconds are stored every second. For example, as filler audio data having a temporal length of 1 second, filler audio data uttered as "uh" is considered as filler audio data having a temporal length of 3 seconds. Please wait for a while. " An example in which filler voice data is used as filler information will be described. However, the filler information is not limited to voice data. For example, a delay time until a response voice in response to an utterance voice from a user is output (not shown) Text data or the like that is displayed on the display and kept informed to the user may be used.

ステップＳ１０２において、制御ユニット１１は、音声データ取得部１１１の制御の下、マイク１４を介して入力されたユーザからの発話音声を含む音声をデジタルデータに変換し、発話音声データ抽出部１１２の制御の下、当該デジタルデータにおいてユーザが実際に発話している区間を抽出して、ユーザからの発話音声データを取得する。なお、制御ユニット１１は、発話音声データ抽出部１１２の制御の下、取得された発話音声データに基づく発話音声の時間的な長さであるユーザ発話時間を、発話時間記憶部１２２に記憶させる。 In step S 102, under the control of the voice data acquisition unit 111, the control unit 11 converts the voice including the voice from the user input via the microphone 14 into digital data, and controls the voice voice data extraction unit 112. In the digital data, a section in which the user is actually speaking is extracted from the digital data to obtain speech voice data from the user. Note that, under the control of the utterance voice data extraction unit 112, the control unit 11 causes the utterance time storage unit 122 to store the user utterance time, which is the temporal length of the utterance voice based on the acquired utterance voice data.

取得された発話音声データは、音声対話装置１からサーバ２に送信され、サーバ２において、当該発話音声データに基づく発話音声の認識、当該認識の結果に基づく応答内容データの生成、当該応答内容データに対応する応答音声の合成が実施される。 The acquired utterance voice data is transmitted from the voice interaction device 1 to the server 2, where the server 2 recognizes the utterance voice based on the utterance voice data, generates response content data based on the result of the recognition, and generates the response content data. Is synthesized.

ステップＳ１０３において、制御ユニット１１は、応答準備時間予測部１１３の制御の下、上記発話音声の認識に要する第１の時間、上記応答内容データの生成に要する第２の時間、および、上記応答音声の合成に要する第３の時間を予測し、例えば、予測された第１、第２および第３の時間の合計時間を算出することによって、ユーザからの発話音声の終了時点から、当該発話音声に対する応答音声の出力を開始するまでに要する遅延時間を予測する。 In step S103, under the control of the response preparation time prediction unit 113, the control unit 11 performs a first time required for recognizing the uttered voice, a second time required for generating the response content data, and the response voice. By estimating a third time required for synthesizing the utterance voice, for example, by calculating a total time of the predicted first, second, and third times, the end time of the utterance voice from the user is calculated. The delay time required to start outputting the response voice is predicted.

なお、第１の時間は、応答準備時間予測部１１３の音声認識時間予測部１１３１の制御の下で予測される。具体的には、制御ユニット１１は、音声認識時間予測部１１３１の制御の下、発話時間記憶部１２２に記憶される、過去の発話音声に係るユーザ発話時間と、応答準備時間記憶部１２３に記憶される、当該ユーザ発話時間に対応する過去の発話音声の認識に要した時間とを読み出す。その後、制御ユニット１１は、音声認識時間予測部１１３１の制御の下、読み出された、過去の発話音声に係るユーザ発話時間と、当該ユーザ発話時間に対応する過去の発話音声の認識に要した時間とに基づいて、ユーザ発話時間と発話音声の認識に要する時間との係数を算出する。当該係数は、例えば、ユーザ発話時間を発話音声の認識に要した時間で割った値の平均として算出する、あるいは、最小二乗法により一次関数を求めることによって算出する。その後、制御ユニット１１は、音声認識時間予測部１１３１の制御の下、発話時間記憶部１２２に記憶される、取得された発話音声データに基づく、ユーザからの発話音声に係るユーザ発話時間を読み出し、読み出されたユーザ発話時間と、上記算出された、ユーザ発話時間と発話音声の認識に要する時間との係数とに基づいて、第１の時間を予測する。 Note that the first time is predicted under the control of the speech recognition time prediction unit 1131 of the response preparation time prediction unit 113. Specifically, under the control of the speech recognition time prediction unit 1131, the control unit 11 stores the user utterance time related to the past uttered speech stored in the utterance time storage unit 122 and the response preparation time storage unit 123. And the time required for recognition of the past uttered voice corresponding to the user's uttered time. Thereafter, under the control of the speech recognition time prediction unit 1131, the control unit 11 needed to recognize the read user utterance time of the past uttered speech and the past uttered speech corresponding to the user uttered time. Based on the time, a coefficient between the user utterance time and the time required for recognition of the uttered voice is calculated. The coefficient is calculated, for example, as an average of values obtained by dividing the user utterance time by the time required for recognition of the uttered voice, or by calculating a linear function by the least square method. Thereafter, under the control of the speech recognition time prediction unit 1131, the control unit 11 reads out the user speech time related to the speech sound from the user based on the acquired speech sound data stored in the speech time storage unit 122, The first time is predicted based on the read user utterance time and the calculated coefficient of the user utterance time and the time required for recognizing the uttered voice.

第２の時間は、応答準備時間予測部１１３の応答内容生成時間予測部１１３２の制御の下で予測される。具体的には、制御ユニット１１は、応答内容生成時間予測部１１３２の制御の下、応答準備時間記憶部１２３に記憶される過去の応答内容データの生成に要した時間を読み出す。その後、制御ユニット１１は、応答内容生成時間予測部１１３２の制御の下、読み出された所定の回数の過去の応答内容データの生成に要した時間の平均値を算出し、第２の時間を、当該算出された平均値に基づいて予測する。 The second time is predicted under the control of the response content generation time prediction unit 1132 of the response preparation time prediction unit 113. Specifically, under the control of the response content generation time prediction unit 1132, the control unit 11 reads out the time required to generate past response content data stored in the response preparation time storage unit 123. Thereafter, under the control of the response content generation time prediction unit 1132, the control unit 11 calculates an average value of the time required to generate the read predetermined number of pieces of past response data, and calculates the second time. , Based on the calculated average value.

第３の時間は、応答準備時間予測部１１３の音声合成時間予測部１１３３の制御の下で予測される。具体的には、制御ユニット１１は、音声合成時間予測部１１３３の制御の下、応答準備時間記憶部１２３に記憶される、過去の応答内容データに対応する応答音声の合成に要した時間を読み出す。その後、制御ユニット１１は、音声合成時間予測部１１３３の制御の下、読み出された所定の回数の過去の応答音声の合成に要した時間の平均値を算出し、第３の時間を、当該算出された平均値に基づいて予測する。 The third time is predicted under the control of the speech synthesis time prediction unit 1133 of the response preparation time prediction unit 113. Specifically, the control unit 11 reads the time required for synthesizing the response voice corresponding to the past response content data stored in the response preparation time storage unit 123 under the control of the voice synthesis time prediction unit 1133. . Thereafter, under the control of the speech synthesis time prediction unit 1133, the control unit 11 calculates an average value of the time required for synthesizing the read predetermined number of past response speeches, and calculates the third time as the third time. Predict based on the calculated average value.

ステップＳ１０４において、制御ユニット１１は、フィラー情報出力部１１４の制御の下、フィラー情報記憶部１２５に記憶されたフィラー音声データの中から、例えば、第１、第２および第３の時間の合計時間に基づいて予測された遅延時間に最も近い時間的な長さを有するフィラー音声データを読み出し、読み出されたフィラー音声データをスピーカ１５に出力する。これにより、スピーカ１５において上記遅延時間内にフィラーが発話される。 In step S104, under the control of the filler information output unit 114, the control unit 11 selects, for example, the total time of the first, second, and third times from among the filler voice data stored in the filler information storage unit 125. The filler audio data having the temporal length closest to the delay time predicted based on the filler audio data is read, and the read filler audio data is output to the speaker 15. As a result, the filler is uttered in the speaker 15 within the delay time.

ステップＳ１０５において、制御ユニット１１は、応答準備完了判定部１１６の制御の下、出力されたフィラー音声データの再生が終了した際に、サーバ２から、上記発話音声の認識が完了したことの通知、上記応答内容データの生成が完了したことの通知、および、上記応答音声の合成が完了したことの通知が取得されているか否かに基づいて、ユーザからの発話音声に対する応答音声の合成が完了しているか否かを判定する。 In step S105, under the control of the response preparation completion determination unit 116, the control unit 11 notifies the server 2 that the recognition of the uttered voice has been completed when the reproduction of the output filler voice data is completed. Based on whether or not the notification that the generation of the response content data has been completed and the notification that the synthesis of the response voice has been completed have been obtained, the synthesis of the response voice to the uttered voice from the user is completed. Is determined.

ステップＳ１０５において応答音声の合成が完了していると判定された場合には、ステップＳ１０６において、制御ユニット１１は、応答音声データ出力部１１７の制御の下、サーバ２から応答音声データを取得し、取得された応答音声データをスピーカ１５に出力する。その後、出力された応答音声データがスピーカ１５から再生され、ユーザとの音声対話がなされる。 If it is determined in step S105 that the synthesis of the response voice is completed, in step S106, the control unit 11 acquires the response voice data from the server 2 under the control of the response voice data output unit 117, The acquired response voice data is output to the speaker 15. Thereafter, the output response voice data is reproduced from the speaker 15, and a voice dialogue with the user is performed.

ステップＳ１０５において応答音声の合成が完了していないと判定された場合には、応答音声が出力されるまでにユーザがさらに待つ必要があることをユーザに通知するために、追加のフィラーを発話するための処理が実行される。 If it is determined in step S105 that the synthesis of the response voice is not completed, an additional filler is uttered to notify the user that the user needs to wait further until the response voice is output. Is executed.

まず、ステップＳ１０７において、制御ユニット１１は、応答準備完了判定部１１６の制御の下、さらに、上記発話音声の認識が完了したことの通知と、上記応答内容データの生成が完了したことの通知とを取得しているか否かに基づいて、発話音声の認識および応答内容データの生成が完了しているか否かを判定する。 First, in step S107, under the control of the response preparation completion determination unit 116, the control unit 11 further notifies that the recognition of the uttered voice has been completed, and that the generation of the response content data has been completed. It is determined whether or not the recognition of the uttered voice and the generation of the response content data have been completed based on whether or not has been acquired.

ステップＳ１０７において発話音声の認識および応答内容データの生成が完了していると判定された場合には、ステップＳ１０８において、制御ユニット１１は、応答準備時間予測部１１３の制御の下、上記遅延時間を再予測する。 If it is determined in step S107 that the recognition of the uttered voice and the generation of the response content data have been completed, the control unit 11 sets the delay time under the control of the response preparation time prediction unit 113 in step S108. Re-forecast.

具体的には、制御ユニット１１は、応答準備時間予測部１１３の音声合成時間予測部１１３３の制御の下、応答文字数記憶部１２４に記憶される、過去の発話音声に対する応答内容データの長さである応答文字数を読み出す。また、制御ユニット１１は、応答準備時間予測部１１３の音声合成時間予測部１１３３の制御の下、応答準備時間記憶部１２３に記憶される、当該応答文字数に係る過去の応答内容データに対応する応答音声の合成に要した時間を読み出す。制御ユニット１１は、音声合成時間予測部１１３３の制御の下、読み出された、過去の発話音声に対する応答内容データに係る応答文字数と、当該応答文字数に係る過去の応答内容データに対応する応答音声の合成に要した時間とに基づいて、応答文字数と応答音声の合成に要する時間との係数を算出する。当該係数は、例えば、応答文字数を応答音声の合成に要した時間で割った値の平均として算出する、あるいは、最小二乗法により一次関数を求めることによって算出する。制御ユニット１１は、音声合成時間予測部１１３３の制御の下、サーバ２から、取得された発話音声データに対する応答内容データに係る応答文字数の通知を受信し、当該応答文字数と、上記算出された、応答文字数と応答音声の合成に要する時間との係数とに基づいて、第３の時間を再予測する。再予測された第３の時間に基づいて、上記遅延時間が再予測される。なお、遅延時間の再予測では、ステップＳ１０３において予測された第１の時間および第２の時間を利用してもよい、あるいは、ステップＳ１０３において予測された第１の時間および第２の時間を利用する代わりに、処理完了通知取得部１１５の制御の下に上記発話音声の認識が完了したことの通知および上記応答内容データの生成が完了したことの通知をそれぞれ取得したタイミングを計測して利用してもよい。 Specifically, under the control of the speech synthesis time prediction unit 1133 of the response preparation time prediction unit 113, the control unit 11 uses the length of the response content data for the past uttered voice stored in the response character number storage unit 124. Read a certain number of response characters. Further, under the control of the speech synthesis time prediction unit 1133 of the response preparation time prediction unit 113, the control unit 11 stores the response corresponding to the past response content data related to the number of response characters stored in the response preparation time storage unit 123. Read the time required for speech synthesis. Under the control of the speech synthesis time prediction unit 1133, the control unit 11 reads out the number of response characters corresponding to the read response data to the past uttered voice and the response voice corresponding to the past response content data relating to the number of response characters. Based on the time required to synthesize the response voice, a coefficient between the number of response characters and the time required to synthesize the response voice is calculated. The coefficient is calculated, for example, as an average of values obtained by dividing the number of response characters by the time required for synthesizing the response voice, or by calculating a linear function by the least square method. The control unit 11 receives a notification of the number of response characters related to the response content data for the acquired utterance voice data from the server 2 under the control of the voice synthesis time prediction unit 1133, and The third time is predicted again based on the coefficient of the number of response characters and the time required for the synthesis of the response voice. The delay time is re-predicted based on the re-predicted third time. In the re-estimation of the delay time, the first time and the second time predicted in step S103 may be used, or the first time and the second time predicted in step S103 may be used. Instead, under the control of the process completion notification acquisition unit 115, the timings at which the notification that the recognition of the uttered voice has been completed and the notification that the generation of the response content data has been completed are respectively measured and used. You may.

ステップＳ１０９において、制御ユニット１１は、フィラー情報出力部１１４の制御の下、フィラー情報記憶部１２５に記憶されたフィラー音声データの中から、例えば、ユーザからの発話音声の終了時点からの経過時間を上記再予測された遅延時間から減算した時間に最も近い時間的な長さを有する、追加のフィラー音声データを読み出し、読み出されたフィラー音声データをスピーカ１５に出力する。これにより、スピーカ１５において上記再予測された遅延時間内に追加のフィラーが発話される。 In step S109, under the control of the filler information output unit 114, the control unit 11 determines, for example, the elapsed time from the end time of the utterance voice from the user from the filler voice data stored in the filler information storage unit 125. Additional filler audio data having a temporal length closest to the time subtracted from the re-predicted delay time is read, and the read filler audio data is output to the speaker 15. Accordingly, the additional filler is uttered in the speaker 15 within the re-predicted delay time.

ステップＳ１０７において発話音声の認識および応答内容データの生成が完了していないと判定された場合には、ステップＳ１１０において、制御ユニット１１は、フィラー情報出力部１１４の制御の下、フィラー情報記憶部１２５に記憶されたフィラー音声データの中からランダムにフィラー音声データを読み出し、読み出されたフィラー音声データをスピーカ１５に出力する。これにより、スピーカ１５において、ランダムに読み出された追加のフィラーが発話される。 If it is determined in step S107 that the recognition of the uttered voice and the generation of the response content data have not been completed, the control unit 11 controls the filler information storage unit 125 under the control of the filler information output unit 114 in step S110. The filler audio data is read out at random from the filler audio data stored in the storage device, and the read filler audio data is output to the speaker 15. Thereby, the additional filler read at random is uttered in the speaker 15.

ステップＳ１０９において出力された追加のフィラー音声データの再生が終了した際には、ステップＳ１１１において、制御ユニット１１は、応答準備完了判定部１１６の制御の下、ステップＳ１０５における動作において説明したのと同様に、ユーザからの発話音声に対する応答音声の合成が完了しているか否かを判定する。 When the reproduction of the additional filler audio data output in step S109 is completed, in step S111, the control unit 11 performs the same operation as described in the operation in step S105 under the control of the response preparation completion determination unit 116. Next, it is determined whether or not the synthesis of the response voice to the utterance voice from the user has been completed.

ステップＳ１１１において応答音声の合成が完了していると判定された場合には、ステップＳ１１２において、制御ユニット１１は、応答音声データ出力部１１７の制御の下、ステップＳ１０６における動作において説明したのと同様に、応答音声データをスピーカ１５に出力する。その後、出力された応答音声データがスピーカ１５から再生され、ユーザとの音声対話がなされる。 If it is determined in step S111 that the synthesis of the response voice has been completed, in step S112, the control unit 11 performs the same operation as that described in the operation in step S106 under the control of the response voice data output unit 117. Then, the response voice data is output to the speaker 15. Thereafter, the output response voice data is reproduced from the speaker 15, and a voice dialogue with the user is performed.

ステップＳ１１１において応答音声の合成が完了していないと判定された場合には、ステップＳ１１３において、制御ユニット１１は、フィラー情報出力部１１４の制御の下、ステップＳ１１０における動作において説明したのと同様に、ランダムに読み出されたフィラー音声データをスピーカ１５に出力する。これにより、スピーカ１５において、ランダムに読み出された追加のフィラーが発話される。 If it is determined in step S111 that the synthesis of the response voice has not been completed, in step S113, the control unit 11 operates under the control of the filler information output unit 114 in the same manner as described in the operation in step S110. , And outputs the filler voice data read at random to the speaker 15. Thereby, the additional filler read at random is uttered in the speaker 15.

なお、ステップＳ１１０においてランダムに読み出された追加のフィラーが発話された後には、ステップＳ１０５からの動作が繰り返され、ステップＳ１１３においてランダムに読み出された追加のフィラーが発話された後には、ステップＳ１１１からの動作が繰り返される。 In addition, after the additional filler read at random in step S110 is uttered, the operation from step S105 is repeated, and after the additional filler read at random in step S113 is uttered, step S105 is performed. The operation from S111 is repeated.

（効果）
以上詳述したように、この発明の第１の実施形態では、以下のような効果が奏せられる。 (effect)
As described above in detail, the first embodiment of the present invention has the following effects.

（１）音声データ取得部１１１および発話音声データ抽出部１１２の制御の下、ユーザからの発話音声に係る発話音声データが取得される。ここで、取得された発話音声データは、サーバ２に送信され、サーバ２において、当該発話音声データに基づく発話音声の認識、当該認識の結果に基づく応答内容データの生成、当該応答内容データに対応する応答音声の合成が実施される。応答準備時間予測部１１３の制御の下、当該発話音声データに基づくユーザ発話時間と、過去の発話音声に係る応答内容データに関する情報とに基づいて、上記発話音声の認識に要する第１の時間、上記応答内容データの生成に要する第２の時間、および、上記応答音声の合成に要する第３の時間が予測され、予測された第１、第２および第３の時間の合計時間を算出することによって、ユーザからの発話音声の終了時点から、当該発話音声に対する応答音声の出力を開始するまでに要する遅延時間が予測される。 (1) Under the control of the voice data acquisition unit 111 and the utterance voice data extraction unit 112, utterance voice data relating to the utterance voice from the user is obtained. Here, the acquired utterance voice data is transmitted to the server 2, and the server 2 recognizes the utterance voice based on the utterance voice data, generates response content data based on the recognition result, and responds to the response content data. A response voice is synthesized. Under the control of the response preparation time prediction unit 113, based on the user utterance time based on the utterance voice data and information on the response content data related to the past utterance voice, a first time required for recognition of the utterance voice, A second time required to generate the response content data and a third time required to synthesize the response voice are predicted, and a total time of the predicted first, second, and third times is calculated. Thus, the delay time required from the end of the uttered voice from the user to the start of outputting the response voice to the uttered voice is predicted.

このように、発話音声の認識に要する第１の時間については、発話音声データに係るユーザ発話時間を利用することにより精度が高い予測をすることができる。また、応答内容データの生成に要する第２の時間については、多くの場合、発話音声の認識の相違による応答内容データ生成処理時間の変動は少ないので、過去の応答内容データに関する情報を用いることにより信頼度が高い予測をすることができる。また、応答音声の合成に要する第３の時間についても、過去の応答内容データに関する情報を用いることにより信頼性のある予測をすることができる。 As described above, the first time required for the recognition of the uttered voice can be predicted with high accuracy by using the user uttered time related to the uttered voice data. In addition, in the second time required for generating the response content data, in many cases, the variation in the response content data generation processing time due to the difference in the recognition of the uttered voice is small, so that information on the past response content data is used. A highly reliable prediction can be made. Also, the third time required for the synthesis of the response voice can be reliably predicted by using the information on the past response content data.

また、例えば応答音声の準備処理を別の装置で行う場合等のように、実装によっては通信処理等の遅延時間も生じ得るが、このように過去の発話音声に係る実際の情報を用いることにより、着目する発話音声についても当該過去の情報を取得したのと同条件で処理すれば、このような通信処理等の時間も含めて処理時間を予測することができる。 In addition, for example, a delay time of a communication process or the like may occur depending on the implementation, such as a case where a response voice preparation process is performed by another device, but by using actual information on past speech voice in this manner, If the uttered voice of interest is also processed under the same conditions as when the past information is obtained, the processing time including such communication processing time can be predicted.

（２）フィラー情報出力部１１４の制御の下、記憶されたフィラー音声データの中から、予測された遅延時間に最も近い時間的な長さを有するフィラー音声データが読み出され、読み出されたフィラー音声データがスピーカ１５に出力され、スピーカ１５において上記遅延時間内にフィラーが発話される。 (2) Under the control of the filler information output unit 114, filler audio data having a temporal length closest to the predicted delay time is read from the stored filler audio data, and is read. The filler voice data is output to the speaker 15, and the filler is uttered in the speaker 15 within the delay time.

このため、発話音声を発したユーザが、レスポンスが返らないことにより不安にさせられることがなくなる。また、例えば、遅延時間に対応する時間的な長さで意味を有する言葉を発するフィラー音声を出力するようにすると、ユーザは、出力されるフィラー音声の種類によって、応答音声が返ってくるまでに待つ必要がある時間を予測でき、これにより、ユーザをさらに安心させることができる。 For this reason, the user who uttered the uttered voice will not be disturbed by the fact that no response is returned. In addition, for example, when a filler voice that emits a meaningful word with a time length corresponding to the delay time is output, the user may change the response voice depending on the type of the filler voice to be output until the response voice is returned. The time required to wait can be predicted, which can further reassure the user.

（３）応答準備完了判定部１１６の制御の下、出力されたフィラー音声データの再生が終了した際に、ユーザからの発話音声に対する応答音声の合成が完了しているか否かが判定される。ユーザからの発話音声に対する応答音声の合成が完了していないと判定された場合に、フィラー情報出力部１１４の制御の下、追加のフィラー音声データがスピーカ１５に出力される。 (3) Under the control of the response preparation completion determination unit 116, when reproduction of the output filler voice data ends, it is determined whether or not the synthesis of the response voice to the utterance voice from the user has been completed. When it is determined that the synthesis of the response voice to the uttered voice from the user is not completed, additional filler voice data is output to the speaker 15 under the control of the filler information output unit 114.

このため、出力されたフィラー音声データの再生が終了した後に、発話音声を発したユーザが、応答音声が出力されるのをさらに待つ必要がある場合にも、レスポンスが返らないことにより不安にさせられることがなくなる。 For this reason, even after the reproduction of the outputted filler voice data is completed, the user who has made the utterance voice becomes anxious because the response is not returned even when it is necessary to further wait for the output of the response voice. Will not be done.

（４）応答準備完了判定部１１６の制御の下、ユーザからの発話音声に対する応答音声の合成が完了していないと判定された場合に、さらに、発話音声の認識および応答内容データの生成が完了しているか否かが判定される。発話音声の認識および応答内容データの生成が完了していると判定された場合に、音声合成時間予測部１１３３の制御の下、通知された応答内容データに係る応答文字数と、過去の発話音声に係る応答内容データに関する情報とに基づいて、第３の時間が再予測される。応答準備時間予測部１１３の制御の下、再予測された第３の時間に基づいて、上記遅延時間が再予測される。フィラー情報出力部１１４の制御の下、記憶されたフィラー音声データの中から、ユーザからの発話音声の終了時点からの経過時間を上記再予測された遅延時間から減算した時間に最も近い時間的な長さを有する、追加のフィラー音声データが読み出され、読み出されたフィラー音声データがスピーカ１５に出力され、スピーカ１５において上記再予測された遅延時間内にフィラーが発話される。 (4) Under the control of the response preparation completion determining unit 116, when it is determined that the synthesis of the response voice to the voice from the user is not completed, the recognition of the voice and the generation of the response content data are further completed. It is determined whether or not it is. When it is determined that the recognition of the uttered voice and the generation of the response content data have been completed, under the control of the voice synthesis time prediction unit 1133, the number of response characters related to the notified response content data and the past utterance voice are added. The third time is predicted again based on the information on the response content data. Under the control of the response preparation time prediction unit 113, the delay time is re-predicted based on the re-predicted third time. Under the control of the filler information output unit 114, the stored filler voice data has a temporal closest to the time obtained by subtracting the elapsed time from the end time of the utterance voice from the user from the re-predicted delay time. Additional filler audio data having a length is read, and the read filler audio data is output to the speaker 15, and the filler is uttered at the speaker 15 within the re-predicted delay time.

このように、第３の時間について、応答内容データに係る文字数を用いることにより精度が高い再予測をすることができ、遅延時間について精度の高い再予測がされることになる。これにより、追加のフィラー情報も、ユーザがさらに待つ必要がある時間に応じたものとすることができ、ユーザをさらに安心させることができる。 In this manner, for the third time, highly accurate re-prediction can be performed by using the number of characters related to the response content data, and highly accurate re-prediction is performed for the delay time. Thereby, the additional filler information can be made to correspond to the time when the user needs to wait further, and the user can be further relieved.

［他の実施形態］
なお、この発明は上記第１の実施形態に限定されるものではない。例えば、上記第１の実施形態では、音声対話装置とサーバとの組み合わせによってユーザとの対話を実現している。しかしながら、音声対話装置とサーバとを１つの装置として実現してもよい。また、上記第１の実施形態では、応答音声の出力とフィラー情報の出力との両方を実現する音声対話装置について説明したが、これらを別個の異なる装置によって実現してもよい。 [Other embodiments]
Note that the present invention is not limited to the first embodiment. For example, in the first embodiment, the dialogue with the user is realized by a combination of the voice interaction device and the server. However, the voice interaction device and the server may be realized as one device. Further, in the first embodiment described above, the voice interactive device that realizes both the output of the response voice and the output of the filler information has been described, but these may be realized by different devices.

その他、音声対話装置およびサーバの装置の種類とその構成、ならびに、発話音声に対する応答音声を準備するための処理等についても、この発明の要旨を逸脱しない範囲で種々変形して実施可能である。 In addition, the types and configurations of the voice interaction device and the server, the processing for preparing a response voice to the utterance voice, and the like can be variously modified without departing from the scope of the present invention.

要するにこの発明は、上記第１の実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記第１の実施形態に開示されている複数の構成要素の適宜な組み合せにより種々の発明を形成できる。例えば、上記第１の実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。さらに、異なる実施形態に亘る構成要素を適宜組み合せてもよい。 In short, the present invention is not limited to the first embodiment as it is, and can be embodied by modifying its components in an implementation stage without departing from the scope of the invention. In addition, various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the first embodiment. For example, some components may be deleted from all the components shown in the first embodiment. Further, components of different embodiments may be appropriately combined.

１…音声対話装置、１１…制御ユニット、１１１…音声データ取得部、１１２…発話音声データ抽出部、１１３…応答準備時間予測部、１１３１…音声認識時間予測部、１１３２…応答内容生成時間予測部、１１３３…音声合成時間予測部、１１４…フィラー情報出力部、１１５…処理完了通知取得部、１１６…応答準備完了判定部、１１７…応答音声データ出力部、１２…記憶ユニット、１２１…音声データ記憶部、１２２…発話時間記憶部、１２３…応答準備時間記憶部、１２４…応答文字数記憶部、１２５…フィラー情報記憶部、１３…通信インタフェースユニット、１４…マイク、１５…スピーカ、２…サーバ、２１…制御ユニット、２１１…音声認識機能部、２１１１…発話音声データ取得部、２１１２…発話テキストデータ生成部、２１２…応答内容生成機能部、２１２１…応答テキストデータ生成部、２１３…音声合成機能部、２１３１…応答音声データ合成部、２１３２…応答音声データ出力部、２２…記憶ユニット、２２１…発話音声データ記憶部、２２２…発話テキストデータ記憶部、２２３…応答テキストデータ記憶部、２２４…応答音声データ記憶部、２３…通信インタフェースユニット DESCRIPTION OF SYMBOLS 1 ... Voice interaction apparatus, 11 ... Control unit, 111 ... Voice data acquisition part, 112 ... Utterance voice data extraction part, 113 ... Response preparation time prediction part, 1311: Voice recognition time prediction part, 1132 ... Response content generation time prediction part .., 1133... Voice synthesis time prediction unit, 114. Filler information output unit, 115... Processing completion notification acquisition unit, 116... Response preparation completion determination unit 117, response voice data output unit, 12 storage unit, 121. Unit, 122: utterance time storage unit, 123: response preparation time storage unit, 124: response character number storage unit, 125: filler information storage unit, 13: communication interface unit, 14: microphone, 15: speaker, 2: server, 21 ... Control unit, 211 ... Speech recognition function part, 2111 ... Speech sound data acquisition part, 2112 ... Speech text data generation , 212: Response content generation function unit, 2121: Response text data generation unit, 213: Voice synthesis function unit, 2131: Response voice data synthesis unit, 2132: Response voice data output unit, 22: Storage unit, 221: Utterance voice data Storage unit, 222: Utterance text data storage unit, 223: Response text data storage unit, 224: Response voice data storage unit, 23: Communication interface unit

Claims

A speech processing device for use in recognizing an uttered voice from a user, generating response content data based on the result of the recognition, and synthesizing a response voice corresponding to the response content data,
A prediction unit that predicts a delay time required from the end of the uttered voice to the start of output of the responsive voice based on the length of the uttered voice and information on past response content data,
In the predicted delay time, a filler information output unit that outputs filler information according to the delay time ,
The prediction unit includes:
Means for detecting the length of the uttered voice, and predicting a first time required for recognition of the uttered voice based on the detected length of the uttered voice;
Means for predicting a second time required for generating response content data based on a result of recognition of the uttered voice, based on information on the past response content data;
Means for predicting a third time required for synthesizing a response voice corresponding to the generated response content data based on information on the past response content data;
Means for predicting a delay time required from the end of the uttered voice to the start of output of the response voice based on the predicted first, second and third times;
Ru a voice processing apparatus.

The means for predicting the first time includes:
The length of the past speech from the user, based on the time required for the recognition of speech of the length, means for calculating the coefficients of the time required for the recognition of the speech length and the speech When,
The length of the detected speech, is the calculated, based on the coefficients of the time required for recognition of speech length and speech, and means for predicting the first time, claims 2. The audio processing device according to 1.

The information on the past response content data includes a time required for synthesizing a response voice corresponding to the past response content data,
Means for predicting the third time, the third time is predicted based on the average value of the time required for the synthesis of the past speech responses of a given number of times, according to claim 1 or 2 Voice processing device.

When the reproduction of the output filler information is completed, a determination unit that determines whether or not the synthesis of the response voice to the uttered voice from the user has been completed,
The voice processing device according to claim 1 , wherein the filler information output unit outputs additional filler information when it is determined that the synthesis of the response voice has not been completed.

When the reproduction of the outputted filler information is completed, it is determined whether or not the synthesis of the response voice to the uttered voice from the user is completed, and further, it is determined that the synthesis of the response voice is not completed. In the case of being performed, further comprising: a recognition unit for recognizing the uttered voice from the user and determining whether or not generation of response content data based on a result of the recognition is completed,
The information about past response content data includes the length of the response content data for past speech sound from the user, the time and taken to the synthesis of the response speech corresponding to the response content data of the length,
The means for predicting the third time comprises:
The length of the response content data for past speech sound from the user, the based on the time required for the synthesis of the response speech corresponding to the length of the response content data, wherein the response content data length and the response voice Means for calculating a coefficient of time required for synthesis;
When it is determined that the synthesis of the response voice has not been completed, and when it is determined that the recognition of the utterance voice from the user and the generation of the response content data based on the result of the recognition have been completed, Means for re-estimating the third time based on the length of the response content data for the speech voice from the user and the calculated coefficient of the length of the response content data and the time required for synthesizing the response voice; With
The means for predicting the delay time, based on the re-predicted third time, re-predicts the delay time required from the end of the uttered voice from the user to start outputting the response voice,
The filler information output unit, within the re-predicted delay time, according to the time subtracted from the re-predicted delay time elapsed time from the end time of the speech sound from the user, additional filler information The audio processing device according to claim 1 , wherein the audio processing device outputs

Speech processing performed by an apparatus including a computer and a memory, used for recognizing a speech voice from a user, generating response content data based on the recognition result, and synthesizing a response voice corresponding to the response content data. The method
A step of estimating a delay time required from the end of the uttered voice to the start of output of the responsive voice based on the length of the uttered voice and information on past response content data;
Outputting the filler information according to the delay time within the predicted delay time ,
The predicting step includes:
Detecting the length of the uttered voice, and predicting a first time required for recognition of the uttered voice based on the length of the detected uttered voice;
A step of predicting a second time required for generating response content data based on a result of the recognition of the uttered voice based on information on the past response content data;
Estimating a third time required for synthesizing a response voice corresponding to the generated response content data based on information on the past response content data;
Estimating, based on the predicted first, second, and third times, a delay time required from the end of the uttered voice to the start of outputting the response voice;
Voice processing method Ru equipped with.

Wherein each unit and the program causing a computer to function as each means of the respective sections comprises speech processing apparatus according to any one of claims 1 to 5 is provided.