WO2019039352A1 - Dispositif de traitement d'informations, procédé de commande, et programme - Google Patents
Dispositif de traitement d'informations, procédé de commande, et programme Download PDFInfo
- Publication number
- WO2019039352A1 WO2019039352A1 PCT/JP2018/030272 JP2018030272W WO2019039352A1 WO 2019039352 A1 WO2019039352 A1 WO 2019039352A1 JP 2018030272 W JP2018030272 W JP 2018030272W WO 2019039352 A1 WO2019039352 A1 WO 2019039352A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- person
- utterance
- directed
- voice data
- voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/24—Speech recognition using non-acoustical features
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/22—Interactive procedures; Man-machine interfaces
Definitions
- the object 20 is an interactive robot.
- the front direction of the face of the person 10-1 who is the speaker is directed to the object 20.
- the direction of the line of sight of the person 10-1 is different.
- the information processing apparatus 2000 determines, using the line of sight of the person 10-1, whether or not the utterance included in the voice data is directed from the person 10 to the robot. Therefore, in the example in the right column of FIG. 1, it is determined that the utterance included in the voice data is not directed to the robot from the person 10. Therefore, the robot has not responded to this utterance.
- the computer 1000 may be realized using a plurality of computers.
- the image analysis unit 2020 and the voice determination unit 2040 can be realized by different computers.
- the program modules stored in the storage device of each computer may be only the program modules corresponding to the functional components realized by the computer.
- the camera 30 is an arbitrary camera that captures an image of the person 10 and generates moving image data.
- the captured image is a moving image frame that constitutes this moving image data.
- the camera 30 may be installed on the object 20 or may be installed at a location other than the object 20.
- the object 20 is a robot.
- the camera 30 installed on the object 20 is, for example, a camera (treated as an eye of the robot) used to visually recognize the surrounding situation.
- the microphone 40 may be installed on the object 20 or may be installed in a place other than the object 20.
- the object 20 is a robot.
- the microphone 40 installed on the object 20 is, for example, a microphone (handled as a robot's ear) used to aurally recognize the surrounding situation by the robot.
- the image analysis unit 2020 acquires a captured image (S102).
- the method by which the image analysis unit 2020 acquires a captured image is arbitrary.
- the image analysis unit 2020 receives a captured image transmitted from the camera 30.
- the image analysis unit 2020 accesses the camera 30 and acquires a captured image stored in the camera 30.
- the image analysis unit 2020 may acquire all captured images generated by the camera 30, or may acquire only a part of the captured images. In the latter case, for example, the image analysis unit 2020 acquires a captured image generated by the camera 30 at a ratio of one to a predetermined number.
- the image analysis unit 2020 estimates the line of sight for each of the plurality of people.
- the timing at which the speech discrimination unit 2040 acquires speech data is arbitrary. For example, every time the target 20 newly generates speech data, the newly generated speech data is transmitted to the information processing apparatus 2000. In this case, the speech discrimination unit 2040 acquires the speech data at the timing when the speech data is newly generated. In addition, for example, the information processing apparatus 2000 may periodically access the object 20 or a storage unit communicably connected to the object 20 to acquire unacquired audio data.
- the information processing apparatus 2000 can specify the speaker of the speech represented by the speech data.
- This particular method is varied.
- the speaker can be identified based on the movement of the person's mouth included in the captured image during the period in which the utterance is performed.
- the information processing apparatus 2000 performs image analysis on each captured image generated in a period in which the utterance represented by the utterance data is performed, thereby specifying a person moving the mouth in the period and uttering the person Identify as a person.
- the information processing apparatus 2000 identifies a person who has the longest moving time of the mouth during that period as a speaker.
- other existing techniques can also be used for the technique of specifying the speaker of the utterance contained in audio
- the characteristic of each candidate may be determined in advance for a plurality of candidates of the speaker, and this information may be used to specify the speaker.
- the identification information for example, identification of a word having high probability of being included in each person's utterance
- the information correlated with the number is determined in advance (stored in the storage unit).
- the information processing apparatus 2000 determines whether a word included in the utterance is associated with identification information of each person. For example, in the above-mentioned household example, it is assumed that fathers A and B move their mouths as a result of image analysis of a captured image generated during a period in which speech is performed. In this case, the information processing device 2000 determines whether the word included in the utterance is associated with the identification information of the father A or the mother B. For example, if the utterance includes the word “stock price”, the utterance includes the word associated with the identification information of the father A. Therefore, the information processing apparatus 2000 specifies that the speaker is the father A. As described above, by using the words previously associated with the person for specifying the speaker, the speaker can be specified with higher accuracy.
- the voice determination unit 2040 determines whether the utterance data represents an utterance directed from the person 10 to the target 20 (S108). The determination as to whether or not a certain utterance is directed from the person 10 to the target 20 is made based on the line of sight of the person 10 in the period in which the utterance is performed. In addition, when two or more persons 10 are contained in a captured image, the said determination is performed based on the eyes of the speaker of the speech represented by speech data.
- FIG. 5 is a diagram illustrating the relationship between the length of the period during which the utterance represented by the utterance data is performed and the length of the period during which the line of sight of the person 10 is directed to the object 20.
- the period during which the utterance represented by the utterance data is performed is from time t1 to time t6, and the length of the period is p1.
- the line of sight of the person 10 is directed to the object 20 during the period from time t2 to time t3 and the period from time t4 to time t5, and the length of these periods is P2 and p3 respectively.
- the voice discrimination unit 2040 uses the line of sight direction of the person 10 estimated by the image analysis unit 2020 and the start point (for example, the center of the black eye) of the line of sight of the person 10 in the captured image. It is determined whether or not it intersects with 20. Then, when the line of sight of the person 10 intersects with the object 20, the voice determination unit 2040 determines that the line of sight of the person 10 is directed to the object 20. On the other hand, when the line of sight of the person 10 does not cross the object 20, the voice determination unit 2040 determines that the line of sight of the person 10 is not directed to the object 20.
- the image analysis unit 2020 may specify one line of sight of the person 10 based on the lines of sight of the person 10. For example, the image analysis unit 2020 sets a midpoint between the center of the black eye of the left eye of the person 10 and the center of the black eye of the right eye of the person 10 as the start point of the line of sight of the person 10. Further, the image analysis unit 2020 sets a vector obtained by adding the vector representing the line-of-sight direction of the left eye of the person 10 and the vector representing the line-of-sight direction of the right eye of the person 10 as the line-of-sight direction of the person 10. Then, the voice determination unit 2040 determines whether or not the line of sight of the person 10 is directed to the object 20 by determining whether or not the one line of sight specified in this manner intersects the object 20.
- the voice determination unit 2040 does not “whether or not the line of sight of the person 10 intersects the object 20”, “whether or not the line of sight of the person 10 intersects a predetermined size range including the object 20” May be determined. This is because when a person looks at an object and talks, the line of sight does not necessarily intersect the object, and sometimes it looks around the object.
- the predetermined range is, for example, a range in which the size of the object 20 is enlarged at a predetermined rate (for example, 10%).
- the speech discrimination unit 2040 stores the speech data in the storage unit 50 (S110).
- the storage unit 50 stores not only speech data representing a speech directed from the person 10 to the object 20, but also speech data representing speech not directed from the person 10 to the object 20. It may be done. In this case, the storage unit 50 stores the utterance data in association with the information indicating whether or not the data is directed from the person 10 to the target 20.
- the speaker 206 indicates identification information for identifying the person 10 who has made the utterance represented by the utterance data 202.
- the identification information is a feature amount of the face of the speaker obtained from the captured image.
- the method of identifying the speaker is as described above.
- the identification information may be an identifier (such as a unique number) assigned to the person 10.
- the information processing apparatus 2000 repeatedly detects the face of a person on the captured image, and when a new person is detected from the captured image, associates a new identifier with the feature amount of the face of the person. It is stored in the storage unit. Then, the voice determination unit 2040 sets an identifier associated with the feature amount of the face of the person 10 who has made the utterance represented by the utterance data 202 in the utterer 206.
- FIG. 7 is a block diagram illustrating the functional configuration of the information processing apparatus 2000 of the second embodiment.
- the information processing apparatus 2000 of the second embodiment has the same function as the information processing apparatus 2000 of the first embodiment except for the points described below.
- the feature data generation unit 2060 extracts keywords representing the various information described above from the speech data determined by the speech discrimination unit 2040 to represent the speech directed to the object 20 from the person 10 By doing this, feature data of the person 10 is generated.
- the existing technology can be used for the technology itself which extracts a keyword from the utterance.
- the feature data is a set of all the keywords extracted from the speech data.
- the feature data is a set of keywords having high importance among keywords extracted from speech data.
- the importance of a keyword is represented by the frequency with which the keyword is included in the speech data. That is, the higher the frequency of the keyword included in the utterance, the higher the degree of importance.
- the extracted keyword be further associated with the attribute of the keyword.
- an attribute "schedule" is associated with the keyword.
- an attribute “interest” is associated with the keyword.
- a plurality of attributes may be associated with one keyword.
- the existing technology can be used as a technology for specifying an attribute related to a keyword from the utterance.
- FIG. 8 is a diagram illustrating feature data in the form of a table.
- the table of FIG. 8 is called a table 300.
- the table 300 has three columns: keyword 302, attribute 304, and importance 306.
- a table 300 representing feature data of a person is associated with identification information of the person.
- FIG. 8 shows feature data of a person specified by “ID 0001”.
- the feature data generation unit 2060 extracts a keyword extracted from speech data representing a speech directed from the person 10 to the object 20 from speech data not representing a speech directed from the person 10 to the object 20 Prior to using keywords.
- the keyword A extracted from the utterance data representing the utterance directed to the object 20 from the person 10 contradicts the keyword B extracted from the utterance data not representing the utterance directed to the object 20 from the person 10
- the feature data generation unit 2060 includes the keyword A in the feature data and does not include the keyword B in the feature data.
- the feature data generation unit 2060 is an utterance that does not represent the utterance directed from the person 10 to the object 20 in the keyword extracted from the utterance data representing the utterance directed from the person 10 to the object 20
- the above-mentioned importance is calculated by giving a larger weight than the keyword extracted from the data. For example, there is a method of calculating the importance of a keyword as an integrated value of frequency and weight.
- the characteristic data of the person 10 can be generated in more detail by specifying the speaking partner. Specifically, when a certain keyword is included in the feature data of the person 10, the other party who speaks the utterance related to the keyword is also included in the feature data as a related person related to the keyword.
- the feature data generation unit 2060 estimates from the speech of the person A that the person A is going to travel.
- the feature data generation unit 2060 includes information “keyword: travel, attribute: schedule” in the feature data of the person A.
- the feature data generation unit 2060 determines whether person A travels alone or with other people (is there any other person related to travel)? The estimation is made and the estimation result is also included in the feature data of person A.
- Example of hardware configuration The hardware configuration of a computer that implements the information processing apparatus 2000 of the second embodiment is represented, for example, by FIG. 3 as in the first embodiment. However, in the storage device 1080 of the computer 1000 for realizing the information processing apparatus 2000 of the present embodiment, a program module for realizing the function of the information processing apparatus 2000 of the present embodiment is further stored.
- FIG. 10 is a block diagram illustrating the functional configuration of the information processing apparatus 2000 of the third embodiment.
- the information processing apparatus 2000 of the third embodiment has the same function as the information processing apparatus 2000 of the first or second embodiment except for the points described below.
- the object 20 does not operate in response to the voice command included in all the speech data, and is directed from the person 10 to the object 20. It operates only in response to the voice command included in the voiced speech. By doing this, it is possible to operate the object 20 only when the person 10 issues a voice command to the object 20. Thus, for example, in the case where the same words as an accidental voice command are accidentally included in the words spoken by the person 10 for another person, it is possible to prevent the object 20 from operating erroneously. .
- the utterance representing any request is not limited to a predetermined voice command.
- the object 20 has a function of interpreting the content of human speech and performing an operation according to the content. Specifically, in response to a request “take a cup on the table”, an operation of taking a cup on the table and giving it to a speaker may be considered.
- the object 20 may have a function of responding according to the content of the utterance of the person 10.
- the information processing apparatus 2000 determines whether to make the subject 20 respond in response to the utterance of the person 10. Specifically, when it is determined that the utterance data is an utterance directed from the person 10 to the object 20, the process determining unit 2080 causes the object 20 to reply using the content of the utterance data. Decide that. On the other hand, when it is determined that the utterance data is not an utterance directed from the person 10 to the object 20, the process determining unit 2080 determines not to make the object 20 reply. By doing this, it is possible to prevent the target 20 from erroneously replying to an utterance that the person 10 has directed to another person instead of the target 20.
- the information processing apparatus 2000 specifies a travel schedule, a destination, and the like by referring to feature data and schedule data of the person A, and searches for available hotels based on the specified schedule and destination. Furthermore, the information processing apparatus 2000 refers to the one that the person A is interested in, which is shown in the feature data of the person A, and preferentially presents the hotel having a high degree of association with the one that is interested as a search result. Do. For example, when “hot spring” is included in the thing that person A is interested in, the information processing apparatus 2000 preferentially presents a hotel having a hot spring facility or a hotel having a hot spring facility nearby.
- the feature data indicates a related person (e.g., a person who travels together) associated with the keyword.
- the motion of the object 20 is preferably determined in consideration of the relevant person as well.
- the information processing apparatus 2000 grasps that the person A goes on a trip with the person B by referring to the feature data of the person A. Then, the information processing apparatus 2000 searches for a hotel in which a room in which two people can stay is vacant. Further, the information processing apparatus 2000 refers to the one in which the person B is interested, which is indicated in the feature data of the person B, and searches for a hotel in consideration of the person B's interest.
- the information processing apparatus 2000 is a hotel having a high degree of association with "hot spring” and "seafood” (eg, hot spring facilities and seafood dishes). We present the search results for hotels that are close to both the store and the store).
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
L'invention concerne un dispositif de traitement d'informations (2000) qui détermine si l'énoncé d'une personne (10) compris dans des données vocales est dirigé depuis la personne (10) vers un objet (20). Pour effectuer la détermination, le dispositif de traitement d'informations (2000) estime la ligne de vision de la personne (10). La ligne de vision de la personne (10) est estimée par analyse d'une image capturée comprenant la personne (10). La détermination est effectuée à l'aide de la ligne de vision estimée de la personne (10).
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2017-162058 | 2017-08-25 | ||
| JP2017162058 | 2017-08-25 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2019039352A1 true WO2019039352A1 (fr) | 2019-02-28 |
Family
ID=65439447
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2018/030272 Ceased WO2019039352A1 (fr) | 2017-08-25 | 2018-08-14 | Dispositif de traitement d'informations, procédé de commande, et programme |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2019039352A1 (fr) |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2000347692A (ja) * | 1999-06-07 | 2000-12-15 | Sanyo Electric Co Ltd | 人物検出方法、人物検出装置及びそれを用いた制御システム |
| JP2009206924A (ja) * | 2008-02-28 | 2009-09-10 | Fuji Xerox Co Ltd | 情報処理装置、情報処理システム及び情報処理プログラム |
| JP2011203455A (ja) * | 2010-03-25 | 2011-10-13 | Aisin Aw Co Ltd | 車両用情報端末及びプログラム |
| JP2012014394A (ja) * | 2010-06-30 | 2012-01-19 | Nippon Hoso Kyokai <Nhk> | ユーザ指示取得装置、ユーザ指示取得プログラムおよびテレビ受像機 |
| JP2017117371A (ja) * | 2015-12-25 | 2017-06-29 | パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカPanasonic Intellectual Property Corporation of America | 制御方法、制御装置およびプログラム |
-
2018
- 2018-08-14 WO PCT/JP2018/030272 patent/WO2019039352A1/fr not_active Ceased
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2000347692A (ja) * | 1999-06-07 | 2000-12-15 | Sanyo Electric Co Ltd | 人物検出方法、人物検出装置及びそれを用いた制御システム |
| JP2009206924A (ja) * | 2008-02-28 | 2009-09-10 | Fuji Xerox Co Ltd | 情報処理装置、情報処理システム及び情報処理プログラム |
| JP2011203455A (ja) * | 2010-03-25 | 2011-10-13 | Aisin Aw Co Ltd | 車両用情報端末及びプログラム |
| JP2012014394A (ja) * | 2010-06-30 | 2012-01-19 | Nippon Hoso Kyokai <Nhk> | ユーザ指示取得装置、ユーザ指示取得プログラムおよびテレビ受像機 |
| JP2017117371A (ja) * | 2015-12-25 | 2017-06-29 | パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカPanasonic Intellectual Property Corporation of America | 制御方法、制御装置およびプログラム |
Non-Patent Citations (1)
| Title |
|---|
| HAYASHI, YUKI ET AL.: "Development of Group Discussion Interaction Corpus and Analysis of the Relationship with Personality Traits", TRANSACTIONS OF INFORMATION PROCESSING SOCIETY OF JAPAN, vol. 56, no. 4, 15 April 2015 (2015-04-15), pages 1217 - 1227 * |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11875820B1 (en) | Context driven device arbitration | |
| US10083006B1 (en) | Intercom-style communication using multiple computing devices | |
| CN112088402B (zh) | 用于说话者识别的联合神经网络 | |
| JP6912605B2 (ja) | 声識別特徴最適化および動的登録方法、クライアント、ならびにサーバ | |
| CN112088315B (zh) | 多模式语音定位 | |
| US11094316B2 (en) | Audio analytics for natural language processing | |
| US20200135213A1 (en) | Electronic device and control method thereof | |
| KR102356623B1 (ko) | 가상 비서 전자 장치 및 그 제어 방법 | |
| US9858924B2 (en) | Voice processing apparatus and voice processing method | |
| US11514663B2 (en) | Reception apparatus, reception system, reception method, and storage medium | |
| WO2019217101A1 (fr) | Attribution de parole multimodale parmi n locuteurs | |
| WO2019202804A1 (fr) | Dispositif et procédé de traitement vocal | |
| US20210216589A1 (en) | Information processing apparatus, information processing method, program, and dialog system | |
| JP2013257418A (ja) | 情報処理装置、および情報処理方法、並びにプログラム | |
| CN109712606A (zh) | 一种信息获取方法、装置、设备及存储介质 | |
| WO2019150708A1 (fr) | Dispositif de traitement d'informations, système de traitement d'informations, procédé de traitement d'informations et programme | |
| JP7483532B2 (ja) | キーワード抽出装置、キーワード抽出方法及びキーワード抽出プログラム | |
| CN120151651A (zh) | 对焦控制系统和方法、可穿戴设备、介质和程序产品 | |
| WO2019039352A1 (fr) | Dispositif de traitement d'informations, procédé de commande, et programme | |
| JP6866731B2 (ja) | 音声認識装置、音声認識方法、及びプログラム | |
| WO2024209802A1 (fr) | Programme, dispositif de traitement d'informations et procédé de traitement d'informations | |
| Dhondiyal et al. | USER PROFILE MANAGEMENT USING DEEP LEARNING FOR ASSISTIVE HEADGEAR | |
| CN116126417A (zh) | 虚拟对象启动交互方法、装置、电子设备及存储介质 | |
| CN116013262A (zh) | 语音信号处理方法、装置、可读存储介质及电子设备 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18848066 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 18848066 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: JP |