JP2006279111A

JP2006279111A - Information processor, information processing method and program

Info

Publication number: JP2006279111A
Application number: JP2005090291A
Authority: JP
Inventors: Kazumasa Murai; 和昌村井
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2005-03-25
Filing date: 2005-03-25
Publication date: 2006-10-12

Abstract

<P>PROBLEM TO BE SOLVED: To provide an information processor which can support look-back, and to provide an information processing method and a program. <P>SOLUTION: A multi-modal recognition section 3 observes important parts of meeting by image processing, e. g. the operation analysis of salute, standing up or board writing, speaker shift, speaker identification, the extraction of a key person, and the like, and then calculates the weight of significance every time duration. Subsequently, the multi-modal recognition section 3 totalizes the weight of significance in units of utterance and identifies the important parts of meeting. Furthermore, the multi-modal recognition section 3 calculates the significance factor of each scene by expressing the number of words in utterance, the exclamation of hesitation in saying, a volume of a voice, a variation in the pitch of a voice, and the inclination of a face in a predetermined section in terms of predetermined weight. Thereafter, the multi-modal recognition section 3 estimates a key person of each scene based on the significance factor. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、情報処理装置、情報処理方法およびプログラムに関する。 The present invention relates to an information processing apparatus, an information processing method, and a program.

従来、テレビ会議システムにおいて、カメラの制御を無線通信により取得した参加者の位置情報により制御する技術が提案されている。たとえば予めテレビ会議システムの本体装置側に会議参加者の重要度情報を組み込んでおき、この情報をもとにカメラを制御する。また、参加者が携帯している無線通信機能を有する発信機から定期的に自局情報のデータを送信し、テレビ会議システムの本体側では、発信機からの自局情報をアンテナ２本で受信し、電波強度の差から位置情報を計算・蓄積し、カメラ制御権のある発信機に対して、カメラの撮影位置を抽出し、カメラ制御権のある発信機を携帯する参加者が画面に正しく表示することを可能とする（特許文献１）。また、インデックスとインデックスに関連付けられた重要度からビデオストリームの重要度を定める方法が提案されている（特許文献２）。 2. Description of the Related Art Conventionally, in a video conference system, a technique has been proposed in which camera control is controlled based on participant location information acquired by wireless communication. For example, the importance level information of the conference participant is incorporated in advance in the main device side of the video conference system, and the camera is controlled based on this information. In addition, data of the local station is periodically transmitted from a transmitter having a wireless communication function carried by a participant, and the main station side of the video conference system receives the local station information from the transmitter with two antennas. Then, the location information is calculated and stored from the difference in radio field intensity, the shooting position of the camera is extracted from the transmitter with the camera control right, and the participant carrying the transmitter with the camera control right is displayed correctly on the screen. It is possible to display (Patent Document 1). A method for determining the importance of a video stream from the index and the importance associated with the index has been proposed (Patent Document 2).

特開２０００−１９７０２７号公報JP 2000-197027 A 特開２００４−８０７６９号公報JP 2004-80769 A

従来から、会議においては、振り返りや、欠席者、関連者へ会議の状況を伝えるために議事録が作成されることがあった。近年、録音・録画が容易に行えるようになり、画像や音声を収録する場合もある。例えば、間隔を置いて定期的に開催される会議においては、会議の冒頭で前回の会議の要約を視聴することにより、会議の効率が上がることが知られている。このためには、主に手作業で録音や録画の要約を作成する必要があり、コストや手間に重大な課題があるために、一般的な会議では殆ど普及していない。 Conventionally, in meetings, minutes have been created to reflect on the status of the meeting to review, absentees and related parties. In recent years, recording / recording can be easily performed, and images and sounds may be recorded. For example, in a meeting that is regularly held at intervals, it is known that viewing the summary of the previous meeting at the beginning of the meeting increases the efficiency of the meeting. For this purpose, it is necessary to create a recording or a summary of the recording mainly by hand, and there are serious problems in cost and labor.

一方で、要約をすることなしに、未編集の録音や録画を視聴する場合を想定すると、視聴するために会議と同じ時間を要するなど、再利用のための効率が悪いという課題があった。そのため、未編集の録音や録画のビデオは殆ど活用されることはなかった。 On the other hand, assuming that an unedited recording or recording is viewed without summarizing, there is a problem that the efficiency for reuse is low, for example, it takes the same time as a meeting to view. For this reason, unedited recordings and recorded videos were rarely used.

そこで、本発明は、上記問題点に鑑みてなされたもので、振り返りを支援できる情報処理装置、情報処理方法およびプログラムを提供することを目的とする。 Therefore, the present invention has been made in view of the above problems, and an object thereof is to provide an information processing apparatus, an information processing method, and a program that can support reflection.

上記課題を解決するために、本発明は、撮影手段で撮影した映像情報に基づいて、音響情報から特徴量を認識する技術および画像情報から特徴量を認識する技術の少なくとも一方を用いて、該特徴量を抽出する抽出手段と、前記抽出手段が抽出した特徴量を集計する集計手段と、前記集計手段による集計結果に基づいて、前記撮影手段が撮影した空間の状態を推定する推定手段とを備えることを特徴とする情報処理装置である。これにより、特徴量を用いて撮影手段が撮影した空間の状態を推定することによって振り返りを支援できる。 In order to solve the above problems, the present invention uses at least one of a technique for recognizing a feature quantity from acoustic information and a technique for recognizing a feature quantity from image information based on video information photographed by a photographing means. Extracting means for extracting feature quantities, totalizing means for totaling the feature quantities extracted by the extracting means, and estimation means for estimating the state of the space photographed by the photographing means based on the totaling results by the totaling means It is an information processing apparatus characterized by comprising. Thereby, the reflection can be supported by estimating the state of the space photographed by the photographing means using the feature amount.

前記推定手段は、前記集計手段による集計結果を所定の重要度に関連付けることによって前記撮影手段が撮影した空間の状態を推定することを特徴とする。 The estimating means estimates a state of a space photographed by the photographing means by associating a result of the summing by the summing means with a predetermined importance.

前記推定手段は、前記集計手段による集計結果に基づいて、前記映像内のキーパーソンを推定することを特徴とする。前記推定手段は、キーパーソンが発話した部分、発話速度が所定の速度よりも速い部分、音声が所定の大きさよりも大きい部分、ピッチが所定のピッチよりも高い部分、挙手した後の部分、議決した後の部分、所定の期間内に所定の回数だけ話者交代がある部分、起立している人がいる部分、板書している人がいる部分、所定の回数よりも私語が少ない部分および発話の後に所定区間だけ発言しない区間がある部分のうち少なくとも１つを、前記撮影手段が撮影した空間の状態として推定することを特徴とする。 The estimating means estimates a key person in the video based on a counting result by the counting means. The estimation means includes a part where the key person speaks, a part where the speech speed is faster than a predetermined speed, a part where the voice is larger than a predetermined level, a part where the pitch is higher than the predetermined pitch, a part after raising a hand, After that, the part where there is a change of the speaker a predetermined number of times within the predetermined period, the part where the person is standing, the part where the person who is writing on the board is present, the part where the private language is less than the predetermined number and the utterance In this case, at least one of the portions in which there is a section that does not speak only after a predetermined section is estimated as the state of the space photographed by the photographing means.

本発明の情報処理装置は、前記映像情報を再生する再生装置の利用履歴を記録する記録手段と、前記再生装置の利用履歴に基づいて、前記重要度を算出する関数を学習する制御手段とをさらに備える。前記抽出手段は、発話単語数、発話音素数、感動詞の出現回数、声量、声量の時間変化、音素のピッチ変化および顔の傾きのうち少なくとも１つを、前記特徴量として抽出することを特徴とする。前記音響情報から特徴量を認識する技術は、音源推定、発話単位検出、発話衝突検出、話速検出、ピッチ検出、音量検出、無音区間検出、音質同定、言い淀み検出および拍手検出のうちの少なくとも１つを含むことを特徴とする。 The information processing apparatus according to the present invention includes a recording unit that records a usage history of a playback device that plays back the video information, and a control unit that learns a function for calculating the importance based on the usage history of the playback device. Further prepare. The extraction means extracts at least one of the number of utterance words, the number of utterance phonemes, the number of appearances of impression verbs, the volume of voice, the time change of the volume of voice, the pitch change of phonemes, and the inclination of the face as the feature quantity. And The technology for recognizing the feature amount from the acoustic information is at least one of sound source estimation, utterance unit detection, utterance collision detection, speech speed detection, pitch detection, sound volume detection, silence interval detection, sound quality identification, speech detection and applause detection. It is characterized by including one.

前記画像情報から特徴量を認識する技術は、顔検出、所作検出および位置検出のうちの少なくとも１つを含むことを特徴とする。前記集計手段は、発話単位に前記特徴量を集計することを特徴とする。本発明の情報処理装置は、前記推定手段が推定した撮影手段が撮影した空間の状態に基づいて、前記映像情報に対して索引を付与する付与手段をさらに備える。 The technique for recognizing a feature amount from the image information includes at least one of face detection, action detection, and position detection. The aggregation means is characterized in that the feature amounts are aggregated in utterance units. The information processing apparatus according to the present invention further includes an adding unit that adds an index to the video information based on a state of a space captured by the imaging unit estimated by the estimation unit.

本発明は、撮影手段で撮影した映像情報に基づいて、音響情報から特徴量を認識する技術および画像情報から特徴量を認識する技術の少なくとも一方を用いて、該特徴量を抽出する抽出ステップと、前記特徴量を集計する集計ステップと、前記特徴量の集計結果に基づいて、前記撮影手段が撮影した空間の状態を推定する推定ステップとを有する情報処理方法である。前記推定ステップは、前記特徴量の集計結果を、所定の重要度に関連付けることによって前記撮影手段が撮影した空間の状態を推定することを特徴とする。前記推定ステップは、前記特徴量の集計結果に基づいて、前記映像内のキーパーソンを推定することを特徴とする。 The present invention provides an extraction step of extracting the feature quantity using at least one of a technique for recognizing a feature quantity from acoustic information and a technique for recognizing a feature quantity from image information based on video information photographed by a photographing means. An information processing method comprising: a totaling step for totaling the feature quantities; and an estimation step for estimating the state of the space photographed by the photographing means based on the totalization results of the feature quantities. The estimation step is characterized in that the state of the space photographed by the photographing unit is estimated by associating the total result of the feature amount with a predetermined importance. The estimating step is characterized in that a key person in the video is estimated based on the total result of the feature amount.

本発明の情報処理方法は、前記映像情報を再生する再生装置の利用履歴を記録するステップと、前記再生装置の利用履歴に基づいて、前記重要度を算出する関数を学習するステップとをさらに有することを特徴とする。本発明の情報処理方法は、前記推定された撮影手段が撮影した空間の状態に基づいて、前記映像情報に対して索引を付与するステップをさらに有する。 The information processing method of the present invention further includes a step of recording a usage history of a playback device that plays back the video information, and a step of learning a function for calculating the importance based on the usage history of the playback device. It is characterized by that. The information processing method of the present invention further includes a step of assigning an index to the video information based on the estimated state of the space photographed by the photographing means.

本発明は、撮影手段で撮影した映像情報に基づいて、音響情報から特徴量を認識する技術および画像情報から特徴量を認識する技術の少なくとも一方を用いて、該特徴量を抽出するステップ、前記特徴量を集計するステップ、前記特徴量の集計結果に基づいて、前記撮影手段が撮影する空間の状態を推定するステップをコンピュータに実行させるためのプログラムである。 The present invention includes a step of extracting the feature quantity using at least one of a technique for recognizing a feature quantity from acoustic information and a technique for recognizing a feature quantity from image information based on video information photographed by a photographing means, A program for causing a computer to execute a step of totalizing feature amounts and a step of estimating a state of a space photographed by the photographing unit based on the result of summing the feature amounts.

本発明によれば、振り返りを支援できる情報処理装置、情報処理方法およびプログラムを提供することができる。 ADVANTAGE OF THE INVENTION According to this invention, the information processing apparatus, the information processing method, and program which can support reflection can be provided.

以下、本発明を実施するための最良の形態について説明する。 Hereinafter, the best mode for carrying out the present invention will be described.

まず、第１実施例について説明する。図１は、第１実施例によるシステムの全体構成を示す図である。図１に示すように、システム１は、センサー２、マルチモーダル認識部３、音声・画像記録部４、データベース５、再生・利用部６および利用記録部７を備える。センサー２は、マイクロホンやディジタルビデオカメラ（撮影手段）等によって構成される。マイクロホンは、たとえば会議中の音声信号を入力するものである。ディジタルカメラは、会議中の風景が画像情報を入力するものである。これらの音声信号および画像情報は同期が取られている。 First, the first embodiment will be described. FIG. 1 is a diagram showing an overall configuration of a system according to the first embodiment. As shown in FIG. 1, the system 1 includes a sensor 2, a multimodal recognition unit 3, a sound / image recording unit 4, a database 5, a reproduction / use unit 6, and a use recording unit 7. The sensor 2 is constituted by a microphone, a digital video camera (photographing means), or the like. The microphone inputs, for example, an audio signal during a conference. In the digital camera, a landscape during a meeting inputs image information. These audio signals and image information are synchronized.

マルチモーダル認識部３は、挙手、起立、板書などの動作解析などの画像処理、話者交代、話者同定や、キーパーソンの抽出などにより会議の重要部分（会議のさび）を推定するものである。この会議の重要な部分は、音楽に例えると、「さび」相当する部分である。次に、マルチモーダル認識部３は、この重要度重みを発話単位（utterance）毎に集計し、会議の重要部分を同定する。 The multimodal recognition unit 3 estimates the important part (meeting rust) of the conference through image processing such as motion analysis such as raising hand, standing up, and writing, changing the speaker, identifying the speaker, and extracting the key person. is there. An important part of this conference is the part corresponding to “rust” when compared to music. Next, the multimodal recognition unit 3 adds up the importance weights for each utterance and identifies the important part of the conference.

会議中の重要な部分は、たとえば以下のような特性がある。
・キーパーソンが発話する。
・発話速度が速い（音素の時間密度が高い）。
・声が大きく、ピッチが高い。
・挙手・議決した直後の部分。
・話者交代が比較的多い。
・起立、板書している人がいる。
・他の人の私語が少ない。
・発話単位の直後に誰も発言しない区間があることが多い。
・オンサイトの会議でも発話重複が起きる。
なお、情報統合部５０は、これらを撮影手段が撮影した空間の状態として推定する。 An important part of the conference has the following characteristics, for example.
・ Key person speaks.
・ Speech speed is fast (phoneme time density is high).
・ Voice is loud and pitch is high.
・ The part immediately after raising the hand / voting.
・ Speaker changes are relatively frequent.
・ Some people are standing and writing on the board.
・ There are few other people's private languages.
・ There is often a section where no one speaks immediately after the utterance unit.
・ Duplicate utterances occur in on-site meetings.
The information integration unit 50 estimates these as the state of the space photographed by the photographing unit.

また、マルチモーダル認識部３は、会議中に複数の閲覧者がいる場合には、重みを加算し、重要度を更新する。音声・画像記録部４は、会議中に得た音声信号および映像信号を記録する。データベース５は、マルチモーダル認識部３が同定した重要度を、データベース５に記録された音声および画像のメタ情報として記録する。このメタ情報を再生時に用いることで、重要な部分を容易に検索することができる。 In addition, when there are a plurality of viewers during the conference, the multimodal recognition unit 3 adds the weight and updates the importance. The audio / image recording unit 4 records audio signals and video signals obtained during the conference. The database 5 records the importance identified by the multimodal recognition unit 3 as audio and image meta information recorded in the database 5. By using this meta information at the time of reproduction, an important part can be easily searched.

再生・利用部６は、容易に早送り・巻き戻しできるビデオ再生装置である。利用記録部７は、会議に関連する人が、早送りやスキップなどをせずに閲覧した部分を利用記録として記録するものである。利用記録部７は、重要度情報をマルチモーダル認識部３に提供する。ここで、重要度情報とは、記録されたビデオが繰り返し見られるなど、重要だと思われる部分や、早送りでスキップされることが多い部分など、ビデオを見るときの操作から得られる情報である。また、手作業で重要な部分にラベルつけを行い、重要度情報としてもよい。マルチモーダル認識部３は、利用記録部７からの重要度情報にあわせて、重要度重みを算出する関数を学習しなおし、精度を高める。 The playback / use unit 6 is a video playback device that can easily fast forward and rewind. The usage recording unit 7 records a portion viewed by a person related to the conference without fast-forwarding or skipping as a usage record. The usage recording unit 7 provides importance level information to the multimodal recognition unit 3. Here, the importance level information is information obtained from the operation when watching the video, such as a part that seems to be important, such as a recorded video being repeatedly viewed, or a part that is often skipped by fast-forwarding. . Also, it is possible to label important parts manually and use it as importance level information. The multimodal recognition unit 3 re-learns the function for calculating the importance weight according to the importance information from the usage recording unit 7 and increases the accuracy.

次に、マルチモーダル認識部３について説明する。図２は、第１実施例に係るマルチモーダル認識部３の構成を示すブロック図である。図２に示すように、マルチモーダル認識部３は、音声認識部１０、画像認識部２０、教師なし学習制御部３０、教師あり学習制御部４０および情報統合部５０を備える。なお、以下において認識と検出については、一般的な用語を使っているだけで、特に区別していない。 Next, the multimodal recognition unit 3 will be described. FIG. 2 is a block diagram illustrating the configuration of the multimodal recognition unit 3 according to the first embodiment. As shown in FIG. 2, the multimodal recognition unit 3 includes a voice recognition unit 10, an image recognition unit 20, an unsupervised learning control unit 30, a supervised learning control unit 40, and an information integration unit 50. In the following, recognition and detection are not particularly distinguished because only common terms are used.

音声認識部１０は、撮影手段で撮影した映像情報に基づいて、音響情報から特徴量を認識する技術を用いて、特徴量を抽出するものである。また、音声認識部１０は、所定の音声認識技術を用いて音声信号から重要度に関連しそうな情報を数え上げるものである。音声認識部１０は、音源推定部１１、発話単位検出部１２、発話衝突検出部１３、話速検出部１４、ピッチ・音量検出部１５、無音区間検出部１６、言い淀み検出部１７および拍手検出部１８を備える。音源推定部１１は、マイクロホンアレー等を用いて、音源位置を推定する。なお、画像中の顔位置を用いて、音源を推定するようにしてもよいし、音・画像で情報融合により音源を推定してもよい。 The voice recognition unit 10 extracts a feature amount using a technique for recognizing a feature amount from acoustic information based on video information photographed by a photographing unit. The voice recognition unit 10 counts information that is likely to be related to the importance level from the voice signal by using a predetermined voice recognition technique. The speech recognition unit 10 includes a sound source estimation unit 11, an utterance unit detection unit 12, an utterance collision detection unit 13, a speech speed detection unit 14, a pitch / volume detection unit 15, a silence interval detection unit 16, a speech detection unit 17, and applause detection. The unit 18 is provided. The sound source estimation unit 11 estimates a sound source position using a microphone array or the like. Note that the sound source may be estimated using the face position in the image, or the sound source may be estimated by sound and image information fusion.

発話単位検出部１２は、発話の始端および終端により発話を検出する。発話衝突検出部１３は、複数の話者の発話検出から、発話の衝突を検出する。ここで、話者の衝突とは、時間軸上で重複する部分である。話速検出部１４は、音素認識と、単位時間当たりの音素数から検出できる。ピッチ・音量検出部１５は、会議中の音声信号からピッチおよび音量を検出するものである。ピッチ・音量検出部１５は、たとえばピッチをアクセント句等の基本周波数（ＦＯ）などを用いて検出する。また、ピッチ・音量検出部１５は、音量をエネルギーから検出する。 The utterance unit detection unit 12 detects an utterance based on the start and end of the utterance. The utterance collision detection unit 13 detects an utterance collision from the utterance detection of a plurality of speakers. Here, the speaker collision is an overlapping portion on the time axis. The speech speed detection unit 14 can detect from phoneme recognition and the number of phonemes per unit time. The pitch / volume detector 15 detects the pitch and volume from the audio signal during the meeting. The pitch / volume detection unit 15 detects the pitch using, for example, a basic frequency (FO) such as an accent phrase. Further, the pitch / volume detector 15 detects the volume from the energy.

無音区間検出部１６は、発話検出から誰も発話しない区間を得るものである。言い淀み検出部１７は、「ええとー」「あのー」などのワードスポッティングなどを用いて認識できる。拍手検出部１８は、音声認識技術を用いて、音から拍手を検出する。なお、画像から拍手を検出するようにしてもよい。なお、音声認識部１０は、構文解析を伴う音声認識を行わず、ピッチや声量の検出、音素の時間密度、言い淀みの感動詞程度の容易に実装可能な認識のみを行うのが良い。 The silent section detector 16 obtains a section where no one speaks from the speech detection. The utterance detection unit 17 can recognize using word spotting such as “um” or “no”. The applause detector 18 detects applause from the sound using a voice recognition technology. Note that applause may be detected from the image. Note that the speech recognition unit 10 does not perform speech recognition that involves syntactic analysis, but only performs recognition that can be easily implemented, such as pitch and voice volume detection, time density of phonemes, and verbal excitement.

画像認識部２０は、撮影手段で撮影した映像情報に基づいて、画像情報から特徴量を認識する技術を用いて、特徴量を抽出するものである。また、画像認識部２０は、所定の画像処理技術を用いて画像信号から得られる重要度に関連しそうな情報を数え上げる。画像認識部２０は、顔検出部２１、所作検出部２２、各人の位置を検出する位置検出部２３を備える。顔検出部２１は、従来からある画像処理技術を用いて各人の顔の状態を検出するものである。所作検出部２２は、起立、板書などの動き、ジェスチャーなどの所作を検出するものである。位置検出部２３は、各人の位置を検出するものである。 The image recognizing unit 20 extracts a feature amount using a technique for recognizing a feature amount from image information based on video information photographed by a photographing unit. Further, the image recognition unit 20 counts information that seems to be related to the importance obtained from the image signal by using a predetermined image processing technique. The image recognition unit 20 includes a face detection unit 21, an action detection unit 22, and a position detection unit 23 that detects the position of each person. The face detection unit 21 detects the face state of each person using a conventional image processing technique. The action detection unit 22 detects actions such as standing, movement of a blackboard, gestures, and the like. The position detection unit 23 detects the position of each person.

声質・顔同定部２５は、声による話者同定や、顔の同定を使って話者同定する。音声認識部１０および画像認識部２０は、認識技術の進展に伴い、他の項目を認識し、または検出することができる。実際の運用に当たっては、新たに認識される情報を追加して、統合することができる。教師なし学習制御部３０は、教師なし学習を制御するものであり、ＳＯＭ（自己組織化マップ）などにより、複数の入力の同時生起確率などから生成されるマップを構成する。教師なし学習制御部３０は、音声認識部１０および画像認識部２０から得られた情報に基づき、同時生起確率などから教師なしの学習を行う。 The voice quality / face identification unit 25 performs speaker identification using voice speaker identification or face identification. The speech recognition unit 10 and the image recognition unit 20 can recognize or detect other items as the recognition technology advances. In actual operation, newly recognized information can be added and integrated. The unsupervised learning control unit 30 controls unsupervised learning, and forms a map generated from the simultaneous occurrence probability of a plurality of inputs by SOM (self-organizing map) or the like. The unsupervised learning control unit 30 performs unsupervised learning from the co-occurrence probability based on information obtained from the speech recognition unit 10 and the image recognition unit 20.

教師あり学習制御部４０は、情報統合部５０により統合された情報と、利用記録部７からの重要度情報に基づいて、その関連を学習する。この学習には、たとえば階層型ニューラルネットワークを用いることができる。但し、一般的に重要度と関連しそうな検出結果などとの関連を、初期値として類推に基づいて与えることを前提としており、初期値でも相応の効用を得ることができる。この教師あり学習に基づけば、たとえ、各認識部分が誤認識しても、誤認識を前提とした重要度との関連付けを得ることができるので、各認識部分は精度に課題があっても採用することができる。すなわち、現実のデータに基づいてより精度良く重要度を認識できるようにする。一旦、学習が終われば、統合された情報から、重要度を推定できるし、それを初期値として、更に追加学習も可能である。これにより、使用するに従って、徐々に学習結果が蓄積され、重要度を推定することができる。 The supervised learning control unit 40 learns the relation based on the information integrated by the information integration unit 50 and the importance information from the usage recording unit 7. For this learning, for example, a hierarchical neural network can be used. However, it is assumed that a relation with a detection result that is likely to be related to the degree of importance is generally given as an initial value based on analogy, and a corresponding utility can be obtained even with the initial value. Based on this supervised learning, even if each recognition part is misrecognized, it is possible to obtain an association with the importance based on misrecognition. can do. That is, the importance can be recognized more accurately based on actual data. Once learning is completed, the importance can be estimated from the integrated information, and further learning can be performed using the importance as an initial value. As a result, the learning result is gradually accumulated as it is used, and the importance can be estimated.

情報統合部５０は、音声認識部１０および画像認識部２０が抽出した特徴量をたとえば発話単位に集計し、この集計結果に基づいて、撮影手段が撮影した空間の状態を推定する。たとえば、情報統合部５０は、集計結果を所定の重要度に関連付けることによって撮影手段が撮影した空間の状態を推定する。次に、情報統合部５０は、推定した撮影手段が撮影した空間の状態に基づいて、映像情報に対して索引（インデックス）を付与する。これによりたとえば会議の重要部分を簡単に振り返ることができる。 The information integration unit 50 totals the feature amounts extracted by the voice recognition unit 10 and the image recognition unit 20 in units of utterances, for example, and estimates the state of the space photographed by the photographing unit based on the summation result. For example, the information integration unit 50 estimates the state of the space photographed by the photographing unit by associating the aggregation result with a predetermined importance. Next, the information integration unit 50 assigns an index to the video information based on the estimated state of the space photographed by the photographing unit. This makes it possible to easily look back on important parts of the conference, for example.

また、情報統合部５０は、音声認識部１０、画像認識部２０、他のセンサー情報６０、教師なし学習制御部３０および教師あり学習制御部４０からの情報を収集し、統合することによって、時間・重要度グラフを作成し、作成した時間・重要度グラフをデータベース５に送る。収集する過程で、例えば、音源推定で、発話者の音源（口）位置を推定して顔検出の基礎情報に用いたり、口位置を音源推定の候補として用いることもできる。検出された顔から顔同定、声から声質同定を協調してより頑強な同定が可能になる。このように、各認識要素の結果を相互に利用することにより、統合し、複数の情報を得ることができる。 In addition, the information integration unit 50 collects and integrates information from the voice recognition unit 10, the image recognition unit 20, other sensor information 60, the unsupervised learning control unit 30, and the supervised learning control unit 40, so that time Create an importance graph and send the created time / importance graph to the database 5. In the process of collecting, for example, the sound source (mouth) position of the speaker can be estimated and used as basic information for face detection in sound source estimation, or the mouth position can be used as a sound source estimation candidate. More robust identification is possible by coordinating face identification from the detected face and voice quality identification from the voice. Thus, by mutually using the results of the respective recognition elements, it is possible to integrate and obtain a plurality of information.

次に、システムの動作について説明する。図３は、第１実施例に係るシステムの動作フローチャートである。ステップＳ１１で、教師あり学習制御部４０は、情報統合部５０により統合された情報と、利用記録部７からの重要度情報に基づいて、その関連を学習する。ステップＳ１２で、マイクロホンから会議中の音声信号を入力する。ディジタルカメラから会議中の風景が画像情報を入力する。ステップＳ１３で、音声認識部１０および画像認識部２０は、時間ごとの重要度重みを算出する。 Next, the operation of the system will be described. FIG. 3 is an operation flowchart of the system according to the first embodiment. In step S <b> 11, the supervised learning control unit 40 learns the association based on the information integrated by the information integration unit 50 and the importance level information from the usage recording unit 7. In step S12, an audio signal during the conference is input from the microphone. The image of the scenery during the meeting is input from the digital camera. In step S13, the voice recognition unit 10 and the image recognition unit 20 calculate importance weights for each time.

ステップＳ１４で、情報統合部は、重要度重みを発話単位に集計して会議の重要部分を同定する。ステップＳ１５で、データベース５に、情報統合部５０が同定した重要度を、音声および画像のメタ情報として記録する。このメタ情報を再生時に用いることにより、会議の重要な部分を容易に検索することができる。 In step S14, the information integration unit identifies the important part of the conference by counting the importance weights for each utterance. In step S15, the importance level identified by the information integration unit 50 is recorded in the database 5 as audio and image meta information. By using this meta information at the time of reproduction, an important part of the conference can be easily searched.

本実施例をまとめると、音響情報の特徴量には、現在の技術でも容易に認識できる、声の大きさ、音源方向、声の高低、発話速度、発話の衝突検出、話者交代、沈黙、や、ある程度は実用的に認識ができる、話者同定、拍手などが含まれる。また、画像情報からの特徴量には、顔、顔向き、視線方向、顔から人の同定、挙手、起立している、板書、筆記、体の動き、照明の明るさ、プロジェクタースクリーンの画像切り替え、入場、退席、まばたきの頻度、びんぼうゆすりなどが含まれる。これらの認識結果を集計して、予め定められた（または、他の方法で定めた）重要度と関連付ける（学習）。この関連付けに基づいて、音響や画像の特徴量から重要度を推定する。 To summarize this embodiment, the feature amount of the acoustic information can be easily recognized by the current technology, such as voice volume, sound source direction, voice pitch, utterance speed, utterance collision detection, speaker change, silence, In addition, speaker identification, applause, etc., which can be recognized practically to some extent, are included. Also, the feature amount from the image information includes face, face orientation, line-of-sight direction, identification of person from face, raising hand, standing, board writing, writing, body movement, lighting brightness, projector screen image switching , Admission, leaving, blinking frequency, bottle bow and so on. These recognition results are aggregated and associated with a degree of importance determined in advance (or determined by another method) (learning). Based on this association, the importance level is estimated from the feature amount of the sound or the image.

より高度に、話の内容や、見る人の興味などを認識して、重要度を重み付けすることも将来は不可能ではないものであるが、現状の技術では見通しが立たない。そこで、本発明は、取得できる情報、特に、高度な認識機能を持った人間の振る舞いを多数収集し、学習することによって重要度を推定する。 Although it is not impossible in the future to recognize the content of the story and the interests of the viewers and weight the importance, it is not possible with the current technology. Therefore, the present invention estimates importance by collecting and learning a large amount of information that can be acquired, in particular, human behavior having advanced recognition functions.

例えば、
・「（１）興味深い話」があると、「（２）顔を上げる」癖がある。
・「（１）話がつまらない」と「（２）貧乏ゆすり」をする。
・「（１）発言の前」に「（２）口を開けて待っている。」
・「（１）議論が沸騰する」と、「（２）話者交代が頻繁に起きる。」
・「（１）議論が停滞する」と、「（２）誰も発話しない時間」が増える。
マルチモーダル認識部３によって、上記（２）を認識して集計することにより、上記（１）を推定してメタデータを付与する。話がつまらないとか、議論が沸騰するなど、個別の認識は有用性が乏しいが、「重要度」を評価するに長けた人が、ビデオを見ながら重要度な部分にメタ情報を振っておけば、「認識できる事象」と、「重要度」のある程度の関連付けは学習により可能になる。 For example,
・ When there is "(1) interesting story", there is a habit of "(2) raising your face".
・ Do “(1) The story is boring” and “(2) Poverty slurp”.
・ "(1) Before speaking""(2) Waiting with your mouth open."
・ "(1) Discussions boil" and "(2) Speaker changes occur frequently."
・ "(1) Discussion is stagnant" and "(2) Time when no one speaks" increases.
The multimodal recognition unit 3 recognizes the above (2) and totals them, thereby estimating the above (1) and adding metadata. Individual recognition is not useful, such as talk is boring or discussions boil, but if someone who is good at assessing `` importance '' shakes meta information to the important part while watching the video A certain degree of association between the “recognizable event” and the “importance” is made possible by learning.

また、記録したビデオを見る複数の人が、「社長の話」は見る人が多かったが、「Ｄさんの発言」は早送りして誰も見なかったなどを観測することにより、Ｄさんが発話している場面よりも、社長が発話している場面のほうが重要度が高そうであるということが推定できる。これらのことから、技術的に可能な範囲で認識できる情報から、具体的に何を話しているかを認識することなしに、重要度を推定することができる。 In addition, many people who watch the recorded video, many people were watching "President's story", but by observing that no one saw D-san's remarks, It can be inferred that the scene in which the president speaks seems to be more important than the scene in which it is speaking. From these facts, the importance can be estimated from information that can be recognized as far as technically possible, without recognizing what is specifically spoken.

重要度は、音声認識の研究で行われている、音声・画像を見ながら手作業で入力する「ハンドラベル」のほか、記録したビデオを再生する際に、繰り返し見たり、早送りして見なかったりすることを自動的認識し、「視聴率」のようなものにより定めることができる。これらを繰り返し関連付けすることにより、帰納的に特徴量から重要度が得られる。 In addition to the “hand label” that is input manually while watching the voice and image, which is being studied in speech recognition research, the importance level is not repeated or fast-forwarded when playing back recorded video. Can be automatically determined and determined by something like "viewing rate". By repeatedly associating these, the degree of importance can be obtained from the feature quantity inductively.

次に、第２実施例について説明する。図４は、第２実施例に係るシステムの全体構成を示す図である。図４に示すように、システム１００は、センサー２、マルチモーダル認識部１０３、音声・画像記録部４、データベース５、再生・利用部６を備える。なお、同一箇所については同一符号を付して説明する。図５は、第２実施例に係るマルチモーダル認識部１０３の構成を示すブロック図である。 Next, a second embodiment will be described. FIG. 4 is a diagram illustrating an overall configuration of a system according to the second embodiment. As shown in FIG. 4, the system 100 includes a sensor 2, a multimodal recognition unit 103, an audio / image recording unit 4, a database 5, and a reproduction / use unit 6. In addition, the same code | symbol is attached | subjected and demonstrated about the same location. FIG. 5 is a block diagram illustrating a configuration of the multimodal recognition unit 103 according to the second embodiment.

図５に示すように、マルチモーダル認識部１０３は、音声認識部１０、画像認識部２０、情報統合部５０および地位・立場入力部７０を備える。音声認識部１０は、撮影手段で撮影した映像情報に基づいて、音響情報から特徴量を認識する技術を用いて、特徴量を抽出する。また、所定の音声認識技術を用いて音声信号から重要度に関連しそうな情報を数え上げるものである。音声認識部１０は、音源推定部１１、発話単位検出部１２、発話衝突検出部１３、話速検出部１４、ピッチ・音量検出部１５、無音区間検出部１６、言い淀み検出部１７および拍手検出部１８を備える。画像認識部２０は、撮影手段で撮影した映像情報に基づいて、画像情報から特徴量を認識する技術を用いて、特徴量を抽出する。また、画像認識部２０は、所定の画像処理技術を用いて画像信号から得られる重要度に関連しそうな情報を数え上げるものである。地位・立場入力部７０は、発話者の地位・立場に関する情報を手動によって入力するものである。地位・立場入力部７０より事前に情報を付与することができる。 As shown in FIG. 5, the multimodal recognition unit 103 includes a voice recognition unit 10, an image recognition unit 20, an information integration unit 50, and a position / position input unit 70. The voice recognition unit 10 extracts feature amounts using a technique for recognizing feature amounts from acoustic information based on video information photographed by a photographing unit. In addition, information that seems to be related to importance is counted from the voice signal by using a predetermined voice recognition technique. The speech recognition unit 10 includes a sound source estimation unit 11, an utterance unit detection unit 12, an utterance collision detection unit 13, a speech speed detection unit 14, a pitch / volume detection unit 15, a silence interval detection unit 16, a speech detection unit 17, and applause detection. The unit 18 is provided. The image recognizing unit 20 extracts the feature amount using a technique for recognizing the feature amount from the image information based on the video information photographed by the photographing unit. The image recognition unit 20 counts information that is likely to be related to the importance obtained from the image signal using a predetermined image processing technique. The position / position input unit 70 manually inputs information related to the position / position of the speaker. Information can be given in advance from the position / position input unit 70.

音声認識部１０および画像認識部２０によって、発話単語数、発話音素数、感動詞の出現回数、声量、声量の時間変化、音素のピッチ変化および顔の傾きを特徴量として抽出する。 The speech recognition unit 10 and the image recognition unit 20 extract the number of uttered words, the number of uttered phonemes, the number of appearances of impression verbs, the amount of voice, the time change of the voice amount, the change of phoneme pitch, and the face inclination as feature amounts.

一般的に、キーパーソンは、以下のような特性がある。
・発話時間が多い。
・発話中に言いよどみが少ない。
・声量が大きく、変化が少ない。
・音声のピッチ変化が比較的少ない。
・立場・地位が高い。
・キーパーソンの発話中に参加者が書き取る操作をすることが多い。 Generally, a key person has the following characteristics.
・ Large utterance time.
・ There is little stagnation during speaking.
・ Sound volume is high and changes are small.
・ There is relatively little pitch change of voice.
・ Position / position is high.
・ Participants often perform writing operations during keyperson utterances.

一方で、キーパーソンではない人は、
・発話時間の割に単語数が少ない。
・いいよどみが多く感動詞（ええと…、あの…）などが多い。
・声量が小さく、声量の変化も大きい。
・音声のピッチの変化が大きい。
・立場・地位が低い。 On the other hand, if you are not a key person,
・ There are few words for speaking time.
・ There are many good stagnations, and there are many impression verbs (um…, that…).
・ Voice volume is small and change in voice volume is large.
・ Voice pitch change is large.
・ Position / position is low.

そこで、情報統合部５０は、抽出した特徴量を集計し、この集計結果に基づいて、撮影手段が撮影した空間内のキーパーソンを推定するものである。また、情報統合部５０は、所定区間内で、発話単語数（音素数でも可）、言い淀みの感動詞、声量、声量の変化率、音声のピッチ変化、顔の傾き（地位に関連すると考えられる）を、所定の重みで換算し、場面ごとの重要度係数を算出する。情報統合部５０は、この重要度係数に基づいて、場面ごとにキーパーソンを推定する。なお、メールのやり取りなどからも係数を導出することができ、勘案に含めることもできる。手動の操作などにより、特定の人の重みを加算したり、特定の人をキーパーソンと設定することもできる。 Therefore, the information integration unit 50 aggregates the extracted feature amounts, and estimates a key person in the space photographed by the photographing unit based on the summation result. In addition, the information integration unit 50 considers that the number of utterance words (the number of phonemes is acceptable), the verbal excitement verb, the voice volume, the change rate of the voice volume, the voice pitch change, the face inclination (related to the position within a predetermined section. Is converted with a predetermined weight, and an importance coefficient for each scene is calculated. The information integration unit 50 estimates a key person for each scene based on the importance coefficient. It should be noted that the coefficient can also be derived from the exchange of e-mails and included in consideration. The weight of a specific person can be added by a manual operation or the like, or a specific person can be set as a key person.

次に、第２実施例によるシステムの動作を説明する。図６は、第２実施例に係るシステムの動作フローチャートである。ステップＳ２１で、マイクロホンから会議中の音声信号を入力する。ディジタルカメラから会議中の風景が画像情報を入力する。ステップＳ２２で、情報統合部５０は、所定区間内で、発話単語数、感動詞、声量、声量の変化率、音声のピッチ変化、顔の傾きなどを、所定の重みで換算し、場面ごとの重要度係数を算出する。 Next, the operation of the system according to the second embodiment will be described. FIG. 6 is an operation flowchart of the system according to the second embodiment. In step S21, an audio signal during the conference is input from the microphone. The image of the scenery during the meeting is input from the digital camera. In step S22, the information integration unit 50 converts the number of uttered words, excitement verbs, voice volume, change rate of voice volume, change in voice pitch, face inclination, and the like with predetermined weights within a predetermined section. Calculate the importance factor.

ステップＳ２３で、情報統合部５０は、重要度係数に基づいて、場面ごとのキーパーソンを推定する。ステップＳ２４で、データベース５に、情報統合部５０が重要度係数を、音声および画像のメタ情報として記録する。このメタ情報を再生時に用いることにより、キーパーソンを容易に検索することができる。 In step S23, the information integration unit 50 estimates a key person for each scene based on the importance coefficient. In step S24, the information integration unit 50 records the importance coefficient in the database 5 as audio and image meta-information. By using this meta information at the time of reproduction, the key person can be easily searched.

第２実施例によれば、ブレーンストーミング形式のミーティングでは、事前の情報から必ずしもキーパーソンが同定できない場合でも、後からキーパーソンを推定することができる。また、シーンごとにキーパーソンが入れ替わるという特性にも対応することができる。 According to the second embodiment, in a brainstorming type meeting, even if the key person cannot always be identified from the prior information, the key person can be estimated later. In addition, it is possible to cope with the characteristic that the key person is switched for each scene.

なお、本発明による情報処理方法は、例えば、ＣＰＵ（Central Processing Unit）、ＲＯＭ(Read Only Memory)、ＲＡＭ(Random Access Memory)等を用いて実現され、プログラムをハードディスク装置や、ＣＤ−ＲＯＭ、ＤＶＤまたはフレキシブルディスクなどの可搬型記憶媒体等からインストールし、または通信回路からダウンロードし、ＣＰＵがこのプログラムを実行することで、各ステップが実現される。このプログラムは、撮影手段で撮影した映像情報に基づいて、音響情報から特徴量を認識する技術および画像情報から特徴量を認識する技術の少なくとも一方を用いて、該特徴量を抽出するステップ、前記特徴量を集計するステップ、前記特徴量の集計結果に基づいて、前記撮影手段が撮影する空間の状態を推定するステップをコンピュータに実行させる。 The information processing method according to the present invention is realized using, for example, a CPU (Central Processing Unit), a ROM (Read Only Memory), a RAM (Random Access Memory), and the like, and the program is stored in a hard disk device, a CD-ROM, or a DVD. Alternatively, each step is realized by installing from a portable storage medium such as a flexible disk or downloading from a communication circuit and the CPU executing this program. This program extracts the feature amount using at least one of a technique for recognizing a feature amount from acoustic information and a technique for recognizing a feature amount from image information based on video information photographed by a photographing means, Causing the computer to execute a step of totalizing feature amounts and a step of estimating a state of a space photographed by the photographing unit based on the result of the feature amount summation.

以上本発明の好ましい実施例について詳述したが、本発明は係る特定の実施例に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、種々の変形、変更が可能である。 Although the preferred embodiments of the present invention have been described in detail above, the present invention is not limited to the specific embodiments, and various modifications, within the scope of the gist of the present invention described in the claims, It can be changed.

第１実施例に係るシステムの全体構成を示す図である。It is a figure which shows the whole structure of the system which concerns on 1st Example. 第１実施例に係るマルチモーダル認識部３の構成を示すブロック図である。It is a block diagram which shows the structure of the multimodal recognition part 3 which concerns on 1st Example. 第１実施例に係るシステムの動作フローチャートである。It is an operation | movement flowchart of the system which concerns on 1st Example. 第２実施例に係るシステムの全体構成を示す図である。It is a figure which shows the whole structure of the system which concerns on 2nd Example. 第２実施例に係るマルチモーダル認識部１０３の構成を示すブロック図である。It is a block diagram which shows the structure of the multimodal recognition part 103 which concerns on 2nd Example. 第２実施例に係るシステムの動作フローチャートである。It is an operation | movement flowchart of the system which concerns on 2nd Example.

Explanation of symbols

１、１００システム
２センサー
３、１０３マルチモーダル認識部
４音声、画像記録部
５データベース
６再生・利用部
７利用記録部
１０音声認識部
２０画像認識部
３０教師なし学習制御部
４０教師あり学習制御部
５０情報統合部
７０地位・立場入力部
DESCRIPTION OF SYMBOLS 1,100 System 2 Sensor 3,103 Multimodal recognition part 4 Voice, image recording part 5 Database 6 Playback | regeneration / use part 7 Usage recording part 10 Voice recognition part 20 Image recognition part 30 Unsupervised learning control part 40 Supervised learning control part 50 Information Integration Department 70 Position / Position Input Department

Claims

Extraction means for extracting the feature quantity using at least one of a technique for recognizing the feature quantity from the acoustic information and a technique for recognizing the feature quantity from the image information based on the video information photographed by the photographing means;
Tally means for tallying the feature quantities extracted by the extracting means;
An information processing apparatus comprising: an estimation unit configured to estimate a state of a space photographed by the photographing unit based on a counting result obtained by the counting unit.

The information processing apparatus according to claim 1, wherein the estimation unit estimates a state of a space photographed by the photographing unit by associating a totaling result obtained by the totaling unit with a predetermined importance.

The information processing apparatus according to claim 1, wherein the estimating unit estimates a key person in a space photographed by the photographing unit based on a counting result by the counting unit.

The estimation means includes a part where the key person speaks, a part where the speech speed is faster than a predetermined speed, a part where the voice is larger than a predetermined level, a part where the pitch is higher than the predetermined pitch, a part after raising a hand, After that, the part where there is a change of the speaker a predetermined number of times within the predetermined period, the part where the person is standing, the part where the person who is writing on the board is present, the part where the private language is less than the predetermined number and the utterance 2. The information processing apparatus according to claim 1, wherein at least one of a portion in which there is a section that does not speak for a predetermined section after is estimated as a state of a space photographed by the photographing unit.

Recording means for recording a usage history of a playback device for playing back the video information;
The information processing apparatus according to claim 2, further comprising: a control unit that learns a function for calculating the importance based on a usage history of the playback apparatus.

The extraction means extracts at least one of the number of utterance words, the number of utterance phonemes, the number of appearances of impression verbs, the volume of voice, the time change of voice volume, the change of phoneme pitch, and the inclination of the face as the feature quantity. The information processing apparatus according to claim 1.

The technology for recognizing the feature amount from the acoustic information is at least one of sound source estimation, utterance unit detection, utterance collision detection, speech speed detection, pitch detection, sound volume detection, silence interval detection, sound quality identification, speech detection and applause detection. The information processing apparatus according to claim 1, comprising one.

The information processing apparatus according to claim 1, wherein the technique for recognizing a feature amount from the image information includes at least one of face detection, action detection, and position detection.

The information processing apparatus according to claim 1, wherein the totaling unit totalizes the feature amount in utterance units.

The information processing apparatus according to claim 1, further comprising an attaching unit that assigns an index to the video information based on a state of a space photographed by the photographing unit estimated by the estimating unit.

An extraction step for extracting the feature quantity using at least one of a technique for recognizing the feature quantity from the acoustic information and a technique for recognizing the feature quantity from the image information based on the video information photographed by the photographing means;
A totalizing step for totalizing the feature quantities;
An information processing method comprising: an estimation step of estimating a state of a space photographed by the photographing means based on the total result of the feature amount.

The information processing method according to claim 11, wherein the estimating step estimates a state of a space photographed by the photographing unit by associating the total result of the feature amount with a predetermined importance.

The information processing method according to claim 11, wherein the estimating step estimates a key person in a space photographed by the photographing unit based on the total result of the feature amount.

Recording a usage history of a playback device for playing back the video information;
The information processing method according to claim 11, further comprising a step of learning a function for calculating the importance based on a usage history of the playback device.

The information processing method according to claim 11, further comprising a step of assigning an index to the video information based on a space state photographed by the estimated photographing unit.

Extracting the feature amount using at least one of a technology for recognizing the feature amount from the acoustic information and a technology for recognizing the feature amount from the image information based on the video information photographed by the photographing means;
Summing up the feature quantities;
A program for causing a computer to execute a step of estimating a state of a space photographed by the photographing unit based on a result of counting the feature values.