JP7774997B2

JP7774997B2 - Imaging device, control method, and program

Info

Publication number: JP7774997B2
Application number: JP2021140207A
Authority: JP
Inventors: 修原田; 宏樹太田
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2021-08-30
Filing date: 2021-08-30
Publication date: 2025-11-25
Anticipated expiration: 2041-08-30
Also published as: JP2023034121A

Description

本発明は、人物の音声に対して音声処理を行う音声処理装置に関するものである。 The present invention relates to a voice processing device that performs voice processing on a person's voice.

撮像装置における動画撮影では、撮影時の状況を撮影者のイメージ通りに残すことが重要であり、それは映像だけでなく音声についても同様である。 When shooting video using an imaging device, it is important to capture the shooting situation exactly as the photographer imagined it, and this applies not only to video but also to audio.

特許文献１では、被写体の音声を抽出し、その抽出した音声信号を被写体の位置に応じて個別に調整することで、臨場感やステレオ感をもった音響空間を実現することが開示されている。 Patent document 1 discloses a method for extracting the sound of a subject and individually adjusting the extracted audio signal according to the subject's position to create an acoustic space with a sense of realism and stereophony.

特開２０１２－１３８９３０号公報JP 2012-138930 A

しかし、人間が会話を聴取するとき、正確に再現された音響空間が人間のイメージ通りであるとは必ずしも限らない。例えば、人間はたくさんの人がそれぞれに雑談しているなかでも、自分が興味のある人の会話や、自分の名前などは、自然と聞き取ることができる。また、人間は音声情報だけでなく視覚的情報も使用しているともいわれており、話し手を視覚的に確認することのよって、その人物の口の動きやしぐさなどから得る情報も用いて聞こえ方を補っていると言われている。つまり、動画に記録される音声についても、人の記憶（イメージ）に残る会話音声と同じになるように、記録することも重要である。 However, when humans listen to a conversation, the accurately reproduced acoustic space does not necessarily match what humans imagine. For example, even when many people are chatting away, humans can naturally hear the conversation of people they are interested in, as well as their own name. It is also said that humans use visual information in addition to audio information, and by visually confirming the speaker, they supplement their hearing with information obtained from the person's mouth movements and gestures. In other words, it is important that the audio recorded on video is recorded so that it matches the conversational audio that is remembered (imaged) by humans.

しかし、特許文献１では、人（音源）の位置関係に基づいて、声の音響空間を正確に再現することが目的であるため、撮影者のイメージとは異なる動画となっているおそれがあった。 However, since the aim of Patent Document 1 is to accurately reproduce the acoustic space of a voice based on the relative positions of people (sound sources), there is a risk that the resulting video may differ from the photographer's image.

そこで、本発明は、撮影者のイメージに沿った動画および音声を記録することを目的とする。 The present invention aims to record video and audio that matches the photographer's imagination.

本発明の撮像装置は、動画から被写体を検出する検出手段と、前記動画から被写体の音声を決定する決定手段と、前記動画から検出された被写体から主被写体を選定する選定手段と、前記選定手段によって選定された主被写体と関連する被写体を判断する判断手段とを有し、
前記判断手段は、前記主被写体と関連する被写体が前記動画の画角から外れた場合、前記決定手段によって決定された被写体の音声に基づいて、前記動画の画角から外れた被写体が、前記主被写体と継続して関連しているか否かを判断することを特徴とする。 The imaging device of the present invention comprises a detection means for detecting a subject from a moving image, a determination means for determining a sound of the subject from the moving image, a selection means for selecting a main subject from the subjects detected from the moving image, and a determination means for determining a subject related to the main subject selected by the selection means,
When a subject related to the main subject moves out of the field of view of the video, the determination means determines whether the subject that has moved out of the field of view of the video continues to be related to the main subject based on the sound of the subject determined by the determination means.

本発明によれば、撮影者のイメージに沿った動画および音声を記録することができる。 This invention makes it possible to record video and audio in line with the photographer's imagination.

第一の実施形態の撮像装置のブロック図を示す図である。FIG. 1 is a block diagram of an imaging apparatus according to a first embodiment. 第一の実施形態の撮像処理部と音声処理部のブロック図（記録時）を示す図である。FIG. 2 is a block diagram of an imaging processing unit and an audio processing unit according to the first embodiment (during recording); 第一の実施形態の撮像処理部と音声処理部のブロック図（後処理時）を示す図である。FIG. 2 is a block diagram of an imaging processing unit and an audio processing unit (during post-processing) according to the first embodiment. 第一の実施形態の主対象選定方法を示す図である。FIG. 10 is a diagram illustrating a main target selection method according to the first embodiment. 第一の実施形態の動画記録シーケンスの動作フローを示す図である。FIG. 4 is a diagram showing an operation flow of a moving image recording sequence according to the first embodiment. 第一の実施形態の想定シーンを説明する図である。FIG. 1 is a diagram illustrating an assumed scene according to a first embodiment. 第一の実施形態の音声処理の内容を説明する図である。FIG. 2 is a diagram illustrating the content of audio processing according to the first embodiment. 第二の実施形態の撮像処理部と音声処理部のブロック図を示す図である。FIG. 10 is a block diagram of an imaging processing unit and an audio processing unit according to a second embodiment. 第二の実施形態の録画記録シーケンスの動作フローを示す図である。FIG. 10 is a diagram showing an operation flow of a video recording sequence according to the second embodiment. 第二の実施形態の課題を説明するための図である。FIG. 10 is a diagram for explaining a problem of the second embodiment. 第二の実施形態の課題となるシーンを説明した図である。FIG. 10 is a diagram illustrating a problem scene in the second embodiment.

以下に、本発明の好ましい実施の形態を、添付の図面に基づいて詳細に説明する。 A preferred embodiment of the present invention will be described in detail below with reference to the accompanying drawings.

［第一の実施形態］
本実施形態では、撮像装置に含まれる音声処理装置ついて図１から図３を用いて説明する。 [First embodiment]
In this embodiment, an audio processing device included in an imaging device will be described with reference to FIGS. 1 to 3. FIG.

図１は第一の実施形態の撮像装置１００の構成を示すブロック図である。 Figure 1 is a block diagram showing the configuration of the imaging device 100 of the first embodiment.

撮像部１０１は、撮影光学レンズにより取り込まれた被写体の光学像を撮像素子により画像信号に変換し、画像処理部１０２によってアナログデジタル変換、画像調整処理などを行い、画像データを生成する。撮影光学レンズは、内蔵型の光学レンズであっても、着脱式の光学レンズであっても良い。また、撮像素子は、ＣＣＤ、ＣＭＯＳ等に代表される光電変換素子であればよい。音声入力部１０３は、内蔵または音声端子を介して接続されたマイクにより、撮像装置１００の周辺の音声を集音し、アナログデジタル変換されたものを、音声処理部１０４にて各種音声処理を行い、音声データを生成する。マイクは、指向性、無指向性を問わない。メモリ１０５は、撮像部１０１、画像処理部１０２により得られた画像データや、音声入力部１０３、音声処理部１０４により得られた音声データを一時的に記憶する。表示制御部１０６は、画像処理部１０２により得られた画像データに係る映像や、撮像装置１００の操作画面、メニュー画面等を表示部１０７や、不図示の映像端子を介して外部のディスプレイに表示させる。表示部１０７はタッチパネル機能を有し、撮影者が操作することでメニューや被写体の選択などが可能である。 The imaging unit 101 converts the optical image of the subject captured by the photographing optical lens into an image signal using an imaging element, and the image processing unit 102 performs analog-to-digital conversion, image adjustment processing, etc. to generate image data. The photographing optical lens may be a built-in optical lens or a detachable optical lens. The imaging element may be a photoelectric conversion element such as a CCD or CMOS. The audio input unit 103 uses a microphone built in or connected via an audio terminal to collect sound from the surroundings of the imaging device 100, and the analog-to-digital converted sound is subjected to various audio processes by the audio processing unit 104 to generate audio data. The microphone may be directional or omnidirectional. The memory 105 temporarily stores image data obtained by the imaging unit 101 and image processing unit 102, and audio data obtained by the audio input unit 103 and audio processing unit 104. The display control unit 106 displays images related to the image data obtained by the image processing unit 102, the operation screen of the imaging device 100, menu screens, etc. on the display unit 107 or on an external display via a video terminal (not shown). The display unit 107 has a touch panel function, and can be operated by the photographer to select menus, subjects, etc.

符号化処理部１０８は、メモリ１０５に一時的に記憶された画像データや音声データを読み出して所定の符号化を行い、圧縮画像データ、圧縮音声データ等を生成する。また、音声データに関しては圧縮しないようにしてもよい。圧縮画像データは、例えば、ＭＰＥＧ２やＨ．２６４／ＭＰＥＧ４－ＡＶＣなど、どのような圧縮方式で圧縮されたものであってもよい。また、圧縮音声データも、ＡＣ３（Ａ）ＡＣ、ＡＴＲＡＣ、ＡＤＰＣＭなどのような圧縮方式で圧縮されたものであってもよい。記録再生部１０９は、記録媒体１１０に対して、符号化処理部１０８で生成された圧縮画像データ、圧縮音声データまたは音声データ、各種データを記録したり、記録媒体１１０から読出したりする。ここで、記録媒体１１０は、画像データ、音声データ、等を記録することができれば、磁気ディスク、光学式ディスク、半導体メモリなどのあらゆる方式の記録媒体を含む。 The encoding processing unit 108 reads the image data and audio data temporarily stored in the memory 105 and performs a predetermined encoding process to generate compressed image data, compressed audio data, etc. The audio data may not be compressed. The compressed image data may be compressed using any compression method, such as MPEG2 or H.264/MPEG4-AVC. The compressed audio data may also be compressed using a compression method such as AC3(A)AC, ATRAC, or ADPCM. The recording/playback unit 109 records the compressed image data, compressed audio data, or audio data generated by the encoding processing unit 108, and various data, to and reads them from the recording medium 110. The recording medium 110 includes any type of recording medium capable of recording image data, audio data, etc., such as a magnetic disk, optical disk, or semiconductor memory.

制御部１１１は、撮像装置１００、撮像部１０１の各ブロックに制御信号を送信することで撮像装置１００の各ブロックを制御することができ、各種制御を実行するためのＣＰＵやメモリなどからなる。制御部１１１で使用するメモリ１０５は、各種制御プログラムを格納するＲＯＭ、演算処理のためのＲＡＭ等であり、制御部１１１外付けのメモリも含む。操作部１１２は、ボタンやダイヤルなどからなり、ユーザの操作に応じて、指示信号を制御部１１１に送信する。本実施形態の撮像装置では、動画記録開始、終了を指示するための撮影ボタン、光学的もしくは電子的に画像に対してズーム動作する指示するためのズームレバー、各種調整をするための十字キー、決定キーなどからなる。音声出力部１１３は、記録再生部１０９により再生された音声データや圧縮音声データ、または制御部１１１により出力される音声データをスピーカ１１４や音声端子などに出力する。外部出力部１１５は、記録再生部１０９により再生された圧縮映像データや圧縮音声データ、音声データなどを外部機器に出力する。データバス１１６は、音声データや画像データ等の各種データ、各種制御信号を撮像装置１００の各ブロックに供給する。 The control unit 111 controls each block of the imaging device 100 by transmitting control signals to each block of the imaging unit 101 and the imaging device 100. It includes a CPU and memory for executing various controls. The memory 105 used by the control unit 111 includes ROM for storing various control programs, RAM for arithmetic processing, and other memory external to the control unit 111. The operation unit 112 includes buttons and dials and transmits instruction signals to the control unit 111 in response to user operations. In this embodiment of the imaging device, it includes a shooting button for starting and stopping video recording, a zoom lever for optically or electronically zooming the image, a cross key for various adjustments, and a confirmation key. The audio output unit 113 outputs audio data and compressed audio data reproduced by the recording/playback unit 109, or audio data output by the control unit 111, to a speaker 114, an audio terminal, or the like. The external output unit 115 outputs compressed video data, compressed audio data, audio data, and the like reproduced by the recording/playback unit 109 to an external device. The data bus 116 supplies various data such as audio data and image data, as well as various control signals, to each block of the imaging device 100.

ここで、本実施形態の撮像装置１００の通常の動作について説明する。 Now, we will explain the normal operation of the imaging device 100 of this embodiment.

本実施形態の撮像装置１００は、ユーザが操作部１１２を操作して電源を投入する指示が出されたことに応じて、不図示の電源供給部から、撮像装置の各ブロックに電源を供給する。 In this embodiment, the imaging device 100 supplies power to each block of the imaging device from a power supply unit (not shown) in response to a user operating the operation unit 112 to issue an instruction to turn on the power.

電源が供給されると、制御部１１１は、操作部１１２のモード切り換えスイッチが、例えば、撮影モード、再生モード等のどのモードであるかを操作部１１２からの指示信号により確認する。動画記録モードでは、撮像部１０１、画像処理部１０２により得られた画像データ（映像データ）と音声入力部１０３、音声処理部１０４により得られた音声データとを動画ファイルとして保存する。再生モードでは、記録媒体１１０に記録された圧縮画像データを記録再生部１０９により再生して表示部１０７に表示させる。 When power is supplied, the control unit 111 checks the mode selected by the mode selector switch of the operation unit 112, for example, shooting mode, playback mode, etc., based on an instruction signal from the operation unit 112. In video recording mode, the image data (video data) obtained by the imaging unit 101 and image processing unit 102 and the audio data obtained by the audio input unit 103 and audio processing unit 104 are saved as a video file. In playback mode, the compressed image data recorded on the recording medium 110 is played back by the recording/playback unit 109 and displayed on the display unit 107.

動画記録モードでは、まず、制御部１１１は、撮影待機状態に移行させるように制御信号を撮像装置１００の各ブロックに送信し、以下のような動作をさせる。撮像部１０１は、撮影光学レンズにより取り込まれた被写体の光学像を撮像素子により画像信号に変換し、画像処理部１０２で画像調整処理などを行い、画像データを生成する。そして、得られた画像データを表示制御部１０６に送信し、表示部１０７に表示させる。ユーザはこの様にして表示された画面を見ながら撮影の準備を行う。 In video recording mode, the control unit 111 first sends control signals to each block of the imaging device 100 to transition to a shooting standby state, causing the following operations to occur. The imaging unit 101 converts the optical image of the subject captured by the shooting optical lens into an image signal using the imaging element, and the image processing unit 102 performs image adjustment processing and other processes to generate image data. The obtained image data is then sent to the display control unit 106, which displays it on the display unit 107. The user prepares for shooting while looking at the screen displayed in this way.

音声入力部１０３は、複数のマイクにより得られたアナログ音声信号をデジタル変換し、得られた複数のデジタル音声信号を処理して、マルチチャンネルの音声データを生成する。そして、得られた音声データを音声出力部１１３に送信し、接続されたスピーカ１１４や不図示のイヤホンから音声として出力させる。ユーザは、この様にして出力された音声を聞きながら記録音量を決定するためのマニュアルボリュームの調整をすることもできる。 The audio input unit 103 converts analog audio signals obtained by multiple microphones into digital form and processes the resulting digital audio signals to generate multi-channel audio data. The resulting audio data is then sent to the audio output unit 113, which outputs the audio from a connected speaker 114 or earphones (not shown). The user can adjust the manual volume to determine the recording volume while listening to the audio output in this manner.

次に、ユーザが操作部１１２の記録ボタンを操作することにより撮影開始の指示信号が制御部１１１に送信されると、制御部１１１は、撮像装置１００の各ブロックに撮影開始の指示信号を送信し、以下のような動作をさせる。 Next, when the user operates the recording button on the operation unit 112 to send an instruction signal to start shooting to the control unit 111, the control unit 111 sends an instruction signal to start shooting to each block of the imaging device 100, causing it to perform the following operations.

撮像部１０１は、撮影光学レンズにより取り込まれた被写体の光学像を撮像素子により画像信号に変換し、画像処理部１０２にて画像調整処理などを行い、画像データを生成する。そして、得られた画像データを表示制御部１０６に送信し、表示部１０７に表示させる。また、得られた画像データをメモリ１０５へ送信する。 The imaging unit 101 converts the optical image of the subject captured by the photographing optical lens into an image signal using an imaging element, and the image processing unit 102 performs image adjustment processing and other processes to generate image data. The obtained image data is then sent to the display control unit 106, which displays it on the display unit 107. The obtained image data is also sent to the memory 105.

音声入力部１０３は、複数のマイクにより得られたアナログ音声信号をデジタル変換し、音声処理部１０４にて得られた複数のデジタル音声信号を処理して、マルチチャンネルの音声データを生成する。そして、得られた音声データをメモリ１０５に送信する。また、マイクが一つの場合には、得られたアナログ音声信号をデジタル変換し音声データを生成し、音声データをメモリ１０５に送信する。 The audio input unit 103 digitally converts analog audio signals obtained by multiple microphones, and processes the multiple digital audio signals obtained by the audio processing unit 104 to generate multi-channel audio data. The obtained audio data is then sent to memory 105. If there is only one microphone, the obtained analog audio signal is digitally converted to generate audio data, and the audio data is sent to memory 105.

符号化処理部１０８は、メモリ１０５に一時的に記憶された画像データや音声データを読み出して所定の符号化を行い、圧縮画像データ、圧縮音声データ等を生成する。 The encoding processing unit 108 reads the image data and audio data temporarily stored in the memory 105 and performs the specified encoding to generate compressed image data, compressed audio data, etc.

そして、制御部１１１は、これらの圧縮画像データ、圧縮音声データを合成し、データストリームを形成し、記録再生部１０９に出力する。音声データを圧縮しない場合には、制御部１１１は、メモリ１０５に格納された音声データと圧縮画像データとを合成し、データストリームを形成して記録再生部１０９に出力する。記録再生部１０９は、ＵＤＦ、ＦＡＴ等のファイルシステム管理のもとに、データストリームを一つの動画ファイルとして記録媒体１１０に書き込んでいく。以上の動作を撮影中は継続する。 The control unit 111 then combines the compressed image data and compressed audio data to form a data stream, which it outputs to the recording and playback unit 109. If the audio data is not compressed, the control unit 111 combines the audio data stored in memory 105 with the compressed image data to form a data stream, which it outputs to the recording and playback unit 109. The recording and playback unit 109 writes the data stream to the recording medium 110 as a single video file under file system management such as UDF or FAT. The above operations continue while shooting is in progress.

そして、ユーザが操作部１１２の記録ボタンを操作することにより撮影終了の指示信号が制御部１１１に送信されると、制御部１１１は、撮像装置１００の各ブロックに撮影終了の指示信号を送信し、以下のような動作をさせる。 Then, when the user operates the recording button on the operation unit 112 to send an instruction signal to end shooting to the control unit 111, the control unit 111 sends an instruction signal to end shooting to each block of the imaging device 100, causing it to perform the following operations.

撮像部１０１、画像処理部１０２、音声入力部１０３、音声処理部１０４は、それぞれ画像データ、音声データの生成を停止する。符号化処理部１０８は、メモリに記憶されている残りの画像データと音声データとを読出して所定の符号化を行い、圧縮画像データ、圧縮音声データ等を生成し終えたら動作を停止する。音声データを圧縮しない場合には、当然、圧縮画像データの生成が終わったら動作を停止する。 The imaging unit 101, image processing unit 102, audio input unit 103, and audio processing unit 104 each stop generating image data and audio data. The encoding processing unit 108 reads the remaining image data and audio data stored in memory and performs the specified encoding, stopping operation once it has finished generating compressed image data, compressed audio data, etc. If the audio data is not compressed, it will naturally stop operation once it has finished generating the compressed image data.

そして、制御部１１１は、これらの最後の圧縮画像データと、圧縮音声データまたは音声データとを合成し、データストリームを形成し、記録再生部１０９に出力する。記録再生部１０９は、ＵＤＦ、ＦＡＴ等のファイルシステム管理のもとに、データストリームを一つの動画ファイルとして記録媒体１１０に書き込んでいく。そして、データストリームの供給が停止したら、動画ファイルを完成させて、記録動作を停止させる。制御部１１１は、記録動作が停止すると、撮影待機状態に移行させるように制御信号を撮像装置１００の各ブロックに送信して、撮影待機状態に戻る。 The control unit 111 then combines these final compressed image data with the compressed audio data or voice data to form a data stream, which is output to the recording/playback unit 109. The recording/playback unit 109 writes the data stream to the recording medium 110 as a single video file under file system management such as UDF or FAT. When the supply of the data stream stops, the video file is completed and the recording operation is stopped. When the recording operation stops, the control unit 111 sends a control signal to each block of the imaging device 100 to transition to a shooting standby state, and the imaging device returns to the shooting standby state.

次に、再生モードでは、制御部１１１は、再生状態に移行させるように制御信号を撮像装置１００の各ブロックに送信し、以下のような動作をさせる。記録媒体１１０に記録された圧縮画像データと圧縮音声データとからなる動画ファイルを記録再生部１０９が読出して、読出された圧縮画像データ、圧縮音声データは、符号化処理部１０８に送る。 Next, in playback mode, the control unit 111 sends control signals to each block of the imaging device 100 to transition to playback mode, causing the following operations to occur: The recording and playback unit 109 reads a video file consisting of compressed image data and compressed audio data recorded on the recording medium 110, and sends the read compressed image data and compressed audio data to the encoding processing unit 108.

符号化処理部１０８は、圧縮画像データ、圧縮音声データを復号してそれぞれ、表示制御部１０６、音声出力部１１３に送信する。表示制御部１０６は、復号された画像データを表示部１０７に表示させる。音声出力部１１３は、復号された音声データを内蔵または、取付けられた外部スピーカから出力させる。 The encoding processing unit 108 decodes the compressed image data and compressed audio data and transmits them to the display control unit 106 and audio output unit 113, respectively. The display control unit 106 displays the decoded image data on the display unit 107. The audio output unit 113 outputs the decoded audio data from a built-in or attached external speaker.

本実施形態の撮像装置１００は以上のように、画像、音声の記録再生を行うことができる。 As described above, the imaging device 100 of this embodiment is capable of recording and playing back images and audio.

本実施形態では、音声入力部１０３、音声処理部１０４において、音声信号を得る際に、マイクにより得られた音声信号のレベル調整処理等の処理をしている。この処理は、装置が起動してから常に行われてもよいし、撮影モードが選択されてから行われてもよい、または、音声の記録に関連するモードが選択されてから行われても良い。また、音声の記録に関連するモードにおいて、音声の記録が開始したことに応じて上記の処理を行ってもよい。本実施形態では、動画像撮影の開始されたタイミングで上記の処理を行うようにしたものとする。 In this embodiment, when obtaining an audio signal, the audio input unit 103 and audio processing unit 104 perform processing such as level adjustment of the audio signal obtained by the microphone. This processing may be performed continuously after the device is started, or may be performed after a shooting mode is selected, or after a mode related to audio recording is selected. Furthermore, in a mode related to audio recording, the above processing may be performed in response to the start of audio recording. In this embodiment, the above processing is performed when video recording begins.

図２は本実施形態の撮像装置１００の撮像部１０１、画像処理部１０２、音声入力部１０３、音声処理部１０４の詳細な構成の一例を示すブロック図である。 Figure 2 is a block diagram showing an example of the detailed configuration of the imaging unit 101, image processing unit 102, audio input unit 103, and audio processing unit 104 of the imaging device 100 of this embodiment.

撮像部１０１は、被写体の光学像を取り込む光学レンズ２０１等の光学系、光学レンズ２０１により取り込まれた被写体の光学像を電気信号（画像信号）に変換させる撮像素子２０２を有している。さらに、光学レンズ２０１を移動させるための位置センサ、モータ等の公知の駆動メカニズムを有する光学レンズ制御部２０３を有している。本実施形態では撮像部１０１に光学レンズ２０１、光学レンズ制御部２０３が内蔵されているように記載しているが、これらは着脱可能な交換光学レンズであっても良い。例えば、ズーム動作、フォーカス調整などの指示を、ユーザが操作部１１２を操作して入力すると、制御部１１１は、光学レンズ制御部２０３に光学レンズを移動させる制御信号（駆動信号）を送信する。光学レンズ制御部２０３は、この制御信号に応じて、位置センサで光学レンズ２０１の位置を確認し、モータ等で光学レンズ２０１の移動を行う。 The imaging unit 101 has an optical system, such as an optical lens 201, that captures an optical image of a subject, and an imaging element 202 that converts the optical image of the subject captured by the optical lens 201 into an electrical signal (image signal). It also has an optical lens control unit 203 that has a position sensor for moving the optical lens 201 and a known driving mechanism, such as a motor. In this embodiment, the imaging unit 101 is described as having the optical lens 201 and optical lens control unit 203 built in, but these may also be detachable, interchangeable optical lenses. For example, when a user inputs an instruction for zooming, focus adjustment, or the like by operating the operation unit 112, the control unit 111 sends a control signal (drive signal) to the optical lens control unit 203 to move the optical lens. In response to this control signal, the optical lens control unit 203 confirms the position of the optical lens 201 using a position sensor and moves the optical lens 201 using a motor or the like.

画像処理部１０２は、撮像素子２０２により変換された画像信号に対して、画像調整部２２１にて各種画質調整処理をして画像データを形成し、データバス１１６を介してメモリ１０５に送信する。ここで形成された画像データをもとに、制御部１１１はフォーカス調整や光量調整などの各種調整を行う。 The image processing unit 102 performs various image quality adjustment processes on the image signal converted by the image sensor 202 in the image adjustment unit 221 to form image data, which is then sent to the memory 105 via the data bus 116. Based on the image data formed here, the control unit 111 performs various adjustments such as focus adjustment and light intensity adjustment.

さらに本実施形態では、画像処理部１０２は各種検出機能を有する。人物検出部２２２は画像調整部２２１にて形成された画像データから、目や鼻や口などの人物の顔の特徴点を抽出し、それに画像データにおける人物の位置や顔の大きさなどを検出する。そして、それら特徴点の情報をメモリ１０５に記憶することで、その情報に基づいて被写体人物を個別に認識することも可能である。また、人物検出部２２２は、唇や頭の動きを検出する人物動作検出部２２３と、それによりその人物が発話しているか否かを判定する人物発話検出部２２４とを有している。また、画像処理部１０２には、人物検出部２２２にて検出された人物のうち、どの人物を音声処理の主となる被写体（以下、主被写体、主対象ともいう）とするかを選定する主対象選定部２２５を有する。主対象選定部２２５は、制御部１１１によって定められた条件をもとに主対象を選定する。主対象選定部２２５による、主対象の選定条件については後述する。 Furthermore, in this embodiment, the image processing unit 102 has various detection functions. The person detection unit 222 extracts facial features such as the eyes, nose, and mouth from the image data formed by the image adjustment unit 221, and detects the position and face size of the person in the image data. By storing information on these feature points in memory 105, it is possible to individually recognize the subject person based on this information. The person detection unit 222 also has a person movement detection unit 223 that detects lip and head movement, and a person speech detection unit 224 that uses this information to determine whether the person is speaking. The image processing unit 102 also has a main subject selection unit 225 that selects which person, from the people detected by the person detection unit 222, will be the main subject of audio processing (hereinafter also referred to as the main subject or main target). The main subject selection unit 225 selects the main target based on conditions set by the control unit 111. The conditions for selecting the main target by the main target selection unit 225 will be described later.

さらに、画像処理部１０２は会話グループ検出部２２６を有する。会話グループ検出部２２６は、人物検出部２２２において検出された人物のうちから、主対象選定部２２５にて選定された人物と会話している人物を検出する。その検出は、人物同士の位置関係や、顔の向き、動作などによって判断されるものである。例えば、会話グループ検出部２２６は、主対象である被写体に最も距離が近い被写体を、主対象と会話している人物（関連する人物）であると判断する。また、例えば、会話グループ検出部２２６は、主対象の体や顔、視線等の向きに対向する被写体を、主対象と会話している人物であると判断する。また、会話グループ検出部２２６は、主対象が動いている場合、その動いている方向の先にいる被写体を、主対象と会話している人物であると判断する。なぜなら、このような被写体は、近い将来に主対象と会話すると考えられるからである。 The image processing unit 102 further includes a conversation group detection unit 226. The conversation group detection unit 226 detects, from among the people detected by the person detection unit 222, people who are conversing with the person selected by the main subject selection unit 225. This detection is determined based on the relative positions of the people, the direction of their faces, their movements, and so on. For example, the conversation group detection unit 226 determines that the subject closest to the main subject is the person conversing with the main subject (a related person). Also, for example, the conversation group detection unit 226 determines that a subject whose body, face, line of sight, etc. faces the main subject is the person conversing with the main subject. Also, if the main subject is moving, the conversation group detection unit 226 determines that a subject in the direction of the movement is the person conversing with the main subject. This is because such a subject is likely to converse with the main subject in the near future.

なお、会話グループ検出部２２６は、主対象と会話している人物が、所定時間より長く主対象と会話していないと判断した場合、その人物を主対象と会話していない（関連しない）人物とする。言い換えれば、主対象と会話している人物が、所定時間以内であれば、主対象と会話していないと判断されても、主対象と会話している人物と判断される。 Note that if the conversation group detection unit 226 determines that a person who is conversing with the main target has not conversed with the main target for longer than a predetermined time, that person is considered to be a person who is not conversing with the main target (not related to the main target). In other words, if a person who is conversing with the main target has not conversed with the main target for a predetermined time, that person is considered to be a person who is conversing with the main target, even if it is determined that they are not conversing with the main target.

次に、音声入力部１０３、音声処理部１０４について説明する。音声入力部１０３は音声振動を電気信号に変換し、音声信号として出力するマイク２１１。本実施形態ではマイク２１１は左右のＬｃｈ／Ｒｃｈの２チャンネルで構成されたステレオ方式とするが、１チャンネルのモノラル方式でも、２チャンネル以上の複数のマイクを保持する構成でも構わない。Ａ／Ｄ変換部２１２は、マイク２１１により得られたアナログ音声信号をデジタル音声信号に変換する手段である。 Next, we will explain the audio input unit 103 and audio processing unit 104. The audio input unit 103 is a microphone 211 that converts audio vibrations into an electrical signal and outputs it as an audio signal. In this embodiment, the microphone 211 is a stereo system consisting of two channels, left and right channels (Lch/Rch), but it may also be a mono system with one channel, or a system that includes multiple microphones with two or more channels. The A/D conversion unit 212 is a means for converting the analog audio signal obtained by the microphone 211 into a digital audio signal.

音声処理部１０４は音声入力部１０３によって変換された音声信号に各種音声処理を行うブロックである。本実施形態では、音声処理部１０４に音声抽出部２１３、音声調整部２１５、音声合成部２１７を有する。音声抽出部２１３では、人物の音声とそれ以外の音声（以後、「非人物音声」という）とに抽出（決定）することが可能である。さらに、人物音声抽出部２１４では、人物検出部２２２の情報をもとに、人物の音声をひとりひとりの個々の音声に抽出することが可能である。例えば、人物音声抽出部２１４は、音声の周波数、大きさ、および抑揚に基づいて個々の音声に抽出する。さらに、第一の実施形態では、制御部１１１は、人物音声抽出部２１４によって抽出された音声と、画像処理部１０２によって検出された被写体の動作とに基づいて、被写体と音声とを関連付けることができる。例えば、被写体の動作は、発話の頻度、発声のタイミング、口の動きである。 The audio processing unit 104 is a block that performs various audio processes on the audio signal converted by the audio input unit 103. In this embodiment, the audio processing unit 104 has an audio extraction unit 213, an audio adjustment unit 215, and an audio synthesis unit 217. The audio extraction unit 213 is capable of extracting (determining) audio into human voices and other audio (hereinafter referred to as "non-human voices"). Furthermore, the human voice extraction unit 214 is capable of extracting individual audio from human voices based on information from the person detection unit 222. For example, the human voice extraction unit 214 extracts individual audio based on the frequency, volume, and intonation of the audio. Furthermore, in the first embodiment, the control unit 111 can associate audio with a subject based on the audio extracted by the human voice extraction unit 214 and the subject's movements detected by the image processing unit 102. For example, the subject's movements include the frequency of speech, the timing of speech, and mouth movements.

また、音声調整部２１５では音声抽出部２１３によって抽出された音声に対して、レベル調整やイコライザ等による周波数帯域別の音声処理を個別に実施することができる。特に会話音声調整部２１６では、会話グループ検出部２２６の情報に基づいて調整を実施し、抽出された音声対して聞こえやすく強調したり、聞こえにくく控えめにしたりする。その調整内容については後述する。さらに、音声合成部２１７では音声調整部２１５にて個々に調整された音声を合成し、再度ひとつの音声信号に戻す。そして、合成された音声信号はオートレベルコントローラによって振幅を所定のレベルに調整される（以後、ＡＬＣ２１９）。以上の構成を備え、音声処理部１０４は音声信号に所定の処理を行い、音声データを形成しメモリ１０５へ送信する。 The audio adjustment unit 215 can also individually perform level adjustments, equalizers, and other audio processing for each frequency band on the audio extracted by the audio extraction unit 213. In particular, the conversation audio adjustment unit 216 performs adjustments based on information from the conversation group detection unit 226, emphasizing the extracted audio to make it easier to hear, or reducing its volume to make it harder to hear. The details of these adjustments will be described later. Furthermore, the audio synthesis unit 217 synthesizes the audio that has been individually adjusted by the audio adjustment unit 215, and returns it to a single audio signal. The amplitude of the synthesized audio signal is then adjusted to a predetermined level by an auto level controller (hereinafter, ALC 219). With the above configuration, the audio processing unit 104 performs predetermined processing on the audio signal, forms audio data, and sends it to memory 105.

図３は本実施形態の撮像装置１００の画像処理部１０２および音声処理部１０４の、他の構成の一例を示すブロック図である。図３と図２との相違点は、画像データおよび音声データの入力ソースが違う点である。図２では、画像信号は撮像部１０１、音声信号は音声入力部１０３からの信号を使用する。一方、図３では画像および音声の入力ソースはメモリ１０５に保存されているデータを入力する。このようにメモリ１０５に一旦保存された（保持された）データを用いることで、撮影時の処理だけでなく、記録後の後処理として本提案の手法を用いることが可能となる。また、主対象選定部２２５においても、一連の動画データから音声処理の対象人物を選定することが可能となる。 Figure 3 is a block diagram showing another example of the configuration of the image processing unit 102 and audio processing unit 104 of the imaging device 100 of this embodiment. The difference between Figure 3 and Figure 2 is the input sources of image data and audio data. In Figure 2, the image signal comes from the imaging unit 101, and the audio signal comes from the audio input unit 103. On the other hand, in Figure 3, the image and audio input sources are data stored in memory 105. By using data temporarily stored (held) in memory 105 in this way, it is possible to use the proposed method not only for processing during shooting, but also for post-processing after recording. Furthermore, the main subject selection unit 225 can select a person to be the target of audio processing from a series of video data.

ここで、主対象選定部２２５による主対象の選定方法の例について図４を用いて説明する。本実施形態では主対象を、撮影者が着目すると考えられる人物として説明する。例えば、図４（ａ）の場合、合焦マーク４０２は撮像装置１００がフォーカスを合わせている対象を示すマークである。図４（ａ）では主対象４０１と合焦マーク４０２とが一致していることから、撮像装置１００は主対象４０１を主となる被写体と認識し、主対象４０１にフォーカスを合わせていることとなる。主対象選定部２２５は、この主対象４０１を主対象として判断する。このように主被写体と認識している人物を主対象として選定することができる。 Here, an example of a method for selecting a main subject by the main subject selection unit 225 will be described using Figure 4. In this embodiment, the main subject will be described as a person who is thought to be the photographer's focus. For example, in Figure 4(a), the focus mark 402 is a mark indicating the subject on which the imaging device 100 is focusing. In Figure 4(a), the main subject 401 and the focus mark 402 match, so the imaging device 100 recognizes the main subject 401 as the main subject and is focusing on the main subject 401. The main subject selection unit 225 determines this main subject 401 to be the main subject. In this way, a person recognized as the main subject can be selected as the main subject.

また、図４（ｂ）では登録された顔画像を用いる方法を示している。登録顔画像４０３はメモリ１０５に事前に登録された被写体の画像である。主対象選定部２２５はその画像の顔と一致すると判断された人物を主対象と選定する。 Figure 4(b) also shows a method using a registered face image. The registered face image 403 is an image of a subject that has been registered in advance in the memory 105. The main subject selection unit 225 selects a person whose face is determined to match the face in the image as the main subject.

また、図４（ｃ）では撮影者の意思によって主対象を決める方法を示す。表示部１０７に表示されている人物に対して、撮影者が表示部１０７のタッチパネルに対してタッチすることで主対象となる被写体を選択する。主対象選定部２２５は、撮影者によって選択された被写体を主対象として判断する。 Figure 4(c) also shows a method for determining the main subject at the photographer's discretion. The photographer selects the main subject by touching the touch panel of the display unit 107 from among the people displayed on the display unit 107. The main subject selection unit 225 determines that the subject selected by the photographer is the main subject.

また、図４（ｄ）では記録済みの動画データを用いる方法を示している。例えば、記録済みの動画データ４０４がメモリ１０５に記録されている場合、主対象選定部２２５は、動画データ４０４の中で最も登場頻度の高い人物４０５を主対象として判断する。ほかにも、例えば、主対象選定部２２５は、フォーカス合焦頻度の高い人物を選択してもよい。 Also, Figure 4(d) shows a method using pre-recorded video data. For example, if pre-recorded video data 404 is stored in memory 105, the main subject selection unit 225 determines the person 405 who appears most frequently in the video data 404 as the main subject. Alternatively, for example, the main subject selection unit 225 may select a person who is frequently in focus.

なお、主対象選定部２２５は、例えばフォーカスが合わせられている被写体を主対象とする場合、その主対象に対するフォーカスが外れても、所定時間内にその被写体にフォーカスが戻れば主対象として維持する。言い換えれば、主対象選定部２２５は、主対象からフォーカスが所定時間より長く外れた場合、新たに主対象となる被写体を選定する。 Note that, for example, when the main object selection unit 225 selects a focused subject as the main object, even if the focus on that main object is lost, the main object selection unit 225 will maintain that subject as the main object as long as the focus returns to that subject within a predetermined time. In other words, if the focus is lost from the main object for longer than a predetermined time, the main object selection unit 225 will select a new main object.

続いて、本実施形態の撮像装置１００の動作について図５～図７を用いて説明する。 Next, the operation of the imaging device 100 of this embodiment will be explained using Figures 5 to 7.

図５は撮像装置１００の一連の録画記録シーケンスの一例を示すフローチャートである。この撮像装置１００の処理は、ＲＯＭ（不図示）に記録されたソフトウェアをメモリ１０５に展開してＣＰＵが実行することで実現する。また、本フローチャートの処理は、撮像装置１００が電源オンされたことをトリガに開始される。 Figure 5 is a flowchart showing an example of a video recording sequence for the imaging device 100. The processing of this imaging device 100 is realized by software stored in ROM (not shown) being loaded into memory 105 and executed by the CPU. The processing of this flowchart is triggered when the imaging device 100 is powered on.

ステップＳ５０１では、制御部１１１は、ユーザによる操作部１１２の操作により動画記録を開始するための指示を受け付ける。 In step S501, the control unit 111 accepts an instruction to start video recording via a user's operation of the operation unit 112.

ステップＳ５０２では、制御部１１１は、音声録音するための音声のパスを接続する。 In step S502, the control unit 111 connects an audio path for recording audio.

ステップＳ５０３では、制御部１１１は、音声パスが確立した後、本実施形態で説明する制御を含めた信号処理の初期設定をおこない、動画記録のための信号処理を開始する。以降、録音シーケンスについて焦点を当てて説明する。動画記録のための信号処理が終了するまで、制御部１１１は動画に記録される映像を記録している。 In step S503, after the audio path is established, the control unit 111 performs initial signal processing settings, including the control described in this embodiment, and starts signal processing for video recording. The following description focuses on the recording sequence. Until signal processing for video recording is completed, the control unit 111 records the video to be recorded in the video.

ステップＳ５０４では、画像処理部１０２の人物検出部２２２は被写体を検出する。 In step S504, the person detection unit 222 of the image processing unit 102 detects the subject.

ステップＳ５０５では、画像処理部１０２の主対象選定部２２５は、ステップＳ５０４において検出された被写体から、主対象を選定（判断）する。 In step S505, the main subject selection unit 225 of the image processing unit 102 selects (determines) a main subject from the subjects detected in step S504.

ステップＳ５０６では、画像処理部１０２の会話グループ検出部２２６は、ステップＳ５０５において選定された主対象と会話している人物（被写体）を判断する。 In step S506, the conversation group detection unit 226 of the image processing unit 102 determines the person (subject) who is conversing with the main subject selected in step S505.

ステップＳ５０７では、音声処理部１０４の音声抽出部２１３は、人物音声の抽出を行う。 In step S507, the audio extraction unit 213 of the audio processing unit 104 extracts the person's audio.

音声処理部１０４の音声調整部２１５は、ステップＳ５０７において抽出された音声に対して調整処理を行う。ステップＳ５０７において抽出された音声の被写体（人物）が主対象の会話グループに属する被写体（人物）か否かで音声調整処理の内容を異ならせる。音声調整処理の詳細については、図６、図７を用いて後述するが、本フローチャートでは簡易的に説明する。 The audio adjustment unit 215 of the audio processing unit 104 performs adjustment processing on the audio extracted in step S507. The content of the audio adjustment processing differs depending on whether the subject (person) of the audio extracted in step S507 is a subject (person) belonging to the main conversation group. Details of the audio adjustment processing will be described later using Figures 6 and 7, but this flowchart will provide a simplified explanation.

ステップＳ５０８では、音声処理部１０４の音声調整部２１５は、ステップＳ５０７において抽出された音声の人物が主対象の会話グループに属する被写体か否かを判断する。抽出された音声の人物が主対象の会話グループに属する被写体である場合、ステップＳ５０９の処理が実行される。抽出された音声の人物が主対象の会話グループに属する被写体ではない場合、ステップＳ５１０の処理が実行される。 In step S508, the audio adjustment unit 215 of the audio processing unit 104 determines whether the person whose voice was extracted in step S507 is a subject belonging to the main conversation group. If the person whose voice was extracted is a subject belonging to the main conversation group, the process of step S509 is executed. If the person whose voice was extracted is not a subject belonging to the main conversation group, the process of step S510 is executed.

ステップＳ５０９では、音声処理部１０４の音声調整部２１５は、抽出された音声の音量が大きくなるようにレベル調整する。 In step S509, the audio adjustment unit 215 of the audio processing unit 104 adjusts the level of the extracted audio so that the volume is increased.

ステップＳ５１０では、音声処理部１０４の音声調整部２１５は、抽出された音声の音量が小さくなるようにレベル調整する。ステップＳ５１１では、音声処理部１０４の音声調整部２１５は、抽出された音声に対して、音量以外の調整処理を行う。 In step S510, the audio adjustment unit 215 of the audio processing unit 104 adjusts the level of the extracted audio to reduce its volume. In step S511, the audio adjustment unit 215 of the audio processing unit 104 performs adjustment processing other than volume on the extracted audio.

ステップＳ５１２では、音声処理部１０４の音声合成部２１７は、個別に音声調整された抽出音声を合成し、ひとつの音声データを生成する。 In step S512, the voice synthesis unit 217 of the voice processing unit 104 synthesizes the extracted voices that have been individually adjusted to generate a single voice data.

ステップＳ５１３では、制御部１１１は、動画記録を終了するか否かを判断する。例えば、制御部１１１は、ユーザによる操作部１１２の操作によって動画記録の終了を指示された場合や、記録媒体１１０の残り容量が少ないと判断された場合に、動画記録を終了すると判断する。動画記録を終了しないと判断された場合、ステップＳ５０４の処理に戻り、録音シーケンス処理が継続される。動画記録を終了すると判断された場合、ステップＳ５１４の処理が実行される。 In step S513, the control unit 111 determines whether to end video recording. For example, the control unit 111 determines to end video recording when the user operates the operation unit 112 to instruct the end of video recording, or when it is determined that the remaining capacity of the recording medium 110 is low. If it is determined not to end video recording, the process returns to step S504, and the recording sequence process continues. If it is determined to end video recording, the process of step S514 is executed.

ここで、動画記録を終了しないと判断された場合、ステップＳ５０４の処理に戻る。すなわち、動画記録中は、繰り返し主対象および、主対象と会話している人物が判断される。これにより、例えば、主対象である被写体が画角外に消えた場合やフォーカスが外れた場合でも、制御部１１１は別の被写体を主対象として決定できる。また、主対象と会話している人物の人数が増減した場合でも、制御部１１１はそれに合わせて主対象と会話している人物を決定することができる。 If it is determined here that video recording should not be ended, the process returns to step S504. In other words, while video recording is in progress, the main subject and the people who are conversing with the main subject are repeatedly determined. This allows the control unit 111 to determine a different subject as the main subject, even if, for example, the main subject disappears outside the angle of view or goes out of focus. Also, even if the number of people conversing with the main subject increases or decreases, the control unit 111 can determine the people who are conversing with the main subject accordingly.

ステップＳ５１４では、制御部１１１は、音声パスを切断し、信号処理を終了する。 In step S514, the control unit 111 disconnects the audio path and terminates signal processing.

ここで、図６および図７を用いて、音声調整処理について説明する。 Here, we will explain the audio adjustment process using Figures 6 and 7.

図６は音声調整処理の想定シーンを示す図である。いま、人物６０２～人物６０５の４人の被写体（人物）が画角６０１の中に存在し、人物６０２は人物６０３と、人物６０４は人物６０５とそれぞれ会話（発声）をしているものとする。このとき、主対象選定部２２５が選定する、音声処理の主対象が人物６０２であった場合、人物６０２と人物６０３とは、画像データから会話グループ検出部２２６によって会話グループ６１０として検出される。この場合、人物６０２、人物６０３の音声は注目すべき音声として強調するように音声調整され、人物６０４と人物６０５の音声は強調対象ではない不要な音声として音声調整される。 Figure 6 shows a hypothetical scene for audio adjustment processing. Assume that four subjects (people), person 602 to person 605, are present within angle of view 601, and person 602 is conversing (speaking) with person 603, and person 604 is conversing (speaking) with person 605. In this case, if person 602 is selected as the main target for audio processing by main target selection unit 225, people 602 and 603 are detected as conversation group 610 from the image data by conversation group detection unit 226. In this case, the audio of people 602 and 603 is adjusted to emphasize the audio that deserves attention, and the audio of people 604 and 605 is adjusted to emphasize the audio that is not required for emphasis.

図７（ａ）～（ｃ）は音声調整処理を示す図である。図７では、図６における人物６０２、人物６０３、人物６０４をそれぞれ人物Ａ、Ｂ、Ｃとして表記している（人物６０５は不図示）。 Figures 7(a) to (c) show the audio adjustment process. In Figure 7, person 602, person 603, and person 604 in Figure 6 are represented as person A, person B, and person C, respectively (person 605 is not shown).

図７（ａ）は人物音声抽出部２１４にて抽出された、人物Ａ～Ｃのそれぞれの音声信号を示している。つまり、信号７０１は人物Ａ、信号７０２は人物Ｂ、信号７０３は人物Ｃのそれぞれ抽出された音声信号を示している。そして、それぞれの信号において、振幅の大きな区間は、それぞれの人物が発話（発声）している期間（有声タイミング）を示しており、振幅の小さな区間は発話していない期間（無声タイミング）を示している。例えば、信号７０４と信号７０５とを比較してみると、人物Ａと人物Ｂとは会話しているため、有声タイミングと無声タイミングとがほぼ交互に現れている。一方、人物Ｃは人物ＡおよびＢの会話の相手ではないため、信号７０６は信号７０４と信号７０５とは有声タイミングと無声タイミングが交互に現れることは少ない。 Figure 7(a) shows the voice signals of persons A to C extracted by the person voice extraction unit 214. That is, signal 701 shows the extracted voice signal of person A, signal 702 shows the extracted voice signal of person B, and signal 703 shows the extracted voice signal of person C. In each signal, sections with large amplitude indicate periods when the respective person is speaking (uttering voice) (voiced timing), and sections with small amplitude indicate periods when the person is not speaking (silent timing). For example, comparing signals 704 and 705, since persons A and B are conversing, voiced timing and silent timing appear almost alternately. On the other hand, since person C is not a conversation partner of persons A and B, signal 706 does not alternate between voiced timing and silent timing as much as signals 704 and 705.

図７（ｂ）は、それぞれの人物に対しての音声の補正係数を示している。本実施形態においては補正係数が１．０のときはレベル調整（ゲイン調整）が行われないことを示す。また、補正係数が１．０よりも大きい場合の処理は、その音声を強調して聞き取りやすくする（より大きい音量にする）ための音声調整処理であり、係数が１．０よりも小さい場合の処理は、音声を聞こえにくくする（より小さい音量にする）ための処理である。 Figure 7 (b) shows the audio correction coefficient for each person. In this embodiment, a correction coefficient of 1.0 indicates that no level adjustment (gain adjustment) is performed. Furthermore, when the correction coefficient is greater than 1.0, the processing is an audio adjustment process that emphasizes the audio to make it easier to hear (increase the volume), and when the coefficient is less than 1.0, the processing is a process that makes the audio harder to hear (decrease the volume).

例えば、会話グループ検出部２２６によって、期間７１０の間は人物Ａと人物Ｂが会話していると判定された場合を例に説明する。この場合、人物Ａは主対象であることから、会話音声調整部２１６は、人物Ａと人物Ｂのそれぞれの音声を強調する対象として認識し、それぞれの音声に対する補正係数を大きい値にする（係数７１４、係数７１５）。本実施形態では、人物Ａと人物Ｂとの音声に対する補正係数を同じ値にする。これは、撮影者であるユーザはどちらの音声も等しく聞いていることが想定されるからである。一方、会話音声調整部２１６は、人物Ａと会話していないと判断された人物Ｃの音声に対する補正係数を小さく設定し、人物Ｃの音声を比較的聞き取りにくくなるようにする（係数７１６）。このように、会話音声調整部２１６は、主対象の人物Ａおよびその会話相手である人物Ｂの音声が強調し、それ以外の音声が小さくする。例えば、会話音声調整部２１６は、主対象の人物Ａおよびその会話相手である人物Ｂの音声に対するゲインやレベルを、それ以外の音声に対するものより大きくする。これにより、映像および音声が撮影者であるユーザのイメージに沿った動画データとなる。 For example, let us consider a case where the conversation group detection unit 226 determines that person A and person B are conversing during period 710. In this case, since person A is the main target, the conversation audio adjustment unit 216 recognizes the voices of person A and person B as targets for emphasis and sets large correction coefficients for each voice (coefficients 714 and 715). In this embodiment, the correction coefficients for person A and person B are set to the same value. This is because it is assumed that the user (photographer) is listening to both voices equally. On the other hand, the conversation audio adjustment unit 216 sets a small correction coefficient for the voice of person C, who is determined not to be conversing with person A, so that person C's voice is relatively difficult to hear (coefficient 716). In this way, the conversation audio adjustment unit 216 emphasizes the voices of the main target person A and his/her conversation partner person B and reduces the volume of other voices. For example, the conversation audio adjustment unit 216 increases the gain and level for the voices of the main target person A and his/her conversation partner person B compared to those for other voices. This results in video data with video and audio that matches the image of the user who filmed it.

そして、図７（ｃ）は、前述の図７（ｂ）の補正係数に基づいて調整処理された音声信号を示している。例えば、会話音声調整部２１６のよる音声調整をゲイン調整によって実現した場合、期間７１０の間は、会話判定された人物Ａと人物Ｂの音声（信号７２４、信号７２５）は補正係数が１．０よりも大きいため、音量が大きくなりユーザにとって聞こえやすくなる。また、会話判定されなかった人物Ｃの音声（信号７２６）は、補正係数が１．０よりも小さため、音量が小さくなり聞こえづらくなる。このように個別調整された抽出音声が音声合成部２１７にて合成されることで、結果として注目対象として判定された会話のみが聞き取りやすい音声データとして生成される。 Figure 7(c) shows an audio signal that has been adjusted based on the correction coefficients of Figure 7(b) described above. For example, if the audio adjustment by the conversation audio adjustment unit 216 is achieved by gain adjustment, during period 710, the audio of persons A and B who have been determined to be conversing (signals 724 and 725) will have a correction coefficient greater than 1.0, making them louder and easier for the user to hear. Furthermore, the audio of person C who has not been determined to be conversing (signal 726) will have a correction coefficient less than 1.0, making them quieter and harder to hear. The extracted audio that has been individually adjusted in this way is synthesized by the audio synthesis unit 217, and as a result, only the conversation determined to be the target of attention is generated as audio data that is easy to hear.

なお、本実施形態では、主被写体に関する音声を強調（大きくなるよう補正）し、主被写体と関係のない音声を聞こえにくくした（小さくなるように補正した）が、どちらか一方にだけ調整を適用しても構わない。すなわち、主対象となる被写体（人物）およびその会話対象である被写体（人物）の補正係数が、その他の被写体の補正係数よりも大きければよい。 Note that in this embodiment, the sound related to the main subject is emphasized (corrected to be louder) and the sound unrelated to the main subject is made less audible (corrected to be quieter), but adjustments can be applied to only one of them. In other words, it is sufficient if the correction coefficients for the main subject (person) and the subjects (people) with whom they are talking are larger than the correction coefficients for the other subjects.

また、会話音声調整部２１６による強調手法も、前述のようなゲイン全体の調整に限らず、イコライザなどにより人物音声の周波数帯域において周波数別に調整しても構わない。 Furthermore, the emphasis method used by the conversational voice adjustment unit 216 is not limited to adjusting the overall gain as described above, but may also be adjusted by frequency within the frequency band of human voices using an equalizer or the like.

［第二の実施形態］
第一の実施形態では、主対象を選定後、主対象と会話している人物を主対象との位置関係や人物の動作により会話グループを検出し、会話グループの音声を強調し、もしくは不要である他の音声は抑え、注目すべき会話が聞き取りやすい音声データを取得している。 [Second embodiment]
In the first embodiment, after selecting a main target, conversation groups of people who are conversing with the main target are detected based on their relative positions relative to the main target and their movements, and the audio of the conversation group is emphasized or other unnecessary audio is suppressed, thereby obtaining audio data that makes the conversation of interest easier to hear.

第一の実施形態では、会話グループの検出方法は、人物検出部２２２において検出された人物のうちから、主対象選定部２２５にて選定された人物と会話している人物同士の、位置関係や、顔の向き、動作などによって判断されている。このように、第一の実施形態では、会話グループ検出部２２６の検出は、撮像装置１００の画角６０１内に存在する人物によって行われている。 In the first embodiment, the method for detecting conversation groups is to determine the positional relationships, facial orientations, and movements of people who are conversing with the person selected by the main subject selection unit 225 from among the people detected by the person detection unit 222. In this way, in the first embodiment, the conversation group detection unit 226 detects people who are present within the angle of view 601 of the imaging device 100.

いま、図１０（ａ）のように主対象である人物Ａと、画角６０１内の人物Ｂ、人物Ｄ（６０３、６０６）が会話グループとして検出されたとする。撮影者によるズーム操作やパンニング操作により人物Ｂが画角からはずれてしまった場合、人物Ａ、人物Ｂ、人物Ｄの会話は継続されていても、次の会話グループの検出では人物Ｂは図１０（ｂ）のように会話グループから外れてしまう。その結果、人物Ｂが会話に参加していても、会話グループ検出部２２６は、人物Ｂを会話グループと判断しないため、人物Ｂの音声だけが強調されず聞き取りづらい会話となってしまうおそれがある。 Now, suppose that person A, who is the main subject, and persons B and D (603, 606) within the angle of view 601 are detected as a conversation group, as shown in Figure 10(a). If person B moves out of the angle of view due to a zoom or panning operation by the photographer, even if the conversation between persons A, B, and D continues, person B will be removed from the conversation group when the next conversation group is detected, as shown in Figure 10(b). As a result, even if person B is participating in the conversation, the conversation group detection unit 226 does not determine that person B is part of the conversation group, which could result in the voice of person B not being emphasized and making the conversation difficult to hear.

第二の実施形態は、画角内にいた会話グループの少なくとも１人が画角からはずれても、画角からずれた人の会話が継続していると判断した時には、会話グループを画角からはずれる前の状態で維持し、聞き取りやすい音声を取得し続けることを目的とする。 The second embodiment aims to maintain the conversation group in the state it was in before the person left the field of view, even if at least one person in the conversation group that was within the field of view moves out of the field of view, if it is determined that the conversation of the person who moved out of the field of view is continuing, and continue to capture audio that is easy to hear.

以下、第二の実施形態について、添付の図面に基づいて詳細に説明する。尚、図１の撮像装置１００の構成は、第一の実施形態と同じため説明を省略する。 The second embodiment will now be described in detail with reference to the accompanying drawings. Note that the configuration of the imaging device 100 in Figure 1 is the same as that of the first embodiment, so a description thereof will be omitted.

図８は本実施形態の撮像装置１００の撮像部１０１、画像処理部１０２、音声入力部１０３、音声処理部１０４の詳細な構成を示すブロック図である。尚、図２と同じ機能を持つブロックは同じ番号を割付し、説明を省略する。 Figure 8 is a block diagram showing the detailed configuration of the imaging unit 101, image processing unit 102, audio input unit 103, and audio processing unit 104 of the imaging device 100 of this embodiment. Note that blocks having the same functions as those in Figure 2 are assigned the same numbers, and their explanations will be omitted.

特徴抽出部８０１は、人物音声抽出部２１４より抽出された音声とその音声に対応する人物とを関連付ける。例えば、特徴抽出部８０１は、音声の特徴と画角内の被写体の動作とに基づいて、抽出された音声と対応する人物とを関連付ける。例えば、上記音声の特徴は、周波数、大きさ、および抑揚である。例えば、被写体の動作は、発話の頻度、発声のタイミング、口の動きである。このような関連付けにより、話者の特定を行うための確度を向上させることができる。これにより、制御部１１１は、会話グループの人物が画角から外れても音声から話者を特定できる。 The feature extraction unit 801 associates the sound extracted by the person voice extraction unit 214 with the person corresponding to that sound. For example, the feature extraction unit 801 associates the extracted sound with the corresponding person based on the sound features and the movement of the subject within the field of view. For example, the sound features are frequency, volume, and intonation. For example, the movement of the subject is the frequency of speech, the timing of speech, and mouth movement. This association can improve the accuracy of identifying the speaker. This allows the control unit 111 to identify the speaker from the sound even if the person in the conversation group is out of the field of view.

会話グループ修正部８０２は、特徴抽出部８０１で取得した人物と関連付けされた音声の特徴から、画角からはずれた人物が会話を継続しているかを判断する。制御部１１１は、この結果と会話グループ検出部２２６の検出結果から画角から外れた人物を考慮した会話グループになるよう修正する。 The conversation group correction unit 802 determines whether a person who is out of the field of view is continuing a conversation based on the voice characteristics associated with the person acquired by the feature extraction unit 801. The control unit 111 uses this result and the detection result of the conversation group detection unit 226 to correct the conversation group so that it takes into account the person who is out of the field of view.

なお、第二の実施形態では特徴抽出部８０１、会話グループ修正部８０２を図２に示すブロック図に追加した形態で説明したが、会話グループ修正部８０２を図３に示すブロック図に追加した形態でも動作内容は同じである。 In the second embodiment, a feature extraction unit 801 and a conversation group correction unit 802 were added to the block diagram shown in Figure 2, but the operation is the same even when the conversation group correction unit 802 is added to the block diagram shown in Figure 3.

次に第二の実施形態の撮像装置１００の動作について図９、１１を用いて説明する。 Next, the operation of the imaging device 100 of the second embodiment will be described using Figures 9 and 11.

図９は撮像装置１００の一連の記録動作を説明したフローチャートである。図９では、図５と同じ動作をするブロックには図５と同じステップ番号を付与している。ここで、先に図９の動作での想定シーン例を図１１を用いて説明する。 Figure 9 is a flowchart explaining a series of recording operations of the imaging device 100. In Figure 9, blocks that perform the same operations as in Figure 5 are assigned the same step numbers as in Figure 5. Here, we will first explain an example of an expected scene for the operation in Figure 9 using Figure 11.

図１１（ａ）では、撮影者により記録釦が押下された時点における場面が示されている。図１１（ａ）に示す場面（以降、初期撮影シーンという）では、画角内に人物Ａ、人物Ｂ、および人物Ｄ（６０２、６０３、６０６）が存在する。主対象を人物Ａとし、主対象を含む会話グループは、人物Ａ、人物Ｂ、人物Ｄの３名が検出される。そして、会話グループの少なくとも１人が画角から外れた場合のシーンを説明する。 Figure 11(a) shows the scene at the time the photographer presses the record button. In the scene shown in Figure 11(a) (hereinafter referred to as the initial shooting scene), Person A, Person B, and Person D (602, 603, 606) are present within the field of view. Person A is the main subject, and a conversation group containing the main subject is detected, consisting of three people: Person A, Person B, and Person D. We will now explain the scene when at least one person in the conversation group is outside the field of view.

会話グループに含まれる人物が画角から外れた場合のシーンの例を、図１１（ｂ）～（ｅ）に示す。図１１（ｂ）～（ｄ）は人物Ｂ（６０３）が画角６０１から外れた場合のシーンである。図１１（ｅ）は撮影者による撮像装置１００のパンニング動作により会話グループの全員が画角から外れた場合のシーンである。また各図中の人物の口付近に表記されている横向きの「ハ」の字は、それぞれの人物からの発声状態を表しており、その線の太さで声量や会話への参加頻度の程度を表現している。また図１１の各シーンを、図１１（ａ）は初期撮影シーン、図１１（ｂ）はシーンｂ、図１１（ｃ）はシーンｃ、図１１（ｄ）はシーンｄ、図１１（ｅ）はシーンｅと記述する。また、各図に登場する人物６０２を人物Ａ、人物６０３を人物Ｂ、人物６０６を人物Ｄと記述する。また、各シーンの主対象を人物Ａとする。また、各図の画角６０１を撮像装置１００の撮影画角、会話グループ６１０は会話グループを示す。 Figures 11(b) to (e) show examples of scenes in which a person in a conversation group moves out of the field of view. Figures 11(b) to (d) show scenes in which person B (603) moves out of the field of view 601. Figure 11(e) shows a scene in which all members of the conversation group move out of the field of view due to the cameraman panning the imaging device 100. The horizontal "V" characters near the mouths of the people in each figure represent the vocalizations of each person, with the thickness of the lines representing the volume of their voices and the frequency of their participation in the conversation. Each scene in Figure 11 will be referred to as Figure 11(a) as the initial shooting scene, Figure 11(b) as scene b, Figure 11(c) as scene c, Figure 11(d) as scene d, and Figure 11(e) as scene e. Person 602, person 603, and person 606 appearing in each figure will be referred to as person A, person 603, and person D, respectively. The main subject in each scene is person A. In each diagram, the angle of view 601 indicates the imaging angle of the image capture device 100, and the conversation group 610 indicates the conversation group.

図１１（ａ）～（ｅ）各シーンの想定は、以下のとおりである。 The assumptions for each scene in Figures 11(a) to (e) are as follows:

シーンｂでは、初期撮影シーンに対し、人物Ｂが画角からは外れているが、画角内にいるときと同様に会話を継続しているシーンが示されている。 In scene b, person B is out of the frame of view compared to the initial shot, but continues to converse as if he were in the frame of view.

シーンｃでは、初期撮影シーンに対し、人物Ｂが画角から外れており、かつ会話をしていないシーンが示されている。なお、シーンｃでは、人物Ａ、人物Ｄともに人物Ｂの方を向いていない状態である。 Scene c shows a scene in which person B is out of the field of view and is not engaged in conversation, as opposed to the initial shooting scene. Note that in scene c, neither person A nor person D are looking towards person B.

シーンｄでは、シーンｂのシーンに対し、人物Ｂが遠方へ移動しているが会話は継続しているシーンが示されている。なお、シーンｄでは、人物Ｂの音声は撮像装置１００に入力されている。また、画角内にいる人物Ａの顔の向きが、人物Ｂのいる方向を向いており、発声量が大きくなっている。 In scene d, compared to scene b, person B has moved into the distance, but the conversation continues. Note that in scene d, person B's voice is being input to the imaging device 100. Also, person A, who is within the field of view, is facing in the direction of person B, and is speaking louder.

シーンｅでは、初期撮影シーンに対し、人物Ａ、人物Ｂ、および人物Ｄが画角から外れたシーンが示されている。なお、シーンｅでは、人物Ａ、人物Ｂ、および人物Ｄは会話を継続している。 Scene e shows a scene in which Person A, Person B, and Person D are out of the field of view compared to the initial shot scene. Note that in scene e, Person A, Person B, and Person D are continuing to have a conversation.

以上、図９の動作での想定シーン例を図１１を用いて説明した。以降、図９のフローチャートを用いて撮像装置１００の動作を説明する。本実施形態の説明では、主にステップＳ９０１～ステップＳ９０４について行う。 An example of an assumed scene for the operation in FIG. 9 has been described above using FIG. 11. The operation of the imaging device 100 will now be described using the flowchart in FIG. 9. In this embodiment, steps S901 to S904 will be mainly described.

まず、ステップＳ５０１からステップＳ５０７までの処理によって、画角内の人物検出、主対象の特定、主対象と会話している人物の検出、および音声の抽出が実施される。 First, steps S501 to S507 are performed to detect people within the field of view, identify the main subject, detect people conversing with the main subject, and extract audio.

ステップＳ９０１では、制御部１１１は、ステップＳ５０６検出された主対象と会話している人数と、特徴抽出部８０１および会話グループ修正部８０２によって関連付けられた会話グループの人数と一致するか否かを判断する。例えば、制御部１１１は、ステップＳ５０６で検出された会話グループの人数に対する現時点の会話グループの人数との差分をとることで判断する。人数が減少したと判断された場合、特徴抽出部８０１および会話グループ修正部８０２によって関連付けられた会話グループの人物のうち、画角から外れた人物が存在することになる。人数が一致すると判断された場合、ステップＳ９０４の処理が実行される。人数が一致しないと判断された場合、ステップＳ９０２の処理が実行される。 In step S901, control unit 111 determines whether the number of people conversing with the main target detected in step S506 matches the number of people in the conversation group associated by feature extraction unit 801 and conversation group correction unit 802. For example, control unit 111 makes this determination by calculating the difference between the number of people in the conversation group detected in step S506 and the number of people in the current conversation group. If it is determined that the number of people has decreased, it means that there is a person in the conversation group associated by feature extraction unit 801 and conversation group correction unit 802 who is out of the field of view. If it is determined that the number of people matches, processing of step S904 is executed. If it is determined that the number of people does not match, processing of step S902 is executed.

ステップＳ９０２では、会話グループ修正部８０２は、画角から外れている人物と、画角内の人物との会話が継続しているか否かを判断する。画角から外れている人物と画角内の人物との会話が継続してないと判断された場合、制御部１１１は、現在の会話グループはステップＳ５０６での検出結果として、ステップＳ９０４以降の処理を行う。会話が継続していると判断された場合、ステップＳ９０３の処理が実行される。 In step S902, the conversation group correction unit 802 determines whether a conversation is ongoing between a person outside the field of view and a person within the field of view. If it is determined that a conversation is not ongoing between a person outside the field of view and a person within the field of view, the control unit 111 performs the processing from step S904 onwards, treating the current conversation group as the detection result of step S506. If it is determined that a conversation is ongoing, the processing of step S903 is executed.

ステップＳ９０３では、制御部１１１は、画角から外れた人物が画角内の会話グループに含まれるように、ステップＳ５０６での検出された主対象と会話している人物（被写体）を修正する。 In step S903, the control unit 111 corrects the person (subject) who is conversing with the main subject detected in step S506 so that the person outside the field of view is included in the conversation group within the field of view.

ステップＳ９０４では、特徴抽出部８０１は、人物音声抽出部２１４より抽出された被写体（人物）毎に抽出された音声に基づいて、音声とその音声に対応する人物との関連付けを行う。 In step S904, the feature extraction unit 801 associates the audio with the person corresponding to that audio based on the audio extracted for each subject (person) extracted by the person audio extraction unit 214.

ここで、上述のシーンを用いて、ステップＳ９０２における、人物Ｂが人物Ａ、人物Ｄとの会話を継続しているか否かの判断の一例を説明する。 Here, using the above scene, we will explain an example of determining in step S902 whether person B is continuing a conversation with person A and person D.

シーンｂでは、図９のステップＳ５０５およびステップＳ５０６で、主対象である人物Ａと会話している人物として人物Ｄが特定される。しかし、図９のステップＳ９０１で、初期撮影シーンでは会話グループに属していた人物Ｂが、画角から外れたことがわかる。そして、図９のステップＳ９０２で、会話グループ修正部８０２によって人物部が人物Ａ、および人物Ｄとの会話を継続していることが判断される。そのため、図９ステップ９０３で、制御部１１１は、主対象である人物Ａと会話している被写体（人物）に人物Ｂを追加する。すなわち、シーンｂでは、初期撮影シーンと同様の会話グループを維持することになる。 In scene b, in steps S505 and S506 of FIG. 9, person D is identified as the person conversing with person A, the main subject. However, in step S901 of FIG. 9, it is determined that person B, who belonged to the conversation group in the initial shooting scene, has moved out of the field of view. Then, in step S902 of FIG. 9, the conversation group correction unit 802 determines that the person section is continuing to converse with person A and person D. Therefore, in step 903 of FIG. 9, the control unit 111 adds person B to the subjects (people) conversing with person A, the main subject. In other words, in scene b, the same conversation group as in the initial shooting scene is maintained.

ここで、人物Ｂが人物Ａ、Ｄとの会話が続いているか否かの判断の一例を説明する。会話グループ修正部８０２は、特徴抽出部８０１の情報より、人物Ｂの声の大きさや抑揚に変化がなく、人物Ａ、Ｄとの会話時の発話タイミングが合っている場合、会話が継続していると判断する。発話タイミングが合っている場合は、例えば、人物Ａ，Ｄと人物Ｂとが交互に会話している（会話が継続している）場合である。この場合、制御部１１１は、主対象である人物Ａと会話している人物（被写体）に人物Ｂを追加する。また、会話グループ修正部８０２は、画像処理部１０２が被写体の画角から外れた方向と被写体の顔の向きとが判断できる場合、さらに画角内の人物Ａまたは人物Ｄの顔の向きと人物Ｂの画角から外れた方向とに基づいて会話が継続しているか否かを判断する。すなわち、上述の声の大きさや浴用、発話タイミングで会話が継続していると判断しても、画角内の人物Ａまたは人物Ｄの顔の向きが人物Ｂが画角から外れた方向と一致していない場合、会話グループ修正部８０２は、会話が継続していないと判断する。 Here, an example of determining whether person B is continuing a conversation with persons A and D will be described. Based on the information from feature extraction unit 801, conversation group correction unit 802 determines that the conversation is continuing if there is no change in the volume or intonation of person B's voice and the timing of his/her speech when conversing with persons A and D is synchronized. A case in which the timing of his/her speech is synchronized is, for example, when persons A and D and person B are alternating conversation (the conversation is continuing). In this case, control unit 111 adds person B to the people (subjects) conversing with person A, the main target. Furthermore, when image processing unit 102 can determine the direction outside the field of view of the subject and the direction of the subject's face, conversation group correction unit 802 further determines whether the conversation is continuing based on the direction of the face of person A or person D within the field of view and the direction outside the field of view of person B. In other words, even if it is determined that a conversation is ongoing based on the volume of the voices, bathing, and timing of speech described above, if the direction of the face of person A or person D within the angle of view does not match the direction in which person B has left the angle of view, conversation group correction unit 802 will determine that the conversation is not ongoing.

シーンｃでは、シーンｂに対し、人物Ｂの音声が検出されていない場合である。このような場合、人物Ｂは人物Ａおよび人物Ｄの会話に参加していないと判断され、制御部１１１は、主対象である人物Ａと会話している被写体を人物Ｄのまま、修正は行わない。 Scene c is a case where the voice of person B is not detected in scene b. In such a case, it is determined that person B is not participating in the conversation between person A and person D, and the control unit 111 leaves person D as the subject conversing with person A, the main subject, without making any modifications.

シーンｄでは、シーンｂの状況から人物Ｂが移動し、人物Ａ、Ｄから遠ざかるも会話は継続しているシーンである。このシーンでは、人物Ｂの声は小さくなっているが、人物Ａおよび人物Ｄとの会話時の発話タイミングは合っている。また、人物Ｂの声は小さくなったが、人物Ａの声はこれに反し大きくなっている。これらの情報から、会話グループ修正部８０２は、人物Ａおよび人物Ｂは会話をしていると判断する。これに応じて、制御部１１１は、主被写体である人物Ａと会話している被写体として人物Ｂを追加する。 In scene d, person B moves away from person A and person D from the situation in scene b, but the conversation continues. In this scene, person B's voice has become quieter, but the timing of his speech when he is conversing with person A and person D is correct. Furthermore, person B's voice has become quieter, but person A's voice, in contrast, has become louder. From this information, conversation group correction unit 802 determines that person A and person B are conversing. In response, control unit 111 adds person B as a subject conversing with person A, the main subject.

シーンｅでは、撮影者が撮像方向を人物Ａ、人物Ｂ、および人物Ｄのいる方向から打ち上げ花火に向けて変更したシーンである。すなわち、人物Ａ、人物Ｂ、および人物Ｄは会話を継続しているが、主対象である人物Ａも画角から消えた状態である。しかし、特徴抽出部８０１の情報より、取得した音声に人物Ａの音声が含まれているため、この場合では、制御部１１１は人物Ａを主対象であると判断する。加えて、特徴抽出部８０１の情報より、人物Ｂおよび人物Ｄの音声も検出され続けているため、制御部１１１は、主対象である人物Ａと会話している被写体として人物Ｂおよび人物Ｄを追加する。このように、シーンｅのようなシーンでは人物は誰も画角内にいないが、会話グループの音声が強調されて記録される。なお、制御部１１１は、図１１（ｆ）に示すように会話の内容をテキスト変換し、吹き出し状などの形態で表示するよう制御してもよい。 In scene e, the photographer changes the imaging direction from the direction of persons A, B, and D toward the fireworks. In other words, persons A, B, and D are continuing their conversation, but person A, who is the main subject, has disappeared from the field of view. However, information from feature extraction unit 801 indicates that the acquired audio includes the voice of person A, so in this case, control unit 111 determines that person A is the main subject. Additionally, information from feature extraction unit 801 indicates that the voices of persons B and D are also continuing to be detected, so control unit 111 adds persons B and D as subjects who are conversing with person A, who is the main subject. In this way, in a scene like scene e, no persons are within the field of view, but the voices of the conversation group are recorded with emphasis. Note that control unit 111 may convert the content of the conversation into text and display it in a form such as a speech bubble, as shown in FIG. 11(f).

以上、第二の実施形態における撮像装置１００の動作について説明した。 The above describes the operation of the imaging device 100 in the second embodiment.

なお、ステップＳ５０６で検出される主対象である被写体と会話している人物の人数は画角内のステップＳ５０４での人物検出に基づくものなので、初期撮影シーン（図１１（ａ））では人物Ｂ、人物Ｄの２名、図１１（ｂ）では人物Ｄの１名である。 Note that the number of people conversing with the main subject detected in step S506 is based on the person detection within the angle of view in step S504, so in the initial shooting scene (Figure 11(a)), there are two people, person B and person D, and in Figure 11(b) there is one person, person D.

なお、第二の実施形態における音声抽出は、動画記録開始から所定時間が経過するまでは人物検出部２２２の検出結果、その後は人物動作検出部２２３の結果と特徴抽出部８０１の情報に基づいて実行される。 In the second embodiment, audio extraction is performed based on the detection results of the person detection unit 222 until a predetermined time has elapsed since the start of video recording, and thereafter based on the results of the person movement detection unit 223 and information from the feature extraction unit 801.

以上のように第二の実施形態によれば、会話グループに属する人物が画角から外れた場合でも、会話が継続している場合では適切な会話グループに修正することできる。 As described above, according to the second embodiment, even if a person belonging to a conversation group moves out of the field of view, the conversation group can be corrected to an appropriate one as long as the conversation is ongoing.

第二の実施形態の図１１（ｂ）、（ｃ）、（ｄ）での会話継続の判定について、会話グループに属する人物が画角からはずれた要因について考慮しない前提で説明したが、これを考慮してもよい。例えば、撮影者がレンズ２０１のズーム操作により会話グループに属する人物が画角から外れた場合、その人物が自身の意思とは関係なく画角から外れたため、制御部１１１は、特徴抽出部８０１の情報を使うことなく会話が継続されていると判断してもよい。 In the second embodiment, the determination of conversation continuation in Figures 11(b), (c), and (d) was explained assuming that the factors that caused a person belonging to a conversation group to move out of the field of view were not taken into consideration, but this may also be taken into consideration. For example, if the photographer causes a person belonging to a conversation group to move out of the field of view by zooming the lens 201, the control unit 111 may determine that the conversation is continuing without using information from the feature extraction unit 801, since the person moved out of the field of view regardless of the photographer's intention.

以上、本発明の好ましい実施形態について説明したが、本発明はこれらの実施形態に限定されず、その要旨の範囲内で種々の変形及び変更が可能である。 The above describes preferred embodiments of the present invention, but the present invention is not limited to these embodiments, and various modifications and variations are possible within the scope of the invention.

［その他の実施形態］
本発明は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 [Other embodiments]
The present invention can also be realized by supplying a program that realizes one or more of the functions of the above-described embodiments to a system or device via a network or a storage medium, and having one or more processors in the computer of the system or device read and execute the program.The present invention can also be realized by a circuit (e.g., an ASIC) that realizes one or more of the functions.

なお、本発明は上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施形態に開示されている複数の構成要素の適宜な組み合わせにより、種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。さらに、異なる実施形態にわたる構成要素を適宜組み合わせてもよい。 The present invention is not limited to the above-described embodiments, and can be embodied by modifying the components within the scope of the spirit of the invention when implemented. Furthermore, various inventions can be created by appropriately combining multiple components disclosed in the above-described embodiments. For example, some components may be omitted from all of the components shown in the embodiments. Furthermore, components from different embodiments may be appropriately combined.

Claims

a detection means for detecting a subject from a video;
a determining means for determining the sound of a subject from the video;
a selection means for selecting a main subject from the subjects detected from the video;
a determining means for determining a subject related to the main subject selected by the selecting means,
The audio processing device is characterized in that, when a subject related to the main subject moves out of the angle of view of the video, the determination means determines, based on the audio of the subject determined by the determination means, whether the subject that has moved out of the angle of view of the video continues to be related to the main subject.

2. The audio processing device according to claim 1, wherein the determination means determines that the subject outside the field of view of the video is related to the main subject when it is determined that the voice of the main subject and the subject outside the field of view of the video are having an ongoing conversation, and determines that the subject outside the field of view of the video is not related to the main subject when it is determined that the voice of the main subject and the subject outside the field of view of the video are not having an ongoing conversation.

3. The audio processing device according to claim 1, wherein the determination means determines that the subject outside the field of view of the video is not continuously associated with the main subject, even if it can determine that the voice of the main subject and the subject outside the field of view of the video are continuing to have a conversation based on the voice of the subject determined by the determination means, if the direction of the face of the main subject does not match the direction in which the subject outside the field of view of the video is out of the field of view.

4. The audio processing device according to claim 1, wherein the determination means determines that the main subject and a subject related to the main subject continue to be related to each other when it determines that a conversation between the main subject and a subject that is out of the field of view of the video is continuing even when the main subject is out of the field of view of the video.

An audio processing device according to any one of claims 1 to 4, characterized in that the selection means selects a subject that is in focus in the video as the main subject.

An audio processing device according to any one of claims 1 to 4, characterized in that the selection means selects a main subject from the video based on an image recorded as the main subject.

An audio processing device according to any one of claims 1 to 4, characterized in that the selection means selects the subject that appears most frequently among the subjects captured in the video as the main subject.

An audio processing device according to any one of claims 1 to 7, characterized in that the determination means determines subjects related to the main subject based on the distance from the main subject.

An audio processing device according to any one of claims 1 to 8, characterized in that the determination means determines that the subject closest to the main subject is the subject related to the main subject.

An audio processing device according to any one of claims 1 to 7, characterized in that the determination means determines that a subject facing the main subject is a subject related to the main subject.

An audio processing device according to any one of claims 1 to 7, characterized in that the determination means determines subjects related to the main subject based on the movement of the main subject.

an associating means for associating the subject detected by the detecting means with the sound extracted by the determining means;
and image processing means,
12. The audio processing device according to claim 1, wherein the associating means associates the subject detected by the detecting means with the audio extracted by the determining means based on the audio extracted by the determining means and the movement of the subject detected by the image processing means.

The audio processing device described in claim 12, characterized in that the image processing means detects the frequency of speech, timing of speech, or mouth movements of the subject.

An audio processing device according to any one of claims 1 to 13, characterized in that the determination means extracts the subject's audio based on the audio frequency, volume, and intonation.

An audio processing device according to any one of claims 1 to 14, further comprising an imaging means for capturing the video.

An audio processing device according to any one of claims 1 to 15, characterized in that the determination means determines that a subject that has been detected by the detection means has moved out of the angle of view of the video when the subject is no longer detected by the detection means.

A method for controlling an audio processing device, comprising:
a detection step of detecting an object from a video;
a determining step of determining a subject's voice from the video;
a selection step of selecting a main subject from the subjects detected from the video;
a determining step of determining an object related to the main object selected in the selecting step,
In the determination step, when a subject related to the main subject moves out of the angle of view of the video, it is determined whether the subject that moves out of the angle of view of the video continues to be related to the main subject, based on the audio of the subject determined in the determination step .

A computer-readable program for causing a computer to function as each of the means of the voice processing device described in any one of claims 1 to 15.