JP2014236276A

JP2014236276A - Image processing device, imaging device, and image processing program

Info

Publication number: JP2014236276A
Application number: JP2013115146A
Authority: JP
Inventors: 啓一新田; Keiichi Nitta
Original assignee: Nikon Corp
Current assignee: Nikon Corp
Priority date: 2013-05-31
Filing date: 2013-05-31
Publication date: 2014-12-15

Abstract

【課題】主要被写体を確度高く特定することができる技術を提供することを目的とする。
【解決手段】時系列に連続して撮像された複数の画像を取得する画像取得部と、複数の画像の撮像時に録音され複数の画像に対応付けられた音声を取得する音声取得部と、音声から音声情報を抽出し、音声情報より画像の主要被写体を推定する被写体推定部と、推定された主要被写体を画像から特定する被写体特定部と、を備える。
【選択図】図１PROBLEM TO BE SOLVED: To provide a technique capable of specifying a main subject with high accuracy.
An image acquisition unit that acquires a plurality of images that are continuously captured in time series, a sound acquisition unit that acquires sound recorded at the time of capturing a plurality of images and associated with the plurality of images, and a sound Voice information is extracted from the subject, and a subject estimation unit that estimates the main subject of the image from the voice information, and a subject specification unit that specifies the estimated main subject from the image.
[Selection] Figure 1

Description

本発明は、画像処理装置、撮像装置および画像処理プログラムに関する。 The present invention relates to an image processing device, an imaging device, and an image processing program.

撮像された画像から主要被写体を検出する様々な技術が開発されている。 Various techniques for detecting a main subject from captured images have been developed.

例えば、画像から複数の特徴量を統合して顕著性マップを求め、その顕著性マップにおいて、人間の視覚的注意を引く可能性の高い、所定の閾値以上の顕著性を有する画像領域を主要被写体のコア領域の種と設定し、領域分割処理を施すことにより、画像から主要被写体の領域を抽出する技術がある（特許文献１など参照） For example, a saliency map is obtained by integrating a plurality of feature amounts from an image, and an image area having a saliency equal to or higher than a predetermined threshold that is likely to attract human visual attention in the saliency map is selected as a main subject. There is a technique for extracting a region of a main subject from an image by setting it as a seed of a core region and performing region division processing (see Patent Document 1, etc.)

特開２０１１−３５６３６号公報JP 2011-35636 A

しかしながら、従来技術では、顕著性などの特徴量に基づいて主要被写体を検出することから、画像を撮像したユーザが意図した主要被写体と異なる場合がある。 However, in the prior art, since the main subject is detected based on a feature amount such as saliency, it may be different from the main subject intended by the user who captured the image.

上記従来技術が有する問題に鑑み、本発明の目的は、主要被写体を確度高く特定することができる技術を提供することにある。 In view of the above-described problems of the conventional technology, an object of the present invention is to provide a technology capable of specifying a main subject with high accuracy.

上記課題を解決するために、本発明を例示する画像処理装置の一態様は、時系列に連続して撮像された複数の画像を取得する画像取得部と、複数の画像の撮像時に録音され複数の画像に対応付けられた音声を取得する音声取得部と、音声から音声情報を抽出し、音声情報より画像の主要被写体を推定する被写体推定部と、推定された主要被写体を画像から特定する被写体特定部と、を備える。 In order to solve the above-described problem, an aspect of an image processing apparatus illustrating the present invention includes an image acquisition unit that acquires a plurality of images that are continuously captured in time series, and a plurality of images that are recorded when a plurality of images are captured. An audio acquisition unit that acquires audio associated with the image, a subject estimation unit that extracts audio information from the audio and estimates a main subject of the image from the audio information, and a subject that identifies the estimated main subject from the image A specific unit.

また、被写体特定部は、音声情報が抽出された時点より後に撮像された画像から主要被写体を特定してもよい。 In addition, the subject specifying unit may specify the main subject from an image captured after the time when the audio information is extracted.

また、音声を発話する人物を画像から特定する発話者特定部を備えてもよい。 Moreover, you may provide the speaker specific part which specifies the person who utters an audio | voice from an image.

また、被写体特定部により主要被写体が特定された画像に対し主要被写体の情報を付加する情報付加部を備えてもよい。 Further, an information adding unit that adds information on the main subject to the image in which the main subject is specified by the subject specifying unit may be provided.

また、音声情報と主要被写体とを予め対応付けた情報を記憶する記憶部を備えてもよい。 In addition, a storage unit that stores information in which the audio information and the main subject are associated in advance may be provided.

本発明を例示する撮像装置の一態様は、時系列に連続して被写界を撮像し複数の画像を生成する撮像部と、被写界の合焦状態を調整する合焦調整部と、本発明の画像処理装置と、被写体推定部により推定された主要被写体に合焦させるように、合焦調整部を制御する制御部と、を備える。 An aspect of an imaging apparatus illustrating the present invention is an imaging unit that captures an image of a scene continuously in time series to generate a plurality of images, a focus adjustment unit that adjusts a focus state of the scene, The image processing apparatus of the present invention, and a control unit that controls the focus adjustment unit so as to focus on the main subject estimated by the subject estimation unit.

本発明を例示する画像処理プログラムの一態様は、時系列に連続して撮像された複数の画像を取得する画像取得手順、複数の画像の撮像時に録音され複数の画像に対応付けられた音声を取得する音声取得手順、音声から音声情報を抽出し、音声情報より画像の主要被写体を推定する被写体推定手順、推定された主要被写体を画像から特定する被写体特定手順、をコンピュータに実行させる。 One aspect of the image processing program illustrating the present invention is an image acquisition procedure for acquiring a plurality of images that are continuously captured in time series, and voices that are recorded when a plurality of images are captured and are associated with the plurality of images. The computer executes an audio acquisition procedure to be acquired, audio information is extracted from the audio, a subject estimation procedure for estimating the main subject of the image from the audio information, and a subject specifying procedure for specifying the estimated main subject from the image.

本発明によれば、主要被写体を確度高く特定することができる。 According to the present invention, a main subject can be specified with high accuracy.

本発明の一の実施形態に係るデジタルカメラの構成を示す図The figure which shows the structure of the digital camera which concerns on one Embodiment of this invention. 本発明の一の実施形態に係るデジタルカメラによる撮像処理を示すフローチャート6 is a flowchart showing imaging processing by a digital camera according to an embodiment of the present invention. 撮像開始時の被写界の構図の一例を示す図A figure showing an example of the composition of the object scene at the start of imaging 主要被写体推定後の被写界の構図の一例を示す図（その１）Figure showing an example of the composition of the object scene after estimation of the main subject (part 1) 主要被写体推定後の被写界の構図の一例を示す図（その２）A diagram showing an example of the composition of the object scene after estimation of the main subject (part 2) 主要被写体推定後の被写界の構図の一例を示す図（その３）A diagram showing an example of the composition of the object scene after estimation of the main subject (part 3) 主要被写体推定後の被写界の構図の一例を示す図（その４）FIG. 4 shows an example of the composition of the object scene after estimation of the main subject (Part 4) 主要被写体推定後の被写界の構図の一例を示す図（その５）FIG. 5 shows an example of the composition of the object scene after estimation of the main subject (No. 5) 撮像開始時および主要被写体推定後の被写界の構図の別例を示す図A figure showing another example of the composition of the object scene at the start of imaging and after estimation of the main subject

以下、本発明の一の実施形態について、図面に基づいて説明する。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings.

図１は、本発明の一の実施形態に係るデジタルカメラ１００の構成を示す図である。 FIG. 1 is a diagram showing a configuration of a digital camera 100 according to an embodiment of the present invention.

本実施形態のデジタルカメラ１００は、撮像レンズ１１、撮像素子１２、ＡＦＥ１３、画像処理部１４、マイクロホン１５ａ、１５ｂ、表示制御部１６ａ、モニタ１６ｂ、ＳＤＲＡＭ１７、レンズ駆動部１８、撮像素子駆動回路１９、ＣＰＵ２０、操作部材２４、不揮発性メモリ２５、動きセンサ２６、記録Ｉ／Ｆ２７およびバス２９から構成される。画像処理部１４、マイクロホン１５ａ、１５ｂ、表示制御部１６ａ、ＳＤＲＡＭ１７、ＣＰＵ２０および記録Ｉ／Ｆ２７は、バス２９を介して情報伝達可能にそれぞれ接続される。また、レンズ駆動部１８、撮像素子駆動回路１９、操作部材２４、不揮発性メモリ２５および動きセンサ２６は、ＣＰＵ２０に接続される。 The digital camera 100 of the present embodiment includes an imaging lens 11, an imaging device 12, an AFE 13, an image processing unit 14, microphones 15a and 15b, a display control unit 16a, a monitor 16b, an SDRAM 17, a lens driving unit 18, an imaging device driving circuit 19, The CPU 20, the operation member 24, the nonvolatile memory 25, the motion sensor 26, the recording I / F 27, and the bus 29 are included. The image processing unit 14, the microphones 15 a and 15 b, the display control unit 16 a, the SDRAM 17, the CPU 20, and the recording I / F 27 are connected via a bus 29 so as to be able to transmit information. The lens driving unit 18, the image sensor driving circuit 19, the operation member 24, the nonvolatile memory 25, and the motion sensor 26 are connected to the CPU 20.

撮像レンズ１１は、ズームレンズやフォーカシングレンズを含む複数のレンズ群で構成されている。撮像レンズ１１のレンズ位置は、ＣＰＵ２０の制御指示に基づきレンズ駆動部１８によって光軸方向に調整される。なお、簡単のため、図１では撮像レンズ１１を１枚のレンズとして図示する。 The imaging lens 11 includes a plurality of lens groups including a zoom lens and a focusing lens. The lens position of the imaging lens 11 is adjusted in the optical axis direction by the lens driving unit 18 based on a control instruction from the CPU 20. For simplicity, the imaging lens 11 is shown as a single lens in FIG.

撮像素子１２は、撮像素子駆動回路１９によって駆動され、ＣＰＵ２０からの制御信号に基づいて、撮像レンズ１１を通過した光束によって結像される被写界を撮像するデバイスである。撮像素子１２の受光面には、複数の受光素子がマトリックス状に配列されている。また、撮像素子１２の各受光素子には、赤色（Ｒ）、緑色（Ｇ）、青色（Ｂ）のカラーフィルタが公知のベイヤ配列に従って配置されている。そのため、撮像素子１２の各受光素子は、カラーフィルタでの色分解によってそれぞれの色に対応するアナログの画像信号を出力する。この撮像素子１２の出力はＡＦＥ１３に入力される。なお、本実施形態の撮像素子１２は、順次走査方式の固体撮像素子（ＣＣＤ等）であっても、ＸＹアドレス方式の固体撮像素子（ＣＭＯＳ等）であってもよい。 The image pickup device 12 is a device that is driven by the image pickup device drive circuit 19 and picks up an image of an object scene formed by the light flux that has passed through the image pickup lens 11 based on a control signal from the CPU 20. A plurality of light receiving elements are arranged in a matrix on the light receiving surface of the imaging element 12. In addition, red (R), green (G), and blue (B) color filters are arranged in each light receiving element of the imaging element 12 according to a known Bayer array. Therefore, each light receiving element of the imaging element 12 outputs an analog image signal corresponding to each color by color separation in the color filter. The output of the image sensor 12 is input to the AFE 13. Note that the image sensor 12 of the present embodiment may be a progressive scan type solid-state image sensor (CCD or the like) or an XY address type solid-state image sensor (CMOS or the like).

ＡＦＥ１３は、撮像素子１２の出力に対してアナログ信号処理を施すアナログフロントエンド回路である。このＡＦＥ１３は、相関二重サンプリング、画像信号のゲインの調整、画像信号のＡ／Ｄ変換を行う。ＡＦＥ１３の出力は、画像処理部１４に送られる。なお、本実施形態では、撮像素子１２およびＡＦＥ１３で撮像部を構成する。 The AFE 13 is an analog front end circuit that performs analog signal processing on the output of the image sensor 12. The AFE 13 performs correlated double sampling, image signal gain adjustment, and image signal A / D conversion. The output of the AFE 13 is sent to the image processing unit 14. In the present embodiment, the imaging device 12 and the AFE 13 constitute an imaging unit.

画像処理部１４は、ホワイトバランス処理回路、画素補間（デモザイキング）回路、マトリクス処理回路、非線形変換（γ補正）処理回路、および輪郭強調処理回路などを備え、デジタルの画像信号に対して、ホワイトバランス、画素補間、マトリクス、非線形変換（γ補正）、および輪郭強調などの処理を施す。なお、画素補間回路は、１画素当たり１色のベイヤ配列信号を、１画素当たり３色からなる通常のカラー画像信号に変換する。 The image processing unit 14 includes a white balance processing circuit, a pixel interpolation (demosaicing) circuit, a matrix processing circuit, a non-linear transformation (γ correction) processing circuit, an edge enhancement processing circuit, and the like. Processing such as balance, pixel interpolation, matrix, nonlinear transformation (γ correction), and contour enhancement is performed. Note that the pixel interpolation circuit converts a Bayer array signal of one color per pixel into a normal color image signal composed of three colors per pixel.

画像処理部１４から出力される３色の画像信号は、バス２９を通じてＳＤＲＡＭ１７に格納される。ＳＤＲＡＭ１７に格納された画像データは、ＣＰＵ２０の制御により読み出されて表示制御部１６ａに送られる。表示制御部１６ａは、入力された画像データを表示用の所定方式の信号（例えば、ＮＴＳＣ方式のカラー複合映像信号）に変換してモニタ１６ｂに表示する（スルー）画像として出力する。 The three color image signals output from the image processing unit 14 are stored in the SDRAM 17 through the bus 29. The image data stored in the SDRAM 17 is read out under the control of the CPU 20 and sent to the display control unit 16a. The display control unit 16a converts the input image data into a predetermined display signal (for example, an NTSC color composite video signal) and outputs it as a (through) image to be displayed on the monitor 16b.

また、後述する操作部材２４のレリーズ釦操作に応答して取得された画像データは、ＳＤＲＡＭ１７から読み出され不図示の圧縮伸長処理部に送られる。圧縮伸長処理部（不図示）は、その画像データに圧縮処理を施して画像ファイルを生成し、記録Ｉ／Ｆ２７を介して記録媒体であるメモリカード２８に記録する。 Also, image data acquired in response to a release button operation of the operation member 24 described later is read from the SDRAM 17 and sent to a compression / decompression processing unit (not shown). A compression / decompression processing unit (not shown) performs compression processing on the image data to generate an image file, and records the file on a memory card 28 as a recording medium via the recording I / F 27.

マイクロホン１５ａ、１５ｂは、ユーザや被写界の人物、または周囲から発せられる音声を受信して電気信号のアナログの音声信号に変換する。そのアナログの音声信号は、不図示のＡ／Ｄ変換回路でデジタル信号に変換された後、ステレオ形式で音声データとしてＳＤＲＡＭ１７に格納される。マイクロホン１５ａ、１５ｂには、一般的なマイクロホンを用いることができる。また、動画などの再生時には、記録された音声信号は、不図示のスピーカで再生することも可能な構成とされている。 The microphones 15a and 15b receive sound emitted from a user, a person in the scene, or the surroundings and convert the sound into an analog sound signal of an electrical signal. The analog audio signal is converted into a digital signal by an A / D conversion circuit (not shown), and then stored in the SDRAM 17 as audio data in a stereo format. A general microphone can be used as the microphones 15a and 15b. Further, at the time of reproducing a moving image or the like, the recorded audio signal can be reproduced by a speaker (not shown).

ＣＰＵ２０は、不揮発性メモリ２５に格納されている制御プログラムに従い、デジタルカメラ１００の各部を統括的に制御するプロセッサである。例えば、ＣＰＵ２０は、撮像素子１２に撮像された画像データに基づいて、公知のコントラスト検出や位相差検出による自動焦点（ＡＦ）制御や公知の自動露出（ＡＥ）演算などをそれぞれ実行する。また、本実施形態のＣＰＵ２０は、不揮発性メモリ２５に格納されている画像処理プログラムに従い、発話者特定部２１、被写体推定部２２および被写体特定部２３として動作する。 The CPU 20 is a processor that comprehensively controls each unit of the digital camera 100 in accordance with a control program stored in the nonvolatile memory 25. For example, the CPU 20 executes automatic focus (AF) control by known contrast detection or phase difference detection, known automatic exposure (AE) calculation, and the like based on image data captured by the image sensor 12. Further, the CPU 20 of the present embodiment operates as the speaker specifying unit 21, the subject estimating unit 22, and the subject specifying unit 23 according to the image processing program stored in the nonvolatile memory 25.

発話者特定部２１は、後述する被写体推定部２２が主要被写体を推定するために用いる音声データを発話する発話者を、撮像素子１２により撮像された画像データを用いて検出し特定する。本実施形態の発話者特定部２１は、画像データに公知の被写体検出処理のアルゴリズムを適用して発話者を特定する。例えば、発話者特定部２１は、予め不揮発性メモリ２５に格納された人物の形状を示すテンプレートを読み込み、画像データに対しパターンマッチング処理などを施すことにより、人物の顔領域を検出し発話者として特定する。例えば、ＣＰＵ２０は、その結果に基づいて、マイクロホン１５ａ、１５ｂを特定された発話者からの音声が明瞭に取得できるように設定するのが好ましい。なお、複数の人物の顔領域が検出された場合、発話者特定部２１は、複数の人物全てを発話者と特定してもよいし、ＣＰＵ２０やユーザによって選択された人物を発話者と特定してもよい。また、発話者特定部２１は、画像データから人物が検出されなかった場合、ユーザや画像データの画角外の人物を発話者と特定することが好ましい。なお、発話者特定部２１は、マイクロホン１５ａ，１５ｂで取得された音声信号に対して声紋分析を行うことで、発話者を特定（推定）する構成としてもよい。 The speaker specifying unit 21 detects and specifies a speaker who speaks audio data used by the subject estimating unit 22 (to be described later) to estimate the main subject, using the image data captured by the image sensor 12. The speaker specifying unit 21 of the present embodiment specifies a speaker by applying a known subject detection processing algorithm to image data. For example, the speaker specifying unit 21 reads a template indicating the shape of a person stored in advance in the non-volatile memory 25 and performs pattern matching processing on the image data, thereby detecting a person's face area and serving as a speaker. Identify. For example, it is preferable that the CPU 20 sets the microphones 15a and 15b so that the voice from the specified speaker can be clearly acquired based on the result. When the face areas of a plurality of persons are detected, the speaker specifying unit 21 may specify all of the plurality of persons as speakers, or specify the person selected by the CPU 20 or the user as a speaker. May be. Further, when no person is detected from the image data, the speaker specifying unit 21 preferably specifies the user or a person outside the angle of view of the image data as the speaker. The speaker specifying unit 21 may be configured to specify (estimate) the speaker by performing voiceprint analysis on the audio signals acquired by the microphones 15a and 15b.

被写体推定部２２は、ＳＤＲＡＭ１７に格納されている音声データを読み込み、公知の音声認識のアルゴリズムに基づいて、音声データから発話者特定部２１により特定された発話者の発話内容を音声情報として抽出し取得する。被写体推定部２２は、不揮発性メモリ２５に格納される辞書データに基づいて、取得した音声情報から撮像素子１２により次に撮像される主要被写体を推定する。なお、辞書データは、音声情報と主要被写体とが対応付けられた一覧のデータである。つまり、本実施形態の辞書データは、人物や動物、花、乗り物、建物および山などのそれぞれの主要被写体を表す名詞の音声情報の一覧とともに、例えば、花や虹などの主要被写体に対応する「きれい」などの形容詞などの音声情報の一覧とからなる。 The subject estimation unit 22 reads the voice data stored in the SDRAM 17 and extracts the utterance content of the speaker specified by the speaker specifying unit 21 from the voice data as voice information based on a known voice recognition algorithm. get. The subject estimation unit 22 estimates the main subject to be imaged next by the image sensor 12 from the acquired audio information based on the dictionary data stored in the nonvolatile memory 25. The dictionary data is a list of data in which the audio information and the main subject are associated with each other. That is, the dictionary data of the present embodiment corresponds to a main subject such as a flower or a rainbow together with a list of noun audio information representing each main subject such as a person, an animal, a flower, a vehicle, a building, and a mountain. It consists of a list of audio information such as adjectives such as “beautiful”.

被写体特定部２３は、被写体推定部２２の推定後に、撮像素子１２により撮像される画像データに対して、例えば、被写体検出処理を施し、被写体推定部２２により推定された主要被写体の画像領域を検出し特定する。例えば、被写体特定部２３は、被写体推定部２２により推定された主要被写体の形状を示すテンプレートを不揮発性メモリ２５より読み込み、画像データに対しパターンマッチング処理などを施して主要被写体の画像領域を検出し特定する。ＣＰＵ２０は、特定された主要被写体の画像領域を合焦領域として、レンズ駆動部１８を駆動して撮影レンズ１１を光軸方向で進退させて焦点調整を行う。なお、不揮発性メモリ２５は、辞書データに登録される主要被写体のテンプレートを格納しているものとする。 The subject specifying unit 23 performs, for example, subject detection processing on the image data captured by the image sensor 12 after the estimation by the subject estimation unit 22 and detects the image area of the main subject estimated by the subject estimation unit 22. Then identify. For example, the subject specifying unit 23 reads a template indicating the shape of the main subject estimated by the subject estimation unit 22 from the non-volatile memory 25 and performs pattern matching processing on the image data to detect the image area of the main subject. Identify. The CPU 20 performs the focus adjustment by driving the lens driving unit 18 to advance and retract the photographing lens 11 in the optical axis direction with the identified image area of the main subject as the focusing area. It is assumed that the nonvolatile memory 25 stores a template of the main subject registered in the dictionary data.

操作部材２４は、例えば、レリーズ釦、電源釦、コマンドダイヤル、十字状のカーソルキー、決定釦などで構成される。そして、操作部材２４はデジタルカメラ１００の各種入力をユーザから受け付ける。また、本実施形態では、モニタ１６ｂと同形状の透明のパネルで構成されモニタ１６ｂの表面全体に積層配置されるタッチパネルを、操作部材２４として用いてもよい。すなわち、タッチパネルが、パネル表面に接触したスタイラス（または指先等）の位置を検出し、検出した位置情報をＣＰＵ２０に出力することでユーザからの指示入力を受け付けるようにしてもよい。 The operation member 24 includes, for example, a release button, a power button, a command dial, a cross-shaped cursor key, a determination button, and the like. The operation member 24 receives various inputs of the digital camera 100 from the user. In the present embodiment, a touch panel that is configured by a transparent panel having the same shape as the monitor 16b and is stacked on the entire surface of the monitor 16b may be used as the operation member 24. That is, the touch panel may detect the position of a stylus (or fingertip or the like) that is in contact with the panel surface, and output the detected position information to the CPU 20 to accept an instruction input from the user.

動きセンサ２６は、加速度センサや電子ジャイロなどデジタルカメラ１００の姿勢や動きを検出し、ＣＰＵ２０に検出信号を出力する。 The motion sensor 26 detects the posture and movement of the digital camera 100 such as an acceleration sensor or an electronic gyro, and outputs a detection signal to the CPU 20.

次に、図２のフローチャートを参照しつつ、本実施形態のデジタルカメラ１００による撮像動作について説明する。なお、本実施形態では、撮影モードとして動画モードに設定され動画を撮像する場合について説明するが、連写モードで静止画像を撮像する場合や撮像待機時のスルー画像を撮像する場合についても同様である。 Next, the imaging operation by the digital camera 100 of the present embodiment will be described with reference to the flowchart of FIG. In this embodiment, the case where the moving image mode is set as the shooting mode will be described. However, the same applies to the case where a still image is captured in the continuous shooting mode or the case where a through image is captured during standby. is there.

ＣＰＵ２０は、ユーザによる操作部材２３の電源釦操作により、電源投入指示を受け付け、デジタルカメラ１００の電源を投入する。ＣＰＵ２０は、不揮発性メモリ２５に格納された制御プログラムおよび画像処理プログラムを読み込んで実行し、デジタルカメラ１００を初期化する。ＣＰＵ２０は、ユーザによる操作部材２４のレリーズ釦全押し操作の撮像指示を受け付けると、撮像素子１２に動画の撮像を開始させる。同時に、ＣＰＵ２０は、マイクロホン１５ａ、１５ｂに音声を受信して音声データの取得を開始させる。なお、撮像開始時において動画撮像される被写界は、図３に示すような人物３１だけの構図の被写界３０であるとする。 The CPU 20 receives a power-on instruction by turning on the power button on the operation member 23 by the user, and turns on the digital camera 100. The CPU 20 reads and executes the control program and the image processing program stored in the nonvolatile memory 25 and initializes the digital camera 100. When the CPU 20 accepts an imaging instruction for a release button full-press operation of the operation member 24 by the user, the CPU 20 causes the imaging device 12 to start imaging a moving image. At the same time, the CPU 20 causes the microphones 15a and 15b to receive sound and start acquiring sound data. Note that it is assumed that an object scene that is captured as a moving image at the start of imaging is an object scene 30 having a composition of only a person 31 as shown in FIG.

ステップＳ１０１において、発話者特定部２１は、撮像素子１２により撮像された動画のフレームに対して被写体検出処理を施し、人物３１の顔領域を検出し、マイクロホン１５ａ，１５ｂによって取得された音声信号と同期して、検出された顔領域の一部（口）が動いていると判断し、人物３１を発話者と特定する。 In step S101, the speaker specifying unit 21 performs subject detection processing on the frame of the moving image captured by the image sensor 12, detects the face area of the person 31, and the audio signal acquired by the microphones 15a and 15b. In synchronization, it is determined that a part (mouth) of the detected face area is moving, and the person 31 is identified as the speaker.

ステップＳ１０２において、被写体推定部２２は、ＳＤＲＡＭ１７に記録されている音声データを読み込んで音声認識処理を施し、人物３１の発話内容を音声情報として抽出し取得する。被写体推定部２２は、その音声情報と不揮発性メモリ２５に格納されている辞書データとに基づいて、撮像素子１２により次に撮像される主要被写体を推定する。 In step S102, the subject estimation unit 22 reads voice data recorded in the SDRAM 17, performs voice recognition processing, and extracts and acquires the utterance content of the person 31 as voice information. The subject estimation unit 22 estimates a main subject to be imaged next by the image sensor 12 based on the sound information and dictionary data stored in the nonvolatile memory 25.

ここで、被写体推定部２２および被写体特定部２３の具体的な動作について、次の５つのケースそれぞれの場合を例にして説明する。
［ケース１］
図４（ａ）に示すように、被写界３０の人物３１が、例えば、「きれいなお花」と発話した場合、ステップＳ１０２において、被写体推定部２２は、撮像素子１２によって次に撮像される主要被写体は「花」であると推定する。 Here, specific operations of the subject estimation unit 22 and the subject specification unit 23 will be described by taking the cases of the following five cases as examples.
[Case 1]
As shown in FIG. 4A, when the person 31 in the scene 30 speaks, for example, as “beautiful flower”, the subject estimation unit 22 performs the next imaging by the imaging device 12 in step S102. The subject is estimated to be “flower”.

ステップＳ１０３において、被写体特定部２３は、被写体推定部２２の推定結果に基づいて、ＳＤＲＡＭ１７から「花」のテンプレートを読み込む。被写体特定部２３は、撮像素子１２によって次に撮像されるフレームから主要被写体である「花」の検出を開始する。そして、ユーザによるデジタルカメラ１００のパンニングやズーミングなどの結果、撮像素子１２が、図４（ｂ）に示す被写界４０を撮像すると、被写体特定部２３は、そのフレームから後ろの山ではなく手前の花４１の画像領域４２を主要被写体として検出し特定する。 In step S 103, the subject specifying unit 23 reads a “flower” template from the SDRAM 17 based on the estimation result of the subject estimation unit 22. The subject specifying unit 23 starts to detect “flower” as the main subject from the next frame imaged by the image sensor 12. Then, as a result of panning or zooming of the digital camera 100 by the user, when the imaging device 12 images the object scene 40 shown in FIG. 4B, the subject specifying unit 23 is not in front of the mountain behind the frame. The image area 42 of the flower 41 is detected and specified as the main subject.

ステップＳ１０４において、ＣＰＵ２０は、特定された花４１の画像領域４２を合焦領域として、レンズ駆動部１８を駆動して撮影レンズ１１を光軸方向で進退させて焦点調整を行う。同時に、ＣＰＵ２０は、フレームのスルー画像をモニタ１６ｂに表示するとともに、花４１の画像領域４２が合焦領域であることを示すＡＦ枠を重畳表示する。
［ケース２］
図５（ａ）に示すように、会議などにおいて、被写界３０の人物３１が、例えば、「次の図をご覧下さい」と発話した場合、ステップＳ１０２において、被写体推定部２２は、撮像素子１２によって次に撮像される主要被写体は「図」であると推定する。 In step S 104, the CPU 20 performs focus adjustment by driving the lens driving unit 18 to advance and retract the photographing lens 11 in the optical axis direction with the identified image area 42 of the flower 41 as the focusing area. At the same time, the CPU 20 displays a through image of the frame on the monitor 16b and superimposes and displays an AF frame indicating that the image area 42 of the flower 41 is an in-focus area.
[Case 2]
As shown in FIG. 5A, in a meeting or the like, when a person 31 of the object scene 30 speaks, for example, “please see the next figure”, in step S102, the subject estimation unit 22 12, the main subject to be imaged next is estimated to be “Figure”.

ステップＳ１０３において、被写体特定部２３は、被写体推定部２２の推定結果に基づいて、ＳＤＲＡＭ１７からグラフなどのテンプレートを読み込む。被写体特定部２３は、撮像素子１２によって次に撮像されるフレームから主要被写体である「図」の検出を開始する。そして、撮像素子１２が、図５（ｂ）に示すようなプロジェクタなどに映し出されたグラフ５１を含む被写界５０を撮像すると、被写体特定部２３は、フレームからグラフ５１の画像領域５２を主要被写体として特定する。 In step S 103, the subject specifying unit 23 reads a template such as a graph from the SDRAM 17 based on the estimation result of the subject estimation unit 22. The subject specifying unit 23 starts to detect the “figure” that is the main subject from the next frame imaged by the image sensor 12. Then, when the imaging device 12 images the scene 50 including the graph 51 projected on a projector or the like as shown in FIG. 5B, the subject specifying unit 23 mainly uses the image area 52 of the graph 51 from the frame. Identify as a subject.

ステップＳ１０４において、ＣＰＵ２０は、特定されたグラフ５１の画像領域５２を合焦領域として、レンズ駆動部１８を駆動して撮影レンズ１１を光軸方向で進退させて焦点調整を行う。同時に、ＣＰＵ２０は、フレームのスルー画像をモニタ１６ｂに表示するとともに、グラフ５１の画像領域５２が合焦領域であることを示すＡＦ枠を重畳表示する。 In step S 104, the CPU 20 uses the identified image area 52 of the graph 51 as the in-focus area, drives the lens driving unit 18, and advances and retracts the photographing lens 11 in the optical axis direction to perform focus adjustment. At the same time, the CPU 20 displays a through image of the frame on the monitor 16b and superimposes and displays an AF frame indicating that the image area 52 of the graph 51 is an in-focus area.

なお、「図」を特定するにあたり、被写体特定部２３は、グラフなどのテンプレートの代わりに、例えば、公知の文字認識またはグラフ認識技術を用いて、「図」の画像領域を特定してもよい。
［ケース３］
図６（ａ）に示すように、被写界３０の人物３１が、例えば、「わあ、きれい」と発話した場合、ステップＳ１０２において、被写体推定部２２は、撮像素子１２によって次に撮像される主要被写体は「きれなもの」と推定する。 Note that, in specifying the “figure”, the subject specifying unit 23 may specify the image region of the “figure” using, for example, a known character recognition or graph recognition technique instead of a template such as a graph. .
[Case 3]
As illustrated in FIG. 6A, when the person 31 in the object scene 30 speaks, for example, “Wow, beautiful”, the subject estimation unit 22 is next imaged by the imaging element 12 in step S102. The main subject is presumed to be “clear”.

ステップＳ１０３において、被写体特定部２３は、被写体推定部２２の推定結果に基づいて、ＳＤＲＡＭ１７から「きれなもの」として対応付けられた「花」や「虹」などのテンプレートを読み込む。被写体特定部２３は、撮像素子１２によって次に撮像されるフレームから主要被写体である「きれいなもの」の検出を開始する。そして、ユーザによるデジタルカメラ１００のパンニングやズーミングなどの結果、撮像素子１２が、図４（ｂ）に示す花４１や図６（ｂ）に示す虹６１を撮像すると、被写体特定部２３は、花４１の画像領域４２や虹６１の画像領域６２を主要被写体として特定する。 In step S 103, the subject specifying unit 23 reads templates such as “flowers” and “rainbows” associated as “clear” from the SDRAM 17 based on the estimation result of the subject estimation unit 22. The subject specifying unit 23 starts to detect “beautiful” as the main subject from the next frame imaged by the image sensor 12. Then, as a result of panning or zooming of the digital camera 100 by the user, when the image sensor 12 images the flower 41 shown in FIG. 4B or the rainbow 61 shown in FIG. The image area 42 of 41 and the image area 62 of the rainbow 61 are specified as main subjects.

ステップＳ１０４において、ＣＰＵ２０は、特定された花４１の画像領域４２や虹６１の画像領域６２を合焦領域として、レンズ駆動部１８を駆動して撮影レンズ１１を光軸方向で進退させて焦点調整を行う。同時に、ＣＰＵ２０は、フレームのスルー画像をモニタ１６ｂに表示するとともに、花４１の画像領域４２や虹６１の画像領域６２が合焦領域であることを示すＡＦ枠を重畳表示する。 In step S104, the CPU 20 drives the lens driving unit 18 with the identified image area 42 of the flower 41 and the image area 62 of the rainbow 61 as the in-focus area, and advances and retracts the photographing lens 11 in the optical axis direction to adjust the focus. I do. At the same time, the CPU 20 displays a through image of the frame on the monitor 16b, and superimposes an AF frame indicating that the image area 42 of the flower 41 and the image area 62 of the rainbow 61 are in-focus areas.

なお、「きれいなもの」を特定するにあたり、被写体特定部２３は、「花」や「虹」などのテンプレートを用いる代わりに、例えば、顕著性マップに基づいて、高彩度領域で明るい領域を「きれいなもの」の画像領域として特定してもよい。その際、顕著性マップにおいて、色、明るさの重みを高くして、方向性エッジの重みを低くすることが好ましい。また、ＣＰＵ２０は、特定された高彩度領域で明るい領域を合焦領域とするが、高彩度領域で明るい領域が複数ある場合には、最至近の領域または最も大きな画像領域などを合焦領域とすることが好ましい。本実施形態によれば、発話内容が、次のシーンの主要被写体を直接的に示しておらず、例えば、形容詞、形容動詞のような形で表現されている場合であっても、主要被写体を特定することができる。
［ケース４］
図７（ａ）に示すように、被写界３０の人物３１が、例えば、「わあ、早い」と発話した場合、ステップＳ１０２において、被写体推定部２２は、撮像素子１２によって次に撮像される主要被写体が「早いもの」と推定する。 Note that in specifying “beautiful”, the subject specifying unit 23 uses, for example, a high-saturation region and a bright region based on a saliency map instead of using a template such as “flower” or “rainbow”. May be specified as an image area. At this time, in the saliency map, it is preferable to increase the weight of color and brightness and decrease the weight of directional edge. Further, the CPU 20 sets the bright area in the specified high saturation area as the focusing area, but if there are a plurality of bright areas in the high saturation area, the CPU 20 sets the closest area or the largest image area as the focusing area. Is preferred. According to the present embodiment, even if the utterance content does not directly indicate the main subject of the next scene and is expressed in a form such as an adjective or an adjective verb, for example, Can be identified.
[Case 4]
As shown in FIG. 7A, when the person 31 in the object scene 30 speaks, for example, “Wow, early”, the subject estimation unit 22 captures the next image by the image sensor 12 in step S102. Estimate that the main subject is “fast”.

ステップＳ１０３において、被写体特定部２３は、被写体推定部２２の推定結果に基づいて、ＳＤＲＡＭ１７から「早いもの」として対応付けられた「自動車」、「電車」あるいは「飛行機」などのテンプレートを読み込む。被写体特定部２３は、撮像素子１２によって次に撮像されるフレームから主要被写体である「早いもの」の検出を開始する。そして、ユーザによるデジタルカメラ１００のパンニングまたはズーミングなどの結果、撮像素子１２が、図７（ｂ）に示す電車７１を撮像すると、被写体特定部２３は、後ろの山ではなくその麓を走る電車７１の画像領域７２を主要被写体として特定する。 In step S 103, the subject specifying unit 23 reads a template such as “car”, “train”, or “airplane” associated as “fast” from the SDRAM 17 based on the estimation result of the subject estimation unit 22. The subject specifying unit 23 starts detecting the “early subject” that is the main subject from the next frame imaged by the image sensor 12. Then, as a result of panning or zooming of the digital camera 100 by the user, when the image pickup device 12 images the train 71 shown in FIG. 7B, the subject specifying unit 23 does not use the mountain behind but the train 71 running on the fence. Is specified as the main subject.

ステップＳ１０４において、ＣＰＵ２０は、特定された電車７１の画像領域７２を合焦領域として、レンズ駆動部１８を駆動して撮影レンズ１１を光軸方向で進退させて焦点調整を行う。同時に、ＣＰＵ２０は、フレームのスルー画像をモニタ１６ｂに表示するとともに電車７１の画像領域７２が合焦領域であることを示すＡＦ枠を重畳表示する。 In step S 104, the CPU 20 performs focus adjustment by driving the lens driving unit 18 to advance and retract the photographing lens 11 in the optical axis direction with the identified image area 72 of the train 71 as the focusing area. At the same time, the CPU 20 displays a through image of the frame on the monitor 16b and superimposes and displays an AF frame indicating that the image area 72 of the train 71 is an in-focus area.

なお、「早いもの」を特定するにあたり、被写体特定部２３は、「電車」などのテンプレートを用いる代わりに、例えば、フレームと１フレーム前のフレームとの間で相関処理を施して、フレーム間差分または動きベクトルを算出し、フレーム間差分または動きベクトルに基づいて「早いもの」の画像領域を特定してもよい。ただし、ユーザがデジタルカメラ１００をパンニングしながら「早いもの」を撮像する場合、「早いもの」および背景におけるフレーム間差分または動きベクトルの関係は逆となる。この場合、被写体特定部２３は、例えば、動きセンサ２６の検出信号に基づいて、デジタルカメラ１００自身の動きを判定し、「早いもの」の画像領域を特定することが好ましい。また、ＣＰＵ２０は、「早いもの」の主要被写体に対しコンティニュアスＡＦを行うことが好ましい。
［ケース５］
図８（ａ）に示すように、被写界３０の人物３１が、例えば、「わあ、広い」と発話した場合、ステップＳ１０２において、被写体推定部２２は、撮像素子１２によって次に撮像される主要被写体が「広いもの」と推定する。 Note that in specifying “early thing”, the subject specifying unit 23 performs a correlation process between a frame and a frame one frame before, for example, instead of using a template such as “train”, and performs interframe difference. Alternatively, a motion vector may be calculated, and an “early” image region may be specified based on the interframe difference or the motion vector. However, when the user pans the digital camera 100 and captures the “early thing”, the relationship between the “early thing” and the inter-frame difference or motion vector in the background is reversed. In this case, for example, it is preferable that the subject specifying unit 23 determines the motion of the digital camera 100 itself based on the detection signal of the motion sensor 26 and specifies the “fast” image area. Further, it is preferable that the CPU 20 performs the continuous AF on the “fast” main subject.
[Case 5]
As illustrated in FIG. 8A, when the person 31 in the object scene 30 speaks, for example, “Wow, wide”, the subject estimation unit 22 is next imaged by the imaging element 12 in step S102. The main subject is estimated to be “wide”.

ステップＳ１０３において、被写体特定部２３は、被写体推定部２２の推定結果に基づいて、ＳＤＲＡＭ１７から「広いもの」として対応付けられた「空」や「海」などのテンプレートを読み込む。被写体特定部２３は、撮像素子１２によって次に撮像されるフレームから主要被写体である「広いもの」の検出を開始する。そして、ユーザによるデジタルカメラ１００のパンニングなどの結果、撮像素子１２が、図８（ｂ）に示す被写界８０を撮像すると、被写体特定部２３は、手前の「花」ではなく「空」を主要被写体として特定する。 In step S 103, the subject specifying unit 23 reads a template such as “sky” or “sea” associated with “wide” from the SDRAM 17 based on the estimation result of the subject estimation unit 22. The subject specifying unit 23 starts to detect a “wide subject” that is the main subject from the next frame imaged by the image sensor 12. Then, as a result of panning of the digital camera 100 by the user or the like, when the imaging device 12 images the object scene 80 shown in FIG. 8B, the subject specifying unit 23 displays “sky” instead of “flowers” in the foreground. Identify as the main subject.

ステップＳ１０４において、ＣＰＵ２０は、「空」に合焦するように、無限遠にシングルＡＦ制御して、駆動部１８を駆動し撮影レンズ１１を光軸方向で進退させて焦点調整を行う。 In step S 104, the CPU 20 performs single AF control at infinity so as to focus on “sky”, drives the drive unit 18, and advances and retracts the photographing lens 11 in the optical axis direction to perform focus adjustment.

ステップＳ１０５において、ＣＰＵ２０は、ユーザによる操作部材２４の操作から撮像終了の指示を受け付けたか否かを判定する。ＣＰＵ２０は、撮像終了の指示を受け付けた場合、ステップＳ１０６（ＹＥＳ側）へ移行する。一方、ＣＰＵ２０は、撮像終了の指示を受け付けていない場合、ステップＳ１０１（ＮＯ側）へ移行し、撮像終了の指示を受け付けるまでステップＳ１０１〜ステップＳ１０４の処理を行う。 In step S 105, the CPU 20 determines whether an instruction to end imaging is received from the operation of the operation member 24 by the user. When the CPU 20 receives an instruction to end imaging, the CPU 20 proceeds to step S106 (YES side). On the other hand, if the CPU 20 has not received an instruction to end imaging, the CPU 20 proceeds to step S101 (NO side) and performs the processing of steps S101 to S104 until an instruction to end imaging is received.

ステップＳ１０６において、ＣＰＵ２０は、ＳＤＲＡＭ１７に格納され画像処理部１４によって画像処理された動画データを読み出して、不図示の圧縮伸長処理部に送る。圧縮伸長処理部（不図示）は、動画データに対し圧縮処理を施して動画ファイルを生成し、記録Ｉ／Ｆ２７を介して記録媒体であるメモリカード２８に記録する。ＣＰＵ２０は、一連の処理を終了する。 In step S106, the CPU 20 reads the moving image data stored in the SDRAM 17 and subjected to image processing by the image processing unit 14, and sends the moving image data to a compression / decompression processing unit (not shown). A compression / decompression processing unit (not shown) performs a compression process on the moving image data to generate a moving image file, and records it on the memory card 28 as a recording medium via the recording I / F 27. The CPU 20 ends a series of processes.

このように、本実施形態では、次の主要被写体が画像内にいないにもかかわらず、取得した音声データに基づいて、次の主要被写体を予め推定することにより、撮像素子１２により撮像された画像から、その主要被写体を確度高く且つ迅速に特定することができる。 As described above, in the present embodiment, an image captured by the image sensor 12 is estimated by estimating the next main subject in advance based on the acquired audio data even though the next main subject is not in the image. Thus, the main subject can be identified with high accuracy and speed.

また、次の主要被写体を予め推定し特定することにより、デジタルカメラ１００は、確実にその主要被写体に合焦させることができ、最適な状態で撮像することができる。
《実施形態の補足事項》
（１）本発明の再生処理装置は、画像処理プログラムをデジタルカメラ１００のＣＰＵ２０に実行させることにより実現させたが、本発明はこれに限定されない。例えば、本発明に係る再生処理装置における処理を、コンピュータや撮像部を有するスマートフォンなどの電子機器で実現するための処理プログラムおよびそれを記録した媒体に対しても適用可能である。 In addition, by preliminarily estimating and specifying the next main subject, the digital camera 100 can reliably focus on the main subject and can take an image in an optimal state.
<< Additional items of embodiment >>
(1) Although the reproduction processing apparatus of the present invention is realized by causing the CPU 20 of the digital camera 100 to execute an image processing program, the present invention is not limited to this. For example, the present invention can also be applied to a processing program for realizing the processing in the reproduction processing apparatus according to the present invention with an electronic device such as a smartphone having a computer or an imaging unit, and a medium on which the program is recorded.

なお、コンピュータを本発明の画像処理装置として動作させる場合、コンピュータが、例えば、デジタルカメラ１００などから読み込んだ動画ファイルを再生すると、被写体推定部２２は、動画ファイルに付加された音声データまたはユーザからの音声に基づいて主要被写体を推定し、被写体特定部２３は、被写体推定部２２の推定後に再生されるフレームから推定された主要被写体を特定する。そして、コンピュータのＣＰＵは、情報付加部として、特定された主要被写体および画像領域の大きさや位置などの情報をフレームに対応付けて、動画ファイルのヘッダ領域に付加する。これにより、動画の各フレームに何が主要被写体として写っているかを容易に確認することでき、動画ファイルの編集などが容易にできる。 When the computer is operated as the image processing apparatus of the present invention, when the computer reproduces a moving image file read from the digital camera 100 or the like, for example, the subject estimation unit 22 receives the audio data added to the moving image file or the user. The subject identification unit 23 identifies the main subject estimated from the frame reproduced after the estimation by the subject estimation unit 22. Then, the CPU of the computer, as an information adding unit, associates information such as the identified main subject and the size and position of the image area with the frame and adds them to the header area of the moving image file. Thereby, it is possible to easily confirm what is captured as a main subject in each frame of the moving image, and editing of the moving image file can be facilitated.

（２）上記実施形態では、発話者特定部２１、被写体推定部２２および被写体特定部２３の処理を、ＣＰＵ２０がソフトウエア的に実現する例を説明したが、ＡＳＩＣ等を用いてハードウエア的に実現してもよい。 (2) In the above embodiment, the example in which the CPU 20 realizes the processing of the speaker specifying unit 21, the subject estimating unit 22, and the subject specifying unit 23 by software has been described. However, by hardware using an ASIC or the like. It may be realized.

（３）上記実施形態では、不揮発性メモリ２５に格納される辞書データは、名詞や形容詞などの音声情報と主要被写体とを対応付けたデータとしたが、本発明はこれに限定されない。例えば、辞書データは、自動車や電車などが発する音声データ自身を音声情報として、自動車や電車などの主要被写体と対応付けしてもよい。 (3) In the above embodiment, the dictionary data stored in the non-volatile memory 25 is data in which speech information such as nouns and adjectives is associated with the main subject, but the present invention is not limited to this. For example, the dictionary data may be associated with a main subject such as a car or train using voice data itself from a car or train as voice information.

例えば、デジタルカメラ１００が、最初、図９（ａ）に示す被写界９０の山に合焦して撮像している場合を例に考える。その撮像中に、デジタルカメラ１００のマイクロホン１５ａ、１５ｂが、被写界９０に接近する電車の音声を受信し、その信号レベルが大きくなっている場合、あるいはその周波数成分が（ドップラー効果により）高く変化している場合、被写体推定部２２は、受信した電車の音声情報に基づいて、撮像素子１２によって次に撮像される主要被写体が電車であると推定する。被写体特定部２３は、ＳＤＲＡＭ１７から電車のテンプレートを読み込み、撮像素子１２によって次に撮像されるフレームから電車の検出を開始する。そして、デジタルカメラ１００をパンニングやズーミングなどすることなく、図９（ｂ）に示す被写界９０に電車が入ってきた場合、被写体特定部２３は、手前を走る電車９１の画像領域を主要被写体として特定し、ＣＰＵ２０は、山から電車９１の画像領域９２を合焦領域として、レンズ駆動部１８を駆動して撮影レンズ１１を光軸方向で進退させて焦点調整を行う。 For example, consider a case where the digital camera 100 is initially focused on a mountain of the object scene 90 shown in FIG. During the imaging, the microphones 15a and 15b of the digital camera 100 receive the sound of a train approaching the object scene 90, and the signal level is high, or the frequency component is high (due to the Doppler effect). If it has changed, the subject estimation unit 22 estimates that the main subject to be imaged next by the image sensor 12 is a train based on the received train audio information. The subject specifying unit 23 reads a train template from the SDRAM 17 and starts detection of the train from the next frame imaged by the image sensor 12. Then, when a train enters the object scene 90 shown in FIG. 9B without panning or zooming the digital camera 100, the subject specifying unit 23 uses the image area of the train 91 running in front as the main subject. The CPU 20 uses the image area 92 of the train 91 from the mountain as the in-focus area, drives the lens driving unit 18 to advance and retract the photographing lens 11 in the optical axis direction, and performs focus adjustment.

なお、被写体特定部２３は、マイクロホン１５ａ、１５ｂによるステレオ形式の音声データに基づいて、音声の近づいてくる方向から被写界内に入ってくる主要被写体を検出するようにしてもよい。 Note that the subject specifying unit 23 may detect a main subject that enters the object scene from the direction in which the sound approaches, based on stereo audio data from the microphones 15a and 15b.

（４）上記実施形態では、マイクロホン１５ａ、１５ｂは、被写界内外の音声だけでなく、マイクロホン１５ａ、１５ｂとは別に、ユーザの音声を受信するマイクロホンが配置されてもよい。 (4) In the above-described embodiment, the microphones 15a and 15b may be arranged with microphones that receive the user's voice in addition to the voice inside and outside the object scene, in addition to the microphones 15a and 15b.

（５）上記実施形態では、マイクロホン１５ａ，１５ｂで取得した音声情報をもとに、次に撮影される画面の主要被写体を特定する例について説明したが、例えば、図５（ｂ）に示す画面が撮影されている状態で、人物３１の「この図を見て下さい」との発話を認識した場合には、主要被写体を、人物３１から、同じ撮影画面内のグラフ５２に変更する構成としてもよい。 (5) In the above embodiment, the example in which the main subject of the screen to be photographed next is specified based on the audio information acquired by the microphones 15a and 15b has been described. For example, the screen shown in FIG. If the person 31 recognizes the utterance of “please see this figure” when the person is photographed, the main subject may be changed from the person 31 to the graph 52 in the same photographing screen. Good.

以上の詳細な説明により、実施形態の特徴点および利点は明らかになるであろう。これは、特許請求の範囲が、その精神および権利範囲を逸脱しない範囲で前述のような実施形態の特徴点および利点にまで及ぶことを意図する。また、当該技術分野において通常の知識を有する者であれば、あらゆる改良および変更に容易に想到できるはずであり、発明性を有する実施形態の範囲を前述したものに限定する意図はなく、実施形態に開示された範囲に含まれる適当な改良物および均等物によることも可能である。 From the above detailed description, features and advantages of the embodiments will become apparent. It is intended that the scope of the claims extend to the features and advantages of the embodiments as described above without departing from the spirit and scope of the right. Further, any person having ordinary knowledge in the technical field should be able to easily come up with any improvements and modifications, and there is no intention to limit the scope of the embodiments having the invention to those described above. It is also possible to use appropriate improvements and equivalents within the scope disclosed in.

１１…撮像レンズ、１２…撮像素子、１３…ＡＦＥ、１４…画像処理部、１５ａ、１５ｂ…マイクロホン、２０…ＣＰＵ、２１…発話者特定部、２２…被写体推定部、２３…被写体特定部、１００…デジタルカメラ DESCRIPTION OF SYMBOLS 11 ... Imaging lens, 12 ... Image sensor, 13 ... AFE, 14 ... Image processing part, 15a, 15b ... Microphone, 20 ... CPU, 21 ... Speaker specific part, 22 ... Subject estimation part, 23 ... Subject specific part, 100 …Digital camera

Claims

An image acquisition unit that acquires a plurality of images that are continuously captured in time series;
An audio acquisition unit that acquires audio recorded during imaging of the plurality of images and associated with the plurality of images;
A subject estimation unit that extracts speech information from the speech and estimates a main subject of the image from the speech information;
A subject identifying unit for identifying the estimated main subject from the image;
An image processing apparatus comprising:

The image processing apparatus according to claim 1.
The image processing apparatus according to claim 1, wherein the subject specifying unit specifies the main subject from an image captured after the time when the audio information is extracted.

The image processing apparatus according to claim 1 or 2,
An image processing apparatus comprising: a speaker specifying unit that specifies a person who speaks the sound from the image.

The image processing device according to any one of claims 1 to 3,
An image processing apparatus comprising: an information adding unit that adds information on the main subject to the image in which the main subject is specified by the subject specifying unit.

The image processing apparatus according to claim 1, wherein:
An image processing apparatus comprising: a storage unit that stores information in which the audio information and the main subject are associated in advance.

An imaging unit that continuously images in time series and generates a plurality of images;
A focus adjustment unit for adjusting a focus state of the object scene;
An image processing apparatus according to claim 1;
A control unit that controls the focus adjustment unit so as to focus on the main subject estimated by the subject estimation unit;
An imaging apparatus comprising:

An image acquisition procedure for acquiring a plurality of images that are continuously captured in time series,
An audio acquisition procedure for acquiring audio recorded at the time of capturing the plurality of images and associated with the plurality of images;
Subject estimation procedure for extracting speech information from the speech and estimating a main subject of the image from the speech information;
A subject identification procedure for identifying the estimated main subject from the image;
An image processing program for causing a computer to execute.