WO2025225378A1

WO2025225378A1 - Image processing device, image processing method, and program

Info

Publication number: WO2025225378A1
Application number: PCT/JP2025/014112
Authority: WO
Inventors: 圭吾米田
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2024-04-26
Filing date: 2025-04-09
Publication date: 2025-10-30
Anticipated expiration: 2026-10-26
Also published as: JP2025167990A

Abstract

A first image processing device 20 sets the positions and orientations of a plurality of second virtual cameras on the basis of the optical axes of the plurality of first virtual cameras in a virtual space corresponding to the positions and orientations of a plurality of imaging devices in a real space, and generates a plurality of depth images indicating the distance between each of the plurality of second virtual cameras and the 3D model of the subject generated on the basis of the plurality of captured images acquired by the plurality of imaging devices.

Description

Image processing device, image processing method and program

　本開示は、３Ｄモデルを示す３Ｄモデルデータを圧縮及び配信するための画像処理装置に関する。 The present disclosure relates to an image processing device for compressing and distributing 3D model data representing a 3D model.

　複数台の撮像装置で撮影した複数の撮像画像を用いて被写体（オブジェクト）の３Ｄモデルを示す３Ｄモデルデータを生成し、その３Ｄモデルが存在する仮想空間内に仮想的に配置したカメラ（仮想カメラ）から見た仮想視点画像を生成する技術がある。昨今では、サーバ上で被写体の３Ｄモデルデータを生成し、その３Ｄモデルデータを圧縮及び配信し、ユーザがＰＣやタブレット等の自身のローカル端末（クライアント）で仮想カメラを操作して仮想視点画像を表示する技術が注目されている。 There is a technology that generates 3D model data showing a 3D model of a subject (object) using multiple images captured by multiple imaging devices, and then generates a virtual viewpoint image as seen from a camera (virtual camera) virtually placed in a virtual space in which the 3D model exists. Recently, technology that generates 3D model data of a subject on a server, compresses and distributes the 3D model data, and allows users to operate a virtual camera on their own local terminal (client) such as a PC or tablet to display a virtual viewpoint image has been attracting attention.

　３Ｄモデルデータを圧縮及び配信する技術において、配信側のデバイスでは、３Ｄモデルを取り囲むように複数の仮想カメラ（仮想カメラ群）を配置し、複数の仮想カメラのデプス画像とテクスチャ画像をそれぞれ生成する。そして、生成したデプス画像とテクスチャ画像を圧縮し、圧縮したデプス画像とテクスチャ画像と複数の仮想カメラの位置及び姿勢を示す情報とをユーザに配信する方法がある。受信側のデバイスでは、圧縮されたデプス画像とテクスチャ画像を復号し、復号したデプス画像とテクスチャ画像と複数の仮想カメラの位置及び姿勢を示す情報とに基づき、３Ｄモデルを再構成する。 In a technology for compressing and distributing 3D model data, the distributing device arranges multiple virtual cameras (or a group of virtual cameras) to surround the 3D model and generates depth images and texture images from each of the multiple virtual cameras. The generated depth and texture images are then compressed, and the compressed depth and texture images, along with information indicating the positions and orientations of the multiple virtual cameras, are distributed to the user. The receiving device decodes the compressed depth and texture images, and reconstructs the 3D model based on the decoded depth and texture images and information indicating the positions and orientations of the multiple virtual cameras.

　特許文献１では、デプス画像の圧縮率の向上を目的に、フレーム間のデプス画像上の主要なオブジェクトの位置の変動が小さくなるように、仮想カメラ群の視点を決定する方法が記載されている。デプス画像を符号化した際に符号化ストリームに含められる動きベクトルが小さくなることにより、デプス画像の圧縮率の向上が期待される。 Patent Document 1 describes a method for determining the viewpoints of a group of virtual cameras so as to minimize fluctuations in the position of key objects in depth images between frames, with the aim of improving the compression rate of depth images. By reducing the number of motion vectors included in the encoded stream when encoding depth images, it is expected that the compression rate of depth images will improve.

国際公開第２０１８／０７９２６０号International Publication No. 2018/079260

　しかしながら、配信側のデバイスにて設定される仮想カメラ群の位置及び姿勢に依っては、受信側のデバイスにて再構成される３Ｄモデルの品質が低下する虞があった。例えば、配信側のデバイスで生成される３Ｄモデルの形状が複雑な場合、仮想カメラ群の位置及び姿勢によっては、３Ｄモデルの複雑な形状をデプス画像にて十分に表現することができておらず、ユーザ側で再構成した３Ｄモデルの形状精度が低下する恐れがある。つまり、配信側で生成される３Ｄモデルの形状に対し、受信側で再構成される３Ｄモデルの形状が大きく異なる恐れがある。そのため、３Ｄモデルをより正確に再構成しようとすると、仮想カメラ群の位置及び姿勢を適切に決める必要がある。 However, depending on the position and orientation of the virtual cameras set on the device on the distribution side, there is a risk that the quality of the 3D model reconstructed on the device on the receiving side may be reduced. For example, if the shape of the 3D model generated on the device on the distribution side is complex, depending on the position and orientation of the virtual cameras, the complex shape of the 3D model may not be sufficiently represented in the depth image, and the shape accuracy of the 3D model reconstructed on the user side may be reduced. In other words, the shape of the 3D model generated on the distribution side may differ significantly from the shape of the 3D model reconstructed on the receiving side. Therefore, to reconstruct a 3D model more accurately, it is necessary to appropriately determine the position and orientation of the virtual cameras.

　そこで、本開示は、受信側のデバイスにて生成される３Ｄモデルの品質が低下する可能性を低減することを目的とする。 The present disclosure therefore aims to reduce the possibility of a decrease in the quality of the 3D model generated on the receiving device.

　上記の課題を解決するため、本開示に係る画像処理装置は、以下の構成を有する。すなわち、現実空間における複数の撮像装置の位置及び姿勢に対応する仮想空間における複数の第１仮想カメラの光軸に基づいて複数の第２仮想カメラの位置及び姿勢を設定する設定手段と、前記複数の第２仮想カメラのそれぞれと、前記複数の撮像装置により取得される複数の撮像画像に基づいて生成される被写体の３Ｄモデルと、の距離を示す複数のデプス画像を生成する生成手段とを有する。 In order to solve the above problems, the image processing device according to the present disclosure has the following configuration: a setting means for setting the positions and orientations of multiple second virtual cameras based on the optical axes of multiple first virtual cameras in virtual space that correspond to the positions and orientations of multiple image capture devices in real space; and a generation means for generating multiple depth images that indicate the distance between each of the multiple second virtual cameras and a 3D model of a subject that is generated based on the multiple captured images acquired by the multiple image capture devices.

　本開示によれば、受信側のデバイスにて生成される３Ｄモデルの品質が低下する可能性を低減することができる。 This disclosure reduces the possibility of a decrease in the quality of the 3D model generated on the receiving device.

画像処理システム１の構成例を示すブロック図である。1 is a block diagram showing an example of the configuration of an image processing system 1. FIG. 画像処理装置のハードウェア構成例を示す図である。FIG. 1 illustrates an example of a hardware configuration of an image processing apparatus. 第１の画像処理装置２０及び第２の画像処理装置３０の機能構成の一例を示す図である。FIG. 2 is a diagram illustrating an example of the functional configuration of a first image processing device 20 and a second image processing device 30. 撮影システム１０の構成例を示す図である。FIG. 1 is a diagram illustrating an example of the configuration of an imaging system 10. 物理カメラ群３０２に対応する第１仮想カメラ群３０５を設定し、第１仮想カメラ群３０５により生成される被写体の３Ｄモデル３０４を示す図である。FIG. 3 is a diagram showing a first virtual camera group 305 that corresponds to the physical camera group 302 and a 3D model 304 of a subject that is generated by the first virtual camera group 305. 第１仮想カメラ群３０５の視点情報に基づいて設定される第２仮想カメラ群３０７を示す図である。3A and 3B are diagrams showing a second virtual camera group 307 set based on viewpoint information of a first virtual camera group 305. FIG. 実施例１に関わる仮想カメラの視点情報の生成を説明する図である。10A and 10B are diagrams illustrating generation of viewpoint information of a virtual camera according to the first embodiment. 実施例１に関わる３Ｄモデルデータの圧縮と配信処理の一例を示すフローチャートである。10 is a flowchart illustrating an example of a process for compressing and distributing 3D model data according to the first embodiment. 実施例１に関わる３Ｄモデルデータの受信と仮想視点画像の生成処理の一例を示すフローチャートである。10 is a flowchart illustrating an example of a process for receiving 3D model data and generating a virtual viewpoint image according to the first embodiment. 実施例１に関わる仮想カメラ群の視点情報の生成処理の一例を示すフローチャートである。10 is a flowchart illustrating an example of a process for generating viewpoint information of a group of virtual cameras according to the first embodiment. 実施例１に関わる３Ｄモデルの復元処理の一例を示すフローチャートである。10 is a flowchart illustrating an example of a restoration process of a 3D model according to the first embodiment. 実施例２に関わる仮想カメラ群の視点情報の生成を説明する図である。10A and 10B are diagrams illustrating generation of viewpoint information of a group of virtual cameras according to the second embodiment. 実施例２に関わる３Ｄモデルデータの圧縮と配信処理の一例を示すフローチャートである。10 is a flowchart illustrating an example of a process for compressing and distributing 3D model data according to a second embodiment. 実施例２に関わる３Ｄモデルの復元処理の一例を示すフローチャートである。10 is a flowchart illustrating an example of a restoration process of a 3D model according to a second embodiment.

　本発明の好適な実施形態によれば、画像処理装置は、現実空間における複数の撮像装置の位置及び姿勢に対応する仮想空間における複数の第１仮想カメラの光軸に基づいて複数の第２仮想カメラの位置及び姿勢を設定する設定手段を有する。また、画像処理装置は、前記複数の第２仮想カメラのそれぞれと、前記複数の撮像装置により取得される複数の撮像画像に基づいて生成される被写体の３Ｄモデルと、の距離を示す複数のデプス画像を生成する生成手段を有する。ここで、複数の撮像装置は、複数の撮像装置間での位置合わせが行われているものとする。複数の第１仮想カメラは、現実空間における複数の撮像装置の位置関係から設定される。すなわち、仮想空間における複数の第１仮想カメラは、現実空間における複数の撮像装置を仮想空間にて再現したものである。 According to a preferred embodiment of the present invention, the image processing device has a setting means for setting the positions and orientations of multiple second virtual cameras based on the optical axes of multiple first virtual cameras in virtual space that correspond to the positions and orientations of multiple imaging devices in real space. The image processing device also has a generation means for generating multiple depth images that indicate the distance between each of the multiple second virtual cameras and a 3D model of the subject generated based on multiple captured images acquired by the multiple imaging devices. Here, it is assumed that the multiple imaging devices are aligned with each other. The multiple first virtual cameras are set based on the positional relationships between the multiple imaging devices in real space. In other words, the multiple first virtual cameras in virtual space are reproductions in virtual space of the multiple imaging devices in real space.

　この態様により、複数の撮像装置の位置及び姿勢に基づいて複数の第２仮想カメラの位置及び姿勢が設定されることになる。そのため、複数の第２仮想カメラのデプス画像にて表現される被写体は、複数の撮像装置から取得される撮像画像に含まれる被写体に近いものとなる。具体的には、第２仮想カメラから生成されるデプス画像に含まれる被写体を示す領域の形状は、撮像画像に含まれる被写体を示す領域の形状と近いものとなる。配信側のデバイスで生成される被写体の３Ｄモデルは複数の撮像装置から取得される複数の撮像画像に基づいて生成される。そのため、受信側のデバイスで被写体の３Ｄモデルを再構成（復元）する際に用いるデプス画像が撮像画像に近いものであるならば、受信側のデバイスで再構成した３Ｄモデルも配信側のデバイスで生成された３Ｄモデルと近いものになる。すなわち、受信側のデバイスで生成される３Ｄモデルが、配信側のデバイスで生成される３Ｄモデルに対して形状が異なる可能性、つまり３Ｄモデルの再現性が悪化する可能性を低減することができる。 In this manner, the positions and orientations of the multiple second virtual cameras are set based on the positions and orientations of the multiple imaging devices. Therefore, the subject depicted in the depth images of the multiple second virtual cameras is similar to the subject included in the captured images acquired from the multiple imaging devices. Specifically, the shape of the area representing the subject included in the depth images generated from the second virtual cameras is similar to the shape of the area representing the subject included in the captured images. The 3D model of the subject generated by the distribution device is generated based on the multiple captured images acquired from the multiple imaging devices. Therefore, if the depth images used to reconstruct (restore) the 3D model of the subject on the receiving device are similar to the captured images, the 3D model reconstructed on the receiving device will also be similar to the 3D model generated on the distribution device. In other words, it is possible to reduce the possibility that the shape of the 3D model generated on the receiving device will differ from that of the 3D model generated on the distribution device, i.e., the reproducibility of the 3D model will be degraded.

　また、上記の画像処理装置は、前記デプス画像を符号化する符号化手段を有する。また、上記の画像処理装置は、符号化された複数の前記デプス画像と、前記複数の第２仮想カメラの位置及び姿勢を示す視点情報とを、前記３Ｄモデルを再構成する他の装置に出力する出力手段を有する。ここで、他の装置は、前記符号化された複数の前記デプス画像および前記視点情報とに基づいて前記３Ｄモデルを再構成する装置である。 The image processing device also has an encoding means for encoding the depth images. The image processing device also has an output means for outputting the encoded depth images and viewpoint information indicating the positions and orientations of the second virtual cameras to another device that reconstructs the 3D model. Here, the other device is a device that reconstructs the 3D model based on the encoded depth images and viewpoint information.

　また、上記の画像処理装置は、前記生成手段は、前記複数の第２仮想カメラそれぞれにおいて、前記被写体の３Ｄモデルを含む仮想視点画像を生成する。そして、前記符号化手段は、複数の前記仮想視点画像を符号化し、前記出力手段は、符号化された複数の前記仮想視点画像を前記他の装置に出力する。 Furthermore, in the above-mentioned image processing device, the generation means generates a virtual viewpoint image including a 3D model of the subject for each of the plurality of second virtual cameras. The encoding means then encodes the plurality of virtual viewpoint images, and the output means outputs the encoded plurality of virtual viewpoint images to the other device.

　この態様により、第２仮想カメラから生成される仮想視点画像（テクスチャ画像）に含まれる被写体の各構成要素の色は、撮像画像に含まれる被写体の各構成要素の色と近いものとなる。すなわち、受信側のデバイスで生成される３Ｄモデルが、配信側のデバイスで生成される３Ｄモデルに対して色が異なる可能性、つまり３Ｄモデルの再現性が悪化する可能性を低減することができる。 In this manner, the colors of each component of the subject included in the virtual viewpoint image (texture image) generated from the second virtual camera are close to the colors of each component of the subject included in the captured image. In other words, it is possible to reduce the possibility that the 3D model generated on the receiving device will have different colors than the 3D model generated on the transmitting device, i.e., the possibility that the reproducibility of the 3D model will be degraded.

　また、前記生成手段は、前記複数の第２仮想カメラそれぞれにおいて、前記テクスチャ画像における各画素が前記被写体の３Ｄモデルの構成要素と対応しているか否かを示す対応情報を生成する。そして、前記符号化手段は、複数の前記対応情報を符号化し、前記出力手段は、符号された複数の前記対応情報を前記他の装置に出力する。 Furthermore, the generation means generates correspondence information indicating whether each pixel in the texture image corresponds to a component of the 3D model of the subject, for each of the second virtual cameras. The encoding means then encodes the plurality of pieces of correspondence information, and the output means outputs the encoded plurality of pieces of correspondence information to the other device.

　この態様により、受信側のデバイスにおいて、テクスチャ画像の各画素のうち、被写体の３Ｄモデルの色付けに用いる画素を特定することができる。例えば、撮像装置から取得される撮像画像において被写体が別の被写体により遮蔽されている場合、第２仮想カメラのテクスチャ画像の生成に当該撮像画像が用いられるため、その被写体の色情報と異なる色情報を有するテクスチャ画像を生成してしまう虞がある。そのため、遮蔽が発生していないと考えられる領域を示す対応情報を生成することにより、３Ｄモデルを再構成する際に色が異なる可能性を低減することができる。 This aspect allows the receiving device to identify, among the pixels of the texture image, the pixels to be used to color the 3D model of the subject. For example, if a subject is occluded by another subject in an image captured by the imaging device, that captured image is used to generate a texture image for the second virtual camera, which could result in the generation of a texture image with color information that differs from the color information of that subject. Therefore, by generating correspondence information that indicates areas where occlusion is not thought to occur, the possibility of colors differing when the 3D model is reconstructed can be reduced.

　また、前記複数の第２仮想カメラは、前記複数の第１仮想カメラより、前記被写体の３Ｄモデルに近い位置に設定される。 Furthermore, the second virtual cameras are set at positions closer to the 3D model of the subject than the first virtual cameras.

　また、前記生成手段は、１つの前記第２仮想カメラに対し、１つの前記デプス画像を生成する。経時的な動きを示す３Ｄモデルを配信する際には、複数の時刻ごとに、１つの第２仮想カメラに対し、１つのデプス画像を生成する。 Furthermore, the generation means generates one depth image for one second virtual camera. When distributing a 3D model showing movement over time, one depth image is generated for one second virtual camera at each of a plurality of time points.

　本実施形態の他の好適な実施形態によれば、画像処理装置は、符号化された複数のデプス画像および複数の第２仮想カメラの位置及び姿勢を示す第１視点情報を取得する取得手段を有する。ここで、デプス画像とは、複数の第２仮想カメラのそれぞれと、前記撮像装置により取得される複数の撮像画像に基づいて生成される被写体の３Ｄモデルと、の距離を示す画像である。また、複数の第２仮想カメラは、現実空間における複数の撮像装置の位置及び姿勢に対応する仮想空間における複数の第１仮想カメラの光軸に基づいて設定された仮想カメラである。また、上記の画像処理装置は、前記符号化された複数のデプス画像を復号する復号手段を有する。また、上記の画像処理装置は、復号された前記複数のデプス画像と、前記第１視点情報と、に基づいて前記被写体の３Ｄモデルを生成する生成手段を有する。 According to another preferred embodiment of this embodiment, the image processing device has an acquisition means for acquiring a plurality of encoded depth images and first viewpoint information indicating the positions and orientations of a plurality of second virtual cameras. Here, a depth image is an image indicating the distance between each of the plurality of second virtual cameras and a 3D model of the subject generated based on a plurality of captured images acquired by the imaging device. The plurality of second virtual cameras are virtual cameras set based on the optical axes of a plurality of first virtual cameras in virtual space corresponding to the positions and orientations of the plurality of imaging devices in real space. The image processing device also has a decoding means for decoding the plurality of encoded depth images. The image processing device also has a generation means for generating a 3D model of the subject based on the plurality of decoded depth images and the first viewpoint information.

　また、上記の画像処理装置は、前記第１仮想カメラおよび前記第２仮想カメラと異なる第３仮想カメラの位置及び姿勢を示す第２視点情報を取得する。また、上記の画像処理装置は、前記第２視点情報と、前記被写体の３Ｄモデルとに基づいて、仮想視点画像を生成する。なお、第３仮想カメラは、ユーザがジョイスティック等の入力装置を操作することにより設定されてもよいし、生成された３Ｄモデルの位置に基づいて設定されてもよい。 The image processing device also acquires second viewpoint information indicating the position and orientation of a third virtual camera different from the first virtual camera and the second virtual camera. The image processing device also generates a virtual viewpoint image based on the second viewpoint information and a 3D model of the subject. The third virtual camera may be set by a user operating an input device such as a joystick, or may be set based on the position of the generated 3D model.

　この態様により、受信側のデバイスにて生成された３Ｄモデルを用いて仮想視点画像を生成することができる。 This allows a virtual viewpoint image to be generated using a 3D model generated on the receiving device.

　本実施形態の他の好適な実施形態によれば、プログラムは、コンピュータに、上述の画像処理装置の機能を実行させる。コンピュータは、本プログラムを実行することにより、好適に上記記載の画像処理装置として機能する。 According to another preferred embodiment of this embodiment, a program causes a computer to execute the functions of the image processing device described above. By executing this program, the computer preferably functions as the image processing device described above.

　＜実施例＞
　以下、図面を参照して本開示に好適な実施例について詳細に説明する。なお、以下の実施例は本開示を限定するものではなく、本実施例で説明されている特徴の組み合わせの全てが本開示の解決手段に必須のものとは限らない。さらに、添付図面においては、同一若しくは同様の構成に同一の参照番号を付し、重複した説明は省略する。 <Example>
Preferred embodiments of the present disclosure will be described in detail below with reference to the drawings. Note that the following embodiments do not limit the present disclosure, and not all of the combinations of features described in the embodiments are necessarily essential to the solutions of the present disclosure. Furthermore, in the accompanying drawings, the same reference numerals are used to designate the same or similar components, and redundant descriptions will be omitted.

　なお、仮想視点画像とは、ユーザが自由に仮想カメラの位置及び姿勢を操作することによって生成される画像であり、自由視点画像や任意視点画像などとも呼ばれる。また、特に断りが無い限り、画像という文言が動画と静止画の両方の概念を含むものとして説明する。 Note that a virtual viewpoint image is an image generated by a user freely manipulating the position and orientation of a virtual camera, and is also called a free viewpoint image or arbitrary viewpoint image. Furthermore, unless otherwise specified, the term "image" will be used to refer to both moving and still images.

　画像処理システム１は、複数の撮像装置による撮像に基づく複数の画像と、指定された仮想視点とに基づいて、指定された仮想視点からの光景を表す仮想視点画像を生成するシステムである。本実施形態における仮想視点画像は、自由視点映像とも呼ばれるものであるが、ユーザが自由に（任意に）指定した視点に対応する画像に限定されず、例えば複数の候補からユーザが選択した視点に対応する画像なども仮想視点画像に含まれる。また、本実施形態では仮想視点の指定がユーザ操作により行われる場合を中心に説明するが、仮想視点の指定が画像解析の結果等に基づいて自動で行われてもよい。また、本実施形態では仮想視点画像が動画である場合を中心に説明するが、仮想視点画像は静止画であってもよい。 Image processing system 1 is a system that generates a virtual viewpoint image that represents a scene from a specified virtual viewpoint, based on multiple images captured by multiple imaging devices and a specified virtual viewpoint. The virtual viewpoint image in this embodiment is also called a free viewpoint video, but is not limited to an image corresponding to a viewpoint freely (arbitrarily) specified by the user; for example, the virtual viewpoint image also includes an image corresponding to a viewpoint selected by the user from multiple candidates. Furthermore, this embodiment will mainly describe a case where the virtual viewpoint is specified by user operation, but the virtual viewpoint may also be specified automatically based on the results of image analysis, etc. Furthermore, this embodiment will mainly describe a case where the virtual viewpoint image is a video, but the virtual viewpoint image may also be a still image.

　仮想視点画像の生成に用いられる視点情報は、仮想視点の位置及び向き（視線方向）を示す情報である。具体的には、視点情報は、仮想視点の三次元位置を表すパラメータと、パン、チルト、及びロール方向における仮想視点の向きを表すパラメータとを含む、パラメータセットである。なお、視点情報の内容は上記に限定されない。例えば、視点情報としてのパラメータセットには、仮想視点の視野の大きさ（画角）を表すパラメータが含まれてもよい。また、視点情報は複数のパラメータセットを有していてもよい。例えば、視点情報が、仮想視点画像の動画を構成する複数のフレームにそれぞれ対応する複数のパラメータセットを有し、連続する複数の時点それぞれにおける仮想視点の位置及び向きを示す情報であってもよい。 The viewpoint information used to generate a virtual viewpoint image is information that indicates the position and orientation (line of sight) of the virtual viewpoint. Specifically, the viewpoint information is a parameter set that includes parameters that indicate the three-dimensional position of the virtual viewpoint and parameters that indicate the orientation of the virtual viewpoint in the pan, tilt, and roll directions. Note that the content of the viewpoint information is not limited to the above. For example, the parameter set serving as viewpoint information may include a parameter that indicates the size of the field of view (angle of view) of the virtual viewpoint. Furthermore, the viewpoint information may have multiple parameter sets. For example, the viewpoint information may have multiple parameter sets that respectively correspond to multiple frames that make up a video of the virtual viewpoint image, and may be information that indicates the position and orientation of the virtual viewpoint at each of multiple consecutive points in time.

　画像処理システム１は、撮像領域を複数の方向から撮像する複数の撮像装置を有する。撮像領域は、例えばサッカーや空手などの競技が行われる競技場、もしくはコンサートや演劇が行われる舞台などである。複数の撮像装置は、このような撮像領域を取り囲むようにそれぞれ異なる位置に設置され、同期して撮像を行う。なお、複数の撮像装置は撮像領域の全周にわたって設置されていなくてもよく、設置場所の制限等によっては撮像領域の周囲の一部にのみ設置されていてもよい。また、撮像装置の数は図に示す例に限定されず、例えば撮像領域をサッカーの競技場とする場合には、競技場の周囲に３０台程度の撮像装置が設置されてもよい。また、望遠カメラと広角カメラなど機能が異なる撮像装置が設置されていてもよい。 The image processing system 1 has multiple imaging devices that capture images of an imaging area from multiple directions. The imaging area may be, for example, a stadium where sports such as soccer or karate are held, or a stage where concerts or plays are held. The multiple imaging devices are installed in different positions surrounding such an imaging area and capture images in sync. Note that the multiple imaging devices do not have to be installed around the entire perimeter of the imaging area; depending on installation location restrictions, they may be installed only around a portion of the perimeter of the imaging area. The number of imaging devices is not limited to the example shown in the figure; for example, if the imaging area is a soccer stadium, around 30 imaging devices may be installed around the stadium. Imaging devices with different functions, such as telephoto cameras and wide-angle cameras, may also be installed.

　なお、本実施形態における複数の撮像装置は、それぞれが独立した筐体を有し単一の視点で撮像可能なカメラであるものとする。ただしこれに限らず、２以上の撮像装置が同一の筐体内に構成されていてもよい。例えば、複数のレンズ群と複数のセンサを備えており複数視点から撮像可能な単体のカメラが、複数の撮像装置として設置されていてもよい。 Note that in this embodiment, the multiple imaging devices are each assumed to be cameras having an independent housing and capable of capturing images from a single viewpoint. However, this is not limited to this, and two or more imaging devices may be configured within the same housing. For example, a single camera equipped with multiple lens groups and multiple sensors and capable of capturing images from multiple viewpoints may be installed as the multiple imaging devices.

　仮想視点画像は、例えば以下のような方法で生成される。まず、複数の撮像装置によりそれぞれ異なる方向から撮像することで複数の画像（複数の撮像画像）が取得される。次に、複数の撮像画像から、人物やボールなどの所定のオブジェクトに対応する前景領域を抽出した前景画像と、前景領域以外の背景領域を抽出した背景画像が取得される。また、所定のオブジェクトの三次元形状を表す前景モデルと前景モデルに色付けするためのテクスチャデータとが前景画像に基づいて生成され、競技場などの背景の三次元形状を表す背景モデルに色づけするためのテクスチャデータが背景画像に基づいて生成される。そして、前景モデルと背景モデルに対してテクスチャデータをマッピングし、視点情報が示す仮想視点に応じてレンダリングを行うことにより、仮想視点画像が生成される。ただし、仮想視点画像の生成方法はこれに限定されず、三次元モデルを用いずに撮像画像の射影変換により仮想視点画像を生成する方法など、種々の方法を用いることができる。 A virtual viewpoint image can be generated, for example, using the following method. First, multiple images (multiple captured images) are obtained by capturing images from different directions using multiple imaging devices. Next, a foreground image is obtained from the multiple captured images, extracting a foreground area corresponding to a specific object, such as a person or a ball, and a background image is obtained from the multiple captured images, extracting a background area other than the foreground area. A foreground model representing the three-dimensional shape of the specific object and texture data for coloring the foreground model are generated based on the foreground image, and texture data for coloring a background model representing the three-dimensional shape of the background, such as a stadium, is generated based on the background image. The texture data is then mapped to the foreground model and background model, and rendering is performed according to the virtual viewpoint indicated by the viewpoint information, thereby generating a virtual viewpoint image. However, the method for generating a virtual viewpoint image is not limited to this, and various methods can be used, such as generating a virtual viewpoint image by projective transformation of captured images without using a three-dimensional model.

　前景画像とは、撮像装置により撮像されて取得された撮像画像から、オブジェクトの領域（前景領域）を抽出した画像である。前景領域として抽出されるオブジェクトとは、時系列で同じ方向から撮像を行った場合において動きのある（その絶対位置や形が変化し得る）動的オブジェクト（動体）を指す。オブジェクトは、例えば、競技において、それが行われるフィールド内にいる選手や審判などの人物、例えば球技であればボールなど、またコンサートやエンタテイメントにおける歌手、演奏者、パフォーマー、司会者などである。 A foreground image is an image in which an object's area (foreground area) has been extracted from an image captured by an imaging device. An object extracted as a foreground area is a dynamic object (moving body) that moves (its absolute position and shape can change) when images are captured from the same direction in chronological order. Examples of objects include players, referees, and other people on the field where a sport is taking place, such as the ball in a ball game, or singers, musicians, performers, and presenters in a concert or entertainment event.

　背景画像とは、少なくとも前景となるオブジェクトとは異なる領域（背景領域）の画像である。具体的には、背景画像は、撮像画像から前景となるオブジェクトを取り除いた状態の画像である。また、背景は、時系列で同じ方向から撮像を行った場合において静止している、又は静止に近い状態が継続している撮像対象物を指す。このような撮像対象物は、例えば、コンサート等のステージ、競技などのイベントを行うスタジアム、球技で使用するゴールなどの構造物、フィールド、などである。ただし、背景は少なくとも前景となるオブジェクトとは異なる領域であり、撮像対象としては、オブジェクトと背景の他に、別の物体等が含まれていてもよい。 A background image is an image of at least an area (background area) that is different from the foreground object. Specifically, a background image is an image in which the foreground object has been removed from the captured image. Furthermore, the background refers to an object that remains stationary or nearly stationary when images are taken from the same direction in chronological order. Examples of such objects include a stage for a concert, a stadium where an event such as a sport is held, a structure such as a goal used in a ball game, or a field. However, the background is at least an area that is different from the foreground object, and the captured object may include other objects in addition to the object and background.

　仮想カメラとは、撮像領域の周囲に実際に設置された複数の撮像装置とは異なる仮想的なカメラであって、仮想視点画像の生成に係る仮想視点を便宜的に説明するための概念である。すなわち、仮想視点画像は、撮像領域に関連付けられる仮想空間内に設定された仮想視点から撮像した画像であるとみなすことができる。そして、仮想的な当該撮像における視点の位置及び向きは仮想カメラの位置及び向きとして表すことができる。言い換えれば、仮想視点画像は、空間内に設定された仮想視点の位置にカメラが存在するものと仮定した場合に、そのカメラにより得られる撮像画像を模擬した画像であると言える。 A virtual camera is a virtual camera that is different from the multiple imaging devices actually installed around the imaging area, and is a concept used to conveniently explain the virtual viewpoint involved in generating a virtual viewpoint image. In other words, a virtual viewpoint image can be considered to be an image captured from a virtual viewpoint set in a virtual space associated with the imaging area. The position and orientation of the viewpoint in this virtual image can be expressed as the position and orientation of the virtual camera. In other words, a virtual viewpoint image can be said to be an image that simulates the captured image obtained by a camera, assuming that the camera exists at the position of the virtual viewpoint set in space.

　＜実施例１＞
　本実施例では、物理カメラの視点情報に基づき、サーバからクライアントに配信するデプス画像とテクスチャ画像を生成するための仮想カメラ群の視点を決定する処理を説明する。なお、本実施例では、受信側のデバイスのことをクライアントと呼称する。また、本実施例では、現実空間に配置された撮像装置のことを物理カメラと呼称する。 Example 1
In this embodiment, a process for determining the viewpoints of a group of virtual cameras for generating depth images and texture images to be distributed from a server to a client based on viewpoint information of a physical camera will be described. Note that in this embodiment, the receiving device will be referred to as the client. Also, in this embodiment, the imaging device arranged in real space will be referred to as the physical camera.

　＜画像処理システムのハードウェア構成＞
　図１Ａは、本実施例に係る、画像処理システム１の全体構成の一例を示す図である。 <Hardware configuration of image processing system>
FIG. 1A is a diagram showing an example of the overall configuration of an image processing system 1 according to this embodiment.

　画像処理システム１は、撮影システム１０、第１の画像処理装置２０、第２の画像処理装置３０、入力装置４０、及び表示装置５０を有する。画像処理システム１は、複数台の物理カメラで撮影した画像（複数の撮像画像）を用いてオブジェクトの３Ｄモデルの形状情報を生成する。そして、画像処理システム１は、サーバからクライアントに配信するデプス画像とテクスチャ画像を生成するための第２仮想カメラ群の視点情報（第１視点情報）を生成する。さらに、画像処理システム１は、生成した第２仮想カメラ群の視点情報に基づき、デプス画像、テクスチャ画像を生成及び圧縮し、それら画像と３Ｄモデルを再構成（復元）するために必要な情報とを３Ｄモデルデータとしてユーザに配信する。 The image processing system 1 has a shooting system 10, a first image processing device 20, a second image processing device 30, an input device 40, and a display device 50. The image processing system 1 generates shape information of a 3D model of an object using images (plurality of captured images) taken by multiple physical cameras. The image processing system 1 then generates viewpoint information (first viewpoint information) of a second virtual camera group for generating depth images and texture images to be distributed from the server to the client. Furthermore, the image processing system 1 generates and compresses depth images and texture images based on the generated viewpoint information of the second virtual camera group, and distributes these images and the information necessary to reconstruct (restore) the 3D model to the user as 3D model data.

　撮影システム１０は、複数の物理カメラを異なる位置に配置し、多視点から被写体（オブジェクト）を同期撮影する。そして同期撮影して取得される複数の撮像画像、撮影システム１０の物理カメラ毎の視点情報（外部／内部パラメータ、画像サイズ）などが、第１の画像処理装置２０に送信される。カメラの外部パラメータとは、カメラの位置及び姿勢を示すパラメータ（例えば回転行列及び位置ベクトル等）である。カメラの内部パラメータとは、カメラ固有の内部パラメータであり、例えば、焦点距離、画像中心、及びレンズ歪みパラメータ等である。カメラの外部パラメータと内部パラメータをまとめてカメラパラメータと呼ぶ。画像サイズとは、画像の幅や高さである。この物理カメラ群の視点情報は、デプス画像とテクスチャ画像を生成するための第２仮想カメラ群の視点情報（カメラパラメータや画像サイズ）を決定する際に用いられる。 The photography system 10 places multiple physical cameras in different positions and synchronously photographs a subject (object) from multiple viewpoints. The multiple captured images obtained through synchronous photography, as well as viewpoint information (external/internal parameters, image size) for each physical camera of the photography system 10, are then transmitted to the first image processing device 20. The external parameters of a camera are parameters that indicate the position and orientation of the camera (e.g., rotation matrix and position vector). The internal parameters of a camera are internal parameters specific to the camera, such as focal length, image center, and lens distortion parameters. The external and internal parameters of a camera are collectively referred to as camera parameters. Image size refers to the width and height of the image. The viewpoint information of this group of physical cameras is used when determining the viewpoint information (camera parameters and image size) of the second group of virtual cameras used to generate depth images and texture images.

　第１の画像処理装置２０は、撮影システム１０から入力した複数の撮像画像、物理カメラ毎の視点情報に基づき前景となるオブジェクトの３Ｄモデルを生成する。前景となるオブジェクト（被写体）は、例えば、撮影システム１０の撮影範囲内に存在する人物や動体などである。そして、第１の画像処理装置２０は、デプス画像とテクスチャ画像を生成するための第２仮想カメラ群の視点情報を生成する。さらに、第１の画像処理装置２０は、生成した第２仮想カメラ群の視点情報に基づき、デプス画像、テクスチャ画像を生成及び圧縮し、圧縮されたそれら画像と３Ｄモデルを復元するために必要な情報（メタデータ）とともに、第２の画像処理装置３０に出力する。メタデータは、例えば、生成した第２仮想カメラ群の視点情報である。 The first image processing device 20 generates a 3D model of the foreground object based on the multiple captured images input from the imaging system 10 and the viewpoint information for each physical camera. The foreground object (subject) is, for example, a person or moving object within the imaging range of the imaging system 10. The first image processing device 20 then generates viewpoint information for a second virtual camera group for generating depth images and texture images. Furthermore, the first image processing device 20 generates and compresses depth images and texture images based on the generated viewpoint information for the second virtual camera group, and outputs these compressed images and the information (metadata) required to restore the 3D model to the second image processing device 30. The metadata is, for example, the generated viewpoint information for the second virtual camera group.

　先ず、第１の画像処理装置２０は、複数の撮像画像、物理カメラ毎の視点情報に基づき被写体の形状情報（形状情報）を推定する。被写体の形状情報の推定には、例えば視体積交差法が用いられる。現実空間に存在する撮影システム１０に対応する、仮想空間における複数の仮想カメラ（複数の第１仮想カメラ）を設定する。この複数の第１仮想カメラを第１仮想カメラ群と呼称する。この第１仮想カメラ群は、物理カメラ群を仮想空間上で再現したものである。つまり、第１仮想カメラ群の視点情報は、現実空間の物理カメラ群の視点情報と対応する。そして、第１仮想カメラ群の視点情報と、複数の撮像画像とを視体積交差法に用いることにより、被写体の形状情報を生成する。この処理の結果、被写体の形状情報を表現した３Ｄ点群（３次元座標を持つ点の集合）が得られる。なお、この３Ｄ点群は、Ｖｉｓｕａｌ　Ｈｕｌｌとも呼ばれる。なお、撮影画像から被写体の形状情報を導出する方法はこれに限らない。また、被写体の形状情報の表現方法は３Ｄ点群に限らず、メッシュやボクセルでもよい。また、生成した被写体の３Ｄ点群の各点に対し、複数の撮像画像を用いてテクスチャ画像（色情報）を決定する。したがって、被写体の３Ｄモデルを示す３Ｄモデルデータには、形状を示す形状情報と、色を示す色情報とが含まれることになる。なお、本実施例では３Ｄモデルは、形状情報と色情報とから生成されるものとするが、これに限定されない。例えば、３Ｄモデルは形状情報を含むものであり、色情報とは別のデータとして管理してもよい。なお、現実空間における物理カメラ群に対応する、仮想空間における第１仮想カメラ群を設定する手法は、ＣＧの分野等において公知技術であるため説明を省略する。 First, the first image processing device 20 estimates shape information (shape information) of the subject based on multiple captured images and viewpoint information for each physical camera. Estimating the shape information of the subject uses, for example, a visual hull intersection method. Multiple virtual cameras (multiple first virtual cameras) are set in virtual space corresponding to the imaging system 10 existing in real space. These multiple first virtual cameras are referred to as a first virtual camera group. This first virtual camera group is a reproduction of a physical camera group in virtual space. In other words, the viewpoint information of the first virtual camera group corresponds to the viewpoint information of the physical camera group in real space. Then, shape information of the subject is generated by using the viewpoint information of the first virtual camera group and the multiple captured images in the visual hull intersection method. As a result of this processing, a 3D point cloud (a set of points with three-dimensional coordinates) that represents the shape information of the subject is obtained. This 3D point cloud is also called a visual hull. Note that the method of deriving shape information of the subject from captured images is not limited to this. Furthermore, the method of expressing the shape information of the subject is not limited to a 3D point cloud, but can also be a mesh or voxel. Furthermore, a texture image (color information) is determined for each point in the generated 3D point cloud of the subject using multiple captured images. Therefore, 3D model data representing a 3D model of the subject includes shape information indicating the shape and color information indicating the color. Note that in this embodiment, the 3D model is generated from shape information and color information, but this is not limited to this. For example, the 3D model includes shape information and may be managed as data separate from color information. Note that the method of setting a first virtual camera group in virtual space corresponding to a physical camera group in real space is well-known technology in the field of CG, etc., and therefore will not be described here.

　次に、第１の画像処理装置２０は、配信するデプス画像とテクスチャ画像を生成するための複数の第２仮想カメラの視点情報（位置及び姿勢）を生成する。以降、複数の第２仮想カメラを第２仮想カメラ群と呼称する。この第２仮想カメラ群の位置は、例えば、第１仮想カメラ群の光軸上に配置され、姿勢は第１仮想カメラの姿勢と同じに設定される。第１仮想カメラ群の視点情報は、現実空間の物理カメラ群の視点情報と対応していることから、物理カメラ群の視点情報に基づいて第２仮想カメラ群の視点情報が生成されることになる。第２仮想カメラ群の視点情報の生成方法は、後で詳しく説明する。以降、第２仮想カメラ群の視点情報を第２視点情報と呼称する。第１の画像処理装置２０は、この第２仮想カメラ群の第２視点情報に基づき３Ｄモデルのデプス画像を生成する。デプス画像は、第２仮想カメラ群の台数分生成される。具体的には、被写体の３Ｄ点群内の各点を第２仮想カメラ群の撮像面と同一面に投影する。第２仮想カメラ群ごとに、投影した各画素に対して第２仮想カメラから被写体までの距離（デプス）を算出し、デプス画像の各画素にデプス値を設定する。また、第１の画像処理装置２０は、第２仮想カメラ群の第２視点情報に基づき被写体を撮影しテクスチャ画像を生成する。テクスチャ画像は、第２仮想カメラの視線方向と近い視線方向の物理カメラの撮影画像の画素値の優先度を高くして、複数の撮像画像の色をブレンドすることにより生成される。このように第２仮想カメラの視線方向と近い視線方向を有する物理カメラの撮像画像の画素値に対し、複数の撮像画像をブレンドする際の優先度（重み）を高く設定する方法を、仮想視点依存のテクスチャ画像の生成方法と呼称する。仮想視点依存のテクスチャ画像の生成方法の詳細は後述する。この仮想視点依存で生成したテクスチャ画像は、仮想カメラの位置及び姿勢に依存して、被写体の画素値を決定する物理カメラの撮影画像が選択されるため、仮想カメラが動くと被写体の色が変化する。そのため、この方法で生成したテクスチャ画像を仮想視点依存のテクスチャ画像と呼ぶ。一方で、仮想視点非依存のテクスチャ画像は、仮想カメラの位置及び姿勢によって、被写体の画素値が変化しない方法で生成される。本実施例では、テクスチャ画像の生成は仮想視点依存を題材に扱うが、テクスチャ画像は仮想視点非依存で生成されてもよい。仮想視点依存と仮想視点非依存のテクスチャ画像の生成処理の一例を下記に示す。 Next, the first image processing device 20 generates viewpoint information (position and orientation) of multiple second virtual cameras for generating depth images and texture images to be distributed. Hereinafter, the multiple second virtual cameras will be referred to as a second virtual camera group. The position of this second virtual camera group is, for example, arranged on the optical axis of the first virtual camera group, and the orientation is set to be the same as the orientation of the first virtual camera. Since the viewpoint information of the first virtual camera group corresponds to the viewpoint information of the physical camera group in real space, the viewpoint information of the second virtual camera group is generated based on the viewpoint information of the physical camera group. The method for generating the viewpoint information of the second virtual camera group will be explained in detail later. Hereinafter, the viewpoint information of the second virtual camera group will be referred to as second viewpoint information. The first image processing device 20 generates a depth image of the 3D model based on the second viewpoint information of this second virtual camera group. Depth images are generated for the number of second virtual cameras. Specifically, each point in the 3D point cloud of the subject is projected onto the same plane as the imaging plane of the second virtual camera group. For each second virtual camera group, the distance (depth) from the second virtual camera to the subject is calculated for each projected pixel, and a depth value is set for each pixel of the depth image. Furthermore, the first image processing device 20 captures an image of the subject based on the second viewpoint information of the second virtual camera group and generates a texture image. The texture image is generated by blending the colors of multiple captured images while increasing the priority of pixel values of images captured by a physical camera with a line of sight close to the line of sight of the second virtual camera. This method of setting a high priority (weight) for pixel values of images captured by a physical camera with a line of sight close to the line of sight of the second virtual camera when blending multiple captured images is referred to as a virtual viewpoint-dependent texture image generation method. Details of the virtual viewpoint-dependent texture image generation method will be described later. This virtual viewpoint-dependent texture image is generated by selecting an image captured by a physical camera that determines the pixel value of the subject depending on the position and orientation of the virtual camera, so the color of the subject changes when the virtual camera moves. Therefore, a texture image generated using this method is referred to as a virtual viewpoint-dependent texture image. On the other hand, virtual-viewpoint-independent texture images are generated in a way that the pixel values of the subject do not change depending on the position and orientation of the virtual camera. In this embodiment, texture image generation deals with virtual-viewpoint-dependent texture images, but texture images may also be generated in a virtual-viewpoint-independent manner. An example of the process for generating virtual-viewpoint-dependent and virtual-viewpoint-independent texture images is shown below.

　仮想視点依存のテクスチャ画像の生成処理は、例えば被写体を構成する３Ｄ点群の点の可視性判定処理と仮想カメラの位置及び姿勢に基づく色の導出処理とを含むものである。 The process of generating a virtual viewpoint-dependent texture image includes, for example, a process of determining the visibility of points in the 3D point cloud that constitutes the subject, and a process of deriving colors based on the position and orientation of the virtual camera.

　可視性判定処理では、３Ｄ点群内の各点と撮影システム１０が有する物理カメラ群に含まれる複数の物理カメラとの位置関係から、各点について撮像可能な物理カメラを特定する。色の導出処理では、例えば、３Ｄ点群内のある点を着目点とし、その着目点の色を導出する。具体的には、着目点ごとに以降の処理を実施する。第２仮想カメラの撮像範囲に含まれる着目点を選択する。そして、その着目点を撮像範囲とする第２仮想カメラの視線方向と近い視線方向を有する第１仮想カメラであり、その着目点を撮像可能な第１仮想カメラを選択する。第１仮想カメラは、物理カメラを仮想空間上で再現したものであるため、第１仮想カメラを選択することは物理カメラを選択することと同じである。そして選択した着目点を、選択した物理カメラの撮像画像に射影する。その射影先の画素の色を、その着目点の色とする。物理カメラの選択は、例えば、第２仮想カメラから着目点への視線方向と物理カメラから着目点への視線方向の角度が一定角度以下か否かで行われる。なお、着目点が複数の物理カメラにより撮像可能な場合、第２仮想カメラの視線方向と近い視線方向の複数の物理カメラを選択し、それら物理カメラの撮像画像のそれぞれに着目点を射影する。そして、射影先の画素値を取得し、第２仮想カメラの視線方向と近い物理カメラの画素値が優先して使われるよう加重平均を算出することで、着目点の色が決定される。 )。 In the visibility determination process, the physical camera that can capture each point is identified based on the positional relationship between each point in the 3D point cloud and the multiple physical cameras included in the group of physical cameras that the imaging system 10 has. In the color derivation process, for example, a point in the 3D point cloud is set as a focus point, and the color of that focus point is derived. Specifically, the following process is performed for each focus point. A focus point that is included in the imaging range of the second virtual camera is selected. Then, a first virtual camera that has a line of sight close to the line of sight of the second virtual camera whose imaging range includes that focus point and can capture that focus point is selected. Because the first virtual camera is a reproduction of a physical camera in virtual space, selecting the first virtual camera is the same as selecting a physical camera. The selected focus point is then projected onto the image captured by the selected physical camera. The color of the pixel at the projection destination is set as the color of that focus point. A physical camera is selected, for example, based on whether the angle between the line of sight from the second virtual camera to the focus point and the line of sight from the physical camera to the focus point is within a certain angle. If the point of interest can be captured by multiple physical cameras, multiple physical cameras with viewing directions close to the viewing direction of the second virtual camera are selected, and the point of interest is projected onto each of the images captured by those physical cameras. The pixel values of the projection destination are then obtained, and a weighted average is calculated so that pixel values from physical cameras with viewing directions close to the second virtual camera are used preferentially, thereby determining the color of the point of interest.

　このような処理を、着目点を変えながら行い、着目点の色を第２仮想カメラの撮像面と同一面に投影することで、仮想視点依存のテクスチャ画像を生成することが出来る。なお、本実施例にあわせて説明したが、上記に限定されず、第２仮想カメラと異なる仮想カメラに依存するテクスチャ画像を生成する場合には、上記の第２仮想カメラを前記異なる仮想カメラに変更して上記処理を行う。 This process is performed while changing the focus point, and the color of the focus point is projected onto the same surface as the imaging surface of the second virtual camera, thereby generating a texture image that is dependent on the virtual viewpoint. While this has been explained in conjunction with the present embodiment, this is not limited to the above, and when generating a texture image that is dependent on a virtual camera different from the second virtual camera, the second virtual camera is changed to the different virtual camera and the above process is performed.

　仮想視点非依存のテクスチャ画像の生成処理は、例えば前述した可視性判定処理と仮想カメラの位置及び姿勢に依らない色の導出処理とを含むものである。可視性判定の処理後、例えば、３Ｄ点群内のある点を着目点とし、その着目点を撮像可能な第１仮想カメラに対応する物理カメラの撮像画像に射影し、該射影先の画素の色をその着目点の色とする。 The process of generating a texture image independent of the virtual viewpoint includes, for example, the visibility determination process described above and a process of deriving a color that is independent of the position and orientation of the virtual camera. After the visibility determination process, for example, a point in the 3D point cloud is set as the focus point, and that focus point is projected onto an image captured by a physical camera corresponding to a first virtual camera that can capture images, and the color of the pixel at the projection destination is set as the color of that focus point.

　なお、着目点が複数の第１仮想カメラにより撮像可能な場合、複数の第１仮想カメラに対応する複数の物理カメラの撮像画像のそれぞれに着目点を射影する。そして、射影先の画素値を取得し、画素値の平均を算出することで、着目点の色が決定される。このような処理を、着目点を変えながら行い、着目点の色を第２仮想カメラの撮像面と同一面に投影することで、仮想視点非依存のテクスチャ画像を生成することが出来る。 If the point of interest can be captured by multiple first virtual cameras, the point of interest is projected onto each of the images captured by multiple physical cameras corresponding to the multiple first virtual cameras. The pixel values at the projection destination are then obtained and the average of the pixel values is calculated to determine the color of the point of interest. This process is performed while changing the point of interest, and the color of the point of interest is projected onto the same surface as the imaging surface of the second virtual camera, thereby generating a texture image that is independent of the virtual viewpoint.

　仮想視点画像生成技術において、仮想視点依存で色情報を生成する方法は、第２仮想カメラの視線方向に近い視線方向を有する物理カメラの撮影画像の画素値が優先的に使われるため、仮想視点非依存に比べて高画質な画像を生成する方法として知られている。それに加えて、物理カメラと位置及び姿勢が近い仮想カメラで生成した仮想視点依存の仮想視点画像（テクスチャ画像）は、該物理カメラの撮影画像の影響を多く受け、撮影画像に近い画質のテクスチャ画像の生成が可能となる。一方で、物理カメラの位置及び姿勢と大きく異なる仮想カメラの視点で生成した仮想視点依存のテクスチャ画像や仮想視点非依存で生成したテクスチャ画像は、複数台の物理カメラで画素値を補間しながら生成される。そのため、実際の被写体の色情報とは異なる画素値を含むことや、コントラストの低い画像となってしまうことがある。 In virtual viewpoint image generation technology, the method of generating color information dependent on a virtual viewpoint is known as a method of generating images with higher image quality than those independent of a virtual viewpoint, because priority is given to the pixel values of the image captured by a physical camera with a line of sight close to that of the second virtual camera. In addition, a virtual viewpoint-dependent virtual viewpoint image (texture image) generated by a virtual camera with a position and orientation similar to that of a physical camera is heavily influenced by the image captured by that physical camera, making it possible to generate a texture image with image quality close to that of the captured image. On the other hand, a virtual viewpoint-dependent texture image generated from the viewpoint of a virtual camera whose position and orientation is significantly different from that of the physical camera, or a texture image generated independent of a virtual viewpoint, is generated by interpolating pixel values using multiple physical cameras. As a result, the image may contain pixel values that differ from the color information of the actual subject, or may have low contrast.

　最後に、第１の画像処理装置２０は、デプス画像とテクスチャ画像を、Ｈ．２６４やＨ．２６５などの動画圧縮方式を用いて圧縮（符号化）する。なお、圧縮方式は動画圧縮に限らず、元のデータ量より少ないサイズに符号化できればよく、ファイル圧縮などでもよい。第１の画像処理装置２０は、被写体が映っていないデプス画像とテクスチャ画像を、第２の画像処理装置３０に出力しなくてもよい。第２の画像処理装置３０が、デプス画像を第２仮想カメラの第２視点情報に基づき仮想空間に逆投影することで、３Ｄモデルの形状情報を再構成することが可能となる。さらにデプス画像の各デプス値と同一座標のテクスチャ画像の画素値を、デプス値を逆投影した点の色情報とすることで、形状情報に色情報を付与し３Ｄモデルを再構成することが出来る。 Finally, the first image processing device 20 compresses (encodes) the depth image and texture image using a video compression method such as H.264 or H.265. Note that the compression method is not limited to video compression, and any method that can encode to a size smaller than the original data volume, such as file compression, may be used. The first image processing device 20 does not need to output depth images and texture images that do not show the subject to the second image processing device 30. The second image processing device 30 back-projects the depth image into virtual space based on the second viewpoint information of the second virtual camera, making it possible to reconstruct shape information of a 3D model. Furthermore, by using the pixel values of the texture image at the same coordinates as each depth value of the depth image as color information at the point where the depth value is back-projected, it is possible to add color information to the shape information and reconstruct a 3D model.

　また、第１の画像処理装置２０は、圧縮前のデプス画像とテクスチャ画像から被写体の撮影範囲を矩形で切り出しし、この矩形画像（ＲＯＩ画像）を圧縮して配信してもよい。 The first image processing device 20 may also extract a rectangular image of the subject's shooting range from the pre-compression depth image and texture image, compress this rectangular image (ROI image), and distribute it.

　その場合、メタデータとして、切り出した矩形画像の座標情報を含んでもよい。また、第１の画像処理装置２０は、矩形画像を並べて１枚の画像にしてから圧縮して配信してもよい。画像全体でなく、矩形画像を配信することでデータ量の削減が可能となる。 In this case, the metadata may include coordinate information for the cut-out rectangular image. The first image processing device 20 may also arrange the rectangular images to form a single image, compress it, and distribute it. By distributing the rectangular image instead of the entire image, it is possible to reduce the amount of data.

　さらに、第１の画像処理装置２０は、Ｈ．２６４やＨ．２６５では動画圧縮できない単精度浮動小数点数（３２ｂｉｔ）などの高精度なデプス値を含んだデプス画像を生成してもよい。その場合、デプス情報を動画圧縮可能な精度（８ｂｉｔや１０ｂｉｔ）に変換した後に動画圧縮を行う。変換する方法は、例えば、スカラー量子化処理を行い、量子化したデプス画像を圧縮して配信してもよい。その場合、メタデータとして、量子化前のデプスの値域の最小値と最大値を含んでもよい。スカラー量子化を行うことで、動画圧縮可能な精度では不十分であった撮影対象や撮影範囲にいる被写体の３Ｄモデルをクライアントで再構成することが出来る。 Furthermore, the first image processing device 20 may generate depth images containing high-precision depth values, such as single-precision floating-point numbers (32 bits), which cannot be used for video compression using H.264 or H.265. In this case, the video is compressed after converting the depth information to a precision that allows video compression (8 bits or 10 bits). The conversion method may involve, for example, performing scalar quantization processing, and compressing and distributing the quantized depth image. In this case, the minimum and maximum values of the depth value range before quantization may be included as metadata. By performing scalar quantization, the client can reconstruct a 3D model of the subject being photographed or the subject within the photographic range, which would not be possible with the precision required for video compression.

　第２の画像処理装置３０は、第１の画像処理装置２０からデプス画像、テクスチャ画像、メタデータを受信及び復号し、３Ｄモデルを復元する。３Ｄモデルの復元方法は、先述したように、デプス画像とテクスチャ画像をメタデータに含まれる第２仮想カメラ群の第２視点情報に基づき仮想空間に逆投影することで行われる。また第２の画像処理装置３０は、後述する入力装置４０から受信した入力値に基づきユーザが視聴する仮想視点画像を生成するための第３仮想カメラの視点情報（第２視点情報）を算出する。そして、算出した第２視点情報と復元した３Ｄモデルに基づき仮想視点画像を生成する。さらに、第２の画像処理装置３０は、生成した仮想視点画像を表示装置５０に出力する。 The second image processing device 30 receives and decodes the depth image, texture image, and metadata from the first image processing device 20 and restores the 3D model. As described above, the 3D model is restored by back-projecting the depth image and texture image into virtual space based on the second viewpoint information of the second virtual camera group contained in the metadata. The second image processing device 30 also calculates viewpoint information (second viewpoint information) of a third virtual camera for generating a virtual viewpoint image viewed by the user based on input values received from the input device 40, which will be described later. The second image processing device 30 then generates a virtual viewpoint image based on the calculated second viewpoint information and the restored 3D model. The second image processing device 30 then outputs the generated virtual viewpoint image to the display device 50.

　入力装置４０は、ユーザが第３仮想カメラを設定するための入力値を受け付け、入力値を第２の画像処理装置３０に送信する。例えば、入力装置４０はジョイスティック、ジョグダイヤル、タッチパネル、キーボード、及びマウスなどの入力部を有する。第３仮想カメラを設定するユーザは、入力部を操作することで第３仮想カメラの位置及び姿勢を設定する。なお、本実施例ではユーザが第３仮想カメラの位置及び姿勢を設定するが、これに限定されず、３Ｄモデルの位置情報を用いて第３仮想カメラの位置及び姿勢を設定してもよい。ここでの３Ｄモデルの位置情報は、配信側の第１の画像処理装置２０で生成してもよいし、受信側の第２の画像処理装置３０で生成してもよい。 The input device 40 accepts input values used by the user to set the third virtual camera and transmits the input values to the second image processing device 30. For example, the input device 40 has input units such as a joystick, jog dial, touch panel, keyboard, and mouse. The user setting the third virtual camera sets the position and orientation of the third virtual camera by operating the input unit. Note that in this embodiment, the user sets the position and orientation of the third virtual camera, but this is not limited to this, and the position and orientation of the third virtual camera may also be set using position information of the 3D model. The position information of the 3D model here may be generated by the first image processing device 20 on the distribution side, or may be generated by the second image processing device 30 on the receiving side.

　表示装置５０は、第２の画像処理装置３０により生成及び出力された仮想視点画像を表示する。ユーザは、表示装置５０に表示された仮想視点画像を見て、入力装置４０を通じて次のフレームにおける仮想カメラの位置及び姿勢を設定する。 The display device 50 displays the virtual viewpoint image generated and output by the second image processing device 30. The user views the virtual viewpoint image displayed on the display device 50 and sets the position and orientation of the virtual camera for the next frame via the input device 40.

　図１Ｂは、本実施例の第１の画像処理装置２０のハードウェア構成の一例を示す図である。なお、第２の画像処理装置３０のハードウェア構成も、以下で説明する第１の画像処理装置２０の構成と同様である。本実施例の第１の画像処理装置２０は、ＣＰＵ１０１、ＲＡＭ１０２、ＲＯＭ１０３、通信部１０４で構成される。 FIG. 1B is a diagram showing an example of the hardware configuration of the first image processing device 20 of this embodiment. The hardware configuration of the second image processing device 30 is also similar to the configuration of the first image processing device 20 described below. The first image processing device 20 of this embodiment is composed of a CPU 101, RAM 102, ROM 103, and communication unit 104.

　ＣＰＵ１０１は、ＲＡＭ１０２やＲＯＭ１０３に格納されているコンピュータプログラムやデータを用いて第１の画像処理装置２０の全体を制御することで、図１Ａ、図１Ｂに示す第１の画像処理装置２０の各機能を実現する。なお、第１の画像処理装置２０がＣＰＵ１０１とは異なる１又は複数の専用のハードウェアを有し、ＣＰＵ１０１による処理の少なくとも一部を専用のハードウェアが実行してもよい。専用のハードウェアの例としては、ＡＳＩＣ（特定用途向け集積回路）、ＦＰＧＡ（フィールドプログラマブルゲートアレイ）、およびＤＳＰ（デジタルシグナルプロセッサ）などがある。ＲＡＭ１０２は、補助記憶装置２１４から供給されるプログラムやデータ、及び通信部１０４を介して外部から供給されるデータなどを一時記憶する。ＲＯＭ１０３は、変更を必要としないプログラムなどを格納する。 The CPU 101 controls the entire first image processing device 20 using computer programs and data stored in the RAM 102 and ROM 103, thereby realizing each function of the first image processing device 20 shown in Figures 1A and 1B. Note that the first image processing device 20 may have one or more pieces of dedicated hardware different from the CPU 101, and at least some of the processing by the CPU 101 may be performed by the dedicated hardware. Examples of dedicated hardware include an ASIC (application-specific integrated circuit), an FPGA (field-programmable gate array), and a DSP (digital signal processor). The RAM 102 temporarily stores programs and data supplied from the auxiliary storage device 214, as well as data supplied from the outside via the communication unit 104. The ROM 103 stores programs that do not require modification.

　通信部１０４は、第１の画像処理装置２０の外部の装置との通信に用いられる。例えば、第１の画像処理装置２０が外部の装置と有線で接続される場合には、通信用のケーブルが通信部１０４に接続される。第１の画像処理装置２０が外部の装置と無線通信する機能を有する場合には、通信部１０４はアンテナを備える。 The communication unit 104 is used for communication between the first image processing device 20 and an external device. For example, if the first image processing device 20 is connected to an external device via a wired connection, a communication cable is connected to the communication unit 104. If the first image processing device 20 has the function of communicating wirelessly with an external device, the communication unit 104 is equipped with an antenna.

　＜画像処理装置の機能構成＞
　図２は、第１の画像処理装置２０及び第２の画像処理装置３０の機能構成の一例を示す図である。 <Functional configuration of image processing device>
FIG. 2 is a diagram showing an example of the functional configuration of the first image processing device 20 and the second image processing device 30. As shown in FIG.

　第１の画像処理装置２０は、形状情報生成部２０１、視点決定部２０２、デプス画像生成部２０３、テクスチャ画像生成部２０４、符号化部２０５、配信部２０６から構成される。 The first image processing device 20 is composed of a shape information generation unit 201, a viewpoint determination unit 202, a depth image generation unit 203, a texture image generation unit 204, an encoding unit 205, and a distribution unit 206.

　形状情報生成部２０１は、通信部１０４を使って、撮影システム１０から受信した複数の撮像画像、物理カメラ群の視点情報を用いて被写体の形状情報を推定する。形状情報の推定は、先述した視体積交差法などが用いられる。そのため、形状情報生成部２０１は、取得した物理カメラ群の視点情報から、現実空間における物理カメラ群に対応する、仮想空間における第１仮想カメラ群の視点情報も生成する。形状情報生成部２０１は、推定した形状情報を視点決定部２０２、デプス画像生成部２０３、テクスチャ画像生成部２０４に出力する。また、形状情報生成部２０１は、受信した物理カメラ群の視点情報と複数の撮像画像を視点決定部２０２、テクスチャ画像生成部２０４に出力する。 The shape information generation unit 201 uses the communication unit 104 to estimate shape information of the subject using the multiple captured images and viewpoint information of the physical cameras received from the imaging system 10. The shape information is estimated using the volume intersection method described above. Therefore, the shape information generation unit 201 also generates viewpoint information of a first virtual camera group in virtual space that corresponds to the physical camera group in real space from the acquired viewpoint information of the physical cameras. The shape information generation unit 201 outputs the estimated shape information to the viewpoint determination unit 202, depth image generation unit 203, and texture image generation unit 204. The shape information generation unit 201 also outputs the received viewpoint information of the physical cameras and multiple captured images to the viewpoint determination unit 202 and texture image generation unit 204.

　視点決定部２０２は、形状情報生成部２０１からの入力データに基づき、第２の画像処理装置３０が復元する３Ｄモデルの品質向上を目的に、第２仮想カメラ群の視点を決定し、視点情報を生成する。視点決定部２０２は、生成した第２仮想カメラ群の視点情報をデプス画像生成部２０３と、デプス画像生成部２０３を介してテクスチャ画像生成部２０４に出力する。 Based on the input data from the shape information generation unit 201, the viewpoint determination unit 202 determines the viewpoints of the second virtual camera group and generates viewpoint information, with the aim of improving the quality of the 3D model restored by the second image processing device 30. The viewpoint determination unit 202 outputs the generated viewpoint information of the second virtual camera group to the depth image generation unit 203 and, via the depth image generation unit 203, to the texture image generation unit 204.

　第２仮想カメラ群の視点情報は、例えば第２仮想カメラを物理カメラに対応する第１仮想カメラの位置及び姿勢に一致させ、第１仮想カメラの視線方向に第２仮想カメラをドリーズームすることにより生成される。具体的には、物理カメラに対応する第１仮想カメラの位置に配置した第２仮想カメラが、第１仮想カメラの視線方向に被写体に近づきながら第２仮想カメラから生成される仮想視点画像上の被写体のサイズが維持されるように焦点距離（ズーム）を調整する。このような第２仮想カメラの視点情報の導出処理を後述する図３Ａ～図３Ｃと図４を用いて詳細に説明する。このような処理を、物理カメラを変えながら、また撮影画像に写る被写体が複数人いる場合、被写体を変えながら行うことで、被写体毎に被写体を映した物理カメラの台数分、第２仮想カメラの視点情報を生成することが出来る。 The viewpoint information of the second virtual camera group is generated, for example, by matching the position and orientation of the second virtual camera to that of the first virtual camera corresponding to the physical camera and dolly-zooming the second virtual camera in the line of sight of the first virtual camera. Specifically, the second virtual camera, placed at the position of the first virtual camera corresponding to the physical camera, approaches the subject in the line of sight of the first virtual camera while adjusting the focal length (zoom) so that the size of the subject in the virtual viewpoint image generated from the second virtual camera is maintained. This process of deriving the viewpoint information of the second virtual camera will be described in detail below using Figures 3A to 3C and 4. By performing this process while changing the physical camera, or while changing the subjects if there are multiple subjects in the captured image, it is possible to generate viewpoint information of the second virtual camera for each subject, for the number of physical cameras that captured the subject.

　また、視点決定部２０２は、第２仮想カメラの画像サイズ（幅と高さ）を変更してもよい。先ず第２仮想カメラの画像サイズは物理カメラの画像サイズと同じにし、上記の様に第２仮想カメラの位置及び姿勢を決定後、第２仮想カメラの画像サイズを変更する。その方法は、例えば、第２仮想カメラの仮想視点画像の幅と高さを物理カメラの撮像画像の幅と高さのそれぞれ１／Ｎに変更する場合、第２仮想カメラの内部パラメータの焦点距離と画像中心もそれぞれ１／Ｎ倍する。つまり、物理カメラの画像サイズが４Ｋ（幅３８４０、高さ２１６０）で第２仮想カメラの画像サイズをフルＨＤ（幅１９２０、高さ１０８０）に幅と高さそれぞれ１／２に変更する場合、第２仮想カメラの内部パラメータの焦点距離と画像中心を１／２倍する。そうすることで、画角を変えずに画像サイズを変更でき、サーバからクライアントに送信するデータ量を削減できる。 The viewpoint determination unit 202 may also change the image size (width and height) of the second virtual camera. First, the image size of the second virtual camera is set to the same as the image size of the physical camera, and after determining the position and orientation of the second virtual camera as described above, the image size of the second virtual camera is changed. For example, if the width and height of the virtual viewpoint image of the second virtual camera are changed to 1/N of the width and height of the image captured by the physical camera, the focal length and image center internal parameters of the second virtual camera are also multiplied by 1/N. In other words, if the image size of the physical camera is 4K (width 3840, height 2160) and the image size of the second virtual camera is changed to full HD (width 1920, height 1080), the width and height are each halved, the focal length and image center internal parameters of the second virtual camera are multiplied by 1/2. This allows the image size to be changed without changing the angle of view, reducing the amount of data sent from the server to the client.

　デプス画像生成部２０３は、形状情報生成部２０１から入力した被写体の形状情報と視点決定部２０２から入力した第２仮想カメラ群の視点情報に基づき、先述したデプス画像の生成処理を行う。デプス画像生成部２０３は、生成したデプス画像と第２仮想カメラ群の視点情報を符号化部２０５に出力する。 The depth image generation unit 203 performs the aforementioned depth image generation process based on the shape information of the subject input from the shape information generation unit 201 and the viewpoint information of the second virtual camera group input from the viewpoint determination unit 202. The depth image generation unit 203 outputs the generated depth image and the viewpoint information of the second virtual camera group to the encoding unit 205.

　テクスチャ画像生成部２０４は、形状情報生成部２０１からの入力データ、デプス画像生成部２０３を介して視点決定部２０２から入力した第２仮想カメラ群の視点情報に基づき、先述した仮想視点依存のテクスチャ画像の生成処理を行う。テクスチャ画像生成部２０４は、生成したテクスチャ画像を符号化部２０５に出力する。仮想視点依存のテクスチャ画像は、先述したように、第２仮想カメラの位置及び姿勢に近い物理カメラの撮影画像の画素値を優先的に参照されることにより生成される。そうすることで、高画質なテクスチャ画像を生成でき、クライアントで３Ｄモデルを復元した際に、高品質な色情報を含んだ３Ｄモデルを再生できる。 The texture image generation unit 204 performs processing to generate the virtual viewpoint-dependent texture image described above based on input data from the shape information generation unit 201 and viewpoint information of the second virtual camera group input from the viewpoint determination unit 202 via the depth image generation unit 203. The texture image generation unit 204 outputs the generated texture image to the encoding unit 205. As described above, the virtual viewpoint-dependent texture image is generated by preferentially referencing the pixel values of the image captured by the physical camera that is closest in position and orientation to the second virtual camera. This makes it possible to generate high-quality texture images, and when the 3D model is restored on the client, a 3D model containing high-quality color information can be reproduced.

　また、テクスチャ画像生成部２０４は、形状情報生成部２０１からの入力データ、デプス画像生成部２０３を介して視点決定部２０２から入力した第２仮想カメラ群の視点情報に基づき有効画素マップを生成してもよい。テクスチャ画像生成部２０４は、生成した有効画素マップを符号化部２０５に出力してもよい。有効画素マップについては、実施例２で詳細に説明する。 The texture image generation unit 204 may also generate a valid pixel map based on input data from the shape information generation unit 201 and viewpoint information of the second virtual camera group input from the viewpoint determination unit 202 via the depth image generation unit 203. The texture image generation unit 204 may output the generated valid pixel map to the encoding unit 205. The valid pixel map will be described in detail in Example 2.

　符号化部２０５は、デプス画像生成部２０３から入力したデプス画像と第２仮想カメラ群の視点情報、テクスチャ画像生成部２０４から入力したテクスチャ画像を取得する。符号化部２０５は、先述した圧縮方式によりデプス画像とテクスチャ画像を圧縮し、圧縮した画像群と第２仮想カメラ群の視点情報（メタデータ）を配信部２０６に出力する。 The encoding unit 205 acquires the depth images and viewpoint information of the second virtual camera group input from the depth image generation unit 203, and the texture images input from the texture image generation unit 204. The encoding unit 205 compresses the depth images and texture images using the compression method described above, and outputs the compressed image group and viewpoint information (metadata) of the second virtual camera group to the distribution unit 206.

　また、符号化部２０５は、デプス画像、テクスチャ画像だけでなく、メタデータを圧縮してもよい。圧縮方式は先述したファイル圧縮等である。 Furthermore, the encoding unit 205 may compress not only depth images and texture images, but also metadata. Compression methods include the file compression methods described above.

　さらに、符号化部２０５は、テクスチャ画像生成部２０４から入力した有効画素マップを圧縮して配信部２０６に出力してもよい。 Furthermore, the encoding unit 205 may compress the effective pixel map input from the texture image generation unit 204 and output it to the distribution unit 206.

　配信部２０６は、通信部１０４を使って、符号化部２０５から入力した圧縮したデプス画像、テクスチャ画像、及び第２仮想カメラ群の視点情報を後述する受信部２０７に送信する。 The distribution unit 206 uses the communication unit 104 to transmit the compressed depth image, texture image, and viewpoint information of the second virtual camera group input from the encoding unit 205 to the receiving unit 207, which will be described later.

　第２の画像処理装置３０は、受信部２０７、復号部２０８、３Ｄモデル復元部２０９、仮想カメラ制御部２１０、仮想視点画像生成部２１１から構成される。 The second image processing device 30 is composed of a receiving unit 207, a decoding unit 208, a 3D model restoration unit 209, a virtual camera control unit 210, and a virtual viewpoint image generation unit 211.

　受信部２０７は、通信部１０４を使って、配信部２０６から圧縮されたデプス画像、圧縮されたテクスチャ画像、及び第２仮想カメラ群の視点情報（メタデータ）を受信し、復号部２０８に出力する。 The receiving unit 207 uses the communication unit 104 to receive the compressed depth image, compressed texture image, and viewpoint information (metadata) of the second virtual camera group from the distribution unit 206, and outputs them to the decoding unit 208.

　復号部２０８は、受信部２０７から取得した圧縮されたデプス画像と圧縮されたテクスチャ画像を復号し、第２仮想カメラ群の視点情報とともに３Ｄモデル復元部２０９に出力する。また、復号部２０８は、デプス画像、テクスチャ画像だけでなく、メタデータを復号してもよい。さらに、復号部２０８は、有効画素マップを復号して３Ｄモデル復元部２０９に出力してもよい。 The decoding unit 208 decodes the compressed depth images and compressed texture images acquired from the receiving unit 207 and outputs them to the 3D model restoration unit 209 along with viewpoint information of the second virtual camera group. The decoding unit 208 may also decode metadata in addition to the depth images and texture images. Furthermore, the decoding unit 208 may decode an effective pixel map and output it to the 3D model restoration unit 209.

　３Ｄモデル復元部２０９は、復号部２０８から取得した復号されたデプス画像と復号されたテクスチャ画像と第２仮想カメラ群の視点情報に基づき、先述した復元方法を用いて３Ｄモデルを復元する。復元した３Ｄモデルを仮想視点画像生成部２１１に出力する。さらに、３Ｄモデル復元部２０９は、復号部２０８から取得した有効画素マップを使用して３Ｄモデルの色情報を生成してもよい。有効画素マップの使用方法は実施例２で詳細に説明する。 The 3D model restoration unit 209 restores a 3D model using the restoration method described above, based on the decoded depth image and decoded texture image acquired from the decoding unit 208, and the viewpoint information of the second virtual camera group. The restored 3D model is output to the virtual viewpoint image generation unit 211. Furthermore, the 3D model restoration unit 209 may generate color information for the 3D model using the effective pixel map acquired from the decoding unit 208. The method for using the effective pixel map will be explained in detail in Example 2.

　仮想カメラ制御部２１０は、通信部１０４を使って、ユーザが入力装置４０を通じて入力した入力値から仮想視点画像を生成するための第３仮想カメラの視点情報を生成し、第３仮想カメラの視点情報を仮想視点画像生成部２１１に出力する。また、仮想カメラ制御部２１０は、生成したユーザ指定の第３仮想カメラの視点情報を３Ｄモデル復元部２０９に出力してもよい。 The virtual camera control unit 210 uses the communication unit 104 to generate viewpoint information of a third virtual camera for generating a virtual viewpoint image from input values entered by the user via the input device 40, and outputs the viewpoint information of the third virtual camera to the virtual viewpoint image generation unit 211. The virtual camera control unit 210 may also output the generated viewpoint information of the user-specified third virtual camera to the 3D model restoration unit 209.

　仮想視点画像生成部２１１は、３Ｄモデル復元部２０９から取得した３Ｄモデル、仮想カメラ制御部２１０から取得した第３仮想カメラの視点情報に基づいて仮想視点画像を生成する。仮想視点画像の生成は、被写体の３Ｄモデル、背景オブジェクトの３Ｄモデル、第３仮想カメラを仮想空間上に配置し、第３仮想カメラから見た画像を生成することにより行われる。背景オブジェクトの３Ｄモデルは、例えば別途被写体と合成するために作成したＣＧ（Ｃｏｍｐｕｔｅｒ　Ｇｒａｐｈｉｃ）モデルであり、予め作成されて第２の画像処理装置３０内に保存されている（例えば、図１ＢのＲＯＭ１０３に保存されている）。 The virtual viewpoint image generation unit 211 generates a virtual viewpoint image based on the 3D model acquired from the 3D model restoration unit 209 and the viewpoint information of the third virtual camera acquired from the virtual camera control unit 210. The virtual viewpoint image is generated by placing a 3D model of the subject, a 3D model of the background object, and the third virtual camera in a virtual space, and generating an image seen from the third virtual camera. The 3D model of the background object is, for example, a CG (Computer Graphics) model created separately to be combined with the subject, and is created in advance and stored in the second image processing device 30 (for example, stored in ROM 103 in Figure 1B).

　被写体の３Ｄモデルと背景のオブジェクトの３Ｄモデルは、既存のＣＧレンダリング方法によりレンダリングされる。仮想視点画像生成部２１１は、生成した仮想視点画像を表示装置５０に送信する。 The 3D model of the subject and the 3D model of the background object are rendered using an existing CG rendering method. The virtual viewpoint image generation unit 211 transmits the generated virtual viewpoint image to the display device 50.

　＜物理カメラ群の視点情報に基づく第２仮想カメラ群の視点情報の生成例の説明＞
　ここで、物理カメラ群の視点情報を用いて第２仮想カメラ群の視点情報を生成する方法を図３Ａ～図３Ｃと図４を用いて説明する。図３Ａ～図３Ｃは、第２仮想カメラ群の視点情報の生成方法の一例を説明する概略図である。 <Description of an example of generating viewpoint information of a group of second virtual cameras based on viewpoint information of a group of physical cameras>
Here, a method for generating viewpoint information of a second virtual camera group using viewpoint information of a physical camera group will be described with reference to Figures 3A to 3C and 4. Figures 3A to 3C are schematic diagrams for explaining an example of a method for generating viewpoint information of a second virtual camera group.

　図３Ａは、現実空間に配置された物理カメラ群３０２と被写体３０１を示す図である。被写体３０１は物理カメラ群３０２により撮影される。なお、直線３０３は、物理カメラ群３０２の各々の光軸を示す。 FIG. 3A shows a group of physical cameras 302 and a subject 301 arranged in real space. The subject 301 is photographed by the group of physical cameras 302. Note that a straight line 303 indicates the optical axis of each of the group of physical cameras 302.

　図３Ｂは、物理カメラ群３０２に対応する第１仮想カメラ群３０５を設定し、第１仮想カメラ群３０５により生成される被写体の３Ｄモデル３０４を示す図である。第１の画像処理装置２０は、物理カメラ群３０２から取得される複数の撮像画像を用いて、被写体の３Ｄモデル３０４を生成する。現実空間における被写体３０１に対する、仮想空間における３Ｄモデル３０４を生成するために、現実空間における物理カメラ群３０２に対応する、仮想空間における第１仮想カメラ群３０５を生成する。言い換えると、現実空間における物理カメラ群３０２を仮想空間において再現したものが第１仮想カメラ群３０５である。したがって、第１仮想カメラ群３０５の光軸３０６は、物理カメラ群３０２の光軸３０３に対応している。そして第１の画像処理装置２０は、生成した第１仮想カメラ群３０５の視点情報を用いて被写体３０１の３Ｄモデル３０４を生成する。 FIG. 3B shows a diagram in which a first virtual camera group 305 corresponding to the physical camera group 302 is set and a 3D model 304 of the subject is generated by the first virtual camera group 305. The first image processing device 20 generates the 3D model 304 of the subject using multiple captured images acquired from the physical camera group 302. In order to generate a 3D model 304 in virtual space for the subject 301 in real space, a first virtual camera group 305 is generated in virtual space corresponding to the physical camera group 302 in real space. In other words, the first virtual camera group 305 is a virtual space reproduction of the physical camera group 302 in real space. Therefore, the optical axis 306 of the first virtual camera group 305 corresponds to the optical axis 303 of the physical camera group 302. The first image processing device 20 then generates the 3D model 304 of the subject 301 using viewpoint information from the generated first virtual camera group 305.

　図３Ｃは、第１仮想カメラ群３０５の視点情報に基づいて設定される第２仮想カメラ群３０７を示す図である。第２仮想カメラ群３０７は、第１仮想カメラ群３０５の光軸３０６上に設定される。なお、これに限定されず、光軸３０６に近い位置に第２仮想カメラ３０７を設定してもよい。また、第１仮想カメラ群３０５より３Ｄモデル３０４に近い位置に第２仮想カメラ群３０７は設定される。一般的に、配信のためのデプス画像やテクスチャ画像を生成するための第２仮想カメラ群３０７は、３Ｄモデル３０４を包含するバウンディングボックスを配置し、そのバウンディングボックスを囲むように球面上に被写体を向くように配置する。そうすることで、第２仮想カメラ群３０７の位置及び姿勢（外部パラメータ）を決定する。そして、３Ｄモデル３０４の全体が第２仮想カメラ群３０７で撮影可能な焦点距離（内部パラメータ）を決定する。これらは、手動で設定され、第２仮想カメラの数や配置間隔はヒューリスティックに決定されることが多い。３Ｄモデル３０４、すなわち被写体３０１が複雑な形状の場合、ある第２仮想カメラから見て３Ｄモデル３０４の部位が別の部位で隠されるオクルージョンが発生してしまうため、第２仮想カメラ群３０７の数や配置を適切に決定する必要がある。形状が複雑な被写体に対して、第２仮想カメラの数が少なく、配置が不適切でオクルージョンが多く発生した場合、第２の画像処理装置３０で復元する３Ｄモデルは配信する前の３Ｄモデル３０４と比べて形状が大きく異なる可能性がある。しかしながら、３Ｄモデル３０４を正確に復元しようと第２仮想カメラの数を多くすると、配信するデプス画像とテクスチャ画像が増加する。そのため、ユーザが伝送帯域の低い環境や処理性能が低いローカル端末で３Ｄモデルを復元しようとするとフレームレートが低下する恐れがある。表示装置５０に表示する３Ｄモデルは、ユーザに高い臨場感を与えるために、より高画質で６０ｆｐｓなどのカクつかないフレームレートであることが望ましい。そのため、第２仮想カメラ群３０７の数や配置を適切に設定し、少ないデータ量で３Ｄモデル３０４を正確に復元可能なデプス画像と高画質なテクスチャ画像を配信する必要がある。そこで、第１の画像処理装置２０は、先述した物理カメラ群３０２の視点情報に基づき第２仮想カメラ群３０７の視点情報を生成する。すなわち、物理カメラ群３０２に対応する第１仮想カメラ群３０５の位置に第２仮想カメラ群３０７を配置する。そして、被写体３０１の３Ｄモデル３０４に近づくように第１仮想カメラ３０５の光軸３０６上に第２仮想カメラ群３０７を画面上の被写体の大きさを維持しながらドリーズームすることで、第２仮想カメラ群３０７の位置を設定する。どの位置まで第２仮想カメラ群３０７をドリーズームするかは予め決定されているものとする。例えば、３Ｄモデル３０４との距離が所定値になるまでドリーズームするようにしてもよい。あるいは、第２仮想カメラ群３０７の撮像面に投影される３Ｄモデル３０４の大きさが所定の大きさになるまでドリーズームしてもよい。なお、第２仮想カメラ群３０７の視点情報の決定の仕方はこの方法に限らず、単に第１仮想カメラ群３０５の光軸３０６上の近くに第２仮想カメラ群３０７を配置し、手動で焦点距離などの内部パラメータや画像サイズを決定してもよい。すなわち、第２仮想カメラ群３０７の視点情報は、物理カメラ群３０２の視点情報に基づき決定されればよい。仮想視点生成技術において、物理カメラ群３０２は、先述した視体積交差法など被写体３０１を取り囲むように、被写体３０１の形状を推定するのに必要な台数が配置される。そのため、物理カメラ群３０２から観測したようなデプス画像を送れば、形状推定した３Ｄモデル３０４の形状情報をユーザ側で正確に復元することが可能となる。また、物理カメラ群３０２と画質が近い仮想視点依存のテクスチャ画像を配信することで、ユーザ側で高画質な色情報を復元でき、ユーザは高品質な３Ｄモデルを再生することができる。 3C is a diagram showing the second virtual camera group 307 set based on the viewpoint information of the first virtual camera group 305. The second virtual camera group 307 is set on the optical axis 306 of the first virtual camera group 305. However, this is not limited to this, and the second virtual camera 307 may be set at a position closer to the optical axis 306. The second virtual camera group 307 is also set at a position closer to the 3D model 304 than the first virtual camera group 305. Generally, the second virtual camera group 307 used to generate depth images and texture images for distribution places a bounding box that encompasses the 3D model 304, and places the second virtual camera group 307 on a spherical surface surrounding the bounding box so that it faces the subject. This determines the position and orientation (extrinsic parameters) of the second virtual camera group 307. Then, the focal length (internal parameters) at which the entire 3D model 304 can be captured by the second virtual camera group 307 is determined. These are set manually, and the number and spacing of the second virtual cameras are often determined heuristically. If the 3D model 304, i.e., the subject 301, has a complex shape, occlusion occurs, whereby a portion of the 3D model 304 is hidden by another portion as viewed from a certain second virtual camera. Therefore, the number and placement of the second virtual cameras 307 must be appropriately determined. For a subject with a complex shape, if the number of second virtual cameras is small and their placement is inappropriate, resulting in frequent occlusion, the 3D model restored by the second image processing device 30 may differ significantly in shape from the 3D model 304 before distribution. However, increasing the number of second virtual cameras in an attempt to accurately restore the 3D model 304 increases the number of depth images and texture images to be distributed. Therefore, if a user attempts to restore the 3D model in an environment with low transmission bandwidth or on a local terminal with low processing performance, the frame rate may decrease. It is desirable for the 3D model displayed on the display device 50 to have higher image quality and a smooth frame rate, such as 60 fps, to provide the user with a high sense of realism. Therefore, it is necessary to appropriately set the number and arrangement of the second virtual camera group 307 and deliver depth images and high-quality texture images that can accurately reconstruct the 3D model 304 with a small amount of data. Therefore, the first image processing device 20 generates viewpoint information for the second virtual camera group 307 based on the viewpoint information of the physical camera group 302 described above. That is, the second virtual camera group 307 is arranged at the position of the first virtual camera group 305 corresponding to the physical camera group 302. The position of the second virtual camera group 307 is then set by dolly-zooming the second virtual camera group 307 onto the optical axis 306 of the first virtual camera 305 while maintaining the size of the subject on the screen so as to approach the 3D model 304 of the subject 301. It is assumed that the position to which the second virtual camera group 307 is dolly-zoomed is predetermined. For example, dolly-zooming may be performed until the distance from the 3D model 304 reaches a predetermined value. Alternatively, dolly-zooming may be performed until the size of the 3D model 304 projected on the imaging surface of the second virtual camera group 307 reaches a predetermined size. Note that the method for determining the viewpoint information of the second virtual camera group 307 is not limited to this method. Alternatively, the second virtual camera group 307 may simply be positioned near the optical axis 306 of the first virtual camera group 305, and internal parameters such as focal length and image size may be determined manually. That is, the viewpoint information of the second virtual camera group 307 may be determined based on the viewpoint information of the physical camera group 302. In virtual viewpoint generation technology, the physical camera group 302 is positioned in the number required to estimate the shape of the subject 301, surrounding the subject 301 using the volume intersection method described above, for example. Therefore, by sending depth images as observed by the physical camera group 302, the user can accurately restore the shape information of the shape-estimated 3D model 304. Furthermore, by delivering virtual viewpoint-dependent texture images with image quality similar to that of the physical camera group 302, the user can restore high-quality color information, allowing the user to play back a high-quality 3D model.

　仮想カメラ毎の視点情報の決定方法を、図４を用いて説明する。図４は、第１仮想カメラ４０１が被写体の３Ｄモデル４０２を撮影し、第１仮想カメラ４０１の視点情報に基づき第２仮想カメラ４０３の視点情報を生成する概略図である。第２仮想カメラ４０３は第１仮想カメラ４０１の光軸上を３Ｄモデル４０２に近づくようにドリーズームしながら移動する。そして、３Ｄモデル４０２を包含し、３Ｄモデルと第２仮想カメラ４０３との距離情報（デプス情報）を１０ｂｉｔで表現可能な範囲の領域４０４内で３Ｄモデル４０２を撮影可能な位置に配置する。なお、領域４０４は立方体に限らず、３Ｄモデル４０２を中心とした球体でもよい。第１仮想カメラ４０１は３Ｄモデル４０２を撮影し、撮影画像４０５を生成する。第１の画像処理装置２０は、撮影システム１０が有する物理カメラ群が撮影した撮影画像４０５に基づき、３Ｄモデル４０２を推定する。そして推定した３Ｄモデル４０２を第２仮想カメラ４０３が撮影し、デプス画像４０６とテクスチャ画像４０７を生成する。デプス画像４０６とテクスチャ画像４０７に写る３Ｄモデル４０２の大きさは、撮影画像４０５の写る３Ｄモデル４０２と同等の大きさである。なお、先述したように第２仮想カメラ４０３の視点情報の決定はドリーズームに限らず、デプス画像４０６に写る３Ｄモデル４０２の大きさは撮影画像４０５に写る３Ｄモデル４０２の大きさと変わってもよい。また、第２仮想カメラ４０３は第１仮想カメラ４０１の光軸上を移動するとしたが、第２仮想カメラ４０３の位置及び姿勢は第１仮想カメラ４０１の位置及び姿勢と近ければよく、必ずしも第１仮想カメラ４０１の光軸上を移動しなくてもよい。さらに、第２仮想カメラ４０３の視点情報の決定は、第２の画像処理装置３０の初期化時に一度行われても良いし、３Ｄモデル４０２の動きに合わせて毎フレーム行われてもよい。 The method for determining viewpoint information for each virtual camera will be explained using Figure 4. Figure 4 is a schematic diagram in which a first virtual camera 401 captures a 3D model 402 of a subject and generates viewpoint information for a second virtual camera 403 based on the viewpoint information of the first virtual camera 401. The second virtual camera 403 moves along the optical axis of the first virtual camera 401 while dolly zooming to approach the 3D model 402. The second virtual camera 403 is then positioned at a position where it can capture the 3D model 402 within an area 404 that encompasses the 3D model 402 and allows the distance information (depth information) between the 3D model and the second virtual camera 403 to be expressed in 10 bits. Note that the area 404 is not limited to a cube, and may be a sphere centered on the 3D model 402. The first virtual camera 401 captures the 3D model 402 and generates a captured image 405. The first image processing device 20 estimates the 3D model 402 based on the captured images 405 captured by the physical cameras of the imaging system 10. The estimated 3D model 402 is then photographed by a second virtual camera 403 to generate a depth image 406 and a texture image 407. The size of the 3D model 402 depicted in the depth image 406 and the texture image 407 is the same as the size of the 3D model 402 depicted in the photographed image 405. As described above, the determination of the viewpoint information of the second virtual camera 403 is not limited to dolly zoom, and the size of the 3D model 402 depicted in the depth image 406 may be different from the size of the 3D model 402 depicted in the photographed image 405. Furthermore, although the second virtual camera 403 has been described as moving on the optical axis of the first virtual camera 401, the position and orientation of the second virtual camera 403 need only be close to the position and orientation of the first virtual camera 401, and it does not necessarily have to move on the optical axis of the first virtual camera 401. Furthermore, the determination of the viewpoint information of the second virtual camera 403 may be performed once when the second image processing device 30 is initialized, or may be performed for each frame in accordance with the movement of the 3D model 402.

　＜３Ｄモデルデータの圧縮と配信および仮想視点画像の生成の制御＞
　図５は、本実施例に係る、第１の画像処理装置２０における３Ｄモデルデータの圧縮と配信を制御する処理の流れを示すフローチャートである。図５に示すフローは、ＲＯＭ１０３に格納された制御プログラムがＲＡＭ１０２に読み出され、ＣＰＵ１０１がこれを実行することによって実現される。形状情報生成部２０１が、撮影システム１０から複数の撮像画像、物理カメラ群の視点情報を受信したことをトリガとして、図５のフローが実行される。 <Compression and distribution of 3D model data and control of generation of virtual viewpoint images>
Fig. 5 is a flowchart showing the flow of processing for controlling the compression and distribution of 3D model data in the first image processing device 20 according to this embodiment. The flow shown in Fig. 5 is realized by reading a control program stored in the ROM 103 into the RAM 102 and executing it by the CPU 101. Execution of the flow in Fig. 5 is triggered when the shape information generation unit 201 receives a plurality of captured images and viewpoint information of the physical cameras from the imaging system 10.

　Ｓ５０１では、形状情報生成部２０１が、複数の撮像画像に基づき被写体の形状情報を推定し生成する。生成した形状情報、物理カメラ群の視点情報、複数の撮像画像は、視点決定部２０２とテクスチャ画像生成部２０４に出力される。また、生成した形状情報は、デプス画像生成部２０３に出力される。なお、被写体の生成に用いる、物理カメラ群に対応する第１仮想カメラ群の視点情報は、予め生成されているものとする。例えば、撮影開始前の準備として、物理カメラ群を配置した際に、配置した物理カメラ群に対応する第１仮想カメラ群を生成する。この第１仮想カメラ群の視点情報と複数の撮像画像を用いて被写体の形状情報を生成する。 In S501, the shape information generation unit 201 estimates and generates shape information of the subject based on multiple captured images. The generated shape information, viewpoint information of the physical camera group, and multiple captured images are output to the viewpoint determination unit 202 and texture image generation unit 204. The generated shape information is also output to the depth image generation unit 203. Note that the viewpoint information of the first virtual camera group corresponding to the physical camera group used to generate the subject is assumed to have been generated in advance. For example, when the physical camera group is placed as preparation before filming begins, a first virtual camera group corresponding to the placed physical camera group is generated. Shape information of the subject is generated using the viewpoint information of this first virtual camera group and multiple captured images.

　Ｓ５０２では、視点決定部２０２が、物理カメラ群の視点情報に基づき、第２仮想カメラ群の視点情報を生成する。生成した第２仮想カメラ群の視点情報は、デプス画像生成部２０３と、デプス画像生成部２０３を介してテクスチャ画像生成部２０４に出力される。 In S502, the viewpoint determination unit 202 generates viewpoint information for the second virtual camera group based on the viewpoint information for the physical camera group. The generated viewpoint information for the second virtual camera group is output to the depth image generation unit 203 and to the texture image generation unit 204 via the depth image generation unit 203.

　第２仮想カメラ群の視点情報の生成処理は図７で説明する。なお、生成した第２仮想カメラ群の視点情報は、デプス画像生成部２０３を介さずにテクスチャ画像生成部２０４に出力されてもよい。 The process of generating the viewpoint information of the second virtual camera group is explained in Figure 7. Note that the generated viewpoint information of the second virtual camera group may be output to the texture image generation unit 204 without going through the depth image generation unit 203.

　Ｓ５０３では、デプス画像生成部２０３が、形状情報生成部２０１と視点決定部２０２から取得したデータに基づき、３Ｄモデルのデプス画像を生成する。生成したデプス画像は、符号化部２０５に出力される。 In S503, the depth image generation unit 203 generates a depth image of the 3D model based on the data acquired from the shape information generation unit 201 and the viewpoint determination unit 202. The generated depth image is output to the encoding unit 205.

　Ｓ５０４では、テクスチャ画像生成部２０４が、形状情報生成部２０１と視点決定部２０２から取得したデータに基づき、３Ｄモデルのテクスチャ画像を生成する。生成したテクスチャ画像は、符号化部２０５に出力される。 In S504, the texture image generation unit 204 generates a texture image of the 3D model based on the data acquired from the shape information generation unit 201 and the viewpoint determination unit 202. The generated texture image is output to the encoding unit 205.

　Ｓ５０５では、符号化部２０５が、デプス画像生成部２０３とテクスチャ画像生成部２０４から取得したデプス画像とテクスチャ画像を符号化する。符号化したデータは、配信部２０６に出力される。 In S505, the encoding unit 205 encodes the depth image and texture image acquired from the depth image generation unit 203 and texture image generation unit 204. The encoded data is output to the distribution unit 206.

　Ｓ５０６では、配信部２０６が、デプス画像生成部２０３と符号化部２０５から取得したデータと第２仮想カメラ群の視点情報とを含む３Ｄモデルデータを配信し、本フローは終了する。画像処理システム１では、３Ｄモデルデータを第２の画像処理装置３０に配信するが、これに限定されない。別途、３Ｄモデルデータを蓄積するサーバに配信してもよい。 In S506, the distribution unit 206 distributes 3D model data including the data acquired from the depth image generation unit 203 and the encoding unit 205 and the viewpoint information of the second virtual camera group, and this flow ends. In the image processing system 1, the 3D model data is distributed to the second image processing device 30, but this is not limited to this. The 3D model data may also be distributed to a separate server that stores it.

　図６は、本実施例に係る、第２の画像処理装置３０においてデプス画像とテクスチャ画像を用いて仮想視点画像を生成する処理の流れを示すフローチャートである。仮想カメラ制御部２１０が入力装置４０からの入力値を受信することをトリガとして、図６のフローが実行される。 FIG. 6 is a flowchart showing the flow of processing for generating a virtual viewpoint image using a depth image and a texture image in the second image processing device 30 according to this embodiment. Execution of the flow in FIG. 6 is triggered when the virtual camera control unit 210 receives an input value from the input device 40.

　Ｓ６０１では、仮想カメラ制御部２１０が、入力装置４０からの入力値に基づき、ユーザ指定の第３仮想カメラの視点情報を生成する。生成された第３仮想カメラの視点情報は、仮想視点画像生成部２１１に出力される。 In S601, the virtual camera control unit 210 generates viewpoint information of a third virtual camera designated by the user based on input values from the input device 40. The generated viewpoint information of the third virtual camera is output to the virtual viewpoint image generation unit 211.

　Ｓ６０２では、受信部２０７が、配信部２０６から配信される３Ｄモデルデータを受信する。受信した３Ｄモデルデータは復号部２０８に出力される。 In S602, the receiving unit 207 receives the 3D model data distributed from the distribution unit 206. The received 3D model data is output to the decoding unit 208.

　Ｓ６０３では、復号部２０８が、受信部２０７から取得した３Ｄモデルデータに含まれるデプス画像とテクスチャ画像を復号する。また、第２仮想カメラ群の視点情報も符号化されていた場合には、第２仮想カメラ群の視点情報も復号する。そして、復号されたデプス画像とテクスチャ画像、および第２仮想カメラ群の視点情報を３Ｄモデル復元部２０９に出力する。 In S603, the decoding unit 208 decodes the depth images and texture images included in the 3D model data acquired from the receiving unit 207. Furthermore, if the viewpoint information of the second virtual camera group has also been encoded, the viewpoint information of the second virtual camera group is also decoded. The decoded depth images and texture images, as well as the viewpoint information of the second virtual camera group, are then output to the 3D model restoration unit 209.

　Ｓ６０４では、３Ｄモデル復元部２０９が、復号部２０８から取得した復号されたデプス画像とテクスチャ画像、および第２仮想カメラ群の視点情報に基づき、被写体の３Ｄモデルを復元する。復元した３Ｄモデルは仮想視点画像生成部２１１に出力される。３Ｄモデルの復元処理は図８で説明する。 In S604, the 3D model restoration unit 209 restores a 3D model of the subject based on the decoded depth image and texture image acquired from the decoding unit 208 and the viewpoint information of the second virtual camera group. The restored 3D model is output to the virtual viewpoint image generation unit 211. The 3D model restoration process is explained in Figure 8.

　Ｓ６０５では、仮想視点画像生成部２１１が、仮想カメラ制御部２１０から取得される第３仮想カメラの視点情報と、３Ｄモデル復元部２０９から取得される３Ｄモデルとに基づき仮想視点画像を生成し、本フローは終了する。 In S605, the virtual viewpoint image generation unit 211 generates a virtual viewpoint image based on the viewpoint information of the third virtual camera acquired from the virtual camera control unit 210 and the 3D model acquired from the 3D model restoration unit 209, and this flow ends.

　本フローが終了した後、生成された仮想視点画像は表示装置５０に送信され、表示装置５０に表示される。 After this flow is completed, the generated virtual viewpoint image is sent to the display device 50 and displayed on the display device 50.

　以上が、本実施例に係る、３Ｄモデルデータの圧縮と配信および仮想視点画像の生成の制御の内容である。 The above describes the control of the compression and distribution of 3D model data and the generation of virtual viewpoint images in this embodiment.

　＜第２仮想カメラ群の視点情報の生成処理の説明＞
　図７は、本実施例に係る、第２仮想カメラ群の視点情報の生成処理の流れを示すフローチャートの一例である。ここでは、第２仮想カメラ群の視点情報の生成処理の方法は、図４を用いて説明した第２仮想カメラがドリーズームする方法に基づき生成する手法で説明する。すなわち、物理カメラの位置に対応する第１仮想カメラの位置から、第１仮想カメラの光軸上を被写体の３Ｄモデルに近づく方向に第２仮想カメラが移動する。さらに、第２仮想カメラの撮像面における被写体の大きさを維持したまま第２仮想カメラが焦点距離を調整する手法を説明する。 <Description of the Process for Generating Viewpoint Information of Second Virtual Camera Group>
7 is an example of a flowchart showing the flow of processing for generating viewpoint information of the second virtual camera group according to this embodiment. Here, the method for generating viewpoint information of the second virtual camera group will be described as a method for generating viewpoint information based on the dolly zoom method of the second virtual camera described with reference to FIG. 4 . That is, the second virtual camera moves on the optical axis of the first virtual camera from the position of the first virtual camera corresponding to the position of the physical camera in a direction approaching the 3D model of the subject. Furthermore, a method will be described in which the second virtual camera adjusts the focal length while maintaining the size of the subject on the imaging plane of the second virtual camera.

　図７に示すフローは、視点決定部２０２により実行される。形状情報生成部２０１から被写体の３Ｄモデル、物理カメラ群の視点情報、複数の撮像画像の受信をトリガとして、図７は実行される。図７のフローは、図５のＳ５０２の物理カメラ群の視点情報に基づき、第２仮想カメラ群の視点を決定する制御を詳細に説明したものである。図７に示すフローを実行する際には、第１の画像処理装置２０にて、第１仮想カメラ群の視点情報と被写体の３Ｄモデルとが生成されているものとする。 The flow shown in FIG. 7 is executed by the viewpoint determination unit 202. Execution of FIG. 7 is triggered by the reception of a 3D model of the subject, viewpoint information of the physical camera group, and multiple captured images from the shape information generation unit 201. The flow in FIG. 7 provides a detailed explanation of the control for determining the viewpoint of the second virtual camera group based on the viewpoint information of the physical camera group in S502 of FIG. 5. When executing the flow shown in FIG. 7, it is assumed that viewpoint information of the first virtual camera group and a 3D model of the subject have been generated by the first image processing device 20.

　Ｓ７０１では、被写体の３Ｄモデルと第１仮想カメラ群の視点情報と複数の撮像画像を取得する。なお、第１仮想カメラ群の視点情報を取得せずに、物理カメラ群の視点情報を取得してもよい。その場合、物理カメラ群の視点情報を用いて第１仮想カメラ群の視点情報を生成する。 In S701, a 3D model of the subject, viewpoint information of the first virtual camera group, and multiple captured images are acquired. Note that viewpoint information of the physical camera group may be acquired without acquiring viewpoint information of the first virtual camera group. In that case, viewpoint information of the first virtual camera group is generated using the viewpoint information of the physical camera group.

　Ｓ７０２では、Ｓ７０３とＳ７０４を物理カメラの台数で繰り返し処理を行う。 In S702, S703 and S704 are repeated for each physical camera.

　Ｓ７０３では、Ｓ７０４を複数の撮像画像に含まれる被写体の数で繰り返し処理を行う。被写体の認識は、顔検出アルゴリズムや人物検出アルゴリズム等の結果に基づき行われる。もしくは、被写体ごとに分離して生成された３Ｄモデルを物理カメラの視点情報を用いて、物理カメラの撮像面と同一面に投影することで、撮影画像中で被写体ごとにその領域を特定することもできる。 In S703, S704 is repeated for each subject included in the multiple captured images. Subject recognition is performed based on the results of a face detection algorithm, person detection algorithm, etc. Alternatively, the area of each subject in the captured images can be identified by projecting a 3D model generated separately for each subject onto the same plane as the imaging surface of the physical camera using the viewpoint information of the physical camera.

　Ｓ７０４では、３Ｄモデルと第１仮想カメラの視点情報に基づき、第２仮想カメラの視点情報を生成する。第２仮想カメラの視点情報の生成は、図４を用いて説明した方法で行われる。上記処理をＳ７０２およびＳ７０３に記載されているように繰り返し処理を行うことによって、第２仮想カメラ群の視点情報を生成する。 In S704, viewpoint information for the second virtual camera is generated based on the 3D model and viewpoint information for the first virtual camera. The viewpoint information for the second virtual camera is generated using the method described with reference to Figure 4. The above process is repeated as described in S702 and S703 to generate viewpoint information for the second virtual camera group.

　Ｓ７０５では、生成した第２仮想カメラ群の視点情報をデプス画像生成部２０３とテクスチャ画像生成部２０４に送信する。 In S705, the viewpoint information of the generated second virtual camera group is sent to the depth image generation unit 203 and the texture image generation unit 204.

　＜３Ｄモデル復元の制御の説明＞
　図８は、本実施例に係る、３Ｄモデルの復元処理の流れを示すフローチャートの一例である。ここでは、３Ｄモデルの復元処理の方法は、仮想視点依存の色情報を含む３Ｄモデルを生成する方法に基づく場合で説明する。すなわち、ユーザ指定の第３仮想カメラの位置及び姿勢に近い第２仮想カメラ群のテクスチャ画像の画素値を優先して、３Ｄモデルの色情報とする場合で説明する。図８に示すフローは、３Ｄモデル復元部２０９により実行される。図８のフローは、図６の復号したデータに基づき被写体の３Ｄモデルを復元する制御を詳細に説明したものである。 <Explanation of 3D model restoration control>
FIG. 8 is an example of a flowchart showing the flow of a 3D model restoration process according to this embodiment. Here, the 3D model restoration process is described based on a method for generating a 3D model including virtual viewpoint-dependent color information. In other words, the pixel values of the texture image of the second virtual camera group, which are closest to the position and orientation of the user-specified third virtual camera, are given priority as the color information for the 3D model. The flow shown in FIG. 8 is executed by the 3D model restoration unit 209. The flow in FIG. 8 provides a detailed description of the control for restoring a 3D model of a subject based on the decoded data in FIG. 6.

　Ｓ８０１では、３Ｄモデル復元部２０９が、デプス画像とテクスチャ画像、第２仮想カメラ群の視点情報を取得する。 In S801, the 3D model restoration unit 209 acquires a depth image, a texture image, and viewpoint information of the second virtual camera group.

　Ｓ８０２では、３Ｄモデル復元部２０９が、デプス画像の各画素値のデプス値をデプス画像と対応する仮想カメラの外部パラメータと内部パラメータに基づき、仮想空間に投影し、被写体の形状情報の構成要素を生成する。例えば被写体の３Ｄモデルが３Ｄ点群で表現される場合には、各点が構成要素となる。取得した複数のデプス画像のすべてで上記処理を実行することにより、被写体の形状情報を復元する。 In S802, the 3D model restoration unit 209 projects the depth values of each pixel in the depth image into virtual space based on the external and internal parameters of the virtual camera corresponding to the depth image, and generates components of the shape information of the subject. For example, if the 3D model of the subject is represented by a 3D point cloud, each point becomes a component. By performing the above process on all of the multiple acquired depth images, the shape information of the subject is restored.

　Ｓ８０３では、デプス画像が仮想空間に投影した形状情報の構成要素を複数のテクスチャ画像から撮影されている場合、ユーザ指定の仮想カメラの位置及び姿勢が近い仮想カメラのテクスチャ画像の画素値を優先して構成要素の色情報と決定する。これを全構成要素で繰り返し、形状情報に対応する色情報を復元する。例えば、形状情報が点群であり、構成要素が点群の各点である場合には、すべての点に対応する色情報を復元する。 In S803, if the components of the shape information projected into the virtual space in the depth image are captured using multiple texture images, the pixel values of the texture image of the virtual camera with the closest position and orientation to the user-specified virtual camera are prioritized and determined as the color information for the components. This process is repeated for all components to restore the color information corresponding to the shape information. For example, if the shape information is a point cloud and the components are each point in the point cloud, the color information corresponding to all points is restored.

　Ｓ８０４では、復元した３Ｄモデルを仮想視点画像生成部２１１に出力する。 In S804, the restored 3D model is output to the virtual viewpoint image generation unit 211.

　以上、説明したように、本実施例では、物理カメラ群の視点情報に基づき第２仮想カメラ群の視点情報を生成し、生成した第２仮想カメラ群の視点情報を用いてデプス画像とテクスチャ画像を生成する処理が行われる。このような処理によれば、形状が複雑な３Ｄモデルでも、データ量を増加させることなく、受信側で高品質な３Ｄモデルを復元することが可能となる。なお、本実施例では、圧縮した３Ｄモデルデータを配信するとしたが、圧縮した３Ｄモデルデータを蓄積してもよく、クライアントの要求に応答して圧縮した３Ｄモデルデータを配信してもよい。また、仮想視点依存の色情報を含む３Ｄモデルの生成を３Ｄモデルの復元時に行う方法を例に説明したが、これに限らず、３Ｄモデルの形状情報の復元後、仮想視点画像の生成と同時に３Ｄモデルの色情報を生成してもよい。この場合、３Ｄモデルの形状情報の全てに色情報を付与する必要はなく、ユーザ指定の第３仮想カメラの撮影範囲（視野角）のみ色情報を生成すればよい。 As explained above, in this embodiment, viewpoint information for the second virtual camera group is generated based on viewpoint information from the physical camera group, and a depth image and a texture image are generated using the generated viewpoint information for the second virtual camera group. This type of processing makes it possible to restore a high-quality 3D model on the receiving side without increasing the amount of data, even for 3D models with complex shapes. Note that while this embodiment describes the delivery of compressed 3D model data, the compressed 3D model data may be stored, or may be delivered in response to a client request. Also, while the example described uses a method in which a 3D model containing virtual-viewpoint-dependent color information is generated when the 3D model is restored, this is not limiting. After the 3D model's shape information is restored, color information for the 3D model may be generated simultaneously with the generation of a virtual viewpoint image. In this case, it is not necessary to assign color information to all of the 3D model's shape information; color information only needs to be generated for the shooting range (viewing angle) of the user-specified third virtual camera.

　＜実施例２＞
　実施例１では、物理カメラ群の視点情報に基づき第２仮想カメラ群の視点情報を生成し、生成した第２仮想カメラ群の視点情報を用いてデプス画像とテクスチャ画像を生成する処理を説明した。次に、被写体が、他の被写体や物体で遮蔽される場合に対応するため、デプス画像とテクスチャ画像に加えて有効画素マップを生成する態様を、実施例２として説明する。なお、画像処理装置のハードウェア構成や機能構成など実施例１と共通する部分は説明を省略ないし簡略化して説明を行うものとする。 Example 2
In Example 1, a process was described in which viewpoint information of a second virtual camera group is generated based on viewpoint information of a physical camera group, and a depth image and a texture image are generated using the generated viewpoint information of the second virtual camera group. Next, Example 2 will be described, which illustrates an aspect in which an effective pixel map is generated in addition to a depth image and a texture image to deal with cases in which a subject is occluded by another subject or object. Note that the description of parts common to Example 1, such as the hardware configuration and functional configuration of the image processing device, will be omitted or simplified.

　図９は本実施例に係る、有効画素マップを生成する方法の一例を示す概略図である。被写体は、物理カメラにより撮影され、撮影画像９０３が生成される。その際、撮影画像９０３に写る被写体は、遮蔽物により体の一部が隠されている。遮蔽物は、被写体とする。 FIG. 9 is a schematic diagram showing an example of a method for generating an effective pixel map according to this embodiment. A subject is photographed using a physical camera, and a captured image 903 is generated. At this time, the subject in the captured image 903 has part of its body hidden by an obstruction. The obstruction is considered to be the subject.

　第１の画像処理装置２０は、撮影システム１０が有する物理カメラが撮影した複数の撮像画像により、被写体と遮蔽物の形状推定を行い、それぞれの形状情報を生成する。すなわち、被写体の３Ｄモデル９０１と、遮蔽物の３Ｄモデル９０４を生成する。次に、第１の画像処理装置２０は、物理カメラ群の視点情報、形状情報等のデータに基づき、実施例１で説明した第２仮想カメラ群の視点情報を生成する。第２仮想カメラ９０５は、第１仮想カメラ９０２の光軸上に、被写体の３Ｄモデル９０１を包含し、被写体のデプスを１０ｂｉｔで表現可能な範囲の領域９０６内に配置される。また第２仮想カメラ９０５は、第１仮想カメラ９０２から写した被写体９０１の大きさが維持されるよう焦点距離が調整される。第２仮想カメラ９０５の視点情報の生成後、第１の画像処理装置２０は、デプス画像９０７とテクスチャ画像９０８を生成する。ここで、生成したテクスチャ画像９０８に写る被写体９０１の各画素値は、遮蔽物の３Ｄモデル９０４に遮蔽されない領域（被写体の左側）では第１仮想カメラ９０２に対応する撮影画像の画素値が優先して使われる。一方、遮蔽物の３Ｄモデル９０４に遮蔽される領域（被写体の右側）では第１仮想カメラ９０２とは異なる視点の第１仮想カメラに対応する撮影画像の画素値が使われる。そのため、テクスチャ画像９０８に写る被写体の右側と左側で画質が大きく異なる可能性がある。このテクスチャ画像９０８で３Ｄモデルの色情報を復元すると、３Ｄモデルの一部の画質が低下し、その一部が目立つことで、ユーザが違和感を覚える恐れがある。そこで、テクスチャ画像の高画質な領域を示す有効画素マップ９０９を生成する。有効画素マップは、例えば、非遮蔽領域の画素値を１とし、遮蔽領域の画素値を０とする。有効画素マップ９０９の画像サイズは、テクスチャ画像９０８の画像サイズと同じである。遮蔽領域か否かの判定は、例えば、被写体の３Ｄモデル９０１の形状情報を構成する各点が、先述した可視性判定処理により、第１仮想カメラ９０２により可視か否かに基づき判定される。つまり、可視性判定処理により、撮影画像９０３に写る遮蔽物９０４の領域の画素値は、被写体の３Ｄモデル９０１の色情報としてつかわれない。そこで、第２仮想カメラ９０５の視点情報の生成元である物理カメラの撮影画像の画素値を使用して決定したテクスチャ画像の領域を特定し、有効画素マップにおいて、特定した領域と同領域の画素値を１とし、それ以外を０とする。なお、遮蔽領域の判定方法はこれに限らない。第２の画像処理装置３０は、３Ｄモデルの色情報の復元時に、有効画素マップの値が１の領域に対応するテクスチャ画像の領域の画素値を優先して３Ｄモデルの色情報として使用する。具体的には、第２の画像処理装置３０は、復元した形状情報に対して、仮想視点依存で色情報を生成する際に、有効画素マップの１に対応するテクスチャ画像の画素値は使用され、有効画素マップの０に対応するテクスチャ画像の画素値は使用しない。ただし、色情報を付与したい形状情報が、有効画素マップの０に対応するテクスチャ画像にのみ色情報が保存されている場合、そのテクスチャ画像の画素値が使われる。なお、ここまで有効画素マップの画素値を０か１の２値で説明したが多値でもよい。その場合、有効画素マップの画素値を３Ｄモデルの色情報の生成時の重みづけに使用してもよい。つまり、有効画素マップの画素値に応じてテクスチャ画像の画素値の優先度が決定される。有効画素マップを２５５段階の多値で生成する場合、例えば、被写体の輪郭や遮蔽物との境目の画素値を０とし、被写体の内側方向に一定距離（例えば５ｐｘ）近づくと２５５になるよう線形的に増加させてもよい。これにより、３Ｄモデルの色情報の復元時に、信頼性の低い輪郭や遮蔽物との境界のテクスチャ画像の画素値の影響は低減され、信頼性の高い画素値を使った３Ｄモデルの色情報を生成することが出来る。 The first image processing device 20 estimates the shapes of the subject and obstructing object using multiple captured images taken by the physical cameras of the imaging system 10, and generates shape information for each. That is, it generates a 3D model 901 of the subject and a 3D model 904 of the obstructing object. Next, the first image processing device 20 generates viewpoint information for the second virtual camera group described in Example 1 based on data such as viewpoint information and shape information from the physical camera group. The second virtual camera 905 is positioned on the optical axis of the first virtual camera 902, within an area 906 that encompasses the 3D model 901 of the subject and is within a range in which the depth of the subject can be expressed in 10 bits. The focal length of the second virtual camera 905 is adjusted so that the size of the subject 901 as viewed from the first virtual camera 902 is maintained. After generating the viewpoint information for the second virtual camera 905, the first image processing device 20 generates a depth image 907 and a texture image 908. Here, for each pixel value of the subject 901 appearing in the generated texture image 908, the pixel values of the image captured by the first virtual camera 902 are used preferentially in the area not obscured by the 3D model 904 of the obstructing object (the left side of the subject). On the other hand, in the area obscured by the 3D model 904 of the obstructing object (the right side of the subject), the pixel values of the image captured by the first virtual camera, which has a different viewpoint from the first virtual camera 902, are used. Therefore, there is a possibility that the image quality of the right and left sides of the subject appearing in the texture image 908 will differ significantly. If the color information of the 3D model is restored using this texture image 908, the image quality of parts of the 3D model will be reduced, and this part will stand out, which may cause the user to feel uncomfortable. Therefore, an effective pixel map 909 is generated that indicates high-quality areas of the texture image. For example, the effective pixel map assigns pixel values of 1 to unobstructed areas and 0 to obstructed areas. The image size of the effective pixel map 909 is the same as the image size of the texture image 908. The determination of whether or not an area is an occluded area is made, for example, based on whether each point constituting the shape information of the 3D model 901 of the subject is visible to the first virtual camera 902 using the visibility determination process described above. In other words, due to the visibility determination process, pixel values of the area of the occluding object 904 captured in the captured image 903 are not used as color information of the 3D model 901 of the subject. Therefore, a region of a texture image determined using pixel values of the captured image of the physical camera from which the viewpoint information of the second virtual camera 905 is generated is identified, and in the effective pixel map, pixel values of the identified area are set to 1, and other areas are set to 0. Note that the method of determining an occluded area is not limited to this. When restoring color information of the 3D model, the second image processing device 30 prioritizes pixel values of areas of the texture image corresponding to areas with a value of 1 in the effective pixel map and uses them as color information of the 3D model. Specifically, when generating color information for the restored shape information in a virtual viewpoint-dependent manner, the second image processing device 30 uses pixel values of the texture image corresponding to 1 in the effective pixel map and does not use pixel values of the texture image corresponding to 0 in the effective pixel map. However, if the shape information to which color information is to be added is stored only in the texture image corresponding to 0 in the effective pixel map, the pixel value of that texture image is used. While the effective pixel map has been described so far as having binary pixel values of 0 or 1, it can also be multi-valued. In this case, the pixel values of the effective pixel map can be used to weight the color information generated for the 3D model. In other words, the priority of the pixel values of the texture image is determined according to the pixel values of the effective pixel map. When generating a multi-valued effective pixel map with 255 levels, for example, the pixel values of the subject's outline or the boundary with an obstructing object can be set to 0, and can linearly increase to 255 as the subject approaches a certain distance (e.g., 5 px) inward toward the subject. This reduces the influence of unreliable pixel values of the texture image at the outline or boundary with an obstructing object when restoring the color information of the 3D model, allowing for the generation of color information for the 3D model using highly reliable pixel values.

　図１０は、本実施例に係る、第１の画像処理装置２０における３Ｄモデルデータの圧縮と配信を制御する処理の流れを示すフローチャートである。形状情報生成部２０１が撮影システム１０から複数の撮像画像、物理カメラ群の視点情報の受信をトリガとして、図１０のフローの実行が開始される。 FIG. 10 is a flowchart showing the flow of processing for controlling the compression and distribution of 3D model data in the first image processing device 20 according to this embodiment. Execution of the flow in FIG. 10 begins when the shape information generation unit 201 receives multiple captured images and viewpoint information from the physical cameras from the imaging system 10.

　Ｓ１００１からＳ１００３は、図５のＳ５０１からＳ５０３と同様である。 S1001 to S1003 are the same as S501 to S503 in Figure 5.

　Ｓ１００４では、テクスチャ画像生成部２０４が、形状情報生成部２０１と視点決定部２０２から取得したデータに基づき、前景モデルのテクスチャ画像と有効画素マップを生成する。生成したテクスチャ画像と有効画素マップは、符号化部２０５に出力される。 In S1004, the texture image generation unit 204 generates a texture image and a valid pixel map of the foreground model based on the data acquired from the shape information generation unit 201 and the viewpoint determination unit 202. The generated texture image and valid pixel map are output to the encoding unit 205.

　Ｓ１００５では、符号化部２０５が、デプス画像生成部２０３とテクスチャ画像生成部２０４から取得したデプス画像、テクスチャ画像、有効画素マップを符号化する。符号化されたデプス画像、テクスチャ画像、有効画素マップは、配信部２０６に出力される。 In S1005, the encoding unit 205 encodes the depth image, texture image, and valid pixel map acquired from the depth image generation unit 203 and texture image generation unit 204. The encoded depth image, texture image, and valid pixel map are output to the distribution unit 206.

　Ｓ１００６では、配信部２０６が、デプス画像生成部２０３と符号化部２０５から取得したデプス画像、テクスチャ画像、有効画素マップ、第２仮想カメラ群の視点情報とを含む３Ｄモデルデータを受信部２０７に送信し、本フローは終了する。 In S1006, the distribution unit 206 transmits 3D model data including the depth image, texture image, effective pixel map, and viewpoint information of the second virtual camera group acquired from the depth image generation unit 203 and encoding unit 205 to the reception unit 207, and this flow ends.

　図１１は、本実施例に係る、３Ｄモデルの復元処理の流れを示すフローチャートの一例である。図１１のフローは、３Ｄモデル復元部２０９により実行される。図１１のフローは、図６のＳ６０４にて復号したデータに基づき被写体の３Ｄモデルを復元する制御を詳細に説明したものである。 FIG. 11 is an example of a flowchart showing the flow of the 3D model restoration process according to this embodiment. The flow in FIG. 11 is executed by the 3D model restoration unit 209. The flow in FIG. 11 provides a detailed explanation of the control for restoring a 3D model of a subject based on the data decoded in S604 in FIG. 6.

　Ｓ１１０１では、デプス画像、テクスチャ画像、有効画素マップ、第２仮想カメラ群の視点情報を取得する。 In S1101, a depth image, texture image, effective pixel map, and viewpoint information of the second virtual camera group are obtained.

　Ｓ１１０２では、デプス画像の各画素のデプス値をデプス画像と対応する第２仮想カメラの外部パラメータと内部パラメータに基づき、仮想空間に投影し、被写体の形状情報を復元する。 In S1102, the depth value of each pixel in the depth image is projected into virtual space based on the external parameters and internal parameters of the second virtual camera corresponding to the depth image, and shape information of the subject is restored.

　Ｓ１１０３では、形状情報が複数のテクスチャ画像から撮影されている場合、ユーザ指定の第３仮想カメラの位置及び姿勢が近い第２仮想カメラのテクスチャ画像の画素値を優先して色情報とする。その際に、テクスチャ画像の画素値と対応する有効画素マップの画素値に応じて、テクスチャ画像の画素値の優先度を決定する。形状情報が３Ｄ点群で表現されている場合、これを点群のすべての点で繰り返し、形状情報に対応する色情報を生成する。 In S1103, if the shape information is captured from multiple texture images, the pixel values of the texture image of the second virtual camera, which is closest in position and orientation to the user-specified third virtual camera, are given priority as color information. At this time, the priority of the pixel values of the texture image is determined according to the pixel values of the effective pixel map that correspond to the pixel values of the texture image. If the shape information is represented by a 3D point cloud, this process is repeated for all points in the point cloud to generate color information corresponding to the shape information.

　Ｓ１１０４は、図８のＳ８０４と同様である。 S1104 is the same as S804 in Figure 8.

　本フローが終了した後、復元された３Ｄモデルとユーザ指定の第３仮想カメラの視点情報に基づき仮想視点画像生成部２１１が仮想視点画像を生成し、生成した仮想視点画像が表示装置５０に表示される。 After this flow is completed, the virtual viewpoint image generation unit 211 generates a virtual viewpoint image based on the restored 3D model and viewpoint information of the third virtual camera specified by the user, and the generated virtual viewpoint image is displayed on the display device 50.

　以上が、有効画素マップを含む３Ｄモデルデータの圧縮と配信および３Ｄモデルの復元の制御の内容である。このような処理によれば、被写体が遮蔽物に隠されている場合でも、信頼性の高いテクスチャ画像の画素値を使った３Ｄモデルを復元することが出来る。 The above explains the compression and distribution of 3D model data including valid pixel maps, and the control of 3D model restoration. This type of processing makes it possible to restore a 3D model using highly reliable pixel values from texture images, even when the subject is hidden by an obstruction.

　＜その他の実施例＞
　本開示は、上述の実施例の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 <Other Examples>
The present disclosure can also be realized by a process in which a program that realizes one or more functions of the above-described embodiments is supplied to a system or device via a network or a storage medium, and one or more processors in the computer of the system or device read and execute the program. The present disclosure can also be realized by a circuit (e.g., an ASIC) that realizes one or more functions.

　なお、本実施形態の開示は、以下の構成、方法、システムおよびプログラムを含む。 The disclosure of this embodiment includes the following configurations, methods, systems, and programs.

　（構成１）
　現実空間における複数の撮像装置の位置及び姿勢に対応する仮想空間における複数の第１仮想カメラの光軸に基づいて複数の第２仮想カメラの位置及び姿勢を設定する設定手段と、
　前記複数の第２仮想カメラのそれぞれと、前記複数の撮像装置により取得される複数の撮像画像に基づいて生成される被写体の３Ｄモデルと、の距離を示す複数のデプス画像を生成する生成手段と
　を有することを特徴とする画像処理装置。 (Configuration 1)
a setting means for setting positions and orientations of a plurality of second virtual cameras based on optical axes of a plurality of first virtual cameras in a virtual space corresponding to positions and orientations of a plurality of imaging devices in a real space;
and a generation means for generating a plurality of depth images indicating the distance between each of the plurality of second virtual cameras and a 3D model of a subject generated based on a plurality of captured images acquired by the plurality of imaging devices.

　（構成２）
　前記複数の第２仮想カメラは、前記複数の第１仮想カメラの光軸上に設定されることを特徴とする構成１に記載の画像処理装置。 (Configuration 2)
2. The image processing device according to configuration 1, wherein the plurality of second virtual cameras are set on the optical axes of the plurality of first virtual cameras.

　（構成３）
　前記デプス画像を符号化する符号化手段を有することを特徴とする構成１記載の画像処理装置。 (Configuration 3)
2. The image processing device according to claim 1, further comprising encoding means for encoding the depth image.

　（構成４）
　符号化された複数の前記デプス画像と、前記複数の第２仮想カメラの位置及び姿勢を示す視点情報と、を前記符号化された複数の前記デプス画像および前記視点情報とに基づいて前記３Ｄモデルを再構成する他の装置に出力する出力手段を有することを特徴とする構成３に記載の画像処理装置。 (Configuration 4)
The image processing device according to configuration 3 further comprises an output means for outputting the encoded depth images and viewpoint information indicating the positions and orientations of the second virtual cameras to another device that reconstructs the 3D model based on the encoded depth images and the viewpoint information.

　（構成５）
　前記生成手段は、前記複数の第２仮想カメラそれぞれにおいて、前記被写体の３Ｄモデルを含む仮想視点画像を生成し、
　前記符号化手段は、複数の前記仮想視点画像を符号化し、
　前記出力手段は、符号化された複数の前記仮想視点画像を前記他の装置に出力することを特徴とする構成４に記載の画像処理装置。 (Configuration 5)
the generating means generates a virtual viewpoint image including a 3D model of the subject for each of the second virtual cameras;
the encoding means encodes the plurality of virtual viewpoint images;
5. The image processing device according to configuration 4, wherein the output means outputs the encoded virtual viewpoint images to the other device.

　（構成６）
　前記生成手段は、前記複数の第２仮想カメラそれぞれにおいて、前記仮想視点画像における各画素が前記被写体の３Ｄモデルの構成要素と対応しているか否かを示す対応情報を生成し、
　前記符号化手段は、複数の前記対応情報を符号化し、
　前記出力手段は、符号された複数の前記対応情報を前記他の装置に出力することを特徴とする構成５に記載の画像処理装置。 (Configuration 6)
the generating means generates correspondence information indicating whether each pixel in the virtual viewpoint image corresponds to a component of a 3D model of the subject, for each of the plurality of second virtual cameras;
the encoding means encodes a plurality of pieces of correspondence information;
6. The image processing apparatus according to configuration 5, wherein the output means outputs the encoded correspondence information to the other apparatus.

　（構成７）
　前記複数の第２仮想カメラは、前記複数の第１仮想カメラより、前記被写体の３Ｄモデルに近い位置に設定されることを特徴とする構成１に記載の画像処理装置。 (Configuration 7)
2. The image processing device according to configuration 1, wherein the second virtual cameras are set at positions closer to the 3D model of the subject than the first virtual cameras.

　（構成８）
　前記生成手段は、１つの前記第２仮想カメラに対し、１つの前記デプス画像を生成することを特徴とする構成１に記載の画像処理装置。 (Configuration 8)
2. The image processing device according to claim 1, wherein the generating means generates one depth image for one second virtual camera.

　（構成９）
　現実空間における複数の撮像装置の位置及び姿勢に対応する仮想空間における複数の第１仮想カメラの光軸に基づいて設定された複数の第２仮想カメラのそれぞれと、前記撮像装置により取得される複数の撮像画像に基づいて生成される被写体の３Ｄモデルと、の距離を示す符号化された複数のデプス画像および前記複数の第２仮想カメラの位置及び姿勢を示す第１視点情報を取得する取得手段と、
　前記符号化された複数のデプス画像を復号する復号手段と、
　復号された前記複数のデプス画像と、前記第１視点情報と、に基づいて前記被写体の３Ｄモデルを生成する生成手段と、
　を有することを特徴とする画像処理装置。 (Configuration 9)
an acquisition means for acquiring a plurality of encoded depth images indicating the distance between each of a plurality of second virtual cameras set based on the optical axes of a plurality of first virtual cameras in a virtual space corresponding to the positions and orientations of a plurality of image capturing devices in real space and a 3D model of a subject generated based on a plurality of captured images acquired by the image capturing devices, and first viewpoint information indicating the positions and orientations of the plurality of second virtual cameras;
a decoding means for decoding the encoded depth images;
a generating means for generating a 3D model of the subject based on the decoded depth images and the first viewpoint information;
1. An image processing device comprising:

　（構成１０）
　前記第１仮想カメラおよび前記第２仮想カメラと異なる第３仮想カメラの位置及び姿勢を示す第２視点情報を取得し、
　前記生成手段は、前記第２視点情報と、前記被写体の３Ｄモデルとに基づいて、仮想視点画像を生成することを特徴とする構成９に記載の画像処理装置。 (Configuration 10)
acquiring second viewpoint information indicating a position and an attitude of a third virtual camera different from the first virtual camera and the second virtual camera;
10. The image processing device according to configuration 9, wherein the generating means generates a virtual viewpoint image based on the second viewpoint information and a 3D model of the subject.

　（方法１）
　現実空間における複数の撮像装置の位置及び姿勢に対応する仮想空間における複数の第１仮想カメラの光軸に基づいて複数の第２仮想カメラの位置及び姿勢を設定する設定工程と、
　前記複数の第２仮想カメラのそれぞれと、前記複数の撮像装置により取得される複数の撮像画像に基づいて生成される被写体の３Ｄモデルと、の距離を示す複数のデプス画像を生成する生成工程と
　を有することを特徴とする画像処理方法。 (Method 1)
a setting step of setting positions and orientations of a plurality of second virtual cameras based on optical axes of a plurality of first virtual cameras in a virtual space corresponding to positions and orientations of a plurality of imaging devices in a real space;
and a generation step of generating a plurality of depth images indicating the distance between each of the plurality of second virtual cameras and a 3D model of the subject generated based on a plurality of captured images acquired by the plurality of imaging devices.

　（方法２）
　現実空間における複数の撮像装置の位置及び姿勢に対応する仮想空間における複数の第１仮想カメラの光軸に基づいて設定された複数の第２仮想カメラのそれぞれと、前記撮像装置により取得される複数の撮像画像に基づいて生成される被写体の３Ｄモデルと、の距離を示す符号化された複数のデプス画像および前記複数の第２仮想カメラの位置及び姿勢を示す第１視点情報を取得する取得工程と、
　前記符号化された複数のデプス画像を復号する復号工程と、
　復号された前記複数のデプス画像と、前記第１視点情報と、に基づいて前記被写体の３Ｄモデルを生成する生成工程と、
　を有することを特徴とする画像処理方法。 (Method 2)
an acquisition process for acquiring a plurality of encoded depth images indicating the distance between each of a plurality of second virtual cameras set based on the optical axes of a plurality of first virtual cameras in a virtual space corresponding to the positions and orientations of a plurality of imaging devices in real space and a 3D model of a subject generated based on a plurality of captured images acquired by the imaging devices, and first viewpoint information indicating the positions and orientations of the plurality of second virtual cameras;
a decoding step of decoding the encoded depth images;
a generation step of generating a 3D model of the subject based on the decoded depth images and the first viewpoint information;
An image processing method comprising:

　（プログラム）
　コンピュータに、構成１乃至１０の何れか１項に記載の画像処理装置の各手段として機能させるためのプログラム。 (program)
11. A program for causing a computer to function as each of the means of the image processing device according to any one of configurations 1 to 10.

　本発明は上記実施の形態に制限されるものではなく、本発明の精神及び範囲から離脱することなく、様々な変更及び変形が可能である。従って、本発明の範囲を公にするために以下の請求項を添付する。 The present invention is not limited to the above-described embodiments, and various modifications and variations are possible without departing from the spirit and scope of the present invention. Therefore, the following claims are appended to clarify the scope of the present invention.

　本願は、２０２４年４月２６日提出の日本国特許出願特願２０２４－０７３０６２を基礎として優先権を主張するものであり、その記載内容の全てをここに援用する。 This application claims priority based on Japanese Patent Application No. 2024-073062, filed April 26, 2024, the entire contents of which are incorporated herein by reference.

　２０２　視点決定部
　２０３　デプス画像生成部
　２０４　テクスチャ画像生成部
　２０５　符号化部 202 viewpoint determination unit 203 depth image generation unit 204 texture image generation unit 205 encoding unit

Claims

a setting means for setting positions and orientations of a plurality of second virtual cameras based on optical axes of a plurality of first virtual cameras in a virtual space corresponding to positions and orientations of a plurality of imaging devices in a real space;
and a generation means for generating a plurality of depth images indicating the distance between each of the plurality of second virtual cameras and a 3D model of a subject generated based on a plurality of captured images acquired by the plurality of imaging devices.

The image processing device described in claim 1, characterized in that the multiple second virtual cameras are set on the optical axes of the multiple first virtual cameras.

The image processing device according to claim 1, further comprising an encoding means for encoding the depth image.

The image processing device described in claim 3, further comprising an output means for outputting the encoded depth images and viewpoint information indicating the positions and orientations of the second virtual cameras to another device that reconstructs the 3D model based on the encoded depth images and the viewpoint information.

the generating means generates a virtual viewpoint image including a 3D model of the subject for each of the second virtual cameras;
the encoding means encodes the plurality of virtual viewpoint images;
5. The image processing apparatus according to claim 4, wherein the output means outputs the encoded virtual viewpoint images to the other device.

the generating means generates correspondence information indicating whether each pixel in the virtual viewpoint image corresponds to a component of a 3D model of the subject, for each of the plurality of second virtual cameras;
the encoding means encodes a plurality of pieces of correspondence information;
6. The image processing apparatus according to claim 5, wherein said output means outputs the encoded correspondence information to said other apparatus.

The image processing device described in claim 1, characterized in that the multiple second virtual cameras are set at positions closer to the 3D model of the subject than the multiple first virtual cameras.

The image processing device described in claim 1, characterized in that the generation means generates one depth image for one second virtual camera.

an acquisition means for acquiring a plurality of encoded depth images indicating the distance between each of a plurality of second virtual cameras set based on the optical axes of a plurality of first virtual cameras in a virtual space corresponding to the positions and orientations of a plurality of image capturing devices in real space and a 3D model of a subject generated based on a plurality of captured images acquired by the image capturing devices, and first viewpoint information indicating the positions and orientations of the plurality of second virtual cameras;
a decoding means for decoding the encoded depth images;
a generating means for generating a 3D model of the subject based on the decoded depth images and the first viewpoint information;
1. An image processing device comprising:

acquiring second viewpoint information indicating a position and an attitude of a third virtual camera different from the first virtual camera and the second virtual camera;
The image processing apparatus according to claim 9 , wherein the generating means generates a virtual viewpoint image based on the second viewpoint information and a 3D model of the subject.

a setting step of setting positions and orientations of a plurality of second virtual cameras based on optical axes of a plurality of first virtual cameras in a virtual space corresponding to positions and orientations of a plurality of imaging devices in a real space;
and a generation step of generating a plurality of depth images indicating the distance between each of the plurality of second virtual cameras and a 3D model of the subject generated based on a plurality of captured images acquired by the plurality of imaging devices.

an acquisition process for acquiring a plurality of encoded depth images indicating the distance between each of a plurality of second virtual cameras set based on the optical axes of a plurality of first virtual cameras in a virtual space corresponding to the positions and orientations of a plurality of imaging devices in real space and a 3D model of a subject generated based on a plurality of captured images acquired by the imaging devices, and first viewpoint information indicating the positions and orientations of the plurality of second virtual cameras;
a decoding step of decoding the encoded depth images;
a generation step of generating a 3D model of the subject based on the decoded depth images and the first viewpoint information;
An image processing method comprising:

A program for causing a computer to function as each of the means of the image processing device described in any one of claims 1 to 10.