[go: up one dir, main page]

WO2025225378A1 - Image processing device, image processing method, and program - Google Patents

Image processing device, image processing method, and program

Info

Publication number
WO2025225378A1
WO2025225378A1 PCT/JP2025/014112 JP2025014112W WO2025225378A1 WO 2025225378 A1 WO2025225378 A1 WO 2025225378A1 JP 2025014112 W JP2025014112 W JP 2025014112W WO 2025225378 A1 WO2025225378 A1 WO 2025225378A1
Authority
WO
WIPO (PCT)
Prior art keywords
virtual
image
model
viewpoint
virtual camera
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/JP2025/014112
Other languages
French (fr)
Japanese (ja)
Inventor
圭吾 米田
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Canon Inc
Original Assignee
Canon Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Canon Inc filed Critical Canon Inc
Publication of WO2025225378A1 publication Critical patent/WO2025225378A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • H04N13/271Image signal generators wherein the generated image signals comprise depth maps or disparity maps
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules

Definitions

  • the present disclosure relates to an image processing device for compressing and distributing 3D model data representing a 3D model.
  • the distributing device arranges multiple virtual cameras (or a group of virtual cameras) to surround the 3D model and generates depth images and texture images from each of the multiple virtual cameras.
  • the generated depth and texture images are then compressed, and the compressed depth and texture images, along with information indicating the positions and orientations of the multiple virtual cameras, are distributed to the user.
  • the receiving device decodes the compressed depth and texture images, and reconstructs the 3D model based on the decoded depth and texture images and information indicating the positions and orientations of the multiple virtual cameras.
  • Patent Document 1 describes a method for determining the viewpoints of a group of virtual cameras so as to minimize fluctuations in the position of key objects in depth images between frames, with the aim of improving the compression rate of depth images. By reducing the number of motion vectors included in the encoded stream when encoding depth images, it is expected that the compression rate of depth images will improve.
  • the quality of the 3D model reconstructed on the device on the receiving side may be reduced.
  • the shape of the 3D model generated on the device on the distribution side is complex, depending on the position and orientation of the virtual cameras, the complex shape of the 3D model may not be sufficiently represented in the depth image, and the shape accuracy of the 3D model reconstructed on the user side may be reduced.
  • the shape of the 3D model generated on the distribution side may differ significantly from the shape of the 3D model reconstructed on the receiving side. Therefore, to reconstruct a 3D model more accurately, it is necessary to appropriately determine the position and orientation of the virtual cameras.
  • the present disclosure therefore aims to reduce the possibility of a decrease in the quality of the 3D model generated on the receiving device.
  • the image processing device has the following configuration: a setting means for setting the positions and orientations of multiple second virtual cameras based on the optical axes of multiple first virtual cameras in virtual space that correspond to the positions and orientations of multiple image capture devices in real space; and a generation means for generating multiple depth images that indicate the distance between each of the multiple second virtual cameras and a 3D model of a subject that is generated based on the multiple captured images acquired by the multiple image capture devices.
  • This disclosure reduces the possibility of a decrease in the quality of the 3D model generated on the receiving device.
  • FIG. 1 is a block diagram showing an example of the configuration of an image processing system 1.
  • FIG. 1 illustrates an example of a hardware configuration of an image processing apparatus.
  • FIG. 2 is a diagram illustrating an example of the functional configuration of a first image processing device 20 and a second image processing device 30.
  • FIG. 1 is a diagram illustrating an example of the configuration of an imaging system 10.
  • FIG. 3 is a diagram showing a first virtual camera group 305 that corresponds to the physical camera group 302 and a 3D model 304 of a subject that is generated by the first virtual camera group 305.
  • 3A and 3B are diagrams showing a second virtual camera group 307 set based on viewpoint information of a first virtual camera group 305.
  • 10A and 10B are diagrams illustrating generation of viewpoint information of a virtual camera according to the first embodiment.
  • 10 is a flowchart illustrating an example of a process for compressing and distributing 3D model data according to the first embodiment.
  • 10 is a flowchart illustrating an example of a process for receiving 3D model data and generating a virtual viewpoint image according to the first embodiment.
  • 10 is a flowchart illustrating an example of a process for generating viewpoint information of a group of virtual cameras according to the first embodiment.
  • 10 is a flowchart illustrating an example of a restoration process of a 3D model according to the first embodiment.
  • 10A and 10B are diagrams illustrating generation of viewpoint information of a group of virtual cameras according to the second embodiment.
  • 10 is a flowchart illustrating an example of a process for compressing and distributing 3D model data according to a second embodiment.
  • 10 is a flowchart illustrating an example of a restoration process of a 3D model according to a second embodiment.
  • the image processing device has a setting means for setting the positions and orientations of multiple second virtual cameras based on the optical axes of multiple first virtual cameras in virtual space that correspond to the positions and orientations of multiple imaging devices in real space.
  • the image processing device also has a generation means for generating multiple depth images that indicate the distance between each of the multiple second virtual cameras and a 3D model of the subject generated based on multiple captured images acquired by the multiple imaging devices.
  • the multiple imaging devices are aligned with each other.
  • the multiple first virtual cameras are set based on the positional relationships between the multiple imaging devices in real space. In other words, the multiple first virtual cameras in virtual space are reproductions in virtual space of the multiple imaging devices in real space.
  • the positions and orientations of the multiple second virtual cameras are set based on the positions and orientations of the multiple imaging devices. Therefore, the subject depicted in the depth images of the multiple second virtual cameras is similar to the subject included in the captured images acquired from the multiple imaging devices. Specifically, the shape of the area representing the subject included in the depth images generated from the second virtual cameras is similar to the shape of the area representing the subject included in the captured images.
  • the 3D model of the subject generated by the distribution device is generated based on the multiple captured images acquired from the multiple imaging devices. Therefore, if the depth images used to reconstruct (restore) the 3D model of the subject on the receiving device are similar to the captured images, the 3D model reconstructed on the receiving device will also be similar to the 3D model generated on the distribution device. In other words, it is possible to reduce the possibility that the shape of the 3D model generated on the receiving device will differ from that of the 3D model generated on the distribution device, i.e., the reproducibility of the 3D model will be degraded.
  • the image processing device also has an encoding means for encoding the depth images.
  • the image processing device also has an output means for outputting the encoded depth images and viewpoint information indicating the positions and orientations of the second virtual cameras to another device that reconstructs the 3D model.
  • the other device is a device that reconstructs the 3D model based on the encoded depth images and viewpoint information.
  • the generation means generates a virtual viewpoint image including a 3D model of the subject for each of the plurality of second virtual cameras.
  • the encoding means then encodes the plurality of virtual viewpoint images, and the output means outputs the encoded plurality of virtual viewpoint images to the other device.
  • the colors of each component of the subject included in the virtual viewpoint image (texture image) generated from the second virtual camera are close to the colors of each component of the subject included in the captured image.
  • the generation means generates correspondence information indicating whether each pixel in the texture image corresponds to a component of the 3D model of the subject, for each of the second virtual cameras.
  • the encoding means then encodes the plurality of pieces of correspondence information, and the output means outputs the encoded plurality of pieces of correspondence information to the other device.
  • This aspect allows the receiving device to identify, among the pixels of the texture image, the pixels to be used to color the 3D model of the subject. For example, if a subject is occluded by another subject in an image captured by the imaging device, that captured image is used to generate a texture image for the second virtual camera, which could result in the generation of a texture image with color information that differs from the color information of that subject. Therefore, by generating correspondence information that indicates areas where occlusion is not thought to occur, the possibility of colors differing when the 3D model is reconstructed can be reduced.
  • the second virtual cameras are set at positions closer to the 3D model of the subject than the first virtual cameras.
  • the generation means generates one depth image for one second virtual camera.
  • one depth image is generated for one second virtual camera at each of a plurality of time points.
  • the image processing device has an acquisition means for acquiring a plurality of encoded depth images and first viewpoint information indicating the positions and orientations of a plurality of second virtual cameras.
  • a depth image is an image indicating the distance between each of the plurality of second virtual cameras and a 3D model of the subject generated based on a plurality of captured images acquired by the imaging device.
  • the plurality of second virtual cameras are virtual cameras set based on the optical axes of a plurality of first virtual cameras in virtual space corresponding to the positions and orientations of the plurality of imaging devices in real space.
  • the image processing device also has a decoding means for decoding the plurality of encoded depth images.
  • the image processing device also has a generation means for generating a 3D model of the subject based on the plurality of decoded depth images and the first viewpoint information.
  • the image processing device also acquires second viewpoint information indicating the position and orientation of a third virtual camera different from the first virtual camera and the second virtual camera.
  • the image processing device also generates a virtual viewpoint image based on the second viewpoint information and a 3D model of the subject.
  • the third virtual camera may be set by a user operating an input device such as a joystick, or may be set based on the position of the generated 3D model.
  • a program causes a computer to execute the functions of the image processing device described above.
  • the computer By executing this program, the computer preferably functions as the image processing device described above.
  • a virtual viewpoint image is an image generated by a user freely manipulating the position and orientation of a virtual camera, and is also called a free viewpoint image or arbitrary viewpoint image.
  • image will be used to refer to both moving and still images.
  • Image processing system 1 is a system that generates a virtual viewpoint image that represents a scene from a specified virtual viewpoint, based on multiple images captured by multiple imaging devices and a specified virtual viewpoint.
  • the virtual viewpoint image in this embodiment is also called a free viewpoint video, but is not limited to an image corresponding to a viewpoint freely (arbitrarily) specified by the user; for example, the virtual viewpoint image also includes an image corresponding to a viewpoint selected by the user from multiple candidates.
  • this embodiment will mainly describe a case where the virtual viewpoint is specified by user operation, but the virtual viewpoint may also be specified automatically based on the results of image analysis, etc.
  • this embodiment will mainly describe a case where the virtual viewpoint image is a video, but the virtual viewpoint image may also be a still image.
  • the viewpoint information used to generate a virtual viewpoint image is information that indicates the position and orientation (line of sight) of the virtual viewpoint.
  • the viewpoint information is a parameter set that includes parameters that indicate the three-dimensional position of the virtual viewpoint and parameters that indicate the orientation of the virtual viewpoint in the pan, tilt, and roll directions.
  • the content of the viewpoint information is not limited to the above.
  • the parameter set serving as viewpoint information may include a parameter that indicates the size of the field of view (angle of view) of the virtual viewpoint.
  • the viewpoint information may have multiple parameter sets.
  • the viewpoint information may have multiple parameter sets that respectively correspond to multiple frames that make up a video of the virtual viewpoint image, and may be information that indicates the position and orientation of the virtual viewpoint at each of multiple consecutive points in time.
  • the image processing system 1 has multiple imaging devices that capture images of an imaging area from multiple directions.
  • the imaging area may be, for example, a stadium where sports such as soccer or karate are held, or a stage where concerts or plays are held.
  • the multiple imaging devices are installed in different positions surrounding such an imaging area and capture images in sync. Note that the multiple imaging devices do not have to be installed around the entire perimeter of the imaging area; depending on installation location restrictions, they may be installed only around a portion of the perimeter of the imaging area.
  • the number of imaging devices is not limited to the example shown in the figure; for example, if the imaging area is a soccer stadium, around 30 imaging devices may be installed around the stadium. Imaging devices with different functions, such as telephoto cameras and wide-angle cameras, may also be installed.
  • the multiple imaging devices are each assumed to be cameras having an independent housing and capable of capturing images from a single viewpoint.
  • this is not limited to this, and two or more imaging devices may be configured within the same housing.
  • a single camera equipped with multiple lens groups and multiple sensors and capable of capturing images from multiple viewpoints may be installed as the multiple imaging devices.
  • a virtual viewpoint image can be generated, for example, using the following method. First, multiple images (multiple captured images) are obtained by capturing images from different directions using multiple imaging devices. Next, a foreground image is obtained from the multiple captured images, extracting a foreground area corresponding to a specific object, such as a person or a ball, and a background image is obtained from the multiple captured images, extracting a background area other than the foreground area. A foreground model representing the three-dimensional shape of the specific object and texture data for coloring the foreground model are generated based on the foreground image, and texture data for coloring a background model representing the three-dimensional shape of the background, such as a stadium, is generated based on the background image.
  • the texture data is then mapped to the foreground model and background model, and rendering is performed according to the virtual viewpoint indicated by the viewpoint information, thereby generating a virtual viewpoint image.
  • the method for generating a virtual viewpoint image is not limited to this, and various methods can be used, such as generating a virtual viewpoint image by projective transformation of captured images without using a three-dimensional model.
  • a foreground image is an image in which an object's area (foreground area) has been extracted from an image captured by an imaging device.
  • An object extracted as a foreground area is a dynamic object (moving body) that moves (its absolute position and shape can change) when images are captured from the same direction in chronological order. Examples of objects include players, referees, and other people on the field where a sport is taking place, such as the ball in a ball game, or singers, musicians, performers, and presenters in a concert or entertainment event.
  • a background image is an image of at least an area (background area) that is different from the foreground object.
  • a background image is an image in which the foreground object has been removed from the captured image.
  • the background refers to an object that remains stationary or nearly stationary when images are taken from the same direction in chronological order. Examples of such objects include a stage for a concert, a stadium where an event such as a sport is held, a structure such as a goal used in a ball game, or a field.
  • the background is at least an area that is different from the foreground object, and the captured object may include other objects in addition to the object and background.
  • a virtual camera is a virtual camera that is different from the multiple imaging devices actually installed around the imaging area, and is a concept used to conveniently explain the virtual viewpoint involved in generating a virtual viewpoint image.
  • a virtual viewpoint image can be considered to be an image captured from a virtual viewpoint set in a virtual space associated with the imaging area.
  • the position and orientation of the viewpoint in this virtual image can be expressed as the position and orientation of the virtual camera.
  • a virtual viewpoint image can be said to be an image that simulates the captured image obtained by a camera, assuming that the camera exists at the position of the virtual viewpoint set in space.
  • Example 1 a process for determining the viewpoints of a group of virtual cameras for generating depth images and texture images to be distributed from a server to a client based on viewpoint information of a physical camera will be described.
  • the receiving device will be referred to as the client.
  • the imaging device arranged in real space will be referred to as the physical camera.
  • FIG. 1A is a diagram showing an example of the overall configuration of an image processing system 1 according to this embodiment.
  • the image processing system 1 has a shooting system 10, a first image processing device 20, a second image processing device 30, an input device 40, and a display device 50.
  • the image processing system 1 generates shape information of a 3D model of an object using images (plurality of captured images) taken by multiple physical cameras.
  • the image processing system 1 then generates viewpoint information (first viewpoint information) of a second virtual camera group for generating depth images and texture images to be distributed from the server to the client.
  • the image processing system 1 generates and compresses depth images and texture images based on the generated viewpoint information of the second virtual camera group, and distributes these images and the information necessary to reconstruct (restore) the 3D model to the user as 3D model data.
  • the photography system 10 places multiple physical cameras in different positions and synchronously photographs a subject (object) from multiple viewpoints.
  • the multiple captured images obtained through synchronous photography, as well as viewpoint information (external/internal parameters, image size) for each physical camera of the photography system 10, are then transmitted to the first image processing device 20.
  • the external parameters of a camera are parameters that indicate the position and orientation of the camera (e.g., rotation matrix and position vector).
  • the internal parameters of a camera are internal parameters specific to the camera, such as focal length, image center, and lens distortion parameters.
  • the external and internal parameters of a camera are collectively referred to as camera parameters.
  • Image size refers to the width and height of the image.
  • the viewpoint information of this group of physical cameras is used when determining the viewpoint information (camera parameters and image size) of the second group of virtual cameras used to generate depth images and texture images.
  • the first image processing device 20 generates a 3D model of the foreground object based on the multiple captured images input from the imaging system 10 and the viewpoint information for each physical camera.
  • the foreground object (subject) is, for example, a person or moving object within the imaging range of the imaging system 10.
  • the first image processing device 20 then generates viewpoint information for a second virtual camera group for generating depth images and texture images.
  • the first image processing device 20 generates and compresses depth images and texture images based on the generated viewpoint information for the second virtual camera group, and outputs these compressed images and the information (metadata) required to restore the 3D model to the second image processing device 30.
  • the metadata is, for example, the generated viewpoint information for the second virtual camera group.
  • the first image processing device 20 estimates shape information (shape information) of the subject based on multiple captured images and viewpoint information for each physical camera. Estimating the shape information of the subject uses, for example, a visual hull intersection method.
  • Multiple virtual cameras (multiple first virtual cameras) are set in virtual space corresponding to the imaging system 10 existing in real space. These multiple first virtual cameras are referred to as a first virtual camera group.
  • This first virtual camera group is a reproduction of a physical camera group in virtual space.
  • the viewpoint information of the first virtual camera group corresponds to the viewpoint information of the physical camera group in real space.
  • shape information of the subject is generated by using the viewpoint information of the first virtual camera group and the multiple captured images in the visual hull intersection method.
  • a 3D point cloud (a set of points with three-dimensional coordinates) that represents the shape information of the subject is obtained.
  • This 3D point cloud is also called a visual hull.
  • the method of deriving shape information of the subject from captured images is not limited to this.
  • the method of expressing the shape information of the subject is not limited to a 3D point cloud, but can also be a mesh or voxel.
  • a texture image (color information) is determined for each point in the generated 3D point cloud of the subject using multiple captured images. Therefore, 3D model data representing a 3D model of the subject includes shape information indicating the shape and color information indicating the color.
  • the 3D model is generated from shape information and color information, but this is not limited to this.
  • the 3D model includes shape information and may be managed as data separate from color information. Note that the method of setting a first virtual camera group in virtual space corresponding to a physical camera group in real space is well-known technology in the field of CG, etc., and therefore will not be described here.
  • the first image processing device 20 generates viewpoint information (position and orientation) of multiple second virtual cameras for generating depth images and texture images to be distributed.
  • the multiple second virtual cameras will be referred to as a second virtual camera group.
  • the position of this second virtual camera group is, for example, arranged on the optical axis of the first virtual camera group, and the orientation is set to be the same as the orientation of the first virtual camera. Since the viewpoint information of the first virtual camera group corresponds to the viewpoint information of the physical camera group in real space, the viewpoint information of the second virtual camera group is generated based on the viewpoint information of the physical camera group. The method for generating the viewpoint information of the second virtual camera group will be explained in detail later.
  • the first image processing device 20 generates a depth image of the 3D model based on the second viewpoint information of this second virtual camera group. Depth images are generated for the number of second virtual cameras. Specifically, each point in the 3D point cloud of the subject is projected onto the same plane as the imaging plane of the second virtual camera group. For each second virtual camera group, the distance (depth) from the second virtual camera to the subject is calculated for each projected pixel, and a depth value is set for each pixel of the depth image. Furthermore, the first image processing device 20 captures an image of the subject based on the second viewpoint information of the second virtual camera group and generates a texture image.
  • the texture image is generated by blending the colors of multiple captured images while increasing the priority of pixel values of images captured by a physical camera with a line of sight close to the line of sight of the second virtual camera.
  • This method of setting a high priority (weight) for pixel values of images captured by a physical camera with a line of sight close to the line of sight of the second virtual camera when blending multiple captured images is referred to as a virtual viewpoint-dependent texture image generation method. Details of the virtual viewpoint-dependent texture image generation method will be described later.
  • This virtual viewpoint-dependent texture image is generated by selecting an image captured by a physical camera that determines the pixel value of the subject depending on the position and orientation of the virtual camera, so the color of the subject changes when the virtual camera moves.
  • a texture image generated using this method is referred to as a virtual viewpoint-dependent texture image.
  • virtual-viewpoint-independent texture images are generated in a way that the pixel values of the subject do not change depending on the position and orientation of the virtual camera.
  • texture image generation deals with virtual-viewpoint-dependent texture images, but texture images may also be generated in a virtual-viewpoint-independent manner.
  • An example of the process for generating virtual-viewpoint-dependent and virtual-viewpoint-independent texture images is shown below.
  • the process of generating a virtual viewpoint-dependent texture image includes, for example, a process of determining the visibility of points in the 3D point cloud that constitutes the subject, and a process of deriving colors based on the position and orientation of the virtual camera.
  • the physical camera that can capture each point is identified based on the positional relationship between each point in the 3D point cloud and the multiple physical cameras included in the group of physical cameras that the imaging system 10 has.
  • a point in the 3D point cloud is set as a focus point, and the color of that focus point is derived.
  • the following process is performed for each focus point. A focus point that is included in the imaging range of the second virtual camera is selected. Then, a first virtual camera that has a line of sight close to the line of sight of the second virtual camera whose imaging range includes that focus point and can capture that focus point is selected.
  • the first virtual camera is a reproduction of a physical camera in virtual space
  • selecting the first virtual camera is the same as selecting a physical camera.
  • the selected focus point is then projected onto the image captured by the selected physical camera.
  • the color of the pixel at the projection destination is set as the color of that focus point.
  • a physical camera is selected, for example, based on whether the angle between the line of sight from the second virtual camera to the focus point and the line of sight from the physical camera to the focus point is within a certain angle. If the point of interest can be captured by multiple physical cameras, multiple physical cameras with viewing directions close to the viewing direction of the second virtual camera are selected, and the point of interest is projected onto each of the images captured by those physical cameras.
  • the pixel values of the projection destination are then obtained, and a weighted average is calculated so that pixel values from physical cameras with viewing directions close to the second virtual camera are used preferentially, thereby determining the color of the point of interest.
  • This process is performed while changing the focus point, and the color of the focus point is projected onto the same surface as the imaging surface of the second virtual camera, thereby generating a texture image that is dependent on the virtual viewpoint. While this has been explained in conjunction with the present embodiment, this is not limited to the above, and when generating a texture image that is dependent on a virtual camera different from the second virtual camera, the second virtual camera is changed to the different virtual camera and the above process is performed.
  • the process of generating a texture image independent of the virtual viewpoint includes, for example, the visibility determination process described above and a process of deriving a color that is independent of the position and orientation of the virtual camera.
  • a point in the 3D point cloud is set as the focus point, and that focus point is projected onto an image captured by a physical camera corresponding to a first virtual camera that can capture images, and the color of the pixel at the projection destination is set as the color of that focus point.
  • the point of interest can be captured by multiple first virtual cameras
  • the point of interest is projected onto each of the images captured by multiple physical cameras corresponding to the multiple first virtual cameras.
  • the pixel values at the projection destination are then obtained and the average of the pixel values is calculated to determine the color of the point of interest. This process is performed while changing the point of interest, and the color of the point of interest is projected onto the same surface as the imaging surface of the second virtual camera, thereby generating a texture image that is independent of the virtual viewpoint.
  • the method of generating color information dependent on a virtual viewpoint is known as a method of generating images with higher image quality than those independent of a virtual viewpoint, because priority is given to the pixel values of the image captured by a physical camera with a line of sight close to that of the second virtual camera.
  • a virtual viewpoint-dependent virtual viewpoint image (texture image) generated by a virtual camera with a position and orientation similar to that of a physical camera is heavily influenced by the image captured by that physical camera, making it possible to generate a texture image with image quality close to that of the captured image.
  • a virtual viewpoint-dependent texture image generated from the viewpoint of a virtual camera whose position and orientation is significantly different from that of the physical camera, or a texture image generated independent of a virtual viewpoint is generated by interpolating pixel values using multiple physical cameras.
  • the image may contain pixel values that differ from the color information of the actual subject, or may have low contrast.
  • the first image processing device 20 compresses (encodes) the depth image and texture image using a video compression method such as H.264 or H.265.
  • a video compression method such as H.264 or H.265.
  • the compression method is not limited to video compression, and any method that can encode to a size smaller than the original data volume, such as file compression, may be used.
  • the first image processing device 20 does not need to output depth images and texture images that do not show the subject to the second image processing device 30.
  • the second image processing device 30 back-projects the depth image into virtual space based on the second viewpoint information of the second virtual camera, making it possible to reconstruct shape information of a 3D model.
  • the pixel values of the texture image at the same coordinates as each depth value of the depth image as color information at the point where the depth value is back-projected, it is possible to add color information to the shape information and reconstruct a 3D model.
  • the first image processing device 20 may also extract a rectangular image of the subject's shooting range from the pre-compression depth image and texture image, compress this rectangular image (ROI image), and distribute it.
  • ROI image This rectangular image
  • the metadata may include coordinate information for the cut-out rectangular image.
  • the first image processing device 20 may also arrange the rectangular images to form a single image, compress it, and distribute it. By distributing the rectangular image instead of the entire image, it is possible to reduce the amount of data.
  • the first image processing device 20 may generate depth images containing high-precision depth values, such as single-precision floating-point numbers (32 bits), which cannot be used for video compression using H.264 or H.265.
  • the video is compressed after converting the depth information to a precision that allows video compression (8 bits or 10 bits).
  • the conversion method may involve, for example, performing scalar quantization processing, and compressing and distributing the quantized depth image.
  • the minimum and maximum values of the depth value range before quantization may be included as metadata.
  • the client can reconstruct a 3D model of the subject being photographed or the subject within the photographic range, which would not be possible with the precision required for video compression.
  • the second image processing device 30 receives and decodes the depth image, texture image, and metadata from the first image processing device 20 and restores the 3D model. As described above, the 3D model is restored by back-projecting the depth image and texture image into virtual space based on the second viewpoint information of the second virtual camera group contained in the metadata.
  • the second image processing device 30 also calculates viewpoint information (second viewpoint information) of a third virtual camera for generating a virtual viewpoint image viewed by the user based on input values received from the input device 40, which will be described later.
  • the second image processing device 30 then generates a virtual viewpoint image based on the calculated second viewpoint information and the restored 3D model.
  • the second image processing device 30 then outputs the generated virtual viewpoint image to the display device 50.
  • the input device 40 accepts input values used by the user to set the third virtual camera and transmits the input values to the second image processing device 30.
  • the input device 40 has input units such as a joystick, jog dial, touch panel, keyboard, and mouse.
  • the user setting the third virtual camera sets the position and orientation of the third virtual camera by operating the input unit.
  • the user sets the position and orientation of the third virtual camera, but this is not limited to this, and the position and orientation of the third virtual camera may also be set using position information of the 3D model.
  • the position information of the 3D model here may be generated by the first image processing device 20 on the distribution side, or may be generated by the second image processing device 30 on the receiving side.
  • the display device 50 displays the virtual viewpoint image generated and output by the second image processing device 30.
  • the user views the virtual viewpoint image displayed on the display device 50 and sets the position and orientation of the virtual camera for the next frame via the input device 40.
  • FIG. 1B is a diagram showing an example of the hardware configuration of the first image processing device 20 of this embodiment.
  • the hardware configuration of the second image processing device 30 is also similar to the configuration of the first image processing device 20 described below.
  • the first image processing device 20 of this embodiment is composed of a CPU 101, RAM 102, ROM 103, and communication unit 104.
  • the CPU 101 controls the entire first image processing device 20 using computer programs and data stored in the RAM 102 and ROM 103, thereby realizing each function of the first image processing device 20 shown in Figures 1A and 1B.
  • the first image processing device 20 may have one or more pieces of dedicated hardware different from the CPU 101, and at least some of the processing by the CPU 101 may be performed by the dedicated hardware. Examples of dedicated hardware include an ASIC (application-specific integrated circuit), an FPGA (field-programmable gate array), and a DSP (digital signal processor).
  • the RAM 102 temporarily stores programs and data supplied from the auxiliary storage device 214, as well as data supplied from the outside via the communication unit 104.
  • the ROM 103 stores programs that do not require modification.
  • the communication unit 104 is used for communication between the first image processing device 20 and an external device. For example, if the first image processing device 20 is connected to an external device via a wired connection, a communication cable is connected to the communication unit 104. If the first image processing device 20 has the function of communicating wirelessly with an external device, the communication unit 104 is equipped with an antenna.
  • FIG. 2 is a diagram showing an example of the functional configuration of the first image processing device 20 and the second image processing device 30. As shown in FIG.
  • the first image processing device 20 is composed of a shape information generation unit 201, a viewpoint determination unit 202, a depth image generation unit 203, a texture image generation unit 204, an encoding unit 205, and a distribution unit 206.
  • the shape information generation unit 201 uses the communication unit 104 to estimate shape information of the subject using the multiple captured images and viewpoint information of the physical cameras received from the imaging system 10.
  • the shape information is estimated using the volume intersection method described above. Therefore, the shape information generation unit 201 also generates viewpoint information of a first virtual camera group in virtual space that corresponds to the physical camera group in real space from the acquired viewpoint information of the physical cameras.
  • the shape information generation unit 201 outputs the estimated shape information to the viewpoint determination unit 202, depth image generation unit 203, and texture image generation unit 204.
  • the shape information generation unit 201 also outputs the received viewpoint information of the physical cameras and multiple captured images to the viewpoint determination unit 202 and texture image generation unit 204.
  • the viewpoint determination unit 202 determines the viewpoints of the second virtual camera group and generates viewpoint information, with the aim of improving the quality of the 3D model restored by the second image processing device 30.
  • the viewpoint determination unit 202 outputs the generated viewpoint information of the second virtual camera group to the depth image generation unit 203 and, via the depth image generation unit 203, to the texture image generation unit 204.
  • the viewpoint information of the second virtual camera group is generated, for example, by matching the position and orientation of the second virtual camera to that of the first virtual camera corresponding to the physical camera and dolly-zooming the second virtual camera in the line of sight of the first virtual camera.
  • the second virtual camera placed at the position of the first virtual camera corresponding to the physical camera, approaches the subject in the line of sight of the first virtual camera while adjusting the focal length (zoom) so that the size of the subject in the virtual viewpoint image generated from the second virtual camera is maintained.
  • This process of deriving the viewpoint information of the second virtual camera will be described in detail below using Figures 3A to 3C and 4. By performing this process while changing the physical camera, or while changing the subjects if there are multiple subjects in the captured image, it is possible to generate viewpoint information of the second virtual camera for each subject, for the number of physical cameras that captured the subject.
  • the viewpoint determination unit 202 may also change the image size (width and height) of the second virtual camera.
  • the image size of the second virtual camera is set to the same as the image size of the physical camera, and after determining the position and orientation of the second virtual camera as described above, the image size of the second virtual camera is changed. For example, if the width and height of the virtual viewpoint image of the second virtual camera are changed to 1/N of the width and height of the image captured by the physical camera, the focal length and image center internal parameters of the second virtual camera are also multiplied by 1/N.
  • the image size of the physical camera is 4K (width 3840, height 2160) and the image size of the second virtual camera is changed to full HD (width 1920, height 1080), the width and height are each halved, the focal length and image center internal parameters of the second virtual camera are multiplied by 1/2. This allows the image size to be changed without changing the angle of view, reducing the amount of data sent from the server to the client.
  • the depth image generation unit 203 performs the aforementioned depth image generation process based on the shape information of the subject input from the shape information generation unit 201 and the viewpoint information of the second virtual camera group input from the viewpoint determination unit 202.
  • the depth image generation unit 203 outputs the generated depth image and the viewpoint information of the second virtual camera group to the encoding unit 205.
  • the texture image generation unit 204 performs processing to generate the virtual viewpoint-dependent texture image described above based on input data from the shape information generation unit 201 and viewpoint information of the second virtual camera group input from the viewpoint determination unit 202 via the depth image generation unit 203.
  • the texture image generation unit 204 outputs the generated texture image to the encoding unit 205.
  • the virtual viewpoint-dependent texture image is generated by preferentially referencing the pixel values of the image captured by the physical camera that is closest in position and orientation to the second virtual camera. This makes it possible to generate high-quality texture images, and when the 3D model is restored on the client, a 3D model containing high-quality color information can be reproduced.
  • the texture image generation unit 204 may also generate a valid pixel map based on input data from the shape information generation unit 201 and viewpoint information of the second virtual camera group input from the viewpoint determination unit 202 via the depth image generation unit 203.
  • the texture image generation unit 204 may output the generated valid pixel map to the encoding unit 205.
  • the valid pixel map will be described in detail in Example 2.
  • the encoding unit 205 acquires the depth images and viewpoint information of the second virtual camera group input from the depth image generation unit 203, and the texture images input from the texture image generation unit 204.
  • the encoding unit 205 compresses the depth images and texture images using the compression method described above, and outputs the compressed image group and viewpoint information (metadata) of the second virtual camera group to the distribution unit 206.
  • the encoding unit 205 may compress not only depth images and texture images, but also metadata. Compression methods include the file compression methods described above.
  • the encoding unit 205 may compress the effective pixel map input from the texture image generation unit 204 and output it to the distribution unit 206.
  • the distribution unit 206 uses the communication unit 104 to transmit the compressed depth image, texture image, and viewpoint information of the second virtual camera group input from the encoding unit 205 to the receiving unit 207, which will be described later.
  • the second image processing device 30 is composed of a receiving unit 207, a decoding unit 208, a 3D model restoration unit 209, a virtual camera control unit 210, and a virtual viewpoint image generation unit 211.
  • the receiving unit 207 uses the communication unit 104 to receive the compressed depth image, compressed texture image, and viewpoint information (metadata) of the second virtual camera group from the distribution unit 206, and outputs them to the decoding unit 208.
  • the decoding unit 208 decodes the compressed depth images and compressed texture images acquired from the receiving unit 207 and outputs them to the 3D model restoration unit 209 along with viewpoint information of the second virtual camera group.
  • the decoding unit 208 may also decode metadata in addition to the depth images and texture images.
  • the decoding unit 208 may decode an effective pixel map and output it to the 3D model restoration unit 209.
  • the 3D model restoration unit 209 restores a 3D model using the restoration method described above, based on the decoded depth image and decoded texture image acquired from the decoding unit 208, and the viewpoint information of the second virtual camera group.
  • the restored 3D model is output to the virtual viewpoint image generation unit 211.
  • the 3D model restoration unit 209 may generate color information for the 3D model using the effective pixel map acquired from the decoding unit 208. The method for using the effective pixel map will be explained in detail in Example 2.
  • the virtual camera control unit 210 uses the communication unit 104 to generate viewpoint information of a third virtual camera for generating a virtual viewpoint image from input values entered by the user via the input device 40, and outputs the viewpoint information of the third virtual camera to the virtual viewpoint image generation unit 211.
  • the virtual camera control unit 210 may also output the generated viewpoint information of the user-specified third virtual camera to the 3D model restoration unit 209.
  • the virtual viewpoint image generation unit 211 generates a virtual viewpoint image based on the 3D model acquired from the 3D model restoration unit 209 and the viewpoint information of the third virtual camera acquired from the virtual camera control unit 210.
  • the virtual viewpoint image is generated by placing a 3D model of the subject, a 3D model of the background object, and the third virtual camera in a virtual space, and generating an image seen from the third virtual camera.
  • the 3D model of the background object is, for example, a CG (Computer Graphics) model created separately to be combined with the subject, and is created in advance and stored in the second image processing device 30 (for example, stored in ROM 103 in Figure 1B).
  • the 3D model of the subject and the 3D model of the background object are rendered using an existing CG rendering method.
  • the virtual viewpoint image generation unit 211 transmits the generated virtual viewpoint image to the display device 50.
  • FIG. 3A to 3C are schematic diagrams for explaining an example of a method for generating viewpoint information of a second virtual camera group.
  • FIG. 3A shows a group of physical cameras 302 and a subject 301 arranged in real space.
  • the subject 301 is photographed by the group of physical cameras 302.
  • a straight line 303 indicates the optical axis of each of the group of physical cameras 302.
  • FIG. 3B shows a diagram in which a first virtual camera group 305 corresponding to the physical camera group 302 is set and a 3D model 304 of the subject is generated by the first virtual camera group 305.
  • the first image processing device 20 generates the 3D model 304 of the subject using multiple captured images acquired from the physical camera group 302.
  • a first virtual camera group 305 is generated in virtual space corresponding to the physical camera group 302 in real space.
  • the first virtual camera group 305 is a virtual space reproduction of the physical camera group 302 in real space. Therefore, the optical axis 306 of the first virtual camera group 305 corresponds to the optical axis 303 of the physical camera group 302.
  • the first image processing device 20 then generates the 3D model 304 of the subject 301 using viewpoint information from the generated first virtual camera group 305.
  • 3C is a diagram showing the second virtual camera group 307 set based on the viewpoint information of the first virtual camera group 305.
  • the second virtual camera group 307 is set on the optical axis 306 of the first virtual camera group 305. However, this is not limited to this, and the second virtual camera 307 may be set at a position closer to the optical axis 306.
  • the second virtual camera group 307 is also set at a position closer to the 3D model 304 than the first virtual camera group 305.
  • the second virtual camera group 307 used to generate depth images and texture images for distribution places a bounding box that encompasses the 3D model 304, and places the second virtual camera group 307 on a spherical surface surrounding the bounding box so that it faces the subject.
  • the 3D model restored by the second image processing device 30 may differ significantly in shape from the 3D model 304 before distribution.
  • increasing the number of second virtual cameras in an attempt to accurately restore the 3D model 304 increases the number of depth images and texture images to be distributed. Therefore, if a user attempts to restore the 3D model in an environment with low transmission bandwidth or on a local terminal with low processing performance, the frame rate may decrease. It is desirable for the 3D model displayed on the display device 50 to have higher image quality and a smooth frame rate, such as 60 fps, to provide the user with a high sense of realism.
  • the first image processing device 20 generates viewpoint information for the second virtual camera group 307 based on the viewpoint information of the physical camera group 302 described above. That is, the second virtual camera group 307 is arranged at the position of the first virtual camera group 305 corresponding to the physical camera group 302. The position of the second virtual camera group 307 is then set by dolly-zooming the second virtual camera group 307 onto the optical axis 306 of the first virtual camera 305 while maintaining the size of the subject on the screen so as to approach the 3D model 304 of the subject 301.
  • the position to which the second virtual camera group 307 is dolly-zoomed is predetermined. For example, dolly-zooming may be performed until the distance from the 3D model 304 reaches a predetermined value. Alternatively, dolly-zooming may be performed until the size of the 3D model 304 projected on the imaging surface of the second virtual camera group 307 reaches a predetermined size. Note that the method for determining the viewpoint information of the second virtual camera group 307 is not limited to this method. Alternatively, the second virtual camera group 307 may simply be positioned near the optical axis 306 of the first virtual camera group 305, and internal parameters such as focal length and image size may be determined manually.
  • the viewpoint information of the second virtual camera group 307 may be determined based on the viewpoint information of the physical camera group 302.
  • the physical camera group 302 is positioned in the number required to estimate the shape of the subject 301, surrounding the subject 301 using the volume intersection method described above, for example. Therefore, by sending depth images as observed by the physical camera group 302, the user can accurately restore the shape information of the shape-estimated 3D model 304. Furthermore, by delivering virtual viewpoint-dependent texture images with image quality similar to that of the physical camera group 302, the user can restore high-quality color information, allowing the user to play back a high-quality 3D model.
  • Figure 4 is a schematic diagram in which a first virtual camera 401 captures a 3D model 402 of a subject and generates viewpoint information for a second virtual camera 403 based on the viewpoint information of the first virtual camera 401.
  • the second virtual camera 403 moves along the optical axis of the first virtual camera 401 while dolly zooming to approach the 3D model 402.
  • the second virtual camera 403 is then positioned at a position where it can capture the 3D model 402 within an area 404 that encompasses the 3D model 402 and allows the distance information (depth information) between the 3D model and the second virtual camera 403 to be expressed in 10 bits.
  • the area 404 is not limited to a cube, and may be a sphere centered on the 3D model 402.
  • the first virtual camera 401 captures the 3D model 402 and generates a captured image 405.
  • the first image processing device 20 estimates the 3D model 402 based on the captured images 405 captured by the physical cameras of the imaging system 10.
  • the estimated 3D model 402 is then photographed by a second virtual camera 403 to generate a depth image 406 and a texture image 407.
  • the size of the 3D model 402 depicted in the depth image 406 and the texture image 407 is the same as the size of the 3D model 402 depicted in the photographed image 405.
  • the determination of the viewpoint information of the second virtual camera 403 is not limited to dolly zoom, and the size of the 3D model 402 depicted in the depth image 406 may be different from the size of the 3D model 402 depicted in the photographed image 405.
  • the second virtual camera 403 has been described as moving on the optical axis of the first virtual camera 401, the position and orientation of the second virtual camera 403 need only be close to the position and orientation of the first virtual camera 401, and it does not necessarily have to move on the optical axis of the first virtual camera 401.
  • the determination of the viewpoint information of the second virtual camera 403 may be performed once when the second image processing device 30 is initialized, or may be performed for each frame in accordance with the movement of the 3D model 402.
  • Fig. 5 is a flowchart showing the flow of processing for controlling the compression and distribution of 3D model data in the first image processing device 20 according to this embodiment.
  • the flow shown in Fig. 5 is realized by reading a control program stored in the ROM 103 into the RAM 102 and executing it by the CPU 101. Execution of the flow in Fig. 5 is triggered when the shape information generation unit 201 receives a plurality of captured images and viewpoint information of the physical cameras from the imaging system 10.
  • the shape information generation unit 201 estimates and generates shape information of the subject based on multiple captured images.
  • the generated shape information, viewpoint information of the physical camera group, and multiple captured images are output to the viewpoint determination unit 202 and texture image generation unit 204.
  • the generated shape information is also output to the depth image generation unit 203.
  • the viewpoint information of the first virtual camera group corresponding to the physical camera group used to generate the subject is assumed to have been generated in advance. For example, when the physical camera group is placed as preparation before filming begins, a first virtual camera group corresponding to the placed physical camera group is generated. Shape information of the subject is generated using the viewpoint information of this first virtual camera group and multiple captured images.
  • the viewpoint determination unit 202 generates viewpoint information for the second virtual camera group based on the viewpoint information for the physical camera group.
  • the generated viewpoint information for the second virtual camera group is output to the depth image generation unit 203 and to the texture image generation unit 204 via the depth image generation unit 203.
  • the process of generating the viewpoint information of the second virtual camera group is explained in Figure 7. Note that the generated viewpoint information of the second virtual camera group may be output to the texture image generation unit 204 without going through the depth image generation unit 203.
  • the depth image generation unit 203 generates a depth image of the 3D model based on the data acquired from the shape information generation unit 201 and the viewpoint determination unit 202.
  • the generated depth image is output to the encoding unit 205.
  • the texture image generation unit 204 generates a texture image of the 3D model based on the data acquired from the shape information generation unit 201 and the viewpoint determination unit 202.
  • the generated texture image is output to the encoding unit 205.
  • the encoding unit 205 encodes the depth image and texture image acquired from the depth image generation unit 203 and texture image generation unit 204.
  • the encoded data is output to the distribution unit 206.
  • the distribution unit 206 distributes 3D model data including the data acquired from the depth image generation unit 203 and the encoding unit 205 and the viewpoint information of the second virtual camera group, and this flow ends.
  • the 3D model data is distributed to the second image processing device 30, but this is not limited to this.
  • the 3D model data may also be distributed to a separate server that stores it.
  • FIG. 6 is a flowchart showing the flow of processing for generating a virtual viewpoint image using a depth image and a texture image in the second image processing device 30 according to this embodiment. Execution of the flow in FIG. 6 is triggered when the virtual camera control unit 210 receives an input value from the input device 40.
  • the virtual camera control unit 210 generates viewpoint information of a third virtual camera designated by the user based on input values from the input device 40.
  • the generated viewpoint information of the third virtual camera is output to the virtual viewpoint image generation unit 211.
  • the receiving unit 207 receives the 3D model data distributed from the distribution unit 206.
  • the received 3D model data is output to the decoding unit 208.
  • the decoding unit 208 decodes the depth images and texture images included in the 3D model data acquired from the receiving unit 207. Furthermore, if the viewpoint information of the second virtual camera group has also been encoded, the viewpoint information of the second virtual camera group is also decoded. The decoded depth images and texture images, as well as the viewpoint information of the second virtual camera group, are then output to the 3D model restoration unit 209.
  • the 3D model restoration unit 209 restores a 3D model of the subject based on the decoded depth image and texture image acquired from the decoding unit 208 and the viewpoint information of the second virtual camera group.
  • the restored 3D model is output to the virtual viewpoint image generation unit 211.
  • the 3D model restoration process is explained in Figure 8.
  • the virtual viewpoint image generation unit 211 generates a virtual viewpoint image based on the viewpoint information of the third virtual camera acquired from the virtual camera control unit 210 and the 3D model acquired from the 3D model restoration unit 209, and this flow ends.
  • the generated virtual viewpoint image is sent to the display device 50 and displayed on the display device 50.
  • ⁇ Description of the Process for Generating Viewpoint Information of Second Virtual Camera Group> 7 is an example of a flowchart showing the flow of processing for generating viewpoint information of the second virtual camera group according to this embodiment.
  • the method for generating viewpoint information of the second virtual camera group will be described as a method for generating viewpoint information based on the dolly zoom method of the second virtual camera described with reference to FIG. 4 . That is, the second virtual camera moves on the optical axis of the first virtual camera from the position of the first virtual camera corresponding to the position of the physical camera in a direction approaching the 3D model of the subject. Furthermore, a method will be described in which the second virtual camera adjusts the focal length while maintaining the size of the subject on the imaging plane of the second virtual camera.
  • the flow shown in FIG. 7 is executed by the viewpoint determination unit 202. Execution of FIG. 7 is triggered by the reception of a 3D model of the subject, viewpoint information of the physical camera group, and multiple captured images from the shape information generation unit 201.
  • the flow in FIG. 7 provides a detailed explanation of the control for determining the viewpoint of the second virtual camera group based on the viewpoint information of the physical camera group in S502 of FIG. 5.
  • viewpoint information of the first virtual camera group and a 3D model of the subject have been generated by the first image processing device 20.
  • a 3D model of the subject, viewpoint information of the first virtual camera group, and multiple captured images are acquired.
  • viewpoint information of the physical camera group may be acquired without acquiring viewpoint information of the first virtual camera group. In that case, viewpoint information of the first virtual camera group is generated using the viewpoint information of the physical camera group.
  • S703 and S704 are repeated for each physical camera.
  • S704 is repeated for each subject included in the multiple captured images.
  • Subject recognition is performed based on the results of a face detection algorithm, person detection algorithm, etc.
  • the area of each subject in the captured images can be identified by projecting a 3D model generated separately for each subject onto the same plane as the imaging surface of the physical camera using the viewpoint information of the physical camera.
  • viewpoint information for the second virtual camera is generated based on the 3D model and viewpoint information for the first virtual camera.
  • the viewpoint information for the second virtual camera is generated using the method described with reference to Figure 4. The above process is repeated as described in S702 and S703 to generate viewpoint information for the second virtual camera group.
  • the viewpoint information of the generated second virtual camera group is sent to the depth image generation unit 203 and the texture image generation unit 204.
  • FIG. 8 is an example of a flowchart showing the flow of a 3D model restoration process according to this embodiment.
  • the 3D model restoration process is described based on a method for generating a 3D model including virtual viewpoint-dependent color information.
  • the pixel values of the texture image of the second virtual camera group, which are closest to the position and orientation of the user-specified third virtual camera, are given priority as the color information for the 3D model.
  • the flow shown in FIG. 8 is executed by the 3D model restoration unit 209.
  • the flow in FIG. 8 provides a detailed description of the control for restoring a 3D model of a subject based on the decoded data in FIG. 6.
  • the 3D model restoration unit 209 acquires a depth image, a texture image, and viewpoint information of the second virtual camera group.
  • the 3D model restoration unit 209 projects the depth values of each pixel in the depth image into virtual space based on the external and internal parameters of the virtual camera corresponding to the depth image, and generates components of the shape information of the subject. For example, if the 3D model of the subject is represented by a 3D point cloud, each point becomes a component. By performing the above process on all of the multiple acquired depth images, the shape information of the subject is restored.
  • the pixel values of the texture image of the virtual camera with the closest position and orientation to the user-specified virtual camera are prioritized and determined as the color information for the components. This process is repeated for all components to restore the color information corresponding to the shape information. For example, if the shape information is a point cloud and the components are each point in the point cloud, the color information corresponding to all points is restored.
  • the restored 3D model is output to the virtual viewpoint image generation unit 211.
  • viewpoint information for the second virtual camera group is generated based on viewpoint information from the physical camera group, and a depth image and a texture image are generated using the generated viewpoint information for the second virtual camera group.
  • This type of processing makes it possible to restore a high-quality 3D model on the receiving side without increasing the amount of data, even for 3D models with complex shapes.
  • the compressed 3D model data may be stored, or may be delivered in response to a client request.
  • the example described uses a method in which a 3D model containing virtual-viewpoint-dependent color information is generated when the 3D model is restored, this is not limiting.
  • color information for the 3D model may be generated simultaneously with the generation of a virtual viewpoint image. In this case, it is not necessary to assign color information to all of the 3D model's shape information; color information only needs to be generated for the shooting range (viewing angle) of the user-specified third virtual camera.
  • Example 2 In Example 1, a process was described in which viewpoint information of a second virtual camera group is generated based on viewpoint information of a physical camera group, and a depth image and a texture image are generated using the generated viewpoint information of the second virtual camera group. Next, Example 2 will be described, which illustrates an aspect in which an effective pixel map is generated in addition to a depth image and a texture image to deal with cases in which a subject is occluded by another subject or object. Note that the description of parts common to Example 1, such as the hardware configuration and functional configuration of the image processing device, will be omitted or simplified.
  • FIG. 9 is a schematic diagram showing an example of a method for generating an effective pixel map according to this embodiment.
  • a subject is photographed using a physical camera, and a captured image 903 is generated.
  • the subject in the captured image 903 has part of its body hidden by an obstruction.
  • the obstruction is considered to be the subject.
  • the first image processing device 20 estimates the shapes of the subject and obstructing object using multiple captured images taken by the physical cameras of the imaging system 10, and generates shape information for each. That is, it generates a 3D model 901 of the subject and a 3D model 904 of the obstructing object.
  • the first image processing device 20 generates viewpoint information for the second virtual camera group described in Example 1 based on data such as viewpoint information and shape information from the physical camera group.
  • the second virtual camera 905 is positioned on the optical axis of the first virtual camera 902, within an area 906 that encompasses the 3D model 901 of the subject and is within a range in which the depth of the subject can be expressed in 10 bits.
  • the focal length of the second virtual camera 905 is adjusted so that the size of the subject 901 as viewed from the first virtual camera 902 is maintained.
  • the first image processing device 20 After generating the viewpoint information for the second virtual camera 905, the first image processing device 20 generates a depth image 907 and a texture image 908.
  • the pixel values of the image captured by the first virtual camera 902 are used preferentially in the area not obscured by the 3D model 904 of the obstructing object (the left side of the subject).
  • the pixel values of the image captured by the first virtual camera which has a different viewpoint from the first virtual camera 902, are used. Therefore, there is a possibility that the image quality of the right and left sides of the subject appearing in the texture image 908 will differ significantly. If the color information of the 3D model is restored using this texture image 908, the image quality of parts of the 3D model will be reduced, and this part will stand out, which may cause the user to feel uncomfortable. Therefore, an effective pixel map 909 is generated that indicates high-quality areas of the texture image.
  • the effective pixel map assigns pixel values of 1 to unobstructed areas and 0 to obstructed areas.
  • the image size of the effective pixel map 909 is the same as the image size of the texture image 908.
  • the determination of whether or not an area is an occluded area is made, for example, based on whether each point constituting the shape information of the 3D model 901 of the subject is visible to the first virtual camera 902 using the visibility determination process described above. In other words, due to the visibility determination process, pixel values of the area of the occluding object 904 captured in the captured image 903 are not used as color information of the 3D model 901 of the subject.
  • a region of a texture image determined using pixel values of the captured image of the physical camera from which the viewpoint information of the second virtual camera 905 is generated is identified, and in the effective pixel map, pixel values of the identified area are set to 1, and other areas are set to 0.
  • the method of determining an occluded area is not limited to this.
  • the second image processing device 30 prioritizes pixel values of areas of the texture image corresponding to areas with a value of 1 in the effective pixel map and uses them as color information of the 3D model.
  • the second image processing device 30 uses pixel values of the texture image corresponding to 1 in the effective pixel map and does not use pixel values of the texture image corresponding to 0 in the effective pixel map.
  • the shape information to which color information is to be added is stored only in the texture image corresponding to 0 in the effective pixel map, the pixel value of that texture image is used.
  • the effective pixel map has been described so far as having binary pixel values of 0 or 1, it can also be multi-valued. In this case, the pixel values of the effective pixel map can be used to weight the color information generated for the 3D model.
  • the priority of the pixel values of the texture image is determined according to the pixel values of the effective pixel map.
  • the pixel values of the subject's outline or the boundary with an obstructing object can be set to 0, and can linearly increase to 255 as the subject approaches a certain distance (e.g., 5 px) inward toward the subject. This reduces the influence of unreliable pixel values of the texture image at the outline or boundary with an obstructing object when restoring the color information of the 3D model, allowing for the generation of color information for the 3D model using highly reliable pixel values.
  • FIG. 10 is a flowchart showing the flow of processing for controlling the compression and distribution of 3D model data in the first image processing device 20 according to this embodiment. Execution of the flow in FIG. 10 begins when the shape information generation unit 201 receives multiple captured images and viewpoint information from the physical cameras from the imaging system 10.
  • S1001 to S1003 are the same as S501 to S503 in Figure 5.
  • the texture image generation unit 204 generates a texture image and a valid pixel map of the foreground model based on the data acquired from the shape information generation unit 201 and the viewpoint determination unit 202.
  • the generated texture image and valid pixel map are output to the encoding unit 205.
  • the encoding unit 205 encodes the depth image, texture image, and valid pixel map acquired from the depth image generation unit 203 and texture image generation unit 204.
  • the encoded depth image, texture image, and valid pixel map are output to the distribution unit 206.
  • the distribution unit 206 transmits 3D model data including the depth image, texture image, effective pixel map, and viewpoint information of the second virtual camera group acquired from the depth image generation unit 203 and encoding unit 205 to the reception unit 207, and this flow ends.
  • FIG. 11 is an example of a flowchart showing the flow of the 3D model restoration process according to this embodiment.
  • the flow in FIG. 11 is executed by the 3D model restoration unit 209.
  • the flow in FIG. 11 provides a detailed explanation of the control for restoring a 3D model of a subject based on the data decoded in S604 in FIG. 6.
  • the depth value of each pixel in the depth image is projected into virtual space based on the external parameters and internal parameters of the second virtual camera corresponding to the depth image, and shape information of the subject is restored.
  • the pixel values of the texture image of the second virtual camera which is closest in position and orientation to the user-specified third virtual camera, are given priority as color information.
  • the priority of the pixel values of the texture image is determined according to the pixel values of the effective pixel map that correspond to the pixel values of the texture image. If the shape information is represented by a 3D point cloud, this process is repeated for all points in the point cloud to generate color information corresponding to the shape information.
  • S1104 is the same as S804 in Figure 8.
  • the virtual viewpoint image generation unit 211 After this flow is completed, the virtual viewpoint image generation unit 211 generates a virtual viewpoint image based on the restored 3D model and viewpoint information of the third virtual camera specified by the user, and the generated virtual viewpoint image is displayed on the display device 50.
  • the present disclosure can also be realized by a process in which a program that realizes one or more functions of the above-described embodiments is supplied to a system or device via a network or a storage medium, and one or more processors in the computer of the system or device read and execute the program.
  • the present disclosure can also be realized by a circuit (e.g., an ASIC) that realizes one or more functions.
  • the disclosure of this embodiment includes the following configurations, methods, systems, and programs.
  • (Configuration 1) a setting means for setting positions and orientations of a plurality of second virtual cameras based on optical axes of a plurality of first virtual cameras in a virtual space corresponding to positions and orientations of a plurality of imaging devices in a real space; and a generation means for generating a plurality of depth images indicating the distance between each of the plurality of second virtual cameras and a 3D model of a subject generated based on a plurality of captured images acquired by the plurality of imaging devices.
  • (Configuration 2) The image processing device according to configuration 1, wherein the plurality of second virtual cameras are set on the optical axes of the plurality of first virtual cameras.
  • the image processing device further comprises an output means for outputting the encoded depth images and viewpoint information indicating the positions and orientations of the second virtual cameras to another device that reconstructs the 3D model based on the encoded depth images and the viewpoint information.
  • the generating means generates a virtual viewpoint image including a 3D model of the subject for each of the second virtual cameras; the encoding means encodes the plurality of virtual viewpoint images; 5.
  • the image processing device wherein the output means outputs the encoded virtual viewpoint images to the other device.
  • the generating means generates correspondence information indicating whether each pixel in the virtual viewpoint image corresponds to a component of a 3D model of the subject, for each of the plurality of second virtual cameras; the encoding means encodes a plurality of pieces of correspondence information; 6.
  • the image processing apparatus according to configuration 5, wherein the output means outputs the encoded correspondence information to the other apparatus.
  • An acquisition means for acquiring a plurality of encoded depth images indicating the distance between each of a plurality of second virtual cameras set based on the optical axes of a plurality of first virtual cameras in a virtual space corresponding to the positions and orientations of a plurality of image capturing devices in real space and a 3D model of a subject generated based on a plurality of captured images acquired by the image capturing devices, and first viewpoint information indicating the positions and orientations of the plurality of second virtual cameras; a decoding means for decoding the encoded depth images; a generating means for generating a 3D model of the subject based on the decoded depth images and the first viewpoint information; 1.
  • An image processing device comprising:
  • Method 1 a setting step of setting positions and orientations of a plurality of second virtual cameras based on optical axes of a plurality of first virtual cameras in a virtual space corresponding to positions and orientations of a plurality of imaging devices in a real space; and a generation step of generating a plurality of depth images indicating the distance between each of the plurality of second virtual cameras and a 3D model of the subject generated based on a plurality of captured images acquired by the plurality of imaging devices.
  • Method 2 an acquisition process for acquiring a plurality of encoded depth images indicating the distance between each of a plurality of second virtual cameras set based on the optical axes of a plurality of first virtual cameras in a virtual space corresponding to the positions and orientations of a plurality of imaging devices in real space and a 3D model of a subject generated based on a plurality of captured images acquired by the imaging devices, and first viewpoint information indicating the positions and orientations of the plurality of second virtual cameras; a decoding step of decoding the encoded depth images; a generation step of generating a 3D model of the subject based on the decoded depth images and the first viewpoint information;
  • An image processing method comprising:
  • program 11 A program for causing a computer to function as each of the means of the image processing device according to any one of configurations 1 to 10.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Graphics (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Processing Or Creating Images (AREA)
  • Image Generation (AREA)
  • Studio Devices (AREA)
  • Image Analysis (AREA)

Abstract

A first image processing device 20 sets the positions and orientations of a plurality of second virtual cameras on the basis of the optical axes of the plurality of first virtual cameras in a virtual space corresponding to the positions and orientations of a plurality of imaging devices in a real space, and generates a plurality of depth images indicating the distance between each of the plurality of second virtual cameras and the 3D model of the subject generated on the basis of the plurality of captured images acquired by the plurality of imaging devices.

Description

画像処理装置、画像処理方法及びプログラムImage processing device, image processing method and program

 本開示は、3Dモデルを示す3Dモデルデータを圧縮及び配信するための画像処理装置に関する。 The present disclosure relates to an image processing device for compressing and distributing 3D model data representing a 3D model.

 複数台の撮像装置で撮影した複数の撮像画像を用いて被写体(オブジェクト)の3Dモデルを示す3Dモデルデータを生成し、その3Dモデルが存在する仮想空間内に仮想的に配置したカメラ(仮想カメラ)から見た仮想視点画像を生成する技術がある。昨今では、サーバ上で被写体の3Dモデルデータを生成し、その3Dモデルデータを圧縮及び配信し、ユーザがPCやタブレット等の自身のローカル端末(クライアント)で仮想カメラを操作して仮想視点画像を表示する技術が注目されている。 There is a technology that generates 3D model data showing a 3D model of a subject (object) using multiple images captured by multiple imaging devices, and then generates a virtual viewpoint image as seen from a camera (virtual camera) virtually placed in a virtual space in which the 3D model exists. Recently, technology that generates 3D model data of a subject on a server, compresses and distributes the 3D model data, and allows users to operate a virtual camera on their own local terminal (client) such as a PC or tablet to display a virtual viewpoint image has been attracting attention.

 3Dモデルデータを圧縮及び配信する技術において、配信側のデバイスでは、3Dモデルを取り囲むように複数の仮想カメラ(仮想カメラ群)を配置し、複数の仮想カメラのデプス画像とテクスチャ画像をそれぞれ生成する。そして、生成したデプス画像とテクスチャ画像を圧縮し、圧縮したデプス画像とテクスチャ画像と複数の仮想カメラの位置及び姿勢を示す情報とをユーザに配信する方法がある。受信側のデバイスでは、圧縮されたデプス画像とテクスチャ画像を復号し、復号したデプス画像とテクスチャ画像と複数の仮想カメラの位置及び姿勢を示す情報とに基づき、3Dモデルを再構成する。 In a technology for compressing and distributing 3D model data, the distributing device arranges multiple virtual cameras (or a group of virtual cameras) to surround the 3D model and generates depth images and texture images from each of the multiple virtual cameras. The generated depth and texture images are then compressed, and the compressed depth and texture images, along with information indicating the positions and orientations of the multiple virtual cameras, are distributed to the user. The receiving device decodes the compressed depth and texture images, and reconstructs the 3D model based on the decoded depth and texture images and information indicating the positions and orientations of the multiple virtual cameras.

 特許文献1では、デプス画像の圧縮率の向上を目的に、フレーム間のデプス画像上の主要なオブジェクトの位置の変動が小さくなるように、仮想カメラ群の視点を決定する方法が記載されている。デプス画像を符号化した際に符号化ストリームに含められる動きベクトルが小さくなることにより、デプス画像の圧縮率の向上が期待される。 Patent Document 1 describes a method for determining the viewpoints of a group of virtual cameras so as to minimize fluctuations in the position of key objects in depth images between frames, with the aim of improving the compression rate of depth images. By reducing the number of motion vectors included in the encoded stream when encoding depth images, it is expected that the compression rate of depth images will improve.

国際公開第2018/079260号International Publication No. 2018/079260

 しかしながら、配信側のデバイスにて設定される仮想カメラ群の位置及び姿勢に依っては、受信側のデバイスにて再構成される3Dモデルの品質が低下する虞があった。例えば、配信側のデバイスで生成される3Dモデルの形状が複雑な場合、仮想カメラ群の位置及び姿勢によっては、3Dモデルの複雑な形状をデプス画像にて十分に表現することができておらず、ユーザ側で再構成した3Dモデルの形状精度が低下する恐れがある。つまり、配信側で生成される3Dモデルの形状に対し、受信側で再構成される3Dモデルの形状が大きく異なる恐れがある。そのため、3Dモデルをより正確に再構成しようとすると、仮想カメラ群の位置及び姿勢を適切に決める必要がある。 However, depending on the position and orientation of the virtual cameras set on the device on the distribution side, there is a risk that the quality of the 3D model reconstructed on the device on the receiving side may be reduced. For example, if the shape of the 3D model generated on the device on the distribution side is complex, depending on the position and orientation of the virtual cameras, the complex shape of the 3D model may not be sufficiently represented in the depth image, and the shape accuracy of the 3D model reconstructed on the user side may be reduced. In other words, the shape of the 3D model generated on the distribution side may differ significantly from the shape of the 3D model reconstructed on the receiving side. Therefore, to reconstruct a 3D model more accurately, it is necessary to appropriately determine the position and orientation of the virtual cameras.

 そこで、本開示は、受信側のデバイスにて生成される3Dモデルの品質が低下する可能性を低減することを目的とする。 The present disclosure therefore aims to reduce the possibility of a decrease in the quality of the 3D model generated on the receiving device.

 上記の課題を解決するため、本開示に係る画像処理装置は、以下の構成を有する。すなわち、現実空間における複数の撮像装置の位置及び姿勢に対応する仮想空間における複数の第1仮想カメラの光軸に基づいて複数の第2仮想カメラの位置及び姿勢を設定する設定手段と、前記複数の第2仮想カメラのそれぞれと、前記複数の撮像装置により取得される複数の撮像画像に基づいて生成される被写体の3Dモデルと、の距離を示す複数のデプス画像を生成する生成手段とを有する。 In order to solve the above problems, the image processing device according to the present disclosure has the following configuration: a setting means for setting the positions and orientations of multiple second virtual cameras based on the optical axes of multiple first virtual cameras in virtual space that correspond to the positions and orientations of multiple image capture devices in real space; and a generation means for generating multiple depth images that indicate the distance between each of the multiple second virtual cameras and a 3D model of a subject that is generated based on the multiple captured images acquired by the multiple image capture devices.

 本開示によれば、受信側のデバイスにて生成される3Dモデルの品質が低下する可能性を低減することができる。 This disclosure reduces the possibility of a decrease in the quality of the 3D model generated on the receiving device.

画像処理システム1の構成例を示すブロック図である。1 is a block diagram showing an example of the configuration of an image processing system 1. FIG. 画像処理装置のハードウェア構成例を示す図である。FIG. 1 illustrates an example of a hardware configuration of an image processing apparatus. 第1の画像処理装置20及び第2の画像処理装置30の機能構成の一例を示す図である。FIG. 2 is a diagram illustrating an example of the functional configuration of a first image processing device 20 and a second image processing device 30. 撮影システム10の構成例を示す図である。FIG. 1 is a diagram illustrating an example of the configuration of an imaging system 10. 物理カメラ群302に対応する第1仮想カメラ群305を設定し、第1仮想カメラ群305により生成される被写体の3Dモデル304を示す図である。FIG. 3 is a diagram showing a first virtual camera group 305 that corresponds to the physical camera group 302 and a 3D model 304 of a subject that is generated by the first virtual camera group 305. 第1仮想カメラ群305の視点情報に基づいて設定される第2仮想カメラ群307を示す図である。3A and 3B are diagrams showing a second virtual camera group 307 set based on viewpoint information of a first virtual camera group 305. FIG. 実施例1に関わる仮想カメラの視点情報の生成を説明する図である。10A and 10B are diagrams illustrating generation of viewpoint information of a virtual camera according to the first embodiment. 実施例1に関わる3Dモデルデータの圧縮と配信処理の一例を示すフローチャートである。10 is a flowchart illustrating an example of a process for compressing and distributing 3D model data according to the first embodiment. 実施例1に関わる3Dモデルデータの受信と仮想視点画像の生成処理の一例を示すフローチャートである。10 is a flowchart illustrating an example of a process for receiving 3D model data and generating a virtual viewpoint image according to the first embodiment. 実施例1に関わる仮想カメラ群の視点情報の生成処理の一例を示すフローチャートである。10 is a flowchart illustrating an example of a process for generating viewpoint information of a group of virtual cameras according to the first embodiment. 実施例1に関わる3Dモデルの復元処理の一例を示すフローチャートである。10 is a flowchart illustrating an example of a restoration process of a 3D model according to the first embodiment. 実施例2に関わる仮想カメラ群の視点情報の生成を説明する図である。10A and 10B are diagrams illustrating generation of viewpoint information of a group of virtual cameras according to the second embodiment. 実施例2に関わる3Dモデルデータの圧縮と配信処理の一例を示すフローチャートである。10 is a flowchart illustrating an example of a process for compressing and distributing 3D model data according to a second embodiment. 実施例2に関わる3Dモデルの復元処理の一例を示すフローチャートである。10 is a flowchart illustrating an example of a restoration process of a 3D model according to a second embodiment.

 本発明の好適な実施形態によれば、画像処理装置は、現実空間における複数の撮像装置の位置及び姿勢に対応する仮想空間における複数の第1仮想カメラの光軸に基づいて複数の第2仮想カメラの位置及び姿勢を設定する設定手段を有する。また、画像処理装置は、前記複数の第2仮想カメラのそれぞれと、前記複数の撮像装置により取得される複数の撮像画像に基づいて生成される被写体の3Dモデルと、の距離を示す複数のデプス画像を生成する生成手段を有する。ここで、複数の撮像装置は、複数の撮像装置間での位置合わせが行われているものとする。複数の第1仮想カメラは、現実空間における複数の撮像装置の位置関係から設定される。すなわち、仮想空間における複数の第1仮想カメラは、現実空間における複数の撮像装置を仮想空間にて再現したものである。 According to a preferred embodiment of the present invention, the image processing device has a setting means for setting the positions and orientations of multiple second virtual cameras based on the optical axes of multiple first virtual cameras in virtual space that correspond to the positions and orientations of multiple imaging devices in real space. The image processing device also has a generation means for generating multiple depth images that indicate the distance between each of the multiple second virtual cameras and a 3D model of the subject generated based on multiple captured images acquired by the multiple imaging devices. Here, it is assumed that the multiple imaging devices are aligned with each other. The multiple first virtual cameras are set based on the positional relationships between the multiple imaging devices in real space. In other words, the multiple first virtual cameras in virtual space are reproductions in virtual space of the multiple imaging devices in real space.

 この態様により、複数の撮像装置の位置及び姿勢に基づいて複数の第2仮想カメラの位置及び姿勢が設定されることになる。そのため、複数の第2仮想カメラのデプス画像にて表現される被写体は、複数の撮像装置から取得される撮像画像に含まれる被写体に近いものとなる。具体的には、第2仮想カメラから生成されるデプス画像に含まれる被写体を示す領域の形状は、撮像画像に含まれる被写体を示す領域の形状と近いものとなる。配信側のデバイスで生成される被写体の3Dモデルは複数の撮像装置から取得される複数の撮像画像に基づいて生成される。そのため、受信側のデバイスで被写体の3Dモデルを再構成(復元)する際に用いるデプス画像が撮像画像に近いものであるならば、受信側のデバイスで再構成した3Dモデルも配信側のデバイスで生成された3Dモデルと近いものになる。すなわち、受信側のデバイスで生成される3Dモデルが、配信側のデバイスで生成される3Dモデルに対して形状が異なる可能性、つまり3Dモデルの再現性が悪化する可能性を低減することができる。 In this manner, the positions and orientations of the multiple second virtual cameras are set based on the positions and orientations of the multiple imaging devices. Therefore, the subject depicted in the depth images of the multiple second virtual cameras is similar to the subject included in the captured images acquired from the multiple imaging devices. Specifically, the shape of the area representing the subject included in the depth images generated from the second virtual cameras is similar to the shape of the area representing the subject included in the captured images. The 3D model of the subject generated by the distribution device is generated based on the multiple captured images acquired from the multiple imaging devices. Therefore, if the depth images used to reconstruct (restore) the 3D model of the subject on the receiving device are similar to the captured images, the 3D model reconstructed on the receiving device will also be similar to the 3D model generated on the distribution device. In other words, it is possible to reduce the possibility that the shape of the 3D model generated on the receiving device will differ from that of the 3D model generated on the distribution device, i.e., the reproducibility of the 3D model will be degraded.

 また、上記の画像処理装置は、前記デプス画像を符号化する符号化手段を有する。また、上記の画像処理装置は、符号化された複数の前記デプス画像と、前記複数の第2仮想カメラの位置及び姿勢を示す視点情報とを、前記3Dモデルを再構成する他の装置に出力する出力手段を有する。ここで、他の装置は、前記符号化された複数の前記デプス画像および前記視点情報とに基づいて前記3Dモデルを再構成する装置である。 The image processing device also has an encoding means for encoding the depth images. The image processing device also has an output means for outputting the encoded depth images and viewpoint information indicating the positions and orientations of the second virtual cameras to another device that reconstructs the 3D model. Here, the other device is a device that reconstructs the 3D model based on the encoded depth images and viewpoint information.

 また、上記の画像処理装置は、前記生成手段は、前記複数の第2仮想カメラそれぞれにおいて、前記被写体の3Dモデルを含む仮想視点画像を生成する。そして、前記符号化手段は、複数の前記仮想視点画像を符号化し、前記出力手段は、符号化された複数の前記仮想視点画像を前記他の装置に出力する。 Furthermore, in the above-mentioned image processing device, the generation means generates a virtual viewpoint image including a 3D model of the subject for each of the plurality of second virtual cameras. The encoding means then encodes the plurality of virtual viewpoint images, and the output means outputs the encoded plurality of virtual viewpoint images to the other device.

 この態様により、第2仮想カメラから生成される仮想視点画像(テクスチャ画像)に含まれる被写体の各構成要素の色は、撮像画像に含まれる被写体の各構成要素の色と近いものとなる。すなわち、受信側のデバイスで生成される3Dモデルが、配信側のデバイスで生成される3Dモデルに対して色が異なる可能性、つまり3Dモデルの再現性が悪化する可能性を低減することができる。 In this manner, the colors of each component of the subject included in the virtual viewpoint image (texture image) generated from the second virtual camera are close to the colors of each component of the subject included in the captured image. In other words, it is possible to reduce the possibility that the 3D model generated on the receiving device will have different colors than the 3D model generated on the transmitting device, i.e., the possibility that the reproducibility of the 3D model will be degraded.

 また、前記生成手段は、前記複数の第2仮想カメラそれぞれにおいて、前記テクスチャ画像における各画素が前記被写体の3Dモデルの構成要素と対応しているか否かを示す対応情報を生成する。そして、前記符号化手段は、複数の前記対応情報を符号化し、前記出力手段は、符号された複数の前記対応情報を前記他の装置に出力する。 Furthermore, the generation means generates correspondence information indicating whether each pixel in the texture image corresponds to a component of the 3D model of the subject, for each of the second virtual cameras. The encoding means then encodes the plurality of pieces of correspondence information, and the output means outputs the encoded plurality of pieces of correspondence information to the other device.

 この態様により、受信側のデバイスにおいて、テクスチャ画像の各画素のうち、被写体の3Dモデルの色付けに用いる画素を特定することができる。例えば、撮像装置から取得される撮像画像において被写体が別の被写体により遮蔽されている場合、第2仮想カメラのテクスチャ画像の生成に当該撮像画像が用いられるため、その被写体の色情報と異なる色情報を有するテクスチャ画像を生成してしまう虞がある。そのため、遮蔽が発生していないと考えられる領域を示す対応情報を生成することにより、3Dモデルを再構成する際に色が異なる可能性を低減することができる。 This aspect allows the receiving device to identify, among the pixels of the texture image, the pixels to be used to color the 3D model of the subject. For example, if a subject is occluded by another subject in an image captured by the imaging device, that captured image is used to generate a texture image for the second virtual camera, which could result in the generation of a texture image with color information that differs from the color information of that subject. Therefore, by generating correspondence information that indicates areas where occlusion is not thought to occur, the possibility of colors differing when the 3D model is reconstructed can be reduced.

 また、前記複数の第2仮想カメラは、前記複数の第1仮想カメラより、前記被写体の3Dモデルに近い位置に設定される。 Furthermore, the second virtual cameras are set at positions closer to the 3D model of the subject than the first virtual cameras.

 また、前記生成手段は、1つの前記第2仮想カメラに対し、1つの前記デプス画像を生成する。経時的な動きを示す3Dモデルを配信する際には、複数の時刻ごとに、1つの第2仮想カメラに対し、1つのデプス画像を生成する。 Furthermore, the generation means generates one depth image for one second virtual camera. When distributing a 3D model showing movement over time, one depth image is generated for one second virtual camera at each of a plurality of time points.

 本実施形態の他の好適な実施形態によれば、画像処理装置は、符号化された複数のデプス画像および複数の第2仮想カメラの位置及び姿勢を示す第1視点情報を取得する取得手段を有する。ここで、デプス画像とは、複数の第2仮想カメラのそれぞれと、前記撮像装置により取得される複数の撮像画像に基づいて生成される被写体の3Dモデルと、の距離を示す画像である。また、複数の第2仮想カメラは、現実空間における複数の撮像装置の位置及び姿勢に対応する仮想空間における複数の第1仮想カメラの光軸に基づいて設定された仮想カメラである。また、上記の画像処理装置は、前記符号化された複数のデプス画像を復号する復号手段を有する。また、上記の画像処理装置は、復号された前記複数のデプス画像と、前記第1視点情報と、に基づいて前記被写体の3Dモデルを生成する生成手段を有する。 According to another preferred embodiment of this embodiment, the image processing device has an acquisition means for acquiring a plurality of encoded depth images and first viewpoint information indicating the positions and orientations of a plurality of second virtual cameras. Here, a depth image is an image indicating the distance between each of the plurality of second virtual cameras and a 3D model of the subject generated based on a plurality of captured images acquired by the imaging device. The plurality of second virtual cameras are virtual cameras set based on the optical axes of a plurality of first virtual cameras in virtual space corresponding to the positions and orientations of the plurality of imaging devices in real space. The image processing device also has a decoding means for decoding the plurality of encoded depth images. The image processing device also has a generation means for generating a 3D model of the subject based on the plurality of decoded depth images and the first viewpoint information.

 また、上記の画像処理装置は、前記第1仮想カメラおよび前記第2仮想カメラと異なる第3仮想カメラの位置及び姿勢を示す第2視点情報を取得する。また、上記の画像処理装置は、前記第2視点情報と、前記被写体の3Dモデルとに基づいて、仮想視点画像を生成する。なお、第3仮想カメラは、ユーザがジョイスティック等の入力装置を操作することにより設定されてもよいし、生成された3Dモデルの位置に基づいて設定されてもよい。 The image processing device also acquires second viewpoint information indicating the position and orientation of a third virtual camera different from the first virtual camera and the second virtual camera. The image processing device also generates a virtual viewpoint image based on the second viewpoint information and a 3D model of the subject. The third virtual camera may be set by a user operating an input device such as a joystick, or may be set based on the position of the generated 3D model.

 この態様により、受信側のデバイスにて生成された3Dモデルを用いて仮想視点画像を生成することができる。 This allows a virtual viewpoint image to be generated using a 3D model generated on the receiving device.

 本実施形態の他の好適な実施形態によれば、プログラムは、コンピュータに、上述の画像処理装置の機能を実行させる。コンピュータは、本プログラムを実行することにより、好適に上記記載の画像処理装置として機能する。 According to another preferred embodiment of this embodiment, a program causes a computer to execute the functions of the image processing device described above. By executing this program, the computer preferably functions as the image processing device described above.

 <実施例>
 以下、図面を参照して本開示に好適な実施例について詳細に説明する。なお、以下の実施例は本開示を限定するものではなく、本実施例で説明されている特徴の組み合わせの全てが本開示の解決手段に必須のものとは限らない。さらに、添付図面においては、同一若しくは同様の構成に同一の参照番号を付し、重複した説明は省略する。
<Example>
Preferred embodiments of the present disclosure will be described in detail below with reference to the drawings. Note that the following embodiments do not limit the present disclosure, and not all of the combinations of features described in the embodiments are necessarily essential to the solutions of the present disclosure. Furthermore, in the accompanying drawings, the same reference numerals are used to designate the same or similar components, and redundant descriptions will be omitted.

 なお、仮想視点画像とは、ユーザが自由に仮想カメラの位置及び姿勢を操作することによって生成される画像であり、自由視点画像や任意視点画像などとも呼ばれる。また、特に断りが無い限り、画像という文言が動画と静止画の両方の概念を含むものとして説明する。 Note that a virtual viewpoint image is an image generated by a user freely manipulating the position and orientation of a virtual camera, and is also called a free viewpoint image or arbitrary viewpoint image. Furthermore, unless otherwise specified, the term "image" will be used to refer to both moving and still images.

 画像処理システム1は、複数の撮像装置による撮像に基づく複数の画像と、指定された仮想視点とに基づいて、指定された仮想視点からの光景を表す仮想視点画像を生成するシステムである。本実施形態における仮想視点画像は、自由視点映像とも呼ばれるものであるが、ユーザが自由に(任意に)指定した視点に対応する画像に限定されず、例えば複数の候補からユーザが選択した視点に対応する画像なども仮想視点画像に含まれる。また、本実施形態では仮想視点の指定がユーザ操作により行われる場合を中心に説明するが、仮想視点の指定が画像解析の結果等に基づいて自動で行われてもよい。また、本実施形態では仮想視点画像が動画である場合を中心に説明するが、仮想視点画像は静止画であってもよい。 Image processing system 1 is a system that generates a virtual viewpoint image that represents a scene from a specified virtual viewpoint, based on multiple images captured by multiple imaging devices and a specified virtual viewpoint. The virtual viewpoint image in this embodiment is also called a free viewpoint video, but is not limited to an image corresponding to a viewpoint freely (arbitrarily) specified by the user; for example, the virtual viewpoint image also includes an image corresponding to a viewpoint selected by the user from multiple candidates. Furthermore, this embodiment will mainly describe a case where the virtual viewpoint is specified by user operation, but the virtual viewpoint may also be specified automatically based on the results of image analysis, etc. Furthermore, this embodiment will mainly describe a case where the virtual viewpoint image is a video, but the virtual viewpoint image may also be a still image.

 仮想視点画像の生成に用いられる視点情報は、仮想視点の位置及び向き(視線方向)を示す情報である。具体的には、視点情報は、仮想視点の三次元位置を表すパラメータと、パン、チルト、及びロール方向における仮想視点の向きを表すパラメータとを含む、パラメータセットである。なお、視点情報の内容は上記に限定されない。例えば、視点情報としてのパラメータセットには、仮想視点の視野の大きさ(画角)を表すパラメータが含まれてもよい。また、視点情報は複数のパラメータセットを有していてもよい。例えば、視点情報が、仮想視点画像の動画を構成する複数のフレームにそれぞれ対応する複数のパラメータセットを有し、連続する複数の時点それぞれにおける仮想視点の位置及び向きを示す情報であってもよい。 The viewpoint information used to generate a virtual viewpoint image is information that indicates the position and orientation (line of sight) of the virtual viewpoint. Specifically, the viewpoint information is a parameter set that includes parameters that indicate the three-dimensional position of the virtual viewpoint and parameters that indicate the orientation of the virtual viewpoint in the pan, tilt, and roll directions. Note that the content of the viewpoint information is not limited to the above. For example, the parameter set serving as viewpoint information may include a parameter that indicates the size of the field of view (angle of view) of the virtual viewpoint. Furthermore, the viewpoint information may have multiple parameter sets. For example, the viewpoint information may have multiple parameter sets that respectively correspond to multiple frames that make up a video of the virtual viewpoint image, and may be information that indicates the position and orientation of the virtual viewpoint at each of multiple consecutive points in time.

 画像処理システム1は、撮像領域を複数の方向から撮像する複数の撮像装置を有する。撮像領域は、例えばサッカーや空手などの競技が行われる競技場、もしくはコンサートや演劇が行われる舞台などである。複数の撮像装置は、このような撮像領域を取り囲むようにそれぞれ異なる位置に設置され、同期して撮像を行う。なお、複数の撮像装置は撮像領域の全周にわたって設置されていなくてもよく、設置場所の制限等によっては撮像領域の周囲の一部にのみ設置されていてもよい。また、撮像装置の数は図に示す例に限定されず、例えば撮像領域をサッカーの競技場とする場合には、競技場の周囲に30台程度の撮像装置が設置されてもよい。また、望遠カメラと広角カメラなど機能が異なる撮像装置が設置されていてもよい。 The image processing system 1 has multiple imaging devices that capture images of an imaging area from multiple directions. The imaging area may be, for example, a stadium where sports such as soccer or karate are held, or a stage where concerts or plays are held. The multiple imaging devices are installed in different positions surrounding such an imaging area and capture images in sync. Note that the multiple imaging devices do not have to be installed around the entire perimeter of the imaging area; depending on installation location restrictions, they may be installed only around a portion of the perimeter of the imaging area. The number of imaging devices is not limited to the example shown in the figure; for example, if the imaging area is a soccer stadium, around 30 imaging devices may be installed around the stadium. Imaging devices with different functions, such as telephoto cameras and wide-angle cameras, may also be installed.

 なお、本実施形態における複数の撮像装置は、それぞれが独立した筐体を有し単一の視点で撮像可能なカメラであるものとする。ただしこれに限らず、2以上の撮像装置が同一の筐体内に構成されていてもよい。例えば、複数のレンズ群と複数のセンサを備えており複数視点から撮像可能な単体のカメラが、複数の撮像装置として設置されていてもよい。 Note that in this embodiment, the multiple imaging devices are each assumed to be cameras having an independent housing and capable of capturing images from a single viewpoint. However, this is not limited to this, and two or more imaging devices may be configured within the same housing. For example, a single camera equipped with multiple lens groups and multiple sensors and capable of capturing images from multiple viewpoints may be installed as the multiple imaging devices.

 仮想視点画像は、例えば以下のような方法で生成される。まず、複数の撮像装置によりそれぞれ異なる方向から撮像することで複数の画像(複数の撮像画像)が取得される。次に、複数の撮像画像から、人物やボールなどの所定のオブジェクトに対応する前景領域を抽出した前景画像と、前景領域以外の背景領域を抽出した背景画像が取得される。また、所定のオブジェクトの三次元形状を表す前景モデルと前景モデルに色付けするためのテクスチャデータとが前景画像に基づいて生成され、競技場などの背景の三次元形状を表す背景モデルに色づけするためのテクスチャデータが背景画像に基づいて生成される。そして、前景モデルと背景モデルに対してテクスチャデータをマッピングし、視点情報が示す仮想視点に応じてレンダリングを行うことにより、仮想視点画像が生成される。ただし、仮想視点画像の生成方法はこれに限定されず、三次元モデルを用いずに撮像画像の射影変換により仮想視点画像を生成する方法など、種々の方法を用いることができる。 A virtual viewpoint image can be generated, for example, using the following method. First, multiple images (multiple captured images) are obtained by capturing images from different directions using multiple imaging devices. Next, a foreground image is obtained from the multiple captured images, extracting a foreground area corresponding to a specific object, such as a person or a ball, and a background image is obtained from the multiple captured images, extracting a background area other than the foreground area. A foreground model representing the three-dimensional shape of the specific object and texture data for coloring the foreground model are generated based on the foreground image, and texture data for coloring a background model representing the three-dimensional shape of the background, such as a stadium, is generated based on the background image. The texture data is then mapped to the foreground model and background model, and rendering is performed according to the virtual viewpoint indicated by the viewpoint information, thereby generating a virtual viewpoint image. However, the method for generating a virtual viewpoint image is not limited to this, and various methods can be used, such as generating a virtual viewpoint image by projective transformation of captured images without using a three-dimensional model.

 前景画像とは、撮像装置により撮像されて取得された撮像画像から、オブジェクトの領域(前景領域)を抽出した画像である。前景領域として抽出されるオブジェクトとは、時系列で同じ方向から撮像を行った場合において動きのある(その絶対位置や形が変化し得る)動的オブジェクト(動体)を指す。オブジェクトは、例えば、競技において、それが行われるフィールド内にいる選手や審判などの人物、例えば球技であればボールなど、またコンサートやエンタテイメントにおける歌手、演奏者、パフォーマー、司会者などである。 A foreground image is an image in which an object's area (foreground area) has been extracted from an image captured by an imaging device. An object extracted as a foreground area is a dynamic object (moving body) that moves (its absolute position and shape can change) when images are captured from the same direction in chronological order. Examples of objects include players, referees, and other people on the field where a sport is taking place, such as the ball in a ball game, or singers, musicians, performers, and presenters in a concert or entertainment event.

 背景画像とは、少なくとも前景となるオブジェクトとは異なる領域(背景領域)の画像である。具体的には、背景画像は、撮像画像から前景となるオブジェクトを取り除いた状態の画像である。また、背景は、時系列で同じ方向から撮像を行った場合において静止している、又は静止に近い状態が継続している撮像対象物を指す。このような撮像対象物は、例えば、コンサート等のステージ、競技などのイベントを行うスタジアム、球技で使用するゴールなどの構造物、フィールド、などである。ただし、背景は少なくとも前景となるオブジェクトとは異なる領域であり、撮像対象としては、オブジェクトと背景の他に、別の物体等が含まれていてもよい。 A background image is an image of at least an area (background area) that is different from the foreground object. Specifically, a background image is an image in which the foreground object has been removed from the captured image. Furthermore, the background refers to an object that remains stationary or nearly stationary when images are taken from the same direction in chronological order. Examples of such objects include a stage for a concert, a stadium where an event such as a sport is held, a structure such as a goal used in a ball game, or a field. However, the background is at least an area that is different from the foreground object, and the captured object may include other objects in addition to the object and background.

 仮想カメラとは、撮像領域の周囲に実際に設置された複数の撮像装置とは異なる仮想的なカメラであって、仮想視点画像の生成に係る仮想視点を便宜的に説明するための概念である。すなわち、仮想視点画像は、撮像領域に関連付けられる仮想空間内に設定された仮想視点から撮像した画像であるとみなすことができる。そして、仮想的な当該撮像における視点の位置及び向きは仮想カメラの位置及び向きとして表すことができる。言い換えれば、仮想視点画像は、空間内に設定された仮想視点の位置にカメラが存在するものと仮定した場合に、そのカメラにより得られる撮像画像を模擬した画像であると言える。 A virtual camera is a virtual camera that is different from the multiple imaging devices actually installed around the imaging area, and is a concept used to conveniently explain the virtual viewpoint involved in generating a virtual viewpoint image. In other words, a virtual viewpoint image can be considered to be an image captured from a virtual viewpoint set in a virtual space associated with the imaging area. The position and orientation of the viewpoint in this virtual image can be expressed as the position and orientation of the virtual camera. In other words, a virtual viewpoint image can be said to be an image that simulates the captured image obtained by a camera, assuming that the camera exists at the position of the virtual viewpoint set in space.

 <実施例1>
 本実施例では、物理カメラの視点情報に基づき、サーバからクライアントに配信するデプス画像とテクスチャ画像を生成するための仮想カメラ群の視点を決定する処理を説明する。なお、本実施例では、受信側のデバイスのことをクライアントと呼称する。また、本実施例では、現実空間に配置された撮像装置のことを物理カメラと呼称する。
Example 1
In this embodiment, a process for determining the viewpoints of a group of virtual cameras for generating depth images and texture images to be distributed from a server to a client based on viewpoint information of a physical camera will be described. Note that in this embodiment, the receiving device will be referred to as the client. Also, in this embodiment, the imaging device arranged in real space will be referred to as the physical camera.

 <画像処理システムのハードウェア構成>
 図1Aは、本実施例に係る、画像処理システム1の全体構成の一例を示す図である。
<Hardware configuration of image processing system>
FIG. 1A is a diagram showing an example of the overall configuration of an image processing system 1 according to this embodiment.

 画像処理システム1は、撮影システム10、第1の画像処理装置20、第2の画像処理装置30、入力装置40、及び表示装置50を有する。画像処理システム1は、複数台の物理カメラで撮影した画像(複数の撮像画像)を用いてオブジェクトの3Dモデルの形状情報を生成する。そして、画像処理システム1は、サーバからクライアントに配信するデプス画像とテクスチャ画像を生成するための第2仮想カメラ群の視点情報(第1視点情報)を生成する。さらに、画像処理システム1は、生成した第2仮想カメラ群の視点情報に基づき、デプス画像、テクスチャ画像を生成及び圧縮し、それら画像と3Dモデルを再構成(復元)するために必要な情報とを3Dモデルデータとしてユーザに配信する。 The image processing system 1 has a shooting system 10, a first image processing device 20, a second image processing device 30, an input device 40, and a display device 50. The image processing system 1 generates shape information of a 3D model of an object using images (plurality of captured images) taken by multiple physical cameras. The image processing system 1 then generates viewpoint information (first viewpoint information) of a second virtual camera group for generating depth images and texture images to be distributed from the server to the client. Furthermore, the image processing system 1 generates and compresses depth images and texture images based on the generated viewpoint information of the second virtual camera group, and distributes these images and the information necessary to reconstruct (restore) the 3D model to the user as 3D model data.

 撮影システム10は、複数の物理カメラを異なる位置に配置し、多視点から被写体(オブジェクト)を同期撮影する。そして同期撮影して取得される複数の撮像画像、撮影システム10の物理カメラ毎の視点情報(外部/内部パラメータ、画像サイズ)などが、第1の画像処理装置20に送信される。カメラの外部パラメータとは、カメラの位置及び姿勢を示すパラメータ(例えば回転行列及び位置ベクトル等)である。カメラの内部パラメータとは、カメラ固有の内部パラメータであり、例えば、焦点距離、画像中心、及びレンズ歪みパラメータ等である。カメラの外部パラメータと内部パラメータをまとめてカメラパラメータと呼ぶ。画像サイズとは、画像の幅や高さである。この物理カメラ群の視点情報は、デプス画像とテクスチャ画像を生成するための第2仮想カメラ群の視点情報(カメラパラメータや画像サイズ)を決定する際に用いられる。 The photography system 10 places multiple physical cameras in different positions and synchronously photographs a subject (object) from multiple viewpoints. The multiple captured images obtained through synchronous photography, as well as viewpoint information (external/internal parameters, image size) for each physical camera of the photography system 10, are then transmitted to the first image processing device 20. The external parameters of a camera are parameters that indicate the position and orientation of the camera (e.g., rotation matrix and position vector). The internal parameters of a camera are internal parameters specific to the camera, such as focal length, image center, and lens distortion parameters. The external and internal parameters of a camera are collectively referred to as camera parameters. Image size refers to the width and height of the image. The viewpoint information of this group of physical cameras is used when determining the viewpoint information (camera parameters and image size) of the second group of virtual cameras used to generate depth images and texture images.

 第1の画像処理装置20は、撮影システム10から入力した複数の撮像画像、物理カメラ毎の視点情報に基づき前景となるオブジェクトの3Dモデルを生成する。前景となるオブジェクト(被写体)は、例えば、撮影システム10の撮影範囲内に存在する人物や動体などである。そして、第1の画像処理装置20は、デプス画像とテクスチャ画像を生成するための第2仮想カメラ群の視点情報を生成する。さらに、第1の画像処理装置20は、生成した第2仮想カメラ群の視点情報に基づき、デプス画像、テクスチャ画像を生成及び圧縮し、圧縮されたそれら画像と3Dモデルを復元するために必要な情報(メタデータ)とともに、第2の画像処理装置30に出力する。メタデータは、例えば、生成した第2仮想カメラ群の視点情報である。 The first image processing device 20 generates a 3D model of the foreground object based on the multiple captured images input from the imaging system 10 and the viewpoint information for each physical camera. The foreground object (subject) is, for example, a person or moving object within the imaging range of the imaging system 10. The first image processing device 20 then generates viewpoint information for a second virtual camera group for generating depth images and texture images. Furthermore, the first image processing device 20 generates and compresses depth images and texture images based on the generated viewpoint information for the second virtual camera group, and outputs these compressed images and the information (metadata) required to restore the 3D model to the second image processing device 30. The metadata is, for example, the generated viewpoint information for the second virtual camera group.

 先ず、第1の画像処理装置20は、複数の撮像画像、物理カメラ毎の視点情報に基づき被写体の形状情報(形状情報)を推定する。被写体の形状情報の推定には、例えば視体積交差法が用いられる。現実空間に存在する撮影システム10に対応する、仮想空間における複数の仮想カメラ(複数の第1仮想カメラ)を設定する。この複数の第1仮想カメラを第1仮想カメラ群と呼称する。この第1仮想カメラ群は、物理カメラ群を仮想空間上で再現したものである。つまり、第1仮想カメラ群の視点情報は、現実空間の物理カメラ群の視点情報と対応する。そして、第1仮想カメラ群の視点情報と、複数の撮像画像とを視体積交差法に用いることにより、被写体の形状情報を生成する。この処理の結果、被写体の形状情報を表現した3D点群(3次元座標を持つ点の集合)が得られる。なお、この3D点群は、Visual Hullとも呼ばれる。なお、撮影画像から被写体の形状情報を導出する方法はこれに限らない。また、被写体の形状情報の表現方法は3D点群に限らず、メッシュやボクセルでもよい。また、生成した被写体の3D点群の各点に対し、複数の撮像画像を用いてテクスチャ画像(色情報)を決定する。したがって、被写体の3Dモデルを示す3Dモデルデータには、形状を示す形状情報と、色を示す色情報とが含まれることになる。なお、本実施例では3Dモデルは、形状情報と色情報とから生成されるものとするが、これに限定されない。例えば、3Dモデルは形状情報を含むものであり、色情報とは別のデータとして管理してもよい。なお、現実空間における物理カメラ群に対応する、仮想空間における第1仮想カメラ群を設定する手法は、CGの分野等において公知技術であるため説明を省略する。 First, the first image processing device 20 estimates shape information (shape information) of the subject based on multiple captured images and viewpoint information for each physical camera. Estimating the shape information of the subject uses, for example, a visual hull intersection method. Multiple virtual cameras (multiple first virtual cameras) are set in virtual space corresponding to the imaging system 10 existing in real space. These multiple first virtual cameras are referred to as a first virtual camera group. This first virtual camera group is a reproduction of a physical camera group in virtual space. In other words, the viewpoint information of the first virtual camera group corresponds to the viewpoint information of the physical camera group in real space. Then, shape information of the subject is generated by using the viewpoint information of the first virtual camera group and the multiple captured images in the visual hull intersection method. As a result of this processing, a 3D point cloud (a set of points with three-dimensional coordinates) that represents the shape information of the subject is obtained. This 3D point cloud is also called a visual hull. Note that the method of deriving shape information of the subject from captured images is not limited to this. Furthermore, the method of expressing the shape information of the subject is not limited to a 3D point cloud, but can also be a mesh or voxel. Furthermore, a texture image (color information) is determined for each point in the generated 3D point cloud of the subject using multiple captured images. Therefore, 3D model data representing a 3D model of the subject includes shape information indicating the shape and color information indicating the color. Note that in this embodiment, the 3D model is generated from shape information and color information, but this is not limited to this. For example, the 3D model includes shape information and may be managed as data separate from color information. Note that the method of setting a first virtual camera group in virtual space corresponding to a physical camera group in real space is well-known technology in the field of CG, etc., and therefore will not be described here.

 次に、第1の画像処理装置20は、配信するデプス画像とテクスチャ画像を生成するための複数の第2仮想カメラの視点情報(位置及び姿勢)を生成する。以降、複数の第2仮想カメラを第2仮想カメラ群と呼称する。この第2仮想カメラ群の位置は、例えば、第1仮想カメラ群の光軸上に配置され、姿勢は第1仮想カメラの姿勢と同じに設定される。第1仮想カメラ群の視点情報は、現実空間の物理カメラ群の視点情報と対応していることから、物理カメラ群の視点情報に基づいて第2仮想カメラ群の視点情報が生成されることになる。第2仮想カメラ群の視点情報の生成方法は、後で詳しく説明する。以降、第2仮想カメラ群の視点情報を第2視点情報と呼称する。第1の画像処理装置20は、この第2仮想カメラ群の第2視点情報に基づき3Dモデルのデプス画像を生成する。デプス画像は、第2仮想カメラ群の台数分生成される。具体的には、被写体の3D点群内の各点を第2仮想カメラ群の撮像面と同一面に投影する。第2仮想カメラ群ごとに、投影した各画素に対して第2仮想カメラから被写体までの距離(デプス)を算出し、デプス画像の各画素にデプス値を設定する。また、第1の画像処理装置20は、第2仮想カメラ群の第2視点情報に基づき被写体を撮影しテクスチャ画像を生成する。テクスチャ画像は、第2仮想カメラの視線方向と近い視線方向の物理カメラの撮影画像の画素値の優先度を高くして、複数の撮像画像の色をブレンドすることにより生成される。このように第2仮想カメラの視線方向と近い視線方向を有する物理カメラの撮像画像の画素値に対し、複数の撮像画像をブレンドする際の優先度(重み)を高く設定する方法を、仮想視点依存のテクスチャ画像の生成方法と呼称する。仮想視点依存のテクスチャ画像の生成方法の詳細は後述する。この仮想視点依存で生成したテクスチャ画像は、仮想カメラの位置及び姿勢に依存して、被写体の画素値を決定する物理カメラの撮影画像が選択されるため、仮想カメラが動くと被写体の色が変化する。そのため、この方法で生成したテクスチャ画像を仮想視点依存のテクスチャ画像と呼ぶ。一方で、仮想視点非依存のテクスチャ画像は、仮想カメラの位置及び姿勢によって、被写体の画素値が変化しない方法で生成される。本実施例では、テクスチャ画像の生成は仮想視点依存を題材に扱うが、テクスチャ画像は仮想視点非依存で生成されてもよい。仮想視点依存と仮想視点非依存のテクスチャ画像の生成処理の一例を下記に示す。 Next, the first image processing device 20 generates viewpoint information (position and orientation) of multiple second virtual cameras for generating depth images and texture images to be distributed. Hereinafter, the multiple second virtual cameras will be referred to as a second virtual camera group. The position of this second virtual camera group is, for example, arranged on the optical axis of the first virtual camera group, and the orientation is set to be the same as the orientation of the first virtual camera. Since the viewpoint information of the first virtual camera group corresponds to the viewpoint information of the physical camera group in real space, the viewpoint information of the second virtual camera group is generated based on the viewpoint information of the physical camera group. The method for generating the viewpoint information of the second virtual camera group will be explained in detail later. Hereinafter, the viewpoint information of the second virtual camera group will be referred to as second viewpoint information. The first image processing device 20 generates a depth image of the 3D model based on the second viewpoint information of this second virtual camera group. Depth images are generated for the number of second virtual cameras. Specifically, each point in the 3D point cloud of the subject is projected onto the same plane as the imaging plane of the second virtual camera group. For each second virtual camera group, the distance (depth) from the second virtual camera to the subject is calculated for each projected pixel, and a depth value is set for each pixel of the depth image. Furthermore, the first image processing device 20 captures an image of the subject based on the second viewpoint information of the second virtual camera group and generates a texture image. The texture image is generated by blending the colors of multiple captured images while increasing the priority of pixel values of images captured by a physical camera with a line of sight close to the line of sight of the second virtual camera. This method of setting a high priority (weight) for pixel values of images captured by a physical camera with a line of sight close to the line of sight of the second virtual camera when blending multiple captured images is referred to as a virtual viewpoint-dependent texture image generation method. Details of the virtual viewpoint-dependent texture image generation method will be described later. This virtual viewpoint-dependent texture image is generated by selecting an image captured by a physical camera that determines the pixel value of the subject depending on the position and orientation of the virtual camera, so the color of the subject changes when the virtual camera moves. Therefore, a texture image generated using this method is referred to as a virtual viewpoint-dependent texture image. On the other hand, virtual-viewpoint-independent texture images are generated in a way that the pixel values of the subject do not change depending on the position and orientation of the virtual camera. In this embodiment, texture image generation deals with virtual-viewpoint-dependent texture images, but texture images may also be generated in a virtual-viewpoint-independent manner. An example of the process for generating virtual-viewpoint-dependent and virtual-viewpoint-independent texture images is shown below.

 仮想視点依存のテクスチャ画像の生成処理は、例えば被写体を構成する3D点群の点の可視性判定処理と仮想カメラの位置及び姿勢に基づく色の導出処理とを含むものである。 The process of generating a virtual viewpoint-dependent texture image includes, for example, a process of determining the visibility of points in the 3D point cloud that constitutes the subject, and a process of deriving colors based on the position and orientation of the virtual camera.

 可視性判定処理では、3D点群内の各点と撮影システム10が有する物理カメラ群に含まれる複数の物理カメラとの位置関係から、各点について撮像可能な物理カメラを特定する。色の導出処理では、例えば、3D点群内のある点を着目点とし、その着目点の色を導出する。具体的には、着目点ごとに以降の処理を実施する。第2仮想カメラの撮像範囲に含まれる着目点を選択する。そして、その着目点を撮像範囲とする第2仮想カメラの視線方向と近い視線方向を有する第1仮想カメラであり、その着目点を撮像可能な第1仮想カメラを選択する。第1仮想カメラは、物理カメラを仮想空間上で再現したものであるため、第1仮想カメラを選択することは物理カメラを選択することと同じである。そして選択した着目点を、選択した物理カメラの撮像画像に射影する。その射影先の画素の色を、その着目点の色とする。物理カメラの選択は、例えば、第2仮想カメラから着目点への視線方向と物理カメラから着目点への視線方向の角度が一定角度以下か否かで行われる。なお、着目点が複数の物理カメラにより撮像可能な場合、第2仮想カメラの視線方向と近い視線方向の複数の物理カメラを選択し、それら物理カメラの撮像画像のそれぞれに着目点を射影する。そして、射影先の画素値を取得し、第2仮想カメラの視線方向と近い物理カメラの画素値が優先して使われるよう加重平均を算出することで、着目点の色が決定される。 )。 In the visibility determination process, the physical camera that can capture each point is identified based on the positional relationship between each point in the 3D point cloud and the multiple physical cameras included in the group of physical cameras that the imaging system 10 has. In the color derivation process, for example, a point in the 3D point cloud is set as a focus point, and the color of that focus point is derived. Specifically, the following process is performed for each focus point. A focus point that is included in the imaging range of the second virtual camera is selected. Then, a first virtual camera that has a line of sight close to the line of sight of the second virtual camera whose imaging range includes that focus point and can capture that focus point is selected. Because the first virtual camera is a reproduction of a physical camera in virtual space, selecting the first virtual camera is the same as selecting a physical camera. The selected focus point is then projected onto the image captured by the selected physical camera. The color of the pixel at the projection destination is set as the color of that focus point. A physical camera is selected, for example, based on whether the angle between the line of sight from the second virtual camera to the focus point and the line of sight from the physical camera to the focus point is within a certain angle. If the point of interest can be captured by multiple physical cameras, multiple physical cameras with viewing directions close to the viewing direction of the second virtual camera are selected, and the point of interest is projected onto each of the images captured by those physical cameras. The pixel values of the projection destination are then obtained, and a weighted average is calculated so that pixel values from physical cameras with viewing directions close to the second virtual camera are used preferentially, thereby determining the color of the point of interest.

 このような処理を、着目点を変えながら行い、着目点の色を第2仮想カメラの撮像面と同一面に投影することで、仮想視点依存のテクスチャ画像を生成することが出来る。なお、本実施例にあわせて説明したが、上記に限定されず、第2仮想カメラと異なる仮想カメラに依存するテクスチャ画像を生成する場合には、上記の第2仮想カメラを前記異なる仮想カメラに変更して上記処理を行う。 This process is performed while changing the focus point, and the color of the focus point is projected onto the same surface as the imaging surface of the second virtual camera, thereby generating a texture image that is dependent on the virtual viewpoint. While this has been explained in conjunction with the present embodiment, this is not limited to the above, and when generating a texture image that is dependent on a virtual camera different from the second virtual camera, the second virtual camera is changed to the different virtual camera and the above process is performed.

 仮想視点非依存のテクスチャ画像の生成処理は、例えば前述した可視性判定処理と仮想カメラの位置及び姿勢に依らない色の導出処理とを含むものである。可視性判定の処理後、例えば、3D点群内のある点を着目点とし、その着目点を撮像可能な第1仮想カメラに対応する物理カメラの撮像画像に射影し、該射影先の画素の色をその着目点の色とする。 The process of generating a texture image independent of the virtual viewpoint includes, for example, the visibility determination process described above and a process of deriving a color that is independent of the position and orientation of the virtual camera. After the visibility determination process, for example, a point in the 3D point cloud is set as the focus point, and that focus point is projected onto an image captured by a physical camera corresponding to a first virtual camera that can capture images, and the color of the pixel at the projection destination is set as the color of that focus point.

 なお、着目点が複数の第1仮想カメラにより撮像可能な場合、複数の第1仮想カメラに対応する複数の物理カメラの撮像画像のそれぞれに着目点を射影する。そして、射影先の画素値を取得し、画素値の平均を算出することで、着目点の色が決定される。このような処理を、着目点を変えながら行い、着目点の色を第2仮想カメラの撮像面と同一面に投影することで、仮想視点非依存のテクスチャ画像を生成することが出来る。 If the point of interest can be captured by multiple first virtual cameras, the point of interest is projected onto each of the images captured by multiple physical cameras corresponding to the multiple first virtual cameras. The pixel values at the projection destination are then obtained and the average of the pixel values is calculated to determine the color of the point of interest. This process is performed while changing the point of interest, and the color of the point of interest is projected onto the same surface as the imaging surface of the second virtual camera, thereby generating a texture image that is independent of the virtual viewpoint.

 仮想視点画像生成技術において、仮想視点依存で色情報を生成する方法は、第2仮想カメラの視線方向に近い視線方向を有する物理カメラの撮影画像の画素値が優先的に使われるため、仮想視点非依存に比べて高画質な画像を生成する方法として知られている。それに加えて、物理カメラと位置及び姿勢が近い仮想カメラで生成した仮想視点依存の仮想視点画像(テクスチャ画像)は、該物理カメラの撮影画像の影響を多く受け、撮影画像に近い画質のテクスチャ画像の生成が可能となる。一方で、物理カメラの位置及び姿勢と大きく異なる仮想カメラの視点で生成した仮想視点依存のテクスチャ画像や仮想視点非依存で生成したテクスチャ画像は、複数台の物理カメラで画素値を補間しながら生成される。そのため、実際の被写体の色情報とは異なる画素値を含むことや、コントラストの低い画像となってしまうことがある。 In virtual viewpoint image generation technology, the method of generating color information dependent on a virtual viewpoint is known as a method of generating images with higher image quality than those independent of a virtual viewpoint, because priority is given to the pixel values of the image captured by a physical camera with a line of sight close to that of the second virtual camera. In addition, a virtual viewpoint-dependent virtual viewpoint image (texture image) generated by a virtual camera with a position and orientation similar to that of a physical camera is heavily influenced by the image captured by that physical camera, making it possible to generate a texture image with image quality close to that of the captured image. On the other hand, a virtual viewpoint-dependent texture image generated from the viewpoint of a virtual camera whose position and orientation is significantly different from that of the physical camera, or a texture image generated independent of a virtual viewpoint, is generated by interpolating pixel values using multiple physical cameras. As a result, the image may contain pixel values that differ from the color information of the actual subject, or may have low contrast.

 最後に、第1の画像処理装置20は、デプス画像とテクスチャ画像を、H.264やH.265などの動画圧縮方式を用いて圧縮(符号化)する。なお、圧縮方式は動画圧縮に限らず、元のデータ量より少ないサイズに符号化できればよく、ファイル圧縮などでもよい。第1の画像処理装置20は、被写体が映っていないデプス画像とテクスチャ画像を、第2の画像処理装置30に出力しなくてもよい。第2の画像処理装置30が、デプス画像を第2仮想カメラの第2視点情報に基づき仮想空間に逆投影することで、3Dモデルの形状情報を再構成することが可能となる。さらにデプス画像の各デプス値と同一座標のテクスチャ画像の画素値を、デプス値を逆投影した点の色情報とすることで、形状情報に色情報を付与し3Dモデルを再構成することが出来る。 Finally, the first image processing device 20 compresses (encodes) the depth image and texture image using a video compression method such as H.264 or H.265. Note that the compression method is not limited to video compression, and any method that can encode to a size smaller than the original data volume, such as file compression, may be used. The first image processing device 20 does not need to output depth images and texture images that do not show the subject to the second image processing device 30. The second image processing device 30 back-projects the depth image into virtual space based on the second viewpoint information of the second virtual camera, making it possible to reconstruct shape information of a 3D model. Furthermore, by using the pixel values of the texture image at the same coordinates as each depth value of the depth image as color information at the point where the depth value is back-projected, it is possible to add color information to the shape information and reconstruct a 3D model.

 また、第1の画像処理装置20は、圧縮前のデプス画像とテクスチャ画像から被写体の撮影範囲を矩形で切り出しし、この矩形画像(ROI画像)を圧縮して配信してもよい。 The first image processing device 20 may also extract a rectangular image of the subject's shooting range from the pre-compression depth image and texture image, compress this rectangular image (ROI image), and distribute it.

 その場合、メタデータとして、切り出した矩形画像の座標情報を含んでもよい。また、第1の画像処理装置20は、矩形画像を並べて1枚の画像にしてから圧縮して配信してもよい。画像全体でなく、矩形画像を配信することでデータ量の削減が可能となる。 In this case, the metadata may include coordinate information for the cut-out rectangular image. The first image processing device 20 may also arrange the rectangular images to form a single image, compress it, and distribute it. By distributing the rectangular image instead of the entire image, it is possible to reduce the amount of data.

 さらに、第1の画像処理装置20は、H.264やH.265では動画圧縮できない単精度浮動小数点数(32bit)などの高精度なデプス値を含んだデプス画像を生成してもよい。その場合、デプス情報を動画圧縮可能な精度(8bitや10bit)に変換した後に動画圧縮を行う。変換する方法は、例えば、スカラー量子化処理を行い、量子化したデプス画像を圧縮して配信してもよい。その場合、メタデータとして、量子化前のデプスの値域の最小値と最大値を含んでもよい。スカラー量子化を行うことで、動画圧縮可能な精度では不十分であった撮影対象や撮影範囲にいる被写体の3Dモデルをクライアントで再構成することが出来る。 Furthermore, the first image processing device 20 may generate depth images containing high-precision depth values, such as single-precision floating-point numbers (32 bits), which cannot be used for video compression using H.264 or H.265. In this case, the video is compressed after converting the depth information to a precision that allows video compression (8 bits or 10 bits). The conversion method may involve, for example, performing scalar quantization processing, and compressing and distributing the quantized depth image. In this case, the minimum and maximum values of the depth value range before quantization may be included as metadata. By performing scalar quantization, the client can reconstruct a 3D model of the subject being photographed or the subject within the photographic range, which would not be possible with the precision required for video compression.

 第2の画像処理装置30は、第1の画像処理装置20からデプス画像、テクスチャ画像、メタデータを受信及び復号し、3Dモデルを復元する。3Dモデルの復元方法は、先述したように、デプス画像とテクスチャ画像をメタデータに含まれる第2仮想カメラ群の第2視点情報に基づき仮想空間に逆投影することで行われる。また第2の画像処理装置30は、後述する入力装置40から受信した入力値に基づきユーザが視聴する仮想視点画像を生成するための第3仮想カメラの視点情報(第2視点情報)を算出する。そして、算出した第2視点情報と復元した3Dモデルに基づき仮想視点画像を生成する。さらに、第2の画像処理装置30は、生成した仮想視点画像を表示装置50に出力する。 The second image processing device 30 receives and decodes the depth image, texture image, and metadata from the first image processing device 20 and restores the 3D model. As described above, the 3D model is restored by back-projecting the depth image and texture image into virtual space based on the second viewpoint information of the second virtual camera group contained in the metadata. The second image processing device 30 also calculates viewpoint information (second viewpoint information) of a third virtual camera for generating a virtual viewpoint image viewed by the user based on input values received from the input device 40, which will be described later. The second image processing device 30 then generates a virtual viewpoint image based on the calculated second viewpoint information and the restored 3D model. The second image processing device 30 then outputs the generated virtual viewpoint image to the display device 50.

 入力装置40は、ユーザが第3仮想カメラを設定するための入力値を受け付け、入力値を第2の画像処理装置30に送信する。例えば、入力装置40はジョイスティック、ジョグダイヤル、タッチパネル、キーボード、及びマウスなどの入力部を有する。第3仮想カメラを設定するユーザは、入力部を操作することで第3仮想カメラの位置及び姿勢を設定する。なお、本実施例ではユーザが第3仮想カメラの位置及び姿勢を設定するが、これに限定されず、3Dモデルの位置情報を用いて第3仮想カメラの位置及び姿勢を設定してもよい。ここでの3Dモデルの位置情報は、配信側の第1の画像処理装置20で生成してもよいし、受信側の第2の画像処理装置30で生成してもよい。 The input device 40 accepts input values used by the user to set the third virtual camera and transmits the input values to the second image processing device 30. For example, the input device 40 has input units such as a joystick, jog dial, touch panel, keyboard, and mouse. The user setting the third virtual camera sets the position and orientation of the third virtual camera by operating the input unit. Note that in this embodiment, the user sets the position and orientation of the third virtual camera, but this is not limited to this, and the position and orientation of the third virtual camera may also be set using position information of the 3D model. The position information of the 3D model here may be generated by the first image processing device 20 on the distribution side, or may be generated by the second image processing device 30 on the receiving side.

 表示装置50は、第2の画像処理装置30により生成及び出力された仮想視点画像を表示する。ユーザは、表示装置50に表示された仮想視点画像を見て、入力装置40を通じて次のフレームにおける仮想カメラの位置及び姿勢を設定する。 The display device 50 displays the virtual viewpoint image generated and output by the second image processing device 30. The user views the virtual viewpoint image displayed on the display device 50 and sets the position and orientation of the virtual camera for the next frame via the input device 40.

 図1Bは、本実施例の第1の画像処理装置20のハードウェア構成の一例を示す図である。なお、第2の画像処理装置30のハードウェア構成も、以下で説明する第1の画像処理装置20の構成と同様である。本実施例の第1の画像処理装置20は、CPU101、RAM102、ROM103、通信部104で構成される。 FIG. 1B is a diagram showing an example of the hardware configuration of the first image processing device 20 of this embodiment. The hardware configuration of the second image processing device 30 is also similar to the configuration of the first image processing device 20 described below. The first image processing device 20 of this embodiment is composed of a CPU 101, RAM 102, ROM 103, and communication unit 104.

 CPU101は、RAM102やROM103に格納されているコンピュータプログラムやデータを用いて第1の画像処理装置20の全体を制御することで、図1A、図1Bに示す第1の画像処理装置20の各機能を実現する。なお、第1の画像処理装置20がCPU101とは異なる1又は複数の専用のハードウェアを有し、CPU101による処理の少なくとも一部を専用のハードウェアが実行してもよい。専用のハードウェアの例としては、ASIC(特定用途向け集積回路)、FPGA(フィールドプログラマブルゲートアレイ)、およびDSP(デジタルシグナルプロセッサ)などがある。RAM102は、補助記憶装置214から供給されるプログラムやデータ、及び通信部104を介して外部から供給されるデータなどを一時記憶する。ROM103は、変更を必要としないプログラムなどを格納する。 The CPU 101 controls the entire first image processing device 20 using computer programs and data stored in the RAM 102 and ROM 103, thereby realizing each function of the first image processing device 20 shown in Figures 1A and 1B. Note that the first image processing device 20 may have one or more pieces of dedicated hardware different from the CPU 101, and at least some of the processing by the CPU 101 may be performed by the dedicated hardware. Examples of dedicated hardware include an ASIC (application-specific integrated circuit), an FPGA (field-programmable gate array), and a DSP (digital signal processor). The RAM 102 temporarily stores programs and data supplied from the auxiliary storage device 214, as well as data supplied from the outside via the communication unit 104. The ROM 103 stores programs that do not require modification.

 通信部104は、第1の画像処理装置20の外部の装置との通信に用いられる。例えば、第1の画像処理装置20が外部の装置と有線で接続される場合には、通信用のケーブルが通信部104に接続される。第1の画像処理装置20が外部の装置と無線通信する機能を有する場合には、通信部104はアンテナを備える。 The communication unit 104 is used for communication between the first image processing device 20 and an external device. For example, if the first image processing device 20 is connected to an external device via a wired connection, a communication cable is connected to the communication unit 104. If the first image processing device 20 has the function of communicating wirelessly with an external device, the communication unit 104 is equipped with an antenna.

 <画像処理装置の機能構成>
 図2は、第1の画像処理装置20及び第2の画像処理装置30の機能構成の一例を示す図である。
<Functional configuration of image processing device>
FIG. 2 is a diagram showing an example of the functional configuration of the first image processing device 20 and the second image processing device 30. As shown in FIG.

 第1の画像処理装置20は、形状情報生成部201、視点決定部202、デプス画像生成部203、テクスチャ画像生成部204、符号化部205、配信部206から構成される。 The first image processing device 20 is composed of a shape information generation unit 201, a viewpoint determination unit 202, a depth image generation unit 203, a texture image generation unit 204, an encoding unit 205, and a distribution unit 206.

 形状情報生成部201は、通信部104を使って、撮影システム10から受信した複数の撮像画像、物理カメラ群の視点情報を用いて被写体の形状情報を推定する。形状情報の推定は、先述した視体積交差法などが用いられる。そのため、形状情報生成部201は、取得した物理カメラ群の視点情報から、現実空間における物理カメラ群に対応する、仮想空間における第1仮想カメラ群の視点情報も生成する。形状情報生成部201は、推定した形状情報を視点決定部202、デプス画像生成部203、テクスチャ画像生成部204に出力する。また、形状情報生成部201は、受信した物理カメラ群の視点情報と複数の撮像画像を視点決定部202、テクスチャ画像生成部204に出力する。 The shape information generation unit 201 uses the communication unit 104 to estimate shape information of the subject using the multiple captured images and viewpoint information of the physical cameras received from the imaging system 10. The shape information is estimated using the volume intersection method described above. Therefore, the shape information generation unit 201 also generates viewpoint information of a first virtual camera group in virtual space that corresponds to the physical camera group in real space from the acquired viewpoint information of the physical cameras. The shape information generation unit 201 outputs the estimated shape information to the viewpoint determination unit 202, depth image generation unit 203, and texture image generation unit 204. The shape information generation unit 201 also outputs the received viewpoint information of the physical cameras and multiple captured images to the viewpoint determination unit 202 and texture image generation unit 204.

 視点決定部202は、形状情報生成部201からの入力データに基づき、第2の画像処理装置30が復元する3Dモデルの品質向上を目的に、第2仮想カメラ群の視点を決定し、視点情報を生成する。視点決定部202は、生成した第2仮想カメラ群の視点情報をデプス画像生成部203と、デプス画像生成部203を介してテクスチャ画像生成部204に出力する。 Based on the input data from the shape information generation unit 201, the viewpoint determination unit 202 determines the viewpoints of the second virtual camera group and generates viewpoint information, with the aim of improving the quality of the 3D model restored by the second image processing device 30. The viewpoint determination unit 202 outputs the generated viewpoint information of the second virtual camera group to the depth image generation unit 203 and, via the depth image generation unit 203, to the texture image generation unit 204.

 第2仮想カメラ群の視点情報は、例えば第2仮想カメラを物理カメラに対応する第1仮想カメラの位置及び姿勢に一致させ、第1仮想カメラの視線方向に第2仮想カメラをドリーズームすることにより生成される。具体的には、物理カメラに対応する第1仮想カメラの位置に配置した第2仮想カメラが、第1仮想カメラの視線方向に被写体に近づきながら第2仮想カメラから生成される仮想視点画像上の被写体のサイズが維持されるように焦点距離(ズーム)を調整する。このような第2仮想カメラの視点情報の導出処理を後述する図3A~図3Cと図4を用いて詳細に説明する。このような処理を、物理カメラを変えながら、また撮影画像に写る被写体が複数人いる場合、被写体を変えながら行うことで、被写体毎に被写体を映した物理カメラの台数分、第2仮想カメラの視点情報を生成することが出来る。 The viewpoint information of the second virtual camera group is generated, for example, by matching the position and orientation of the second virtual camera to that of the first virtual camera corresponding to the physical camera and dolly-zooming the second virtual camera in the line of sight of the first virtual camera. Specifically, the second virtual camera, placed at the position of the first virtual camera corresponding to the physical camera, approaches the subject in the line of sight of the first virtual camera while adjusting the focal length (zoom) so that the size of the subject in the virtual viewpoint image generated from the second virtual camera is maintained. This process of deriving the viewpoint information of the second virtual camera will be described in detail below using Figures 3A to 3C and 4. By performing this process while changing the physical camera, or while changing the subjects if there are multiple subjects in the captured image, it is possible to generate viewpoint information of the second virtual camera for each subject, for the number of physical cameras that captured the subject.

 また、視点決定部202は、第2仮想カメラの画像サイズ(幅と高さ)を変更してもよい。先ず第2仮想カメラの画像サイズは物理カメラの画像サイズと同じにし、上記の様に第2仮想カメラの位置及び姿勢を決定後、第2仮想カメラの画像サイズを変更する。その方法は、例えば、第2仮想カメラの仮想視点画像の幅と高さを物理カメラの撮像画像の幅と高さのそれぞれ1/Nに変更する場合、第2仮想カメラの内部パラメータの焦点距離と画像中心もそれぞれ1/N倍する。つまり、物理カメラの画像サイズが4K(幅3840、高さ2160)で第2仮想カメラの画像サイズをフルHD(幅1920、高さ1080)に幅と高さそれぞれ1/2に変更する場合、第2仮想カメラの内部パラメータの焦点距離と画像中心を1/2倍する。そうすることで、画角を変えずに画像サイズを変更でき、サーバからクライアントに送信するデータ量を削減できる。 The viewpoint determination unit 202 may also change the image size (width and height) of the second virtual camera. First, the image size of the second virtual camera is set to the same as the image size of the physical camera, and after determining the position and orientation of the second virtual camera as described above, the image size of the second virtual camera is changed. For example, if the width and height of the virtual viewpoint image of the second virtual camera are changed to 1/N of the width and height of the image captured by the physical camera, the focal length and image center internal parameters of the second virtual camera are also multiplied by 1/N. In other words, if the image size of the physical camera is 4K (width 3840, height 2160) and the image size of the second virtual camera is changed to full HD (width 1920, height 1080), the width and height are each halved, the focal length and image center internal parameters of the second virtual camera are multiplied by 1/2. This allows the image size to be changed without changing the angle of view, reducing the amount of data sent from the server to the client.

 デプス画像生成部203は、形状情報生成部201から入力した被写体の形状情報と視点決定部202から入力した第2仮想カメラ群の視点情報に基づき、先述したデプス画像の生成処理を行う。デプス画像生成部203は、生成したデプス画像と第2仮想カメラ群の視点情報を符号化部205に出力する。 The depth image generation unit 203 performs the aforementioned depth image generation process based on the shape information of the subject input from the shape information generation unit 201 and the viewpoint information of the second virtual camera group input from the viewpoint determination unit 202. The depth image generation unit 203 outputs the generated depth image and the viewpoint information of the second virtual camera group to the encoding unit 205.

 テクスチャ画像生成部204は、形状情報生成部201からの入力データ、デプス画像生成部203を介して視点決定部202から入力した第2仮想カメラ群の視点情報に基づき、先述した仮想視点依存のテクスチャ画像の生成処理を行う。テクスチャ画像生成部204は、生成したテクスチャ画像を符号化部205に出力する。仮想視点依存のテクスチャ画像は、先述したように、第2仮想カメラの位置及び姿勢に近い物理カメラの撮影画像の画素値を優先的に参照されることにより生成される。そうすることで、高画質なテクスチャ画像を生成でき、クライアントで3Dモデルを復元した際に、高品質な色情報を含んだ3Dモデルを再生できる。 The texture image generation unit 204 performs processing to generate the virtual viewpoint-dependent texture image described above based on input data from the shape information generation unit 201 and viewpoint information of the second virtual camera group input from the viewpoint determination unit 202 via the depth image generation unit 203. The texture image generation unit 204 outputs the generated texture image to the encoding unit 205. As described above, the virtual viewpoint-dependent texture image is generated by preferentially referencing the pixel values of the image captured by the physical camera that is closest in position and orientation to the second virtual camera. This makes it possible to generate high-quality texture images, and when the 3D model is restored on the client, a 3D model containing high-quality color information can be reproduced.

 また、テクスチャ画像生成部204は、形状情報生成部201からの入力データ、デプス画像生成部203を介して視点決定部202から入力した第2仮想カメラ群の視点情報に基づき有効画素マップを生成してもよい。テクスチャ画像生成部204は、生成した有効画素マップを符号化部205に出力してもよい。有効画素マップについては、実施例2で詳細に説明する。 The texture image generation unit 204 may also generate a valid pixel map based on input data from the shape information generation unit 201 and viewpoint information of the second virtual camera group input from the viewpoint determination unit 202 via the depth image generation unit 203. The texture image generation unit 204 may output the generated valid pixel map to the encoding unit 205. The valid pixel map will be described in detail in Example 2.

 符号化部205は、デプス画像生成部203から入力したデプス画像と第2仮想カメラ群の視点情報、テクスチャ画像生成部204から入力したテクスチャ画像を取得する。符号化部205は、先述した圧縮方式によりデプス画像とテクスチャ画像を圧縮し、圧縮した画像群と第2仮想カメラ群の視点情報(メタデータ)を配信部206に出力する。 The encoding unit 205 acquires the depth images and viewpoint information of the second virtual camera group input from the depth image generation unit 203, and the texture images input from the texture image generation unit 204. The encoding unit 205 compresses the depth images and texture images using the compression method described above, and outputs the compressed image group and viewpoint information (metadata) of the second virtual camera group to the distribution unit 206.

 また、符号化部205は、デプス画像、テクスチャ画像だけでなく、メタデータを圧縮してもよい。圧縮方式は先述したファイル圧縮等である。 Furthermore, the encoding unit 205 may compress not only depth images and texture images, but also metadata. Compression methods include the file compression methods described above.

 さらに、符号化部205は、テクスチャ画像生成部204から入力した有効画素マップを圧縮して配信部206に出力してもよい。 Furthermore, the encoding unit 205 may compress the effective pixel map input from the texture image generation unit 204 and output it to the distribution unit 206.

 配信部206は、通信部104を使って、符号化部205から入力した圧縮したデプス画像、テクスチャ画像、及び第2仮想カメラ群の視点情報を後述する受信部207に送信する。 The distribution unit 206 uses the communication unit 104 to transmit the compressed depth image, texture image, and viewpoint information of the second virtual camera group input from the encoding unit 205 to the receiving unit 207, which will be described later.

 第2の画像処理装置30は、受信部207、復号部208、3Dモデル復元部209、仮想カメラ制御部210、仮想視点画像生成部211から構成される。 The second image processing device 30 is composed of a receiving unit 207, a decoding unit 208, a 3D model restoration unit 209, a virtual camera control unit 210, and a virtual viewpoint image generation unit 211.

 受信部207は、通信部104を使って、配信部206から圧縮されたデプス画像、圧縮されたテクスチャ画像、及び第2仮想カメラ群の視点情報(メタデータ)を受信し、復号部208に出力する。 The receiving unit 207 uses the communication unit 104 to receive the compressed depth image, compressed texture image, and viewpoint information (metadata) of the second virtual camera group from the distribution unit 206, and outputs them to the decoding unit 208.

 復号部208は、受信部207から取得した圧縮されたデプス画像と圧縮されたテクスチャ画像を復号し、第2仮想カメラ群の視点情報とともに3Dモデル復元部209に出力する。また、復号部208は、デプス画像、テクスチャ画像だけでなく、メタデータを復号してもよい。さらに、復号部208は、有効画素マップを復号して3Dモデル復元部209に出力してもよい。 The decoding unit 208 decodes the compressed depth images and compressed texture images acquired from the receiving unit 207 and outputs them to the 3D model restoration unit 209 along with viewpoint information of the second virtual camera group. The decoding unit 208 may also decode metadata in addition to the depth images and texture images. Furthermore, the decoding unit 208 may decode an effective pixel map and output it to the 3D model restoration unit 209.

 3Dモデル復元部209は、復号部208から取得した復号されたデプス画像と復号されたテクスチャ画像と第2仮想カメラ群の視点情報に基づき、先述した復元方法を用いて3Dモデルを復元する。復元した3Dモデルを仮想視点画像生成部211に出力する。さらに、3Dモデル復元部209は、復号部208から取得した有効画素マップを使用して3Dモデルの色情報を生成してもよい。有効画素マップの使用方法は実施例2で詳細に説明する。 The 3D model restoration unit 209 restores a 3D model using the restoration method described above, based on the decoded depth image and decoded texture image acquired from the decoding unit 208, and the viewpoint information of the second virtual camera group. The restored 3D model is output to the virtual viewpoint image generation unit 211. Furthermore, the 3D model restoration unit 209 may generate color information for the 3D model using the effective pixel map acquired from the decoding unit 208. The method for using the effective pixel map will be explained in detail in Example 2.

 仮想カメラ制御部210は、通信部104を使って、ユーザが入力装置40を通じて入力した入力値から仮想視点画像を生成するための第3仮想カメラの視点情報を生成し、第3仮想カメラの視点情報を仮想視点画像生成部211に出力する。また、仮想カメラ制御部210は、生成したユーザ指定の第3仮想カメラの視点情報を3Dモデル復元部209に出力してもよい。 The virtual camera control unit 210 uses the communication unit 104 to generate viewpoint information of a third virtual camera for generating a virtual viewpoint image from input values entered by the user via the input device 40, and outputs the viewpoint information of the third virtual camera to the virtual viewpoint image generation unit 211. The virtual camera control unit 210 may also output the generated viewpoint information of the user-specified third virtual camera to the 3D model restoration unit 209.

 仮想視点画像生成部211は、3Dモデル復元部209から取得した3Dモデル、仮想カメラ制御部210から取得した第3仮想カメラの視点情報に基づいて仮想視点画像を生成する。仮想視点画像の生成は、被写体の3Dモデル、背景オブジェクトの3Dモデル、第3仮想カメラを仮想空間上に配置し、第3仮想カメラから見た画像を生成することにより行われる。背景オブジェクトの3Dモデルは、例えば別途被写体と合成するために作成したCG(Computer Graphic)モデルであり、予め作成されて第2の画像処理装置30内に保存されている(例えば、図1BのROM103に保存されている)。 The virtual viewpoint image generation unit 211 generates a virtual viewpoint image based on the 3D model acquired from the 3D model restoration unit 209 and the viewpoint information of the third virtual camera acquired from the virtual camera control unit 210. The virtual viewpoint image is generated by placing a 3D model of the subject, a 3D model of the background object, and the third virtual camera in a virtual space, and generating an image seen from the third virtual camera. The 3D model of the background object is, for example, a CG (Computer Graphics) model created separately to be combined with the subject, and is created in advance and stored in the second image processing device 30 (for example, stored in ROM 103 in Figure 1B).

 被写体の3Dモデルと背景のオブジェクトの3Dモデルは、既存のCGレンダリング方法によりレンダリングされる。仮想視点画像生成部211は、生成した仮想視点画像を表示装置50に送信する。 The 3D model of the subject and the 3D model of the background object are rendered using an existing CG rendering method. The virtual viewpoint image generation unit 211 transmits the generated virtual viewpoint image to the display device 50.

 <物理カメラ群の視点情報に基づく第2仮想カメラ群の視点情報の生成例の説明>
 ここで、物理カメラ群の視点情報を用いて第2仮想カメラ群の視点情報を生成する方法を図3A~図3Cと図4を用いて説明する。図3A~図3Cは、第2仮想カメラ群の視点情報の生成方法の一例を説明する概略図である。
<Description of an example of generating viewpoint information of a group of second virtual cameras based on viewpoint information of a group of physical cameras>
Here, a method for generating viewpoint information of a second virtual camera group using viewpoint information of a physical camera group will be described with reference to Figures 3A to 3C and 4. Figures 3A to 3C are schematic diagrams for explaining an example of a method for generating viewpoint information of a second virtual camera group.

 図3Aは、現実空間に配置された物理カメラ群302と被写体301を示す図である。被写体301は物理カメラ群302により撮影される。なお、直線303は、物理カメラ群302の各々の光軸を示す。 FIG. 3A shows a group of physical cameras 302 and a subject 301 arranged in real space. The subject 301 is photographed by the group of physical cameras 302. Note that a straight line 303 indicates the optical axis of each of the group of physical cameras 302.

 図3Bは、物理カメラ群302に対応する第1仮想カメラ群305を設定し、第1仮想カメラ群305により生成される被写体の3Dモデル304を示す図である。第1の画像処理装置20は、物理カメラ群302から取得される複数の撮像画像を用いて、被写体の3Dモデル304を生成する。現実空間における被写体301に対する、仮想空間における3Dモデル304を生成するために、現実空間における物理カメラ群302に対応する、仮想空間における第1仮想カメラ群305を生成する。言い換えると、現実空間における物理カメラ群302を仮想空間において再現したものが第1仮想カメラ群305である。したがって、第1仮想カメラ群305の光軸306は、物理カメラ群302の光軸303に対応している。そして第1の画像処理装置20は、生成した第1仮想カメラ群305の視点情報を用いて被写体301の3Dモデル304を生成する。 FIG. 3B shows a diagram in which a first virtual camera group 305 corresponding to the physical camera group 302 is set and a 3D model 304 of the subject is generated by the first virtual camera group 305. The first image processing device 20 generates the 3D model 304 of the subject using multiple captured images acquired from the physical camera group 302. In order to generate a 3D model 304 in virtual space for the subject 301 in real space, a first virtual camera group 305 is generated in virtual space corresponding to the physical camera group 302 in real space. In other words, the first virtual camera group 305 is a virtual space reproduction of the physical camera group 302 in real space. Therefore, the optical axis 306 of the first virtual camera group 305 corresponds to the optical axis 303 of the physical camera group 302. The first image processing device 20 then generates the 3D model 304 of the subject 301 using viewpoint information from the generated first virtual camera group 305.

 図3Cは、第1仮想カメラ群305の視点情報に基づいて設定される第2仮想カメラ群307を示す図である。第2仮想カメラ群307は、第1仮想カメラ群305の光軸306上に設定される。なお、これに限定されず、光軸306に近い位置に第2仮想カメラ307を設定してもよい。また、第1仮想カメラ群305より3Dモデル304に近い位置に第2仮想カメラ群307は設定される。一般的に、配信のためのデプス画像やテクスチャ画像を生成するための第2仮想カメラ群307は、3Dモデル304を包含するバウンディングボックスを配置し、そのバウンディングボックスを囲むように球面上に被写体を向くように配置する。そうすることで、第2仮想カメラ群307の位置及び姿勢(外部パラメータ)を決定する。そして、3Dモデル304の全体が第2仮想カメラ群307で撮影可能な焦点距離(内部パラメータ)を決定する。これらは、手動で設定され、第2仮想カメラの数や配置間隔はヒューリスティックに決定されることが多い。3Dモデル304、すなわち被写体301が複雑な形状の場合、ある第2仮想カメラから見て3Dモデル304の部位が別の部位で隠されるオクルージョンが発生してしまうため、第2仮想カメラ群307の数や配置を適切に決定する必要がある。形状が複雑な被写体に対して、第2仮想カメラの数が少なく、配置が不適切でオクルージョンが多く発生した場合、第2の画像処理装置30で復元する3Dモデルは配信する前の3Dモデル304と比べて形状が大きく異なる可能性がある。しかしながら、3Dモデル304を正確に復元しようと第2仮想カメラの数を多くすると、配信するデプス画像とテクスチャ画像が増加する。そのため、ユーザが伝送帯域の低い環境や処理性能が低いローカル端末で3Dモデルを復元しようとするとフレームレートが低下する恐れがある。表示装置50に表示する3Dモデルは、ユーザに高い臨場感を与えるために、より高画質で60fpsなどのカクつかないフレームレートであることが望ましい。そのため、第2仮想カメラ群307の数や配置を適切に設定し、少ないデータ量で3Dモデル304を正確に復元可能なデプス画像と高画質なテクスチャ画像を配信する必要がある。そこで、第1の画像処理装置20は、先述した物理カメラ群302の視点情報に基づき第2仮想カメラ群307の視点情報を生成する。すなわち、物理カメラ群302に対応する第1仮想カメラ群305の位置に第2仮想カメラ群307を配置する。そして、被写体301の3Dモデル304に近づくように第1仮想カメラ305の光軸306上に第2仮想カメラ群307を画面上の被写体の大きさを維持しながらドリーズームすることで、第2仮想カメラ群307の位置を設定する。どの位置まで第2仮想カメラ群307をドリーズームするかは予め決定されているものとする。例えば、3Dモデル304との距離が所定値になるまでドリーズームするようにしてもよい。あるいは、第2仮想カメラ群307の撮像面に投影される3Dモデル304の大きさが所定の大きさになるまでドリーズームしてもよい。なお、第2仮想カメラ群307の視点情報の決定の仕方はこの方法に限らず、単に第1仮想カメラ群305の光軸306上の近くに第2仮想カメラ群307を配置し、手動で焦点距離などの内部パラメータや画像サイズを決定してもよい。すなわち、第2仮想カメラ群307の視点情報は、物理カメラ群302の視点情報に基づき決定されればよい。仮想視点生成技術において、物理カメラ群302は、先述した視体積交差法など被写体301を取り囲むように、被写体301の形状を推定するのに必要な台数が配置される。そのため、物理カメラ群302から観測したようなデプス画像を送れば、形状推定した3Dモデル304の形状情報をユーザ側で正確に復元することが可能となる。また、物理カメラ群302と画質が近い仮想視点依存のテクスチャ画像を配信することで、ユーザ側で高画質な色情報を復元でき、ユーザは高品質な3Dモデルを再生することができる。 3C is a diagram showing the second virtual camera group 307 set based on the viewpoint information of the first virtual camera group 305. The second virtual camera group 307 is set on the optical axis 306 of the first virtual camera group 305. However, this is not limited to this, and the second virtual camera 307 may be set at a position closer to the optical axis 306. The second virtual camera group 307 is also set at a position closer to the 3D model 304 than the first virtual camera group 305. Generally, the second virtual camera group 307 used to generate depth images and texture images for distribution places a bounding box that encompasses the 3D model 304, and places the second virtual camera group 307 on a spherical surface surrounding the bounding box so that it faces the subject. This determines the position and orientation (extrinsic parameters) of the second virtual camera group 307. Then, the focal length (internal parameters) at which the entire 3D model 304 can be captured by the second virtual camera group 307 is determined. These are set manually, and the number and spacing of the second virtual cameras are often determined heuristically. If the 3D model 304, i.e., the subject 301, has a complex shape, occlusion occurs, whereby a portion of the 3D model 304 is hidden by another portion as viewed from a certain second virtual camera. Therefore, the number and placement of the second virtual cameras 307 must be appropriately determined. For a subject with a complex shape, if the number of second virtual cameras is small and their placement is inappropriate, resulting in frequent occlusion, the 3D model restored by the second image processing device 30 may differ significantly in shape from the 3D model 304 before distribution. However, increasing the number of second virtual cameras in an attempt to accurately restore the 3D model 304 increases the number of depth images and texture images to be distributed. Therefore, if a user attempts to restore the 3D model in an environment with low transmission bandwidth or on a local terminal with low processing performance, the frame rate may decrease. It is desirable for the 3D model displayed on the display device 50 to have higher image quality and a smooth frame rate, such as 60 fps, to provide the user with a high sense of realism. Therefore, it is necessary to appropriately set the number and arrangement of the second virtual camera group 307 and deliver depth images and high-quality texture images that can accurately reconstruct the 3D model 304 with a small amount of data. Therefore, the first image processing device 20 generates viewpoint information for the second virtual camera group 307 based on the viewpoint information of the physical camera group 302 described above. That is, the second virtual camera group 307 is arranged at the position of the first virtual camera group 305 corresponding to the physical camera group 302. The position of the second virtual camera group 307 is then set by dolly-zooming the second virtual camera group 307 onto the optical axis 306 of the first virtual camera 305 while maintaining the size of the subject on the screen so as to approach the 3D model 304 of the subject 301. It is assumed that the position to which the second virtual camera group 307 is dolly-zoomed is predetermined. For example, dolly-zooming may be performed until the distance from the 3D model 304 reaches a predetermined value. Alternatively, dolly-zooming may be performed until the size of the 3D model 304 projected on the imaging surface of the second virtual camera group 307 reaches a predetermined size. Note that the method for determining the viewpoint information of the second virtual camera group 307 is not limited to this method. Alternatively, the second virtual camera group 307 may simply be positioned near the optical axis 306 of the first virtual camera group 305, and internal parameters such as focal length and image size may be determined manually. That is, the viewpoint information of the second virtual camera group 307 may be determined based on the viewpoint information of the physical camera group 302. In virtual viewpoint generation technology, the physical camera group 302 is positioned in the number required to estimate the shape of the subject 301, surrounding the subject 301 using the volume intersection method described above, for example. Therefore, by sending depth images as observed by the physical camera group 302, the user can accurately restore the shape information of the shape-estimated 3D model 304. Furthermore, by delivering virtual viewpoint-dependent texture images with image quality similar to that of the physical camera group 302, the user can restore high-quality color information, allowing the user to play back a high-quality 3D model.

 仮想カメラ毎の視点情報の決定方法を、図4を用いて説明する。図4は、第1仮想カメラ401が被写体の3Dモデル402を撮影し、第1仮想カメラ401の視点情報に基づき第2仮想カメラ403の視点情報を生成する概略図である。第2仮想カメラ403は第1仮想カメラ401の光軸上を3Dモデル402に近づくようにドリーズームしながら移動する。そして、3Dモデル402を包含し、3Dモデルと第2仮想カメラ403との距離情報(デプス情報)を10bitで表現可能な範囲の領域404内で3Dモデル402を撮影可能な位置に配置する。なお、領域404は立方体に限らず、3Dモデル402を中心とした球体でもよい。第1仮想カメラ401は3Dモデル402を撮影し、撮影画像405を生成する。第1の画像処理装置20は、撮影システム10が有する物理カメラ群が撮影した撮影画像405に基づき、3Dモデル402を推定する。そして推定した3Dモデル402を第2仮想カメラ403が撮影し、デプス画像406とテクスチャ画像407を生成する。デプス画像406とテクスチャ画像407に写る3Dモデル402の大きさは、撮影画像405の写る3Dモデル402と同等の大きさである。なお、先述したように第2仮想カメラ403の視点情報の決定はドリーズームに限らず、デプス画像406に写る3Dモデル402の大きさは撮影画像405に写る3Dモデル402の大きさと変わってもよい。また、第2仮想カメラ403は第1仮想カメラ401の光軸上を移動するとしたが、第2仮想カメラ403の位置及び姿勢は第1仮想カメラ401の位置及び姿勢と近ければよく、必ずしも第1仮想カメラ401の光軸上を移動しなくてもよい。さらに、第2仮想カメラ403の視点情報の決定は、第2の画像処理装置30の初期化時に一度行われても良いし、3Dモデル402の動きに合わせて毎フレーム行われてもよい。 The method for determining viewpoint information for each virtual camera will be explained using Figure 4. Figure 4 is a schematic diagram in which a first virtual camera 401 captures a 3D model 402 of a subject and generates viewpoint information for a second virtual camera 403 based on the viewpoint information of the first virtual camera 401. The second virtual camera 403 moves along the optical axis of the first virtual camera 401 while dolly zooming to approach the 3D model 402. The second virtual camera 403 is then positioned at a position where it can capture the 3D model 402 within an area 404 that encompasses the 3D model 402 and allows the distance information (depth information) between the 3D model and the second virtual camera 403 to be expressed in 10 bits. Note that the area 404 is not limited to a cube, and may be a sphere centered on the 3D model 402. The first virtual camera 401 captures the 3D model 402 and generates a captured image 405. The first image processing device 20 estimates the 3D model 402 based on the captured images 405 captured by the physical cameras of the imaging system 10. The estimated 3D model 402 is then photographed by a second virtual camera 403 to generate a depth image 406 and a texture image 407. The size of the 3D model 402 depicted in the depth image 406 and the texture image 407 is the same as the size of the 3D model 402 depicted in the photographed image 405. As described above, the determination of the viewpoint information of the second virtual camera 403 is not limited to dolly zoom, and the size of the 3D model 402 depicted in the depth image 406 may be different from the size of the 3D model 402 depicted in the photographed image 405. Furthermore, although the second virtual camera 403 has been described as moving on the optical axis of the first virtual camera 401, the position and orientation of the second virtual camera 403 need only be close to the position and orientation of the first virtual camera 401, and it does not necessarily have to move on the optical axis of the first virtual camera 401. Furthermore, the determination of the viewpoint information of the second virtual camera 403 may be performed once when the second image processing device 30 is initialized, or may be performed for each frame in accordance with the movement of the 3D model 402.

 <3Dモデルデータの圧縮と配信および仮想視点画像の生成の制御>
 図5は、本実施例に係る、第1の画像処理装置20における3Dモデルデータの圧縮と配信を制御する処理の流れを示すフローチャートである。図5に示すフローは、ROM103に格納された制御プログラムがRAM102に読み出され、CPU101がこれを実行することによって実現される。形状情報生成部201が、撮影システム10から複数の撮像画像、物理カメラ群の視点情報を受信したことをトリガとして、図5のフローが実行される。
<Compression and distribution of 3D model data and control of generation of virtual viewpoint images>
Fig. 5 is a flowchart showing the flow of processing for controlling the compression and distribution of 3D model data in the first image processing device 20 according to this embodiment. The flow shown in Fig. 5 is realized by reading a control program stored in the ROM 103 into the RAM 102 and executing it by the CPU 101. Execution of the flow in Fig. 5 is triggered when the shape information generation unit 201 receives a plurality of captured images and viewpoint information of the physical cameras from the imaging system 10.

 S501では、形状情報生成部201が、複数の撮像画像に基づき被写体の形状情報を推定し生成する。生成した形状情報、物理カメラ群の視点情報、複数の撮像画像は、視点決定部202とテクスチャ画像生成部204に出力される。また、生成した形状情報は、デプス画像生成部203に出力される。なお、被写体の生成に用いる、物理カメラ群に対応する第1仮想カメラ群の視点情報は、予め生成されているものとする。例えば、撮影開始前の準備として、物理カメラ群を配置した際に、配置した物理カメラ群に対応する第1仮想カメラ群を生成する。この第1仮想カメラ群の視点情報と複数の撮像画像を用いて被写体の形状情報を生成する。 In S501, the shape information generation unit 201 estimates and generates shape information of the subject based on multiple captured images. The generated shape information, viewpoint information of the physical camera group, and multiple captured images are output to the viewpoint determination unit 202 and texture image generation unit 204. The generated shape information is also output to the depth image generation unit 203. Note that the viewpoint information of the first virtual camera group corresponding to the physical camera group used to generate the subject is assumed to have been generated in advance. For example, when the physical camera group is placed as preparation before filming begins, a first virtual camera group corresponding to the placed physical camera group is generated. Shape information of the subject is generated using the viewpoint information of this first virtual camera group and multiple captured images.

 S502では、視点決定部202が、物理カメラ群の視点情報に基づき、第2仮想カメラ群の視点情報を生成する。生成した第2仮想カメラ群の視点情報は、デプス画像生成部203と、デプス画像生成部203を介してテクスチャ画像生成部204に出力される。 In S502, the viewpoint determination unit 202 generates viewpoint information for the second virtual camera group based on the viewpoint information for the physical camera group. The generated viewpoint information for the second virtual camera group is output to the depth image generation unit 203 and to the texture image generation unit 204 via the depth image generation unit 203.

 第2仮想カメラ群の視点情報の生成処理は図7で説明する。なお、生成した第2仮想カメラ群の視点情報は、デプス画像生成部203を介さずにテクスチャ画像生成部204に出力されてもよい。 The process of generating the viewpoint information of the second virtual camera group is explained in Figure 7. Note that the generated viewpoint information of the second virtual camera group may be output to the texture image generation unit 204 without going through the depth image generation unit 203.

 S503では、デプス画像生成部203が、形状情報生成部201と視点決定部202から取得したデータに基づき、3Dモデルのデプス画像を生成する。生成したデプス画像は、符号化部205に出力される。 In S503, the depth image generation unit 203 generates a depth image of the 3D model based on the data acquired from the shape information generation unit 201 and the viewpoint determination unit 202. The generated depth image is output to the encoding unit 205.

 S504では、テクスチャ画像生成部204が、形状情報生成部201と視点決定部202から取得したデータに基づき、3Dモデルのテクスチャ画像を生成する。生成したテクスチャ画像は、符号化部205に出力される。 In S504, the texture image generation unit 204 generates a texture image of the 3D model based on the data acquired from the shape information generation unit 201 and the viewpoint determination unit 202. The generated texture image is output to the encoding unit 205.

 S505では、符号化部205が、デプス画像生成部203とテクスチャ画像生成部204から取得したデプス画像とテクスチャ画像を符号化する。符号化したデータは、配信部206に出力される。 In S505, the encoding unit 205 encodes the depth image and texture image acquired from the depth image generation unit 203 and texture image generation unit 204. The encoded data is output to the distribution unit 206.

 S506では、配信部206が、デプス画像生成部203と符号化部205から取得したデータと第2仮想カメラ群の視点情報とを含む3Dモデルデータを配信し、本フローは終了する。画像処理システム1では、3Dモデルデータを第2の画像処理装置30に配信するが、これに限定されない。別途、3Dモデルデータを蓄積するサーバに配信してもよい。 In S506, the distribution unit 206 distributes 3D model data including the data acquired from the depth image generation unit 203 and the encoding unit 205 and the viewpoint information of the second virtual camera group, and this flow ends. In the image processing system 1, the 3D model data is distributed to the second image processing device 30, but this is not limited to this. The 3D model data may also be distributed to a separate server that stores it.

 図6は、本実施例に係る、第2の画像処理装置30においてデプス画像とテクスチャ画像を用いて仮想視点画像を生成する処理の流れを示すフローチャートである。仮想カメラ制御部210が入力装置40からの入力値を受信することをトリガとして、図6のフローが実行される。 FIG. 6 is a flowchart showing the flow of processing for generating a virtual viewpoint image using a depth image and a texture image in the second image processing device 30 according to this embodiment. Execution of the flow in FIG. 6 is triggered when the virtual camera control unit 210 receives an input value from the input device 40.

 S601では、仮想カメラ制御部210が、入力装置40からの入力値に基づき、ユーザ指定の第3仮想カメラの視点情報を生成する。生成された第3仮想カメラの視点情報は、仮想視点画像生成部211に出力される。 In S601, the virtual camera control unit 210 generates viewpoint information of a third virtual camera designated by the user based on input values from the input device 40. The generated viewpoint information of the third virtual camera is output to the virtual viewpoint image generation unit 211.

 S602では、受信部207が、配信部206から配信される3Dモデルデータを受信する。受信した3Dモデルデータは復号部208に出力される。 In S602, the receiving unit 207 receives the 3D model data distributed from the distribution unit 206. The received 3D model data is output to the decoding unit 208.

 S603では、復号部208が、受信部207から取得した3Dモデルデータに含まれるデプス画像とテクスチャ画像を復号する。また、第2仮想カメラ群の視点情報も符号化されていた場合には、第2仮想カメラ群の視点情報も復号する。そして、復号されたデプス画像とテクスチャ画像、および第2仮想カメラ群の視点情報を3Dモデル復元部209に出力する。 In S603, the decoding unit 208 decodes the depth images and texture images included in the 3D model data acquired from the receiving unit 207. Furthermore, if the viewpoint information of the second virtual camera group has also been encoded, the viewpoint information of the second virtual camera group is also decoded. The decoded depth images and texture images, as well as the viewpoint information of the second virtual camera group, are then output to the 3D model restoration unit 209.

 S604では、3Dモデル復元部209が、復号部208から取得した復号されたデプス画像とテクスチャ画像、および第2仮想カメラ群の視点情報に基づき、被写体の3Dモデルを復元する。復元した3Dモデルは仮想視点画像生成部211に出力される。3Dモデルの復元処理は図8で説明する。 In S604, the 3D model restoration unit 209 restores a 3D model of the subject based on the decoded depth image and texture image acquired from the decoding unit 208 and the viewpoint information of the second virtual camera group. The restored 3D model is output to the virtual viewpoint image generation unit 211. The 3D model restoration process is explained in Figure 8.

 S605では、仮想視点画像生成部211が、仮想カメラ制御部210から取得される第3仮想カメラの視点情報と、3Dモデル復元部209から取得される3Dモデルとに基づき仮想視点画像を生成し、本フローは終了する。 In S605, the virtual viewpoint image generation unit 211 generates a virtual viewpoint image based on the viewpoint information of the third virtual camera acquired from the virtual camera control unit 210 and the 3D model acquired from the 3D model restoration unit 209, and this flow ends.

 本フローが終了した後、生成された仮想視点画像は表示装置50に送信され、表示装置50に表示される。 After this flow is completed, the generated virtual viewpoint image is sent to the display device 50 and displayed on the display device 50.

 以上が、本実施例に係る、3Dモデルデータの圧縮と配信および仮想視点画像の生成の制御の内容である。 The above describes the control of the compression and distribution of 3D model data and the generation of virtual viewpoint images in this embodiment.

 <第2仮想カメラ群の視点情報の生成処理の説明>
 図7は、本実施例に係る、第2仮想カメラ群の視点情報の生成処理の流れを示すフローチャートの一例である。ここでは、第2仮想カメラ群の視点情報の生成処理の方法は、図4を用いて説明した第2仮想カメラがドリーズームする方法に基づき生成する手法で説明する。すなわち、物理カメラの位置に対応する第1仮想カメラの位置から、第1仮想カメラの光軸上を被写体の3Dモデルに近づく方向に第2仮想カメラが移動する。さらに、第2仮想カメラの撮像面における被写体の大きさを維持したまま第2仮想カメラが焦点距離を調整する手法を説明する。
<Description of the Process for Generating Viewpoint Information of Second Virtual Camera Group>
7 is an example of a flowchart showing the flow of processing for generating viewpoint information of the second virtual camera group according to this embodiment. Here, the method for generating viewpoint information of the second virtual camera group will be described as a method for generating viewpoint information based on the dolly zoom method of the second virtual camera described with reference to FIG. 4 . That is, the second virtual camera moves on the optical axis of the first virtual camera from the position of the first virtual camera corresponding to the position of the physical camera in a direction approaching the 3D model of the subject. Furthermore, a method will be described in which the second virtual camera adjusts the focal length while maintaining the size of the subject on the imaging plane of the second virtual camera.

 図7に示すフローは、視点決定部202により実行される。形状情報生成部201から被写体の3Dモデル、物理カメラ群の視点情報、複数の撮像画像の受信をトリガとして、図7は実行される。図7のフローは、図5のS502の物理カメラ群の視点情報に基づき、第2仮想カメラ群の視点を決定する制御を詳細に説明したものである。図7に示すフローを実行する際には、第1の画像処理装置20にて、第1仮想カメラ群の視点情報と被写体の3Dモデルとが生成されているものとする。 The flow shown in FIG. 7 is executed by the viewpoint determination unit 202. Execution of FIG. 7 is triggered by the reception of a 3D model of the subject, viewpoint information of the physical camera group, and multiple captured images from the shape information generation unit 201. The flow in FIG. 7 provides a detailed explanation of the control for determining the viewpoint of the second virtual camera group based on the viewpoint information of the physical camera group in S502 of FIG. 5. When executing the flow shown in FIG. 7, it is assumed that viewpoint information of the first virtual camera group and a 3D model of the subject have been generated by the first image processing device 20.

 S701では、被写体の3Dモデルと第1仮想カメラ群の視点情報と複数の撮像画像を取得する。なお、第1仮想カメラ群の視点情報を取得せずに、物理カメラ群の視点情報を取得してもよい。その場合、物理カメラ群の視点情報を用いて第1仮想カメラ群の視点情報を生成する。 In S701, a 3D model of the subject, viewpoint information of the first virtual camera group, and multiple captured images are acquired. Note that viewpoint information of the physical camera group may be acquired without acquiring viewpoint information of the first virtual camera group. In that case, viewpoint information of the first virtual camera group is generated using the viewpoint information of the physical camera group.

 S702では、S703とS704を物理カメラの台数で繰り返し処理を行う。 In S702, S703 and S704 are repeated for each physical camera.

 S703では、S704を複数の撮像画像に含まれる被写体の数で繰り返し処理を行う。被写体の認識は、顔検出アルゴリズムや人物検出アルゴリズム等の結果に基づき行われる。もしくは、被写体ごとに分離して生成された3Dモデルを物理カメラの視点情報を用いて、物理カメラの撮像面と同一面に投影することで、撮影画像中で被写体ごとにその領域を特定することもできる。 In S703, S704 is repeated for each subject included in the multiple captured images. Subject recognition is performed based on the results of a face detection algorithm, person detection algorithm, etc. Alternatively, the area of each subject in the captured images can be identified by projecting a 3D model generated separately for each subject onto the same plane as the imaging surface of the physical camera using the viewpoint information of the physical camera.

 S704では、3Dモデルと第1仮想カメラの視点情報に基づき、第2仮想カメラの視点情報を生成する。第2仮想カメラの視点情報の生成は、図4を用いて説明した方法で行われる。上記処理をS702およびS703に記載されているように繰り返し処理を行うことによって、第2仮想カメラ群の視点情報を生成する。 In S704, viewpoint information for the second virtual camera is generated based on the 3D model and viewpoint information for the first virtual camera. The viewpoint information for the second virtual camera is generated using the method described with reference to Figure 4. The above process is repeated as described in S702 and S703 to generate viewpoint information for the second virtual camera group.

 S705では、生成した第2仮想カメラ群の視点情報をデプス画像生成部203とテクスチャ画像生成部204に送信する。 In S705, the viewpoint information of the generated second virtual camera group is sent to the depth image generation unit 203 and the texture image generation unit 204.

 <3Dモデル復元の制御の説明>
 図8は、本実施例に係る、3Dモデルの復元処理の流れを示すフローチャートの一例である。ここでは、3Dモデルの復元処理の方法は、仮想視点依存の色情報を含む3Dモデルを生成する方法に基づく場合で説明する。すなわち、ユーザ指定の第3仮想カメラの位置及び姿勢に近い第2仮想カメラ群のテクスチャ画像の画素値を優先して、3Dモデルの色情報とする場合で説明する。図8に示すフローは、3Dモデル復元部209により実行される。図8のフローは、図6の復号したデータに基づき被写体の3Dモデルを復元する制御を詳細に説明したものである。
<Explanation of 3D model restoration control>
FIG. 8 is an example of a flowchart showing the flow of a 3D model restoration process according to this embodiment. Here, the 3D model restoration process is described based on a method for generating a 3D model including virtual viewpoint-dependent color information. In other words, the pixel values of the texture image of the second virtual camera group, which are closest to the position and orientation of the user-specified third virtual camera, are given priority as the color information for the 3D model. The flow shown in FIG. 8 is executed by the 3D model restoration unit 209. The flow in FIG. 8 provides a detailed description of the control for restoring a 3D model of a subject based on the decoded data in FIG. 6.

 S801では、3Dモデル復元部209が、デプス画像とテクスチャ画像、第2仮想カメラ群の視点情報を取得する。 In S801, the 3D model restoration unit 209 acquires a depth image, a texture image, and viewpoint information of the second virtual camera group.

 S802では、3Dモデル復元部209が、デプス画像の各画素値のデプス値をデプス画像と対応する仮想カメラの外部パラメータと内部パラメータに基づき、仮想空間に投影し、被写体の形状情報の構成要素を生成する。例えば被写体の3Dモデルが3D点群で表現される場合には、各点が構成要素となる。取得した複数のデプス画像のすべてで上記処理を実行することにより、被写体の形状情報を復元する。 In S802, the 3D model restoration unit 209 projects the depth values of each pixel in the depth image into virtual space based on the external and internal parameters of the virtual camera corresponding to the depth image, and generates components of the shape information of the subject. For example, if the 3D model of the subject is represented by a 3D point cloud, each point becomes a component. By performing the above process on all of the multiple acquired depth images, the shape information of the subject is restored.

 S803では、デプス画像が仮想空間に投影した形状情報の構成要素を複数のテクスチャ画像から撮影されている場合、ユーザ指定の仮想カメラの位置及び姿勢が近い仮想カメラのテクスチャ画像の画素値を優先して構成要素の色情報と決定する。これを全構成要素で繰り返し、形状情報に対応する色情報を復元する。例えば、形状情報が点群であり、構成要素が点群の各点である場合には、すべての点に対応する色情報を復元する。 In S803, if the components of the shape information projected into the virtual space in the depth image are captured using multiple texture images, the pixel values of the texture image of the virtual camera with the closest position and orientation to the user-specified virtual camera are prioritized and determined as the color information for the components. This process is repeated for all components to restore the color information corresponding to the shape information. For example, if the shape information is a point cloud and the components are each point in the point cloud, the color information corresponding to all points is restored.

 S804では、復元した3Dモデルを仮想視点画像生成部211に出力する。 In S804, the restored 3D model is output to the virtual viewpoint image generation unit 211.

 以上、説明したように、本実施例では、物理カメラ群の視点情報に基づき第2仮想カメラ群の視点情報を生成し、生成した第2仮想カメラ群の視点情報を用いてデプス画像とテクスチャ画像を生成する処理が行われる。このような処理によれば、形状が複雑な3Dモデルでも、データ量を増加させることなく、受信側で高品質な3Dモデルを復元することが可能となる。なお、本実施例では、圧縮した3Dモデルデータを配信するとしたが、圧縮した3Dモデルデータを蓄積してもよく、クライアントの要求に応答して圧縮した3Dモデルデータを配信してもよい。また、仮想視点依存の色情報を含む3Dモデルの生成を3Dモデルの復元時に行う方法を例に説明したが、これに限らず、3Dモデルの形状情報の復元後、仮想視点画像の生成と同時に3Dモデルの色情報を生成してもよい。この場合、3Dモデルの形状情報の全てに色情報を付与する必要はなく、ユーザ指定の第3仮想カメラの撮影範囲(視野角)のみ色情報を生成すればよい。 As explained above, in this embodiment, viewpoint information for the second virtual camera group is generated based on viewpoint information from the physical camera group, and a depth image and a texture image are generated using the generated viewpoint information for the second virtual camera group. This type of processing makes it possible to restore a high-quality 3D model on the receiving side without increasing the amount of data, even for 3D models with complex shapes. Note that while this embodiment describes the delivery of compressed 3D model data, the compressed 3D model data may be stored, or may be delivered in response to a client request. Also, while the example described uses a method in which a 3D model containing virtual-viewpoint-dependent color information is generated when the 3D model is restored, this is not limiting. After the 3D model's shape information is restored, color information for the 3D model may be generated simultaneously with the generation of a virtual viewpoint image. In this case, it is not necessary to assign color information to all of the 3D model's shape information; color information only needs to be generated for the shooting range (viewing angle) of the user-specified third virtual camera.

 <実施例2>
 実施例1では、物理カメラ群の視点情報に基づき第2仮想カメラ群の視点情報を生成し、生成した第2仮想カメラ群の視点情報を用いてデプス画像とテクスチャ画像を生成する処理を説明した。次に、被写体が、他の被写体や物体で遮蔽される場合に対応するため、デプス画像とテクスチャ画像に加えて有効画素マップを生成する態様を、実施例2として説明する。なお、画像処理装置のハードウェア構成や機能構成など実施例1と共通する部分は説明を省略ないし簡略化して説明を行うものとする。
Example 2
In Example 1, a process was described in which viewpoint information of a second virtual camera group is generated based on viewpoint information of a physical camera group, and a depth image and a texture image are generated using the generated viewpoint information of the second virtual camera group. Next, Example 2 will be described, which illustrates an aspect in which an effective pixel map is generated in addition to a depth image and a texture image to deal with cases in which a subject is occluded by another subject or object. Note that the description of parts common to Example 1, such as the hardware configuration and functional configuration of the image processing device, will be omitted or simplified.

 図9は本実施例に係る、有効画素マップを生成する方法の一例を示す概略図である。被写体は、物理カメラにより撮影され、撮影画像903が生成される。その際、撮影画像903に写る被写体は、遮蔽物により体の一部が隠されている。遮蔽物は、被写体とする。 FIG. 9 is a schematic diagram showing an example of a method for generating an effective pixel map according to this embodiment. A subject is photographed using a physical camera, and a captured image 903 is generated. At this time, the subject in the captured image 903 has part of its body hidden by an obstruction. The obstruction is considered to be the subject.

 第1の画像処理装置20は、撮影システム10が有する物理カメラが撮影した複数の撮像画像により、被写体と遮蔽物の形状推定を行い、それぞれの形状情報を生成する。すなわち、被写体の3Dモデル901と、遮蔽物の3Dモデル904を生成する。次に、第1の画像処理装置20は、物理カメラ群の視点情報、形状情報等のデータに基づき、実施例1で説明した第2仮想カメラ群の視点情報を生成する。第2仮想カメラ905は、第1仮想カメラ902の光軸上に、被写体の3Dモデル901を包含し、被写体のデプスを10bitで表現可能な範囲の領域906内に配置される。また第2仮想カメラ905は、第1仮想カメラ902から写した被写体901の大きさが維持されるよう焦点距離が調整される。第2仮想カメラ905の視点情報の生成後、第1の画像処理装置20は、デプス画像907とテクスチャ画像908を生成する。ここで、生成したテクスチャ画像908に写る被写体901の各画素値は、遮蔽物の3Dモデル904に遮蔽されない領域(被写体の左側)では第1仮想カメラ902に対応する撮影画像の画素値が優先して使われる。一方、遮蔽物の3Dモデル904に遮蔽される領域(被写体の右側)では第1仮想カメラ902とは異なる視点の第1仮想カメラに対応する撮影画像の画素値が使われる。そのため、テクスチャ画像908に写る被写体の右側と左側で画質が大きく異なる可能性がある。このテクスチャ画像908で3Dモデルの色情報を復元すると、3Dモデルの一部の画質が低下し、その一部が目立つことで、ユーザが違和感を覚える恐れがある。そこで、テクスチャ画像の高画質な領域を示す有効画素マップ909を生成する。有効画素マップは、例えば、非遮蔽領域の画素値を1とし、遮蔽領域の画素値を0とする。有効画素マップ909の画像サイズは、テクスチャ画像908の画像サイズと同じである。遮蔽領域か否かの判定は、例えば、被写体の3Dモデル901の形状情報を構成する各点が、先述した可視性判定処理により、第1仮想カメラ902により可視か否かに基づき判定される。つまり、可視性判定処理により、撮影画像903に写る遮蔽物904の領域の画素値は、被写体の3Dモデル901の色情報としてつかわれない。そこで、第2仮想カメラ905の視点情報の生成元である物理カメラの撮影画像の画素値を使用して決定したテクスチャ画像の領域を特定し、有効画素マップにおいて、特定した領域と同領域の画素値を1とし、それ以外を0とする。なお、遮蔽領域の判定方法はこれに限らない。第2の画像処理装置30は、3Dモデルの色情報の復元時に、有効画素マップの値が1の領域に対応するテクスチャ画像の領域の画素値を優先して3Dモデルの色情報として使用する。具体的には、第2の画像処理装置30は、復元した形状情報に対して、仮想視点依存で色情報を生成する際に、有効画素マップの1に対応するテクスチャ画像の画素値は使用され、有効画素マップの0に対応するテクスチャ画像の画素値は使用しない。ただし、色情報を付与したい形状情報が、有効画素マップの0に対応するテクスチャ画像にのみ色情報が保存されている場合、そのテクスチャ画像の画素値が使われる。なお、ここまで有効画素マップの画素値を0か1の2値で説明したが多値でもよい。その場合、有効画素マップの画素値を3Dモデルの色情報の生成時の重みづけに使用してもよい。つまり、有効画素マップの画素値に応じてテクスチャ画像の画素値の優先度が決定される。有効画素マップを255段階の多値で生成する場合、例えば、被写体の輪郭や遮蔽物との境目の画素値を0とし、被写体の内側方向に一定距離(例えば5px)近づくと255になるよう線形的に増加させてもよい。これにより、3Dモデルの色情報の復元時に、信頼性の低い輪郭や遮蔽物との境界のテクスチャ画像の画素値の影響は低減され、信頼性の高い画素値を使った3Dモデルの色情報を生成することが出来る。 The first image processing device 20 estimates the shapes of the subject and obstructing object using multiple captured images taken by the physical cameras of the imaging system 10, and generates shape information for each. That is, it generates a 3D model 901 of the subject and a 3D model 904 of the obstructing object. Next, the first image processing device 20 generates viewpoint information for the second virtual camera group described in Example 1 based on data such as viewpoint information and shape information from the physical camera group. The second virtual camera 905 is positioned on the optical axis of the first virtual camera 902, within an area 906 that encompasses the 3D model 901 of the subject and is within a range in which the depth of the subject can be expressed in 10 bits. The focal length of the second virtual camera 905 is adjusted so that the size of the subject 901 as viewed from the first virtual camera 902 is maintained. After generating the viewpoint information for the second virtual camera 905, the first image processing device 20 generates a depth image 907 and a texture image 908. Here, for each pixel value of the subject 901 appearing in the generated texture image 908, the pixel values of the image captured by the first virtual camera 902 are used preferentially in the area not obscured by the 3D model 904 of the obstructing object (the left side of the subject). On the other hand, in the area obscured by the 3D model 904 of the obstructing object (the right side of the subject), the pixel values of the image captured by the first virtual camera, which has a different viewpoint from the first virtual camera 902, are used. Therefore, there is a possibility that the image quality of the right and left sides of the subject appearing in the texture image 908 will differ significantly. If the color information of the 3D model is restored using this texture image 908, the image quality of parts of the 3D model will be reduced, and this part will stand out, which may cause the user to feel uncomfortable. Therefore, an effective pixel map 909 is generated that indicates high-quality areas of the texture image. For example, the effective pixel map assigns pixel values of 1 to unobstructed areas and 0 to obstructed areas. The image size of the effective pixel map 909 is the same as the image size of the texture image 908. The determination of whether or not an area is an occluded area is made, for example, based on whether each point constituting the shape information of the 3D model 901 of the subject is visible to the first virtual camera 902 using the visibility determination process described above. In other words, due to the visibility determination process, pixel values of the area of the occluding object 904 captured in the captured image 903 are not used as color information of the 3D model 901 of the subject. Therefore, a region of a texture image determined using pixel values of the captured image of the physical camera from which the viewpoint information of the second virtual camera 905 is generated is identified, and in the effective pixel map, pixel values of the identified area are set to 1, and other areas are set to 0. Note that the method of determining an occluded area is not limited to this. When restoring color information of the 3D model, the second image processing device 30 prioritizes pixel values of areas of the texture image corresponding to areas with a value of 1 in the effective pixel map and uses them as color information of the 3D model. Specifically, when generating color information for the restored shape information in a virtual viewpoint-dependent manner, the second image processing device 30 uses pixel values of the texture image corresponding to 1 in the effective pixel map and does not use pixel values of the texture image corresponding to 0 in the effective pixel map. However, if the shape information to which color information is to be added is stored only in the texture image corresponding to 0 in the effective pixel map, the pixel value of that texture image is used. While the effective pixel map has been described so far as having binary pixel values of 0 or 1, it can also be multi-valued. In this case, the pixel values of the effective pixel map can be used to weight the color information generated for the 3D model. In other words, the priority of the pixel values of the texture image is determined according to the pixel values of the effective pixel map. When generating a multi-valued effective pixel map with 255 levels, for example, the pixel values of the subject's outline or the boundary with an obstructing object can be set to 0, and can linearly increase to 255 as the subject approaches a certain distance (e.g., 5 px) inward toward the subject. This reduces the influence of unreliable pixel values of the texture image at the outline or boundary with an obstructing object when restoring the color information of the 3D model, allowing for the generation of color information for the 3D model using highly reliable pixel values.

 図10は、本実施例に係る、第1の画像処理装置20における3Dモデルデータの圧縮と配信を制御する処理の流れを示すフローチャートである。形状情報生成部201が撮影システム10から複数の撮像画像、物理カメラ群の視点情報の受信をトリガとして、図10のフローの実行が開始される。 FIG. 10 is a flowchart showing the flow of processing for controlling the compression and distribution of 3D model data in the first image processing device 20 according to this embodiment. Execution of the flow in FIG. 10 begins when the shape information generation unit 201 receives multiple captured images and viewpoint information from the physical cameras from the imaging system 10.

 S1001からS1003は、図5のS501からS503と同様である。 S1001 to S1003 are the same as S501 to S503 in Figure 5.

 S1004では、テクスチャ画像生成部204が、形状情報生成部201と視点決定部202から取得したデータに基づき、前景モデルのテクスチャ画像と有効画素マップを生成する。生成したテクスチャ画像と有効画素マップは、符号化部205に出力される。 In S1004, the texture image generation unit 204 generates a texture image and a valid pixel map of the foreground model based on the data acquired from the shape information generation unit 201 and the viewpoint determination unit 202. The generated texture image and valid pixel map are output to the encoding unit 205.

 S1005では、符号化部205が、デプス画像生成部203とテクスチャ画像生成部204から取得したデプス画像、テクスチャ画像、有効画素マップを符号化する。符号化されたデプス画像、テクスチャ画像、有効画素マップは、配信部206に出力される。 In S1005, the encoding unit 205 encodes the depth image, texture image, and valid pixel map acquired from the depth image generation unit 203 and texture image generation unit 204. The encoded depth image, texture image, and valid pixel map are output to the distribution unit 206.

 S1006では、配信部206が、デプス画像生成部203と符号化部205から取得したデプス画像、テクスチャ画像、有効画素マップ、第2仮想カメラ群の視点情報とを含む3Dモデルデータを受信部207に送信し、本フローは終了する。 In S1006, the distribution unit 206 transmits 3D model data including the depth image, texture image, effective pixel map, and viewpoint information of the second virtual camera group acquired from the depth image generation unit 203 and encoding unit 205 to the reception unit 207, and this flow ends.

 図11は、本実施例に係る、3Dモデルの復元処理の流れを示すフローチャートの一例である。図11のフローは、3Dモデル復元部209により実行される。図11のフローは、図6のS604にて復号したデータに基づき被写体の3Dモデルを復元する制御を詳細に説明したものである。 FIG. 11 is an example of a flowchart showing the flow of the 3D model restoration process according to this embodiment. The flow in FIG. 11 is executed by the 3D model restoration unit 209. The flow in FIG. 11 provides a detailed explanation of the control for restoring a 3D model of a subject based on the data decoded in S604 in FIG. 6.

 S1101では、デプス画像、テクスチャ画像、有効画素マップ、第2仮想カメラ群の視点情報を取得する。 In S1101, a depth image, texture image, effective pixel map, and viewpoint information of the second virtual camera group are obtained.

 S1102では、デプス画像の各画素のデプス値をデプス画像と対応する第2仮想カメラの外部パラメータと内部パラメータに基づき、仮想空間に投影し、被写体の形状情報を復元する。 In S1102, the depth value of each pixel in the depth image is projected into virtual space based on the external parameters and internal parameters of the second virtual camera corresponding to the depth image, and shape information of the subject is restored.

 S1103では、形状情報が複数のテクスチャ画像から撮影されている場合、ユーザ指定の第3仮想カメラの位置及び姿勢が近い第2仮想カメラのテクスチャ画像の画素値を優先して色情報とする。その際に、テクスチャ画像の画素値と対応する有効画素マップの画素値に応じて、テクスチャ画像の画素値の優先度を決定する。形状情報が3D点群で表現されている場合、これを点群のすべての点で繰り返し、形状情報に対応する色情報を生成する。 In S1103, if the shape information is captured from multiple texture images, the pixel values of the texture image of the second virtual camera, which is closest in position and orientation to the user-specified third virtual camera, are given priority as color information. At this time, the priority of the pixel values of the texture image is determined according to the pixel values of the effective pixel map that correspond to the pixel values of the texture image. If the shape information is represented by a 3D point cloud, this process is repeated for all points in the point cloud to generate color information corresponding to the shape information.

 S1104は、図8のS804と同様である。 S1104 is the same as S804 in Figure 8.

 本フローが終了した後、復元された3Dモデルとユーザ指定の第3仮想カメラの視点情報に基づき仮想視点画像生成部211が仮想視点画像を生成し、生成した仮想視点画像が表示装置50に表示される。 After this flow is completed, the virtual viewpoint image generation unit 211 generates a virtual viewpoint image based on the restored 3D model and viewpoint information of the third virtual camera specified by the user, and the generated virtual viewpoint image is displayed on the display device 50.

 以上が、有効画素マップを含む3Dモデルデータの圧縮と配信および3Dモデルの復元の制御の内容である。このような処理によれば、被写体が遮蔽物に隠されている場合でも、信頼性の高いテクスチャ画像の画素値を使った3Dモデルを復元することが出来る。 The above explains the compression and distribution of 3D model data including valid pixel maps, and the control of 3D model restoration. This type of processing makes it possible to restore a 3D model using highly reliable pixel values from texture images, even when the subject is hidden by an obstruction.

 <その他の実施例>
 本開示は、上述の実施例の1以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける1つ以上のプロセッサがプログラムを読出し実行する処理でも実現可能である。また、1以上の機能を実現する回路(例えば、ASIC)によっても実現可能である。
<Other Examples>
The present disclosure can also be realized by a process in which a program that realizes one or more functions of the above-described embodiments is supplied to a system or device via a network or a storage medium, and one or more processors in the computer of the system or device read and execute the program. The present disclosure can also be realized by a circuit (e.g., an ASIC) that realizes one or more functions.

 なお、本実施形態の開示は、以下の構成、方法、システムおよびプログラムを含む。 The disclosure of this embodiment includes the following configurations, methods, systems, and programs.

 (構成1)
 現実空間における複数の撮像装置の位置及び姿勢に対応する仮想空間における複数の第1仮想カメラの光軸に基づいて複数の第2仮想カメラの位置及び姿勢を設定する設定手段と、
 前記複数の第2仮想カメラのそれぞれと、前記複数の撮像装置により取得される複数の撮像画像に基づいて生成される被写体の3Dモデルと、の距離を示す複数のデプス画像を生成する生成手段と
 を有することを特徴とする画像処理装置。
(Configuration 1)
a setting means for setting positions and orientations of a plurality of second virtual cameras based on optical axes of a plurality of first virtual cameras in a virtual space corresponding to positions and orientations of a plurality of imaging devices in a real space;
and a generation means for generating a plurality of depth images indicating the distance between each of the plurality of second virtual cameras and a 3D model of a subject generated based on a plurality of captured images acquired by the plurality of imaging devices.

 (構成2)
 前記複数の第2仮想カメラは、前記複数の第1仮想カメラの光軸上に設定されることを特徴とする構成1に記載の画像処理装置。
(Configuration 2)
2. The image processing device according to configuration 1, wherein the plurality of second virtual cameras are set on the optical axes of the plurality of first virtual cameras.

 (構成3)
 前記デプス画像を符号化する符号化手段を有することを特徴とする構成1記載の画像処理装置。
(Configuration 3)
2. The image processing device according to claim 1, further comprising encoding means for encoding the depth image.

 (構成4)
 符号化された複数の前記デプス画像と、前記複数の第2仮想カメラの位置及び姿勢を示す視点情報と、を前記符号化された複数の前記デプス画像および前記視点情報とに基づいて前記3Dモデルを再構成する他の装置に出力する出力手段を有することを特徴とする構成3に記載の画像処理装置。
(Configuration 4)
The image processing device according to configuration 3 further comprises an output means for outputting the encoded depth images and viewpoint information indicating the positions and orientations of the second virtual cameras to another device that reconstructs the 3D model based on the encoded depth images and the viewpoint information.

 (構成5)
 前記生成手段は、前記複数の第2仮想カメラそれぞれにおいて、前記被写体の3Dモデルを含む仮想視点画像を生成し、
 前記符号化手段は、複数の前記仮想視点画像を符号化し、
 前記出力手段は、符号化された複数の前記仮想視点画像を前記他の装置に出力することを特徴とする構成4に記載の画像処理装置。
(Configuration 5)
the generating means generates a virtual viewpoint image including a 3D model of the subject for each of the second virtual cameras;
the encoding means encodes the plurality of virtual viewpoint images;
5. The image processing device according to configuration 4, wherein the output means outputs the encoded virtual viewpoint images to the other device.

 (構成6)
 前記生成手段は、前記複数の第2仮想カメラそれぞれにおいて、前記仮想視点画像における各画素が前記被写体の3Dモデルの構成要素と対応しているか否かを示す対応情報を生成し、
 前記符号化手段は、複数の前記対応情報を符号化し、
 前記出力手段は、符号された複数の前記対応情報を前記他の装置に出力することを特徴とする構成5に記載の画像処理装置。
(Configuration 6)
the generating means generates correspondence information indicating whether each pixel in the virtual viewpoint image corresponds to a component of a 3D model of the subject, for each of the plurality of second virtual cameras;
the encoding means encodes a plurality of pieces of correspondence information;
6. The image processing apparatus according to configuration 5, wherein the output means outputs the encoded correspondence information to the other apparatus.

 (構成7)
 前記複数の第2仮想カメラは、前記複数の第1仮想カメラより、前記被写体の3Dモデルに近い位置に設定されることを特徴とする構成1に記載の画像処理装置。
(Configuration 7)
2. The image processing device according to configuration 1, wherein the second virtual cameras are set at positions closer to the 3D model of the subject than the first virtual cameras.

 (構成8)
 前記生成手段は、1つの前記第2仮想カメラに対し、1つの前記デプス画像を生成することを特徴とする構成1に記載の画像処理装置。
(Configuration 8)
2. The image processing device according to claim 1, wherein the generating means generates one depth image for one second virtual camera.

 (構成9)
 現実空間における複数の撮像装置の位置及び姿勢に対応する仮想空間における複数の第1仮想カメラの光軸に基づいて設定された複数の第2仮想カメラのそれぞれと、前記撮像装置により取得される複数の撮像画像に基づいて生成される被写体の3Dモデルと、の距離を示す符号化された複数のデプス画像および前記複数の第2仮想カメラの位置及び姿勢を示す第1視点情報を取得する取得手段と、
 前記符号化された複数のデプス画像を復号する復号手段と、
 復号された前記複数のデプス画像と、前記第1視点情報と、に基づいて前記被写体の3Dモデルを生成する生成手段と、
 を有することを特徴とする画像処理装置。
(Configuration 9)
an acquisition means for acquiring a plurality of encoded depth images indicating the distance between each of a plurality of second virtual cameras set based on the optical axes of a plurality of first virtual cameras in a virtual space corresponding to the positions and orientations of a plurality of image capturing devices in real space and a 3D model of a subject generated based on a plurality of captured images acquired by the image capturing devices, and first viewpoint information indicating the positions and orientations of the plurality of second virtual cameras;
a decoding means for decoding the encoded depth images;
a generating means for generating a 3D model of the subject based on the decoded depth images and the first viewpoint information;
1. An image processing device comprising:

 (構成10)
 前記第1仮想カメラおよび前記第2仮想カメラと異なる第3仮想カメラの位置及び姿勢を示す第2視点情報を取得し、
 前記生成手段は、前記第2視点情報と、前記被写体の3Dモデルとに基づいて、仮想視点画像を生成することを特徴とする構成9に記載の画像処理装置。
(Configuration 10)
acquiring second viewpoint information indicating a position and an attitude of a third virtual camera different from the first virtual camera and the second virtual camera;
10. The image processing device according to configuration 9, wherein the generating means generates a virtual viewpoint image based on the second viewpoint information and a 3D model of the subject.

 (方法1)
 現実空間における複数の撮像装置の位置及び姿勢に対応する仮想空間における複数の第1仮想カメラの光軸に基づいて複数の第2仮想カメラの位置及び姿勢を設定する設定工程と、
 前記複数の第2仮想カメラのそれぞれと、前記複数の撮像装置により取得される複数の撮像画像に基づいて生成される被写体の3Dモデルと、の距離を示す複数のデプス画像を生成する生成工程と
 を有することを特徴とする画像処理方法。
(Method 1)
a setting step of setting positions and orientations of a plurality of second virtual cameras based on optical axes of a plurality of first virtual cameras in a virtual space corresponding to positions and orientations of a plurality of imaging devices in a real space;
and a generation step of generating a plurality of depth images indicating the distance between each of the plurality of second virtual cameras and a 3D model of the subject generated based on a plurality of captured images acquired by the plurality of imaging devices.

 (方法2)
 現実空間における複数の撮像装置の位置及び姿勢に対応する仮想空間における複数の第1仮想カメラの光軸に基づいて設定された複数の第2仮想カメラのそれぞれと、前記撮像装置により取得される複数の撮像画像に基づいて生成される被写体の3Dモデルと、の距離を示す符号化された複数のデプス画像および前記複数の第2仮想カメラの位置及び姿勢を示す第1視点情報を取得する取得工程と、
 前記符号化された複数のデプス画像を復号する復号工程と、
 復号された前記複数のデプス画像と、前記第1視点情報と、に基づいて前記被写体の3Dモデルを生成する生成工程と、
 を有することを特徴とする画像処理方法。
(Method 2)
an acquisition process for acquiring a plurality of encoded depth images indicating the distance between each of a plurality of second virtual cameras set based on the optical axes of a plurality of first virtual cameras in a virtual space corresponding to the positions and orientations of a plurality of imaging devices in real space and a 3D model of a subject generated based on a plurality of captured images acquired by the imaging devices, and first viewpoint information indicating the positions and orientations of the plurality of second virtual cameras;
a decoding step of decoding the encoded depth images;
a generation step of generating a 3D model of the subject based on the decoded depth images and the first viewpoint information;
An image processing method comprising:

 (プログラム)
 コンピュータに、構成1乃至10の何れか1項に記載の画像処理装置の各手段として機能させるためのプログラム。
(program)
11. A program for causing a computer to function as each of the means of the image processing device according to any one of configurations 1 to 10.

 本発明は上記実施の形態に制限されるものではなく、本発明の精神及び範囲から離脱することなく、様々な変更及び変形が可能である。従って、本発明の範囲を公にするために以下の請求項を添付する。 The present invention is not limited to the above-described embodiments, and various modifications and variations are possible without departing from the spirit and scope of the present invention. Therefore, the following claims are appended to clarify the scope of the present invention.

 本願は、2024年4月26日提出の日本国特許出願特願2024-073062を基礎として優先権を主張するものであり、その記載内容の全てをここに援用する。 This application claims priority based on Japanese Patent Application No. 2024-073062, filed April 26, 2024, the entire contents of which are incorporated herein by reference.

 202 視点決定部
 203 デプス画像生成部
 204 テクスチャ画像生成部
 205 符号化部
202 viewpoint determination unit 203 depth image generation unit 204 texture image generation unit 205 encoding unit

Claims (13)

 現実空間における複数の撮像装置の位置及び姿勢に対応する仮想空間における複数の第1仮想カメラの光軸に基づいて複数の第2仮想カメラの位置及び姿勢を設定する設定手段と、
 前記複数の第2仮想カメラのそれぞれと、前記複数の撮像装置により取得される複数の撮像画像に基づいて生成される被写体の3Dモデルと、の距離を示す複数のデプス画像を生成する生成手段と
 を有することを特徴とする画像処理装置。
a setting means for setting positions and orientations of a plurality of second virtual cameras based on optical axes of a plurality of first virtual cameras in a virtual space corresponding to positions and orientations of a plurality of imaging devices in a real space;
and a generation means for generating a plurality of depth images indicating the distance between each of the plurality of second virtual cameras and a 3D model of a subject generated based on a plurality of captured images acquired by the plurality of imaging devices.
 前記複数の第2仮想カメラは、前記複数の第1仮想カメラの光軸上に設定されることを特徴とする請求項1に記載の画像処理装置。 The image processing device described in claim 1, characterized in that the multiple second virtual cameras are set on the optical axes of the multiple first virtual cameras.  前記デプス画像を符号化する符号化手段を有することを特徴とする請求項1記載の画像処理装置。 The image processing device according to claim 1, further comprising an encoding means for encoding the depth image.  符号化された複数の前記デプス画像と、前記複数の第2仮想カメラの位置及び姿勢を示す視点情報と、を前記符号化された複数の前記デプス画像および前記視点情報とに基づいて前記3Dモデルを再構成する他の装置に出力する出力手段を有することを特徴とする請求項3に記載の画像処理装置。 The image processing device described in claim 3, further comprising an output means for outputting the encoded depth images and viewpoint information indicating the positions and orientations of the second virtual cameras to another device that reconstructs the 3D model based on the encoded depth images and the viewpoint information.  前記生成手段は、前記複数の第2仮想カメラそれぞれにおいて、前記被写体の3Dモデルを含む仮想視点画像を生成し、
 前記符号化手段は、複数の前記仮想視点画像を符号化し、
 前記出力手段は、符号化された複数の前記仮想視点画像を前記他の装置に出力することを特徴とする請求項4に記載の画像処理装置。
the generating means generates a virtual viewpoint image including a 3D model of the subject for each of the second virtual cameras;
the encoding means encodes the plurality of virtual viewpoint images;
5. The image processing apparatus according to claim 4, wherein the output means outputs the encoded virtual viewpoint images to the other device.
 前記生成手段は、前記複数の第2仮想カメラそれぞれにおいて、前記仮想視点画像における各画素が前記被写体の3Dモデルの構成要素と対応しているか否かを示す対応情報を生成し、
 前記符号化手段は、複数の前記対応情報を符号化し、
 前記出力手段は、符号された複数の前記対応情報を前記他の装置に出力することを特徴とする請求項5に記載の画像処理装置。
the generating means generates correspondence information indicating whether each pixel in the virtual viewpoint image corresponds to a component of a 3D model of the subject, for each of the plurality of second virtual cameras;
the encoding means encodes a plurality of pieces of correspondence information;
6. The image processing apparatus according to claim 5, wherein said output means outputs the encoded correspondence information to said other apparatus.
 前記複数の第2仮想カメラは、前記複数の第1仮想カメラより、前記被写体の3Dモデルに近い位置に設定されることを特徴とする請求項1に記載の画像処理装置。 The image processing device described in claim 1, characterized in that the multiple second virtual cameras are set at positions closer to the 3D model of the subject than the multiple first virtual cameras.  前記生成手段は、1つの前記第2仮想カメラに対し、1つの前記デプス画像を生成することを特徴とする請求項1に記載の画像処理装置。 The image processing device described in claim 1, characterized in that the generation means generates one depth image for one second virtual camera.  現実空間における複数の撮像装置の位置及び姿勢に対応する仮想空間における複数の第1仮想カメラの光軸に基づいて設定された複数の第2仮想カメラのそれぞれと、前記撮像装置により取得される複数の撮像画像に基づいて生成される被写体の3Dモデルと、の距離を示す符号化された複数のデプス画像および前記複数の第2仮想カメラの位置及び姿勢を示す第1視点情報を取得する取得手段と、
 前記符号化された複数のデプス画像を復号する復号手段と、
 復号された前記複数のデプス画像と、前記第1視点情報と、に基づいて前記被写体の3Dモデルを生成する生成手段と、
 を有することを特徴とする画像処理装置。
an acquisition means for acquiring a plurality of encoded depth images indicating the distance between each of a plurality of second virtual cameras set based on the optical axes of a plurality of first virtual cameras in a virtual space corresponding to the positions and orientations of a plurality of image capturing devices in real space and a 3D model of a subject generated based on a plurality of captured images acquired by the image capturing devices, and first viewpoint information indicating the positions and orientations of the plurality of second virtual cameras;
a decoding means for decoding the encoded depth images;
a generating means for generating a 3D model of the subject based on the decoded depth images and the first viewpoint information;
1. An image processing device comprising:
 前記第1仮想カメラおよび前記第2仮想カメラと異なる第3仮想カメラの位置及び姿勢を示す第2視点情報を取得し、
 前記生成手段は、前記第2視点情報と、前記被写体の3Dモデルとに基づいて、仮想視点画像を生成することを特徴とする請求項9に記載の画像処理装置。
acquiring second viewpoint information indicating a position and an attitude of a third virtual camera different from the first virtual camera and the second virtual camera;
The image processing apparatus according to claim 9 , wherein the generating means generates a virtual viewpoint image based on the second viewpoint information and a 3D model of the subject.
 現実空間における複数の撮像装置の位置及び姿勢に対応する仮想空間における複数の第1仮想カメラの光軸に基づいて複数の第2仮想カメラの位置及び姿勢を設定する設定工程と、
 前記複数の第2仮想カメラのそれぞれと、前記複数の撮像装置により取得される複数の撮像画像に基づいて生成される被写体の3Dモデルと、の距離を示す複数のデプス画像を生成する生成工程と
 を有することを特徴とする画像処理方法。
a setting step of setting positions and orientations of a plurality of second virtual cameras based on optical axes of a plurality of first virtual cameras in a virtual space corresponding to positions and orientations of a plurality of imaging devices in a real space;
and a generation step of generating a plurality of depth images indicating the distance between each of the plurality of second virtual cameras and a 3D model of the subject generated based on a plurality of captured images acquired by the plurality of imaging devices.
 現実空間における複数の撮像装置の位置及び姿勢に対応する仮想空間における複数の第1仮想カメラの光軸に基づいて設定された複数の第2仮想カメラのそれぞれと、前記撮像装置により取得される複数の撮像画像に基づいて生成される被写体の3Dモデルと、の距離を示す符号化された複数のデプス画像および前記複数の第2仮想カメラの位置及び姿勢を示す第1視点情報を取得する取得工程と、
 前記符号化された複数のデプス画像を復号する復号工程と、
 復号された前記複数のデプス画像と、前記第1視点情報と、に基づいて前記被写体の3Dモデルを生成する生成工程と、
 を有することを特徴とする画像処理方法。
an acquisition process for acquiring a plurality of encoded depth images indicating the distance between each of a plurality of second virtual cameras set based on the optical axes of a plurality of first virtual cameras in a virtual space corresponding to the positions and orientations of a plurality of imaging devices in real space and a 3D model of a subject generated based on a plurality of captured images acquired by the imaging devices, and first viewpoint information indicating the positions and orientations of the plurality of second virtual cameras;
a decoding step of decoding the encoded depth images;
a generation step of generating a 3D model of the subject based on the decoded depth images and the first viewpoint information;
An image processing method comprising:
 コンピュータに、請求項1乃至10の何れか1項に記載の画像処理装置の各手段として機能させるためのプログラム。 A program for causing a computer to function as each of the means of the image processing device described in any one of claims 1 to 10.
PCT/JP2025/014112 2024-04-26 2025-04-09 Image processing device, image processing method, and program Pending WO2025225378A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2024073062A JP2025167990A (en) 2024-04-26 2024-04-26 Image processing device, image processing method and program
JP2024-073062 2024-04-26

Publications (1)

Publication Number Publication Date
WO2025225378A1 true WO2025225378A1 (en) 2025-10-30

Family

ID=97489939

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2025/014112 Pending WO2025225378A1 (en) 2024-04-26 2025-04-09 Image processing device, image processing method, and program

Country Status (2)

Country Link
JP (1) JP2025167990A (en)
WO (1) WO2025225378A1 (en)

Also Published As

Publication number Publication date
JP2025167990A (en) 2025-11-07

Similar Documents

Publication Publication Date Title
US11087549B2 (en) Methods and apparatuses for dynamic navigable 360 degree environments
CN107079141B (en) Image mosaic for 3 D video
US11232625B2 (en) Image processing
TW201921921A (en) Processing of 3D image information based on texture maps and meshes
EP3396635A2 (en) A method and technical equipment for encoding media content
TWI848978B (en) Image synthesis
JP2017532847A (en) 3D recording and playback
US11055917B2 (en) Methods and systems for generating a customized view of a real-world scene
Shi et al. Real-time remote rendering of 3D video for mobile devices
CN113795863A (en) Processing of depth maps for images
CN114007059B (en) Video compression method, decompression method, device, electronic equipment and storage medium
CN113963094A (en) Depth map and video processing and reconstruction method, device, equipment and storage medium
CN111669561B (en) Multi-angle free view image data processing method and device, medium and equipment
CN113366825B (en) Device and method for generating an image signal representing a scene
EP3729805B1 (en) Method and apparatus for encoding and decoding volumetric video data
CN111669570B (en) Multi-angle free view video data processing method and device, medium and equipment
WO2025225378A1 (en) Image processing device, image processing method, and program
WO2018109265A1 (en) A method and technical equipment for encoding media content
JP2020127150A (en) System, image processing apparatus, image processing method, and program
EP3404522A1 (en) A method for viewing panoramic content and a head mounted device
WO2024147274A1 (en) Image processing device, image processing method, and program
WO2020190893A1 (en) Capturing and transforming wide-angle video information
JP7709492B2 (en) Image processing device, image processing method, and program
CN114881898B (en) Multi-angle free-viewing angle image data generation method, device, medium, and equipment
US12020395B2 (en) Systems and methods for compressing and decompressing a sequence of images

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 25794035

Country of ref document: EP

Kind code of ref document: A1