WO2024025668A1 - Computing images of controllable dynamic scenes - Google Patents
Computing images of controllable dynamic scenes Download PDFInfo
- Publication number
- WO2024025668A1 WO2024025668A1 PCT/US2023/025095 US2023025095W WO2024025668A1 WO 2024025668 A1 WO2024025668 A1 WO 2024025668A1 US 2023025095 W US2023025095 W US 2023025095W WO 2024025668 A1 WO2024025668 A1 WO 2024025668A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- cage
- image
- computing
- samples
- scene
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/40—3D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T15/00—3D [Three Dimensional] image rendering
- G06T15/08—Volume rendering
Definitions
- a dynamic scene is an environment in which one or more obj ects are moving; in contrast to a static scene where all objects are stationary.
- An example of a dynamic scene is a person’s face which moves as the person talks.
- Another example of a dynamic scene is a propeller of an aircraft which is rotating.
- Another example of a dynamic scene is a standing person with moving arms.
- computing synthetic images of dynamic scenes is a complex task since a rigged three dimensional (3D) model of the scene and its dynamics is needed. Obtaining such a rigged 3D model is complex and time consuming and involves manual work.
- Synthetic images of dynamic scenes are used for a variety of purposes such as computer games, films, video communications and more.
- the images are computed in real time (such as at 30 frames per second or more) and are photorealistic, that is the images have characteristics generally matching those of empirical images and/or video.
- a computer-implemented method of computing an image of a dynamic 3D scene comprising a 3D object.
- the method comprises receive a description of a deformation of the 3D object, the description comprising a cage of primitive 3D elements and associated animation data from a physics engine or an articulated object model.
- For a pixel of the image the method computes a ray from a virtual camera through the pixel into the cage animated according to the animation data and computes a plurality of samples on the ray. Each sample is a 3D position and view direction in one of the 3D elements.
- the method computes a transformation of the samples into a canonical version of the cage to produce transformed samples.
- the method queries a leamt radiance field parameterization of the 3D scene to obtain a color value and an opacity value.
- a volume rendering method is applied to the color and opacity values to produce a pixel value of the image.
- FIG. 1 is a schematic diagram of an image animator for computing images of controllable dynamic scenes
- FIG. 2 shows a deformation description and three images of the person’s head computed using the image animator of FIG. 1 ;
- FIG. 3 shows a chair and an image of the chair shattering computed using the image animator of FIG. 1;
- FIG. 4 is a flow diagram of an example method performed by the image animator of FIG. 1;
- FIG. 5 is a schematic diagram of a ray in a deformed cage, the ray transformed to a canonical cage, a volume lookup and volume rendering;
- FIG. 6 is a flow diagram of a method of sampling
- FIG. 7 is a flow diagram of a method of computing an image of a person depicting their mouth open
- FIG. 8 is a flow diagram of a method of training a machine learning model and computing a cache
- FIG. 9 illustrates an exemplary computing-based device in which embodiments of an animator for computing images of controllable dynamic scenes is implemented.
- Radiance field parameterizations represent a radiance field which is a function from five dimensional (5D) space to four dimensional (4D) space (referred to as a field) where values of radiance are known for each pair of 3D point and 2D view direction in the field.
- a radiance value is made up of a color value and an opacity value.
- a radiance field parameterizations may be a trained machine learning model such as a neural network, support vector machine, random decision forest or other machine learning model which leams an association between radiance values and pair of 3D points and view directions.
- a radiance field parametrization is a cache of associations between radiance values and 3D points, where the associations are obtained from a trained machine learning model.
- Volume rendering methods compute an image from a radiance field for a particular camera viewpoint by examining radiance values of points along rays which form the image.
- Volume rendering software is well known and commercially available.
- synthetic images of dynamic scenes are used for a variety of purposes such as computer games, films, video communications, telepresence and others.
- Precise control is desired for many applications such as where synthetic images of an avatar of a person in a video call are to accurately depict the facial expression of the real person.
- Precise control is also desired for video game applications where an image of a particular chair is to be made to shatter in a realistic manner.
- These examples of the video call and video game are not intended to be limiting but rather to illustrate uses of the present technology.
- the technology can be used to capture any scene which is static or dynamic such as objects, vegetation environments, humans or other scenes.
- Enrollment is another problem that arises when generating synthetic images of dynamic scenes. Enrollment is where a radiance field parameterization is created for a particular 3D scene, such as a particular person or a particular chair. Some approaches to enrollment use large quantities of training images depicting the particular 3D scene over time and from different viewpoints. Where enrollment is time consuming and computationally burdensome difficulties arise.
- the present technology provides a precise way to control how images of dynamic scenes animate.
- a user or an automated process, is able to specify parameter values such as volumetric blendshapes and skeleton values which are applied to a cage of primitive 3D elements.
- the user or automated process is able to precisely control deformation of a 3D object to be depicted in a synthetic image.
- a user of an automated process is able to use animation data from a physics engine to precisely control deformation of the 3D object to be depicted in the synthetic image.
- a blendshape is a mathematical function which when applied to a parameterized 3D model changes parameter values of the 3D model. In an example, where the 3D model is of a person’s head there may be several hundred blendshapes, each blendshape changing the 3D model according to a facial expression or an identity characteristic.
- the present technology reduces the burden of enrollment in some examples. Enrollment burden is reduced by using a reduced amount of training images, such as training image frames from only one or only two time instants.
- the present technology is able to operate in real time (such as at 30 frames per second or more) in some examples. This is achieved by using optimizations when computing a transform of sample points to a canonical space used by the radiance field parameterization.
- the present technology operates with good generalization ability in some cases.
- the technology can use the model dynamics from the face model or physics engine to animate the scene beyond the training data in a physically meaningful way to generalize well.
- FIG. 1 is a schematic diagram of an image animator 100 for computing synthetic images of dynamic scenes.
- the image animator 100 is deployed as a web service.
- the image animator 100 is deploy ed at a personal computer or other computing device which is in communication with a head worn computer 114 such as a head mounted display device.
- the image animator 100 is deployed in a companion computing device of head worn computer 114.
- the image animator 100 comprises a radiance field parametrization 102, at least one processor 104, a memory 106 and a volume Tenderer 108.
- the radiance field parametrization 102 is a neural network, or a random decision forest, or a support vector machine or other type of machine learning model. It has been trained to predict pairs of color and opacity values of three dimensional points and view directions in a canonical space of a dynamic scene and more detail about the training process is given later in this document.
- the radiance field parametrization 102 is a cache storing associations between three dimensional points in the canonical space and color and opacity values.
- the volume Tenderer 108 is a well-known computer graphics volume Tenderer which takes pairs of color and opacity values of three dimensional points along rays and computes an output image 116.
- the image animator 100 is configured to receive queries from client devices such as smart phone 122, computer game apparatus 110, head worn computer 114, film creation apparatus 120 or other client device. The queries are sent from the client devices over a communications network 124 to the image animator 100.
- a query from a client device comprises a specified viewpoint of a virtual camera, specified values of intrinsic parameters of the virtual camera and a deformation description 118.
- a synthetic image is to be computed by the image animator 100 as if it had been captured by the virtual camera.
- the deformation description describes desired dynamic content of the scene in the output image 116.
- the image animator 100 receives a query and in response generates a synthetic output image 116 which it sends to the client device.
- the client device uses the output image 116 for one of a variety of useful purposes including but not limited to: generating a virtual webcam stream, generating video of a computer video game, generating a hologram for display by a mixed-reality head worn computing device, generating a film.
- the image animator 100 is able to compute synthetic images of a dynamic 3D scene, for particular specified desired dynamic content and particular specified viewpoints, on demand.
- the dynamic scene is a face of a talking person.
- the image animator 100 is able to compute synthetic images of the face from a plurality of viewpoints and with any specified dynamic content.
- specified viewpoints and dynamic content are: plan view, eyes shut, face tilted upwards, smile; perspective view, eyes open, mouth open, angry expression.
- the image animator 100 is able to compute synthetic images for viewpoints and deformation descriptions which were not present in training data used to train the radiance field parameterization 102 since the machine learning used to create the radiance field parameterization 102 is able to generalize.
- the deformation description is obtained using a physics engine 126 in some cases so that a user or an automated process is able to apply physics rules to shatter a 3D object depicted in the synthetic output image 116 or to apply other physics rules to depict animations such as bouncing, waving, rocking, dancing, rotating, spinning or other animations. It is possible to use a Finite Element Method to apply physical simulations to a cage of 3D primitive elements to create the deformation description such as to produce elastic deformation or shattering.
- the deformation description is obtained using a face or body tracker 124 in some cases such as where an avatar of a person is being created. By selecting the viewpoint and the intrinsic camera parameter values it is possible to control characteristics of the synthetic output image.
- the image animator operates in an unconventional manner to enable synthetic images of dynamic scenes to be generated in a controllable manner.
- Many alternative methods of using machine learning to generate synthetic images have little or no ability to control content depicted in the synthetic images which are generated.
- the image animator 100 improves the functioning of the underlying computing device by enabling synthetic images of dynamic scenes to be computed in a manner whereby the content and viewpoint of the dynamic scene is controllable.
- the functionality of the image animator 100 is performed, at least in part, by one or more hardware logic components.
- illustrative types of hardware logic components include Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Application-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), Graphics Processing Units (GPUs).
- the functionality of the image animator 100 is located at a client device, or is shared between a client device and the cloud.
- FIG. 2 shows a deformation description 200 and three images 204, 206, 208 of a person’s head computed using the image animator 100 of FIG. 1, each image showing the person’s head animated in a different way such as with the mouth open or closed.
- the deformation description 200 is a cage of primitive 3D elements which in the example of FIG. 2 are tetrahedra although other primitive 3D elements are used in some examples such as spheres or cuboids.
- the cage of tetrahedra extends from a surface mesh of the person’s head so as to include a volume around the head which is useful to represent hair of the person and any headgear worn by the person.
- the volume around the object in the cage is useful because modelling the volume with volume rendering methods results in more photorealistic images and the cage only needs to approximate the mesh; this reduces the complexity of the cage for objects with many parts (the cage for a plant does not need to have a different part of each leave, it just needs to cover all foliage) and allows to use the same cage for objects of the same type that have a similar shape (different chairs can use the same cage).
- the cage can be intuitively deformed and controlled by users, physics-based simulation, or traditional automated animation techniques like blendshapes. Human faces are a particularly difficult case due to a non-trivial combination of rigid and (visco)elastic motion and yet the present technology performs well for human faces as described in more detail below.
- volumetric three dimensional morphable model (Vol3DMM) which is a parametric 3D face model which animates a surface mesh of a person’s head and the volume around the mesh using a skeleton and blendshapes.
- a user or an automated process is able to specify values of parameters of the Vol3DMM model which are used to animate the Vol3DMM model in order to create the images 204 to 208 as described in more detail below. Different values of the parameters of the Vol3DMM model are used to produce each of the three images 204 to 208.
- the Vol3DMM model together with parameter values is an example of a deformation description.
- Vol3DMM animates a volumetric mesh with a sequence of volumetric blendshapes and a skeleton. It is a generalization of parametric three dimensional morphable models (3DMM) models, which animate a mesh with a skeleton and blendshapes, to a parametric model to animate a volume around a mesh.
- 3DMM parametric three dimensional morphable models
- the skeleton has four bones: a root bone controlling rotation, a neck bone, a left eye bone, and a right eye bone.
- a root bone controlling rotation a neck bone
- a left eye bone a left eye bone
- a right eye bone a bone that is, a tetrahedron vertex has the skinning weights of the closest vertex in the 3DMM mesh.
- the volumetric blendshapes are created by extending the 224 expression blendshapes and the 256 identity blendshapes of the 3DMM model to the volume surrounding its template mesh: the i-th volumetric blend-shape of Vol3DMM is created as a tetrahedral embedding of the mesh of the i-th 3DMM blendshape.
- the tetrahedral embedding create a single volumetric structure from a generic mesh and create an accurate embedding that accounts for face geometry and face deformations: it avoids tetrahedral inter- penetrations between upper and lower lips, it defines a volumetric support that covers hair, and has higher resolution in areas subject to more deformation.
- the exact number of bones or blendshapes is inherited from the specific instance of 3DMM model chosen, but the technique can be applied to different 3DMM models using blendshapes and or skeletons to model faces, bodies, or other objects.
- Vol3DMM is controlled and posed with the same identity, expression, and pose parameters a, P, 0 of a 3DMM face model.
- the parameters a, P, 0 to pose the tetrahedral mesh of Vol3DMM to define the physical space, while a canonical space is defined for each subject by posing Vol3DMM with identity parameter a and setting p, 0 to zero for a neutral pose.
- the decomposition into identity, expression, and pose is inherited from the specific instance of 3DMM model chosen.
- the technology to train and/or animate adapts to different decompositions by constructing a corresponding Vol3DMM model for the specific 3DMM model chosen.
- FIG. 3 shows a chair 300 and a synthetic image 302 of the chair shattering computed using the image animator of FIG. 1.
- the deformation description comprises a cage around the chair 300 where the cage is formed of primitive 3D elements such as tetrahedra, spheres or cuboids.
- the deformation description also comprises information such as rules from a physics engine about how objects behave when they shatter.
- FIG. 4 is a flow diagram of an example method performed by the image animator of FIG. 1.
- Inputs 400 to the method comprise a deformation description, camera viewpoint and camera parameters.
- the camera viewpoint is a viewpoint of a virtual camera for which a synthetic image is to be generated.
- the camera parameters are lens and sensor parameters such as image resolution, field of view, focal length.
- the type and format of the deformation description depends on the type and format of the deformation description used in the training data when the radiance field parameterization was trained. The training process is described later with respect to FIG. 7.
- FIG. 4 is concerned with test time operation after the radiance field parameterization has been learnt.
- the deformation description is a vector of concatenated parameter values of a parameterized 3D model of an object in the dynamic scene such as a Vol3DMM model.
- the deformation description is one or more physics based rules from a physics engine to be applied to a cage of primitive 3D elements encapsulating the 3D object to be depicted and extending into a volume around the 3D object.
- the inputs 400 comprise default values for some or all of the deformation description, the viewpoint, the intrinsic camera parameters.
- the inputs 400 are from a user or from a game apparatus or other automated process.
- the inputs 400 are made according to game state from a computer game or according to state received from a mixed- reality computing device.
- an face or body tracker 420 provides values of the deformation description.
- the face or body tracker is a trained machine learning model which takes as input captured sensor data depicting at least part of a person’s face or body and predicts values of parameters of a 3D face model or 3D body model of the person.
- the parameters are shape parameters, pose parameters or other parameters.
- the deformation description comprises a cage 418 of primitive 3D elements.
- the cage of primitive 3D elements represents the 3D object to be depicted in the image and a volume extending from the 3D object.
- the cage comprises a volumetric mesh with a plurality of volumetric blendshapes and a skeleton.
- the cage is computed from the leamt radiance field parameterization by computing a mesh from the density of the leamt radiance field volume using Marching Cubes and computing a tetrahedral embedding of the mesh.
- the cage 418 of primitive 3D elements is a deformed version of a canonical cage. That is, to produce a modified version of the scene the method begins by deforming a canonical cage to a desired shape which is the deformation description. The method is agnostic to the way in which the deformed cage is generated and what kind of an object is deformed.
- a cage to control and parametrize volume deformation enables deformation to be represented and applied to the scene in real-time, it is capable of representing both smooth and discontinuous function and allows for intuitive control by changing the geometry of the cage.
- This geometric control is compatible with machine learning models, physics engines, and artist generation software thereby allowing good extrapolation or generalization to configurations not observed in training.
- the dynamic scene image generator computes a plurality of rays, each ray associated with a pixel of an output image 116 to be generated by the image animator. For a given pixel (x, y position in the output image) the image animator computes a ray that goes from the virtual camera through the pixel into the deformation description comprising the cage. To compute the ray the image animator uses geometry and the selected values of the intrinsic camera parameters as well as the camera viewpoint. The rays are computed in parallel where possible in order to give efficiencies since there is one ray to be computed per pixel.
- the image animator samples a plurality' of points along the ray. Generally speaking, the more points sampled the better quality the output image.
- a ray is selected at random and samples are drawn within specified bounds obtained from scene knowledge 416.
- the specified bounds are computed from training data which has been used to train the machine learning system.
- the bounds indicate a size of the dynamic scene so that the one or more samples are taken from regions of the rays which are in the dynamic scene.
- To compute the bounds from the training data standard image processing techniques are used to examine training images. It is also possible for the bounds of the dynamic scene to be manually specified by an operator or for the bounds to be measured automatically using a depth camera, global positioning system (GPS) sensor or other position sensor.
- GPS global positioning system
- Each sample is assigned an index of a 3D primitive element of the deformed cage that the sample falls within.
- the image animator transforms the samples from the deformation description cage to a canonical cage.
- a canonical cage is a version of the cage representing the 3D object in a rest state or other specified origin state, such as where the parameter values are zero.
- the canonical cage represents the head of the person looking straight at the virtual camera, with eyes open and mouth shut and a neutral expression.
- the transform of the samples to the canonical cage is computed using barycentric coordinates as described below.
- barycentric coordinated is a particularly efficient way of computing the transform.
- a tetrahedron, one fundamental building block, is a four-sided pyramid. Define the undeformed ‘rest’ position of its four constituent points as
- the transform of the samples to the canonical cage is computed using affine transformations instead, which are expressive enough for large rigidly moving sections of the motion field.
- an optimization is optionally used to compute the transform at operation 406 by optimizing primitive point lookups.
- the optimization comprises computing the transformation P of a sample by setting P equal to a normalized distance between a previous and a next intersection of a tetrahedron on a ray, times the sum, at the previous intersection, over four vertices of a tetrahedron of a barycentric coordinate of the vertex times a canonical coordinate of the vertex, plus one minus the normalized distance, times the sum, at the next intersection, over four vertices of the tetrahedron of the barycentric coordinate of a vertex time the canonical coordinate of the vertex.
- This optimization is found to give significant improvement in processing time such that real time operation of the process of FIG. 4 is possible at over 30 frames per second (i.e. to compute more than 30 images per second where the processor is a single RTX 3090 (trade mark) graphics processing unit).
- Operation 407 is optional and comprises rotating a view direction of at least one of the rays.
- rotating a view direction of a ray of the sample is done prior to query ing the leamt radiance field.
- Computing a rotation R of the view direction for a small fraction of the primitive 3D elements and propagating the value of R to remaining tetrahedra via nearest neighbor interpolation is found to give good results in practice.
- the dynamic scene image generator queries 408 the radiance field parametrization 102.
- the radiance field parametnzation has already been trained to produce color and density values, given a point in the canonical 3D cage and an associated viewing direction.
- the radiance field parameterization produces a pair of values comprising a color and an opacity at the sampled point in the canonical cage.
- the method computes a plurality of color and opacity values 410 of 3D points and view directions in the canonical cage with the deformation description applied.
- the leamt radiance field parametrization 102 is a cache of associations between 3D points and view directions in the canonical version of the cage and color and opacity values, obtained by querying a machine learning model trained using training data comprising images of the dynamic scene from a plurality of viewpoints.
- a cache of values rather than querying a machine learning model directly, significant speed ups are achieved.
- the radiance field is a function v which is queried to obtain the colour c as well as density a at that position in space.
- MLP Multi-Layer Perceptron
- v is also conditioned on the direction of the ray v, which allows it to model view-dependent effects such as specular reflections.
- volume rendering 412 method For each ray, a volume rendering 412 method is applied to the color and opacity values computed along that ray, to produce a pixel value of the output image. Any well-known computer graphics method for volume ray tracing is used. Where real time operation is desired hardware-accelerated volume rendering is used.
- the output image is stored 414 or inserted into a virtual webcam stream or used for telepresence, a game or other applications.
- FIG. 5 is a schematic diagram of a ray in a deformed cage 500, the ray transformed to a canonical cage 502, a volume lookup 504 and volume rendering 506.
- a ray is cast from the camera center, through the pixel into the scene in its deformed state.
- a number of samples are generated along the ray and then each sample is mapped to the canonical space using the deformation Mj of the corresponding tetrahedron j.
- the volumetric representation of the scene is then queried with the transformed sample position p' 7 and the direction of the ray rotated based on the rotation of the 7- th tetrahedron.
- the resulting per-sample opacity and color values are then integrated using volume rendering as in equation one.
- FIG. 6 is a flow diagram of a method of sampling. The method comprises querying the leamt radiance field of the 3D scene to obtain a color value and an opacity value, using only one radiance field network 600 and increasing a size 602 of sampling bins.
- Volumetric rendering typically involves sampling the depth along each ray.
- sampling strategy which enables capturing thin structures and fine details as well as improving sampling bounds.
- the method gives improved quality at a fixed sampled count.
- Some approaches represent the scene with two Multi-Layer Perceptrons (MLPs): a ’coarse’ and a ’fine’ one.
- MLPs Multi-Layer Perceptrons
- Nc samples are evaluated by the coarse network to obtain a coarse estimate of the opacities along the ray. These estimate then guides a second round of Nf samples, placed around the locations where opacity values are the largest. The fine network is then queried at both coarse and fine sample locations, leading to Nc evaluations in the coarse network and Nc + Nf evaluations in the fine network.
- both MLPs are optimized independently, but only the samples from the fine one contribute to the final pixel color. The inventors have recognized that the first Nc samples evaluated in the coarse MLP are not used in rendering the output image, therefore being effectively wasted.
- FIG. 7 is a flow diagram of a method of computing an image of a person depicting their mouth open.
- the method of FIG. 7 is optionally used where only one or two time instances are used in the training images. If many time instances are available in the training images the process of FIG. 7 is not needed.
- the cage represents a person’s face and comprises a mesh of a mouth interior 700, a first plane to represent an upper set of teeth of the person and a second plane 702 to represent a lower set of teeth of the person.
- the method comprises checking 704 whether one of the samples falls in an interior of the mouth and computing the transform 708 of the sample using information about the first and second planes.
- the transformed sample is used to query 710 the radiance field and the method proceeds as in FIG. 4.
- a canonical pose is one with the mouth closed, i.e., with the teeth overlapping (top of bottom teeth is below the bottom of upper teeth).
- upper and lower mouth regions partially overlap in canonical space.
- the color and density leamt in the canonical space have to be the average of the corresponding regions in the upper and lower mouth.
- FIG. 8 is a flow diagram of a method of training a machine learning model and computing a cache for use in an image animator 100.
- Training data 800 is accessed comprising images of a scene (either static or dynamic) taken from many viewpoints. Training is possible using sets of images of a static scene as training data. It is also possible using sequences where each image represents the scene in a different state.
- FIG. 8 is first described for the case where the images are of the scene from a plurality' of different viewpoints obtained at the same time instance or two time instants so that the amount of training data needed for enrollment is relatively low.
- accuracy is improved where a face tracker is used to compute the ground truth parameter values of the deformation description. This is because the face tracker introduces error and if it is used for frames at many time instances there is more error.
- the training data images are real images such as photographs or video frames. It is also possible for the training data images to be synthetic images. From the training data images, tuples of values are extracted 601 where each tuple is a deformation description, a camera viewpoint, camera intrinsic parameters and a color of a given pixel.
- the training data comprises images of the chair taken from many different known viewpoints at the same time instant.
- the images are synthetic images generated using computer graphics technology.
- From each training image a tuple of values is extracted where each tuple is a deformation description, a camera viewpoint, camera intrinsic parameters and a color of a given pixel.
- the deformation description is a cage which is determined 802 by using known image processing techniques to place a cage of primitive 3D elements around and extending from the chair.
- a user or an automated process such as a computer game, triggers a physics engine to deform the cage using physics rules, such as to shatter the chair when it falls under gravity, or to crush the chair when it experiences pressure from another object.
- samples are taken along rays in the cage 804 by shooting rays from a viewpoint of a camera which captured the training image, into the cage. Samples are taken along the rays as described with reference to FIG. 4. Each sample is assigned an index of one of the 3D primitive elements of the cage according to the element the sample falls within.
- the samples are then transformed 806 to a canonical cage, which is a version of the cage in a rest position.
- the transformed samples are used to compute an output pixel color by using volume rendering.
- the output pixel color is compared with the ground truth output pixel color of the training image and the difference or error is assessed using a loss function.
- the loss function output is used to carry out backpropagation so as to train 808 the machine learning model and output a trained machine learning model 810.
- the training process is repeated for many samples until convergence is reached.
- the resulting trained machine learning model 810 is used to compute and store a cache 812 of associations between 3D positions and view directions in the canonical cage and color and opacity values. This is done by querying the trained machine learning model 810 for ranges of 3D positions and storing the results in a cache.
- the training data comprises images of the person’s face taken from many different known viewpoints at the same time.
- values of parameters of a Vol3DMM model of the person’s face and head are Associated with each training data image.
- the parameters include pose (position and orientation) of the eyes, and bones of the neck and jaw, as well as blendshape parameters which specify characteristics of human facial expressions such as eyes shut/open, mouth shut/open, smile/no smile and others.
- the images are real images of a person captured using one or more cameras with known viewpoints.
- a 3D model is fitted to each image using any well-known model fitting process whereby values of parameters of the 3DMM model used to generate Vol3DMM are searched to find a set of values which enable the 3D model to describe the observed real image. The values of the parameters which are found are then used to label the real image and are a value the deformation description.
- Each real image is also labelled with a known camera viewpoint of a camera used to capture the image.
- the machine learning model is trained 808 with a training objective that seeks to minimize the difference between color produced by the machine learning model and color given in the ground truth training data.
- a sparsity loss is optionally applied in the volume surrounding the head and in the mouth interior.
- Sparsity losses allow to deal with incorrect background reconstruction, as well as to mitigate issues arising from disocclusions in the mouth interior region.
- use a Cauchy loss log (1 + 2o- (rt (t/c)) 2 ) (6) where i indexes rays ri shot from training cameras and k indexes samples tk along each of the rays.
- N is the number of samples to which the loss is applied.
- o is the opacity returned by the radiance field parameterization.
- Other sparsityinducing losses like 11 or weighted least-squares also work.
- the sparsity loss in two regions: in the volume surrounding the head and in the mouth interior. Applied to the volume surrounding the head, the sparsity loss prevents opaque regions appearing in areas where there is not enough multi-view information to disentangle foreground from background in 3D. To detect these regions, apply the loss to (1) samples which fall in the tetrahedral primitives as this is the region rendered at test-time, and (2) samples which belong to rays which fall in the background in the training images as detected by 2D face segmentation network of the training images. Also apply the sparsity' loss to the coarse samples that fall inside the mouth interior volume. This prevents the creation of opaque regions inside the mouth cavity in areas that are not seen at training, and therefore have no supervision, but become disoccluded at test time.
- a solution here is to override the color and density of the last sample along each ray that falls in the mouth interior, which allows to set the color of disoccluded regions at test-time to match the color of the learnt color in the visible region between the teeth at training time.
- the present technology' has been tested empirically for a first application and a second application.
- physics-based simulation is used to control the deformation of a static object (an aircraft propeller) undergoing complex topological changes and to render photo-realistic images of the process for every step of the simulation.
- This experiment shows the representation power of the deformation description and the ability to render images from physical deformations difficult to capture with a camera.
- a dataset of a propeller undergoing a continuous compression and rotation was synthesized. For both types of deformation, render 48 temporal frames for 100 cameras.
- For the present technology train only on the first frame, which can be considered the rest state, but supply a coarse tetrahedral mesh describing the motion of the sequence.
- the mean peak signal to noise ratio of the present technology on interpolation of every other frame was 27.72 as compared with 16.63 for an alternative approach without using a cage and using positional encoding on the time signal.
- the peak signal to noise ratio of the present technology on extrapolation over time was 29.87 for the present technology as compared with 12.78 for the alternative technology.
- the present technology in the first application computed images at around 6ms a frame with resolution 512 * 512, as opposed to around 30s for the alternative technology.
- multi-view face data is acquired with a camera ng that captures synchronized videos from 31 cameras at 30 frames per second. These cameras are located 0.75-1 m from the subject, with viewpoints spanning 270° around their head and focusing mostly on frontal views within ⁇ 60°. Illumination is not uniform. All the images are down-sampled to 512x512 pixels and color corrected to have consistent color features across cameras. Estimate camera poses and intrinsic parameters with a standard structure-from-motion pipeline.
- the face tracking result from a face tracker and images from multiple cameras at a single time-instance (frame).
- the frame is chosen to satisfy the following criteria: 1) a significant area of the teeth is visible and the bottom of the upper teeth is above the top of the lower teeth to place a plane between them, 2) the subject looks forward and some of the eye white is visible on both sides of the iris, 3) the face fit for the frame is accurate, 4) the texture of the face is not too wrinkled (e.g. in the nasolabial fold) due to the mouth opening.
- the present technology is found to give a better PSNR than a baseline technology by 00.1 dB and to offer a 10% improvement in learned perceptual image patch similarity (LPIPS).
- the baseline technology uses an explicit mesh and does not have a cage extending beyond the face.
- FIG. 9 illustrates various components of an exemplary computing-based device 900 which are implemented as any form of a computing and/or electronic device, and in which embodiments of an image animator are implemented in some examples.
- Computing-based device 900 comprises one or more processors 914 which are microprocessors, controllers or any other suitable type of processors for processing computer executable instructions to control the operation of the device in order to generate synthetic images of a dynamic scene in a controllable manner.
- the processors 914 include one or more fixed function blocks (also referred to as accelerators) which implement a part of the method of any of FIGs. 4 to 8 in hardware (rather than software or firmware).
- Platform software comprising an operating system 908 or any other suitable platform software is provided at the computing-based device to enable application software 910 to be executed on the device.
- a data store 922 holds output images, values of face tracker parameters, values of physics engine rules, intrinsic camera parameter values, viewpoints and other data.
- An animator 902 comprising a radiance field parameterization 904 and a volume renderer 906 is present at the computing-based device 900.
- Computer-readable media includes, for example, computer storage media such as memory 912 and communications media.
- Computer storage media, such as memory 912 includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or the like.
- Computer storage media includes, but is not limited to, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM), electronic erasable programmable read only memory (EEPROM), flash memory' or other memory technology, compact disc read only memory' (CD- ROM), digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that is used to store information for access by a computing device.
- communication media embody computer readable instructions, data structures, program modules, or the like in a modulated data signal, such as a carrier wave, or other transport mechanism. As defined herein, computer storage media does not include communication media.
- a computer storage medium should not be interpreted to be a propagating signal per se.
- the computer storage media memory 912
- the storage is, in some examples, distributed or located remotely and accessed via a network or other communication link (e g using communication interface 916).
- the computing-based device 900 has an optional capture device 918 to enable the device to capture sensor data such as images and videos.
- the computing-based device 900 has an optional display device 920 to display output images and/or values of parameters.
- a computer-implemented method of computing an image of a dynamic 3D scene comprising a 3D object comprising: receive a description of a deformation of the 3D object, the description comprising a cage of primitive 3D elements and associated animation data from a physics engine or an articulated object model; for a pixel of the image, computing a ray from a virtual camera through the pixel into the cage animated according to the animation data and computing a plurality of samples on the ray, each sample being a 3D position and view direction in one of the 3D elements; compute a transformation of the samples into a canonical version of the cage to produce transformed samples; for each transformed sample, query a leamt radiance field parameterization of the 3D scene to obtain a color value and an opacity value; apply a volume rendering method to the color and opacity values to produce a pixel value of the image.
- Clause B The method of clause A wherein the cage of primitive 3D elements represents the 3D object and a volume extending from the 3D object.
- Clause D The method of clause B wherein the cage is computed from the leamt radiance field parameterization by computing a mesh from the density of the leamt radiance field parameterization using Marching Cubes and computing a tetrahedral embedding of the mesh.
- Clause E The method of any preceding clause further comprising computing the transformation P of a sample by setting P equal to a normalized distance between a previous and a next intersection of a tetrahedron on a ray, times the sum, at the previous intersection, over four vertices of a tetrahedron of a bary centric coordinate of the vertex times a canonical coordinate of the vertex, plus one minus the normalized distance, times the sum, at the next intersection, over four vertices of the tetrahedron of the barycentric coordinate of a vertex time the canonical coordinate of the vertex.
- Clause F The method of clause A further comprising, for one of the transformed samples, rotating a view direction of a ray of the sample prior to querying the learnt radiance field parameterization.
- Clause G The method of clause F comprising computing a rotation R of the view direction for a small fraction of the primitive 3D elements and propagating the value of R to remaining tetrahedra via nearest neighbor interpolation.
- the leamt radiance field parameterization is a cache of associations between 3D points in the canonical version of the cage and color and opacity values, obtained by querying a machine learning model trained using training data comprising images of the dynamic scene from a plurality of viewpoints.
- Clause M The method of clause L comprising checking whether one of the samples falls in an interior of the mouth and computing the transform of the sample using information about the first and second planes.
- Clause N The method of any preceding clause comprising, during the process of, for each transformed sample, querying the leamt radiance field parameterization of the 3D scene to obtain a color value and an opacity value, using only one radiance field network and increasing a number of sampling bins.
- An apparatus comprising: at least one processor; a memory storing instructions that, when executed by the at least one processor, perform a method for computing an image of a dynamic 3D scene comprising a 3D object, comprising: receiving a description of a deformation of the 3D object, the description comprising a cage of primitive 3D elements and associated animation data from a physics engine or an articulated object model; for a pixel of the image, computing a ray from a virtual camera through the pixel into the cage animated according to the animation data and computing a plurality of samples on the ray, each sample being a 3D position and view direction in one of the 3D elements; computing a transformation of the samples into a canonical version of the cage to produce transformed samples; for each transformed sample, querying a learnt radiance field parameterization of the 3D scene to obtain a color value and an opacity value; applying a volume rendering method to the color and opacity values to produce a pixel value of the image.
- a computer-implemented method of computing an image of a dynamic 3D scene comprising a 3D object comprising: receive a description of a deformation of the 3D object; for a pixel of the image, computing a ray from a virtual camera through the pixel into the description and computing a plurality of samples on the ray, each sample being a 3D position and view direction in one of the 3D elements; compute a transformation of the samples into a canonical space to produce transformed samples; for each transformed sample, query a cache of associations between 3D points in the canonical space and color and opacity values; apply a volume rendering method to the color and opacity values to produce a pixel value of the image.
- Clause Q The method of clause P further comprising one or more of: storing the image, transmitting the image to a computer game application, transmitting the image to a telepresence application, inserting the image into a virtual webcam stream, transmitting the image to a head mounted display.
- Clause R The method of clause P or Q comprising using an object tracker to detect parameter values of a model of a 3D object depicted in a video and using the detected parameter values and the model to compute the description of the deformation of the 3D object.
- Clause S The method of any of clause P to R comprising using a physics engine to specify the description.
- 'computer' or 'computing-based device' is used herein to refer to any device with processing capability such that it executes instructions.
- processing capabilities are incorporated into many different devices and therefore the terms 'computer' and 'computing-based device' each include personal computers (PCs), servers, mobile telephones (including smart phones), tablet computers, set-top boxes, media players, games consoles, personal digital assistants, wearable computers, and many other devices.
- the methods described herein are performed, in some examples, by software in machine readable form on a tangible storage medium e g. in the form of a computer program comprising computer program code means adapted to perform all the operations of one or more of the methods described herein when the program is run on a computer and where the computer program may be embodied on a computer readable medium.
- the software is suitable for execution on a parallel processor or a serial processor such that the method operations may be carried out in any suitable order, or simultaneously.
- a remote computer is able to store an example of the process described as software.
- a local or terminal computer is able to access the remote computer and download a part or all of the software to run the program.
- the local computer may download pieces of the software as needed, or execute some software instructions at the local terminal and some at the remote computer (or computer network).
- a dedicated circuit such as a digital signal processor (DSP), programmable logic array, or the like.
- DSP digital signal processor
- subset is used herein to refer to a proper subset such that a subset of a set does not comprise all the elements of the set (i.e. at least one of the elements of the set is missing from the subset).
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computer Graphics (AREA)
- Processing Or Creating Images (AREA)
Abstract
Description
Claims
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202380052012.6A CN119497876A (en) | 2022-07-26 | 2023-06-12 | Computing images of controllable dynamic scenes |
| EP23739720.3A EP4562608A1 (en) | 2022-07-26 | 2023-06-12 | Computing images of controllable dynamic scenes |
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| GBGB2210930.0A GB202210930D0 (en) | 2022-07-26 | 2022-07-26 | Computing images of controllable dynamic scenes |
| GB2210930.0 | 2022-07-26 | ||
| US17/933,453 | 2022-09-19 | ||
| US17/933,453 US12182922B2 (en) | 2022-07-26 | 2022-09-19 | Computing images of controllable dynamic scenes |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024025668A1 true WO2024025668A1 (en) | 2024-02-01 |
Family
ID=87202014
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2023/025095 Ceased WO2024025668A1 (en) | 2022-07-26 | 2023-06-12 | Computing images of controllable dynamic scenes |
Country Status (3)
| Country | Link |
|---|---|
| EP (1) | EP4562608A1 (en) |
| CN (1) | CN119497876A (en) |
| WO (1) | WO2024025668A1 (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119006687A (en) * | 2024-07-29 | 2024-11-22 | 中国矿业大学 | 4D scene characterization method combining pose and radiation field optimization under complex mine environment |
-
2023
- 2023-06-12 EP EP23739720.3A patent/EP4562608A1/en active Pending
- 2023-06-12 WO PCT/US2023/025095 patent/WO2024025668A1/en not_active Ceased
- 2023-06-12 CN CN202380052012.6A patent/CN119497876A/en active Pending
Non-Patent Citations (6)
| Title |
|---|
| CHUHUA XIAN ET AL: "Automatic generation of coarse bounding cages from dense meshes", SHAPE MODELING AND APPLICATIONS, 2009. SMI 2009. IEEE INTERNATIONAL CONFERENCE ON, IEEE, PISCATAWAY, NJ, USA, 26 June 2009 (2009-06-26), pages 21 - 27, XP031493721, ISBN: 978-1-4244-4069-6 * |
| DINEV DIMITAR ET AL: "Solving for muscle blending using data", COMPUTERS AND GRAPHICS, ELSEVIER, GB, vol. 92, 30 September 2020 (2020-09-30), pages 67 - 75, XP086324298, ISSN: 0097-8493, [retrieved on 20200930], DOI: 10.1016/J.CAG.2020.09.005 * |
| TIANHAN XU ET AL: "Deforming Radiance Fields with Cages", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 25 July 2022 (2022-07-25), XP091279660 * |
| WOOD ERROLL ET AL: "Fake it till you make it: face analysis in the wild using synthetic data alone", 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), IEEE, 10 October 2021 (2021-10-10), pages 3661 - 3671, XP034093771, DOI: 10.1109/ICCV48922.2021.00366 * |
| YUAN YU-JIE ET AL: "NeRF-Editing: Geometry Editing of Neural Radiance Fields - Supplementary Material", 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 18 June 2022 (2022-06-18), pages 1 - 9, XP093076321 * |
| YUAN YU-JIE ET AL: "NeRF-Editing: Geometry Editing of Neural Radiance Fields", 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 18 June 2022 (2022-06-18) - 24 June 2022 (2022-06-24), pages 18332 - 18343, XP034195177, DOI: 10.1109/CVPR52688.2022.01781 * |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119006687A (en) * | 2024-07-29 | 2024-11-22 | 中国矿业大学 | 4D scene characterization method combining pose and radiation field optimization under complex mine environment |
Also Published As
| Publication number | Publication date |
|---|---|
| EP4562608A1 (en) | 2025-06-04 |
| CN119497876A (en) | 2025-02-21 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Tewari et al. | Advances in neural rendering | |
| Gao et al. | Nerf: Neural radiance field in 3d vision, a comprehensive review | |
| Lassner et al. | Pulsar: Efficient sphere-based neural rendering | |
| Ost et al. | Neural scene graphs for dynamic scenes | |
| Tewari et al. | State of the art on neural rendering | |
| Hasselgren et al. | Appearance-Driven Automatic 3D Model Simplification. | |
| Dalal et al. | Gaussian splatting: 3D reconstruction and novel view synthesis: A review | |
| US9792725B2 (en) | Method for image and video virtual hairstyle modeling | |
| CN113706714A (en) | New visual angle synthesis method based on depth image and nerve radiation field | |
| US6559849B1 (en) | Animation of linear items | |
| Garbin et al. | VolTeMorph: Real‐time, Controllable and Generalizable Animation of Volumetric Representations | |
| US5758046A (en) | Method and apparatus for creating lifelike digital representations of hair and other fine-grained images | |
| Cole et al. | Differentiable surface rendering via non-differentiable sampling | |
| CN118736108B (en) | High-fidelity and drivable reconstruction method of human face based on 3D Gaussian splattering | |
| CN115205463B (en) | New perspective image generation method, device and equipment based on multi-spherical scene expression | |
| Li et al. | Eyenerf: a hybrid representation for photorealistic synthesis, animation and relighting of human eyes | |
| Yang et al. | Reconstructing objects in-the-wild for realistic sensor simulation | |
| Baudron et al. | E3d: event-based 3d shape reconstruction | |
| US12182922B2 (en) | Computing images of controllable dynamic scenes | |
| Hani et al. | Continuous object representation networks: novel view synthesis without target view supervision | |
| Maxim et al. | A survey on the current state of the art on deep learning 3D reconstruction | |
| Zhao et al. | Surfel-based Gaussian inverse rendering for fast and relightable dynamic human reconstruction from monocular videos | |
| CN118864762A (en) | A reconstruction and driving method based on multi-view clothing human motion video | |
| Osman Ulusoy et al. | Dynamic probabilistic volumetric models | |
| Martin-Brualla et al. | Gelato: Generative latent textured objects |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23739720 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 202380052012.6 Country of ref document: CN |
|
| WWP | Wipo information: published in national office |
Ref document number: 202380052012.6 Country of ref document: CN |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023739720 Country of ref document: EP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2023739720 Country of ref document: EP Effective date: 20250226 |
|
| WWP | Wipo information: published in national office |
Ref document number: 2023739720 Country of ref document: EP |