[go: up one dir, main page]

WO2024118464A1 - Systems and methods for automatic generation of three-dimensional models from two-dimensional images - Google Patents

Systems and methods for automatic generation of three-dimensional models from two-dimensional images Download PDF

Info

Publication number
WO2024118464A1
WO2024118464A1 PCT/US2023/081079 US2023081079W WO2024118464A1 WO 2024118464 A1 WO2024118464 A1 WO 2024118464A1 US 2023081079 W US2023081079 W US 2023081079W WO 2024118464 A1 WO2024118464 A1 WO 2024118464A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
model
training
given
decoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2023/081079
Other languages
French (fr)
Inventor
Deqing Sun
Pratul SRINIVASAN
Kyle SARGENT
Jing Yu KOH
Huiwen Chang
Han Zhang
Charles Herrmann
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Google LLC
Original Assignee
Google LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Google LLC filed Critical Google LLC
Publication of WO2024118464A1 publication Critical patent/WO2024118464A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/067Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using optical means
    • G06N3/0675Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using optical means using electro-optical, acousto-optical or opto-electronic means
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects
    • G06V20/647Three-dimensional objects by matching two-dimensional images to three-dimensional objects
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/094Adversarial learning

Definitions

  • Three-dimensional (“3D”) models are an important part of many popular media formats such as video games, movies, and computer graphics.
  • 3D Three-dimensional
  • efforts have been made to leverage machine learning techniques to automatically generate 3D content from two-dimensional (“2D”) images.
  • 2D two-dimensional
  • training data such as training sets that link a given 2D image with corresponding ground truth pose data.
  • models trained using such data can be limited in terms of the image classes on which they may be trained and the domains on which they may be put to effective use.
  • a conditional Neural Radiance Field (“NeRF”) decoder and modified triplane representation may be introduced in a vector-quantized autoencoder, and may be trained in two stages.
  • the first stage learns to encode and reconstruct the dataset, and encodes images into a learned discrete latent codebook, which can reconstruct a training dataset and output a reconstruction corresponding to the input image.
  • the second stage learns a generative model as an autoregressive transformer over the sequences of discrete latent codes predicted from the first stage encoder.
  • a NeRF approach is able to create 3D representations of a scene or object within a scene from 2D imagery. It employs encoding of the entire scene or the object into a neural network trained to predict light intensity (or radiance) for any point within the 2D image to create or otherwise generate one or more novel 3D views at different angles.
  • the NeRF-based decoder may be trained to reconstruct an input image from a given latent vector (e.g., one produced by a pretrained vision transformer encoder), and to predict the depth of each image.
  • a given latent vector e.g., one produced by a pretrained vision transformer encoder
  • the parameters of the NeRF-based decoder may be frozen, and a generative transformer (e.g., a generative autoregressive transformer) may be trained to generate latent vectors based on the latent vectors produced by the stage 1 encoder.
  • the present technology may allow a model to be trained using pseudo-ground truth depth values generated automatically from a separate, pretrained, off-the-shelf model (e.g., a dense prediction transformer (“DPT”) model), rather than requiring exact ground truth pose data (e.g., ground truth depth values, tuning pose hyperparameters, etc.).
  • a separate, pretrained, off-the-shelf model e.g., a dense prediction transformer (“DPT”) model
  • exact ground truth pose data e.g., ground truth depth values, tuning pose hyperparameters, etc.
  • this may thus allow the model to be trained on a large and diverse 2D image collection (e.g., the ImageNet benchmark image database, such as described by Deng et al., “Imagenet: A large-scale hierarchical image database” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, which is incorporated herein by reference).
  • model may be used to generate a manipulable NeRF in a single forward pass, without relying on an inversion optimization.
  • models configured and trained according to the present technology may provide state-of-the-art generation results (e.g., up to a three-times improvement in Frechet Inception Distance scores) relative to other models.
  • the disclosure describes a computer-implemented method of training a model, comprising: (1) for each given first training example of a plurality of first training examples, the given first training example including a first image, a vector based on the first image, and a first plurality of depth values, each given depth value of first plurality of depth values corresponding to a distance between a reference point and a given portion of the first image: generating, using a decoder of the model, a second image based on the vector; generating, using the decoder of the model, a second plurality of depth values, each given depth value of second plurality of depth values corresponding to a distance between a reference point and a given portion of the second image; generating, using the decoder of the model, a third image based on the vector, the third image being different than the second image; comparing, using one or more processors of a processing system, the second image to the first image to generate a reconstruction loss value for the given first training example; comparing,
  • the method may further comprise: (1) for each given second training example of a plurality of second training examples, the given second training example including an input image and a target vector based on the input image: generating, using an encoder of the model, an output vector based on the input image; and comparing, using the one or more processors, the output vector to the target vector to generate a loss value for the given second training example; and (2) modifying, using the one or more processors, one or more parameters of the encoder of the model based at least in part on the loss values generated for the plurality of second training examples.
  • the output vector generated based on the input image may comprise a learned latent codebook.
  • the learned latent codebook may be processed, during a first stage of training the model, to generate a set of triplanes.
  • the reconstruction loss may be based on both a mean-squared error and a perceptual loss.
  • the first real-fake loss value may distinguish between real and reconstructed images at a canonical viewpoint, while the second real-fake loss value may distinguish between reconstructed images at the canonical viewpoint and novel views.
  • the method may further comprise using the trained model to generate one or more novel 3D views at different angles.
  • the disclosure describes a non-transitory computer program product comprising computer readable instructions that, when executed by a processing system, cause the processing system to perform any of the methods described in the preceding paragraph.
  • the disclosure describes a processing system comprising: (1) a memory storing a model; and (2) one or more processors coupled to the memory and configured to train the model according to a training method comprising: (a) for each given first training example of a plurality of first training examples, the given first training example including a first image, a vector based on the first image, and a first plurality of depth values, each given depth value of first plurality of depth values corresponding to a distance between a reference point and a given portion of the first image: generating, using a decoder of the model, a second image based on the vector; generating, using the decoder of the model, a second plurality of depth values, each given depth value of second plurality of depth values corresponding to a distance between a reference point and a given portion of the second image; generating, using the decoder of the model, a third image based on the vector, the third image being different than the second image; comparing the second image to the first image to generate
  • the one or more processors may be further configured to train the model according to a further training method comprising: (1) for each given second training example of a plurality of second training examples, the given second training example including an input image and a target vector based on the input image: generating, using an encoder of the model, an output vector based on the input image; and comparing, using the one or more processors, the output vector to the target vector to generate a loss value for the given second training example; and (2) modifying one or more parameters of the encoder of the model based at least in part on the loss values generated for the plurality of second training examples.
  • the decoder of the model is a NeRF -based decoder.
  • the encoder of the model is an autoregressive transformer.
  • the output vector generated based on the input image may comprise a learned latent codebook.
  • the learned latent codebook may be processed, during a first stage of training the model, to generate a set of triplanes.
  • the reconstruction loss may be based on both a mean-squared error and a perceptual loss.
  • the first real-fake loss value may be configured to distinguish between real and reconstructed images at a canonical viewpoint; while the second real-fake loss value may be configured to distinguish between reconstructed images at the canonical viewpoint and novel views.
  • a computer-implemented method for training a 3D-aware model.
  • the method comprises, in in a first stage of training the model: applying, by one or more processors of a computing system, an input image to an encoder, the encoder generating a learned latent codebook; applying, by the one or more processors, the learned latent codebook to a triplane generator of a decoder to generate a set of triplanes each having a defined spatial resolution and a defined number of channels; performing, by the one or more processors via the decoder, contraction according to a contraction function; and generating, by the one or more processors via the decoder, a set of novel views based on the contraction and the set of triplanes.
  • the model predicting, by an autoregressive transformer, a latent vector comprising one or more sequences of latent tokens based on the encoder generating the learned latent codebook in the first stage of training; and applying, by the one or more processors, the latent vector to the decoder to generate imagery of a 3D scene.
  • the decoder in the first stage may generate a set of losses.
  • the set of losses may include one or more of a reconstruction loss, a depth loss, or a real/fake loss.
  • the reconstruction loss may be based on both a mean-squared error and a perceptual loss.
  • the real/fake loss may include: a first real-fake loss value that distinguishes between real and reconstructed images at a canonical viewpoint; and a second real-fake loss value that distinguishes between reconstructed images at the canonical viewpoint and novel views.
  • the method in any of the above configurations, may further comprise using the trained model to generate one or more novel 3D views at different angles.
  • a system comprising memory configured to store a set of imagery, and one or more processors operatively coupled to the memory.
  • the one or more processors are configured to train a 3D-aware image model in two stages.
  • the one or more processors are configured to: apply an input image to an encoder, the encoder generating a learned latent codebook; apply the learned latent codebook to a triplane generator of a decoder to generate a set of triplanes each having a defined spatial resolution and a defined number of channels; perform, via the decoder, contraction according to a contraction function; and generate, via the decoder, a set of novel views based on the contraction and the set of triplanes.
  • the one or more processors are configured to: predict, using an autoregressive transformer, a latent vector comprising one or more sequences of latent tokens based on the encoder generating the learned latent codebook in the first stage of training; and apply the latent vector to the decoder to generate imagery of a 3D scene.
  • FIG. 1 is a functional diagram of an example system in accordance with aspects of the disclosure.
  • FIG. 2 is a functional diagram of an example system in accordance with aspects of the disclosure.
  • FIG. 3 is a flow chart illustrating an exemplary process flow for training a 3D-aware generative model in two stages, in accordance with aspects of the disclosure.
  • FIG. 4 is a flow chart illustrating exemplary process flows for generating loss values that may be used in stage 1 of the training set forth in FIG. 3, in accordance with aspects of the disclosure.
  • FIG. 5 is a table comparing FID scores of 3D generative models according to aspects of the disclosure.
  • FIG. 6 illustrates a set of imagery with generated samples and disparity from trained models according to aspects of the disclosure.
  • FIGS. 7A-B illustrate sets of input images, reconstruction images and disparities in accordance with aspects of the disclosure.
  • FIG. 8 illustrates an example of camera manipulations of a reconstructed scene in accordance with aspects of the disclosure.
  • FIG. 9 illustrates a table comparing different models to evaluate depth losses in accordance with aspects of the disclosure.
  • FIG. 10 illustrates a reconstruction table according to an ablation study in accordance with aspects of the disclosure.
  • FIG. 11 illustrates a FID score comparison table in accordance with aspects of the disclosure.
  • FIG. 12 illustrates FID scores of various 3D generative models in accordance with aspects of the disclosure.
  • FIG. 13 illustrates a flow diagram of a method in accordance with aspects of the disclosure.
  • FIG. 14 illustrates a flow diagram of another method in accordance with aspects of the disclosure. DESCRIPTION
  • FIG. 1 shows a high-level system diagram 100 of an exemplary processing system 102 for performing the methods described herein.
  • the processing system 102 may include one or more hardware processors 104 and memory 106 storing instructions 108 and data 110.
  • the instructions 108 and data 110 may include a model (e.g., a 3D-aware generative model).
  • the data 110 may store training examples to be used in training the model, outputs from the model produced during training, training signals and/or loss values generated during such training, and/or outputs from the model generated during inference.
  • Processing system 102 may be resident on a single computing device.
  • processing system 102 may be a server, personal computer, or mobile device, and the model may thus be local to that single computing device.
  • processing system 102 may be resident on a cloud computing system or other distributed system.
  • the model may be distributed across two or more different physical computing devices.
  • the processing system may comprise a first computing device storing layers l- « of a model having m layers, and a second computing device storing layers n-m of the model.
  • the first computing device may be one with less memory and/or processing power (e.g., a personal computer, mobile phone, tablet, etc.) compared to that of the second computing device, or vice versa.
  • the processing system may comprise one or more computing devices storing one or more parts of a model, and one or more separate computing devices storing other parts of the model.
  • the processing system may comprise one or more computing devices sorting the model’s decoder, and one or more separate computing devices storing the model’s generative transformer.
  • data used and/or generated during training or inference of the model e.g., training examples, model outputs, loss values, etc. may be stored on a different computing device than the model.
  • FIG. 2 shows a high-level system diagram 200 in which the exemplary processing system 102 just described is distributed across two computing devices 102a and 102b, each of which may include one or more processors (104a, 104b) and memory (106a, 106b) storing instructions (108a, 108b) and data (110a, 110b).
  • the processing system 102 comprising computing devices 102a and 102b is shown being in communication with one or more websites and/or remote storage systems over one or more networks 202, including website 204 and remote storage system 212.
  • website 204 includes one or more servers 206a-206n.
  • Each of the servers 206a-206n may have one or more processors (e.g., 208), and associated memory (e.g., 210) storing instructions and data, including the content of one or more webpages.
  • remote storage system 212 may also include one or more processors and memory storing instructions and data.
  • the processing system 102 comprising computing devices 102a and 102b may be configured to retrieve data from one or more of website 204 and/or remote storage system 212, for use during training of the model.
  • the first computing device 102a may be configured to retrieve training images from the remote storage system 212. Those training images may then be fed to an encoder housed on the first computing device 102a to generate latent vectors which will in turn be fed into the model’s decoder, which may be housed on a second computing device 102b.
  • the training images may be fed to a pretrained off-the-shelf model (e.g., a DPT) model) housed on the first computing device 102a in order to generate depth value estimates, which may then be used as pseudo-ground truth depth values in training the decoder housed on the second computing device 102b.
  • a pretrained off-the-shelf model e.g., a DPT
  • the processing systems described herein may be implemented on any type of computing device(s), such as any type of general computing device, server, or set thereof, and may further include other components typically present in general purpose computing devices or servers.
  • the memory of such processing systems may be of any non-transitory type capable of storing information accessible by the hardware processor(s) of the processing systems.
  • the memory may include a non-transitory medium such as a hard-drive, memory card, optical disk, solid-state, tape memory, or the like.
  • Computing devices suitable for the roles described herein may include different combinations of the foregoing, whereby different portions of the instructions and data are stored on different types of media.
  • the computing devices described herein may further include any other components normally used in connection with a computing device such as a user interface subsystem.
  • the user interface subsystem may include one or more user inputs (e.g., a mouse, keyboard, stylus, touch screen, and/or microphone) and one or more electronic displays (e.g., a monitor having a screen or any other electrical device that is operable to display information).
  • Output devices besides an electronic display, such as speakers, lights, and vibrating, pulsing, or haptic elements, may also be included in the computing devices described herein.
  • the one or more processors included in each computing device may be any conventional processors, such as commercially available central processing units (“CPUs”), graphics processing units (“GPUs”), tensor processing units (“TPUs”), etc.
  • the one or more processors may be a dedicated device such as an ASIC or other hardware-based processor.
  • Each processor may have multiple cores that are able to operate in parallel.
  • the processors), memory, and other elements of a single computing device may be stored within a single physical housing, or may be distributed between two or more housings.
  • the memory of a computing device may include a hard drive or other storage media located in a housing different from that of the processor(s), such as in an external database or networked storage device. Accordingly, references to a processor or computing device will be understood to include references to a collection of processors or computing devices or memories that may or may not operate in parallel, as well as one or more servers of a load-balanced server farm or cloud-based system.
  • the computing devices described herein may store instructions capable of being executed directly (such as machine code) or indirectly (such as scripts) by the processor(s).
  • the computing devices may also store data, which may be retrieved, stored, or modified by one or more processors in accordance with the instructions.
  • Instructions may be stored as computing device code on a computing device-readable medium.
  • the terms “instructions” and “programs” may be used interchangeably herein.
  • Instructions may also be stored in object code format for direct processing by the processor(s), or in any other computing device language including scripts or collections of independent source code modules that are interpreted on demand or compiled in advance.
  • the programming language may be C#, C++, JAVA or another computer programming language.
  • any components of the instructions or programs may be implemented in a computer scripting language, such as JavaScript, PHP, ASP, or any other computer scripting language.
  • any one of these components may be implemented using a combination of computer programming languages and computer scripting languages.
  • FIG. 3 is a flow chart illustrating an exemplary process flow 300 for training a 3D-aware generative model in two stages, in accordance with aspects of the disclosure.
  • the model identified as “VQ3D” in certain figures, comprises a vector-quantized autoencoder, which is trained in two stages.
  • Stage 1 of the model receives an input image 302, and employs an encoder 304 and decoder 306.
  • the encoder encodes RGB images into a learned latent codebook 308, and the decoder reconstructs them.
  • a diagram of the inputs, outputs, and architecture of the first stage is shown in the top half of FIG. 3.
  • the encoder 304 is shown as a ViT-S type, while the decoder 306 is a conditional NeRF configured with a ViT-L triplane generator 310.
  • the triplane generator 310 decodes the latent features from the learned latent codebook 308 and outputs a set of triplanes 311.
  • triplanes can align explicit features according to three orthogonal axis-aligned feature planes, in which each feature plane can have a resolution of R x R x C, where R is the spatial resolution and C represents the number of channels.
  • R is the spatial resolution
  • C represents the number of channels.
  • the system contract points according to a contraction function before looking up their values in the triplanes.
  • the first stage can be trained end-to-end by encoding and reconstructing RGB training images while minimizing reconstruction and adversarial losses. Because the decoder is a NeRF, the system is able to supervise the NeRF geometry with an additional training loss using pseudo-GT disparity.
  • the system can also render novel views of decoded images and critique them with an additional adversarial loss.
  • a diagram of the key losses used in Stage 1 training is shown in FIG. 4, discussed below.
  • the first stage can be used to encode unseen single RGB images and then reconstruct them in 3D, which enables novel view synthesis, image editing and manipulations.
  • the output of the encoder-decoder process includes generation of a 3D model 312 corresponding to image 314.
  • this process is configured to generates at least one novel view 316 with a corresponding 3D model 318.
  • Stage 2 implements a generative autoregressive transformer, which is configured to predict sequences of latent tokens.
  • the autoregressive transformer is trained to predict sequences of latent tokens from the encoder from stage 1 training. These tokens are then passed into the stage 1 decoder to produce fully generated 2D images.
  • the stage 2 training inherits the 3D-aware properties from the stage 1 decoder and optimization.
  • Th system trains an autoregressive transformer to generate sequences of latent tokens.
  • the 3D-aware decoder is configured to produce fully generated imagery of 3D scenes with consistent geometry and plausible novel views.
  • random noise 320 is an input to an autoregressive transformer 322, which may be trained to generate a latent vector 324 that the trained NeRF -based decoder 306 may then use to generate multiple views of a given subject (326a, 328a, 330a) and corresponding 3D models (326b, 328b, 330b).
  • the transformer 322 can be trained on the sequences of latent codes 308 produced by the stage 1 encoder. After training, the autoregressive transformer 322 can be used to generate totally new 3D images by first sampling a sequence of latent tokens and then applying the NeRF-based decoder 306. Note that the Stage 2 model inherits the properties optimized in Stage 1, so the fully generated images have high quality geometry and plausible novel views.
  • FIG. 4 is a flow chart illustrating exemplary process flows 400-1, 400-2, 400-3, and 400-4 for generating loss values that may be used in stage 1 of the training set forth in FIG. 3, in accordance with aspects of the disclosure.
  • process flow 400-1 of FIG. 4 illustrates the generation of a reconstruction loss value based on comparisons of the input image (e.g., 304 of FIG. 3) applied to the model and the reconstructed image generated during stage 1.
  • the reconstruction loss of process flow 400-1 can be based on both a mean-squared error and perceptual loss.
  • the reconstruction loss may also be based on other suitable loss calculations, such as logit-Laplace loss.
  • Process flows 400-2 and 400-4 illustrate the generation of real-fake loss values based on comparisons of two images.
  • process flow 400-2 illustrates the generation of a real-fake loss value based on a comparison of the input image and the reconstructed image.
  • process flow 400-4 illustrates the generation of a real-fake loss value based on a comparison of the reconstructed image and an image of a novel view.
  • all of the training dataset images are reconstructed from the same “canonical” camera viewpoint (see the rightmost portion of block 306 of FIG. 3), which simplifies the task for the decoder.
  • High quality novel views are desired within a neighborhood of this canonical camera viewpoint.
  • the system may render novel views during training and critique them with a novel view discriminator.
  • the system may concatenate the disparity as a fourth channel of input to the novel view discriminator, to ensure the geometry does not change depending on the camera viewpoint.
  • Process flow 400-3 illustrates the generation of depth loss values based on comparisons of the NeRF decoder’s predicted depth values and the pseudo-ground truth depth values generated by a separate, pretrained, off-the-shelf DPT model.
  • the depth loss values will be weighted shift- and scale- invariant loss values calculated according to the equations set forth below.
  • the predicted depth from the NeRF decoder may be additionally supervised with pseudo ground truth depth.
  • the DPT depth network can be used to generate pseudo-GT.
  • the predicted depth is supervised with a depth loss. This loss is configured to the NeRF setting and supervises the volumetric rendering weights of every NeRF sample location rather than the accumulated disparity.
  • stage 1 can accept an input RGB image and reconstruct it in 3D, generating suitable reconstruction at the canonical view and plausible novel views.
  • the goal of the first stage is to learn a model which can compress image pixels into a sequence of discrete indices corresponding to a learnt latent codebook (e.g., 308 in FIG. 3). Since it is beneficial for the model to be 3D-aware, several additional criteria are imposed.
  • One goal is good reconstruction from a canonical view. For instance, on ImageNet, ground truth camera extrinsics are unknown and may not be well-defined due to the presence of deformable and ambiguous object categories and scenes without salient objects. Therefore, a single ‘canonical pose’ may be fixed for reconstruction, and the criterion here is that the conditional NeRF-based autoencoder should successfully reconstruct the dataset from this view.
  • Another criterion is reasonable novel views. One may expect that images decoded at novel views within a specified range of the canonical view will have similar quality to images decoded at the canonical view.
  • a further criterion is correct geometry. The geometry of the scene as represented by the NeRF should correspond to the unknown ground truth geometry of the RGB image up to scale and shift.
  • the system can leverage main and auxiliary discriminators.
  • the first (main) discriminator distinguishes between real and reconstructed images at the canonical viewpoint, while the second (auxiliary) distinguishes between reconstructed images at the canonical viewpoint and novel views.
  • the generator may slightly corrupt the main view in order to collaborate with the novel view branch to fool the discriminator; thus, the system may add a stop-grad between the main view and the novel view discriminator. It may be unnecessary to tune a separate distribution of novel views for each dataset, and instead sample novel views uniformly in a disc tangent to a sphere at the canonical camera pose.
  • a nonsaturating GAN objective L gan such as described by Goodfellow et al. in “Generative adversarial nets”, found in Advances in Neural Information Processing Systems (2014), which is incorporated herein by reference, for both discriminators.
  • the system may additionally concatenate the predicted depth as input to the auxiliary discriminator to ensure the distribution of depths does not change depending on the camera viewpoint.
  • the system may supervise the NeRF depth with pseudo-GT geometry at the main viewpoint.
  • the system may employ the pretrained depth prediction transformer model DPT, described by Ranftl et al. in “Vision transformers for dense prediction” (2021), incorporated herein by reference, which produces pseudo-GT disparity estimates for the images in our training datasets.
  • DPT pretrained depth prediction transformer model
  • the model can be limited to some extent by the quality of the depth estimator chosen.
  • a shift- and scale- invariant h loss such as generally described by Rantfl et al.
  • L be indices which range over the image plane and ray samples respectively, let D,k be the pointwise disparities of the NeRF sample locations, let Wn be corresponding NeRF weights from volumetric rendering (see, e.g., Mildenhall et al, “Nerf: Representing Scenes as Neural Radiance Fields for View Synthesis”, 2020, which is incorporated herein by reference), and let di be the pseudo-GT depth from DPT. Then define s*; t* to be the closed-form solution of the weighted least squares problem:
  • this loss is minimized when the NeRF allocates 0 weight to all but one sample location along each ray, and the expectation with respect to the weights of the disparity is equal to the GT disparity map up to a scale and shift. In this way it penalizes weight distributions which are too spread out, but also encourages the weights to be concentrated near the correct geometry.
  • this formulation still allows for more than one surface along each ray and thus for occlusion and disocclusion, because the penalty is applied to the volumetric rendering weights and not the predicted density.
  • This depth loss formulation is highly beneficial for good performance. In particular, supervising the accumulated disparity rather than the pointwise disparities may lead to poor performance. Two penalties on the scale determined by this alignment are introduced as follows:
  • 2 S1 is the weight of a small penalty to prevent the sign of the disparity scale from flipping negative.
  • s2 weights a penalty preventing the disparity maps from becoming too flat, which encourages perceptually pleasing novel views.
  • Stage 2 The goal of Stage 2 is to learn an autoregressive model over the discrete encodings produced by the Stage 1 encoder, so that completely new 3D scenes can be generated as output of the trained model. It has been verified experimentally that the fully generative stage 2 model inherits the properties optimized in stage 1; namely, 3D-consistent novel views and high-quality geometry. Moreover, top- A and top-p filtering can also be applied.
  • the architecture of FIG. 3 leverages the powerful vision transformer architecture in both the encoder and decoder.
  • the decoder is configured to employ 3D inductive bias to facilitate the learning of 3D representations.
  • the system uses a ViT-S model.
  • a ViT-L model is used to decode the latent codes into 3 triplanes of size 512x512 with feature dimension 32. It has been found that the triplane construction stage of the decoder benefits from the increased capacity of the ViT-L model. Note that the triplane size and feature dimension may each be varied, e.g., depending on the type of imagery or application of interest.
  • MLPs are lightweight, e.g., with 2 layers and 32 hidden units each, although more layers and/or different numbers of hidden units may be employed.
  • RGB color may be directly rendered rather than using a neural upsampler, as it has been found that neural upsampling can be be a source of myriad and confusing artifacts not fixable via dual discriminators or consistency losses.
  • the system is able to train the transformer to autoregressively predict the next image token.
  • the hyperparameters in the base model of VIM described by Yu et al. in “Vector-quantized image modeling with improved vqgan”.
  • a conditional model can be trained, and for other datasets we train unconditional generative models.
  • the performance of the above-described architecture and method was studied relative to the baseline methods on ImageNet.
  • the ImageNet dataset is a well-known classification benchmark which includes 1.28M images of 1000 object classes. It is a standard benchmark for 2D image generation, for both conditional and unconditional generation.
  • the testing compared against pi-GAN (see Chan et al., “pi-gan: Periodic implicit generative adversarial networks for 3D-aware image synthesis”, 2020, the entire disclosure of which is incorporated by reference herein), GIRAFFE (see Niemeyer et al., “Giraffe: Representing scenes as compositional generative neural feature fields”, in Proc.
  • Stage 1 of the instant method can also be used for single-view 3D reconstruction and manipulation.
  • FIGS. 7A-B each show single RGB images reconstructed by Stage 1 with estimated geometry.
  • the network performs well at reconstruction and needs only a single forward pass to compute a NeRF for an input image, without employing an inversion optimization.
  • the reconstructed NeRFs can be manipulated, for instance to render novel views.
  • An example 800 of a novel view is illustrated in FIG. 8. This figure shows example camera manipulations of a reconstructed scene, which includes a person riding a motorcycle on one wheel.
  • the above-described approach naturally handles sharp occlusions (left spyglass inset 802) and inpainting of dis-occluded pixels (right spyglass inset 804) without supervision of novel views.
  • Table 2 in Fig. 9 gives the result for generative models with and without depth losses. It can be seen that while adding depth loss can improve the quality of geometry, it did not significantly improve FID by itself. For the GAN methods, it was found that the pointwise disparity loss worked relatively poorly but the original scale- and shift- invariant MSE loss improved geometry. For the instant method, the Stage 2 performance is shown with and without our novel pointwise weighted depth loss. While performance on the depth accuracy metric can improve when various depth losses are incorporated training, the effect on FID was mixed. In this way, it can be seen that incorporating pseudo-GT depth is unlikely to meaningfully improve the FID for the baseline methods without substantial changes.
  • 3D benchmark datasets there are other 3D benchmark datasets that can be considered.
  • One such prominent 3D-aware benchmark dataset is CompCars (see Yang et al., “A large-scale car dataset for fine-grained categorization and verification, 2015, the entire disclosure of which is incorporated herein by reference).
  • CompCars As shown by Table 5 of FIG. 12, the instant model is competitive with the state of the art.
  • This table presents the FID scores of a number of 3D generative models.
  • GIRAFFE HD see Xue et al., “GIRAFFE HD: A high-resolution 3D-aware generative model”, in CVPR 2022, the entire disclosure of which is incorporated herein by reference.
  • * indicates numbers taken from the respective references mentioned here; the other models were trained for the testing. Note that although the baseline models used separate, tuned pose hyperparameters for each dataset, the identical simple pose sampling scheme worked well on both ImageNet and CompCars.
  • FIG. 13 illustrates an example flow diagram 1300 for a method of training a model in accordance with aspects of the technology described herein.
  • the method begins as follows: for each given first training example of a plurality of first training examples, the given first training example including a first image, a vector based on the first image, and a first plurality of depth values, each given depth value of first plurality of depth values corresponding to a distance between a reference point and a given portion of the first image. For each given first training example, the following operations are performed. At block 1304, generating, using a decoder of the model, a second image based on the vector. At block 1306, generating, using the decoder of the model, a second plurality of depth values.
  • each given depth value of the second plurality of depth values corresponds to a distance between a reference point and a given portion of the second image.
  • the method includes generating, using the decoder of the model, a third image based on the vector, in which the third image is different than the second image.
  • the method includes comparing, using one or more processors of a processing system, the second image to the first image to generate a reconstruction loss value for the given first training example, and at block 1312 comparing, using the one or more processors, the second image to the first image to generate a first real- fake loss value for the given first training example.
  • the method includes comparing, using the one or more processors, the second image to the third image to generate a second real-fake loss value for the given first training example.
  • the method includes comparing, using the one or more processors, the first plurality of depth values to the second plurality of depth values to generate a depth loss value for the given first training example.
  • the method includes modifying, using the one or more processors, one or more parameters of the decoder of the model based at least in part on the reconstruction loss values, first real-fake loss values, second real-fake loss values, and depth loss values generated for the plurality of first training examples.
  • Fig. 14 illustrates an example flow diagram 1400 for a two-stage method of training a 3D-aware model in accordance with aspects of the technology described herein.
  • Block 1402 indicates a first stage of training the model.
  • the method includes applying, by one or more processors of a computing system, an input image to an encoder, the encoder generating a learned latent codebook.
  • the method includes applying, by the one or more processors, the learned latent codebook to a triplane generator of a decoder to generate a set of triplanes each having a defined spatial resolution and a defined number of channels.
  • the method includes performing, by the one or more processors via the decoder, contraction according to a contraction function, and at block 1410 the method includes generating, by the one or more processors via the decoder, a set of novel views based on the contraction and the set of triplanes.
  • Block 1412 indicates a second stage of training the model upon training of the decoder in the first stage.
  • the method includes predicting, by an autoregressive transformer, a latent vector comprising one or more sequences of latent tokens based on the encoder generating the learned latent codebook in the first stage of training.
  • the method includes applying, by the one or more processors, the latent vector to the decoder to generate imagery of a 3D scene.
  • stage 1 of the model enables 3D-aware image editing and manipulation.
  • One forward pass through our network converts a single RGB image into a manipulable NeRF, without relying on a computationally expensive inversion optimization.
  • the fully generative stage 2 model allows the system to sample totally new 3D scenes from the many object classes of ImageNet.
  • the stage 2 model inherits the properties which were optimized in stage 1, namely plausible novel views and high-quality geometry.
  • the model can use identical pose sampling hyperparameters for each dataset.
  • the two-stage formulation is simpler and more reliable than known techniques for training 3D-aware generative models.
  • the described approach does not need to use progressive growing, a neural upsampler, pose conditioning, or patch-wise discriminators, but still learns meaningful 3D representations.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Multimedia (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Neurology (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

Systems and methods for automatic generation of 3D models from 2D images. In some examples, a conditional NeRF decoder and modified triplane representation may be introduced in a vector-quantized autoencoder, and may be trained in two stages. In stage 1, the NeRF-based decoder may be trained to reconstruct an input image, to predict the depth of each image, and to generate a shifted view of the image. In stage 2, the parameters of the NeRF-based decoder may be frozen, and a generative transformer may be trained to generate latent vectors based on the latent vectors produced by the stage 1 encoder. Using the architecture and loss functions of the present technology may allow a model to be trained using pseudo-ground truth depth values generated automatically from a separate, pretrained, off- the-shelf dense prediction transformer, rather than requiring exact ground truth data such as ground truth depth values or tuning pose hyperparameters.

Description

SYSTEMS AND METHODS FOR AUTOMATIC GENERATION OF THREE-DIMENSIONAL MODELS FROM TWO-DIMENSIONAL IMAGES
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority to and the benefit of the filing date of U.S. Provisional Patent Application No. 63/428,860, filed November 30, 2022, the entire disclosure of which is expressly incorporated by reference herein.
BACKGROUND
[0002] Three-dimensional (“3D”) models are an important part of many popular media formats such as video games, movies, and computer graphics. As it can be time-consuming to create 3D content manually, efforts have been made to leverage machine learning techniques to automatically generate 3D content from two-dimensional (“2D”) images. However, such approaches may require specialized training data, such as training sets that link a given 2D image with corresponding ground truth pose data. As such training data may be difficult to obtain and/or expensive to generate, models trained using such data can be limited in terms of the image classes on which they may be trained and the domains on which they may be put to effective use.
SUMMARY
[0003] The present technology is related to systems and methods for automatic generation of 3D models from 2D image data. In some aspects of the technology, a conditional Neural Radiance Field (“NeRF”) decoder and modified triplane representation may be introduced in a vector-quantized autoencoder, and may be trained in two stages. The first stage learns to encode and reconstruct the dataset, and encodes images into a learned discrete latent codebook, which can reconstruct a training dataset and output a reconstruction corresponding to the input image. The second stage learns a generative model as an autoregressive transformer over the sequences of discrete latent codes predicted from the first stage encoder. By way of example, a NeRF approach is able to create 3D representations of a scene or object within a scene from 2D imagery. It employs encoding of the entire scene or the object into a neural network trained to predict light intensity (or radiance) for any point within the 2D image to create or otherwise generate one or more novel 3D views at different angles.
[0004] In such a case, in stage 1, the NeRF-based decoder may be trained to reconstruct an input image from a given latent vector (e.g., one produced by a pretrained vision transformer encoder), and to predict the depth of each image. In stage 2, the parameters of the NeRF-based decoder may be frozen, and a generative transformer (e.g., a generative autoregressive transformer) may be trained to generate latent vectors based on the latent vectors produced by the stage 1 encoder. [0005] Advantageously, the present technology may allow a model to be trained using pseudo-ground truth depth values generated automatically from a separate, pretrained, off-the-shelf model (e.g., a dense prediction transformer (“DPT”) model), rather than requiring exact ground truth pose data (e.g., ground truth depth values, tuning pose hyperparameters, etc.). As will be appreciated, this may thus allow the model to be trained on a large and diverse 2D image collection (e.g., the ImageNet benchmark image database, such as described by Deng et al., “Imagenet: A large-scale hierarchical image database” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, which is incorporated herein by reference). Once trained, the model may be used to generate a manipulable NeRF in a single forward pass, without relying on an inversion optimization. Further, models configured and trained according to the present technology may provide state-of-the-art generation results (e.g., up to a three-times improvement in Frechet Inception Distance scores) relative to other models.
[0006] In one aspect, the disclosure describes a computer-implemented method of training a model, comprising: (1) for each given first training example of a plurality of first training examples, the given first training example including a first image, a vector based on the first image, and a first plurality of depth values, each given depth value of first plurality of depth values corresponding to a distance between a reference point and a given portion of the first image: generating, using a decoder of the model, a second image based on the vector; generating, using the decoder of the model, a second plurality of depth values, each given depth value of second plurality of depth values corresponding to a distance between a reference point and a given portion of the second image; generating, using the decoder of the model, a third image based on the vector, the third image being different than the second image; comparing, using one or more processors of a processing system, the second image to the first image to generate a reconstruction loss value for the given first training example; comparing, using the one or more processors, the second image to the first image to generate a first real-fake loss value for the given first training example; comparing, using the one or more processors, the second image to the third image to generate a second real-fake loss value for the given first training example; and comparing, using the one or more processors, the first plurality of depth values to the second plurality of depth values to generate a depth loss value for the given first training example; and (2) modifying, using the one or more processors, one or more parameters of the decoder of the model based at least in part on the reconstruction loss values, first real-fake loss values, second real-fake loss values, and depth loss values generated for the plurality of first training examples.
[0007] In some aspects, the method may further comprise: (1) for each given second training example of a plurality of second training examples, the given second training example including an input image and a target vector based on the input image: generating, using an encoder of the model, an output vector based on the input image; and comparing, using the one or more processors, the output vector to the target vector to generate a loss value for the given second training example; and (2) modifying, using the one or more processors, one or more parameters of the encoder of the model based at least in part on the loss values generated for the plurality of second training examples.
[0008] The output vector generated based on the input image may comprise a learned latent codebook. In this case, the learned latent codebook may be processed, during a first stage of training the model, to generate a set of triplanes. The reconstruction loss may be based on both a mean-squared error and a perceptual loss. Moreover, in an example the first real-fake loss value may distinguish between real and reconstructed images at a canonical viewpoint, while the second real-fake loss value may distinguish between reconstructed images at the canonical viewpoint and novel views.
[0009] Alternatively or additionally to the above, the method may further comprise using the trained model to generate one or more novel 3D views at different angles.
[0010] In another aspect, the disclosure describes a non-transitory computer program product comprising computer readable instructions that, when executed by a processing system, cause the processing system to perform any of the methods described in the preceding paragraph.
[0011] In another aspect, the disclosure describes a processing system comprising: (1) a memory storing a model; and (2) one or more processors coupled to the memory and configured to train the model according to a training method comprising: (a) for each given first training example of a plurality of first training examples, the given first training example including a first image, a vector based on the first image, and a first plurality of depth values, each given depth value of first plurality of depth values corresponding to a distance between a reference point and a given portion of the first image: generating, using a decoder of the model, a second image based on the vector; generating, using the decoder of the model, a second plurality of depth values, each given depth value of second plurality of depth values corresponding to a distance between a reference point and a given portion of the second image; generating, using the decoder of the model, a third image based on the vector, the third image being different than the second image; comparing the second image to the first image to generate a reconstruction loss value for the given first training example; comparing the second image to the first image to generate a first real-fake loss value for the given first training example; comparing the second image to the third image to generate a second real-fake loss value for the given first training example; and comparing the first plurality of depth values to the second plurality of depth values to generate a depth loss value for the given first training example; and (b) modifying one or more parameters of the decoder of the model based at least in part on the reconstruction loss values, first real-fake loss values, second real-fake loss values, and depth loss values generated for the plurality of first training examples. [0012] In some aspects, the one or more processors may be further configured to train the model according to a further training method comprising: (1) for each given second training example of a plurality of second training examples, the given second training example including an input image and a target vector based on the input image: generating, using an encoder of the model, an output vector based on the input image; and comparing, using the one or more processors, the output vector to the target vector to generate a loss value for the given second training example; and (2) modifying one or more parameters of the encoder of the model based at least in part on the loss values generated for the plurality of second training examples. In some aspects, the decoder of the model is a NeRF -based decoder. In some aspects, the encoder of the model is an autoregressive transformer.
[0013] By way of example, the output vector generated based on the input image may comprise a learned latent codebook. Here, the learned latent codebook may be processed, during a first stage of training the model, to generate a set of triplanes. The reconstruction loss may be based on both a mean-squared error and a perceptual loss. Moreover, the first real-fake loss value may be configured to distinguish between real and reconstructed images at a canonical viewpoint; while the second real-fake loss value may be configured to distinguish between reconstructed images at the canonical viewpoint and novel views.
[0014] According to another aspect of the technology, a computer-implemented method is provided for training a 3D-aware model. The method comprises, in in a first stage of training the model: applying, by one or more processors of a computing system, an input image to an encoder, the encoder generating a learned latent codebook; applying, by the one or more processors, the learned latent codebook to a triplane generator of a decoder to generate a set of triplanes each having a defined spatial resolution and a defined number of channels; performing, by the one or more processors via the decoder, contraction according to a contraction function; and generating, by the one or more processors via the decoder, a set of novel views based on the contraction and the set of triplanes. And in a second stage of training the model: predicting, by an autoregressive transformer, a latent vector comprising one or more sequences of latent tokens based on the encoder generating the learned latent codebook in the first stage of training; and applying, by the one or more processors, the latent vector to the decoder to generate imagery of a 3D scene.
[0015] The decoder in the first stage may generate a set of losses. The set of losses may include one or more of a reconstruction loss, a depth loss, or a real/fake loss. Here, the reconstruction loss may be based on both a mean-squared error and a perceptual loss. And the real/fake loss may include: a first real-fake loss value that distinguishes between real and reconstructed images at a canonical viewpoint; and a second real-fake loss value that distinguishes between reconstructed images at the canonical viewpoint and novel views. [0016] The method, in any of the above configurations, may further comprise using the trained model to generate one or more novel 3D views at different angles.
[0017] According to yet another aspect of the technology, a system is provided that comprises memory configured to store a set of imagery, and one or more processors operatively coupled to the memory. The one or more processors are configured to train a 3D-aware image model in two stages. In the first stage to train the model, the one or more processors are configured to: apply an input image to an encoder, the encoder generating a learned latent codebook; apply the learned latent codebook to a triplane generator of a decoder to generate a set of triplanes each having a defined spatial resolution and a defined number of channels; perform, via the decoder, contraction according to a contraction function; and generate, via the decoder, a set of novel views based on the contraction and the set of triplanes. In the second stage to train the model, the one or more processors are configured to: predict, using an autoregressive transformer, a latent vector comprising one or more sequences of latent tokens based on the encoder generating the learned latent codebook in the first stage of training; and apply the latent vector to the decoder to generate imagery of a 3D scene.
BRIEF DESCRIPTION OF THE DRAWINGS
[0018] The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
[0019] FIG. 1 is a functional diagram of an example system in accordance with aspects of the disclosure. [0020] FIG. 2 is a functional diagram of an example system in accordance with aspects of the disclosure. [0021] FIG. 3 is a flow chart illustrating an exemplary process flow for training a 3D-aware generative model in two stages, in accordance with aspects of the disclosure.
[0022] FIG. 4 is a flow chart illustrating exemplary process flows for generating loss values that may be used in stage 1 of the training set forth in FIG. 3, in accordance with aspects of the disclosure.
[0023] FIG. 5 is a table comparing FID scores of 3D generative models according to aspects of the disclosure.
[0024] FIG. 6 illustrates a set of imagery with generated samples and disparity from trained models according to aspects of the disclosure.
[0025] FIGS. 7A-B illustrate sets of input images, reconstruction images and disparities in accordance with aspects of the disclosure.
[0026] FIG. 8 illustrates an example of camera manipulations of a reconstructed scene in accordance with aspects of the disclosure. [0027] FIG. 9 illustrates a table comparing different models to evaluate depth losses in accordance with aspects of the disclosure.
[0028] FIG. 10 illustrates a reconstruction table according to an ablation study in accordance with aspects of the disclosure.
[0029] FIG. 11 illustrates a FID score comparison table in accordance with aspects of the disclosure.
[0030] FIG. 12 illustrates FID scores of various 3D generative models in accordance with aspects of the disclosure.
[0031] FIG. 13 illustrates a flow diagram of a method in accordance with aspects of the disclosure.
[0032] FIG. 14 illustrates a flow diagram of another method in accordance with aspects of the disclosure. DESCRIPTION
[0033] The present technology will now be described with respect to the following exemplary systems and methods. Reference numbers in common between the figures depicted and described below are meant to identify the same features.
Example Systems
[0034] FIG. 1 shows a high-level system diagram 100 of an exemplary processing system 102 for performing the methods described herein. The processing system 102 may include one or more hardware processors 104 and memory 106 storing instructions 108 and data 110. The instructions 108 and data 110 may include a model (e.g., a 3D-aware generative model). In addition, the data 110 may store training examples to be used in training the model, outputs from the model produced during training, training signals and/or loss values generated during such training, and/or outputs from the model generated during inference.
[0035] Processing system 102 may be resident on a single computing device. For example, processing system 102 may be a server, personal computer, or mobile device, and the model may thus be local to that single computing device. Similarly, processing system 102 may be resident on a cloud computing system or other distributed system. In such a case, the model may be distributed across two or more different physical computing devices. For example, the processing system may comprise a first computing device storing layers l-« of a model having m layers, and a second computing device storing layers n-m of the model. In such cases, the first computing device may be one with less memory and/or processing power (e.g., a personal computer, mobile phone, tablet, etc.) compared to that of the second computing device, or vice versa.
[0036] Likewise, in some aspects of the technology, the processing system may comprise one or more computing devices storing one or more parts of a model, and one or more separate computing devices storing other parts of the model. For example, in some aspects, the processing system may comprise one or more computing devices sorting the model’s decoder, and one or more separate computing devices storing the model’s generative transformer. Further, in some aspects of the technology, data used and/or generated during training or inference of the model (e.g., training examples, model outputs, loss values, etc.) may be stored on a different computing device than the model.
[0037] Further in this regard, FIG. 2 shows a high-level system diagram 200 in which the exemplary processing system 102 just described is distributed across two computing devices 102a and 102b, each of which may include one or more processors (104a, 104b) and memory (106a, 106b) storing instructions (108a, 108b) and data (110a, 110b). The processing system 102 comprising computing devices 102a and 102b is shown being in communication with one or more websites and/or remote storage systems over one or more networks 202, including website 204 and remote storage system 212. In this example, website 204 includes one or more servers 206a-206n. Each of the servers 206a-206n may have one or more processors (e.g., 208), and associated memory (e.g., 210) storing instructions and data, including the content of one or more webpages.
[0038] Likewise, although not shown, remote storage system 212 may also include one or more processors and memory storing instructions and data. In some aspects of the technology, the processing system 102 comprising computing devices 102a and 102b may be configured to retrieve data from one or more of website 204 and/or remote storage system 212, for use during training of the model. For example, in some aspects, the first computing device 102a may be configured to retrieve training images from the remote storage system 212. Those training images may then be fed to an encoder housed on the first computing device 102a to generate latent vectors which will in turn be fed into the model’s decoder, which may be housed on a second computing device 102b. Likewise, in some aspects, the training images may be fed to a pretrained off-the-shelf model (e.g., a DPT) model) housed on the first computing device 102a in order to generate depth value estimates, which may then be used as pseudo-ground truth depth values in training the decoder housed on the second computing device 102b.
[0039] The processing systems described herein may be implemented on any type of computing device(s), such as any type of general computing device, server, or set thereof, and may further include other components typically present in general purpose computing devices or servers. Likewise, the memory of such processing systems may be of any non-transitory type capable of storing information accessible by the hardware processor(s) of the processing systems. For instance, the memory may include a non-transitory medium such as a hard-drive, memory card, optical disk, solid-state, tape memory, or the like. Computing devices suitable for the roles described herein may include different combinations of the foregoing, whereby different portions of the instructions and data are stored on different types of media. [0040] In all cases, the computing devices described herein may further include any other components normally used in connection with a computing device such as a user interface subsystem. The user interface subsystem may include one or more user inputs (e.g., a mouse, keyboard, stylus, touch screen, and/or microphone) and one or more electronic displays (e.g., a monitor having a screen or any other electrical device that is operable to display information). Output devices besides an electronic display, such as speakers, lights, and vibrating, pulsing, or haptic elements, may also be included in the computing devices described herein.
[0041] The one or more processors included in each computing device may be any conventional processors, such as commercially available central processing units (“CPUs”), graphics processing units (“GPUs”), tensor processing units (“TPUs”), etc. Alternatively, the one or more processors may be a dedicated device such as an ASIC or other hardware-based processor. Each processor may have multiple cores that are able to operate in parallel. The processors), memory, and other elements of a single computing device may be stored within a single physical housing, or may be distributed between two or more housings. Similarly, the memory of a computing device may include a hard drive or other storage media located in a housing different from that of the processor(s), such as in an external database or networked storage device. Accordingly, references to a processor or computing device will be understood to include references to a collection of processors or computing devices or memories that may or may not operate in parallel, as well as one or more servers of a load-balanced server farm or cloud-based system.
[0042] The computing devices described herein may store instructions capable of being executed directly (such as machine code) or indirectly (such as scripts) by the processor(s). The computing devices may also store data, which may be retrieved, stored, or modified by one or more processors in accordance with the instructions. Instructions may be stored as computing device code on a computing device-readable medium. In that regard, the terms “instructions” and “programs” may be used interchangeably herein. Instructions may also be stored in object code format for direct processing by the processor(s), or in any other computing device language including scripts or collections of independent source code modules that are interpreted on demand or compiled in advance. By way of example, the programming language may be C#, C++, JAVA or another computer programming language. Similarly, any components of the instructions or programs may be implemented in a computer scripting language, such as JavaScript, PHP, ASP, or any other computer scripting language. Furthermore, any one of these components may be implemented using a combination of computer programming languages and computer scripting languages. Model Configuration
[0043] FIG. 3 is a flow chart illustrating an exemplary process flow 300 for training a 3D-aware generative model in two stages, in accordance with aspects of the disclosure.
[0044] The model, identified as “VQ3D” in certain figures, comprises a vector-quantized autoencoder, which is trained in two stages. Stage 1 of the model receives an input image 302, and employs an encoder 304 and decoder 306. The encoder encodes RGB images into a learned latent codebook 308, and the decoder reconstructs them. A diagram of the inputs, outputs, and architecture of the first stage is shown in the top half of FIG. 3. In this example, the encoder 304 is shown as a ViT-S type, while the decoder 306 is a conditional NeRF configured with a ViT-L triplane generator 310.
[0045] As shown, the triplane generator 310 decodes the latent features from the learned latent codebook 308 and outputs a set of triplanes 311. By way of example, triplanes can align explicit features according to three orthogonal axis-aligned feature planes, in which each feature plane can have a resolution of R x R x C, where R is the spatial resolution and C represents the number of channels. In order to render potentially unbounded ImageNet scenes, the system contract points according to a contraction function before looking up their values in the triplanes.
[0046] The first stage can be trained end-to-end by encoding and reconstructing RGB training images while minimizing reconstruction and adversarial losses. Because the decoder is a NeRF, the system is able to supervise the NeRF geometry with an additional training loss using pseudo-GT disparity.
[0047] The system can also render novel views of decoded images and critique them with an additional adversarial loss. A diagram of the key losses used in Stage 1 training is shown in FIG. 4, discussed below. After training, the first stage can be used to encode unseen single RGB images and then reconstruct them in 3D, which enables novel view synthesis, image editing and manipulations. As shown, the output of the encoder-decoder process includes generation of a 3D model 312 corresponding to image 314. In addition, this process is configured to generates at least one novel view 316 with a corresponding 3D model 318.
[0048] Stage 2 implements a generative autoregressive transformer, which is configured to predict sequences of latent tokens. The autoregressive transformer is trained to predict sequences of latent tokens from the encoder from stage 1 training. These tokens are then passed into the stage 1 decoder to produce fully generated 2D images. The stage 2 training inherits the 3D-aware properties from the stage 1 decoder and optimization. Th system trains an autoregressive transformer to generate sequences of latent tokens. Then, the 3D-aware decoder is configured to produce fully generated imagery of 3D scenes with consistent geometry and plausible novel views. [0049] As shown in FIG. 3, random noise 320 is an input to an autoregressive transformer 322, which may be trained to generate a latent vector 324 that the trained NeRF -based decoder 306 may then use to generate multiple views of a given subject (326a, 328a, 330a) and corresponding 3D models (326b, 328b, 330b). In one scenario, the transformer 322 can be trained on the sequences of latent codes 308 produced by the stage 1 encoder. After training, the autoregressive transformer 322 can be used to generate totally new 3D images by first sampling a sequence of latent tokens and then applying the NeRF-based decoder 306. Note that the Stage 2 model inherits the properties optimized in Stage 1, so the fully generated images have high quality geometry and plausible novel views.
[0050] FIG. 4 is a flow chart illustrating exemplary process flows 400-1, 400-2, 400-3, and 400-4 for generating loss values that may be used in stage 1 of the training set forth in FIG. 3, in accordance with aspects of the disclosure.
[0051] In that regard, process flow 400-1 of FIG. 4 illustrates the generation of a reconstruction loss value based on comparisons of the input image (e.g., 304 of FIG. 3) applied to the model and the reconstructed image generated during stage 1. In this example, the reconstruction loss of process flow 400-1 can be based on both a mean-squared error and perceptual loss. However, the reconstruction loss may also be based on other suitable loss calculations, such as logit-Laplace loss.
[0052] Process flows 400-2 and 400-4 illustrate the generation of real-fake loss values based on comparisons of two images. In that regard, process flow 400-2 illustrates the generation of a real-fake loss value based on a comparison of the input image and the reconstructed image. Similarly, process flow 400-4 illustrates the generation of a real-fake loss value based on a comparison of the reconstructed image and an image of a novel view. According to one aspect of the technology, all of the training dataset images are reconstructed from the same “canonical” camera viewpoint (see the rightmost portion of block 306 of FIG. 3), which simplifies the task for the decoder. High quality novel views are desired within a neighborhood of this canonical camera viewpoint. To enforce this, the system may render novel views during training and critique them with a novel view discriminator. Additionally, the system may concatenate the disparity as a fourth channel of input to the novel view discriminator, to ensure the geometry does not change depending on the camera viewpoint.
[0053] Process flow 400-3 illustrates the generation of depth loss values based on comparisons of the NeRF decoder’s predicted depth values and the pseudo-ground truth depth values generated by a separate, pretrained, off-the-shelf DPT model. In this example, it is assumed that the depth loss values will be weighted shift- and scale- invariant loss values calculated according to the equations set forth below. By way of example, the predicted depth from the NeRF decoder may be additionally supervised with pseudo ground truth depth. For instance, the DPT depth network can be used to generate pseudo-GT. The predicted depth is supervised with a depth loss. This loss is configured to the NeRF setting and supervises the volumetric rendering weights of every NeRF sample location rather than the accumulated disparity.
[0054] With this architecture and loos formulation, stage 1 can accept an input RGB image and reconstruct it in 3D, generating suitable reconstruction at the canonical view and plausible novel views.
Training
[0055] The goal of the first stage is to learn a model which can compress image pixels into a sequence of discrete indices corresponding to a learnt latent codebook (e.g., 308 in FIG. 3). Since it is beneficial for the model to be 3D-aware, several additional criteria are imposed. One goal is good reconstruction from a canonical view. For instance, on ImageNet, ground truth camera extrinsics are unknown and may not be well-defined due to the presence of deformable and ambiguous object categories and scenes without salient objects. Therefore, a single ‘canonical pose’ may be fixed for reconstruction, and the criterion here is that the conditional NeRF-based autoencoder should successfully reconstruct the dataset from this view.
[0056] Another criterion is reasonable novel views. One may expect that images decoded at novel views within a specified range of the canonical view will have similar quality to images decoded at the canonical view. A further criterion is correct geometry. The geometry of the scene as represented by the NeRF should correspond to the unknown ground truth geometry of the RGB image up to scale and shift.
[0057] These criteria can be enforced by introducing several auxiliary models and losses, summarized in FIG. 3 To enforce (1) good reconstruction at the canonical view, the system can train with a combination of the MSE, perceptual, and logit-laplace loss, the combination of which is termed herein as “Zrec”.
[0058] To enforce (2) reasonable novel views, the system can leverage main and auxiliary discriminators. The first (main) discriminator distinguishes between real and reconstructed images at the canonical viewpoint, while the second (auxiliary) distinguishes between reconstructed images at the canonical viewpoint and novel views. In this way, the model cannot allocate all its capacity to reconstructing images at the canonical viewpoint without also having high-quality novel views. The generator may slightly corrupt the main view in order to collaborate with the novel view branch to fool the discriminator; thus, the system may add a stop-grad between the main view and the novel view discriminator. It may be unnecessary to tune a separate distribution of novel views for each dataset, and instead sample novel views uniformly in a disc tangent to a sphere at the canonical camera pose. A nonsaturating GAN objective Lgan, such as described by Goodfellow et al. in “Generative adversarial nets”, found in Advances in Neural Information Processing Systems (2014), which is incorporated herein by reference, for both discriminators. The system may additionally concatenate the predicted depth as input to the auxiliary discriminator to ensure the distribution of depths does not change depending on the camera viewpoint.
[0059] To enforce (3) correct geometry, the system may supervise the NeRF depth with pseudo-GT geometry at the main viewpoint. The system may employ the pretrained depth prediction transformer model DPT, described by Ranftl et al. in “Vision transformers for dense prediction” (2021), incorporated herein by reference, which produces pseudo-GT disparity estimates for the images in our training datasets. Thus, the model can be limited to some extent by the quality of the depth estimator chosen. A shift- and scale- invariant h loss, such as generally described by Rantfl et al. in “Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer” in the IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3) (2022), which is incorporated herein by reference, may be used for training monocular depth estimation in which the shift and scale are determined by solving a closed-form least squares alignment with the GT depth. Here, this shift- and scale- invariant loss is adapted to the NeRF setting, in which the system supervises the weight of every sample along each ray rather than the accumulated depth. For a given image, let i G {1 ... N]and k G {1 ... L] be indices which range over the image plane and ray samples respectively, let D,k be the pointwise disparities of the NeRF sample locations, let Wn be corresponding NeRF weights from volumetric rendering (see, e.g., Mildenhall et al, “Nerf: Representing Scenes as Neural Radiance Fields for View Synthesis”, 2020, which is incorporated herein by reference), and let di be the pseudo-GT depth from DPT. Then define s*; t* to be the closed-form solution of the weighted least squares problem:
N L argmin l w , s*’ = s t ' N i=l fc=l WIK SDIK + F " DI)
And set the depth loss to be the weighted scale- and shift-invariant loss:
Figure imgf000014_0001
[0060] Assuming the weight sum to 1 along each ray, this loss is minimized when the NeRF allocates 0 weight to all but one sample location along each ray, and the expectation with respect to the weights of the disparity is equal to the GT disparity map up to a scale and shift. In this way it penalizes weight distributions which are too spread out, but also encourages the weights to be concentrated near the correct geometry. Importantly, this formulation still allows for more than one surface along each ray and thus for occlusion and disocclusion, because the penalty is applied to the volumetric rendering weights and not the predicted density. This depth loss formulation is highly beneficial for good performance. In particular, supervising the accumulated disparity rather than the pointwise disparities may lead to poor performance. Two penalties on the scale determined by this alignment are introduced as follows:
Figure imgf000015_0001
[0061] Here, 2S1 is the weight of a small penalty to prevent the sign of the disparity scale from flipping negative. s2 weights a penalty preventing the disparity maps from becoming too flat, which encourages perceptually pleasing novel views. One can additionally include the same vector-quantization loss Lvq as discussed by Yu et al. in “Vector-quantized image modeling with improved vqgan” (2021), the disclosure of which is incorporated herein by reference, and the distortion and interlevel losses of MipNeRF360 as discussed by Barron et al. in “Mip-nerf 360: Unbounded anti-aliased neural radiance fields” (2022), which is also incorporated herein by reference, given by Lnerf- The loss for the autoencoder would thus be:
Figure imgf000015_0002
[0062] The goal of Stage 2 is to learn an autoregressive model over the discrete encodings produced by the Stage 1 encoder, so that completely new 3D scenes can be generated as output of the trained model. It has been verified experimentally that the fully generative stage 2 model inherits the properties optimized in stage 1; namely, 3D-consistent novel views and high-quality geometry. Moreover, top- A and top-p filtering can also be applied.
[0063] The architecture of FIG. 3 leverages the powerful vision transformer architecture in both the encoder and decoder. Here, the decoder is configured to employ 3D inductive bias to facilitate the learning of 3D representations. An overview of the individual components of the architecture is now presented.
[0064] For the encoder of FIG. 3, the system uses a ViT-S model. For the decoder, a ViT-L model is used to decode the latent codes into 3 triplanes of size 512x512 with feature dimension 32. It has been found that the triplane construction stage of the decoder benefits from the increased capacity of the ViT-L model. Note that the triplane size and feature dimension may each be varied, e.g., depending on the type of imagery or application of interest.
[0065] It may be important to reconstruct and generate potentially unbounded ImageNet scenes. It is possible to leverage the powerful triplane representation, such as noted by Chan et al. in “Efficient geometry-aware 3D generative adversarial networks” (2021), which is incorporated herein by reference. Therefore, an adapted triplane representation can be employed in which points are contracted before looking up their values in the triplanes. For instance, the contraction function of MipNeRF360 can be applied to bound coordinates within the triplanes before looking up their values, and also use the linear- in-disparity sampling scheme with separate proposal and NeRF MLP. The MLPs convert interpolated triplane features to density and, in the case of the NeRF MLP, RGB color. These MLPs are lightweight, e.g., with 2 layers and 32 hidden units each, although more layers and/or different numbers of hidden units may be employed. RGB color may be directly rendered rather than using a neural upsampler, as it has been found that neural upsampling can be be a source of myriad and confusing artifacts not fixable via dual discriminators or consistency losses.
[0066] Furthermore, the system is able to train the transformer to autoregressively predict the next image token. The hyperparameters in the base model of VIM, described by Yu et al. in “Vector-quantized image modeling with improved vqgan”. For ImageNet, a conditional model can be trained, and for other datasets we train unconditional generative models.
Testing and Results
[0067] The performance of the above-described architecture and method was studied relative to the baseline methods on ImageNet. The ImageNet dataset is a well-known classification benchmark which includes 1.28M images of 1000 object classes. It is a standard benchmark for 2D image generation, for both conditional and unconditional generation. The testing compared against pi-GAN (see Chan et al., “pi-gan: Periodic implicit generative adversarial networks for 3D-aware image synthesis”, 2020, the entire disclosure of which is incorporated by reference herein), GIRAFFE (see Niemeyer et al., “Giraffe: Representing scenes as compositional generative neural feature fields”, in Proc. IEEE Conf, on Computer Vision and Pattern Recognition, 2021, the entire disclosure of which is incorporated by reference herein), EG3D (see Chan et al, “Efficient geometry-aware 3D generative adversarial networks”, 2021, the entire disclosure of which is incorporated by reference herein), and StyleNeRF (see Gu et al. “Stylenerf: A style-based 3D-aware generator for high resolution image synthesis, 2021, the entire disclosure of which is incorporated by reference herein). The testing re-implented pi-GAN and GIRAFFE using the system’s internal framework, and ran the provided code for EG3D and StyleNeRF. Since ImageNet does not have GT poses and pseudo-GT poses are not possible to compute, generator and discriminator pose conditioning were disabled for EG3D and sampled from a pre-defined pose distribution. The main results for generation on ImageNet compared against the benchmarks are given in Table 1, shown in FIG. 5. Notably, the FID score on ImageNet was the best by a wide margin, with a more than threefold improvement over the next best baseline score. Generated examples from the method and the benchmarks are reproduced in FIG. 6, where it can be seen that the method described herein generates superior samples.
[0068] In addition to generating high quality scenes, Stage 1 of the instant method can also be used for single-view 3D reconstruction and manipulation. FIGS. 7A-B each show single RGB images reconstructed by Stage 1 with estimated geometry. The network performs well at reconstruction and needs only a single forward pass to compute a NeRF for an input image, without employing an inversion optimization. Moreover, the reconstructed NeRFs can be manipulated, for instance to render novel views. An example 800 of a novel view is illustrated in FIG. 8. This figure shows example camera manipulations of a reconstructed scene, which includes a person riding a motorcycle on one wheel. The above-described approach naturally handles sharp occlusions (left spyglass inset 802) and inpainting of dis-occluded pixels (right spyglass inset 804) without supervision of novel views.
[0069] For results on ImageNet, the system was trained for the longest possible time and used the most optimal top-/? and top-A samping parameters. Additional analysis and experiments were conducted on the learning of geometry and model ablation, for which a consistent Stage 1 step, Stage 2 step, and top-/? and top-& setting were used across each study.
[0070] First, the learning of good geometry was studied, both for the instant model and the baseline methods. One potential concern may be that the use of pseudo-GT depth limits the comparability of the present technique with the baseline GAN methods. This concern was addressed by analyzing both the FID score and the depth accuracy metric. This metric is defined as the mean- and variance-normalized MSE between the NeRF depth and the predicted depth of the generated image.
[0071] Table 2 in Fig. 9 gives the result for generative models with and without depth losses. It can be seen that while adding depth loss can improve the quality of geometry, it did not significantly improve FID by itself. For the GAN methods, it was found that the pointwise disparity loss worked relatively poorly but the original scale- and shift- invariant MSE loss improved geometry. For the instant method, the Stage 2 performance is shown with and without our novel pointwise weighted depth loss. While performance on the depth accuracy metric can improve when various depth losses are incorporated training, the effect on FID was mixed. In this way, it can be seen that incorporating pseudo-GT depth is unlikely to meaningfully improve the FID for the baseline methods without substantial changes.
[0072] Better geometry does not imply better FID. Additionally, learning geometry without a depth loss may be unreliable. During the ImageNet experiments, it was observed that StyleNeRF was sensitive to hyperparameters, and can learn to produce flat depths. EG3D showed that removing GT poses as input to the discriminator is enough to cause the geometry to degenerate to a flat plane.
[0073] Next, experiments were conducted to analyze the key components of the model. An ablation of design choices was conducted for the Stage 1 training, the results of which are presented in Table 3 of FIG. 10. It can be seen that the model did not learn geometry in an unsupervised way. Here, removing the depth loss was sufficient to cause the depth MSE to rise because the depths have collapsed to a flat plane. It can be further seen the importance of including our novel view discriminator as removing it is sufficient to worsen FID and depth MSE. Note that without the depth loss, the FID is much better, but this is appropriate because forcing the NeRF geometry to be correct is a very strong constraint which greatly limits the expressive power of the network.
[0074] The performance of VQ3D with top-p and top-A: sampling was analyzed, where the full codebook size was 8192, with the results presented in Table 4 of FIG. 11. It is noted these sampling changes can give significant performance improvements analogous to truncation sampling for GANs. For the instant (VQ3D) approach, a top-A of 1000 and top-p of 1.0 gave the best FID results.
[0075] There are other 3D benchmark datasets that can be considered. One such prominent 3D-aware benchmark dataset is CompCars (see Yang et al., “A large-scale car dataset for fine-grained categorization and verification, 2015, the entire disclosure of which is incorporated herein by reference). On CompCars, as shown by Table 5 of FIG. 12, the instant model is competitive with the state of the art. This table presents the FID scores of a number of 3D generative models. In addition to the models discussed above, for GIRAFFE HD, see Xue et al., “GIRAFFE HD: A high-resolution 3D-aware generative model”, in CVPR 2022, the entire disclosure of which is incorporated herein by reference. In the table, * indicates numbers taken from the respective references mentioned here; the other models were trained for the testing. Note that although the baseline models used separate, tuned pose hyperparameters for each dataset, the identical simple pose sampling scheme worked well on both ImageNet and CompCars.
Example Methods
[0076] Fig. 13 illustrates an example flow diagram 1300 for a method of training a model in accordance with aspects of the technology described herein.
[0077] At block 1302, the method begins as follows: for each given first training example of a plurality of first training examples, the given first training example including a first image, a vector based on the first image, and a first plurality of depth values, each given depth value of first plurality of depth values corresponding to a distance between a reference point and a given portion of the first image. For each given first training example, the following operations are performed. At block 1304, generating, using a decoder of the model, a second image based on the vector. At block 1306, generating, using the decoder of the model, a second plurality of depth values. Here, each given depth value of the second plurality of depth values corresponds to a distance between a reference point and a given portion of the second image. At block 1308, the method includes generating, using the decoder of the model, a third image based on the vector, in which the third image is different than the second image. At block 1310, the method includes comparing, using one or more processors of a processing system, the second image to the first image to generate a reconstruction loss value for the given first training example, and at block 1312 comparing, using the one or more processors, the second image to the first image to generate a first real- fake loss value for the given first training example. At block 1314, the method includes comparing, using the one or more processors, the second image to the third image to generate a second real-fake loss value for the given first training example. At block 1316, the method includes comparing, using the one or more processors, the first plurality of depth values to the second plurality of depth values to generate a depth loss value for the given first training example. Then at block 1318, the method includes modifying, using the one or more processors, one or more parameters of the decoder of the model based at least in part on the reconstruction loss values, first real-fake loss values, second real-fake loss values, and depth loss values generated for the plurality of first training examples.
[0078] Fig. 14 illustrates an example flow diagram 1400 for a two-stage method of training a 3D-aware model in accordance with aspects of the technology described herein. Block 1402 indicates a first stage of training the model. Here, at block 1404 the method includes applying, by one or more processors of a computing system, an input image to an encoder, the encoder generating a learned latent codebook. Then, at block 1406, the method includes applying, by the one or more processors, the learned latent codebook to a triplane generator of a decoder to generate a set of triplanes each having a defined spatial resolution and a defined number of channels. At block 1408, the method includes performing, by the one or more processors via the decoder, contraction according to a contraction function, and at block 1410 the method includes generating, by the one or more processors via the decoder, a set of novel views based on the contraction and the set of triplanes. Block 1412 indicates a second stage of training the model upon training of the decoder in the first stage. Here, at block 1414, the method includes predicting, by an autoregressive transformer, a latent vector comprising one or more sequences of latent tokens based on the encoder generating the learned latent codebook in the first stage of training. Then, at block 1416, the method includes applying, by the one or more processors, the latent vector to the decoder to generate imagery of a 3D scene.
[0079] The above-described approach presents a novel 3D-aware generative model trainable on large and diverse 2D image collections. The model does not require tuning pose hyperparameters for each dataset or ground truth poses, and can leverage a pseudo-depth estimator during training. State-of-the-art generation results were obtained on ImageNet and competitive results on CompCars, demonstrating that the 3D- aware generative model is capable of fitting a dataset at the scale and diversity of ImageNet. The present model significantly outperforms the next best baseline. Stage 1 of the model enables 3D-aware image editing and manipulation. One forward pass through our network converts a single RGB image into a manipulable NeRF, without relying on a computationally expensive inversion optimization. The fully generative stage 2 model allows the system to sample totally new 3D scenes from the many object classes of ImageNet. The stage 2 model inherits the properties which were optimized in stage 1, namely plausible novel views and high-quality geometry.
[0080] It can be seen that this approach has several technical advantages, ensuring it to scale well to ImageNet. First, separating the training into two stages (reconstruction and generation) enables for direct supervision of the first stage training via a novel depth loss, using pseudo-GT depth. This is possible because in the first stage, as the conditional NeRF decoder learns to reconstruct the input, it also predicts the depth of each image. Second, the system does not require hand-tuning of pose sampling distributions or ground-truth pose data. Rather, the training objective simply enforces reconstruction from a canonical camera pose, and plausible novel views within a neighborhood of the canonical pose. This objective eliminates the need for excessive tuning of the pose distribution for each dataset, and allows the model to work out-of-the-box for multiple object categories. Thus, the model can use identical pose sampling hyperparameters for each dataset. Moreover, the two-stage formulation is simpler and more reliable than known techniques for training 3D-aware generative models. The described approach does not need to use progressive growing, a neural upsampler, pose conditioning, or patch-wise discriminators, but still learns meaningful 3D representations.
[0081] Unless otherwise stated, the foregoing alternative examples are not mutually exclusive, but may be implemented in various combinations to achieve unique advantages. As these and other variations and combinations of the features discussed above can be utilized without departing from the subject matter defined by the claims, the foregoing description of exemplary systems and methods should be taken by way of illustration rather than by way of limitation of the subject matter defined by the claims. In addition, the provision of the examples described herein, as well as clauses phrased as “such as,” “including,” “comprising,” and the like, should not be interpreted as limiting the subject matter of the claims to the specific examples; rather, the examples are intended to illustrate only some of the many possible embodiments. Further, the same reference numbers in different drawings can identify the same or similar elements.

Claims

1. A computer-implemented method of training a model, comprising: for each given first training example of a plurality of first training examples, the given first training example including a first image, a vector based on the first image, and a first plurality of depth values, each given depth value of first plurality of depth values corresponding to a distance between a reference point and a given portion of the first image: generating, using a decoder of the model, a second image based on the vector; generating, using the decoder of the model, a second plurality of depth values, each given depth value of the second plurality of depth values corresponding to a distance between a reference point and a given portion of the second image; generating, using the decoder of the model, a third image based on the vector, the third image being different than the second image; comparing, using one or more processors of a processing system, the second image to the first image to generate a reconstruction loss value for the given first training example; comparing, using the one or more processors, the second image to the first image to generate a first real-fake loss value for the given first training example; comparing, using the one or more processors, the second image to the third image to generate a second real-fake loss value for the given first training example; and comparing, using the one or more processors, the first plurality of depth values to the second plurality of depth values to generate a depth loss value for the given first training example; and modifying, using the one or more processors, one or more parameters of the decoder of the model based at least in part on the reconstruction loss values, first real-fake loss values, second real-fake loss values, and depth loss values generated for the plurality of first training examples.
2. The method of claim 1, further comprising: for each given second training example of a plurality of second training examples, the given second training example including an input image and a target vector based on the input image: generating, using an encoder of the model, an output vector based on the input image; and comparing, using the one or more processors, the output vector to the target vector to generate a loss value for the given second training example; and modifying, using the one or more processors, one or more parameters of the encoder of the model based at least in part on the loss values generated for the plurality of second training examples.
3. The method of claim 2, wherein the output vector generated based on the input image comprises a learned latent codebook.
4. The method of claim 3, wherein the learned latent codebook is processed, during a first stage of training the model, to generate a set of triplanes.
5. The method of claim 1, wherein the reconstruction loss is based on both a mean-squared error and a perceptual loss.
6. The method of claim 1, wherein: the first real-fake loss value distinguishes between real and reconstructed images at a canonical viewpoint; and the second real-fake loss value distinguishes between reconstructed images at the canonical viewpoint and novel views.
7. The method of claim 1, further comprising using the trained model to generate one or more novel 3D views at different angles.
8. A processing system comprising: a memory storing a model; and one or more processors coupled to the memory and configured to train the model according to a training method comprising: for each given first training example of a plurality of first training examples, the given first training example including a first image, a vector based on the first image, and a first plurality of depth values, each given depth value of the first plurality of depth values corresponding to a distance between a reference point and a given portion of the first image: generating, using a decoder of the model, a second image based on the vector; generating, using the decoder of the model, a second plurality of depth values, each given depth value of second plurality of depth values corresponding to a distance between a reference point and a given portion of the second image; generating, using the decoder of the model, a third image based on the vector, the third image being different than the second image; comparing the second image to the first image to generate a reconstruction loss value for the given first training example; comparing the second image to the first image to generate a first real-fake loss value for the given first training example; comparing the second image to the third image to generate a second real-fake loss value for the given first training example; and comparing the first plurality of depth values to the second plurality of depth values to generate a depth loss value for the given first training example; and modifying one or more parameters of the decoder of the model based at least in part on the reconstruction loss values, first real-fake loss values, second real-fake loss values, and depth loss values generated for the plurality of first training examples.
9. The system of claim 8, wherein the one or more processors are further configured to train the model according to a further training method comprising: for each given second training example of a plurality of second training examples, the given second training example including an input image and a target vector based on the input image: generating, using an encoder of the model, an output vector based on the input image; and comparing, using the one or more processors, the output vector to the target vector to generate a loss value for the given second training example; and modifying one or more parameters of the encoder of the model based at least in part on the loss values generated for the plurality of second training examples.
10. The system of claim 9, wherein the encoder of the model is an autoregressive transformer.
11. The system of claim 8, wherein the decoder of the model is a NeRF-based decoder.
12. The system of claim 9, wherein the output vector generated based on the input image comprises a learned latent codebook.
13. The system of claim 12, wherein the learned latent codebook is processed, during a first stage of training the model, to generate a set of triplanes.
14. The system of claim 8, wherein the reconstruction loss is based on both a mean-squared error and a perceptual loss.
15. The system of claim 8, wherein: the first real-fake loss value distinguishes between real and reconstructed images at a canonical viewpoint; and the second real-fake loss value distinguishes between reconstructed images at the canonical viewpoint and novel views.
16. A computer-implemented method of training a 3D-aware model, comprising: in a first stage of training the model: applying, by one or more processors of a computing system, an input image to an encoder, the encoder generating a learned latent codebook; applying, by the one or more processors, the learned latent codebook to a triplane generator of a decoder to generate a set of triplanes each having a defined spatial resolution and a defined number of channels; performing, by the one or more processors via the decoder, contraction according to a contraction function; and generating, by the one or more processors via the decoder, a set of novel views based on the contraction and the set of triplanes; and in a second stage of training the model: predicting, by an autoregressive transformer, a latent vector comprising one or more sequences of latent tokens based on the encoder generating the learned latent codebook in the first stage of training; and applying, by the one or more processors, the latent vector to the decoder to generate imagery of a 3D scene.
17. The method of claim 16, wherein the decoder in the first stage generates a set of losses.
18. The method of claim 17, wherein the set of losses includes one or more of a reconstruction loss, a depth loss, or a real/fake loss.
19. The method of claim 18, wherein the reconstruction loss is based on both a mean-squared error and a perceptual loss.
20. The method of claim 18, wherein the real/fake loss includes: a first real-fake loss value that distinguishes between real and reconstructed images at a canonical viewpoint; and a second real-fake loss value that distinguishes between reconstructed images at the canonical viewpoint and novel views.
21. The method of claim 16, further comprising using the trained model to generate one or more novel 3D views at different angles.
22. A system, comprising: memory configured to store a set of imagery; and one or more processors operatively coupled to the memory, the one or more processors being configured to train a 3D-aware image model: wherein, in a first stage to train the model, the one or more processors are configured to: apply an input image to an encoder, the encoder generating a learned latent codebook; apply the learned latent codebook to a triplane generator of a decoder to generate a set of triplanes each having a defined spatial resolution and a defined number of channels; perform, via the decoder, contraction according to a contraction function; and generate, via the decoder, a set of novel views based on the contraction and the set of triplanes; and in a second stage to train the model: predict, using an autoregressive transformer, a latent vector comprising one or more sequences of latent tokens based on the encoder generating the learned latent codebook in the first stage of training; and apply the latent vector to the decoder to generate imagery of a 3D scene.
PCT/US2023/081079 2022-11-30 2023-11-27 Systems and methods for automatic generation of three-dimensional models from two-dimensional images Ceased WO2024118464A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202263428860P 2022-11-30 2022-11-30
US63/428,860 2022-11-30

Publications (1)

Publication Number Publication Date
WO2024118464A1 true WO2024118464A1 (en) 2024-06-06

Family

ID=89385918

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/081079 Ceased WO2024118464A1 (en) 2022-11-30 2023-11-27 Systems and methods for automatic generation of three-dimensional models from two-dimensional images

Country Status (1)

Country Link
WO (1) WO2024118464A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119295523A (en) * 2024-09-19 2025-01-10 哈尔滨工程大学 A monocular depth estimation method, device and medium based on a bidirectional state space model
US12354576B2 (en) 2023-08-09 2025-07-08 Futureverse Ip Limited Artificial intelligence music generation model and method for configuring the same
US12456250B1 (en) * 2024-11-14 2025-10-28 Futureverse Ip Limited System and method for reconstructing 3D scene data from 2D image data

Non-Patent Citations (19)

* Cited by examiner, † Cited by third party
Title
BARRON ET AL., MIP-NERF 360: UNBOUNDED ANTI-ALIASED NEURAL RADIANCE FIELDS, 2022
CHAN ET AL., EFFICIENT GEOMETRY-AWARE 3D GENERATIVE ADVERSARIAL NETWORKS, 2021
CHAN ET AL., PI-GAN: PERIODIC IMPLICIT GENERATIVE ADVERSARIAL NETWORKS FOR 3D-AWARE IMAGE SYNTHESIS, 2020
DENG ET AL.: "Imagenet: A large-scale hierarchical image database", IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2009
GOODFELLOW ET AL.: "Generative adversarial nets", ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, 2014
GU ET AL., STYLENERF: A STYLE-BASED 3D-AWARE GENERATOR FOR HIGH RESOLUTION IMAGE SYNTHESIS, 2021
JING YU KOH ET AL: "Simple and Effective Synthesis of Indoor 3D Scenes", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 6 April 2022 (2022-04-06), XP091200738 *
LI ZHENGQI ET AL: "InfiniteNature-Zero: Learning Perpetual View Generation of Natural Scenes from Single Images", 22 July 2022, SPRINGER INTERNATIONAL PUBLISHING, PAGE(S) 515 - 534, XP047637778 *
MILDENHALL ET AL., NERF: REPRESENTING SCENES AS NEURAL RADIANCE FIELDS FOR VIEW SYNTHESIS, 2020
NIEMEYER ET AL.: "Giraffe: Representing scenes as compositional generative neural feature fields", PROC. IEEE CONF. ON COMPUTER VISION AND PATTERN RECOGNITION, 2021
RANFTL ET AL., VISION TRANSFORMERS FOR DENSE PREDICTION, 2021
RANTFL ET AL.: "Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, vol. 44, no. 3, 2022
ROBIN ROMBACH ET AL: "Geometry-Free View Synthesis: Transformers and no 3D Priors", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 15 April 2021 (2021-04-15), XP081938961 *
SARGENT KYLE ET AL: "VQ3D: Learning a 3D-Aware Generative Model on ImageNet", 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), IEEE, 1 October 2023 (2023-10-01), pages 4217 - 4227, XP034515215, DOI: 10.1109/ICCV51070.2023.00391 *
XUE ET AL.: "GIRAFFE HD: A high-resolution 3D-aware generative model", CVPR, 2022
YANG ET AL., A LARGE-SCALE CAR DATASET FOR FINE-GRAINED CATEGORIZATION AND VERIFICATION, 2015
YU ET AL., VECTOR-QUANTIZED IMAGE MODELING WITH IMPROVED VQGAN
YU ET AL., VECTOR-QUANTIZED IMAGE MODELING WITH IMPROVED VQGAN, 2021
ZUOYUE LI ET AL: "CompNVS: Novel View Synthesis with Scene Completion", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 23 July 2022 (2022-07-23), XP091278568 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12354576B2 (en) 2023-08-09 2025-07-08 Futureverse Ip Limited Artificial intelligence music generation model and method for configuring the same
CN119295523A (en) * 2024-09-19 2025-01-10 哈尔滨工程大学 A monocular depth estimation method, device and medium based on a bidirectional state space model
US12456250B1 (en) * 2024-11-14 2025-10-28 Futureverse Ip Limited System and method for reconstructing 3D scene data from 2D image data

Similar Documents

Publication Publication Date Title
Pittaluga et al. Revealing scenes by inverting structure from motion reconstructions
US20220239844A1 (en) Neural 3D Video Synthesis
WO2024118464A1 (en) Systems and methods for automatic generation of three-dimensional models from two-dimensional images
Sargent et al. Vq3d: Learning a 3d-aware generative model on imagenet
CN116205962B (en) Monocular depth estimation method and system based on complete context information
Li et al. A systematic survey of deep learning-based single-image super-resolution
US20240096001A1 (en) Geometry-Free Neural Scene Representations Through Novel-View Synthesis
US12333431B2 (en) Multi-dimensional generative framework for video generation
CN113191495A (en) Training method and device for hyper-resolution model and face recognition method and device, medium and electronic equipment
Hwang et al. LiDAR depth completion using color-embedded information via knowledge distillation
WO2024258661A1 (en) Autodecoding latent 3d diffusion models
Elharrouss et al. Transformer-based image and video inpainting: current challenges and future directions
Khan et al. Sparse to dense depth completion using a generative adversarial network with intelligent sampling strategies
Ye et al. GFSCompNet: remote sensing image compression network based on global feature-assisted segmentation
US20250166125A1 (en) Method of generating image and electronic device for performing the same
CN118747726B (en) Training method, related device and medium of image generation model
Chen et al. Linear-ResNet GAN-based anime style transfer of face images
Mahara et al. Generative adversarial model equipped with contrastive learning in map synthesis
CN115439610B (en) Training method and training device for model, electronic equipment and readable storage medium
Peng et al. Large-scale single-pixel imaging and sensing
Zhao et al. Image and Graphics: 10th International Conference, ICIG 2019, Beijing, China, August 23–25, 2019, Proceedings, Part III
CN116912148A (en) Image enhancement method, device, computer equipment and computer readable storage medium
Wei et al. FRGAN: A blind face restoration with generative adversarial networks
CN119067868B (en) Image processing method, device and equipment, medium and product
CN119941816B (en) A Monocular Depth Estimation Method and System Based on Text-Guided and Multi-Scale Fusion

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23829231

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 23829231

Country of ref document: EP

Kind code of ref document: A1