[go: up one dir, main page]

EP4085427A2 - Rendu neuronal - Google Patents

Rendu neuronal

Info

Publication number
EP4085427A2
EP4085427A2 EP21704078.1A EP21704078A EP4085427A2 EP 4085427 A2 EP4085427 A2 EP 4085427A2 EP 21704078 A EP21704078 A EP 21704078A EP 4085427 A2 EP4085427 A2 EP 4085427A2
Authority
EP
European Patent Office
Prior art keywords
training
image
representation
machine learning
view
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP21704078.1A
Other languages
German (de)
English (en)
Inventor
Qi SHAN
Joshua M. Susskind
Aditya Sankar
Robert Alex Colburn
Emilien Dupont
Miguel Angel BAUTISTA MARTIN
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Apple Inc
Original Assignee
Apple Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/145,232 external-priority patent/US11967015B2/en
Application filed by Apple Inc filed Critical Apple Inc
Publication of EP4085427A2 publication Critical patent/EP4085427A2/fr
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/10Geometric effects
    • G06T15/20Perspective computation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • G06T19/20Editing of 3D images, e.g. changing shapes or colours, aligning objects or positioning parts
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/60Rotation of whole images or parts thereof
    • G06T3/608Rotation of whole images or parts thereof by skew deformation, e.g. two-pass or three-pass rotation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2219/00Indexing scheme for manipulating 3D models or images for computer graphics
    • G06T2219/20Indexing scheme for editing of 3D models
    • G06T2219/2016Rotation, translation, scaling

Definitions

  • FIG.1 illustrates an example network environment in accordance with one or more implementations.
  • FIG.2 illustrates an example computing architecture for a system providing machine learning models trained based on equivariance in accordance with one or more implementations.
  • FIG.3 illustrates various input images that can be provided to machine learning models trained based on equivariance in accordance with one or more implementations.
  • FIG.4 illustrates a schematic diagram of a machine learning model in accordance with one or more implementations.
  • FIG.5 illustrates features of a model architecture of a machine learning model in accordance with one or more implementations.
  • FIG.6 illustrates a flow diagram of an example process for operating a trained machine learning model in accordance with one or more implementations.
  • FIG.7 illustrates output images that may be generated by a trained machine learning model based on various input images in accordance with one or more implementations.
  • FIG.8 illustrates additional output images that may be generated by a trained machine learning model based on various additional input images in accordance with one or more implementations.
  • FIG.9 illustrates further output images that may be generated by a trained machine learning model based on various further input images in accordance with one or more implementations.
  • FIG.10 illustrates various aspects of explicit three-dimensional representations and implicit three-dimensional representations of an object in accordance with one or more implementations.
  • FIG.11 illustrates a process for training a machine learning model in accordance with one or more implementations.
  • FIG.12 illustrates a flow diagram of an example process for training a machine learning model in accordance with one or more implementations.
  • FIG.13 illustrates an examples of shear rotation operations in accordance with one or more implementations.
  • FIG.14 illustrates additional details of an example shear rotation operation in accordance with one or more implementations.
  • FIG.15 illustrates an electronic system with which one or more implementations of the subject technology may be implemented.
  • Machine learning may utilize models that are executed to provide predictions in particular applications (e.g., analyzing images and videos, object detection and/or tracking, etc.) among many other types of applications.
  • neural rendering approaches produce photorealistic renderings given noisy or incomplete 3D or 2D observations.
  • incomplete 3D inputs have been converted to rich scene representations using neural textures, which fill in and regularize noisy measurements.
  • conventional methods for neural rendering either require 3D information during training, complicated rendering priors, or expensive runtime decoding schemes.
  • the subject technology provides techniques for training a machine learning model to extract three-dimensional information from a two-dimensional image.
  • the machine learning model may be trained to render an output image of an object based on an input image of the object, the output image depicting a different view of the object than is depicted in the input image.
  • the trained machine learning model can provide an output image of the same mug as it would be viewed from the bottom of the mug, from the right side of the mug, or from any other view of the mug in three dimensions.
  • the trained machine learning model can generate these output images even though the input image does not contain depth information for the mug, and even though the machine learning model is not provided with any depth information regarding the input image.
  • the subject technology does not require expensive sequential decoding steps and enforces 3D structure through equivariance.
  • the subject technology can be trained using only images and their relative poses, and can therefore extend more readily to real scenes with minimal assumptions about geometry.
  • Traditional neural networks may not be equivariant with respect to general transformation groups. Equivariance for discrete rotations can be achieved by replicating and rotating filters.
  • neural networks that achieve equivariance are provided by treating a latent representation as a geometric 3D data structure and applying rotations directly to this representation.
  • an implicit neural representation is encoded into a latent 3D tensor.
  • neural rendering using flow estimation for view synthesis predicts a flow field over the input image(s) conditioned on a camera viewpoint transformation.
  • the machine learning model may be trained to output an explicit representation of the mug in three dimensions in addition to, or in place of, a two-dimensional output image of the mug.
  • An explicit representation of the mug in three dimensions can be a point cloud, a mesh, or a voxel grid (as examples) that can be rendered so as to be recognizable as the object to a human viewer, and that can be manipulated (e.g., rotated, translated, re-sized, etc.) in three dimensions.
  • Implementations of the subject technology improve the computing functionality of a given electronic device by providing an equivariance constraint that, when applied during training of a machine learning model, allows the model to be (i) trained without 3D supervision, (ii) tested without providing pose information as input to the model, and/or (iii) operated to generate an implicit representation (also referred to herein as a “scene representation”) of a three-dimensional object from a single two-dimensional image of the object in a single forward pass.
  • Prior approaches may require an expensive optimization procedure to extract three-dimensional information from an image or a set of images, and typically may require 3D supervision and/or input pose information during training and/or at runtime.
  • FIG.1 illustrates an example network environment 100 in accordance with one or more implementations.
  • the network environment 100 includes an electronic device 110, and a server 120.
  • the network 106 may communicatively (directly or indirectly) couple the electronic device 110 and/or the server 120.
  • the network 106 may be an interconnected network of devices that may include, or may be communicatively coupled to, the Internet.
  • the network environment 100 is illustrated in FIG.
  • the electronic device 110 may be, for example, a desktop computer, a portable computing device such as a laptop computer, a smartphone, a peripheral device (e.g., a digital camera, headphones), a tablet device, a wearable device such as a watch, a band, and the like.
  • a mobile electronic device e.g., smartphone
  • the electronic device 110 may be, and/or may include all or part of, the electronic system discussed below with respect to FIG.3.
  • the electronic device 110 may provide a system for training a machine learning model using training data, where the trained machine learning model is subsequently deployed to the electronic device 110. Further, the electronic device 110 may provide one or more machine learning frameworks for training machine learning models and/or developing applications using such machine learning models. In an example, such machine learning frameworks can provide various machine learning algorithms and models for different problem domains in machine learning. In an example, the electronic device 110 may include a deployed machine learning model that provides an output of data corresponding to a prediction or some other type of machine learning output. [0033] The server 120 may provide a system for training a machine learning model using training data, where the trained machine learning model is subsequently deployed to the server 120.
  • the server 120 may train a given machine learning model for deployment to a client electronic device (e.g., the electronic device 110).
  • the machine learning model deployed on the server 120 and/or the electronic device 110 can then perform one or more machine learning algorithms.
  • the server 120 provides a cloud service that utilizes the trained machine learning model and continually learns over time.
  • FIG.2 illustrates an example computing architecture for a system providing an equivariance constraint for machine learning models, in accordance with one or more implementations.
  • the computing architecture is described as being provided by the server 120, such as by a processor and/or memory of the server 120; however, the computing architecture may be implemented by any other electronic devices, such as the electronic device 110.
  • the server 120 includes training data 210 for training a machine learning model.
  • the server 120 may utilize one or more machine learning algorithms that uses training data 210 for training a machine learning (ML) model 220.
  • ML model 220 may be trained based on at least two training images in training data 210, the two training images depicting different views of a training object.
  • ML model 220 may be trained using an equivariance constraint.
  • Training data 210 may include two-dimensional images of various objects, each image depicting one or more of the objects from a particular view.
  • the images may include sets of images of a particular object from various views that are rotated, translated, scaled, and/or otherwise different relative to the views depicted in the other image(s) of the particular object.
  • FIG.3 illustrates several sets of example training images that can be used for training ML model 220. Neural rendering and scene representation models are usually tested and benchmarked on a ShapeNet dataset.
  • Training data 210 may include three new datasets of posed images which can be used to train and/or test models with complex visual effects.
  • the first dataset referred to herein as MugsHQ, is composed of photorealistic renders of colored mugs on a table with an ambient background.
  • the second dataset referred to herein as the 3D mountains dataset, contains renders of more than five hundred mountains in the Alps using satellite and topography data.
  • training data 210 includes a first set of training images 300, a second set of training images 304, and a third set of training images 308.
  • the first set of training images 300 in this example includes multiple images 302, each including a particular view of a particular training object 303 (e.g., a mug).
  • Training data 210 may include several training images 300, from several views, of each of several mugs.
  • the first set of training images 300 may be referred to as the MugsHQ dataset and may be based on the mugs class from the ShapeNet dataset.
  • every scene is rendered with an environment map (e.g., lighting conditions) and a checkerboard disk platform 321.
  • the first set of training images 300 may include, for example, for each of two hundred fourteen mugs, one hundred and fifty viewpoints uniformly over the upper hemisphere.
  • the environment map and disk platform is the same for every mug. In this way, the scenes in each of the images 302 of the first set of training images 300 are much more complex and look more realistic than typical ShapeNet renders.
  • the MugsHQ dataset (e.g., first set of training images 300) contains photorealistic renders and complex background and lighting, the background scene is the same for every object.
  • ML model 220 can also be trained and/or tested using the second set of training images 304, in which the depicted training objects 307 are mountains.
  • the second set of training images 304 may be a dataset of mountain landscapes where each scene shares no common structure with the other.
  • the second set of training images 304 may be generated based on the height, latitude and longitude of, for example, the five hundred sixty three highest mountains in, for example, the Alps. Satellite images combined with topography data can be used to sample random views of each mountain at a fixed height for the second set of training images 304.
  • the second set of training images 304 in this example includes multiple training images 306, each including a particular view a training object 307.
  • the training objects 307 depicted in training images 306 may be a different class of training objects (e.g., mountains) from training objects 303.
  • Training data 210 may include several training images 304, from several views, of each of several mountains.
  • the third set of training images 308 may be a dataset of real images (e.g., images of physical objects such as succulents, from several views, such as several viewing angles, distances, and/or positions).
  • the third set of training images 308 in this example consists of images of succulent plants observed from different views around a table (e.g., views varying the azimuth but keeping elevation constant).
  • the lighting and background in images 310 of the third set of training images 308 is approximately constant for all the scenes in the images, and there is some noise in the azimuth and elevation measurements.
  • the third set of training images 308 may include, for example, twenty distinct succulents, and, for example, sixteen views of each succulent. Some samples from the dataset are shown in FIG.3.
  • the third set of training images 308 in this example includes multiple images 310, each including a particular view of a particular training object 311.
  • the training objects 311 depicted in images 310 may be a different class of training objects (e.g., succulents) from training objects 303 and training objects 307.
  • Training data 210 may include several training images 304, from several views, of each of several mountains.
  • the first set of training images 300, the second set of training image 304, and the third set of training images 308 provide three new challenging datasets that can be used to train ML model 220 and test representations and neural rendering for complex, natural scenes, and show compelling rendering results for each, highlighting the versatility of the disclosed system and methods.
  • Designing useful 3D scene representations for neural networks is a challenging task.
  • a model in the subject disclosure, includes an inverse renderer mapping an image to a neural scene representation and a forward neural renderer generating outputs such as images from representations.
  • the scene representations themselves can be three-dimensional tensors which can undergo the same transformations as an explicit 3D scene.
  • specific examples focus on 3D rotations, although the model can be generalized to other symmetry transformations such as translation and scaling.
  • FIG.4 schematically illustrates components that may be included in machine learning model 220.
  • machine learning model 220 may include an inverse renderer 402 and a forward renderer 406.
  • An adjustment module 404 may be provided between the inverse renderer 402 and the forward renderer 406.
  • Inverse renderer 402 is trained to generate an implicit representation 408 (also referred to as a scene representation) of an object from an input image 400 of the object (e.g., a two-dimensional input image) from a particular view.
  • Forward renderer 406 generates an output 412 based on the implicit representation 408 from the inverse renderer 402.
  • Output 412 may be, for example, an output two-dimensional image of the object from a different view that is rotated, translated, and/or scaled relative to the particular view of the input image 400.
  • the output may, as another example, include a three-dimensional representation of the object.
  • the three-dimensional representation of the object may be a mesh, a point cloud, or a voxel grid that would be visually recognizable as the object to a human user (e.g., if the explicit representation were to be rendered on a computer display such as a display of electronic device 110).
  • the ML model 220 may generate, based on the provided input image 400, at least one of an output image that depicts the object from a rotated view that is different from the view of the object in the input image 400, or a three- dimensional representation of the object.
  • the implicit representation 408 generated by inverse renderer 402 may be provided to adjustment module 404 before implicit representation 408 is provided to forward renderer 406.
  • Adjustment module 404 may adjust implicit representation 408 by rotating, translating, and/or scaling implicit representation 408 to generate an adjusted implicit representation 410.
  • adjustment module 404 may be a rotation module that rotates implicit representation 408.
  • Adjustment module 404 may, for example, be a shear rotation module.
  • a shear rotation module can be particularly helpful for facilitating machine learning models based on rotational equivariance.
  • the model can infer a scene representation, transform it and render it (see, e.g., FIGS.7-9 discussed hereinafter). Further, the model can infer scene representations in a single forward pass, in contrast with conventional scene representation algorithms that may require a computationally and/or resource expensive optimization procedure to extract scene representations from an image or a set of images.
  • the subject technology introduces a framework for learning scene representations and novel view synthesis without explicit 3D supervision, by enforcing equivariance between the change in viewpoint and change in the latent representation of a scene.
  • FIG.5 illustrates additional details of a model architecture that can be used for ML model 220.
  • the model architecture may be fully differentiable, including the shear rotation operations discussed in further detail hereinafter.
  • Providing a fully differentiable model architecture facilitates achieving a model that can be trained with back-propagation to update all learnable parameters in the neural network.
  • an input image 400 e.g., an image depicting a car
  • an inverse projection 502 e.g., a set of 3D convolutions 504 (e.g., by inverse renderer 402) to generate implicit representation 408.
  • inverse renderer 402 e.g., by inverse renderer 402
  • the inferred scene (e.g., implicit representation 408) is then rendered (e.g., by forward renderer 406) through a transpose 504’ of the 3D convolutions, a forward projection 506, and a transpose 500’ of the one more 2D convolutions to render an output 412 (e.g., an output image), that, in this example, is a copy of the input image 400.
  • the implicit representation 408 is provided to the forward renderer 406 without rotation, resulting in an output 412 (e.g., an output image), that, in this example, is a copy of the input image 400.
  • FIG.6 illustrates a flow diagram of an example process for generating an output using a machine learning model in accordance with one or more implementations.
  • the process 600 is primarily described herein with reference to the server 120 of FIG.1.
  • the process 600 is not limited to the server 120 of FIG.1, and one or more blocks (or operations) of the process 600 may be performed by one or more other components of the server 120 and/or by other suitable devices such as electronic device 110.
  • the blocks of the process 600 are described herein as occurring in serial, or linearly. However, multiple blocks of the process 600 may occur in parallel.
  • server 120 provides an input image, such as input image 400 of FIG.4, depicting a view of an object to a machine learning model such as ML model 220 that has been trained based on a constraint of equivariance under rotations between a training object (e.g., one or more of the training objects 303, 307, and 311 of FIG.3) and a model- generated representation (e.g., an implicit representation) of the training object.
  • a training object e.g., one or more of the training objects 303, 307, and 311 of FIG.3
  • a model- generated representation e.g., an implicit representation
  • the machine learning model may include an inverse renderer such as inverse renderer 402 and a forward renderer such as forward renderer 406.
  • server 120 generates, using the machine learning model and based on the provided image, at least one of an output image that depicts the object from a rotated view that is different from the view of the object in the image, or a three-dimensional representation of the object.
  • the three-dimensional representation may include an explicit three-dimensional representation including at least one of a voxel grid, a mesh or a point cloud.
  • Generating, at block 604, the at least one of the output image that depicts the object from the rotated view that is different from the view of the object in the image or the three-dimensional representation of the object may include generating the at least one of the output image that depicts the object from the rotated view that is different from the view of the object in the image or the three-dimensional representation of the object with the forward renderer.
  • the generating operations of block 604 may also include generating an implicit representation, such as implicit representation 408, of the object with the inverse renderer based on the input image.
  • the forward renderer may generate the at least one of the output image that depicts the object from the rotated view that is different from the view of the object in the image or the three-dimensional representation of the object based on the implicit representation generated by the inverse renderer.
  • Generating, at block 604, the at least one of the output image that depicts the object from the rotated view that is different from the view of the object in the image or the three-dimensional representation of the object based on the implicit representation may include rotating the implicit representation of the object.
  • the machine learning module may include an adjustment module such as adjustment module 404 for performing the rotation of the implicit representation.
  • Rotating the implicit representation of the object may include performing a shear rotation of the implicit representation of the object.
  • the implicit representation of the object may be, for example, a tensor or a latent space of an autoencoder.
  • Adjustment module 404 may provide a new differentiable layer, for performing invertible shear rotations, which allows for the neural network to learn equivariant representations.
  • the subject technology shows that naive tensor rotations are not able to achieve equivariance, and introduces an invertible shearing operation that addresses this limitation within a differentiable neural architecture.
  • FIGS.7, 8, and 9 illustrate various example outputs of a ML model such as ML model 220 trained based on a constraint of equivariance under rotations between a training object and a model-generated representation of the training object.
  • the outputs are output images that each depicts the object shown in an input image from a rotated view that is different from the view of the object in the image.
  • FIG.7 illustrates two input images 400-1 and 400-2 respectively depicting two different views of two different mugs.
  • FIG.7 also shows five output images generated by ML model 220 based on each of the two input images 400-1 and 400-2.
  • For input image 400-1 five output images 800-1, 800-2, 800-3, 800-4, and 800-5 are shown, each depicting the mug in input image 400-1, but from a different view than the view depicted in input image 400-1.
  • For input image 400-2 five output images 802-1, 802-2, 802-3, 802-4, and 802-5 are shown, each depicting the mug in input image 400-2, but from a different view than the view depicted in input image 400-2.
  • FIG.7 for single shot novel view synthesis on previously unseen mugs shows that ML model 220 successfully infers the shape of previously unseen mugs from a single image and is able to perform large viewpoint transformations around the scene.
  • FIG.8 illustrates three input images 400-3, 400-4, and 400-5 respectively depicting three different views of three different mountains.
  • FIG.8 also shows five output images generated by ML model 220 based on each of the three input images 400-3, 400-4, and 400-5.
  • For input image 400-3 five output images 803-1, 803-2, 803-3, 803-4, and 803-5 are shown, each depicting the mountain in input image 400-3, but from a different view than the view depicted in input image 400-3.
  • For input image 400-4 five output images 804-1, 804-2, 804-3, 804-4, and 804-5 are shown, each depicting the mountain in input image 400-4, but from a different view than the view depicted in input image 400-4.
  • For input image 400- 5 five output images 805-1, 805-2, 805-3, 805-4, and 805-5 are shown, each depicting the mountain in input image 400-5, but from a different view than the view depicted in input image 400-5.
  • FIG.8 illustrates three input images 400-6, 400-7, and 400-8 respectively depicting three different views of three different plants (e.g., succulent plants).
  • FIG.9 also shows five output images generated by ML model 220 based on each of the three input images 400-6, 400-7, and 400-8.
  • For input image 400-6 five output images 806-1, 806-2, 806-3, 806-4, and 806-5 are shown, each depicting the succulent in input image 400-6, but from a different view than the view depicted in input image 400-6.
  • For input image 400-7 five output images 807-1, 807-2, 807-3, 807-4, and 807-5 are shown, each depicting the succulent in input image 400-7, but from a different view than the view depicted in input image 400-7.
  • For input image 400-8 five output images 808-1, 808-2, 808-3, 808-4, and 808-5 are shown, each depicting the succulent in input image 400-8, but from a different view than the view depicted in input image 400-8.
  • ML model 220 is able to generate plausible other views of plants depicted in an input image, and in particular creates remarkably consistent shadows. Additional plant input training images can be used to further train ML model to learn a strong prior over the shapes of plants to reduce or avoid the network overfitting the input data at runtime.
  • different ML models can be trained using training images of different categories of training objects (e.g., mugs, mountains, plants, etc.) to perform neural rendering for input images of objects in that category, or a single ML model can be trained using training images of various categories of objects to train the single ML model to perform neural rendering for input images of substantially any object or scene.
  • FIG.10 illustrates a three-dimensional representation 1000 of an object (e.g., a rabbit) that may be generated by the trained ML model 220 based on a depiction of the object in a two-dimensional input image.
  • object e.g., a rabbit
  • three-dimensional representation 1000 is an explicit three-dimensional representation (e.g., a mesh) that is recognizable to a human viewer as a rabbit and/or that can be rendered into a form that is recognizable to a human viewer as a rabbit.
  • the three-dimensional representation generated by ML model 220 can be rendered as a point cloud, a voxel grid, or any other three-dimensional representation.
  • FIG.10 also illustrates an implicit representation 408 of the same object (e.g., the rabbit) that is not recognizable to a human viewer, but that is rotationally equivariant with the explicit three-dimensional representation 1000.
  • explicit three-dimensional representation 1000 can be viewed from a view 1004 from a location 1006 on a sphere (e.g., for rendering an output image such as the output images shown in FIGS.7, 8, and 9), or from views associated with other location such as locations 1008 and 1010 on sphere 1003.
  • Implicit representation 408 can also be viewed from any of the locations 1006, 1008, 1010 on the sphere 1003 (or any other suitable location).
  • FIG.11 illustrates a training operation for ML model 220 using an equivariance constraint.
  • training a ML model using an equivariance constraint can be performed using two input training images 1100 and 1102 depicting the same input training object 1101 from two different views (e.g., rotated views).
  • training ML model 220 based on a constraint of equivariance under rotations between the training object 1101 and a model-generated representation of the training object may include providing a first input training image 1100 (also referred to as x 1 ) depicting a first view of the training object 1101 to the machine learning model, and providing a second input training image 1102 (also referred to as x 2 ) depicting a second view of the training object 1101 to the machine learning model.
  • a first input training image 1100 also referred to as x 1
  • second input training image 1102 also referred to as x 2
  • the training may also include generating a first implicit representation 1104 (also referred to as z 1 ) of the training object 1101 based on the first input training image 1100 (e.g., in an operation f(x 1 )) and generating a second implicit representation 1106 (also referred to as z 2 ) of the training object 1101 based on the second input training image 1102 (e.g., in an operation f(x 2 )).
  • a first implicit representation 1104 also referred to as z 1
  • second implicit representation 1106 also referred to as z 2
  • the training may also include rotating the first implicit representation 1104 of the training object 1101 as indicated by arrow 1107 (e.g., to form a rotated implicit representation 1108, also referred to as ) and rotating the second implicit representation 1106 of the training object 1101 as indicated by arrow 1109 (e.g., to form a rotated implicit representation 1108, also referred to as ).
  • the training may also include generating a first output training image 1116 (also referred to as g based on the rotated first implicit representation 1108 of the training object 1101 and generating a second output training image 1114 (also referred to as based on the rotated second implicit representation 1110 of the training object 1101.
  • FIG.12 illustrates a flow diagram of an example process for training a model such as ML model 220 based on the constraint of equivariance under rotations between the training object and the model-generated representation in accordance with one or more implementations.
  • the process 1200 is primarily described herein with reference to the server 120 of FIG.1. However, the process 1200 is not limited to the server 120 of FIG.1, and one or more blocks (or operations) of the process 1200 may be performed by one or more other components of the server 120 and/or by other suitable devices such as electronic device 110.
  • server 120 may provide a first input training image such as input training image 1100 of FIG.11, depicting a first view of a training object such as training object 1101 to the machine learning model.
  • server 120 may provide a second input training image such as input training image 1102 of FIG.11, depicting a second view of a training object such as training object 1101 to the machine learning model.
  • server 120 may generate a first implicit representation, such as implicit representation 1104, of the training object based on the first input training image.
  • server 120 may generate a second implicit representation, such as implicit representation 1106, of the training object based on the second input training image.
  • server 120 may rotate the first implicit representation of the training object (e.g., to form a rotated first implicit representation 1108).
  • server 120 may rotate the second implicit representation of the training object (e.g., to form a rotated second implicit representation 1110).
  • server 120 may generate a first output training image, such as output training image 1116 based on the rotated first implicit representation 1108 of the training object. [0085] At block 1216, server 120 may generate a second output training image, such as output training image 1114 based on the rotated second implicit representation 1110 of the training object. [0086] At block 1218, server 120 may compare the first input training image to the second output training image. [0087] At block 1220, server 120 may compare the second input training image to the first output training image.
  • the training may include minimizing a loss function based on the comparison of the first input training image to the second output training image and the comparison of the second input training image to the first output training image.
  • the framework for ML model 220 may be composed of two models: an inverse renderer ⁇ :X ⁇ : Z (see also, inverse renderer 402 of FIG.4) that maps an image x € ⁇ nto a scene representation z € ⁇ and a forward renderer g ; ⁇ ⁇ ⁇ (see also, forward renderer 406 of FIG.4) that maps scene representations to images.
  • Equation (1) Equivariance of the inverse renderer (or encoder) f and forward renderer (or decoder) g is then given by: [0090]
  • the top equation in Equation (1) implies that if a camera viewpoint change is performed in image space, the scene representation encoded by f should undergo an equivalent rotation.
  • the second equation implies that if the scene representation is rotated, the images rendered by g should undergo an equivalent rotation.
  • the server 120 (e.g., adjustment module 404) then rotates each encoded representation (e.g., first and second implicit representations 1104 and 1106) by its relative transformation , such that and .
  • each encoded representation e.g., first and second implicit representations 1104 and 1106
  • the rotated can be expected to be rendered as the image x2 and the rotated as x1, as is illustrated in FIG.11.
  • the training can then ensure that the model obeys by minimizing a loss function , where: [0092] As , minimizing this loss then corresponds to satisfying the equivariance property for the forward renderer g. While this loss enforces equivariance of g, in practice it has been discovered that this does not in general enforce equivariance of the inverse renderer f.
  • training the machine learning model can also include comparing the first implicit representation 1104 to the rotated second implicit representation 1110, and comparing the second implicit representation 1106 to the rotated first implicit representation 1108.
  • the loss function can be further based on the comparison of the first implicit representation to the rotated second implicit representation and the comparison of the second implicit representation to the rotated first implicit representation.
  • a loss function that enforces equivariance of the inverse renderer with respect to rotations can be defined as : [0093]
  • the total loss function can be a weighted sum of and , ensuring that both the inverse and forward renderer in the trained machine learning model 220 are equivariant with respect to rotations of the viewpoint or camera.
  • any or all of the operations described above in connection with blocks 1202-1220 for training the model may be performed based on at least two images without three- dimensional supervision of the training.
  • the trained machine learning model may be tested without providing pose information to the trained machine learning model.
  • It has been discovered that defining the rotation operation in scene space is particularly helpful. Indeed, naive tensor rotations are ill suited for this task due to spatial aliasing, that is rotating points on a discrete grid generally result in the rotated points not aligning with the grid, requiring some form of sampling to reconstruct their values. [0096] To illustrate this point, the following describes rotations of 2D images (since the effects for 3D rotations of tensors are the same).
  • an image can be rotated by an angle ⁇ and then the resulting image can be rotated by an angle - ⁇ (sampling with bilinear interpolation to obtain the values at the grid points). If rotations on the grid were invertible, the final image should then be exactly the same as the original image. To test whether this holds in practice, one thousand images were sampled from the CIFAR10 dataset, each were rotated back and forth by every angle in [0, 360] and the error was recorded. In this exemplary scenario, the mean pixel value error is on the order of 3%, which is significant.
  • shear rotations can be used to define invertible tensor rotations that can be used in neural networks.
  • Rotating an image corresponds to rotating pixel values at given (x, y) coordinates in the image by applying a rotation matrix to the coordinate vector.
  • Shear rotations instead rotate images by performing a sequence of shearing operations.
  • the rotation matrix can be factorized as: , (5) so the rotation is performed with three shearing operations as opposed to a single matrix multiplication.
  • an input image 1304 can be sheared three times to obtain a rotated image 1312.
  • FIG.13 also illustrates how a nearest neighbor shear rotation operation 1302 to rotate an input image 1314 to a rotated output image 1322 with three nearest neighbor shear operation can allow adjustment module 404 to perform invertible rotations on a grid.
  • the shear operations themselves will not align with the grid coordinates and so also require a form of interpolation, the following shows how these operations can be made invertible by using a nearest neighbor approach (e.g., with adjustment module 404).
  • Applying a shear transformation involves shifting either columns or rows of the image but not both. Therefore, for each shifted point there is unique nearest neighbor on the grid. In contrast, for regular rotations, two shifted points may get mapped to the same grid point by a nearest neighbor operation.
  • FIG.14 illustrates a regular rotation 1400 of an image 1401 to a rotated image 1406 in which two shifted points 1410 and 1414 can get mapped to the same grid point 1408 by a nearest neighbor operation.
  • FIG.14 also illustrates a shear rotation operation 1402 of an image 1404 to a rotated image 1420 in which, for each shifted point 1424 there is unique nearest neighbor 1422 on the grid.
  • server 120 can find a unique nearest neighbor for each grid point, shearing with nearest neighbors is therefore an invertible operation.
  • the composition of three shearing operations is also invertible, implying that an invertible rotation on the grid can be defined and performed by adjustment module 404 in some implementations.
  • angle resolution While defining tensor rotations with shearing allows for invertibility, there can be a trade-off in angle resolution. Indeed, the smallest rotation that can be represented with invertible shear rotations depends on the grid size n as: This implies that the model may not be equivariant with respect to continuous rotation, but only equivariant up to a finite angle resolution. However, for the grid sizes used in practice, the angle resolution is sharp enough to model most rotations. For example, for a 32 x 32 grid, the angle resolution is less than 2 degrees.
  • the shear rotation matrix factorization involves a tan( ⁇ /2) term.
  • adjustment module 404 may perform rotations for .
  • angles can be decomposed as where and .
  • image rotations on the grid are invertible for multiples of 90, 90, 180 and 270 degree rotations can first be performed by flipping and transposing the image, followed by a shear rotation for the small angle small. This results in only performing shear rotations for angles in [-45, 45], avoiding any numerical problems.
  • shear rotation operation has been defined above for 2D grids, the discussion above extend this to 3D grids by performing two 2D rotations.
  • the full invertible shear rotation operation can be defined as performing an elevation rotation by angle around the width axis, followed by an azimuth rotation around the height axis of the scene representation.
  • the shear rotation operation is discontinuous in the angles. However, this does not matter in practice as it is not necessary to calculate gradients with respect to the angles. Indeed, the shear rotation layer of machine learning model 220 can correspond to shuffling the positions of voxels in the scene representation tensor, allowing back propagation through the operation.
  • the images may include personal information data that uniquely identifies or can be used to identify a specific person.
  • personal information data can include images of a user’s face or portions of the user’s body, video data, demographic data, location-based data, online identifiers, printed information such as telephone numbers, email addresses, home addresses, data or records relating to a user’s health or level of fitness (e.g., vital signs measurements, medication information, exercise information), date of birth, or any other personal information.
  • the present disclosure recognizes that the use of such personal information data, in the present technology, can be used to the benefit of users.
  • the personal information data can be used for neural rendering of images of people.
  • the present disclosure contemplates that those entities responsible for the collection, analysis, disclosure, transfer, storage, or other use of such personal information data will comply with well-established privacy policies and/or privacy practices. In particular, such entities would be expected to implement and consistently apply privacy practices that are generally recognized as meeting or exceeding industry or governmental requirements for maintaining the privacy of users.
  • Such information regarding the use of personal data should be prominently and easily accessible by users, and should be updated as the collection and/or use of data changes.
  • Personal information from users should be collected for legitimate uses only. Further, such collection/sharing should occur only after receiving the consent of the users or other legitimate basis specified in applicable law.
  • policies and practices should be adapted for the particular types of personal information data being collected and/or accessed and adapted to applicable laws and standards, including jurisdiction-specific considerations which may serve to impose a higher standard. For instance, in the US, collection of or access to certain health data may be governed by federal and/or state laws, such as the Health Insurance Portability and Accountability Act (HIPAA); whereas health data in other countries may be subject to other regulations and policies and should be handled accordingly.
  • HIPAA Health Insurance Portability and Accountability Act
  • the present disclosure also contemplates embodiments in which users selectively block the use of, or access to, personal information data. That is, the present disclosure contemplates that hardware and/or software elements can be provided to prevent or block access to such personal information data.
  • the present technology can be configured to allow users to select to “opt in” or “opt out” of participation in the collection and/or sharing of personal information data during registration for services or anytime thereafter.
  • the present disclosure contemplates providing notifications relating to the access or use of personal information.
  • a user may be notified upon downloading an app that their personal information data will be accessed and then reminded again just before personal information data is accessed by the app.
  • personal information data should be managed and handled in a way to minimize risks of unintentional or unauthorized access or use. Risk can be minimized by limiting the collection of data and deleting data once it is no longer needed.
  • data de-identification can be used to protect a user’s privacy.
  • De- identification may be facilitated, when appropriate, by removing identifiers, controlling the amount or specificity of data stored (e.g., collecting location data at city level rather than at an address level or at a scale that is insufficient for facial recognition), controlling how data is stored (e.g., aggregating data across users), and/or other methods such as differential privacy.
  • identifiers controlling the amount or specificity of data stored (e.g., collecting location data at city level rather than at an address level or at a scale that is insufficient for facial recognition), controlling how data is stored (e.g., aggregating data across users), and/or other methods such as differential privacy.
  • FIG.15 illustrates an electronic system 1500 with which one or more implementations of the subject technology may be implemented.
  • the electronic system 1500 can be, and/or can be a part of, the electronic device 110, and/or the server 120 shown in FIG. 1.
  • the electronic system 1500 may include various types of computer readable media and interfaces for various other types of computer readable media.
  • the electronic system 1500 includes a bus 1508, one or more processing unit(s) 1512, a system memory 1504 (and/or buffer), a ROM 1510, a permanent storage device 1502, an input device interface 1514, an output device interface 1506, and one or more network interfaces 1516, or subsets and variations thereof.
  • the bus 1508 collectively represents all system, peripheral, and chipset buses that communicatively connect the numerous internal devices of the electronic system 1500.
  • the bus 1508 communicatively connects the one or more processing unit(s) 1512 with the ROM 1510, the system memory 1504, and the permanent storage device 1502. From these various memory units, the one or more processing unit(s) 1512 retrieves instructions to execute and data to process in order to execute the processes of the subject disclosure.
  • the one or more processing unit(s) 1512 can be a single processor or a multi-core processor in different implementations.
  • the ROM 1510 stores static data and instructions that are needed by the one or more processing unit(s) 1512 and other modules of the electronic system 1500.
  • the permanent storage device 1502 may be a read-and-write memory device.
  • the permanent storage device 1502 may be a non-volatile memory unit that stores instructions and data even when the electronic system 1500 is off.
  • a mass-storage device such as a magnetic or optical disk and its corresponding disk drive
  • a removable storage device such as a floppy disk, flash drive, and its corresponding disk drive
  • the system memory 1504 may be a read-and-write memory device.
  • the system memory 1504 may be a volatile read-and-write memory, such as random access memory.
  • the system memory 1504 may store any of the instructions and data that one or more processing unit(s) 1512 may need at runtime.
  • the processes of the subject disclosure are stored in the system memory 1504, the permanent storage device 1502, and/or the ROM 1510. From these various memory units, the one or more processing unit(s) 1512 retrieves instructions to execute and data to process in order to execute the processes of one or more implementations.
  • the bus 1508 also connects to the input and output device interfaces 1514 and 1506.
  • the input device interface 1514 enables a user to communicate information and select commands to the electronic system 1500.
  • Input devices that may be used with the input device interface 1514 may include, for example, alphanumeric keyboards and pointing devices (also called “cursor control devices”).
  • the output device interface 1506 may enable, for example, the display of images generated by electronic system 1500.
  • Output devices that may be used with the output device interface 1506 may include, for example, printers and display devices, such as a liquid crystal display (LCD), a light emitting diode (LED) display, an organic light emitting diode (OLED) display, a flexible display, a flat panel display, a solid state display, a projector, or any other device for outputting information.
  • printers and display devices such as a liquid crystal display (LCD), a light emitting diode (LED) display, an organic light emitting diode (OLED) display, a flexible display, a flat panel display, a solid state display, a projector, or any other device for outputting information.
  • One or more implementations may include devices that function as both input and output devices
  • the bus 1508 also couples the electronic system 1500 to one or more networks and/or to one or more network nodes, such as the electronic device 110 shown in FIG.1, through the one or more network interface(s) 1516.
  • the electronic system 1500 can be a part of a network of computers (such as a LAN, a wide area network (“WAN”), or an Intranet, or a network of networks, such as the Internet. Any or all components of the electronic system 1500 can be used in conjunction with the subject disclosure.
  • the disclosed systems and methods provide advantages for neural rendering, including providing a machine learning model that makes very few assumptions about the scene representation and rendering process. Indeed, the disclosed machine learning model learns representations simply by enforcing equivariance with respect to 3D rotations. As such, material, texture and lighting in a scene can be encoded into the model. The simplicity of the disclosed model also means that it can be trained purely from posed 2D images with no 3D supervision. [0121] As described herein, these advantages facilitate other advantages including allowing the model to be applied to interesting data where obtaining 3D geometry is difficult. In contrast with other methods, the disclosed machine learning model does not require pose information at test time.
  • operating the disclosed machine learning model is fast: inferring a scene representation simply corresponds to performing a forward pass of a neural network. This is in contrast to other methods that require solving an expensive optimization problem at inference time for every new observed image.
  • rendering is also performed in a single forward pass, making it faster than other methods that often require recurrence to produce an image.
  • training data is sparse (e.g., the number of views per scene is small)
  • novel view synthesis models can exhibit tendency to “snap” to fixed views instead of smoothly rotating around the scene.
  • the disclosed systems and methods contemplate additional training data and/or training operations to reduce this type of undesirable snapping.
  • equivariance is described in various examples as being enforced during training with respect to 3D rotations.
  • real scenes have other symmetries like translation and scale.
  • translation equivariance and scale equivariance can also be applied as constraints for model training.
  • scene representations are sometimes described as being used to render images, additional structure can be enforced on the latent space to make the representations more interpretable or even editable.
  • adding inductive biases from the rendering process such as explicitly handling occlusion, has been shown to improve performance of other models and could also be applied to the disclosed model.
  • the learned scene representation can be used to generate a 3D reconstruction.
  • the disclosed systems and methods can be used to train a model to learn a distribution over scenes p(scene
  • each view pair in the training images is treated the same.
  • views that are close to each other should be easier to reconstruct, while views that are far from each other may not be reconstructed exactly due to the inherent uncertainty caused by occlusion.
  • the training operations described herein can be modified reflect this, for example, by weighting the reconstruction loss by how far scenes are from each other.
  • pairs of views of a training object are provided to train the machine learning model.
  • larger number of views of a training object can be provided to the machine-learning model, which would also reduce the entropy of the p(scene
  • systems and methods are provided to learn scene representations by ensuring that the scene representations transform like real 3D scenes.
  • the model may include invertible shear rotations which allow the model to learn equivariant scene representations by gradient descent.
  • the disclosed machine learning models can be trained without 3D supervision and can be trained using only posed 2D images.
  • systems and methods are provided to infer a scene representation directly from an image using a single forward pass of an inverse renderer.
  • the learned scene representation can easily be manipulated and rendered to produce new viewpoints of the scene.
  • Three challenging new datasets for neural rendering and scene representations are also provided. It has been shown that the disclosed systems and methods perform well on these datasets, as well as on standard ShapeNet tasks.
  • a method includes providing an input image depicting a view of an object to a machine learning model that has been trained based on a constraint of equivariance under rotations between a training object and a model-generated representation of the training object; and generating, using the machine learning model and based on the provided image, at least one of an output image that depicts the object from a rotated view that is different from the view of the object in the image or a three-dimensional representation of the object.
  • a system includes a processor; and a memory device containing instructions, which when executed by the processor cause the processor to: provide an input image depicting a view of an object to a machine learning model that has been trained based on a constraint of equivariance under rotations between a training object and a model-generated representation of the training object; and generate, using the machine learning model and based on the provided image, at least one of an output image that depicts the object from a rotated view that is different from the view of the object in the image or a three-dimensional representation of the object.
  • a non-transitory machine-readable medium including code that, when executed by a processor, causes the processor to: provide an input image depicting a view of an object to a machine learning model that has been trained based on at least two training images depicting different views of a training object; and generate, using the machine learning model and based on the provided image, at least one of an output image that depicts the object from a rotated view that is different from the view of the object in the image or a three-dimensional representation of the object.
  • a non-transitory machine-readable medium including code that, when executed by a processor, causes the processor to: provide an input image depicting a view of an object to a machine learning model that has been trained based on a constraint of equivariance under rotations between a training object and a model-generated representation of the training object; and generate, using the machine learning model and based on the provided image, at least one of an output image that depicts the object from a rotated view that is different from the view of the object in the image or a three-dimensional representation of the object.
  • a non-transitory machine-readable medium including code that, when executed by a processor, causes the processor to: provide an input image depicting a view of an object to a machine learning model that has been trained based on a constraint of equivariance under rotations between a training object and a model-generated representation of the training object; and generate, using the machine learning model and based on the provided image, at least one of an output image that depicts the object from a rotated view that is different from the view of the object in the image or a three-dimensional representation of the object.
  • Implementations within the scope of the present disclosure can be partially or entirely realized using a tangible computer-readable storage medium (or multiple tangible computer-readable storage media of one or more types) encoding one or more instructions.
  • the tangible computer-readable storage medium also can be non-transitory in nature.
  • the computer-readable storage medium can be any storage medium that can be read, written, or otherwise accessed by a general purpose or special purpose computing device, including any processing electronics and/or processing circuitry capable of executing instructions.
  • the computer-readable medium can include any volatile semiconductor memory, such as RAM, DRAM, SRAM, T-RAM, Z-RAM, and TTRAM.
  • the computer-readable medium also can include any non-volatile semiconductor memory, such as ROM, PROM, EPROM, EEPROM, NVRAM, flash, nvSRAM, FeRAM, FeTRAM, MRAM, PRAM, CBRAM, SONOS, RRAM, NRAM, racetrack memory, FJG, and Millipede memory.
  • the computer-readable storage medium can include any non- semiconductor memory, such as optical disk storage, magnetic disk storage, magnetic tape, other magnetic storage devices, or any other medium capable of storing one or more instructions.
  • the tangible computer-readable storage medium can be directly coupled to a computing device, while in other implementations, the tangible computer-readable storage medium can be indirectly coupled to a computing device, e.g., via one or more wired connections, one or more wireless connections, or any combination thereof.
  • Instructions can be directly executable or can be used to develop executable instructions.
  • instructions can be realized as executable or non-executable machine code or as instructions in a high-level language that can be compiled to produce executable or non-executable machine code. Further, instructions also can be realized as or can include data.
  • Computer-executable instructions also can be organized in any format, including routines, subroutines, programs, data structures, objects, modules, applications, applets, functions, etc. As recognized by those of skill in the art, details including, but not limited to, the number, structure, sequence, and organization of instructions can vary significantly without varying the underlying logic, function, processing, and output. [0140] While the above discussion primarily refers to microprocessor or multi-core processors that execute software, one or more implementations are performed by one or more integrated circuits, such as ASICs or FPGAs. In one or more implementations, such integrated circuits execute instructions that are stored on the circuit itself.
  • any specific order or hierarchy of blocks in the processes disclosed is an illustration of example approaches. Based upon design preferences, it is understood that the specific order or hierarchy of blocks in the processes may be rearranged, or that all illustrated blocks be performed. Any of the blocks may be performed simultaneously. In one or more implementations, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the implementations described above should not be understood as requiring such separation in all implementations, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
  • the terms “base station”, “receiver”, “computer”, “server”, “processor”, and “memory” all refer to electronic or other technological devices. These terms exclude people or groups of people.
  • the terms “display” or “displaying” means displaying on an electronic device.
  • the phrase “at least one of” preceding a series of items, with the term “and” or “or” to separate any of the items, modifies the list as a whole, rather than each member of the list (i.e., each item).
  • phrases “at least one of” does not require selection of at least one of each item listed; rather, the phrase allows a meaning that includes at least one of any one of the items, and/or at least one of any combination of the items, and/or at least one of each of the items.
  • the phrases “at least one of A, B, and C” or “at least one of A, B, or C” each refer to only A, only B, or only C; any combination of A, B, and C; and/or at least one of each of A, B, and C.
  • the predicate words “configured to”, “operable to”, and “programmed to” do not imply any particular tangible or intangible modification of a subject, but, rather, are intended to be used interchangeably.
  • a processor configured to monitor and control an operation or a component may also mean the processor being programmed to monitor and control the operation or the processor being operable to monitor and control the operation.
  • a processor configured to execute code can be construed as a processor programmed to execute code or operable to execute code.
  • phrases such as an aspect, the aspect, another aspect, some aspects, one or more aspects, an implementation, the implementation, another implementation, some implementations, one or more implementations, an embodiment, the embodiment, another embodiment, some implementations, one or more implementations, a configuration, the configuration, another configuration, some configurations, one or more configurations, the subject technology, the disclosure, the present disclosure, other variations thereof and alike are for convenience and do not imply that a disclosure relating to such phrase(s) is essential to the subject technology or that such disclosure applies to all configurations of the subject technology. A disclosure relating to such phrase(s) may apply to all configurations, or one or more configurations. A disclosure relating to such phrase(s) may provide one or more examples.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computer Graphics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Geometry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Architecture (AREA)
  • Computer Hardware Design (AREA)
  • Image Analysis (AREA)

Abstract

La technologie selon l'invention fournit un cadre d'apprentissage de représentations de scène neuronale directement à partir d'images, sans supervision tridimensionnelle (3D), par un modèle d'apprentissage automatique. Dans les systèmes et les procédés révélés par l'invention, une structure 3D peut être imposée en garantissant que la représentation apprise se transforme comme une scène 3D réelle. Par exemple, une fonction de perte peut être prévue, laquelle applique une équivariance à la représentation de scène par rapport à des rotations 3D. Étant donné que les rotations tensorielles élémentaires ne peuvent pas être utilisées pour définir des modèles qui sont équivariants par rapport à des rotations 3D, l'invention révèle une nouvelle opération appelée rotation de cisaillement réversible, qui a la propriété d'équivariance souhaitée. Selon certains modes de réalisation, le modèle peut être utilisé pour générer une représentation 3D, telle qu'un maillage, d'un objet à partir d'une image de l'objet.
EP21704078.1A 2020-02-06 2021-01-12 Rendu neuronal Withdrawn EP4085427A2 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202062971198P 2020-02-06 2020-02-06
US202063018434P 2020-04-30 2020-04-30
US17/145,232 US11967015B2 (en) 2020-02-06 2021-01-08 Neural rendering
PCT/US2021/013073 WO2021158337A2 (fr) 2020-02-06 2021-01-12 Rendu neuronal

Publications (1)

Publication Number Publication Date
EP4085427A2 true EP4085427A2 (fr) 2022-11-09

Family

ID=74562021

Family Applications (1)

Application Number Title Priority Date Filing Date
EP21704078.1A Withdrawn EP4085427A2 (fr) 2020-02-06 2021-01-12 Rendu neuronal

Country Status (2)

Country Link
EP (1) EP4085427A2 (fr)
CN (1) CN113222137B (fr)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114119849B (zh) * 2022-01-24 2022-06-24 阿里巴巴(中国)有限公司 三维场景渲染方法、设备以及存储介质
CN119783751B (zh) * 2024-12-11 2025-09-23 西安交通大学 基于卷积网络框架的旋转等变约束松弛化方法及装置

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8971612B2 (en) * 2011-12-15 2015-03-03 Microsoft Corporation Learning image processing tasks from scene reconstructions
CN102930302B (zh) * 2012-10-18 2016-01-13 山东大学 基于在线序贯极限学习机的递增式人体行为识别方法
NL2016285B1 (en) * 2016-02-19 2017-09-20 Scyfer B V Device and method for generating a group equivariant convolutional neural network.
US10331975B2 (en) * 2016-11-29 2019-06-25 Google Llc Training and/or using neural network models to generate intermediary output of a spectral image
CN108230277B (zh) * 2018-02-09 2020-09-11 中国人民解放军战略支援部队信息工程大学 一种基于卷积神经网络的双能ct图像分解方法
CN109191255B (zh) * 2018-09-04 2022-04-15 中山大学 一种基于无监督特征点检测的商品对齐方法

Also Published As

Publication number Publication date
CN113222137A (zh) 2021-08-06
CN113222137B (zh) 2024-07-26

Similar Documents

Publication Publication Date Title
US11967015B2 (en) Neural rendering
US12243273B2 (en) Neural 3D video synthesis
CN111386536B (zh) 语义一致的图像样式转换的方法和系统
US11748940B1 (en) Space-time representation of dynamic scenes
US11055910B1 (en) Method and system for generating models from multiple views
AU2017248506B2 (en) Implementation of an advanced image formation process as a network layer and its applications
US11989846B2 (en) Mixture of volumetric primitives for efficient neural rendering
US12198275B2 (en) Generative scene networks
US12249025B2 (en) High-quality object-space dynamic ambient occlusion
CN116824092B (zh) 三维模型生成方法、装置、计算机设备和存储介质
US11922575B2 (en) Depth hull for rendering three-dimensional models
US11138789B1 (en) Enhanced point cloud for three-dimensional models
WO2022164895A2 (fr) Synthèse de vidéo 3d neuronale
Ma et al. A blendshape model that incorporates physical interaction
CN113222137B (zh) 神经渲染
CN116977548A (zh) 三维重建方法、装置、设备及计算机可读存储介质
CN117541703B (zh) 一种数据渲染方法、装置、设备及计算机可读存储介质
EP4643306A1 (fr) Animation volumétrique non supervisée
Mirbauer et al. SkyGAN: Realistic Cloud Imagery for Image‐based Lighting
Kunert et al. Neural network adaption for depth sensor replication
CN120318391B (zh) 一种基于MViT的稠密视觉场重建方法及系统
Jin et al. Research on 3D Visualization of Drone Scenes Based on Neural Radiance Fields
US20240257449A1 (en) Generating hard object shadows for general shadow receivers within digital images utilizing height maps
US20240371072A1 (en) Stochastic texture filtering
US20250232526A1 (en) Generating three-dimensional point clouds and depth maps of objects within digital images utilizing height maps and perspective field representations

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: UNKNOWN

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20220803

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

17Q First examination report despatched

Effective date: 20230612

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN WITHDRAWN

18W Application withdrawn

Effective date: 20241107