[go: up one dir, main page]

WO2023172257A1 - Photometic stereo for dynamic surface with motion field - Google Patents

Photometic stereo for dynamic surface with motion field Download PDF

Info

Publication number
WO2023172257A1
WO2023172257A1 PCT/US2022/019458 US2022019458W WO2023172257A1 WO 2023172257 A1 WO2023172257 A1 WO 2023172257A1 US 2022019458 W US2022019458 W US 2022019458W WO 2023172257 A1 WO2023172257 A1 WO 2023172257A1
Authority
WO
WIPO (PCT)
Prior art keywords
images
sequence
image
pixel
motion field
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2022/019458
Other languages
French (fr)
Inventor
Liangchen SONG
Yi Xu
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Innopeak Technology Inc
Original Assignee
Innopeak Technology Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Innopeak Technology Inc filed Critical Innopeak Technology Inc
Priority to PCT/US2022/019458 priority Critical patent/WO2023172257A1/en
Publication of WO2023172257A1 publication Critical patent/WO2023172257A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • G06T7/586Depth or shape recovery from multiple images from multiple light sources, e.g. photometric stereo
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10141Special mode during image acquisition
    • G06T2207/10152Varying illumination
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Definitions

  • This application relates generally to image processing including, but not limited to, methods, systems, and non-transitory computer-readable media for encoding information associated with a sequence of images captured at a scene in a predefined data format that facilitates extraction of related information.
  • Virtual content is oftentimes rendered based on image information collected from a scene or object in real life.
  • Three dimensional (3D) image information of the scene or object can be represented by surface normals of objects that are obtained by photometric stereo in which the scene or object is observed under different lighting conditions.
  • Photometric stereo is optionally applied to static or dynamic surfaces.
  • various techniques e.g., rigid transformation, color multiplexing, and time multiplexing
  • these techniques have specific requirements and can introduce significant errors or fail to apply in many situations. It would be beneficial to have a more accurate and efficient image encoding mechanism to encode the image information of a real scene or object and facilitate virtual content rendering than the current practice that communicates 3D image information of the scene using the images or related surface normals.
  • Various embodiments of this application are directed to encoding information associated with a sequence of images concerning a real scene or object in a predefined data format, e.g., a motion field represented by a neural network model.
  • the encoded image information provides at least a depth value and an interframe depth variation of each pixel in the sequence of images.
  • an electronic system can regenerate a normal map of each image in the sequence of images, and the normal map of each image includes a normal of each pixel of the respective image.
  • a virtual scene or object can be reconstructed in a 3D virtual scene based on at least a subset of the normal maps of the sequence of images.
  • a method is implemented at an electronic system for processing information of one or more objects.
  • the method includes obtaining a sequence of images that are captured sequentially in a field of view including the one or more objects, determining a resolution of each image and a number of images in the sequence of images, and generating a motion field of the sequence of images.
  • the motion field is represented with a neural network model and configured to encode at least a depth value and an interframe depth variation of each pixel in the sequence of images.
  • the method further includes encoding the sequence of images to image encoding information including the resolution of each image, the number of images in the sequence of images, and the motion field.
  • the neural network model representing the motion field is configured to for each pixel in the sequence of images, convert a pixel location of the respective pixel in a respective image and a respective temporal location of the respective image in the sequence of images to the depth value and the interframe depth variation of the respective pixel.
  • a method is implemented at an electronic system to decode information of one or more objects.
  • the method includes obtaining image encoding information of a sequence of images that are captured sequentially in a field of view including the one or more objects.
  • the image encoding information includes a resolution of each image, a number of images in the sequence of images, and a motion field represented by a neural network model.
  • the method further includes generating a normal map of each image in the sequence of images from the image encoding information.
  • Generation of the normal map of each image further includes, for each pixel in the respective image, in accordance with the neural network model representing the motion field, converting a pixel location in the respective image and a respective temporal location of the respective image in the sequence of images to a depth value and an interframe depth variation of the respective pixel.
  • Generation of the normal map of each image further includes determining a normal of the respective pixel based on the depth value and interframe depth variation of the respective pixel.
  • the method further includes reconstructing the one or more objects in a 3D virtual scene based on at least a subset of the normal maps of the sequence of images.
  • some implementations include an electronic system that includes one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods.
  • some implementations include a non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.
  • Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.
  • FIG. 2 is a block diagram illustrating an electronic system configured to process content data (e.g., image data), in accordance with some embodiments.
  • content data e.g., image data
  • Figure 3 is an example data processing environment for training and applying a neural network-based data processing model for processing visual and/or audio data, in accordance with some embodiments.
  • Figure 4A is an example neural network applied to process content data in an NN-based data processing model, in accordance with some embodiments
  • Figure 4B is an example node in the neural network, in accordance with some embodiments.
  • Figure 5 is a flowchart of a process for encoding a sequence of images to a motion field or decoding the motion field, in accordance with some embodiments.
  • Figure 6A is a sequence of images captured under different lighting conditions, in accordance with some embodiments.
  • Figures 6B and 6C are surface normal maps and relative depth maps corresponding to the sequence of images in Figure 6A, in accordance with some embodiments.
  • Figure 7 is a flowchart of a process for training a neural network model for encoding a sequence of images to a motion field, in accordance with some embodiments.
  • Figure 8A is a flow diagram of a method for processing information of one or more objects, in accordance with some embodiments
  • Figure 8B is a flow diagram of a method decoding information of one or more objects, in accordance with some embodiments.
  • Photometric Stereo is a computer vision technique applied to reconstruct surface normals of objects by observing the objects under different lighting conditions.
  • Surfaces of the objects comply with a Lambertian reflectance assumption in which luminance of the surfaces is isotropic and appears to be the same independently of an angle of view.
  • a sequence of successive images are captured for the objects that are under rigid movement or deformation while different light conditions are applied. For example, successive images are captured to record an object that rotates on a turntable or a human face having changing expressions. These successive images are associated with different light conditions and different object poses, and therefore, include three dimension (3D) information of surfaces of the objects.
  • This application focuses on encoding the 3D information of surfaces of the objects in a motion field represented by one or more neural network models, which can be decoded to reconstruct models of dynamic surfaces of the objects including a series of surface reconstructions (e.g., a form of 3D meshes) that can be used in gaming, animation, and telepresence applications.
  • a series of surface reconstructions e.g., a form of 3D meshes
  • FIG. 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments.
  • the one or more client devices 104 may be, for example, desktop computers 104A, tablet computers 104B, mobile phones 104C, head-mounted display (HMD) (also called augmented reality (AR) glasses) 104D, or intelligent, multi-sensing, network- connected home devices (e.g., a surveillance camera 104E, a smart television device, a drone).
  • HMD head-mounted display
  • AR augmented reality
  • Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface.
  • the collected data or user inputs can be processed locally at the client device 104 and/or remotely by the server(s) 102.
  • the one or more servers 102 provide system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104.
  • the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.
  • the one or more servers 102 are configured to enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 are configured to implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104.
  • the client devices 104 include a game console (e.g., the HMD 104D) that executes an interactive online gaming application.
  • the game console receives a user instruction and sends it to a game server 102 with user data.
  • the game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console.
  • the stream of video data includes a motion field represented by a neural network model, and when provided to the game console, can be decoded to depth information and surface normals of a gaming scene.
  • the client devices 104 include a networked surveillance camera 104E and a mobile phone 104C.
  • the networked surveillance camera 104E collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera 104E, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104 to monitor the events occurring near the networked surveillance camera 104E in the real time and remotely.
  • the one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100.
  • the one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof.
  • the one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol.
  • a connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof.
  • a network interface 110 e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node
  • the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Intemet Protocol (TCP/IP) suite of protocols to communicate with one another.
  • TCP/IP Transmission Control Protocol/Intemet Protocol
  • At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other computer systems that route data and messages.
  • deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video data, visual data, audio data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data.
  • the content data may broadly include inertial sensor data captured by inertial sensor(s) of a client device 104.
  • data processing models are created based on one or more neural networks to process the content data. These data processing models are trained with training data before they are applied to process the content data. Subsequently to model training, the mobile phone 104C or HMD 104D obtains the content data (e.g., captures video data via an internal camera) and processes the content data using the data processing models locally.
  • both model training and data processing are implemented locally at each individual client device 104 (e.g., the mobile phone 104C and HMD 104D).
  • the client device 104 obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the data processing models.
  • both model training and data processing are implemented remotely at a server 102 (e.g., the server 102A) associated with a client device 104 (e.g. the client device 104A and HMD 104D).
  • the server 102A obtains the training data from itself, another server 102 or the storage 106 applies the training data to train the data processing models.
  • the client device 104 obtains the content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained data processing models, receives data processing results (e.g., recognized hand gestures) from the server 102A, presents the results on a user interface (e.g., associated with the application), renders virtual objects in a field of view based on the poses, or implements some other functions based on the results.
  • the client device 104 itself implements no or little data processing on the content data prior to sending them to the server 102 A.
  • data processing is implemented locally at a client device 104 (e.g., the client device 104B and HMD 104D), while model training is implemented remotely at a server 102 (e.g., the server 102B) associated with the client device 104.
  • the server 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models.
  • the trained data processing models are optionally stored in the server 102B or storage 106.
  • the client device 104 imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface or used to initiate some functions (e.g., rendering virtual objects based on device poses) locally.
  • a pair of AR glasses 104D are communicatively coupled in the data processing environment 100.
  • the AR glasses 104D includes a camera, a microphone, a speaker, one or more inertial sensors (e.g., gyroscope, accelerometer), and a display.
  • the camera and microphone are configured to capture video and audio data from a scene of the AR glasses 104D, while the one or more inertial sensors are configured to capture inertial sensor data.
  • the camera captures hand gestures of a user wearing the AR glasses 104D, and recognizes the hand gestures locally and in real time using a two-stage hand gesture recognition model.
  • the microphone records ambient sound, including user’s voice commands.
  • both video or static visual data captured by the camera and the inertial sensor data measured by the one or more inertial sensors are applied to determine and predict device poses.
  • the video, static image, audio, or inertial sensor data captured by the AR glasses 104D is processed by the AR glasses 104D, server(s) 102, or both to recognize the device poses.
  • deep learning techniques are applied by the server(s) 102 and AR glasses 104D jointly to recognize and predict the device poses.
  • the device poses are used to control the AR glasses 104D itself or interact with an application (e.g., a gaming application) executed by the AR glasses 104D.
  • the display of the AR glasses 104D displays a user interface, and the recognized or predicted device poses are used to render or interact with user selectable display items (e.g., an avatar) on the user interface.
  • deep learning techniques are applied in the data processing environment 100 to process video data, static image data, or inertial sensor data captured by the AR glasses 104D.
  • 2D or 3D device poses are recognized and predicted based on such video, static image, and/or inertial sensor data using a first data processing model.
  • Visual content is optionally generated using a second data processing model.
  • Training of the first and second data processing models is optionally implemented by the server 102 or AR glasses 104D.
  • Inference of the device poses and visual content is implemented by each of the server 102 and AR glasses 104D independently or by both of the server 102 and AR glasses 104D jointly.
  • This application is directed to encoding object information captured in a sequence of images to a motion field represented by a neural network model.
  • the sequence of images are captured by a camera of a client device 104 under different lighting conditions while an object is moving.
  • a server 102 obtains the sequence of images from the client device 104 or has its own camera capturing a sequence of images by itself.
  • the server 102 encodes the sequence of images to the motion field using a training process, and returns the motion field to the client device 104 or provides the motion fields to other client devices.
  • Each client device 104 decodes the motion field to depth field maps, depth variation maps, and/or surface normal maps corresponding to the sequence of images, and can enable an extended reality environment based on the motion field.
  • information of the object is stored and communicated in a distinct data format (i.e., a motion field represented by a neural network model), which is more reliable and utilizes computational and storage resources of the data processing environment 100 more efficiently.
  • FIG 2 is a block diagram illustrating an electronic system 200 configured to process content data (e.g., image data), in accordance with some embodiments.
  • the electronic system 200 includes a server 102, a client device 104 (e.g., AR glasses 104D in Figure 1), a storage 106, or a combination thereof.
  • the electronic system 200 typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset).
  • the electronic system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice-command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls.
  • the client device 104 of the electronic system 200 uses a microphone for voice recognition or a camera 260 for gesture recognition to supplement or replace the keyboard.
  • the client device 104 includes one or more optical cameras 260 (e.g., an RGB camera), scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices.
  • the electronic system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays.
  • Memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory 206, includes a non-transitory computer readable storage medium. In some embodiments, memory 206, or the non- transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof:
  • Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;
  • User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.);
  • information e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.
  • output devices 212 e.g., displays, speakers, etc.
  • Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;
  • Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;
  • One or more user applications 224 for execution by the electronic system 200 e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices;
  • Model training module 226 for receiving training data and establishing a data processing model for processing content data (e.g., video, image, audio, or textual data) to be collected or obtained by a client device 104, where in some embodiments, the model training module 226 of a server 102 trains a neural network model based on a sequence of images using a reconstruction loss L rec , e.g., in Figure 8A;
  • content data e.g., video, image, audio, or textual data
  • L rec e.g., in Figure 8A
  • Data processing module 228 for processing content data using data processing models 250, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224, and in an example, the data processing module 228 of a client device 104 is applied to recover a normal map of each image in the sequence of maps using a neural network model, e.g., in Figure 8B; and
  • One or more databases 240 for storing at least data including one or more of: o Device settings 242 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 244 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 246 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 248 for training one or more data processing models 250; o Data processing model(s) 250 for processing content data (e.g., video, image, audio, or textual data) using deep learning techniques, where the data processing models 250 includes a neural network model that encodes image information of a sequence of images; o Content data and results 254 that are obtained by and outputted to the client device 104 of the electronic system 200 , respectively.
  • the one or more databases 240 are stored in one of the server 102, client device 104, and storage 106 of the electronic system 200 .
  • the one or more databases 240 are distributed in more than one of the server 102, client device 104, and storage 106 of the electronic system 200 .
  • more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 250 are stored at the server 102 and storage 106, respectively.
  • Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above.
  • the above identified modules or programs i.e., sets of instructions
  • memory 206 optionally, stores a subset of the modules and data structures identified above.
  • memory 206 optionally, stores additional modules and data structures not described above.
  • FIG. 3 is an example data processing system 300 for training and applying a neural network based (NN-based) data processing model 250 for processing content data (e.g., video, image, audio, or textual data), in accordance with some embodiments.
  • the data processing system 300 includes a model training module 226 for establishing the data processing model 250 and a data processing module 228 for processing the content data using the data processing model 250.
  • both of the model training module 226 and the data processing module 228 are located on a client device 104 of the data processing system 300, while a training data source 304 distinct form the client device 104 provides training data 306 to the client device 104.
  • the training data source 304 is optionally a server 102 or storage 106.
  • both of the model training module 226 and the data processing module 228 are located on a server 102 of the data processing system 300.
  • the training data source 304 providing the training data 306 is optionally the server 102 itself, another server 102, or the storage 106.
  • the model training module 226 and the data processing module 228 are separately located on a server 102 and client device 104, and the server 102 provides the trained data processing model 250 to the client device 104.
  • the model training module 226 includes one or more data pre-processing modules 308, a model training engine 310, and a loss control module 312.
  • the data processing model 250 is trained according to a type of the content data to be processed.
  • the training data 306 is consistent with the type of the content data, so is a data pre-processing module 308 applied to process the training data 306 consistent with the type of the content data.
  • an image pre-processing module 308A is configured to process image training data 306 to a predefined image format, e.g., extract a region of interest (ROI) in each training image, and crop each training image to a predefined image size.
  • ROI region of interest
  • an audio pre-processing module 308B is configured to process audio training data 306 to a predefined audio format, e.g., converting each training sequence to a frequency domain using a Fourier transform.
  • the model training engine 310 receives pre-processed training data provided by the data pre-processing modules 308, further processes the pre-processed training data using an existing data processing model 250, and generates an output from each training data item.
  • the loss control module 312 can monitor a loss function comparing the output associated with the respective training data item and a ground truth of the respective training data item.
  • the model training engine 310 modifies the data processing model 250 to reduce the loss function, until the loss function satisfies a loss criteria (e.g., a comparison result of the loss function is minimized or reduced below a loss threshold).
  • the modified data processing model 250 is provided to the data processing module 228 to process the content data.
  • the model training module 226 offers supervised learning in which the training data is entirely labelled and includes a desired output for each training data item (also called the ground truth in some situations). Conversely, in some embodiments, the model training module 226 offers unsupervised learning in which the training data are not labelled. The model training module 226 is configured to identify previously undetected patterns in the training data without pre-existing labels and with no or little human supervision. Additionally, in some embodiments, the model training module 226 offers partially supervised learning in which the training data are partially labelled.
  • the data processing module 228 includes a data pre-processing modules 314, a model -based processing module 316, and a data post-processing module 318.
  • the data preprocessing modules 314 pre-processes the content data based on the type of the content data. Functions of the data pre-processing modules 314 are consistent with those of the preprocessing modules 308 and covert the content data to a predefined content format that is acceptable by inputs of the model -based processing module 316. Examples of the content data include one or more of: video, image, audio, textual, and other types of data.
  • each image is pre-processed to extract an ROI or cropped to a predefined image size
  • an audio clip is pre-processed to convert to a frequency domain using a Fourier transform.
  • the content data includes two or more types, e.g., video data and textual data.
  • the model -based processing module 316 applies the trained data processing model 250 provided by the model training module 226 to process the pre-processed content data.
  • the model -based processing module 316 can also monitor an error indicator to determine whether the content data has been properly processed in the data processing model 250.
  • the processed content data is further processed by the data postprocessing module 318 to present the processed content data in a preferred format or to provide other related information that can be derived from the processed content data.
  • Figure 4A is an example neural network (NN) 400 applied to process content data in an NN-based data processing model 250, in accordance with some embodiments
  • Figure 4B is an example node 420 in the neural network (NN) 400, in accordance with some embodiments.
  • the data processing model 250 is established based on the neural network 400.
  • a corresponding model-based processing module 316 applies the data processing model 250 including the neural network 400 to process content data that has been converted to a predefined content format.
  • the neural network 400 includes a collection of nodes 420 that are connected by links 412. Each node 420 receives one or more node inputs and applies a propagation function to generate a node output from the one or more node inputs.
  • the node output is provided via one or more links 412 to one or more other nodes 420
  • a weight w associated with each link 412 is applied to the node output.
  • the one or more node inputs are combined based on corresponding weights wi, W2, W3, and W4 according to the propagation function.
  • the propagation function is a product of a non-linear activation function and a linear weighted combination of the one or more node inputs.
  • the collection of nodes 420 is organized into one or more layers in the neural network 400.
  • the one or more layers includes a single layer acting as both an input layer and an output layer.
  • the one or more layers includes an input layer 402 for receiving inputs, an output layer 406 for providing outputs, and zero or more hidden layers 404 (e.g., 404A and 404B) between the input and output layers 402 and 406.
  • a deep neural network has more than one hidden layers 404 between the input and output layers 402 and 406. In the neural network 400, each layer is only connected with its immediately preceding and/or immediately following layer.
  • a layer 402 or 404B is a fully connected layer because each node 420 in the layer 402 or 404B is connected to every node 420 in its immediately following layer.
  • one of the one or more hidden layers 404 includes two or more nodes that are connected to the same node in its immediately following layer for down sampling or pooling the nodes 420 between these two layers.
  • max pooling uses a maximum value of the two or more nodes in the layer 404B for generating the node of the immediately following layer 406 connected to the two or more nodes.
  • a convolutional neural network is applied in a data processing model 250 to process content data (particularly, video and image data).
  • the CNN employs convolution operations and belongs to a class of deep neural networks 400, i.e., a feedforward neural network that only moves data forward from the input layer 402 through the hidden layers to the output layer 406.
  • the one or more hidden layers of the CNN are convolutional layers convolving with a multiplication or dot product.
  • Each node in a convolutional layer receives inputs from a receptive area associated with a previous layer (e.g., five nodes), and the receptive area is smaller than the entire previous layer and may vary based on a location of the convolution layer in the convolutional neural network.
  • Video or image data is pre-processed to a predefined video/image format corresponding to the inputs of the CNN.
  • the pre-processed video or image data is abstracted by each layer of the CNN to a respective feature map.
  • a recurrent neural network is applied in the data processing model 250 to process content data (particularly, textual and audio data). Nodes in successive layers of the RNN follow a temporal sequence, such that the RNN exhibits a temporal dynamic behavior.
  • each node 420 of the RNN has a time-varying real-valued activation.
  • the RNN examples include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor.
  • LSTM long short-term memory
  • BAM bidirectional associative memory
  • an echo state network an independently RNN (IndRNN)
  • a recursive neural network a recursive neural network
  • a neural history compressor examples include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor.
  • the RNN can be used for hand
  • the training process is a process for calibrating all of the weights w, for each layer of the learning model using a training data set which is provided in the input layer 402.
  • the training process typically includes two steps, forward propagation and backward propagation, which are repeated multiple times until a predefined convergence condition is satisfied.
  • forward propagation the set of weights for different layers are applied to the input data and intermediate results from the previous layers.
  • backward propagation a margin of error of the output (e.g., a loss function) is measured, and the weights are adjusted accordingly to decrease the error.
  • the activation function is optionally linear, rectified linear unit, sigmoid, hyperbolic tangent, or of other types.
  • a network bias term b is added to the sum of the weighted outputs from the previous layer before the activation function is applied.
  • the network bias b provides a perturbation that helps the NN 400 avoid over fitting the training data.
  • the result of the training includes the network bias parameter b for each layer.
  • FIG. 5 is a flowchart of a process 500 for encoding a sequence of images 502 to a motion field and decoding the motion field, in accordance with some embodiments.
  • a camera 260 is disposed and has a fixed camera location in a scene.
  • the camera 260 is configured to capture the sequence of images 502 in the scene.
  • the sequence of images correspond to a frame rate (e.g., 30 frames per second (FPS)) and have a number of images, and each image has a predefined resolution.
  • An object 506 is located in a field of view of the camera 260 and recorded in each of the sequence of images 502.
  • the camera 260 consecutively captures the sequence of images 502, and the object 506 is moving in the field of view.
  • Lighting conditions are varied in synchronization with image capturing by the camera 260.
  • the images 502 are separately and independently captured to include the same object 506 under the different lighting conditions. Independently of how the images 502 are captured, each image 502 corresponds to a distinct object pose, a distinct lighting condition, or both.
  • Each image 502 has a temporal location within the sequence of images 502, and each pixel in the respective image 502 has a two- dimensional (2D) pixel location within the respective image.
  • 3D information of the object e.g., surface normals
  • the information of the pixels of the sequence of images 502 is encoded into the motion field, which can be provided to different electronic devices to be decoded to recover the 3D information of the object 506 in the scene.
  • Pixels in each image 502 are represented by the 2D pixel location (x, y) within the respective image and the temporal location t.
  • each image 502 has a resolution of 2048x 1536 pixels.
  • the pixel locations x andj are located in a first range of 0- 2047 and a second range of 0-1535, respectively.
  • the temporal location t represents an order n of this image 502 in the sequence of images, and corresponds to a duration of time td that has passed since the first image of the sequence of images 502 has been captured.
  • the duration of time td is equal to (n-l)/F, where F is the frame rate of the sequence of images.
  • a motion field of the sequence of images 502 is represented by a neural network model 504, and is generated to encode at least a depth value 508 and an interframe depth variation (Ad) 510 of each pixel in the sequence of images 502.
  • the neural network model includes a plurality of layers, and each layer includes a plurality of filters each of which is associated with a plurality of weights w.
  • the motion field provides a set of weights w of the neural network model.
  • the client device 104 can automatically recover the depth value 508 and interframe depth variation 510 of each pixel in the sequence of images 502 based on the temporal location (f) and pixel location (x, y) of the respective pixel, without obtaining any of the sequence of images 502,. Stated another way, information of the scene captured by the sequence of images 502 is encoded in the neural network model 504.
  • the neural network model 504 representing the motion field includes a motion field model 504A and a depth albedo field model 504B.
  • the motion field model 504A receives the temporal location (t) and pixel location (x, y) of the respective pixel
  • the motion field model 504A generates at least an interframe pixel displacement (dx, dy) and the interframe depth variation (dd) 510 of the respective pixel, i.e., a motion vector (dx, dy, dd) of the respective pixel
  • the motion vector (dx, dy, dd) corresponds to displacements of a vector (x, y, d), and represents how the point on a surface of the object 506 moves at the pixel location (x, y) and at a time t.
  • the depth albedo field model 504B is coupled to the motion field model 504A, and converts the pixel location (x, y) and interframe pixel displacement (dx, dy) of the respective pixel to a respective albedo value (a) 512 and the respective depth value 508. More specifically, a new pixel location (x+dx, y+dy) are fed to the depth albedo field model 504B to determine the respective albedo value (a) 512 and the respective depth value 508.
  • the motion vector (dx, dy, dd) aligns all pixels from different time instances to a common one, allowing the depth and albedo values 508 and 512 of the surface point (x, y) in the aligned time instance t to be predicted by the depth albedo field model 504B.
  • Each pixel has a normal n, which is a vector perpendicular to a surface of the object 506 at the respective pixel.
  • n is a vector perpendicular to a surface of the object 506 at the respective pixel.
  • a client device 104 obtains, e.g., from a server 102, the motion field including a set of weights w, of the neural network model 504.
  • the client device 104 does not need to obtain any of the sequence of images 502,
  • the client device 104 automatically recovers the depth value 508 and interframe depth variation 510 of each pixel in the sequence of images 502 from the neural network model 504 based on the temporal location (7) and pixel location (x, y) of the respective pixel.
  • the client device 104 executes a user application (e.g., a gaming application) configured to reconstruct a gaming scene including the object 506 based on the depth value 508 and interframe depth variation 510 of each pixel in the sequence of images 502.
  • the neural network model 504 includes a first neural network model and is specific to a first scene.
  • the sequence of images 502 are captured in the first scene, and a first object 506 is located in the field of view of the camera.
  • a second sequence of images 502 captured in a distinct second scene are encoded into a distinct second neural network model.
  • a third sequence of images 502 captured in the same first scene and directed to a distinct second object are encoded into a distinct third neural network model.
  • the first, second, and third neural network models are distinct from each other.
  • An input to the neural network model 504 is the pixel and temporal locations (x, v, /), where (x, v) is the pixel location each image and t is a temporal location or time stamp in the sequence of images 502.
  • Each pixel in the sequence images 502 is parameterized as a unique location vector (x, y, f), and corresponds to an output of the normal at the pixel location (x. y) and at the time stamp t.
  • the neural network model 502 can predict Lite surface normal at the pixel location (x. y), thereby creating a dynamic 3D reconstruction of the scene and object 506.
  • each location variable in the pixel and temporal locations (x, y, t ⁇ has a respective range.
  • the ranges for the pixel location (x, y) are defined by a resolution of each image 502, and the pixel location (x, y) does not go beyond the range defined by the resolution.
  • the range for the temporal location i is defined by the number of images in the sequence of images.
  • the object 506 undergoes a deformable motion when the sequence of images 502 are captured.
  • the deformable motion is not rigid, the neural network model 504 that encodes the sequence of images 502 including the deformable motion of the object 506 can be decoded to provide information of the object 506, e.g., a surface normal at each pixel of a surface of the object 506.
  • the object 506 has non-uniform color or material.
  • the object 506 has uniform color or material.
  • the neural network model 504 that encodes the sequence of images 502 including the color or material of the object 506 can be decoded to provide information of the object 506, e.g., a surface normal at each pixel of a surface of the object 506.
  • a highspeed camera 260 is applied to capture the sequence of images 502.
  • No motion compensation is applied between two images if the two images 502 are captured within a predefined duration of time (e.g., within 10 milliseconds). Stated another way, these two images 502 are close in time, such that the object 506 remains substantially still and a corresponding variation of the object 506 between these two images 502 are negligible.
  • Figure 6A is a sequence of images 502 captured under different lighting conditions, in accordance with some embodiments.
  • Figures 6B and 6C are surface normal maps 600 and relative depth maps 620 corresponding to the sequence of images in Figure 6A, in accordance with some embodiments.
  • the sequence of images 502 include five images 502A, 502B, 502C, 502D, and 502E captured at times Zo-2, to-1, to, to+1, and to+2, where to is a temporal position of a third image 502C.
  • each image of the sequence of images 502 includes one of an RGB color image, a CMYA color image, a Lab color image, and a monochromatic color image, and a gray scale image.
  • the sequence of images 502 are successively captured by a camera at a frame rate. In some embodiments, the sequence of images 502 are five images selected from another sequence of images 502 that are successively captured by a camera at a frame rate. In some embodiments, the sequence of images 502 are five independently captured images including the same object exposed to the different lighting conditions.
  • Each image 502 corresponds to a distinct lighting condition in which the object 506 is illuminated from a different light direction.
  • a set of light sources are fixed on a capturing rig, and the set of light sources are controlled in synchronization with capturing the sequence of images 502A-502E.
  • the sequence of images 502 are captured sequentially and in synchronization with illumination created by the set of fixed light sources that are cyclically enabled to illuminate the field of view for respective shortened durations.
  • a plurality of fixed light sources include a first number of fixed light sources placed at the first number of fixed locations with respect to the camera 260 configured to capture the sequence of images 502, and the first number of fixed sources are enabled sequentially and in synchronization with the first number of consecutive images in the sequence of images 502.
  • a neural network model 504 is generated from and applied to encode the sequence of images 502.
  • Each pixel in the sequence of images 502 corresponds to a position vector (x, y, f), where (x, y) corresponds to a pixel position of the respective pixel in a corresponding image 502, and t is a temporal position of the corresponding image 502 in the sequence of images 502.
  • the neural network model 504 For each pixel in the sequence of images 502, the neural network model 504 generates the depth value 508 and interframe depth variation 510 based on the position vector (x, y, f).
  • the depth values 508 of all pixels of the sequence of images 502 are visualized in the five depth maps 620 corresponding to the five images 502A-502E, respectively.
  • the depth value 508 and interframe depth variation 510 of each pixel is converted to a normal vector n at the respective pixel, and the normal vector n has a surface normal value.
  • the surface normal values of all pixels of the sequence of images 502 are visualized in the five surface normal maps 600 corresponding to the five images 502A-502E, respectively.
  • information of the object 506 is captured by the sequence of images 502.
  • the neural network model 504 is generated to encode the sequence of images 502, i.e., encode the information of the object.
  • the client device 104 can conveniently reconstruct the information of the object 506 based on the neural network model 504, e.g., according to the process 500.
  • the reconstructed information of the object 506 includes the surface normal maps 600 and relative depth maps 620, which can be applied to render the object 506 in a virtual reality environment if needed.
  • FIG. 7 is a flowchart of a process 700 for training a neural network model 504 for encoding a sequence of images 502 to a motion field, in accordance with some embodiments.
  • the process 700 is implemented at an electronic system 200 including a camera 260 and a light source system 702 that provides different lighting conditions needed by photometric stereo.
  • the light source system 702 enables five different lighting conditions applied in five images 502A-502E in Figure 6A.
  • the light source system 702 operates in synchronization with the camera 260, such that the sequence of images 502 are captured in the predefined different lighting conditions.
  • the different lighting conditions are set up in a studio, and the electronic system 200 is a server 102 that can operate with the studio to capture the sequence of images 502 for an object 506.
  • the electronic system 200 encodes the sequence of images 502 to the neural network model 504 based on the training process 700 that minimizes a reconstruction loss (Z rec ) 704.
  • the neural network model 504 receives a position vector (x, j, /) of each pixel of the sequence of images 502 as inputs.
  • the neural network model 504 includes a motion field model 504A and a depth albedo field model 504B.
  • the motion field model 504A generates a motion vector (Jx, Ay, Ad) from the position vector (x, y, f) for each pixel, and the motion field model 504 A generates an albedo value (a) 512 and depth value (d) 508 for each pixel.
  • a normal n is generated from the depth value (d) 508 and depth variation (Ad) 510 using equation (1).
  • the electronic system 200 predicts a color value c pre d based on the normal n, albedo value (a) 512, and a direction vector It of a light source in the light source system 702 as follows:
  • the reconstruction loss L rec (704) is determined between the predicted color value c pre d and an actual color value c gt of the respective pixel as follows: where the actual color value c gt acts as a ground truth of the respective pixel.
  • the neural network model 504 representing the motion field is trained based on the reconstruction loss (Lrec) 704. That said, the motion field model 504A and depth albedo field model 504B are trained jointly based on the reconstruction loss (Lrec) 704.
  • the neural network model 504 can predict the surface normal n at the pixel location. This allow'S us to obtain the surface normal u for pixels of the sequence of images 502, thereby creating a dynamic 3D reconstruction of the object 506 captured in the sequence of images 502.
  • Figure 8A is a flow diagram of a method for processing information of one or more objects, in accordance with some embodiments
  • Figure 8B is a flow diagram of a method 850 decoding information of one or more objects 506, in accordance with some embodiments.
  • the methods 800 and 850 are described as being implemented by an electronic system 200 (e.g., a server 102, a client device 104, or a combination thereof).
  • Methods 800 and 850 are, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the computer system.
  • Each of the operations shown in Figures 8A and 8B may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 in Figure 2).
  • the computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other nonvolatile memory device or devices.
  • the instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in methods 800 and 850 may be combined and/or the order of some operations may be changed.
  • an electronic device e.g., a server 102 obtains (802) a sequence of images 502 that are captured sequentially in a field of view including the one or more objects 506, determines (804) a resolution of each image and a number of images in the sequence of images 502, and generates (806) a motion field of the sequence of images 502.
  • the motion field is represented (808) with a neural network model 504 and configured to encode at least a depth value 508 and an interframe depth variation 510 of each pixel in the sequence of images 502.
  • the neural network model 504 representing the motion field is configured (810) to for each pixel in the sequence of images 502, convert a pixel location of the respective pixel in a respective image and a respective temporal location of the respective image in the sequence of images 502 to the depth value 508 and the interframe depth variation 510 of the respective pixel.
  • the electronic device encodes (812) the sequence of images 502 to image encoding information including the resolution of each image, the number of images in the sequence of images 502, and the motion field. It is noted that the resolution of each image is applied to define a location range for the pixel location and the number of images defines a temporal range for the temporal location of the respective image in the sequence of images 502.
  • the neural network model 504 representing the motion field includes (814) a motion field model 504A and a depth albedo field model 504B,
  • the motion field is generated by determining (816) the motion field model 504A that for each pixel in the sequence of images 502 and determining (818) the depth albedo field model 504B for each pixel in the sequence of images 502.
  • the motion field model 504A is configured to convert the pixel location of the respective pixel in the respective image and the respective temporal location of the respective image in the sequence of images 502 to at least an interframe pixel displacement and the interframe depth variation 510 of the respective pixel.
  • the depth albedo field model 504B is configured to convert the pixel location and interframe pixel displacement of the respective pixel to a respective albedo value 512 and a respective depth value 508.
  • the electronic device predicts a color value from a pixel location of the respective pixel in a respective image and a respective temporal location of the respective image in the sequence of images 502 using the neural network model 504 representing the motion field, determines a reconstruction loss between the predicted color value and an actual color value of the respective pixel, and trains the neural network model 504 representing the motion field based on the reconstruction loss.
  • the electronic device predicts the color values by for each pixel in the sequence of images 502, generating the depth value 508, the interframe depth variation 510, and an albedo value 512 from the pixel location of the respective pixel in the respective image and the respective temporal location of the respective image in the sequence of images 502 using the neural network model 504, generating a normal from the depth value 508 and interframe depth variation 510, and predicting the color value based on the normal, the albedo value 512, and a location of a corresponding light source.
  • the normal n, the predicted color value c P red, and the reconstruction loss Tree are represented as follows: where d and Ad are the depth value 508 and interframe depth variation 510, respectively, x and j correspond to two coordinates of the pixel location of the respective pixel in the respective image, a is the albedo value 512, It is the direction of the corresponding light source, and Cgt is the actual color value of the respective pixel.
  • the electronic device includes a first electronic device (e.g., a server 102).
  • a second electronic device e.g., a client device 104 obtains (820) the image encoding information of the sequence of images 502, and regenerates (822) a normal map of each image in the sequence of images 502 from the image encoding information.
  • the normal map of each image includes a normal of each pixel of the respective image.
  • the second electronic device For each pixel in the respective image, in accordance with the neural network model 504 representing the motion field, the second electronic device converts a pixel location in the respective image and a respective temporal location of the respective image in the sequence of images 502 to the depth value 508 and the interframe depth variation 510 of the respective pixel, and determines the normal of the respective pixel based on the depth value 508 and depth variation 510 of the respective pixel. Further, in some embodiments, the second electronic device reconstructs the one or more objects 506 in a three-dimensional (3D) virtual scene based on at least a subset of the normal maps of the sequence of images 502. In some situations, the image encoding information is encoded in a server 102 and provided to a client device 104, and the client device 104 regenerates the normal map from the image encoding information.
  • 3D three-dimensional
  • the sequence of images 502 are captured sequentially and in synchronization with illumination that is created with a plurality of fixed light sources that are cyclically enabled to illuminate the field of view for respective shortened durations.
  • the one or more objects 506 are moving in the field of view during a duration of time in which the sequence of images 502 are captured.
  • a plurality of fixed light sources includes a first number of fixed light sources placed at the first number of fixed locations with respect to a camera configured to capture the sequence of images 502, and the first number of fixed sources are enabled sequentially and in synchronization with the first number of consecutive images in the sequence of images 502.
  • each image of the sequence of images 502 includes one of an RGB color image, a CMYA color image, a Lab color image, and a monochromatic color image, and a gray scale image.
  • an electronic device obtains (852) image encoding information of a sequence of images 502 that are captured sequentially in a field of view including the one or more objects 506.
  • the image encoding information includes (854) a resolution of each image, a number of images in the sequence of images 502, and a motion field represented by a neural network model 504.
  • the electronic device generates (856) a normal map of each image in the sequence of images 502 from the image encoding information.
  • the electronic device For each pixel in the respective image, in accordance with the neural network model 504 representing the motion field, the electronic device converts (858) a pixel location in the respective image and a respective temporal location of the respective image in the sequence of images 502 to a depth value 508 and an interframe depth variation 510 of the respective pixel, and determines (860) a normal of the respective pixel based on the depth value 508 and interframe depth variation 510 of the respective pixel.
  • the term “if’ is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context.
  • the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

This application is directed to processing information of one or more objects existing in a field of view of a camera. An electronic system obtains a sequence of images that are captured sequentially in the field of view including the one or more objects, determines a resolution of each image and a number of images in the sequence of images, and generates a motion field of the sequence of images. The motion field is represented with a neural network model and configured to encode at least a depth value and an interframe depth variation of each pixel in the sequence of images, e.g., based on a pixel location and a temporal location of the respective pixel. The sequence of images are encoded to image encoding information including the resolution of each image, the number of images in the sequence of images, and the motion field.

Description

Photometric Stereo for Dynamic Surface with Motion Field
TECHNICAL FIELD
[0001] This application relates generally to image processing including, but not limited to, methods, systems, and non-transitory computer-readable media for encoding information associated with a sequence of images captured at a scene in a predefined data format that facilitates extraction of related information.
BACKGROUND
[0002] Virtual content is oftentimes rendered based on image information collected from a scene or object in real life. Three dimensional (3D) image information of the scene or object can be represented by surface normals of objects that are obtained by photometric stereo in which the scene or object is observed under different lighting conditions.
Photometric stereo is optionally applied to static or dynamic surfaces. Particularly, various techniques (e.g., rigid transformation, color multiplexing, and time multiplexing) are used to capture dynamic surfaces using fast changing light sources that provide illumination from different directions. However, these techniques have specific requirements and can introduce significant errors or fail to apply in many situations. It would be beneficial to have a more accurate and efficient image encoding mechanism to encode the image information of a real scene or object and facilitate virtual content rendering than the current practice that communicates 3D image information of the scene using the images or related surface normals.
SUMMARY
[0003] Various embodiments of this application are directed to encoding information associated with a sequence of images concerning a real scene or object in a predefined data format, e.g., a motion field represented by a neural network model. The encoded image information provides at least a depth value and an interframe depth variation of each pixel in the sequence of images. Upon receiving the encoded image information, an electronic system can regenerate a normal map of each image in the sequence of images, and the normal map of each image includes a normal of each pixel of the respective image. A virtual scene or object can be reconstructed in a 3D virtual scene based on at least a subset of the normal maps of the sequence of images. By these means, the predefined data format provides an accurate and efficient image encoding mechanism to encode image information of the real scene or object and facilitate virtual content rendering.
[0004] In one aspect, a method is implemented at an electronic system for processing information of one or more objects. The method includes obtaining a sequence of images that are captured sequentially in a field of view including the one or more objects, determining a resolution of each image and a number of images in the sequence of images, and generating a motion field of the sequence of images. The motion field is represented with a neural network model and configured to encode at least a depth value and an interframe depth variation of each pixel in the sequence of images. The method further includes encoding the sequence of images to image encoding information including the resolution of each image, the number of images in the sequence of images, and the motion field. In some embodiments, the neural network model representing the motion field is configured to for each pixel in the sequence of images, convert a pixel location of the respective pixel in a respective image and a respective temporal location of the respective image in the sequence of images to the depth value and the interframe depth variation of the respective pixel.
[0005] In another aspect, a method is implemented at an electronic system to decode information of one or more objects. The method includes obtaining image encoding information of a sequence of images that are captured sequentially in a field of view including the one or more objects. The image encoding information includes a resolution of each image, a number of images in the sequence of images, and a motion field represented by a neural network model. The method further includes generating a normal map of each image in the sequence of images from the image encoding information. Generation of the normal map of each image further includes, for each pixel in the respective image, in accordance with the neural network model representing the motion field, converting a pixel location in the respective image and a respective temporal location of the respective image in the sequence of images to a depth value and an interframe depth variation of the respective pixel. Generation of the normal map of each image further includes determining a normal of the respective pixel based on the depth value and interframe depth variation of the respective pixel. In some embodiments, the method further includes reconstructing the one or more objects in a 3D virtual scene based on at least a subset of the normal maps of the sequence of images.
[0006] In another aspect, some implementations include an electronic system that includes one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods.
[0007] In yet another aspect, some implementations include a non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.
[0008] These illustrative embodiments and implementations are mentioned not to limit or define the disclosure, but to provide examples to aid understanding thereof. Additional embodiments are discussed in the Detailed Description, and further description is provided there.
BRIEF DESCRIPTION OF THE DRAWINGS
[0009] For a better understanding of the various described implementations, reference should be made to the Detailed Description below, in conjunction with the following drawings in which like reference numerals refer to corresponding parts throughout the figures.
[0010] Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.
[0011] Figure 2 is a block diagram illustrating an electronic system configured to process content data (e.g., image data), in accordance with some embodiments.
[0012] Figure 3 is an example data processing environment for training and applying a neural network-based data processing model for processing visual and/or audio data, in accordance with some embodiments.
[0013] Figure 4A is an example neural network applied to process content data in an NN-based data processing model, in accordance with some embodiments, and Figure 4B is an example node in the neural network, in accordance with some embodiments.
[0014] Figure 5 is a flowchart of a process for encoding a sequence of images to a motion field or decoding the motion field, in accordance with some embodiments.
[0015] Figure 6A is a sequence of images captured under different lighting conditions, in accordance with some embodiments.
[0016] Figures 6B and 6C are surface normal maps and relative depth maps corresponding to the sequence of images in Figure 6A, in accordance with some embodiments. [0017] Figure 7 is a flowchart of a process for training a neural network model for encoding a sequence of images to a motion field, in accordance with some embodiments.
[0018] Figure 8A is a flow diagram of a method for processing information of one or more objects, in accordance with some embodiments, and Figure 8B is a flow diagram of a method decoding information of one or more objects, in accordance with some embodiments.
[0019] Like reference numerals refer to corresponding parts throughout the several views of the drawings.
DETAILED DESCRIPTION
[0020] Reference will now be made in detail to specific embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to assist in understanding the subject matter presented herein. But it will be apparent to one of ordinary skill in the art that various alternatives may be used without departing from the scope of claims and the subject matter may be practiced without these specific details. For example, it will be apparent to one of ordinary skill in the art that the subject matter presented herein can be implemented on many types of electronic devices with digital video capabilities.
[0021] In various embodiments of this application, surfaces of objects are reconstructed dynamically using photometric stereo. Photometric Stereo is a computer vision technique applied to reconstruct surface normals of objects by observing the objects under different lighting conditions. Surfaces of the objects comply with a Lambertian reflectance assumption in which luminance of the surfaces is isotropic and appears to be the same independently of an angle of view. A sequence of successive images are captured for the objects that are under rigid movement or deformation while different light conditions are applied. For example, successive images are captured to record an object that rotates on a turntable or a human face having changing expressions. These successive images are associated with different light conditions and different object poses, and therefore, include three dimension (3D) information of surfaces of the objects. This application focuses on encoding the 3D information of surfaces of the objects in a motion field represented by one or more neural network models, which can be decoded to reconstruct models of dynamic surfaces of the objects including a series of surface reconstructions (e.g., a form of 3D meshes) that can be used in gaming, animation, and telepresence applications.
[0022] Figure 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments. The one or more client devices 104 may be, for example, desktop computers 104A, tablet computers 104B, mobile phones 104C, head-mounted display (HMD) (also called augmented reality (AR) glasses) 104D, or intelligent, multi-sensing, network- connected home devices (e.g., a surveillance camera 104E, a smart television device, a drone). Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface. The collected data or user inputs can be processed locally at the client device 104 and/or remotely by the server(s) 102. The one or more servers 102 provide system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104. In some embodiments, the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.
[0023] The one or more servers 102 are configured to enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 are configured to implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104. For example, the client devices 104 include a game console (e.g., the HMD 104D) that executes an interactive online gaming application. The game console receives a user instruction and sends it to a game server 102 with user data. The game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console. In various embodiments of this application, the stream of video data includes a motion field represented by a neural network model, and when provided to the game console, can be decoded to depth information and surface normals of a gaming scene.
[0024] In another example, the client devices 104 include a networked surveillance camera 104E and a mobile phone 104C. The networked surveillance camera 104E collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera 104E, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104 to monitor the events occurring near the networked surveillance camera 104E in the real time and remotely.
[0025] The one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100. The one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof. The one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol. A connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof. As such, the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Intemet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other computer systems that route data and messages.
[0026] In some embodiments, deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video data, visual data, audio data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data. The content data may broadly include inertial sensor data captured by inertial sensor(s) of a client device 104. In these deep learning techniques, data processing models are created based on one or more neural networks to process the content data. These data processing models are trained with training data before they are applied to process the content data. Subsequently to model training, the mobile phone 104C or HMD 104D obtains the content data (e.g., captures video data via an internal camera) and processes the content data using the data processing models locally.
[0027] In some embodiments, both model training and data processing are implemented locally at each individual client device 104 (e.g., the mobile phone 104C and HMD 104D). The client device 104 obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the data processing models. Alternatively, in some embodiments, both model training and data processing are implemented remotely at a server 102 (e.g., the server 102A) associated with a client device 104 (e.g. the client device 104A and HMD 104D). The server 102A obtains the training data from itself, another server 102 or the storage 106 applies the training data to train the data processing models. The client device 104 obtains the content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained data processing models, receives data processing results (e.g., recognized hand gestures) from the server 102A, presents the results on a user interface (e.g., associated with the application), renders virtual objects in a field of view based on the poses, or implements some other functions based on the results. The client device 104 itself implements no or little data processing on the content data prior to sending them to the server 102 A. Additionally, in some embodiments, data processing is implemented locally at a client device 104 (e.g., the client device 104B and HMD 104D), while model training is implemented remotely at a server 102 (e.g., the server 102B) associated with the client device 104. The server 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models. The trained data processing models are optionally stored in the server 102B or storage 106. The client device 104 imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface or used to initiate some functions (e.g., rendering virtual objects based on device poses) locally.
[0028] In some embodiments, a pair of AR glasses 104D (also called an HMD) are communicatively coupled in the data processing environment 100. The AR glasses 104D includes a camera, a microphone, a speaker, one or more inertial sensors (e.g., gyroscope, accelerometer), and a display. The camera and microphone are configured to capture video and audio data from a scene of the AR glasses 104D, while the one or more inertial sensors are configured to capture inertial sensor data. In some situations, the camera captures hand gestures of a user wearing the AR glasses 104D, and recognizes the hand gestures locally and in real time using a two-stage hand gesture recognition model. In some situations, the microphone records ambient sound, including user’s voice commands. In some situations, both video or static visual data captured by the camera and the inertial sensor data measured by the one or more inertial sensors are applied to determine and predict device poses. The video, static image, audio, or inertial sensor data captured by the AR glasses 104D is processed by the AR glasses 104D, server(s) 102, or both to recognize the device poses. Optionally, deep learning techniques are applied by the server(s) 102 and AR glasses 104D jointly to recognize and predict the device poses. The device poses are used to control the AR glasses 104D itself or interact with an application (e.g., a gaming application) executed by the AR glasses 104D. In some embodiments, the display of the AR glasses 104D displays a user interface, and the recognized or predicted device poses are used to render or interact with user selectable display items (e.g., an avatar) on the user interface.
[0029] As explained above, in some embodiments, deep learning techniques are applied in the data processing environment 100 to process video data, static image data, or inertial sensor data captured by the AR glasses 104D. 2D or 3D device poses are recognized and predicted based on such video, static image, and/or inertial sensor data using a first data processing model. Visual content is optionally generated using a second data processing model. Training of the first and second data processing models is optionally implemented by the server 102 or AR glasses 104D. Inference of the device poses and visual content is implemented by each of the server 102 and AR glasses 104D independently or by both of the server 102 and AR glasses 104D jointly.
[0030] This application is directed to encoding object information captured in a sequence of images to a motion field represented by a neural network model. The sequence of images are captured by a camera of a client device 104 under different lighting conditions while an object is moving. A server 102 obtains the sequence of images from the client device 104 or has its own camera capturing a sequence of images by itself. The server 102 encodes the sequence of images to the motion field using a training process, and returns the motion field to the client device 104 or provides the motion fields to other client devices. Each client device 104 decodes the motion field to depth field maps, depth variation maps, and/or surface normal maps corresponding to the sequence of images, and can enable an extended reality environment based on the motion field. As such, information of the object is stored and communicated in a distinct data format (i.e., a motion field represented by a neural network model), which is more reliable and utilizes computational and storage resources of the data processing environment 100 more efficiently.
[0031] Figure 2 is a block diagram illustrating an electronic system 200 configured to process content data (e.g., image data), in accordance with some embodiments. The electronic system 200 includes a server 102, a client device 104 (e.g., AR glasses 104D in Figure 1), a storage 106, or a combination thereof. The electronic system 200 , typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset). The electronic system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice-command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls. Furthermore, in some embodiments, the client device 104 of the electronic system 200 uses a microphone for voice recognition or a camera 260 for gesture recognition to supplement or replace the keyboard. In some embodiments, the client device 104 includes one or more optical cameras 260 (e.g., an RGB camera), scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices. The electronic system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays.
[0032] Memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory 206, includes a non-transitory computer readable storage medium. In some embodiments, memory 206, or the non- transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof:
• Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks; • Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;
• User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.);
• Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;
• Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;
• One or more user applications 224 for execution by the electronic system 200 (e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices);
• Model training module 226 for receiving training data and establishing a data processing model for processing content data (e.g., video, image, audio, or textual data) to be collected or obtained by a client device 104, where in some embodiments, the model training module 226 of a server 102 trains a neural network model based on a sequence of images using a reconstruction loss Lrec, e.g., in Figure 8A;
• Data processing module 228 for processing content data using data processing models 250, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224, and in an example, the data processing module 228 of a client device 104 is applied to recover a normal map of each image in the sequence of maps using a neural network model, e.g., in Figure 8B; and
• One or more databases 240 for storing at least data including one or more of: o Device settings 242 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 244 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 246 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 248 for training one or more data processing models 250; o Data processing model(s) 250 for processing content data (e.g., video, image, audio, or textual data) using deep learning techniques, where the data processing models 250 includes a neural network model that encodes image information of a sequence of images; o Content data and results 254 that are obtained by and outputted to the client device 104 of the electronic system 200 , respectively.
[0033] Optionally, the one or more databases 240 are stored in one of the server 102, client device 104, and storage 106 of the electronic system 200 . Optionally, the one or more databases 240 are distributed in more than one of the server 102, client device 104, and storage 106 of the electronic system 200 . In some embodiments, more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 250 are stored at the server 102 and storage 106, respectively.
[0034] Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules or data structures, and thus various subsets of these modules may be combined or otherwise re-arranged in various embodiments. In some embodiments, memory 206, optionally, stores a subset of the modules and data structures identified above. Furthermore, memory 206, optionally, stores additional modules and data structures not described above. [0035] Figure 3 is an example data processing system 300 for training and applying a neural network based (NN-based) data processing model 250 for processing content data (e.g., video, image, audio, or textual data), in accordance with some embodiments. The data processing system 300 includes a model training module 226 for establishing the data processing model 250 and a data processing module 228 for processing the content data using the data processing model 250. In some embodiments, both of the model training module 226 and the data processing module 228 are located on a client device 104 of the data processing system 300, while a training data source 304 distinct form the client device 104 provides training data 306 to the client device 104. The training data source 304 is optionally a server 102 or storage 106. Alternatively, in some embodiments, both of the model training module 226 and the data processing module 228 are located on a server 102 of the data processing system 300. The training data source 304 providing the training data 306 is optionally the server 102 itself, another server 102, or the storage 106. Additionally, in some embodiments, the model training module 226 and the data processing module 228 are separately located on a server 102 and client device 104, and the server 102 provides the trained data processing model 250 to the client device 104.
[0036] The model training module 226 includes one or more data pre-processing modules 308, a model training engine 310, and a loss control module 312. The data processing model 250 is trained according to a type of the content data to be processed. The training data 306 is consistent with the type of the content data, so is a data pre-processing module 308 applied to process the training data 306 consistent with the type of the content data. For example, an image pre-processing module 308A is configured to process image training data 306 to a predefined image format, e.g., extract a region of interest (ROI) in each training image, and crop each training image to a predefined image size. Alternatively, an audio pre-processing module 308B is configured to process audio training data 306 to a predefined audio format, e.g., converting each training sequence to a frequency domain using a Fourier transform. The model training engine 310 receives pre-processed training data provided by the data pre-processing modules 308, further processes the pre-processed training data using an existing data processing model 250, and generates an output from each training data item. During this course, the loss control module 312 can monitor a loss function comparing the output associated with the respective training data item and a ground truth of the respective training data item. The model training engine 310 modifies the data processing model 250 to reduce the loss function, until the loss function satisfies a loss criteria (e.g., a comparison result of the loss function is minimized or reduced below a loss threshold). The modified data processing model 250 is provided to the data processing module 228 to process the content data.
[0037] In some embodiments, the model training module 226 offers supervised learning in which the training data is entirely labelled and includes a desired output for each training data item (also called the ground truth in some situations). Conversely, in some embodiments, the model training module 226 offers unsupervised learning in which the training data are not labelled. The model training module 226 is configured to identify previously undetected patterns in the training data without pre-existing labels and with no or little human supervision. Additionally, in some embodiments, the model training module 226 offers partially supervised learning in which the training data are partially labelled.
[0038] The data processing module 228 includes a data pre-processing modules 314, a model -based processing module 316, and a data post-processing module 318. The data preprocessing modules 314 pre-processes the content data based on the type of the content data. Functions of the data pre-processing modules 314 are consistent with those of the preprocessing modules 308 and covert the content data to a predefined content format that is acceptable by inputs of the model -based processing module 316. Examples of the content data include one or more of: video, image, audio, textual, and other types of data. For example, each image is pre-processed to extract an ROI or cropped to a predefined image size, and an audio clip is pre-processed to convert to a frequency domain using a Fourier transform. In some situations, the content data includes two or more types, e.g., video data and textual data. The model -based processing module 316 applies the trained data processing model 250 provided by the model training module 226 to process the pre-processed content data. The model -based processing module 316 can also monitor an error indicator to determine whether the content data has been properly processed in the data processing model 250. In some embodiments, the processed content data is further processed by the data postprocessing module 318 to present the processed content data in a preferred format or to provide other related information that can be derived from the processed content data.
[0039] Figure 4A is an example neural network (NN) 400 applied to process content data in an NN-based data processing model 250, in accordance with some embodiments, and Figure 4B is an example node 420 in the neural network (NN) 400, in accordance with some embodiments. The data processing model 250 is established based on the neural network 400. A corresponding model-based processing module 316 applies the data processing model 250 including the neural network 400 to process content data that has been converted to a predefined content format. The neural network 400 includes a collection of nodes 420 that are connected by links 412. Each node 420 receives one or more node inputs and applies a propagation function to generate a node output from the one or more node inputs. As the node output is provided via one or more links 412 to one or more other nodes 420, a weight w associated with each link 412 is applied to the node output. Likewise, the one or more node inputs are combined based on corresponding weights wi, W2, W3, and W4 according to the propagation function. In an example, the propagation function is a product of a non-linear activation function and a linear weighted combination of the one or more node inputs.
[0040] The collection of nodes 420 is organized into one or more layers in the neural network 400. Optionally, the one or more layers includes a single layer acting as both an input layer and an output layer. Optionally, the one or more layers includes an input layer 402 for receiving inputs, an output layer 406 for providing outputs, and zero or more hidden layers 404 (e.g., 404A and 404B) between the input and output layers 402 and 406. A deep neural network has more than one hidden layers 404 between the input and output layers 402 and 406. In the neural network 400, each layer is only connected with its immediately preceding and/or immediately following layer. In some embodiments, a layer 402 or 404B is a fully connected layer because each node 420 in the layer 402 or 404B is connected to every node 420 in its immediately following layer. In some embodiments, one of the one or more hidden layers 404 includes two or more nodes that are connected to the same node in its immediately following layer for down sampling or pooling the nodes 420 between these two layers. Particularly, max pooling uses a maximum value of the two or more nodes in the layer 404B for generating the node of the immediately following layer 406 connected to the two or more nodes.
[0041] In some embodiments, a convolutional neural network (CNN) is applied in a data processing model 250 to process content data (particularly, video and image data). The CNN employs convolution operations and belongs to a class of deep neural networks 400, i.e., a feedforward neural network that only moves data forward from the input layer 402 through the hidden layers to the output layer 406. The one or more hidden layers of the CNN are convolutional layers convolving with a multiplication or dot product. Each node in a convolutional layer receives inputs from a receptive area associated with a previous layer (e.g., five nodes), and the receptive area is smaller than the entire previous layer and may vary based on a location of the convolution layer in the convolutional neural network. Video or image data is pre-processed to a predefined video/image format corresponding to the inputs of the CNN. The pre-processed video or image data is abstracted by each layer of the CNN to a respective feature map. By these means, video and image data can be processed by the CNN for video and image recognition, classification, analysis, imprinting, or synthesis. [0042] Alternatively and additionally, in some embodiments, a recurrent neural network (RNN) is applied in the data processing model 250 to process content data (particularly, textual and audio data). Nodes in successive layers of the RNN follow a temporal sequence, such that the RNN exhibits a temporal dynamic behavior. In an example, each node 420 of the RNN has a time-varying real-valued activation. Examples of the RNN include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor. In some embodiments, the RNN can be used for handwriting or speech recognition. It is noted that in some embodiments, two or more types of content data are processed by the data processing module 228, and two or more types of neural networks (e.g., both CNN and RNN) are applied to process the content data jointly.
[0043] The training process is a process for calibrating all of the weights w, for each layer of the learning model using a training data set which is provided in the input layer 402. The training process typically includes two steps, forward propagation and backward propagation, which are repeated multiple times until a predefined convergence condition is satisfied. In the forward propagation, the set of weights for different layers are applied to the input data and intermediate results from the previous layers. In the backward propagation, a margin of error of the output (e.g., a loss function) is measured, and the weights are adjusted accordingly to decrease the error. The activation function is optionally linear, rectified linear unit, sigmoid, hyperbolic tangent, or of other types. In some embodiments, a network bias term b is added to the sum of the weighted outputs from the previous layer before the activation function is applied. The network bias b provides a perturbation that helps the NN 400 avoid over fitting the training data. The result of the training includes the network bias parameter b for each layer.
[0044] Figure 5 is a flowchart of a process 500 for encoding a sequence of images 502 to a motion field and decoding the motion field, in accordance with some embodiments. A camera 260 is disposed and has a fixed camera location in a scene. The camera 260 is configured to capture the sequence of images 502 in the scene. The sequence of images correspond to a frame rate (e.g., 30 frames per second (FPS)) and have a number of images, and each image has a predefined resolution. An object 506 is located in a field of view of the camera 260 and recorded in each of the sequence of images 502. In some embodiments, the camera 260 consecutively captures the sequence of images 502, and the object 506 is moving in the field of view. Lighting conditions are varied in synchronization with image capturing by the camera 260. Alternatively, in some embodiments, the images 502 are separately and independently captured to include the same object 506 under the different lighting conditions. Independently of how the images 502 are captured, each image 502 corresponds to a distinct object pose, a distinct lighting condition, or both. Each image 502 has a temporal location within the sequence of images 502, and each pixel in the respective image 502 has a two- dimensional (2D) pixel location within the respective image. 3D information of the object (e.g., surface normals) are embedded in information of pixels of the sequence of images 502. In accordance with the process 500, the information of the pixels of the sequence of images 502 is encoded into the motion field, which can be provided to different electronic devices to be decoded to recover the 3D information of the object 506 in the scene.
[0045] Pixels in each image 502 are represented by the 2D pixel location (x, y) within the respective image and the temporal location t. In an example, each image 502 has a resolution of 2048x 1536 pixels. The pixel locations x andj are located in a first range of 0- 2047 and a second range of 0-1535, respectively. The temporal location t represents an order n of this image 502 in the sequence of images, and corresponds to a duration of time td that has passed since the first image of the sequence of images 502 has been captured. The duration of time td is equal to (n-l)/F, where F is the frame rate of the sequence of images. [0046] A motion field of the sequence of images 502 is represented by a neural network model 504, and is generated to encode at least a depth value 508 and an interframe depth variation (Ad) 510 of each pixel in the sequence of images 502. The neural network model includes a plurality of layers, and each layer includes a plurality of filters each of which is associated with a plurality of weights w. The motion field provides a set of weights w of the neural network model. During decoding, if a client device 104 obtains the motion field, the client device 104 can automatically recover the depth value 508 and interframe depth variation 510 of each pixel in the sequence of images 502 based on the temporal location (f) and pixel location (x, y) of the respective pixel, without obtaining any of the sequence of images 502,. Stated another way, information of the scene captured by the sequence of images 502 is encoded in the neural network model 504.
[0047] The neural network model 504 representing the motion field includes a motion field model 504A and a depth albedo field model 504B. For each pixel, when the motion field model 504A receives the temporal location (t) and pixel location (x, y) of the respective pixel, the motion field model 504A generates at least an interframe pixel displacement (dx, dy) and the interframe depth variation (dd) 510 of the respective pixel, i.e., a motion vector (dx, dy, dd) of the respective pixel The motion vector (dx, dy, dd) corresponds to displacements of a vector (x, y, d), and represents how the point on a surface of the object 506 moves at the pixel location (x, y) and at a time t. The depth albedo field model 504B is coupled to the motion field model 504A, and converts the pixel location (x, y) and interframe pixel displacement (dx, dy) of the respective pixel to a respective albedo value (a) 512 and the respective depth value 508. More specifically, a new pixel location (x+dx, y+dy) are fed to the depth albedo field model 504B to determine the respective albedo value (a) 512 and the respective depth value 508. In other words, the motion vector (dx, dy, dd) aligns all pixels from different time instances to a common one, allowing the depth and albedo values 508 and 512 of the surface point (x, y) in the aligned time instance t to be predicted by the depth albedo field model 504B.
[0048] Each pixel has a normal n, which is a vector perpendicular to a surface of the object 506 at the respective pixel. For each pixel, after the respective depth value (d) 508 and interframe depth variation (dd) 510 are determined from the neural network model 504, the normal n is represented by an orthogonal projection equation as follows:
Figure imgf000019_0001
In some embodiments, a client device 104 obtains, e.g., from a server 102, the motion field including a set of weights w, of the neural network model 504. The client device 104 does not need to obtain any of the sequence of images 502, During decoding, the client device 104 automatically recovers the depth value 508 and interframe depth variation 510 of each pixel in the sequence of images 502 from the neural network model 504 based on the temporal location (7) and pixel location (x, y) of the respective pixel. In some embodiments, the client device 104 executes a user application (e.g., a gaming application) configured to reconstruct a gaming scene including the object 506 based on the depth value 508 and interframe depth variation 510 of each pixel in the sequence of images 502. [0049] The neural network model 504 includes a first neural network model and is specific to a first scene. The sequence of images 502 are captured in the first scene, and a first object 506 is located in the field of view of the camera. A second sequence of images 502 captured in a distinct second scene are encoded into a distinct second neural network model. A third sequence of images 502 captured in the same first scene and directed to a distinct second object are encoded into a distinct third neural network model. The first, second, and third neural network models are distinct from each other. An input to the neural network model 504 is the pixel and temporal locations (x, v, /), where (x, v) is the pixel location each image and t is a temporal location or time stamp in the sequence of images 502. Each pixel in the sequence images 502 is parameterized as a unique location vector (x, y, f), and corresponds to an output of the normal at the pixel location (x. y) and at the time stamp t. Given any pixel location and any time stamp, the neural network model 502 can predict Lite surface normal at the pixel location (x. y), thereby creating a dynamic 3D reconstruction of the scene and object 506.
[0050] It is noted that each location variable in the pixel and temporal locations (x, y, t} has a respective range. The ranges for the pixel location (x, y) are defined by a resolution of each image 502, and the pixel location (x, y) does not go beyond the range defined by the resolution. The range for the temporal location i is defined by the number of images in the sequence of images. When the sequence of images 502 are encoded and distributed, the neural network model 504 is communicated jointly with the resolution of each image 502 and the number of images in the sequence of images 502, thereby allowing a client device 104 to set the ranges for the pixel and temporal locations (x, y, t) during decoding.
[0051] In some embodiments, the object 506 undergoes a deformable motion when the sequence of images 502 are captured. Although the deformable motion is not rigid, the neural network model 504 that encodes the sequence of images 502 including the deformable motion of the object 506 can be decoded to provide information of the object 506, e.g., a surface normal at each pixel of a surface of the object 506. In some embodiments, the object 506 has non-uniform color or material. Alternatively, in some embodiments, the object 506 has uniform color or material. Independently of whether the color or material is uniform, the neural network model 504 that encodes the sequence of images 502 including the color or material of the object 506 can be decoded to provide information of the object 506, e.g., a surface normal at each pixel of a surface of the object 506. In some embodiments, a highspeed camera 260 is applied to capture the sequence of images 502. No motion compensation is applied between two images if the two images 502 are captured within a predefined duration of time (e.g., within 10 milliseconds). Stated another way, these two images 502 are close in time, such that the object 506 remains substantially still and a corresponding variation of the object 506 between these two images 502 are negligible.
[0052] Figure 6A is a sequence of images 502 captured under different lighting conditions, in accordance with some embodiments. Figures 6B and 6C are surface normal maps 600 and relative depth maps 620 corresponding to the sequence of images in Figure 6A, in accordance with some embodiments. The sequence of images 502 include five images 502A, 502B, 502C, 502D, and 502E captured at times Zo-2, to-1, to, to+1, and to+2, where to is a temporal position of a third image 502C. In some embodiments, each image of the sequence of images 502 includes one of an RGB color image, a CMYA color image, a Lab color image, and a monochromatic color image, and a gray scale image. In some embodiments, the sequence of images 502 are successively captured by a camera at a frame rate. In some embodiments, the sequence of images 502 are five images selected from another sequence of images 502 that are successively captured by a camera at a frame rate. In some embodiments, the sequence of images 502 are five independently captured images including the same object exposed to the different lighting conditions.
[0053] Each image 502 corresponds to a distinct lighting condition in which the object 506 is illuminated from a different light direction. In some embodiments, a set of light sources are fixed on a capturing rig, and the set of light sources are controlled in synchronization with capturing the sequence of images 502A-502E. The sequence of images 502 are captured sequentially and in synchronization with illumination created by the set of fixed light sources that are cyclically enabled to illuminate the field of view for respective shortened durations. In an example, a plurality of fixed light sources include a first number of fixed light sources placed at the first number of fixed locations with respect to the camera 260 configured to capture the sequence of images 502, and the first number of fixed sources are enabled sequentially and in synchronization with the first number of consecutive images in the sequence of images 502.
[0054] A neural network model 504 is generated from and applied to encode the sequence of images 502. Each pixel in the sequence of images 502 corresponds to a position vector (x, y, f), where (x, y) corresponds to a pixel position of the respective pixel in a corresponding image 502, and t is a temporal position of the corresponding image 502 in the sequence of images 502. For each pixel in the sequence of images 502, the neural network model 504 generates the depth value 508 and interframe depth variation 510 based on the position vector (x, y, f). The depth values 508 of all pixels of the sequence of images 502 are visualized in the five depth maps 620 corresponding to the five images 502A-502E, respectively. Additionally, in accordance with equation (1), the depth value 508 and interframe depth variation 510 of each pixel is converted to a normal vector n at the respective pixel, and the normal vector n has a surface normal value. The surface normal values of all pixels of the sequence of images 502 are visualized in the five surface normal maps 600 corresponding to the five images 502A-502E, respectively.
[0055] In some embodiments, information of the object 506 (e.g., surface normals) is captured by the sequence of images 502. The neural network model 504 is generated to encode the sequence of images 502, i.e., encode the information of the object. When the neural network model 504 is provided to a client device 104, the client device 104 can conveniently reconstruct the information of the object 506 based on the neural network model 504, e.g., according to the process 500. The reconstructed information of the object 506 includes the surface normal maps 600 and relative depth maps 620, which can be applied to render the object 506 in a virtual reality environment if needed.
[0056] Figure 7 is a flowchart of a process 700 for training a neural network model 504 for encoding a sequence of images 502 to a motion field, in accordance with some embodiments. The process 700 is implemented at an electronic system 200 including a camera 260 and a light source system 702 that provides different lighting conditions needed by photometric stereo. For example, the light source system 702 enables five different lighting conditions applied in five images 502A-502E in Figure 6A. The light source system 702 operates in synchronization with the camera 260, such that the sequence of images 502 are captured in the predefined different lighting conditions. In some embodiments, the different lighting conditions are set up in a studio, and the electronic system 200 is a server 102 that can operate with the studio to capture the sequence of images 502 for an object 506. Upon obtaining the sequence of images 502, the electronic system 200 encodes the sequence of images 502 to the neural network model 504 based on the training process 700 that minimizes a reconstruction loss (Zrec) 704.
[0057] The neural network model 504 receives a position vector (x, j, /) of each pixel of the sequence of images 502 as inputs. The neural network model 504 includes a motion field model 504A and a depth albedo field model 504B. The motion field model 504A generates a motion vector (Jx, Ay, Ad) from the position vector (x, y, f) for each pixel, and the motion field model 504 A generates an albedo value (a) 512 and depth value (d) 508 for each pixel. For each pixel, a normal n is generated from the depth value (d) 508 and depth variation (Ad) 510 using equation (1). For each pixel in the sequence of images 502, the electronic system 200 predicts a color value cpred based on the normal n, albedo value (a) 512, and a direction vector It of a light source in the light source system 702 as follows:
Figure imgf000023_0001
The reconstruction loss Lrec (704) is determined between the predicted color value cpred and an actual color value cgt of the respective pixel as follows:
Figure imgf000023_0002
where the actual color value cgt acts as a ground truth of the respective pixel. The neural network model 504 representing the motion field is trained based on the reconstruction loss (Lrec) 704. That said, the motion field model 504A and depth albedo field model 504B are trained jointly based on the reconstruction loss (Lrec) 704. Once the neural network model 504 is trained, given any pixel location in a pixel range and any time instance in a temporal range, the neural network model 504 can predict the surface normal n at the pixel location. This allow'S us to obtain the surface normal u for pixels of the sequence of images 502, thereby creating a dynamic 3D reconstruction of the object 506 captured in the sequence of images 502.
[0058] Figure 8A is a flow diagram of a method for processing information of one or more objects, in accordance with some embodiments, and Figure 8B is a flow diagram of a method 850 decoding information of one or more objects 506, in accordance with some embodiments. The methods 800 and 850 are described as being implemented by an electronic system 200 (e.g., a server 102, a client device 104, or a combination thereof). Methods 800 and 850 are, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the computer system. Each of the operations shown in Figures 8A and 8B may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 in Figure 2). The computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other nonvolatile memory device or devices. The instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in methods 800 and 850 may be combined and/or the order of some operations may be changed. [0059] Referring to Figure 8A, in accordance with the method 800, an electronic device (e.g., a server 102) obtains (802) a sequence of images 502 that are captured sequentially in a field of view including the one or more objects 506, determines (804) a resolution of each image and a number of images in the sequence of images 502, and generates (806) a motion field of the sequence of images 502. The motion field is represented (808) with a neural network model 504 and configured to encode at least a depth value 508 and an interframe depth variation 510 of each pixel in the sequence of images 502. In some embodiments, the neural network model 504 representing the motion field is configured (810) to for each pixel in the sequence of images 502, convert a pixel location of the respective pixel in a respective image and a respective temporal location of the respective image in the sequence of images 502 to the depth value 508 and the interframe depth variation 510 of the respective pixel. The electronic device encodes (812) the sequence of images 502 to image encoding information including the resolution of each image, the number of images in the sequence of images 502, and the motion field. It is noted that the resolution of each image is applied to define a location range for the pixel location and the number of images defines a temporal range for the temporal location of the respective image in the sequence of images 502.
[0060] In some embodiments, the neural network model 504 representing the motion field includes (814) a motion field model 504A and a depth albedo field model 504B, The motion field is generated by determining (816) the motion field model 504A that for each pixel in the sequence of images 502 and determining (818) the depth albedo field model 504B for each pixel in the sequence of images 502. The motion field model 504A is configured to convert the pixel location of the respective pixel in the respective image and the respective temporal location of the respective image in the sequence of images 502 to at least an interframe pixel displacement and the interframe depth variation 510 of the respective pixel. The depth albedo field model 504B is configured to convert the pixel location and interframe pixel displacement of the respective pixel to a respective albedo value 512 and a respective depth value 508.
[0061] In some embodiments, for each pixel in the sequence of images 502, the electronic device predicts a color value from a pixel location of the respective pixel in a respective image and a respective temporal location of the respective image in the sequence of images 502 using the neural network model 504 representing the motion field, determines a reconstruction loss between the predicted color value and an actual color value of the respective pixel, and trains the neural network model 504 representing the motion field based on the reconstruction loss. Further, in some embodiments, the electronic device predicts the color values by for each pixel in the sequence of images 502, generating the depth value 508, the interframe depth variation 510, and an albedo value 512 from the pixel location of the respective pixel in the respective image and the respective temporal location of the respective image in the sequence of images 502 using the neural network model 504, generating a normal from the depth value 508 and interframe depth variation 510, and predicting the color value based on the normal, the albedo value 512, and a location of a corresponding light source. Additionally, in some embodiments, the normal n, the predicted color value cPred, and the reconstruction loss Tree are represented as follows:
Figure imgf000025_0001
where d and Ad are the depth value 508 and interframe depth variation 510, respectively, x and j correspond to two coordinates of the pixel location of the respective pixel in the respective image, a is the albedo value 512, It is the direction of the corresponding light source, and Cgt is the actual color value of the respective pixel.
[0062] In some embodiments, the electronic device includes a first electronic device (e.g., a server 102). A second electronic device (e.g., a client device 104) obtains (820) the image encoding information of the sequence of images 502, and regenerates (822) a normal map of each image in the sequence of images 502 from the image encoding information. The normal map of each image includes a normal of each pixel of the respective image. For each pixel in the respective image, in accordance with the neural network model 504 representing the motion field, the second electronic device converts a pixel location in the respective image and a respective temporal location of the respective image in the sequence of images 502 to the depth value 508 and the interframe depth variation 510 of the respective pixel, and determines the normal of the respective pixel based on the depth value 508 and depth variation 510 of the respective pixel. Further, in some embodiments, the second electronic device reconstructs the one or more objects 506 in a three-dimensional (3D) virtual scene based on at least a subset of the normal maps of the sequence of images 502. In some situations, the image encoding information is encoded in a server 102 and provided to a client device 104, and the client device 104 regenerates the normal map from the image encoding information.
[0063] In some embodiments, the sequence of images 502 are captured sequentially and in synchronization with illumination that is created with a plurality of fixed light sources that are cyclically enabled to illuminate the field of view for respective shortened durations. Further, in some embodiments, the one or more objects 506 are moving in the field of view during a duration of time in which the sequence of images 502 are captured. Additionally, in some embodiments, a plurality of fixed light sources includes a first number of fixed light sources placed at the first number of fixed locations with respect to a camera configured to capture the sequence of images 502, and the first number of fixed sources are enabled sequentially and in synchronization with the first number of consecutive images in the sequence of images 502.
[0064] In some embodiments, each image of the sequence of images 502 includes one of an RGB color image, a CMYA color image, a Lab color image, and a monochromatic color image, and a gray scale image.
[0065] Referring to Figure 8B, in accordance with the method 850, an electronic device (e.g., a client device 104) obtains (852) image encoding information of a sequence of images 502 that are captured sequentially in a field of view including the one or more objects 506. The image encoding information includes (854) a resolution of each image, a number of images in the sequence of images 502, and a motion field represented by a neural network model 504. The electronic device generates (856) a normal map of each image in the sequence of images 502 from the image encoding information. For each pixel in the respective image, in accordance with the neural network model 504 representing the motion field, the electronic device converts (858) a pixel location in the respective image and a respective temporal location of the respective image in the sequence of images 502 to a depth value 508 and an interframe depth variation 510 of the respective pixel, and determines (860) a normal of the respective pixel based on the depth value 508 and interframe depth variation 510 of the respective pixel.
[0066] It should be understood that the particular order in which the operations in Figures 8A and 8B has been described are merely exemplary and are not intended to indicate that the described order is the only order in which the operations could be performed. One of ordinary skill in the art would recognize various ways to encoding a sequence of images 502 and decoding an associated motion field as described herein. Details of some processes described above with respect to Figures 8A and 8B are applicable to both methods 800 and 850. Additionally, it should be noted that details of other processes described above with respect to Figures 5-7 are also applicable in an analogous manner to methods 800 and 850 described above with respect to Figures 8A and 8B. For brevity, these details are not repeated here.
[0067] The terminology used in the description of the various described implementations herein is for the purpose of describing particular implementations only and is not intended to be limiting. As used in the description of the various described implementations and the appended claims, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Additionally, it will be understood that, although the terms “first,” “second,” etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another.
[0068] As used herein, the term “if’ is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.
[0069] The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the claims to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain principles of operation and practical applications, to thereby enable others skilled in the art. [0070] Although various drawings illustrate a number of logical stages in a particular order, stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.

Claims

What is claimed is:
1. A method for processing information of one or more objects, implemented at an electronic system, comprising: obtaining a sequence of images that are captured sequentially in a field of view including the one or more objects; determining a resolution of each image and a number of images in the sequence of images; generating a motion field of the sequence of images, wherein the motion field is represented with a neural network model and configured to encode at least a depth value and an interframe depth variation of each pixel in the sequence of images; and encoding the sequence of images to image encoding information including the resolution of each image, the number of images in the sequence of images, and the motion field.
2. The method of claim 1, wherein the neural network model representing the motion field is configured to for each pixel in the sequence of images, convert a pixel location of the respective pixel in a respective image and a respective temporal location of the respective image in the sequence of images to the depth value and the interframe depth variation of the respective pixel.
3. The method of claim 2, wherein the neural network model representing the motion field includes a motion field model and a depth albedo field model, determining the motion field further comprising: determining the motion field model that for each pixel in the sequence of images, wherein the motion field model is configured to convert the pixel location of the respective pixel in the respective image and the respective temporal location of the respective image in the sequence of images to at least an interframe pixel displacement and the interframe depth variation of the respective pixel; and determining the depth albedo field model for each pixel in the sequence of images, wherein the depth albedo field model is configured to convert the pixel location and interframe pixel displacement of the respective pixel to a respective albedo value and the respective depth value.
4. The method of any of the preceding claims, generating the motion field further comprising, for each pixel in the sequence of images: predicting a color value from a pixel location of the respective pixel in a respective image and a respective temporal location of the respective image in the sequence of images using the neural network model representing the motion field; determining a reconstruction loss between the predicted color value and an actual color value of the respective pixel; and training the neural network model representing the motion field based on the reconstruction loss.
5. The method of claim 4, predicting the color values further comprising, for each pixel in the sequence of images: generating the depth value, the interframe depth variation, and an albedo value from the pixel location of the respective pixel in the respective image and the respective temporal location of the respective image in the sequence of images using the neural network model; generating a normal from the depth value and interframe depth variation; predicting the color value based on the normal, the albedo value, and a location of a corresponding light source.
6. The method of claim 5, wherein the normal n, the predicted color value cPred, and the reconstruction loss Tree are represented as follows:
Figure imgf000030_0001
where d and Ad are the depth value and interframe depth variation, respectively, x and y correspond to two coordinates of the pixel location of the respective pixel in the respective image, a is the albedo value, It is the direction of the corresponding light source, and Cgt is the actual color value of the respective pixel.
7. The method of any of the preceding claims, further comprising: obtaining the image encoding information of the sequence of images; regenerating a normal map of each image in the sequence of images from the image encoding information, wherein the normal map of each image includes a normal of each pixel of the respective image, including for each pixel in the respective image: in accordance with the neural network model representing the motion field, converting a pixel location in the respective image and a respective temporal location of the respective image in the sequence of images to the depth value and the interframe depth variation of the respective pixel; and determining the normal of the respective pixel based on the depth value and depth variation of the respective pixel.
8. The method of claim 7, further comprising: reconstructing the one or more objects in a three-dimensional (3D) virtual scene based on at least a subset of the normal maps of the sequence of images.
9. The method of claim 7, wherein the image encoding information is encoded in a server and provided to a client device, and the client device regenerates the normal map from the image encoding information.
10. The method of any of the preceding claims, wherein the sequence of images are captured sequentially and in synchronization with illumination that is created with a plurality of fixed light sources that are cyclically enabled to illuminate the field of view for respective shortened durations.
11. The method of claim 10, wherein the one or more objects are moving in the field of view during a duration of time in which the sequence of images are captured.
12. The method of claim 10, wherein a plurality of fixed light sources includes a first number of fixed light sources placed at the first number of fixed locations with respect to a camera configured to capture the sequence of images, and the first number of fixed sources are enabled sequentially and in synchronization with the first number of consecutive images in the sequence of images.
13. The method of any of the preceding claims, wherein each image of the sequence of images includes one of an RGB color image, a CMYA color image, a Lab color image, and a monochromatic color image, and a gray scale image.
14. A method for decoding information of one or more objects, implemented at an electronic device, comprising: obtaining image encoding information of a sequence of images that are captured sequentially in a field of view including the one or more objects, wherein the image encoding information includes a resolution of each image, a number of images in the sequence of images, and a motion field represented by a neural network model; generating a normal map of each image in the sequence of images from the image encoding information, including for each pixel in the respective image: in accordance with the neural network model representing the motion field, converting a pixel location in the respective image and a respective temporal location of the respective image in the sequence of images to a depth value and an interframe depth variation of the respective pixel; and determining a normal of the respective pixel based on the depth value and interframe depth variation of the respective pixel.
15. An electronic system, comprising: one or more processors; and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform a method of any of claims 1-14.
16. A non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the one or more processors to perform a method of any of claims 1-14.
PCT/US2022/019458 2022-03-09 2022-03-09 Photometic stereo for dynamic surface with motion field Ceased WO2023172257A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2022/019458 WO2023172257A1 (en) 2022-03-09 2022-03-09 Photometic stereo for dynamic surface with motion field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2022/019458 WO2023172257A1 (en) 2022-03-09 2022-03-09 Photometic stereo for dynamic surface with motion field

Publications (1)

Publication Number Publication Date
WO2023172257A1 true WO2023172257A1 (en) 2023-09-14

Family

ID=87935697

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2022/019458 Ceased WO2023172257A1 (en) 2022-03-09 2022-03-09 Photometic stereo for dynamic surface with motion field

Country Status (1)

Country Link
WO (1) WO2023172257A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130050413A1 (en) * 2011-08-22 2013-02-28 Sony Corporation Video signal processing apparatus, video signal processing method, and computer program
US20180293711A1 (en) * 2017-04-06 2018-10-11 Disney Enterprises, Inc. Kernel-predicting convolutional neural networks for denoising
US20200051206A1 (en) * 2018-08-13 2020-02-13 Nvidia Corporation Motion blur and depth of field reconstruction through temporally stable neural networks
US20210150747A1 (en) * 2019-11-14 2021-05-20 Samsung Electronics Co., Ltd. Depth image generation method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130050413A1 (en) * 2011-08-22 2013-02-28 Sony Corporation Video signal processing apparatus, video signal processing method, and computer program
US20180293711A1 (en) * 2017-04-06 2018-10-11 Disney Enterprises, Inc. Kernel-predicting convolutional neural networks for denoising
US20200051206A1 (en) * 2018-08-13 2020-02-13 Nvidia Corporation Motion blur and depth of field reconstruction through temporally stable neural networks
US20210150747A1 (en) * 2019-11-14 2021-05-20 Samsung Electronics Co., Ltd. Depth image generation method and device

Similar Documents

Publication Publication Date Title
WO2021184026A1 (en) Audio-visual fusion with cross-modal attention for video action recognition
WO2021077140A2 (en) Systems and methods for prior knowledge transfer for image inpainting
WO2023102223A1 (en) Cross-coupled multi-task learning for depth mapping and semantic segmentation
WO2023101679A1 (en) Text-image cross-modal retrieval based on virtual word expansion
WO2022103877A1 (en) Realistic audio driven 3d avatar generation
US20240153184A1 (en) Real-time hand-held markerless human motion recording and avatar rendering in a mobile platform
US12217545B2 (en) Multiple perspective hand tracking
WO2023086398A1 (en) 3d rendering networks based on refractive neural radiance fields
WO2023023160A1 (en) Depth information reconstruction from multi-view stereo (mvs) images
US20240087344A1 (en) Real-time scene text area detection
WO2023133285A1 (en) Anti-aliasing of object borders with alpha blending of multiple segmented 3d surfaces
WO2023229591A1 (en) Real scene super-resolution with raw images for mobile devices
WO2023229589A1 (en) Real-time video super-resolution for mobile devices
WO2023027712A1 (en) Methods and systems for simultaneously reconstructing pose and parametric 3d human models in mobile devices
WO2023091131A1 (en) Methods and systems for retrieving images based on semantic plane features
WO2023069085A1 (en) Systems and methods for hand image synthesis
US12394021B2 (en) Depth-based see-through prevention in image fusion
WO2023172257A1 (en) Photometic stereo for dynamic surface with motion field
WO2023277877A1 (en) 3d semantic plane detection and reconstruction
WO2023229644A1 (en) Real-time video super-resolution for mobile devices
WO2023063944A1 (en) Two-stage hand gesture recognition
WO2023018423A1 (en) Learning semantic binary embedding for video representations
WO2024123372A1 (en) Serialization and deserialization of layered depth images for 3d rendering
WO2023167658A1 (en) Image processing with encoder-decoder networks having skip connections
WO2024232882A1 (en) Systems and methods for multi-view depth estimation using simultaneous localization and mapping

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22931184

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22931184

Country of ref document: EP

Kind code of ref document: A1