[go: up one dir, main page]

WO2025172714A1 - Generating a representation of a scene - Google Patents

Generating a representation of a scene

Info

Publication number
WO2025172714A1
WO2025172714A1 PCT/GB2025/050280 GB2025050280W WO2025172714A1 WO 2025172714 A1 WO2025172714 A1 WO 2025172714A1 GB 2025050280 W GB2025050280 W GB 2025050280W WO 2025172714 A1 WO2025172714 A1 WO 2025172714A1
Authority
WO
WIPO (PCT)
Prior art keywords
scene
base representation
representation
viewing zone
quality
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/GB2025/050280
Other languages
French (fr)
Inventor
Tristan SALOME
Guido MEARDI
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
V Nova International Ltd
Original Assignee
V Nova International Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by V Nova International Ltd filed Critical V Nova International Ltd
Publication of WO2025172714A1 publication Critical patent/WO2025172714A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • G06T19/006Mixed reality
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B3/00Apparatus for testing the eyes; Instruments for examining the eyes
    • A61B3/02Subjective types, i.e. testing apparatus requiring the active assistance of the patient
    • A61B3/028Subjective types, i.e. testing apparatus requiring the active assistance of the patient for testing visual acuity; for determination of refraction, e.g. phoropters
    • A61B3/032Devices for presenting test symbols or characters, e.g. test chart projectors
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/06Ray-tracing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/10Geometric effects
    • G06T15/20Perspective computation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/10Geometric effects
    • G06T15/20Perspective computation
    • G06T15/205Image-based rendering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/816Monomedia components thereof involving special video data, e.g 3D video
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B3/00Apparatus for testing the eyes; Instruments for examining the eyes
    • A61B3/10Objective types, i.e. instruments for examining the eyes independent of the patients' perceptions or reactions
    • A61B3/113Objective types, i.e. instruments for examining the eyes independent of the patients' perceptions or reactions for determining or recording eye movement
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/103Measuring devices for testing the shape, pattern, colour, size or movement of the body or parts thereof, for diagnostic purposes
    • A61B5/11Measuring movement of the entire body or parts thereof, e.g. head or hand tremor or mobility of a limb
    • A61B5/1103Detecting muscular movement of the eye, e.g. eyelid movement
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/103Measuring devices for testing the shape, pattern, colour, size or movement of the body or parts thereof, for diagnostic purposes
    • A61B5/11Measuring movement of the entire body or parts thereof, e.g. head or hand tremor or mobility of a limb
    • A61B5/1113Local tracking of patients, e.g. in a hospital or private home
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2210/00Indexing scheme for image generation or computer graphics
    • G06T2210/36Level of detail

Definitions

  • the present disclosure relates to methods, systems, and apparatuses for generating a representation of a scene.
  • the disclosure relates to methods, systems, and apparatuses for generating a three- dimensional representation of a scene that comprises a base representation and a scene element.
  • XR extended reality
  • XR extended reality
  • providing a high- quality scene typically requires a powerful computer device, which may limit accessibility, in particular with lightweight devices such as mobile devices, headsets, or smart glasses.
  • Providing a high-quality scene - e.g. to provide a scene suitable for industrial pre-visualisation or cinematic computer graphics - typically requires a long generation time (or rendering time), which can make it difficult to provide an interactive scene that is updated in real-time based on the actions of a viewer.
  • a method of generating a three- dimensional representation of a scene comprising: generating (e.g. rendering and/or presenting) a base representation of the scene, the base representation having a first quality; generating (e.g. rendering and/or presenting) a scene element, the scene element having a second quality; wherein the scene element is arranged to be combined with the base representation; and/or the method comprises combining the base representation with the scene element.
  • the second quality is lower than the first quality.
  • the method may comprise generating a multi-dimensional representation of a scene, e.g. a fourdimensional, five-dimensional, or six-dimensional representation.
  • the first quality and the second quality are associated with one or more of: a first resolution and a second resolution (e.g. wherein the second resolution is different to and/or lower than the first resolution); a first frame rate and a second frame rate (e.g. where the second frame rate is different to and/or lower than the first frame rate); and/or a first colour range and/or a second colour range (e.g. where the second colour range is different to and/or lower than the first colour range).
  • a first resolution and a second resolution e.g. wherein the second resolution is different to and/or lower than the first resolution
  • a first frame rate and a second frame rate e.g. where the second frame rate is different to and/or lower than the first frame rate
  • a first colour range and/or a second colour range e.g. where the second colour range is different to and/or lower than the first colour range
  • the scene element is upsampled prior to the combining of the scene element with the base representation.
  • the scene element is subjected to a motion interpolation process prior to the combining of the scene element with the base representation.
  • the base representation has a resolution of at least 4K, at least 8K, and/or at least 16k (preferably, this resolution is a resolution per eye).
  • the scene element has a resolution of no more than 8K, no more than 4K, and/or no more than 2k.
  • the base representation is generated using a ray-tracing process.
  • the scene element is generated using a rasterization process, preferably a real-time or near real-time rasterization process.
  • the base representation enables a user to move about the scene so as to view the scene, preferably to move about the scene with six degrees of freedom (6DoF).
  • the method is carried out at a first device and generating the base representation comprises generating the base representation based on a transmission received from a second device, preferably wherein the transmission comprises the base representation in an encoded format, more preferably a layered format and/or a low complexity enhanced video codec (LCEVC) format.
  • LEC low complexity enhanced video codec
  • generating the base representation comprises streaming the base representation based on a transmission from second first device, preferably wherein streaming the base representation comprises simultaneously receiving the transmission and presenting the scene.
  • generating the scene element comprises streaming the scene element based on a transmission from the further device.
  • the second device and the further device are different devices.
  • movement out of the viewing zone causes presentation of the scene element in an altered quality and/or movement out of the viewing zone pauses an animation or playback of the scene element.
  • the method comprises inserting a trigger into the base representation, the trigger being associated with the generation and/or display of a scene element, the scene element having a second quality and the scene element being arranged to be combined with the base representation.
  • a method of generating a base representation of a three-dimensional scene comprising: generating the base representation of a scene, the base representation having a first quality; inserting a trigger into the base representation, the trigger being associated with the display and/or generation of a scene element, the scene element having a second quality and the scene element being arranged to be combined with the base representation.
  • the method is performed at an image generating device, and the method further comprises transmitting the base representation to a display device.
  • the method further comprises displaying the base representation at the display device.
  • the method comprises generating (e.g. rendering and/or displaying) the scene element at the display device, preferably comprising transmitting the scene element from a further device to the display device and combining the base representation and the scene element at the display device.
  • the method comprises associating a trigger with the scene element, wherein the scene element is arranged to be combined with the base representation based on the triggering of the trigger.
  • a method of presenting a representation of a three-dimensional scene comprising: receiving a base representation of a scene, the base representation having a first quality; receiving and/or generating a scene element, the scene element having a second quality; and combining the base representation with the scene element.
  • a system and/or apparatus for generating a base representation of a three-dimensional scene comprising: means for (e.g. a processor for) generating the base representation of a scene, the base representation having a first quality; means for (e.g. a processor for) inserting a trigger into the base representation, the trigger being associated with the generation of a scene element, the scene element having a second quality and the scene element being arranged to be combined with the base representation.
  • a system and/or apparatus for presenting a representation of a three-dimensional scene comprising: means for (e.g. a processor for) receiving a base representation of a scene, the base representation having a first quality; means for (e.g. a processor for) receiving and/or generating a scene element, the scene element having a second quality; and means for (e.g. a processor for) combining the base representation with the scene element.
  • a system and/or apparatus for generating a representation of a three-dimensional scene comprising: means for (e.g. a processor for) identifying a base representation of a scene, the base representation having a first quality; means for (e.g. a processor for) determining a scene element for combining with the base representation, the scene element having a second quality.
  • the system and/or apparatus comprises means for (e.g. a processor for) associating a trigger with the scene element, wherein the scene element is arranged to be combined with the base representation based on the triggering of the trigger.
  • means for e.g. a processor for
  • associating a trigger with the scene element wherein the scene element is arranged to be combined with the base representation based on the triggering of the trigger.
  • the means for generating the base representation comprises an image generator device.
  • the means for receiving and/or generating the scene element comprises a display device, preferably a virtual reality headset.
  • the means for receiving and/or generating the scene element comprises a computer device connected to a display device.
  • the means for generating the scene element comprises a third-party device, preferably a server comprising a database of available scene elements.
  • the means for combining the base representation with the scene element comprises a display device, preferably a virtual reality headset and/or a computer device connected to the display device.
  • the means for generating the base representation is arranged to transmit the base representation to a further device, preferably to a display device.
  • the means for combining the base representation and the scene element comprises the display device.
  • the system comprises a display device.
  • a method of generating e.g. rendering) a base representation of the scene with a first rendering process at a first level of quality, the base representation including data to produce multiple points of view of the scene within a range of points of views (e.g. a “zone of view”, or “viewing zone”); generating (e.g. rendering and/or presenting) a specific point of view of the scene at a second level of quality based on real-time information on motion and orientation of the viewer within the zone of view.
  • a range of points of views e.g. a “zone of view”, or “viewing zone”
  • the method may comprise processing the first rendering process at a first frame rate and the second rendering process at a second frame rate, the second frame rate being different from the first frame rate.
  • the second rendering process may produce a video of the evolution of the viewer’s point of view of the scene.
  • the video may be a stereoscopic video.
  • the second rendering process may produce video plus depth information.
  • the base representation is produced based on an off-line (e.g. not necessarily real-time) rendering process.
  • the view-point rendition of the base representation is generated using a two-pass rendering process including a first rendering process of multiple points of view producing a pre-rendered data set by means of a ray-tracing or a path-tracing process, not necessarily real-time or near real-time, and a second rendering process of the instantaneous point of view of the viewer by means of a real-time rendering process, such as a real-time rasterization and/or real-time ray-tracing process.
  • the scene element at the second level of quality is generated using a single real-time rendering process, such as a rasterization process or a real-time ray-tracing process.
  • the scene element may be generated by the same real-time rendering process (e.g. a second base representation rendering process) that produces the rendering of the view-point of the base representation.
  • the second rendering process includes a step of motion interpolation to produce the final rendering of the view-point of the scene at a frame rate that is different (e.g., higher) than the frame rate at which the pre-rendered data set was computed.
  • the pre- rendered data set includes data (e.g., motion information of specific scene elements) to support more accurate motion interpolation during the second rendering process.
  • the base representation enables a user to move about the scene so as to view the scene from different points of view (e.g., location and orientation in space at any one time), preferably to move about the scene with six degrees of freedom (6DoF).
  • points of view e.g., location and orientation in space at any one time
  • 6DoF six degrees of freedom
  • the method of producing the real-time rendering of the view-point is carried out at a first device and generating the final rendition of the view-point of the scene comprises generating the video frames of the scene based on a transmission received from a second device, preferably wherein the transmission comprises the use of the base representation in an encoded format, more preferably a layered video encoding format, such as a video encoding enhanced with the MPEG-5 Part 2 Low Complexity Enhanced Video Coding (LCEVC) format.
  • LEC Low Complexity Enhanced Video Coding
  • generating the final rendition of the view-point of the scene comprises streaming to the first device the final rendering of the view-point based on a transmission from a second device, preferably wherein presenting the view-point of the scene comprises receiving the transmission, decoding the video frames, adjusting the video frames for presentation and presenting the view-point.
  • adjusting the video frames for presentation includes applying a reprojection to the decoded video frames, to account for latency in the transmission and to adapt the decoded video to the display frame rate.
  • the method is carried out at a first device and generating the scene element comprises generating the scene element based on a transmission received from a further device, preferably wherein the transmission comprises an encoded version of the scene element.
  • the scene element is responsive to actions of another user in a different location, remotely interacting with the viewer of the scene.
  • generating the scene element comprises streaming the scene element based on a transmission from the further device.
  • the scene element comprises a video feed streamed from the further device.
  • the second device and the further device are different devices.
  • generating the view-point of the base representation comprises processing an initial version of the base representation data so as to generate the view-point of the base representation of the scene based on a perspective of a viewer of the scene.
  • generating the base representation comprises receiving a transmission containing the base representation, preferably an encoded version of the base representation, and generating the view-point of the base representation based on the received transmission.
  • generating the scene element comprises receiving data representing one or more actions of the viewer and generating the scene element based on the received data.
  • generating the scene element comprises receiving a transmission and decoding the transmission so as to obtain the scene element.
  • generating the scene element comprises selecting the scene element from a database, preferably a third-party database.
  • the base representation comprises image data, preferably encoded image data.
  • the base representation is encoded based on a low-complexity software encoding process; and/or the base representation comprises layered (e.g., tier-based, or hierarchical) data so that the base representation can be generated in different levels of quality.
  • the scene comprises one or more of: a part of a movie, a music videoscene, a game, a shopping experience, a digital double, an industrial pre-visualization, a design review, a sports experience.
  • a method of generating a base representation of a three-dimensional scene comprising: generating the base representation of a scene, the base representation having a first quality; inserting a trigger into the base representation, the trigger being associated with the display and/or generation of a scene element, the scene element having a second quality and the view-point of the scene element being arranged to be combined with the view-point of the base representation.
  • the method is performed at an image generating device, and the method further comprises transmitting the base representation to a display device.
  • the method further comprises processing and displaying the view-point of the base representation at the display device.
  • the method comprises generating (e.g. rendering and/or displaying) the scene element at a view-point rendering device, preferably comprising transmitting the scene element from a further device to the display device and combining the base representation and the scene element at the view-point rendering device.
  • Any apparatus feature as described herein may also be provided as a method feature, and vice versa.
  • means plus function features may be expressed alternatively in terms of their corresponding structure, such as a suitably programmed processor and associated memory.
  • the disclosure also provides a computer program and a computer program product comprising software code adapted, when executed on a data processing apparatus, to perform any of the methods described herein, including any or all of their component steps.
  • the disclosure also provides a computer program and a computer program product comprising software code which, when executed on a data processing apparatus, comprises any of the apparatus features described herein.
  • the disclosure also provides a computer program and a computer program product having an operating system which supports a computer program for carrying out any of the methods described herein and/or for embodying any of the apparatus features described herein.
  • the disclosure also provides a signal carrying the computer program as aforesaid, and a method of transmitting such a signal.
  • Figure 2 shows a computer device on which components of the system of Figure 1 may be implemented.
  • Figure 3 shows a method of generating and combining a base representation of a scene and a scene element.
  • Figure 5 shows a method of generating a scene element based on a trigger associated with the scene.
  • the system comprises an image generator 11 , an encoder 12, a transmitter 13, a network 14, a receiver 15, a decoder 16 and a display device 17.
  • these components may each be implemented on separate apparatuses. Equally, various combinations of these components may be implemented on a shared apparatus; for example, the image generator 11 , the encoder 12, and the transmitter 13 may all be part of a single image data generation device. Similarly, the receiver 15, the decoder 16, and the display device 17 may all be a part of a single image rendering device.
  • the system comprises at least one encoding computer device (e.g. a server of a content provider) and at least one rendering computer device (e.g. a VR headset).
  • encoding computer device e.g. a server of a content provider
  • rendering computer device e.g. a VR headset
  • Each computer device comprises one or more of: a processor 21 for executing instructions (e.g. so as to perform one or more of the steps of the various methods described below), a communication interface 22 for facilitating communication between computer devices (e.g. an ethernet interface, a Bluetooth® interface, or a universal serial bus (UBS) interface, a memory 23 and/or storage 24 for storing information and instructions (e.g. a random access memory (RAM), a read only memory (ROM), a hard drive disk (HDD) a solid state drive (SSD), and/or a flash memory, and a user interface 25 (e.g. a display, a mouse, and/or a keyboard) for enabling a user to interact with the computer device.
  • a processor 21 for executing instructions (e.g. so as to perform one or more of the steps of the various methods described below)
  • a communication interface 22 for facilitating communication between computer devices (e.g. an ethernet interface, a Bluetooth® interface, or a universal serial bus (UBS)
  • the computer device 20 may comprise further (or fewer) components.
  • the computer device e.g. the display device 17
  • the computer device may comprise one or more sensors, such as an accelerometer, a GPS sensor, or a light sensor. These sensors typically enable the computer device to identify an environmental condition and/or an action of wearer of the display device.
  • the image generator 11 is configured to generate a sequence of image data (e.g. a sequence of image frames) to enable the display device 17 to use this image data to display a plurality of images.
  • the image data may comprise one or more digital objects and the image data may be generated or encoded in any format.
  • the image data may comprise point cloud data, where each point has a 3D position and one or more attributes. These attributes may, for example, include, a surface colour, a transparency value, a point size and a surface normal direction. Each attribute may have a value chosen from a continuous range or may have a value chosen from a discrete set.
  • the image data enables the later rendering of images. This image data may enable a direct rendering (e.g.
  • the image data may comprise depth map data, where one or more pixels or objects in the image is associated with a depth that is specified by the depth map data.
  • the depth map data may be provided as a depth map layer, separate from an image layer.
  • the image layer may instead be described as a texture layer.
  • the depth map layer may instead be described as a geometry layer.
  • the image data for each image may include further information, which may be provided as a part of an image, e.g. as part of the point cloud data, or as separate layers.
  • the image data may include audio information or haptic feedback information indicating audio or haptics which can accompany displayed visual data.
  • An audio layer or haptic layer may accompany each image, and may be omitted for images where no accompanying audio or haptics are required.
  • the image data may indicate, or may be combinable with, a state of the virtual environment, a position of a user, ora viewing direction of the user.
  • the position and viewing direction may be physical properties of the user in the real-world, or position and viewing direction may also be purely virtual, for example being controlled using a handheld controller.
  • the image generator 11 may, for example, obtain information from the display device 17 that indicates the position, viewing direction, or motion of the user. Equally, the image generator may generate image data such that it can later be combined with this position, viewing direction, or motion, where the image generator may generate a full scene which is only partially viewed by a user depending on the position of that user.
  • the encoder 12 is configured to encode frames to be transmitted to the display device 17.
  • the encoder may be implemented using executable software or may be implemented on specific hardware such as an ASIC.
  • the image generator 11 may transmit raw, unencoded, data through the network 14. However, such transmission typically leads to a high file size and requires a high bandwidth so that it is typically desirable to encode the data prior to the transmission.
  • the encoder 12 may encode the image data in a lossless manner or may encode the data in a lossy manner.
  • the encoder may apply inter-frame or intra-frame compression based on a currently-encoded frame and optionally one or more previously encoded frames.
  • the encoder may be a multi-layer encoder, such as an low complexity enhancement video codec (LCEVC) enabled encoder.
  • LEVC low complexity enhancement video codec
  • the encoder 12 may perform layered encoding on each instance of image data (e.g. each frame) to generate an encoded frame comprising a base depth map layer and an enhancement depth map layer. Encoding a depth map in this way may improve compression.
  • depth maps are desirably highly detailed with a bit depth of up to twelve or fourteen bits, which is a significant increase in the data to be transmitted.
  • providing ways to improve compression of the depth map can make more realistic depth map-based displays viable when performing rendering or transmission of rendered data in real-time.
  • this type of layered encoding makes it easy to drop (and then pick back up) one or more of the layers, which provides flexibility and tools for bandwidth management.
  • Layered encoding is also helpful as the final decoder/user device (such as a user display device) can choose whether to process these extra layers.
  • the best the end device i.e. the receiver, decoder or display device associated with a user that will view the images
  • the controller/renderer/encoder that it does not have enough resources.
  • the controller then will send future images at a lower quality.
  • the end device still unfortunately has to process the higher quality data until the lower quality data arrives, if it can process the received images at all.
  • depth map data may be embedded in image data.
  • the base depth map layer may be a base image layer with embedded depth map data
  • the enhancement depth map layer may be an enhancement image layer with embedded depth map data.
  • the encoded depth map layers may be separate from the encoded image layers.
  • the encoded depth map layers can be dropped under some conditions while still retaining image layers that can be displayed (albeit with a lower level of realism).
  • the encoded depth map layers can be dropped by a transmitter or encoder when available communication resources are reduced, or can be dropped by an end device which lacks the processing resources to handle the highest level of quality.
  • the image data for some images comprises an audio base layer, a haptic feedback base layer, an audio enhancement layer or a haptic feedback enhancement layer, these can be processed or dropped flexibly.
  • the encoder may apply a point cloud data encoding technique such as described in European patent application EP21386059.6, which is incorporated herein by reference.
  • a point cloud encoder may act as a base encoder for a layered encoding technique such as LCEVC or VC-6.
  • LCEVC and VC-6 techniques encode and decode a layered signal, but are agnostic about the content type of data encoded in the signal.
  • the signal can include textures, video frames, geometry or depth data, meshes, point clouds, rendering attributes or physics engine attributes.
  • the transmitter 13 may be any known type of transmitter for wired or wireless communications, including an Ethernet transmitter or a Bluetooth transmitter.
  • the transmitter 13 may be configured to make decisions about how to transmit the image data, and/or may provide feedback to the encoder 12 or the image generator 11 .
  • the transmitter may determine available communication resources (e.g. bandwidth) for transmitting image data, and may drop one or more layers from an encoded frame, or indicate to the image generator and/or encoder that image data should be generated and encoded with fewer layers, when insufficient bandwidth is available for transmission of all generated data.
  • the transmitter may be configured to drop a depth map layer, an LCEVC enhancement layer, or a VC-6 enhancement layer from a frame when insufficient communication resources are available.
  • the network 14 provides a channel for communication between the transmitter 13 and the receiver 15, and may be any known type of network such as a WAN or LAN or a wireless Wi-Fi or Bluetooth network.
  • the network may further be a composite of several networks of different types. Many users only have access to a network with a bandwidth of 30MBps which can lead to latency jitter when streaming. The required bandwidth and the observed latency can be reduced by means of tactics such as forward-looking rendering and last-millisecond reprojection, which are enabled by improved compression.
  • the receiver 15 may be any known type of receiver for wired or wireless communications, including an Ethernet transmitter or a Bluetooth transmitter.
  • the decoder 16 is configured to receive and decode image data (e.g. to decode an encoded frame).
  • the decoder may be implemented using executable software or may be implemented on specific hardware such as an ASIC.
  • the image data is typically arranged to provide a warpable image for which a portion of the image that is displayed at the display device 17 is dependent on a position or orientation of a viewer.
  • the warpable image may then be rendered before a most up to date viewing direction of the user is known.
  • the warpable image may be transmitted to the display device, or the warpable image may be transmitted to a rendering node which is near to the display device, and the display device or rendering node may perform time warping to generate a displayed image portion based on the warpable image and the most up to date viewing direction and position of the user.
  • multiple rendering nodes may each provide separate image data to an image data assembling node; for example, each rendering node may provide a part of a sequence of frames to a frame assembling node.
  • the receiver 15, decoder 16 or display device 17 may be configured to assemble parts of image data from multiple sources to generate a sequence of images for display on the display device.
  • the image data assembling node may be separate from the receiver 15, decoder 16 and display device 17.
  • multiple rendering nodes may be chained.
  • successive rendering nodes may add to a sequence of image data as it passes from rendering node to rendering node, and eventually a complete sequence of image data is then provided to the receiver 15.
  • each rendering node may obtain components of a render from multiple upstream rendering nodes and/or distribute components of a render to multiple downstream rendering nodes.
  • a chain of rendering nodes may be useful for performing different rendering tasks that require different quantities of processing resources, or different frame rates.
  • a company may provide distributed processing in the form of a centralised hub which has abundant processing resources but is distant from users, and peripheral locations which have more scarce processing resources but are closer to users.
  • Expensive but fairly static rendering features such as background lighting or environmental impact on sound may be generated at the central hub (for example using ray tracing), while features that require fewer resources but faster responses or higher frame rates may be generated closer to the user.
  • the more responsive a rendering feature needs to be the lower latency it needs between the rendering node which generates the feature and the user display and, in a chain of rendering nodes, the node which generates each rendering feature can be chosen based on a required maximum latency of that feature.
  • a set of surfaces may be constructed where each surface has different sound reflection and absorption properties depending upon material and shape.
  • the frame rates may be matched by creating multiple frames with features generated at the lower frame rate, and combining them with the frames with features generated at the higher frame rate.
  • a preliminary rendering generates volumetric object data including motion vectors at a first (lowest) frame rate, then produces 2D rendered frames plus depth information for a specific user at a second (higher) frame rate, then transmits video plus depth data to the user device, which produces final frames for display via space warping (depth-based reprojections) at a third (highest) frame rate.
  • One or more of these steps may be performed in combination with the other described embodiments.
  • the viewing position of the user may change as additional rendering tasks are performed at different rendering nodes in the chain. Each or any rendering node may obtain an updated viewing position before performing its respective rendering task.
  • the system may simultaneously generate multiple sequences of image data for different respective users or different respective display devices.
  • each user or display device may view a different 3D environment, or may view different parts of a same 3D environment.
  • each node may serve multiple users or just one user.
  • a starting rendering node (e.g. at a centralised hub) may serve a large group of users.
  • the group of users may be viewing nearby parts of a same 3D environment.
  • the starting node may render a wide field of view which is relevant for all users in the large group.
  • the starting node may send this wide field of view to a first middle rendering node which renders additional aspects of the 3D environment. These additional aspects may for example be aspects which require less processing power to render, or may be aspects which are specific to individual users of the group. Additionally, the middle rendering node may render features in a smaller field of view than the starting node - this smaller field of view may be relevant to each user rather than the group of users.
  • the first middle rendering node may additionally only serve a smaller number of users (e.g. half of the large group of users), with the remaining users being served by a second middle rendering node which also receives the wide field of view from the starting node.
  • the middle rendering node(s) may then send sequences of second partially or fully rendered frames to an end device for each user.
  • the end device may perform further processes such as warping or focal distance adjustments, optionally using depth map data.
  • each rendering node encodes the partially or fully rendered frames before transmitting them on to a next rendering node or to the receiver 15.
  • the required communication resources can be reduced when the rendering nodes are separated by one or more networks, or more generally are implemented in a distributed system such as a cloud.
  • each rendering node in a chain is encoding a different partially or fully rendered frame, with different data. Therefore, it may be advantageous for different rendering nodes to use different rendering formats and/or encoding formats.
  • the output from a first rendering node may be point cloud data which logically describes a 3D scene. This point cloud data can be encoded using the techniques of EP21386059.6.
  • a second rendering node may then operate on the point cloud data to generate image data that is more readily displayed by a generic display device, without requiring the display device to model the 3D environment. This image data may be encoded using video coding techniques.
  • the chaining of rendering nodes may be extended to arbitrary tree structures, where a rendering node obtains partially rendered frames from more than one preceding rendering node, and generates further partially or fully rendered frames based on the multiple obtained sequences of partially rendered frames.
  • a content rendering network comprising numerous rendering nodes may be used to serve a volumetric event to a large number of same-time users, such as users participating in a shared virtual environment. Rendering the same event for each user is far more expensive in terms of computation time and power consumption than rendering the volumetric effect once and performing the rendering equivalent of multicasting the volumetric effect for multiple users.
  • each user may have a second rendering node (such as a VR headset), and the network may comprise a central first rendering node.
  • the first rendering node may render the volumetric event, and distribute partially rendered frames depicting the volumetric event to the different second rendering nodes.
  • the second rendering node for each user may then integrate the partially rendered frames depicting the volumetric event into a view of the virtual environment which is currently being shown to each user, based on parameters such as the user’s virtual position.
  • the receiver 15, decoder 16 and display device 17 may be consolidated into a single device, or may be separated into two or more devices.
  • some VR headset systems comprise a base unit and a headset unit which communicate with each other.
  • the receiver 15 and decoder 16 may be incorporated into such a base unit.
  • the network 14 may be omitted.
  • a home display system may comprise a base unit configured as an image data source, and a portable display unit comprising the display device 17.
  • the receiver 15 or another transmitter associated with the decoder or display device may send a corresponding layer drop indication back through the network 14.
  • the layer drop indication may be received by each rendering node.
  • a rendering node which generates partially or fully rendered frames for that specific decoder or display device may cease generating the dropped layer.
  • a rendering node which generates partially or fully rendered frames for multiple end devices may disregard a layer drop indication received from one end device (as the dropped layer is still needed for other devices).
  • rendering nodes which serve multiple end devices may record received layer drop indications, and may cease generating the dropped layer only when all end devices served by the rendering node indicate that the layer is to be dropped.
  • the encoders or decoders are part of a tier-based hierarchical coding scheme or format.
  • Hierarchical coding enables frames to be communicated with higher resolution and/or higher frame rate than is possible in single-tier coding schemes.
  • one or more enhancement layers is communicated with base data, where the enhancement layers can be used to up-sample the base data at the decoder, for example providing up-sampling in a spatial ortemporal dimension.
  • hierarchical coding can overall provide lossless compression of data, with higher resolution and/or higher frame rate for a given transmission bit rate.
  • Examples of a tier-based hierarchical coding scheme include LCEVC: MPEG-5 Part 2 LCEVC (“Low Complexity Enhancement Video Coding”) and VC-6: SMPTE VC-6 ST-2117, the former being described in PCT/GB2020/050695, published as WO 2020/188273, (and the associated standard document) and the latter being described in PCT/GB2018/053552, published as WO 2019/111010, (and the associated standard document), all of which are incorporated by reference herein.
  • LCEVC MPEG-5 Part 2 LCEVC (“Low Complexity Enhancement Video Coding”)
  • VC-6 SMPTE VC-6 ST-2117
  • WO2018/046940 A further example is described in WO2018/046940, which is incorporated by reference herein.
  • a set of residuals are encoded relative to the residuals stored in a temporal buffer.
  • LCEVC Low-Complexity Enhancement Video Coding
  • Low-Complexity Enhancement Video Coding is a standardised coding method set out in standard specification documents including the Text of ISO/IEC 23094-2 Ed 1 Low Complexity Enhancement Video Coding published in November 2021 , which is incorporated by reference herein.
  • the system above is suitable for generating a representation of a scene (e.g. using the image generator 11) and presenting this representation to a user (e.g. using the display device 17).
  • the scene typically comprises an environment, where the user is able to move (e.g. to move their head and/or to turn their head) to look around the environment and/or to move around the environment.
  • the scene may be a scene of a room in a building, where the user is able to move around the room (e.g. by moving in the real-world and/or by providing an input to a user interface) in order to inspect various parts of the room.
  • the scene is arranged to be viewed and/or experienced using XR (e.g. VR) technology, such as a virtual reality headset, where the user is then able to move about the scene in three degrees of freedom (3DoF) or six degrees of freedom (6DoF) so as to experience the scene.
  • XR e.g. VR
  • 3DoF three degrees of freedom
  • a ‘scene’ as described herein relates to an environment and/or an event that can be represented to a viewer to enable a viewer to experience the scene.
  • the scene may comprise a real or virtual location
  • the scene may comprise an event, such as a concert
  • the scene may comprise a scene of a movie or a TV show.
  • the methods described herein relate to the generation and presentation of a representation of the scene.
  • a virtual reality headset via a virtual reality headset, it is possible to provide a more immersive experience to a viewer than is possible with a two-dimensional display (that may, for example, be viewed on a television).
  • generating and presenting this three-dimensional representation typically requires an increased amount of hardware of software as compared to a two-dimensional representation.
  • the present disclosures decrease the burden placed on a viewer of a three-dimensional representation.
  • Presenting a representation of the scene may comprise presenting a representation directly or processing a received representation prior to the presenting of a processed representation.
  • the representation may be generated offline at the image generator 11 before being transmitted to the display device 17.
  • the representation may be processed at the image generator prior to transmission (e.g. to provide to the display device an image that is suitable for displaying to the user) or the representation may be processed at the display device to determine this image for display.
  • Presenting the representation encompasses embodiments in which the representation is suitable for being shown to the user without further processing and also embodiments in which the representation is processed prior to being displayed to the user (e.g. presenting the representation may comprise determination a rendition ofthe representation and the presenting of this rendition).
  • references herein to the playback and modification of a ‘scene’ should be understood to encompass playback and modification of a representation of this scene, which representation comprises at least a base representation of the scene and, optionally, further elements such as a scene element that may be combined with the base representation to provide the representation.
  • a representation of this scene which representation comprises at least a base representation of the scene and, optionally, further elements such as a scene element that may be combined with the base representation to provide the representation.
  • There may be a plurality of different representations of the scene e.g. viewed by different viewers or devices), so that the playing of the scene may be viewed via these representations; but the representations may not show every aspect of the scene (e.g. the representations may relate to an obscured view).
  • playing the scene and playing a representation of the scene are essentially equivalent for a viewer of the representation.
  • an aspect of the present disclosure provides a method of combining a base representation of a scene with a scene element (or a further element, or a further object).
  • this comprises providing a high-quality, immersive, three-dimensional, base representation of the scene (that is generated using a process that requires substantial time and/or a powerful computer device) that is combinable with a lower quality scene element (that is generated in a shorter time, e.g. in real-time, or that is generated using a less powerful computer device).
  • a computer device such as the image generator 11 , generates a base representation of a scene.
  • the scene may be a depiction of a real scene (e.g. a depiction of a particular building, or a particular event).
  • the scene may be a depiction of a virtual scene (e.g. a computer-generated scene of another planet or an animated scene).
  • the base representation enables a user to view the scene and/or to move about the scene so as to experience the scene.
  • the base representation may comprise a high quality volumetric video. Therefore, a user is able to move around the (base representation of the) scene so as to view the scene from different viewpoints/perspectives.
  • the 'base representation’ is a representation of the scene (e.g. a video showing the scene) that can be rendered or displayed by the display device.
  • the base representation may be a video file or may be a sequence of images.
  • the base representation comprises an alternate format that can be converted into a video file.
  • the base representation may comprise a point cloud of the scene, which point cloud enables images to be generated in dependence on a location and/or orientation of the user.
  • the base representation comprises an (e.g. encoded) video file that can be decoded and rendered to display the scene to a viewer using the display device 17.
  • the rendering of this base representation may depend on a characteristic of the display device 17, e.g. the base representation may be rendered in dependence on a display frame rate of the display device.
  • Generating the base representation may comprise rendering the base representation and/or may comprise identifying or decoding a file associated with the base representation.
  • Various devices may generate a version of the base representation where, for example, the image generator 11 may initially generate the base representation based on an input model of a scene, with this initially generated base representation then being rendered and encoded.
  • the display device 17 may then receive this encoded representation and generate the base representation by decoding the encoded version of the base representation.
  • the base representation has a first quality, which is typically a high quality.
  • the base representation may be generated using a ray-tracing process or a machine learning process (e.g. as described in WO 2016/061640 A1 , which is incorporated herein by reference.
  • the base representation enables a user to view the scene at a high frame rate and/or at a high resolution in order to provide an immersive experience.
  • the base representation may have a resolution of at least 4K, at least 8K, and/or at least 16k (overall or per eye).
  • the generation of the base representation typically comprises generating image data that enables a user to view the base representation, where this image data can then be transmitted, via the network 14, to the display device 17.
  • the base representation (and more generally the scene) may be played in order for a user to see and/or interact with the scene.
  • This typically comprises a user selecting the scene for playback on the display device 17.
  • the display device typically enables the user to control this playback and/or to interact with the scene, e.g. to pause, speed up, slow down, skip through etc. the playback; to choose a scene to view; or to select an item from within a scene being played).
  • a computer device such as the image generator 11 orthe display device 17, generates a scene element that can be combined with the base representation in order to provide a combined representation of the scene.
  • the scene element may, for example, comprise a modification to the base representation or a supplement to the base representation.
  • the scene element may comprise a filter or an overlay, which may be used to modify the base representation to depict, for example, fog passing through the scene or to modify an atmosphere of the scene (e.g. to lighten or darken the scene, or to provide a sepia filter).
  • the scene element may comprise an interactive element, such as a target, where a viewer of the scene is able to click on the target to perform an action. More generally, the scene element may be able to provide an output in dependence on an interaction with the scene element, e.g. to change an attribute of the scene element, ofthe base representation, or of another scene element, orto alter a feature of the playback of the scene.
  • the scene element may comprise an object that may be either a virtual object or a representation of a real object; for example, a real object may be imaged using a camera and the resultant image may be used as a basis for the base representation.
  • the scene element may comprise an avatar and/or a personalisation, where the avatar may be an avatar of a viewer of the scene or of a further person - and this avatar may be selected in dependence on a viewer ofthe scene (e.g. based on a user profile of this viewerthat is accessed by the display device).
  • Such embodiments enable a viewer of a scene to picture themselves within the scene and/or to interact with an avatar within the scene.
  • the scene element may comprise a static object, where the object does not change after rendering.
  • the scene element may be arranged to move (e.g. rotate ortranslate) after rendering.
  • the scene element may be a time-varying object or an animated object where an attribute of the object (e.g. a size, shape or colour) changes after the rendering of the object.
  • the scene element typically has a second quality, which second quality is typically lower than the first quality of the base representation.
  • the scene element may be generated using a rasterization process, which rasterization process involves transforming and modifying a three-dimensional model before converting this model into a two-dimensional image for display.
  • the scene element has a lower frame rate and/or resolution than the base representation.
  • the scene element may, for example, have a resolution of no more than 8K, no more than 4K, and/or no more than 2k.
  • displaying the representation of the scene may comprise altering a position/feature of the scene element for non-adjacent frames (e.g. if the base representation has a frame rate of 60fps and the scene element has a frame rate of 30fps, the scene element may only be updated once every two frames of the base representation).
  • the scene element may be upsampled, upscaled, or processed for display within the base representation, where this may involve displaying a value for a single ‘pixel’ of the scene element on a plurality of pixels of the display.
  • the combining of the scene element with the base representation may comprise a step of motion interpolation so as to produce the final rendering of the scene element, where this enables a low frame rate scene element to be combined with a higher frame rate base representation.
  • the scene element comprises a real-time element, which real-time element is generated during the playback of the scene.
  • the real-time element may be generated and displayed (effectively) in real-time and/or may be generated based on a current situation of the viewer or the display device at the time of generation of the real-time element.
  • the scene element may be generated during the playback of the (base representation of the) scene as opposed to being generated prior to the playback of the (base representation of the) scene.
  • the base representation may be displayed based on a two-pass rendering process that includes a first rendering step of rendering multiple points of view prior to the displaying of the base representation and then a second rendering step that occurs at or near the time of display so as to render the scene for viewing by the viewer.
  • the scene element may be combined with the base representation at this second rendering step so as to provide a representation for viewing by the viewer.
  • this scene element may be generated and/or presented less than five seconds, less than one second, less than a tenth of a second and/or less than a hundredth of a second after the triggering of a trigger that prompts the generation of the scene element (where the time taken may depend on an available processing power).
  • the base representation may be generated greater than one day, greater than one week, and/or greater than one month before the presenting of the scene.
  • the base representation of the scene is generated prior to the playback of the scene and/or the scene element is generated during the playback of the scene.
  • the scene element may be arranged to be generated at a rate that is equal to (or greater than) a rate of playback of the scene (e.g. so that the scene element may be generated and presented in real time).
  • the scene element may be generated at 60 frames a second (so as to be displayed in real time)
  • the base representation may be arranged to be generated at a rate that is slower than the rate of playback of the scene (e.g. so that the base representation must be generated prior to the playback of the scene).
  • the base representation and the scene element being ‘arranged’ to be generated at certain rates may comprise a quality of these components being chosen to enable this generation.
  • the scene element comprises a live or near-live element, where the scene element may be a streamed video or the scene element may be formed based on a video of an event with a slight delay, which delay provides time for processing the video to form the scene element.
  • the base representation and/or the scene element is streamed at the display device.
  • Streaming the scene element, or streaming the base representation typically connotes presenting the scene element/base representation shortly after receipt of a transmission containing the scene element/base representation. This may involve receiving an encoded version of the scene element/base representation in a transmission, then decoding this encoded version of the scene element/base representation at the display device, and then presenting the scene element/base representation.
  • Streaming enables the scene element/base representation to be transmitted and displayed essentially simultaneously (with some small delay to enable for the decoding to occur and also, optionally, some buffering being performed) so that the base representation/scene element does not need to be downloaded prior to the playback of the scene.
  • the base representation is arranged to be downloaded prior to the playback of the scene with the scene element being streamed at the time of playback.
  • the base representation is downloaded to the display device 17 (or a connected device) prior to the playback of the scene and the scene element is streamed to the display device from a further device; or the scene element is downloaded to the display device prior to the playback of the scene and, optionally, combined with the base representation at the time of presenting the scene (e.g. the scene element may be selected from a database of possible scene elements, which database is stored on the display device and/or a connected device).
  • any one or more of the base representation and the scene element may be streamed to the display device 17 and any one or more of the base representation and the scene element may be downloaded to the display device prior to playback of the scene (e.g. so that the base representation may be streamed with the scene element being downloaded and/or generated in real time, or the scene element may be streamed with the base representation being downloaded prior to the presentation of the scene, or each of the base representation and the scene element may be streamed).
  • the scene element may provide a third-party modification to the base representation
  • the base representation may provide a background on which third-parties (that were not involved in the generation of the base representation) are able to supplement this background with scene elements.
  • This enables a viewer ofthe scene to personalise the scene in a desired way. For example, the viewer may be able to insert an avatar into the scene where this avatar mirrors the real-time movements ofthe user.
  • the base representation may provide a background onto which numerous third parties are able to impose different scene elements. This provides a versatile representation of a scene, where a single image generating party (that has access to powerful computer devices) is able to provide a base representation that can be used by numerous modifying parties (that may not have access to such powerful computer devices).
  • the scene element may be an interactive element, where the user is able to interact with this element in real time in order to add functionality to the base representation.
  • Handling real-time interactions with an element of an XR scene typically requires large amounts of computing power and so providing a real-time interaction with a high-quality scene element may require an amount of computing power that is not available to many devices/users.
  • By providing the scene element in lower quality than the base representation it becomes possible to provide a high quality background (e.g. generated using an off- site server) and to combine this with a lower-quality scene element (e.g. generated using a personal computer or a smartphone) to provide a user with an immersive experience that is still personalised and updated in real-time.
  • the scene element may be generated so that a user is able to interact with the scene element via a user input, e.g. using one or more of: eye tracking, gesture tracking, and speech.
  • the display device 17 may comprise a tracking sensor, a camera, and/or a microphone.
  • the scene element may be generated for only a portion of the scene, e.g. based on a current viewpoint and/or a current perspective of a user of the display device 17. This enables the scene element to be generated only for a portion of a scene that is being viewed by the user. Such generation may reduce the processing power required to generate and render the scene element.
  • the portion of the scene may be determined automatically, e.g. based on a sensor of the display device. Equally, the portion of the scene may be determined based on a user input or based on a feature of the base representation where, for example, a generator of the base representation is able to define an area onto which a scene may be imposed.
  • the first step 31 and the second step 32 of the method of Figure 3 are typically carried out at different computer devices, where the computer device (or computer devices) used to generate the base representation are typically more powerful than the computer device (or computer devices) used to generate the scene element.
  • the base representation comprises (or is associated with) a trigger, which trigger is arranged to prompt the generation or display of the scene element.
  • This trigger may, for example, be a time, a condition, or a viewing perspective that is contained in the image data that forms the base representation.
  • a generator of the base representation is able to (at least partly) control the generation of the scene element and to ensure that the scene element and the base representation of the scene are synchronized.
  • the base representation comprises a trigger that is associated with a user action where, for example, an input being provided via a user interface results in the generation or rendering of the scene element.
  • This input may be associated with a specific area of the scene, an interaction with an object in the scene, or an input in a menu associated with the scene.
  • the scene may be associated with various different elements of interactive media content, where the user is able to select an option from a list of possible operations and the scene element is rendered based on the selected option (e.g. the user may select a scene element to be rendered based on a list of possible scene elements).
  • the base representation comprises a trigger (e.g. a trigger point) that activates at a certain point (e.g. a certain frame) of the base representation, e.g. one minute into the scene, and that initiates the generation or rendering the scene element.
  • a trigger e.g. a trigger point
  • the scene may comprise a mirror that comes into view at this point, and then there may be superimposed onto the mirror, in real-time, a representation of a viewer of the scene.
  • the base representation may also comprise a trigger so that, if a user turns their head to view the mirror at any point during the playback of a scene then the real-time representation of the viewer is generated and superimposed over the mirror.
  • the second quality being lower than the first quality comprises one or more of: the second quality having a lower resolution than the first quality (e.g. the scene element having a lower resolution than the base representation); the second quality having a lower frame rate than the first quality; and the second quality having a lower colour range than the first quality.
  • the first quality is associated with a ray tracing process (e.g. the base representation is generated using a ray tracing process).
  • the second quality is associated with a rasterization process (e.g. the scene element is generated using a rasterization process).
  • the base representation is generated for a first range of viewpoints of a scene (e.g. the base representation may be generated for each point within a viewing zone, as described further below) so that a user is able to view the base representation from a plurality of viewpoints (e.g. the user is able to view the base representation as the user moves through the scene).
  • the scene element is generated for only a limited range of viewpoints of the scene (this range typically being less than a range of viewpoints for which the base representation is generated). This may lead to the scene element being visible from only a limited range of viewpoints, or the scene element being distorted when the user moves away from the second range of viewpoints.
  • the scene element (and/or the base representation) is generated so as to be viewed from an expected viewpoint or perspective.
  • the scene element may be visible only when the user is facing a predetermined portion of the base representation.
  • the base representation may be formed of portions of different quality, where the base representation may have a higher quality at an expected viewpoint or an expected perspective, where this encourages the user to move to this viewpoint/perspective (e.g. the base representation may have a higher quality when the user is looking in a ‘forwards’ direction as compared to when the user is looking in a ‘backwards’ direction.
  • the scene element comprises a real element that is captured using a camera, e.g. a camera of the display device 17. This enables the personalisation of a scene based on a real element, e.g. a real element that is in the vicinity of a viewer of a scene.
  • the base representation and the scene element may be generated at the same time or may be generated using the same device (e.g. the image generator 11), where this still enables the presentation of a relatively high quality base representation with a relatively low quality scene element. This reduces the amount of processing power and/or bandwidth required by the display device to display the scene as compared to an implementation in which each element of the scene is generated with a high quality.
  • the scene element may be included in a layer and/or an enhancement scene that is sent in association with the base representation. That is, the image generator 11 may generate both the base representation and the scene element and send these two features as part of a transmission to the display device 17 where, depending on the capabilities of the display device, a scene element of an appropriate quality is rendered (e.g. for less capable display devices a lower quality scene element may be rendered at the display device).
  • the scene element is an object that is in an environment of a wearer of the display device 17.
  • an image of this other person may be captured using a camera and the scene element may be generated based on this image. This enables, for example, two users that are each wearing display devices to view a scene together and to interact with each other while viewing the scene.
  • the scene object may be generated based on an object in the environment, where this is particularly beneficial for augmented reality implementations.
  • the base representation may be combined with a scene element that reflects a current environment of a user to provide an immersive experience in a scene that includes features of the real-world.
  • a chair that is in the vicinity of the wearer of the display device 17 may be captured by a camera of the display device and a scene element may then be generated to represent this chair, with the scene element being combined with the base representation (e.g. to place an overlay over a real chair to present an augmented version of the chair, such as a throne).
  • a user is then able to sit on the scene element in the virtual scene in order to sit on the chair in the real world and so the user can take part in an immersive experience while interacting with real-world objects in a way that does not break the immersion.
  • the scene element relates to a further wearer of a further display device, where this furtherwearer may (or may not) also be viewing the scene (orthe base representation).
  • This furtherwearer user maybe in the same real-world environment as the wearer of the display device or may be in a different real-world environment (e.g. where the two wearers are both dialled in to a meeting from different real- world locations).
  • the (base representation of the) scene comprises a three-dimensional scene.
  • the (base representation of the) scene may provide further dimensions (e.g. four, five, or more dimensions), where these other dimensions may relate to physical effects, time effects, etc.).
  • the base representation is generated so as to enable movement of a viewer around the scene.
  • the base representation may enable a user to walk around the room so as to view the room from different angles.
  • the base representation may be generated in order to enable six degree-of-freedom (6DoF) movement through the scene, where this aids in the provision of an immersive experience for a viewer (and where this reduces any motion sickness effect that may occur for a user of a VR scene).
  • 6DoF six degree-of-freedom
  • generating a base representation that enables such movement requires the base representation to enable viewing from each point within that scene and so requires a substantial file size.
  • the base representation is associated with one or more viewing zones (or zones of view, or viewing volumes), where the base representation enables a user to view the scene in high quality only from within the viewing zones and/or enables a user to move freely (e.g. with six degrees of freedom) only within the viewing zones.
  • the viewing zones have a limited volume, which volume is less than a volume of the scene (e.g. the volume may be less than 50% of the volume of the scene, less than 20% of the volume of the scene, and/or less than 10% of the volume of the scene).
  • Figure 4 An example of such a viewing zone is illustrated in Figure 4, which figure shows a scene 41 that contains a (first) viewing zone 42.
  • the viewing zone enables a viewer to move from a first position 43 in the viewing zone to a second position 44 in the viewing zone so as to view the scene from these different viewpoints/perspectives.
  • Figure 4 further shows a second viewing zone 45, where a user may be able to move between the first and second viewing zones.
  • viewing zones are typically implemented as three-dimensional volumes (and viewing zones may also be four-dimensional, where a three-dimensional location of the viewing zone changes over time). It will be appreciated that viewing zones may be formed in any size or shape, with different sizes and shapes being suitable for different scenes.
  • the use of the viewing zones enables a base representation to be generated without the requirement to see or consider every single point within the scene.
  • the portion 46 of the scene 41 of Figure 4 is obscured or occluded behind a wall for all of the points within this viewing zone 42. Therefore, the image data that forms the base representation is not required to generate or render data for this occluded portion of the scene. This reduction in the amount of data that is required enables the generation of a high- quality immersive scene with a much lower processing and storage requirement than if the entirety of the scene were to be rendered in high quality allowing full movement.
  • the base representation may be generated so as to show nearby objects (that are near to the boundaries of the viewing zone) in greater detail than distant objects (that are far from the boundaries of the viewing zone). This may involve the base representation being generated using (real or virtual) scanners that are set up at the boundaries of the viewing zone, with point filed data being obtained based on beams emitted from these scanners (and with the scanners being arranged to emit beams at regular angles).
  • the base representation is generated based on a three-dimensional model, where the model may comprise the occluded portion but the generation of the base representation is such that the base representation does not include the occluded portion. Therefore, there is a loss of information moving from the model to the base representation (and a corresponding reduction in file information and size), but since the occluded portion cannot be seen from within the viewing zone the viewer of the base representation is unable to identify this loss of information.
  • the base representation is streamed to the display device (or a computer device that is in the vicinity of the computer device) from a separate server (e.g. over the Internet), with the scene element being generated on the display device or the proximate computer device. This reduces the amount of information that needs to be transmitted to the display device to present the scene since a portion of the scene (the scene element) is not being streamed.
  • Present VR systems typically require the use of a specialised computer device, e.g. a gaming computer, where viewing an immersive scene requires this scene to be pre-downloaded to the computer, sent to a VR headset over a wired connection, and then continuously processed by the computer in order to warp the scene appropriately based on the actions of a view.
  • a specialised computer device e.g. a gaming computer
  • viewing an immersive scene requires this scene to be pre-downloaded to the computer, sent to a VR headset over a wired connection, and then continuously processed by the computer in order to warp the scene appropriately based on the actions of a view.
  • the present disclosure opens the possibility of the base representation being generated on a server with substantial processing power before being streamed, over the Internet, to a display device that is relatively low-powered (e.g. a smartphone or a standalone VR headset). This display device may then generate the (lower quality) scene element and combining this element with the base representation.
  • another specialised computer device may generate the scene element and stream this scene element
  • the base representation typically comprises one or more viewing zones 42, 45, which viewing zones provide a high quality representation of the scene while enabling movement in 6DoF within these viewing zones (so that the viewer can experience the scene from a plurality of different viewpoints). Enabling this movement may comprise generating a point field (as has been described above) where this enables an image to be generated for each possible position and orientation of a user within the viewing zone.
  • the display device 17 is able to determine a location and orientation of the user and to present an appropriate image to the user based on this location and orientation.
  • the display device may transmit the location and orientation to the image generator 11 so that the image generator can generate the appropriate image and transmit this image to the display device for presentation at the display device.
  • the base representation may comprise image data and/or point cloud data that is useable to render images for a plurality of different locations and orientations, where the display device is then able to render the appropriate image from this image data based on the location and orientation.
  • the volume of the viewing zone is such that a user is able to move within the viewing zone in order to view the scene, while still only enabling a limited amount of movement (where this leads to a smaller file size as compared to an implementation where a user is able to fully move about the scene).
  • the (or each) viewing zone 42, 45 is typically arranged to be of a limited size, where this provides an immersive experience within this limited-size viewing zone while reducing the amount of processing power/bandwidth required to provide this experience (as compared to a scene in which free movement is possible throughout the scene).
  • the viewing zones are arranged to enable a user to move their head while they are sitting or standing, but not to freely roam around a room.
  • the or each viewing zone may also have a minimum size, e.g. the or each viewing zone may have a volume of at least 1 % of the volume of the scene, at least 5% of the volume of the scene, and/or at least than 10% of the volume of the scene. Similarly, the or each viewing zone may have a volume of at least one- thousandth of a cubic metre (0.01 m 3 ); at least one-hundredth of a cubic metre (0.01 m 3 ); and/or at least one cubic metre (1 m 3 ).
  • the ‘size’ of the viewing zone typically relates to a size in the real world, wherein if the viewing zone has a length of one metre this means that a user is able to move one metre in the real world while staying within the viewing zone.
  • the size of the viewing zone in the scene may be greater than, equal to, or less than the size of the viewing zone in the real world.
  • the viewing zone may scale a real world distance so that moving one metre in the real world moves the user less than (or more than) one metre in the scene. This enables the scene to provide different perceptions to the user (e.g. to make the user feel larger or smaller than they are in real life).
  • the viewing zone may scale a real world angle so that rotating one degree in the real world rotates the user less than (or more than) one degree in the scene.
  • viewing zones shown in Figure 4 are rectangular (in two-dimensions), more generally the viewing zones may be any shape, e.g. cuboid, spherical, ovoid, etc.
  • the scene is associated with a plurality of viewing zones 42,45, where each viewing zone is associated with a different base representation.
  • These different base representations may have different qualities, where the quality may be associated with a size of the viewing zone and/or a perspective of the viewing zone. Equally, the quality may be associated with a perceived importance of the viewing zone. This enables a balance to be struck between the processing power/bandwidth required to provide the scene and the immersivity of the experience, where certain perspectives of a scene may benefit more from a higher quality (e.g. a higher resolution or frame rate) than other perspectives.
  • the quality of a scene element that may be generated for combination with the base representations may depend on the quality of that base representation so that different viewing zones may enable the generation of different scene element and/or of scene elements of different qualities.
  • the generation and/or the presentation of the scene element is dependent on the triggering of a trigger, where the trigger may be a part of the base representation and/or may be associated with the base representation (e.g. the trigger may be a part of the image data that forms the base representation).
  • a computer device such as the display device 17 begins presentation of the scene based on the base representation of the scene.
  • the computer determines the triggering of a trigger.
  • this trigger is a part of the base representation, which trigger defines a condition for the generation of a scene element.
  • a computer device such as the display device or the image generator 11 generates the scene element.
  • the scene element is generated in real-time and is combined with the base representation with the combined representation then being presented to a user of the display device.
  • the trigger comprises a contextual trigger, where the trigger is dependent on a context of a viewer of the base representation and/or of the display device 17 that is displaying the base representation.
  • the contextual trigger may be dependent on a location, a time, an environmental condition, a weather, a condition of a viewer of the base representation, a number of viewers that are presently viewing the base representation, etc.
  • the trigger comprises an activated trigger, where a viewer is able to interact with the base representation (e.g. via a user interface to activate the trigger). For example, the viewer may be able to click on a specific section of the base representation in order to activate the trigger (and generate the scene element.
  • the triggering of the trigger may be determined by a sensor of the display device 17, for example a light sensor, a temperature sensor, an accelerometer, and/or a GPS sensor. Equally, the triggering of the trigger may be determined by a user interface of the display device (e.g. based on a user input). Equally, the triggering of the trigger may be determined by a processor. For example, the trigger may be associated with a frame of the base representation being displayed (e.g. the video being 20% complete) or based on a user looking at and/or interacting with a portion of the scene; the processor may determine that such an in-scene trigger condition has been triggered.
  • the scene element may be generated in dependence on the trigger and/or on a condition at a time when the trigger is triggered.
  • the scene element may be selected from among a database of possible scene elements based on the environment of the display device at the time of generating the scene element and/or based on an active user profile of the display device at the time of generating the scene element.
  • the base representation is associated with a plurality of triggers, where the scene element is generated in dependence on which of these triggers is triggered.
  • the base representation may comprise a plurality of triggering areas, where a viewer looking at, or interacting with, any of these triggering areas triggers the generation of a scene element.
  • the scene element that is generated may depend on the triggering area with which the viewer has interacted.
  • each viewing area may be associated with a different object, where the user is then able to generate a desired object (as the scene element) by interacting with a corresponding scene element.
  • the scene element may be dependent on a condition at the time of the triggering of a trigger where, for example, the trigger may depend on a progress of the viewer through the scene and the scene element may then be generated in dependence on a sensor reading. For example, when a viewer is 20% of the way through the playback of a scene, the scene element may be generated (automatically) based on an environmental condition that is sensed at the time of triggering the trigger.
  • the scene element may comprise a weather filter that is generated at a certain time in a scene (e.g. to depict rain, clouds, or sunshine within the scene) where this enables a scene to be modified based on a current condition of a user.
  • the scene element (e.g. the quality of the scene element) is dependent on a capability of the display device 17, e.g. the hardware, software, and/or a condition of the display device.
  • the scene element may be generated in dependence on an available bandwidth at a communication interface of the display device.
  • the capability of the display device may be determined prior to the displaying of the scene and/or at the time of triggering the trigger, where, for example, the available bandwidth of the communication interface may be determined when the trigger is triggered.
  • This enables more powerful or capable devices to render scene elements with higher qualities and in this way enables balancing between accessibility and quality.
  • Devices with low processing power are able to view the scene with a lower-quality scene element with comparatively high processing power devices being able to view the scene - e.g. with the same base representation - with a higher-quality scene element.
  • the base representation and/or the scene element may comprise layered image data, where this representation and element be generated or presented in dependence on a capability of the display device. For example, the quality of each of the base representation and the scene may be determined in dependence on this capability, where a base level of the representation and/or scene may be combined with one or more enhancement layers depending on the capability of the device. Multiple viewers
  • the scene element generated for a first viewer may be associated with one or more other viewers.
  • the base representation may provide a high-quality meeting environment, where each of the viewers of the scene is a participant in the meeting.
  • the scene element generated for the first viewer may then be an avatar of another participant in the meeting.
  • each of the viewers may be shown the same base representation or each of these viewers may be shown a different base representation.
  • each of the viewers may be shown the same base representation so as to see this schematic from the same perspective.
  • each of the viewers may be shown a different base representation so as to see this schematic from a different perspective.
  • the base representation may provide an immersive background fora meeting, where a scene element can then be generated based on an input from a first viewer with each viewer thereafter being able to see the scene element.
  • the scene element comprises an interactive element
  • the scene element may arranged so that each viewer is able to interact with the scene element or so that only a subset of the viewers are able to interact with the scene element.
  • the base representation may be associated with an object to be discussed, where each user may view a meeting room in low quality (e.g. in real-time, where the avatars of other users are shown in the meeting room) and then may be able to step into the base representation to view the object before returning to the meeting room.
  • each user may view a meeting room in low quality (e.g. in real-time, where the avatars of other users are shown in the meeting room) and then may be able to step into the base representation to view the object before returning to the meeting room.
  • the base representation may be arranged to be transmitted to a plurality of users and/or a plurality of display devices 17.
  • Each base representation may be arranged to be combinable with the same scene element(s).
  • the base representation may be arranged to be combinable with different possible scene elements. This can enable the personalisation of the scene, where different scene elements are shown to different users.
  • the scene is typically arranged to be viewed using an extended reality (XR) technology, where the user may be presented with a representation of a real scene or a digital scene that contains one or more digital elements.
  • extended reality covers each of virtual reality (VR), augmented reality (AR), and mixed reality (MR) and it will be appreciated that the disclosures herein are applicable to any of these technologies.
  • VR virtual reality
  • AR augmented reality
  • MR mixed reality
  • the disclosures herein may be applied to numerous contexts.
  • the scene may comprise or be a part of a movie, a music video, a game, a shopping experience, a sports experience, etc.
  • the scene, and either or both of the base representation and the scene element may be computer generated. Equally, the scene, and either or both of the base representation and the scene element, may be captured using a sensor such as a camera.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Graphics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Geometry (AREA)
  • Computing Systems (AREA)
  • Heart & Thoracic Surgery (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Veterinary Medicine (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Surgery (AREA)
  • Molecular Biology (AREA)
  • Animal Behavior & Ethology (AREA)
  • Biomedical Technology (AREA)
  • Ophthalmology & Optometry (AREA)
  • Computer Hardware Design (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Processing Or Creating Images (AREA)
  • Testing, Inspecting, Measuring Of Stereoscopic Televisions And Televisions (AREA)

Abstract

There is described a method of presenting a three-dimensional representation of a scene, the method comprising: presenting a base representation of the scene, the base representation having a first quality; generating a scene element, the scene element having a second quality; and combining the base representation with the scene element.

Description

Generating a representation of a scene
Field of the Disclosure
The present disclosure relates to methods, systems, and apparatuses for generating a representation of a scene. In particular, the disclosure relates to methods, systems, and apparatuses for generating a three- dimensional representation of a scene that comprises a base representation and a scene element.
Background to the Disclosure
In extended reality (XR) applications, it is often desirable to provide a high-quality, immersive, scene. It is also often desirable to provide a personalised scene, an interactive scene, and also an accessible scene. In practical applications, it can be difficult to meet each of these desires - for example, providing a high- quality scene typically requires a powerful computer device, which may limit accessibility, in particular with lightweight devices such as mobile devices, headsets, or smart glasses. Providing a high-quality scene - e.g. to provide a scene suitable for industrial pre-visualisation or cinematic computer graphics - typically requires a long generation time (or rendering time), which can make it difficult to provide an interactive scene that is updated in real-time based on the actions of a viewer.
Improved methods of providing an extended reality scene are therefore desired.
Summary of the Disclosure
According to an aspect of the present disclosure, there is described: a method of generating a three- dimensional representation of a scene, the method comprising: generating (e.g. rendering and/or presenting) a base representation of the scene, the base representation having a first quality; generating (e.g. rendering and/or presenting) a scene element, the scene element having a second quality; wherein the scene element is arranged to be combined with the base representation; and/or the method comprises combining the base representation with the scene element.
Preferably, the second quality is lower than the first quality.
The method may comprise generating a multi-dimensional representation of a scene, e.g. a fourdimensional, five-dimensional, or six-dimensional representation.
Preferably, the first quality and the second quality are associated with one or more of: a first resolution and a second resolution (e.g. wherein the second resolution is different to and/or lower than the first resolution); a first frame rate and a second frame rate (e.g. where the second frame rate is different to and/or lower than the first frame rate); and/or a first colour range and/or a second colour range (e.g. where the second colour range is different to and/or lower than the first colour range).
Preferably, the scene element is upsampled prior to the combining of the scene element with the base representation.
Preferably, the scene element is subjected to a motion interpolation process prior to the combining of the scene element with the base representation.
Preferably, the base representation has a resolution of at least 4K, at least 8K, and/or at least 16k (preferably, this resolution is a resolution per eye). Preferably, the scene element has a resolution of no more than 8K, no more than 4K, and/or no more than 2k.
Preferably, the base representation is generated using a ray-tracing process. Preferably, the scene element is generated using a rasterization process, preferably a real-time or near real-time rasterization process.
Preferably, the base representation enables a user to move about the scene so as to view the scene, preferably to move about the scene with six degrees of freedom (6DoF). Preferably, the method is carried out at a first device and generating the base representation comprises generating the base representation based on a transmission received from a second device, preferably wherein the transmission comprises the base representation in an encoded format, more preferably a layered format and/or a low complexity enhanced video codec (LCEVC) format.
Preferably, generating the base representation comprises streaming the base representation based on a transmission from second first device, preferably wherein streaming the base representation comprises simultaneously receiving the transmission and presenting the scene.
Preferably, the method is carried out at a first device and generating the scene element comprises generating the scene element based on a transmission received from a further device, preferably wherein the transmission comprises an encoded version of the scene element.
Preferably, generating the scene element comprises streaming the scene element based on a transmission from the further device.
Preferably, the second device and the further device are different devices.
Preferably, the first device comprises a display device and/or the second device comprises an image generator device and/or the further device comprises a third-party database.
Preferably, generating the base representation comprises processing an initial version of the base representation so as to generate the base representation of the scene based on a perspective of a viewer of the scene.
Preferably, the scene element comprises a real-time element and/or wherein the scene element is generated during the playback of the scene (e.g. while the playback of the scene is ongoing).
Preferably, the scene element is generated at a rate of playback of the scene and/or when the scene element is generated no more than one hour prior to the presentation of the scene element, preferably no more than one minute, more preferably no more than one second.
Preferably, the scene element is generated in dependence on one or more of: a context of a viewer of the scene, preferably a context at the time of generation of the scene element and/or at the time of triggering a trigger associated with the generation of the scene element; a feature of an environment of the viewer; an object and/or a person in the environment of the viewer; a user profile of the viewer; a communication from a further device; a feature of the base representation; an origin of the base representation (e.g. a generator of the base representation); a current viewpoint and/or perspective of the viewer; a current viewing zone of the user; and an input of the viewer.
Preferably, the scene element is selected based on a capability of a device displaying the scene. Preferably, the scene element is selected based on a processing power and/or bandwidth of the device.
Preferably, the scene element is selected from a database of available scene elements.
Preferably, the scene element comprises a real object that is captured with a camera.
Preferably, the scene element comprises one or more of: an object; a real object; a virtual object; a filter; an overlay; a weather effect; a personalisation; an avatar; an animated object and/or an animation; and an interactive element.
Preferably, the method comprises: generating (e.g. rendering) the base representation prior to the playback of the scene; and/or generating (e.g. rendering) the scene element during the playback of the scene.
Preferably, generating the base representation comprises receiving a transmission containing the base representation, preferably an encoded version of the base representation, and generating the base representation based on the received transmission. Preferably, generating the base representation comprises decoding the transmission so as to obtain the base representation.
Preferably, generating the base representation comprises generating the base representation from a model of the scene.
Preferably, generating the scene element comprises receiving a transmission containing the scene element, preferably an encoded version of the scene element, and generating the scene element based on the received transmission.
Preferably, generating the scene element comprises decoding the transmission so as to obtain the scene element.
Preferably, generating the scene element comprises selecting the scene element from a database, preferably a third-party database.
Preferably, the combining of the scene element with the base representation is dependent on a feature of the base representation, preferably wherein the feature defines a location onto which the scene element may be imposed and/or a time at which the scene element may be combined with the base representation.
Preferably, the method comprises detecting an interaction between a viewer of the scene and the scene element, preferably comprising modifying the scene element and/or generating a further scene element in dependence on the interaction.
Preferably, the method comprises: generating the base representation at a first computer device; and generating the scene element at a second computer device.
Trigger
Preferably, the scene and/or the base representation and/or the scene element is associated with a trigger, wherein the generation of the scene element is dependent on the triggering of the trigger.
Preferably, the triggering of the trigger is associated with one or more of: a context of a viewer of the scene and/or a change in the context; an environment of the viewer of the scene, preferably an object in an environment of the viewer of the scene; a current viewpoint and/or perspective of the viewer; a playback progress and/or a frame of the scene being displayed; an input of the viewer, preferably an input associated with a location in the scene; and an input from a third-party that is not viewing the scene, preferably wherein the scene is being viewed on a first device and the triggering of the trigger is associated with a transmission being received at the first device from a further device.
Preferably, the trigger is a part of the base representation and/or wherein the trigger comprises a pointer to the base representation.
Preferably, the trigger is received separately to the base representation, preferably wherein the base representation is received from a first device and the trigger is received from a second device, more preferably wherein the scene element is also received from the second device.
Viewing zones
Preferably, the base representation is associated with a viewing zone, wherein the viewing zone comprises a subset of the scene and/or wherein the viewing zone enables a user to move through a subset of the scene; wherein a viewer is able to move within the viewing zone while viewing the base representation.
Preferably, the viewing zone comprises a subset of the scene; and/or allows movement through only a subset or portion of the scene; and/or provides a limited or restricted volume in which a user is able to view the scene (in the first quality); and/or comprises a bounded volume that enables the viewing of the scene (in the first quality), this volume being less than a volume of the scene. Preferably, the viewer is able to move within the viewing zone with six degrees of freedom (6DoF).
Preferably, the viewing zone has a volume of less than 50% of the volume of the scene, less than 20% of the volume of the scene, and/or less than 10% of the volume of the scene.
Preferably, the viewing zone has, or is associated with, a volume (e.g. a real-world volume) of less than five cubic metres (5m3), less than one cubic metre (1 m3), less than one-tenth of a cubic metre (0.1 m3) and/or less than one-hundredth of a cubic metre (0.01 m3).
Preferably, the viewing zone has a volume of at least 1 % of the volume of the scene, at least 5% of the volume of the scene, and/or at least than 10% of the volume of the scene.
Preferably, the viewing zone has, or is associated with, a volume (e.g. a real-world volume) of at least one- thousandth of a cubic metre (0.0001 m3); at least one-hundredth of a cubic metre (0.01 m3); and/or at least one cubic metre (1 m3).
Preferably, the scene comprises an obscured portion, the obscured portion not being visible from the viewing zone and the obscured portion not being rendered within the base representation.
Preferably, the scene is associated with a plurality of viewing zones, wherein each zone is associated with a corresponding base representation and/or wherein each viewing zone provides a different set of viewpoints or perspectives for viewing the scene.
Preferably, the method comprises generating (e.g. rendering, receiving, and/or generating) a plurality of base representations, each base representation being associated with a different viewing zone.
Preferably, the method comprises generating (e.g. presenting) the scene element so that the scene element is visible from a plurality of the viewing zones.
Preferably, the viewing zones are associated with: differently sized viewing zones; base representations of different qualities; and different scene elements; different sets of available scene elements; different viewers of the scene; different qualities of scene elements.
Preferably, the viewing zone is arranged to resist and/or prevent movement out of the viewing zone. Preferably, the base representation is arranged so that feedback is provided to a viewer as that viewer moves towards a boundary of the viewing zone and/or wherein the base representation is arranged so that playback of the scene is altered (e.g. slowed or blurred) as the viewer moves towards a boundary of the viewing zone. Preferably, the method includes showing a pass-through view of the actual surroundings of the viewer when the viewer goes outside the limits of the viewing zone.
According to another aspect of the present disclosure, there is described a method of generating a representation of a three-dimensional scene, the method comprising: generating (e.g. rendering and/or presenting) a base representation of the scene, wherein the base representation is associated with a viewing zone, wherein the viewing zone comprises a subset of the scene and/or wherein the viewing zone enables a user to move through a subset of the scene; wherein a viewer is able to move within the viewing zone while viewing the base representation; and wherein the viewing zone is arranged to resist and/or prevent movement out of the viewing zone, preferably wherein the base representation is arranged so that feedback is provided to a viewer as that viewer moves towards a boundary of the viewing zone and/or wherein the base representation is arranged so that playback of the scene is altered (e.g. slowed or blurred) as the viewer moves towards a boundary of the viewing zone).
Preferably, the method comprises generating (e.g. rendering and/or displaying) the scene element in dependence on a viewer moving towards a boundary of the viewing zone.
Preferably, the viewing zone is arranged to enable movement out of the viewing zone, preferably to enable movement out of the viewing zone in dependence on a user input. Preferably, the method comprises generating (e.g. rendering and/or displaying) the scene element in dependence on a viewer moving into and/or out of the viewing zone.
Preferably, movement out of the viewing zone causes one or more of: pausing of playback of the scene; presentation of an options menu associated with the scene; and display of one or more available viewing zones.
Preferably, movement out of the viewing zone causes presentation of the scene in an altered quality. Preferably, the altered quality is lower than the first quality and/or the altered quality is associated with a two-dimensional representation of the scene.
Preferably, the altered quality is the second quality.
Preferably, movement out of the viewing zone causes presentation of the scene element in an altered quality and/or movement out of the viewing zone pauses an animation or playback of the scene element.
Preferably, movement out of the viewing zone reduces a freedom of movement through the scene. Preferably, outside of the viewing zone the viewer is able to move through the scene in less than 6DoF, no more than 3DoF, and/or less than 3DoF.
General
Preferably, the base representation comprises image data, preferably encoded image data. Preferably, the base representation is encoded based on a low-complexity enhancement video codec (LCEVC) process; and/or the base representation comprises layered image data so that the base representation can be generated in different levels of quality.
Preferably, the scene comprises one or more of: an extended reality (XR) scene; a virtual reality (VR) scene; an augmented reality (AR) scene; and a mixed reality (MR) scene.
Preferably, the method comprises storing the three-dimensional representation and/or outputting the three- dimensional representation. Preferably, the method comprises outputting the three-dimensional representation to a further computer device.
Preferably, the method comprises generating an image and/or a video based on the three-dimensional representation.
Preferably, the scene comprises one or more of: a part of a movie, a music video, a game, a shopping experience, a sports experience.
Preferably, the method comprises inserting a trigger into the base representation, the trigger being associated with the generation and/or display of a scene element, the scene element having a second quality and the scene element being arranged to be combined with the base representation.
According to another aspect of the present disclosure, there is described a method of generating a base representation of a three-dimensional scene, the method comprising: generating the base representation of a scene, the base representation having a first quality; inserting a trigger into the base representation, the trigger being associated with the display and/or generation of a scene element, the scene element having a second quality and the scene element being arranged to be combined with the base representation.
Preferably the method is performed at an image generating device, and the method further comprises transmitting the base representation to a display device. Preferably, the method further comprises displaying the base representation at the display device. Preferably, the method comprises generating (e.g. rendering and/or displaying) the scene element at the display device, preferably comprising transmitting the scene element from a further device to the display device and combining the base representation and the scene element at the display device.
According to another aspect of the present disclosure, there is described a method of generating a representation of a three-dimensional scene, the method comprising: identifying a base representation of a scene, the base representation having a first quality; and determining a scene element for combining with the base representation, the scene element having a second quality.
Preferably, the method comprises associating a trigger with the scene element, wherein the scene element is arranged to be combined with the base representation based on the triggering of the trigger.
According to another aspect of the present disclosure, there is described a method of presenting a representation of a three-dimensional scene, the method comprising: receiving a base representation of a scene, the base representation having a first quality; receiving and/or generating a scene element, the scene element having a second quality; and combining the base representation with the scene element.
According to another aspect of the present disclosure, there is described a system for generating a representation of a three-dimensional scene, the system comprising: means for (e.g. a processor for) generating (e.g. rendering and/or presenting) a base representation of the scene, wherein the base representation is associated with a viewing zone, wherein the viewing zone comprises a subset of the scene and/or wherein the viewing zone enables a user to move through a subset of the scene; wherein a viewer is able to move within the viewing zone while viewing the base representation; and wherein the viewing zone is arranged to resist and/or prevent movement out of the viewing zone, preferably wherein the base representation is arranged so that feedback is provided to a viewer as that viewer moves towards a boundary of the viewing zone and/or wherein the base representation is arranged so that playback of the scene is altered (e.g. slowed or blurred) as the viewer moves towards a boundary of the viewing zone).
According to another aspect of the present disclosure, there is described a system and/or apparatus for generating a representation of a three-dimensional scene, the system and/or apparatus comprising: means for (e.g. a processor for) generating (e.g. rendering and/or presenting) a base representation of a scene, the base representation having a first quality; means for (e.g. a processor for) generating (e.g. rendering and/or presenting) a scene element, the scene element having a second quality; and wherein: the scene element is arranged to be combined with the base representation; and/or the system and/or apparatus comprises means for (e.g. a processor for) combining the base representation with the scene element.
According to another aspect of the present disclosure, there is described a system and/or apparatus for generating a base representation of a three-dimensional scene, the system and/or apparatus comprising: means for (e.g. a processor for) generating the base representation of a scene, the base representation having a first quality; means for (e.g. a processor for) inserting a trigger into the base representation, the trigger being associated with the generation of a scene element, the scene element having a second quality and the scene element being arranged to be combined with the base representation.
According to another aspect of the present disclosure, there is described a system and/or apparatus for presenting a representation of a three-dimensional scene, the system and/or apparatus comprising: means for (e.g. a processor for) receiving a base representation of a scene, the base representation having a first quality; means for (e.g. a processor for) receiving and/or generating a scene element, the scene element having a second quality; and means for (e.g. a processor for) combining the base representation with the scene element.
According to another aspect of the present disclosure, there is described a system and/or apparatus for generating a representation of a three-dimensional scene, the system and/or apparatus comprising: means for (e.g. a processor for) identifying a base representation of a scene, the base representation having a first quality; means for (e.g. a processor for) determining a scene element for combining with the base representation, the scene element having a second quality.
Preferably, the system and/or apparatus comprises means for (e.g. a processor for) associating a trigger with the scene element, wherein the scene element is arranged to be combined with the base representation based on the triggering of the trigger.
Preferably, the means for generating the base representation comprises an image generator device.
Preferably, the means for receiving and/or generating the scene element comprises a display device, preferably a virtual reality headset.
Preferably, the means for receiving and/or generating the scene element comprises a computer device connected to a display device.
Preferably, the means for generating the scene element comprises a third-party device, preferably a server comprising a database of available scene elements.
Preferably, the means for combining the base representation with the scene element comprises a display device, preferably a virtual reality headset and/or a computer device connected to the display device.
Preferably, the means for generating the base representation is arranged to transmit the base representation to a further device, preferably to a display device. Preferably, the means for combining the base representation and the scene element comprises the display device.
Preferably, the system comprises a display device.
According to an aspect of the present disclosure, there is described a method of generating (e.g. rendering) a base representation of the scene with a first rendering process at a first level of quality, the base representation including data to produce multiple points of view of the scene within a range of points of views (e.g. a “zone of view”, or “viewing zone”); generating (e.g. rendering and/or presenting) a specific point of view of the scene at a second level of quality based on real-time information on motion and orientation of the viewer within the zone of view.
The method may comprise processing the first rendering process at a first frame rate and the second rendering process at a second frame rate, the second frame rate being different from the first frame rate.
The second rendering process may produce a video of the evolution of the viewer’s point of view of the scene. The video may be a stereoscopic video. The second rendering process may produce video plus depth information.
Preferably, the base representation is produced based on an off-line (e.g. not necessarily real-time) rendering process.
Preferably, the view-point rendition of the base representation is generated using a two-pass rendering process including a first rendering process of multiple points of view producing a pre-rendered data set by means of a ray-tracing or a path-tracing process, not necessarily real-time or near real-time, and a second rendering process of the instantaneous point of view of the viewer by means of a real-time rendering process, such as a real-time rasterization and/or real-time ray-tracing process. Preferably, the scene element at the second level of quality is generated using a single real-time rendering process, such as a rasterization process or a real-time ray-tracing process. The scene element may be generated by the same real-time rendering process (e.g. a second base representation rendering process) that produces the rendering of the view-point of the base representation.
In a non-limiting embodiment, the second rendering process includes a step of motion interpolation to produce the final rendering of the view-point of the scene at a frame rate that is different (e.g., higher) than the frame rate at which the pre-rendered data set was computed. In a non-limiting embodiment, the pre- rendered data set includes data (e.g., motion information of specific scene elements) to support more accurate motion interpolation during the second rendering process.
Preferably, the base representation enables a user to move about the scene so as to view the scene from different points of view (e.g., location and orientation in space at any one time), preferably to move about the scene with six degrees of freedom (6DoF).
Preferably, the method of producing the real-time rendering of the view-point is carried out at a first device and generating the final rendition of the view-point of the scene comprises generating the video frames of the scene based on a transmission received from a second device, preferably wherein the transmission comprises the use of the base representation in an encoded format, more preferably a layered video encoding format, such as a video encoding enhanced with the MPEG-5 Part 2 Low Complexity Enhanced Video Coding (LCEVC) format.
Preferably, generating the final rendition of the view-point of the scene comprises streaming to the first device the final rendering of the view-point based on a transmission from a second device, preferably wherein presenting the view-point of the scene comprises receiving the transmission, decoding the video frames, adjusting the video frames for presentation and presenting the view-point. Preferably, adjusting the video frames for presentation includes applying a reprojection to the decoded video frames, to account for latency in the transmission and to adapt the decoded video to the display frame rate.
Preferably, the method is carried out at a first device and generating the scene element comprises generating the scene element based on a transmission received from a further device, preferably wherein the transmission comprises an encoded version of the scene element. In a non-limiting embodiment, the scene element is responsive to actions of another user in a different location, remotely interacting with the viewer of the scene.
Preferably, generating the scene element comprises streaming the scene element based on a transmission from the further device. Preferably, the scene element comprises a video feed streamed from the further device.
Preferably, the second device and the further device are different devices.
Preferably, generating the view-point of the base representation comprises processing an initial version of the base representation data so as to generate the view-point of the base representation of the scene based on a perspective of a viewer of the scene.
Preferably, generating the base representation comprises receiving a transmission containing the base representation, preferably an encoded version of the base representation, and generating the view-point of the base representation based on the received transmission.
Preferably, generating the scene element comprises receiving data representing one or more actions of the viewer and generating the scene element based on the received data.
Preferably, generating the scene element comprises receiving a transmission and decoding the transmission so as to obtain the scene element.
Preferably, generating the scene element comprises selecting the scene element from a database, preferably a third-party database.
Preferably, the base representation comprises image data, preferably encoded image data. Preferably, the base representation is encoded based on a low-complexity software encoding process; and/or the base representation comprises layered (e.g., tier-based, or hierarchical) data so that the base representation can be generated in different levels of quality. Preferably, the scene comprises one or more of: a part of a movie, a music videoscene, a game, a shopping experience, a digital double, an industrial pre-visualization, a design review, a sports experience.
According to another aspect of the present disclosure, there is described a method of generating a base representation of a three-dimensional scene, the method comprising: generating the base representation of a scene, the base representation having a first quality; inserting a trigger into the base representation, the trigger being associated with the display and/or generation of a scene element, the scene element having a second quality and the view-point of the scene element being arranged to be combined with the view-point of the base representation.
Preferably the method is performed at an image generating device, and the method further comprises transmitting the base representation to a display device. Preferably, the method further comprises processing and displaying the view-point of the base representation at the display device.
Preferably, the method comprises generating (e.g. rendering and/or displaying) the scene element at a view-point rendering device, preferably comprising transmitting the scene element from a further device to the display device and combining the base representation and the scene element at the view-point rendering device.
Any feature in one aspect of the disclosure may be applied to other aspects of the invention, in any appropriate combination. In particular, method aspects may be applied to apparatus aspects, and vice versa.
Furthermore, features implemented in hardware may be implemented in software, and vice versa. Any reference to software and hardware features herein should be construed accordingly.
Any apparatus feature as described herein may also be provided as a method feature, and vice versa. As used herein, means plus function features may be expressed alternatively in terms of their corresponding structure, such as a suitably programmed processor and associated memory.
It should also be appreciated that particular combinations of the various features described and defined in any aspects of the disclosure can be implemented and/or supplied and/or used independently.
The disclosure also provides a computer program and a computer program product comprising software code adapted, when executed on a data processing apparatus, to perform any of the methods described herein, including any or all of their component steps.
The disclosure also provides a computer program and a computer program product comprising software code which, when executed on a data processing apparatus, comprises any of the apparatus features described herein.
The disclosure also provides a computer program and a computer program product having an operating system which supports a computer program for carrying out any of the methods described herein and/or for embodying any of the apparatus features described herein.
The disclosure also provides a computer readable medium having stored thereon the computer program as aforesaid.
The disclosure also provides a signal carrying the computer program as aforesaid, and a method of transmitting such a signal.
The disclosure extends to methods and/or apparatus substantially as herein described with reference to the accompanying drawings.
The disclosure will now be described, by way of example, with reference to the accompanying drawings. Description of the Drawings
Figure 1 shows a system for generating a sequence of images.
Figure 2 shows a computer device on which components of the system of Figure 1 may be implemented.
Figure 3 shows a method of generating and combining a base representation of a scene and a scene element.
Figure 4 shows a scene comprising a viewing zone.
Figure 5 shows a method of generating a scene element based on a trigger associated with the scene.
Description of the Preferred Embodiments
Referring to Figure 1 , there is shown a system for generating a sequence of images. This system can be used to generate, and then display, a representation of a scene or an environment, where this may involve providing a VR experience (or an XR experience) to a user.
The system comprises an image generator 11 , an encoder 12, a transmitter 13, a network 14, a receiver 15, a decoder 16 and a display device 17.
These components may each be implemented on separate apparatuses. Equally, various combinations of these components may be implemented on a shared apparatus; for example, the image generator 11 , the encoder 12, and the transmitter 13 may all be part of a single image data generation device. Similarly, the receiver 15, the decoder 16, and the display device 17 may all be a part of a single image rendering device.
Typically, the system comprises at least one encoding computer device (e.g. a server of a content provider) and at least one rendering computer device (e.g. a VR headset).
Referring to Figure 2, each of the components, and in particular the image generator 11 , the encoder 12, the transmitter 13, the receiver 15, the decoder 16 and the display device 17 is typically implemented on a computer device 20, where, as described above, a plurality of these components may be implemented on a shared computer device.
Each computer device comprises one or more of: a processor 21 for executing instructions (e.g. so as to perform one or more of the steps of the various methods described below), a communication interface 22 for facilitating communication between computer devices (e.g. an ethernet interface, a Bluetooth® interface, or a universal serial bus (UBS) interface, a memory 23 and/or storage 24 for storing information and instructions (e.g. a random access memory (RAM), a read only memory (ROM), a hard drive disk (HDD) a solid state drive (SSD), and/or a flash memory, and a user interface 25 (e.g. a display, a mouse, and/or a keyboard) for enabling a user to interact with the computer device. These components may be coupled to one another by a bus 26 of the computer device.
The computer device 20 may comprise further (or fewer) components. In particular, the computer device (e.g. the display device 17) may comprise one or more sensors, such as an accelerometer, a GPS sensor, or a light sensor. These sensors typically enable the computer device to identify an environmental condition and/or an action of wearer of the display device.
Turning back to Figure 1 , the image generator 11 is configured to generate a sequence of image data (e.g. a sequence of image frames) to enable the display device 17 to use this image data to display a plurality of images. The image data may comprise one or more digital objects and the image data may be generated or encoded in any format. For example, the image data may comprise point cloud data, where each point has a 3D position and one or more attributes. These attributes may, for example, include, a surface colour, a transparency value, a point size and a surface normal direction. Each attribute may have a value chosen from a continuous range or may have a value chosen from a discrete set. The image data enables the later rendering of images. This image data may enable a direct rendering (e.g. the image data may directly represent an image). Equally, the image data may require further processing in order to enable rendering. For example, the image data may comprise three-dimensional point cloud data, where rendering a two-dimensional image using this data requires processing based on a viewpoint of this two-dimensional image. For example, a two-dimensional object may be rendered using a Gaussian splatting process that is performed on a three-dimensional point cloud (Gaussian splatting is described, for example, by https://huqqingface.co/bioq/qaussian-spiatting).
The image data may comprise depth map data, where one or more pixels or objects in the image is associated with a depth that is specified by the depth map data. The depth map data may be provided as a depth map layer, separate from an image layer. In some contexts, such as MPEG Immersive Video (MIV), the image layer may instead be described as a texture layer. Similarly, in some contexts, the depth map layer may instead be described as a geometry layer.
The image data may include a predicted display window location. The predicted display window location may indicate a portion of an image that is likely to be displayed by the display device 17. The predicted display window location may be based on a viewing position (such as a virtual position and/or orientation of the user in a 3D environment) of the user, where this viewing position may be obtained from the display device. The predicted display window location may be defined using one or more coordinates. For example, the predicted display window location may be defined using the coordinates of a corner or center of a predicted display window, and may be defined using a size of the predicted display window. The predicted display window location may be encoded as part of metadata included with the frame.
The image data for each image (e.g. each frame) may include further information, which may be provided as a part of an image, e.g. as part of the point cloud data, or as separate layers. In particular, the image data may include audio information or haptic feedback information indicating audio or haptics which can accompany displayed visual data. An audio layer or haptic layer may accompany each image, and may be omitted for images where no accompanying audio or haptics are required.
Similarly, the image data may comprise interactivity information, where the image data may contain or indicate elements with which a user can interact. The interactivity information may, for example, define a behaviour of an element, where a user is able to interact with the element based on this behaviour. The behaviour typically defines a change in an element that occurs as a result of a user interaction where this change may comprise a change in the attributes of the element or in the rendering of the element. As an example, where an image contains a target element, the target element may be arranged to disappear when a user interacts with this element, or to provide feedback indicating that the user has interacted with the target. This interactivity data may be provided as part of, or separately to, the image data.
The image data may indicate, or may be combinable with, a state of the virtual environment, a position of a user, ora viewing direction of the user. Here, the position and viewing direction may be physical properties of the user in the real-world, or position and viewing direction may also be purely virtual, for example being controlled using a handheld controller. The image generator 11 may, for example, obtain information from the display device 17 that indicates the position, viewing direction, or motion of the user. Equally, the image generator may generate image data such that it can later be combined with this position, viewing direction, or motion, where the image generator may generate a full scene which is only partially viewed by a user depending on the position of that user.
In some cases, the generated image may be independent of user position and viewing direction. This type of image generation typically requires significant computer resources such as a powerful GPU, and may be implemented in a cloud service, or on a local but powerful computer. For example, a cloud service (such as a Cloud Rendering Service (CRN)) may reduce the cost per-user and thereby make the image frame generation more accessible to a wider range of users. Here “rendering” refers at least to an initial stage of rendering to generate an image. Further rendering may occur at the display device 17 based on the generated image to produce a final image which is displayed.
The image generator 11 may, for example, comprise a rendering engine for initially rendering a virtual environment such as a game or a virtual meeting room.
The encoder 12 is configured to encode frames to be transmitted to the display device 17. The encoder may be implemented using executable software or may be implemented on specific hardware such as an ASIC. In some embodiments, the image generator 11 may transmit raw, unencoded, data through the network 14. However, such transmission typically leads to a high file size and requires a high bandwidth so that it is typically desirable to encode the data prior to the transmission.
The encoder 12 may encode the image data in a lossless manner or may encode the data in a lossy manner. The encoder may apply inter-frame or intra-frame compression based on a currently-encoded frame and optionally one or more previously encoded frames. The encoder may be a multi-layer encoder, such as an low complexity enhancement video codec (LCEVC) enabled encoder.
Where the generated frames comprise depth map data, the encoder 12 may perform layered encoding on each instance of image data (e.g. each frame) to generate an encoded frame comprising a base depth map layer and an enhancement depth map layer. Encoding a depth map in this way may improve compression. In some applications, such as HDR video, depth maps are desirably highly detailed with a bit depth of up to twelve or fourteen bits, which is a significant increase in the data to be transmitted. As a result, providing ways to improve compression of the depth map can make more realistic depth map-based displays viable when performing rendering or transmission of rendered data in real-time. Furthermore, this type of layered encoding makes it easy to drop (and then pick back up) one or more of the layers, which provides flexibility and tools for bandwidth management.
Layered encoding is also helpful as the final decoder/user device (such as a user display device) can choose whether to process these extra layers. For example, in a non-layered approach, the best the end device (i.e. the receiver, decoder or display device associated with a user that will view the images) can do is determine that it does not have enough resources for a given quality (be it resolution, frame rate, inclusion of depth map) and then signal to the controller/renderer/encoder that it does not have enough resources. The controller then will send future images at a lower quality. In that alternative scenario, the end device still unfortunately has to process the higher quality data until the lower quality data arrives, if it can process the received images at all.
In some of the described embodiments, this situation is improved upon because when/if the end device determines for example that it does not have the processing capabilities to handle the highest level of quality, then it can drop and/or choose not to process certain layers. The end device may also signal to the controller that it needs a lower level of quality, but in the meantime the end device can only process the number of layers that it can handle. Therefore, the end device can react to conditions much more quickly.
In some cases, depth map data may be embedded in image data. In this case, the base depth map layer may be a base image layer with embedded depth map data, and the enhancement depth map layer may be an enhancement image layer with embedded depth map data.
Alternatively, when the image data for generated images comprises a depth map layer separate from an image layer and multi-layer encoding is applied, the encoded depth map layers may be separate from the encoded image layers. This has the advantage that the encoded depth map layers can be dropped under some conditions while still retaining image layers that can be displayed (albeit with a lower level of realism). For example, the encoded depth map layers can be dropped by a transmitter or encoder when available communication resources are reduced, or can be dropped by an end device which lacks the processing resources to handle the highest level of quality. Similarly, if the image data for some images comprises an audio base layer, a haptic feedback base layer, an audio enhancement layer or a haptic feedback enhancement layer, these can be processed or dropped flexibly.
Again similarly, if the image data for some images comprises an interactivity data base layer or an interactivity enhancement layer these can be processed or dropped flexibly. For example, certain interactions may only be possible where a threshold bandwidth is available, where complex interactions (e.g. those enabling a conversation with a digital object) may be disabled before less complex interactions (e.g. changing a pixel colour) are disabled.
Additionally or alternatively, where the image data comprises point cloud data, the encoder may apply a point cloud data encoding technique such as described in European patent application EP21386059.6, which is incorporated herein by reference. Such a point cloud encoder may act as a base encoder for a layered encoding technique such as LCEVC or VC-6. Notably LCEVC and VC-6 techniques encode and decode a layered signal, but are agnostic about the content type of data encoded in the signal. For example, the signal can include textures, video frames, geometry or depth data, meshes, point clouds, rendering attributes or physics engine attributes.
The transmitter 13 may be any known type of transmitter for wired or wireless communications, including an Ethernet transmitter or a Bluetooth transmitter.
The transmitter 13 may be configured to make decisions about how to transmit the image data, and/or may provide feedback to the encoder 12 or the image generator 11 . For example, the transmitter may determine available communication resources (e.g. bandwidth) for transmitting image data, and may drop one or more layers from an encoded frame, or indicate to the image generator and/or encoder that image data should be generated and encoded with fewer layers, when insufficient bandwidth is available for transmission of all generated data. As specific examples, the transmitter may be configured to drop a depth map layer, an LCEVC enhancement layer, or a VC-6 enhancement layer from a frame when insufficient communication resources are available.
The network 14 provides a channel for communication between the transmitter 13 and the receiver 15, and may be any known type of network such as a WAN or LAN or a wireless Wi-Fi or Bluetooth network. The network may further be a composite of several networks of different types. Many users only have access to a network with a bandwidth of 30MBps which can lead to latency jitter when streaming. The required bandwidth and the observed latency can be reduced by means of tactics such as forward-looking rendering and last-millisecond reprojection, which are enabled by improved compression.
The receiver 15 may be any known type of receiver for wired or wireless communications, including an Ethernet transmitter or a Bluetooth transmitter.
The decoder 16 is configured to receive and decode image data (e.g. to decode an encoded frame). The decoder may be implemented using executable software or may be implemented on specific hardware such as an ASIC.
The display device 17 may for example be a television screen or a VR headset. The timing of the display may be linked to a configured frame rate, such that the display device may wait before displaying the image. The display device may be configured to perform warping, that is, to obtain a final display window location, adjust a warpable image to obtain a final image corresponding to a final viewing direction and position of the user, and display the final image.
In this regard, the image data is typically arranged to provide a warpable image for which a portion of the image that is displayed at the display device 17 is dependent on a position or orientation of a viewer. The warpable image may then be rendered before a most up to date viewing direction of the user is known. The warpable image may be transmitted to the display device, or the warpable image may be transmitted to a rendering node which is near to the display device, and the display device or rendering node may perform time warping to generate a displayed image portion based on the warpable image and the most up to date viewing direction and position of the user.
As mentioned above, a single device may provide a plurality of the described components. For example, a first rendering node may comprise the image generator 11 , encoder 12 and transmitter 13. Additional similar rendering nodes may be included in the system, and may work together to generate the sequence of frames.
In one case, multiple rendering nodes may each provide separate image data to an image data assembling node; for example, each rendering node may provide a part of a sequence of frames to a frame assembling node. For example, the receiver 15, decoder 16 or display device 17 may be configured to assemble parts of image data from multiple sources to generate a sequence of images for display on the display device. Equally, the image data assembling node may be separate from the receiver 15, decoder 16 and display device 17.
Additionally or alternatively, multiple rendering nodes may be chained. In otherwords, successive rendering nodes may add to a sequence of image data as it passes from rendering node to rendering node, and eventually a complete sequence of image data is then provided to the receiver 15. Furthermore, each rendering node may obtain components of a render from multiple upstream rendering nodes and/or distribute components of a render to multiple downstream rendering nodes.
A chain of rendering nodes may be useful for performing different rendering tasks that require different quantities of processing resources, or different frame rates. For example, a company may provide distributed processing in the form of a centralised hub which has abundant processing resources but is distant from users, and peripheral locations which have more scarce processing resources but are closer to users. Expensive but fairly static rendering features such as background lighting or environmental impact on sound may be generated at the central hub (for example using ray tracing), while features that require fewer resources but faster responses or higher frame rates may be generated closer to the user. In other words, the more responsive a rendering feature needs to be, the lower latency it needs between the rendering node which generates the feature and the user display and, in a chain of rendering nodes, the node which generates each rendering feature can be chosen based on a required maximum latency of that feature. On the other hand, if it is expensive to generate a rendering feature, then it may be preferable to generate the feature less frequency and with a higher maximum latency. For example, a static, high-quality background feature may be generated early in the chain of rendering nodes and a dynamic, but potentially lower-quality, foreground feature may be generated later in the chain of rendering nodes, closer to the user device. Here, environmental impact on sound means, for example, a set of surfaces may be constructed where each surface has different sound reflection and absorption properties depending upon material and shape. The frame rates may be matched by creating multiple frames with features generated at the lower frame rate, and combining them with the frames with features generated at the higher frame rate. In a nonlimiting embodiment, a preliminary rendering generates volumetric object data including motion vectors at a first (lowest) frame rate, then produces 2D rendered frames plus depth information for a specific user at a second (higher) frame rate, then transmits video plus depth data to the user device, which produces final frames for display via space warping (depth-based reprojections) at a third (highest) frame rate. One or more of these steps may be performed in combination with the other described embodiments. The viewing position of the user may change as additional rendering tasks are performed at different rendering nodes in the chain. Each or any rendering node may obtain an updated viewing position before performing its respective rendering task.
Additionally, the system may simultaneously generate multiple sequences of image data for different respective users or different respective display devices. For example, in the context of a VR or AR experience, each user or display device may view a different 3D environment, or may view different parts of a same 3D environment. When using a chain of rendering nodes, each node may serve multiple users or just one user.
For example, a starting rendering node (e.g. at a centralised hub) may serve a large group of users. For example, the group of users may be viewing nearby parts of a same 3D environment. In this case, the starting node may render a wide field of view which is relevant for all users in the large group.
The starting node may send this wide field of view to a first middle rendering node which renders additional aspects of the 3D environment. These additional aspects may for example be aspects which require less processing power to render, or may be aspects which are specific to individual users of the group. Additionally, the middle rendering node may render features in a smaller field of view than the starting node - this smaller field of view may be relevant to each user rather than the group of users. The first middle rendering node may additionally only serve a smaller number of users (e.g. half of the large group of users), with the remaining users being served by a second middle rendering node which also receives the wide field of view from the starting node.
The middle rendering node(s) may then send sequences of second partially or fully rendered frames to an end device for each user. The end device may perform further processes such as warping or focal distance adjustments, optionally using depth map data.
Preferably, each rendering node encodes the partially or fully rendered frames before transmitting them on to a next rendering node or to the receiver 15. This means that the required communication resources can be reduced when the rendering nodes are separated by one or more networks, or more generally are implemented in a distributed system such as a cloud.
However, each rendering node in a chain is encoding a different partially or fully rendered frame, with different data. Therefore, it may be advantageous for different rendering nodes to use different rendering formats and/or encoding formats. For example, the output from a first rendering node may be point cloud data which logically describes a 3D scene. This point cloud data can be encoded using the techniques of EP21386059.6. A second rendering node may then operate on the point cloud data to generate image data that is more readily displayed by a generic display device, without requiring the display device to model the 3D environment. This image data may be encoded using video coding techniques.
The chaining of rendering nodes may be extended to arbitrary tree structures, where a rendering node obtains partially rendered frames from more than one preceding rendering node, and generates further partially or fully rendered frames based on the multiple obtained sequences of partially rendered frames. For example, a content rendering network (CRN) comprising numerous rendering nodes may be used to serve a volumetric event to a large number of same-time users, such as users participating in a shared virtual environment. Rendering the same event for each user is far more expensive in terms of computation time and power consumption than rendering the volumetric effect once and performing the rendering equivalent of multicasting the volumetric effect for multiple users. For example, each user may have a second rendering node (such as a VR headset), and the network may comprise a central first rendering node. The first rendering node may render the volumetric event, and distribute partially rendered frames depicting the volumetric event to the different second rendering nodes. The second rendering node for each user may then integrate the partially rendered frames depicting the volumetric event into a view of the virtual environment which is currently being shown to each user, based on parameters such as the user’s virtual position.
The receiver 15, decoder 16 and display device 17 may be consolidated into a single device, or may be separated into two or more devices. For example, some VR headset systems comprise a base unit and a headset unit which communicate with each other. The receiver 15 and decoder 16 may be incorporated into such a base unit. In some embodiments, the network 14 may be omitted. For example, a home display system may comprise a base unit configured as an image data source, and a portable display unit comprising the display device 17.
In the event that the decoder 16 or the display device 17 does not or cannot handle one or more layers, the receiver 15 or another transmitter associated with the decoder or display device may send a corresponding layer drop indication back through the network 14. The layer drop indication may be received by each rendering node. A rendering node which generates partially or fully rendered frames for that specific decoder or display device may cease generating the dropped layer. On the other hand, a rendering node which generates partially or fully rendered frames for multiple end devices may disregard a layer drop indication received from one end device (as the dropped layer is still needed for other devices). Alternatively, rendering nodes which serve multiple end devices may record received layer drop indications, and may cease generating the dropped layer only when all end devices served by the rendering node indicate that the layer is to be dropped.
In preferred examples, the encoders or decoders are part of a tier-based hierarchical coding scheme or format. Hierarchical coding enables frames to be communicated with higher resolution and/or higher frame rate than is possible in single-tier coding schemes. In hierarchical coding, one or more enhancement layers is communicated with base data, where the enhancement layers can be used to up-sample the base data at the decoder, for example providing up-sampling in a spatial ortemporal dimension. When combined with equivalent down-sampling of the original frames and generation of the enhancement layer at an encoder, hierarchical coding can overall provide lossless compression of data, with higher resolution and/or higher frame rate for a given transmission bit rate. Examples of a tier-based hierarchical coding scheme include LCEVC: MPEG-5 Part 2 LCEVC (“Low Complexity Enhancement Video Coding”) and VC-6: SMPTE VC-6 ST-2117, the former being described in PCT/GB2020/050695, published as WO 2020/188273, (and the associated standard document) and the latter being described in PCT/GB2018/053552, published as WO 2019/111010, (and the associated standard document), all of which are incorporated by reference herein. However, the concepts illustrated herein need not be limited to these specific hierarchical coding schemes.
A further example is described in WO2018/046940, which is incorporated by reference herein. In this example, a set of residuals are encoded relative to the residuals stored in a temporal buffer.
LCEVC (Low-Complexity Enhancement Video Coding) is a standardised coding method set out in standard specification documents including the Text of ISO/IEC 23094-2 Ed 1 Low Complexity Enhancement Video Coding published in November 2021 , which is incorporated by reference herein.
The system above is suitable for generating a representation of a scene (e.g. using the image generator 11) and presenting this representation to a user (e.g. using the display device 17). The scene typically comprises an environment, where the user is able to move (e.g. to move their head and/or to turn their head) to look around the environment and/or to move around the environment. For example, the scene may be a scene of a room in a building, where the user is able to move around the room (e.g. by moving in the real-world and/or by providing an input to a user interface) in order to inspect various parts of the room. Typically, the scene is arranged to be viewed and/or experienced using XR (e.g. VR) technology, such as a virtual reality headset, where the user is then able to move about the scene in three degrees of freedom (3DoF) or six degrees of freedom (6DoF) so as to experience the scene.
A ‘scene’ as described herein relates to an environment and/or an event that can be represented to a viewer to enable a viewer to experience the scene. For example, the scene may comprise a real or virtual location, the scene may comprise an event, such as a concert, or the scene may comprise a scene of a movie or a TV show.
The methods described herein relate to the generation and presentation of a representation of the scene.
This representation is typically generated using a video file so that a user is able to view the scene by playing the video file. Typically, the representation is a three-dimensional (and/or greater than three- dimensional) representation of the scene, where this representation can then be viewed via a virtual reality or augmented reality display device. With this display device a user is able to play the scene (or the representation of the scene) to experience the scene. So, in practice, presenting a representation of the scene may comprise presenting a video file to a user at the display device, and playing/pausing the scene may comprise playing or pausing the video file to show the scene to the user. By providing a three- dimensional representation of the scene (e.g. via a virtual reality headset), it is possible to provide a more immersive experience to a viewer than is possible with a two-dimensional display (that may, for example, be viewed on a television). However, generating and presenting this three-dimensional representation typically requires an increased amount of hardware of software as compared to a two-dimensional representation. The present disclosures decrease the burden placed on a viewer of a three-dimensional representation.
Presenting a representation of the scene may comprise presenting a representation directly or processing a received representation prior to the presenting of a processed representation. For example, the representation may be generated offline at the image generator 11 before being transmitted to the display device 17. The representation may be processed at the image generator prior to transmission (e.g. to provide to the display device an image that is suitable for displaying to the user) or the representation may be processed at the display device to determine this image for display. Presenting the representation encompasses embodiments in which the representation is suitable for being shown to the user without further processing and also embodiments in which the representation is processed prior to being displayed to the user (e.g. presenting the representation may comprise determination a rendition ofthe representation and the presenting of this rendition).
References herein to the playback and modification of a ‘scene’ should be understood to encompass playback and modification of a representation of this scene, which representation comprises at least a base representation of the scene and, optionally, further elements such as a scene element that may be combined with the base representation to provide the representation. There may be a plurality of different representations of the scene (e.g. viewed by different viewers or devices), so that the playing of the scene may be viewed via these representations; but the representations may not show every aspect of the scene (e.g. the representations may relate to an obscured view). In any event, playing the scene and playing a representation of the scene are essentially equivalent for a viewer of the representation.
Referring to Figure 3, an aspect of the present disclosure provides a method of combining a base representation of a scene with a scene element (or a further element, or a further object). Typically, this comprises providing a high-quality, immersive, three-dimensional, base representation of the scene (that is generated using a process that requires substantial time and/or a powerful computer device) that is combinable with a lower quality scene element (that is generated in a shorter time, e.g. in real-time, or that is generated using a less powerful computer device).
In a first step 31 , a computer device, such as the image generator 11 , generates a base representation of a scene. The scene may be a depiction of a real scene (e.g. a depiction of a particular building, or a particular event). Equally, the scene may be a depiction of a virtual scene (e.g. a computer-generated scene of another planet or an animated scene). The base representation enables a user to view the scene and/or to move about the scene so as to experience the scene. For example, the base representation may comprise a high quality volumetric video. Therefore, a user is able to move around the (base representation of the) scene so as to view the scene from different viewpoints/perspectives.
The 'base representation’ is a representation of the scene (e.g. a video showing the scene) that can be rendered or displayed by the display device. The base representation may be a video file or may be a sequence of images. In some embodiments, the base representation comprises an alternate format that can be converted into a video file. For example, the base representation may comprise a point cloud of the scene, which point cloud enables images to be generated in dependence on a location and/or orientation of the user. Typically, the base representation comprises an (e.g. encoded) video file that can be decoded and rendered to display the scene to a viewer using the display device 17. The rendering of this base representation may depend on a characteristic of the display device 17, e.g. the base representation may be rendered in dependence on a display frame rate of the display device.
Generating the base representation may comprise rendering the base representation and/or may comprise identifying or decoding a file associated with the base representation. Various devices may generate a version of the base representation where, for example, the image generator 11 may initially generate the base representation based on an input model of a scene, with this initially generated base representation then being rendered and encoded. The display device 17 may then receive this encoded representation and generate the base representation by decoding the encoded version of the base representation.
The base representation has a first quality, which is typically a high quality. For example, the base representation may be generated using a ray-tracing process or a machine learning process (e.g. as described in WO 2016/061640 A1 , which is incorporated herein by reference. Typically, the base representation enables a user to view the scene at a high frame rate and/or at a high resolution in order to provide an immersive experience. The base representation may have a resolution of at least 4K, at least 8K, and/or at least 16k (overall or per eye).
The generation of the base representation typically comprises generating image data that enables a user to view the base representation, where this image data can then be transmitted, via the network 14, to the display device 17.
The base representation (and more generally the scene) may be played in order for a user to see and/or interact with the scene. This typically comprises a user selecting the scene for playback on the display device 17. The display device typically enables the user to control this playback and/or to interact with the scene, e.g. to pause, speed up, slow down, skip through etc. the playback; to choose a scene to view; or to select an item from within a scene being played).
In a second step 32, a computer device, such as the image generator 11 orthe display device 17, generates a scene element that can be combined with the base representation in order to provide a combined representation of the scene. The scene element may, for example, comprise a modification to the base representation or a supplement to the base representation.
In various embodiments, the scene element may comprise a filter or an overlay, which may be used to modify the base representation to depict, for example, fog passing through the scene or to modify an atmosphere of the scene (e.g. to lighten or darken the scene, or to provide a sepia filter). In various embodiments, the scene element may comprise an interactive element, such as a target, where a viewer of the scene is able to click on the target to perform an action. More generally, the scene element may be able to provide an output in dependence on an interaction with the scene element, e.g. to change an attribute of the scene element, ofthe base representation, or of another scene element, orto alter a feature of the playback of the scene.
The scene element may comprise an object that may be either a virtual object or a representation of a real object; for example, a real object may be imaged using a camera and the resultant image may be used as a basis for the base representation. The scene element may comprise an avatar and/or a personalisation, where the avatar may be an avatar of a viewer of the scene or of a further person - and this avatar may be selected in dependence on a viewer ofthe scene (e.g. based on a user profile of this viewerthat is accessed by the display device). Such embodiments enable a viewer of a scene to picture themselves within the scene and/or to interact with an avatar within the scene. The scene element may comprise a static object, where the object does not change after rendering. Equally, the scene element may be arranged to move (e.g. rotate ortranslate) after rendering. Yet further, the scene element may be a time-varying object or an animated object where an attribute of the object (e.g. a size, shape or colour) changes after the rendering of the object.
The scene element typically has a second quality, which second quality is typically lower than the first quality of the base representation. For example, the scene element may be generated using a rasterization process, which rasterization process involves transforming and modifying a three-dimensional model before converting this model into a two-dimensional image for display. Typically, the scene element has a lower frame rate and/or resolution than the base representation. The scene element may, for example, have a resolution of no more than 8K, no more than 4K, and/or no more than 2k.
Where the scene element has a lower frame rate than the base representation, displaying the representation of the scene may comprise altering a position/feature of the scene element for non-adjacent frames (e.g. if the base representation has a frame rate of 60fps and the scene element has a frame rate of 30fps, the scene element may only be updated once every two frames of the base representation). Similarly, where the scene element has a lower frame rate than the base representation, the scene element may be upsampled, upscaled, or processed for display within the base representation, where this may involve displaying a value for a single ‘pixel’ of the scene element on a plurality of pixels of the display. The combining of the scene element with the base representation may comprise a step of motion interpolation so as to produce the final rendering of the scene element, where this enables a low frame rate scene element to be combined with a higher frame rate base representation.
Typically, the scene element comprises a real-time element, which real-time element is generated during the playback of the scene. The real-time element may be generated and displayed (effectively) in real-time and/or may be generated based on a current situation of the viewer or the display device at the time of generation of the real-time element. In general, the scene element may be generated during the playback of the (base representation of the) scene as opposed to being generated prior to the playback of the (base representation of the) scene.
The base representation may be displayed based on a two-pass rendering process that includes a first rendering step of rendering multiple points of view prior to the displaying of the base representation and then a second rendering step that occurs at or near the time of display so as to render the scene for viewing by the viewer. The scene element may be combined with the base representation at this second rendering step so as to provide a representation for viewing by the viewer.
Generating the scene element at the display device 17 may comprise receiving the scene element from a further device (e.g. in a transmission from a further device) and then generating the scene element based on this transmission. Equally, generating the scene element at the display device may comprise generating the scene element at the display device for the first time. Regardless of whether the scene element is generated for the first time at the display device or at another computer device, the display device will need to generate (and/or render) the scene element in order to display the scene element, where this may involve re-generating the scene element based on the aforementioned transmission.
It will be appreciated that, in practice, it is not possible to generate and present an element in exactly realtime (i.e. with absolutely no delay between an action that prompts the generation of an object and the display of the generated object following generation). Where the term ‘real-time’ is used in this disclosure, it will be appreciated that this relates to near real-time, where a real-time object is generated and presented very soon after an action that prompts the generation. E.g. where the scene element is a real-time element, this scene element may be generated and/or presented less than five seconds, less than one second, less than a tenth of a second and/or less than a hundredth of a second after the triggering of a trigger that prompts the generation of the scene element (where the time taken may depend on an available processing power). In contrast, the base representation may be generated greater than one day, greater than one week, and/or greater than one month before the presenting of the scene.
In some embodiments, the base representation of the scene is generated prior to the playback of the scene and/or the scene element is generated during the playback of the scene. The scene element may be arranged to be generated at a rate that is equal to (or greater than) a rate of playback of the scene (e.g. so that the scene element may be generated and presented in real time). In an example, where the scene is played at 60 frames per second, the scene element may be generated at 60 frames a second (so as to be displayed in real time) The base representation may be arranged to be generated at a rate that is slower than the rate of playback of the scene (e.g. so that the base representation must be generated prior to the playback of the scene). The base representation and the scene element being ‘arranged’ to be generated at certain rates may comprise a quality of these components being chosen to enable this generation.
In practice, the base representation is typically generated prior to the playback of the scene, where each frame of the base representation may be associated with a generation time of at least one minute, at least ten minutes, and/or at least an hour. This enables the generation of a high-quality base representation. The scene element may then be generated at a rate that is the same as a playback rate of the base representation so that the scene element can be generated and displayed (within the base representation) in real time.
In some embodiments, the scene element comprises a pre-generated element, which may then be combined with the base representation based on a triggering condition and/or based on a context of a viewer ofthe scene (e.g. to combine a high-quality base representation provided by a first party with a lower quality scene element provided with a second party). The use of a pre-generated element for the scene element enables each component of the scene to be generated prior to the display of the scene, where the combination of the scene may still be performed in real-time to combine these components. In some embodiments, the scene element comprises a live or near-live element, where the scene element may be a streamed video or the scene element may be formed based on a video of an event with a slight delay, which delay provides time for processing the video to form the scene element.
Typically, the base representation and/or the scene element is streamed at the display device. Streaming the scene element, or streaming the base representation, typically connotes presenting the scene element/base representation shortly after receipt of a transmission containing the scene element/base representation. This may involve receiving an encoded version of the scene element/base representation in a transmission, then decoding this encoded version of the scene element/base representation at the display device, and then presenting the scene element/base representation. Streaming enables the scene element/base representation to be transmitted and displayed essentially simultaneously (with some small delay to enable for the decoding to occur and also, optionally, some buffering being performed) so that the base representation/scene element does not need to be downloaded prior to the playback of the scene. Equally, a user may be able to download the scene element or the base representation, where in some embodiments the scene element or base representation is downloaded to the display device 17 (or to a device attached to the delay device) prior to the decoding and/or playback of the scene element/base representation. These different modes of playback may be desirable in different situations, where streaming the base representation/scene element typically requires an Internet connection with reasonable bandwidth, but downloading the base representation/scene element typically requires a device with substantial storage space as well as a degree of preparation. Since the scene element is typically of a lower quality than the base representation, it is typically easier to stream the scene element (e.g. it typically requires less bandwidth to stream the scene element than the base representation). Therefore, in many embodiments, the base representation is arranged to be downloaded prior to the playback of the scene with the scene element being streamed at the time of playback. In various embodiments: the base representation is downloaded to the display device 17 (or a connected device) prior to the playback of the scene and the scene element is streamed to the display device from a further device; or the scene element is downloaded to the display device prior to the playback of the scene and, optionally, combined with the base representation at the time of presenting the scene (e.g. the scene element may be selected from a database of possible scene elements, which database is stored on the display device and/or a connected device).
More generally, any one or more of the base representation and the scene element may be streamed to the display device 17 and any one or more of the base representation and the scene element may be downloaded to the display device prior to playback of the scene (e.g. so that the base representation may be streamed with the scene element being downloaded and/or generated in real time, or the scene element may be streamed with the base representation being downloaded prior to the presentation of the scene, or each of the base representation and the scene element may be streamed).
In a third step 33, a computer device, such as the display device 17, combines the scene element with the base representation. This may comprise modifying the base representation based on the scene element and/or superimposing the scene element onto the base representation.
This combination of a high-quality base representation and a lower-quality scene element enables the generation of a high quality, immersive, representation of a scene at a time priorto the viewing ofthe scene (e.g. where this generation can use a powerful computer and/or can be performed over an extended duration) while enabling on-the-fly modification of a scene element in order to modify the scene as it is being viewed.
In an example of the use of the scene element, the scene element may depend on an environmental condition or a context of the viewer. For example, the base representation of the scene may depict the scene on a clear day and the scene element may comprise an environmental filter that modifies the base representation to better reflect a current situation of a viewer (e.g. the scene element may provide a darkening filter if the scene is being viewed at night or while the viewer is in a cloudy location; and the scene element may comprise a depiction of rain if the user is in a rainy location). Similarly, the viewer may select the scene element, e.g. to select whether to view a clear scene, a rainy scene, or a cloudy scene. This enables the generation of a single, high quality, base representation that can then be modified by the scene element to personalise the scene and/or to provide a scene that is appropriate for a context of a viewer of the scene.
In another example, the scene element may provide a third-party modification to the base representation where, for example, the base representation may provide a background on which third-parties (that were not involved in the generation of the base representation) are able to supplement this background with scene elements. This enables a viewer ofthe scene to personalise the scene in a desired way. For example, the viewer may be able to insert an avatar into the scene where this avatar mirrors the real-time movements ofthe user. Similarly, the base representation may provide a background onto which numerous third parties are able to impose different scene elements. This provides a versatile representation of a scene, where a single image generating party (that has access to powerful computer devices) is able to provide a base representation that can be used by numerous modifying parties (that may not have access to such powerful computer devices).
In yet another example, the scene element may be an interactive element, where the user is able to interact with this element in real time in order to add functionality to the base representation. Handling real-time interactions with an element of an XR scene typically requires large amounts of computing power and so providing a real-time interaction with a high-quality scene element may require an amount of computing power that is not available to many devices/users. By providing the scene element in lower quality than the base representation, it becomes possible to provide a high quality background (e.g. generated using an off- site server) and to combine this with a lower-quality scene element (e.g. generated using a personal computer or a smartphone) to provide a user with an immersive experience that is still personalised and updated in real-time.
The scene element may be generated so that a user is able to interact with the scene element via a user input, e.g. using one or more of: eye tracking, gesture tracking, and speech. To enable this tracking, the display device 17 may comprise a tracking sensor, a camera, and/or a microphone.
The scene element may be generated for only a portion of the scene, e.g. based on a current viewpoint and/or a current perspective of a user of the display device 17. This enables the scene element to be generated only for a portion of a scene that is being viewed by the user. Such generation may reduce the processing power required to generate and render the scene element. The portion of the scene may be determined automatically, e.g. based on a sensor of the display device. Equally, the portion of the scene may be determined based on a user input or based on a feature of the base representation where, for example, a generator of the base representation is able to define an area onto which a scene may be imposed.
The first step 31 and the second step 32 of the method of Figure 3 are typically carried out at different computer devices, where the computer device (or computer devices) used to generate the base representation are typically more powerful than the computer device (or computer devices) used to generate the scene element.
In some embodiments, the base representation comprises (or is associated with) a trigger, which trigger is arranged to prompt the generation or display of the scene element. This trigger may, for example, be a time, a condition, or a viewing perspective that is contained in the image data that forms the base representation. In such a way, a generator of the base representation is able to (at least partly) control the generation of the scene element and to ensure that the scene element and the base representation of the scene are synchronized.
In some embodiments, the base representation comprises a trigger that is associated with a user action where, for example, an input being provided via a user interface results in the generation or rendering of the scene element. This input may be associated with a specific area of the scene, an interaction with an object in the scene, or an input in a menu associated with the scene. In a practical implementation, the scene may be associated with various different elements of interactive media content, where the user is able to select an option from a list of possible operations and the scene element is rendered based on the selected option (e.g. the user may select a scene element to be rendered based on a list of possible scene elements).
In some embodiments, the base representation comprises a trigger (e.g. a trigger point) that activates at a certain point (e.g. a certain frame) of the base representation, e.g. one minute into the scene, and that initiates the generation or rendering the scene element. In a practical implementation, the scene may comprise a mirror that comes into view at this point, and then there may be superimposed onto the mirror, in real-time, a representation of a viewer of the scene. With such an implementation, the base representation may also comprise a trigger so that, if a user turns their head to view the mirror at any point during the playback of a scene then the real-time representation of the viewer is generated and superimposed over the mirror. Such an implementation enables a scene to be personalised based on a viewer of that scene.
In various embodiments, the second quality being lower than the first quality comprises one or more of: the second quality having a lower resolution than the first quality (e.g. the scene element having a lower resolution than the base representation); the second quality having a lower frame rate than the first quality; and the second quality having a lower colour range than the first quality. In some embodiments, the first quality is associated with a ray tracing process (e.g. the base representation is generated using a ray tracing process). In some embodiments, the second quality is associated with a rasterization process (e.g. the scene element is generated using a rasterization process).
In some embodiments, the base representation is generated for a first range of viewpoints of a scene (e.g. the base representation may be generated for each point within a viewing zone, as described further below) so that a user is able to view the base representation from a plurality of viewpoints (e.g. the user is able to view the base representation as the user moves through the scene). In some embodiments, the scene element is generated for only a limited range of viewpoints of the scene (this range typically being less than a range of viewpoints for which the base representation is generated). This may lead to the scene element being visible from only a limited range of viewpoints, or the scene element being distorted when the user moves away from the second range of viewpoints. In practice, the base representation may enable a user to move through a scene to experience the scene, while the scene element may be arranged to be viewed while the user remains substantially still (e.g. the base representation may show a forest that the user is able to explore, while the scene element may show a bird flying over the forest where the bird is only visible for a short period of time).
In some embodiments, the scene element (and/or the base representation) is generated so as to be viewed from an expected viewpoint or perspective. For example, the scene element may be visible only when the user is facing a predetermined portion of the base representation. The base representation may be formed of portions of different quality, where the base representation may have a higher quality at an expected viewpoint or an expected perspective, where this encourages the user to move to this viewpoint/perspective (e.g. the base representation may have a higher quality when the user is looking in a ‘forwards’ direction as compared to when the user is looking in a ‘backwards’ direction.
In some embodiments, the scene element comprises a real element that is captured using a camera, e.g. a camera of the display device 17. This enables the personalisation of a scene based on a real element, e.g. a real element that is in the vicinity of a viewer of a scene.
In some embodiments, the base representation and the scene element may be generated at the same time or may be generated using the same device (e.g. the image generator 11), where this still enables the presentation of a relatively high quality base representation with a relatively low quality scene element. This reduces the amount of processing power and/or bandwidth required by the display device to display the scene as compared to an implementation in which each element of the scene is generated with a high quality.
The scene element may be included in a layer and/or an enhancement scene that is sent in association with the base representation. That is, the image generator 11 may generate both the base representation and the scene element and send these two features as part of a transmission to the display device 17 where, depending on the capabilities of the display device, a scene element of an appropriate quality is rendered (e.g. for less capable display devices a lower quality scene element may be rendered at the display device).
In some embodiments, the scene element is an object that is in an environment of a wearer of the display device 17. For example, where the wearer is in the same environment as another person, an image of this other person may be captured using a camera and the scene element may be generated based on this image. This enables, for example, two users that are each wearing display devices to view a scene together and to interact with each other while viewing the scene.
More generally, the scene object may be generated based on an object in the environment, where this is particularly beneficial for augmented reality implementations. For example, the base representation may be combined with a scene element that reflects a current environment of a user to provide an immersive experience in a scene that includes features of the real-world. In a practical example, a chair that is in the vicinity of the wearer of the display device 17 may be captured by a camera of the display device and a scene element may then be generated to represent this chair, with the scene element being combined with the base representation (e.g. to place an overlay over a real chair to present an augmented version of the chair, such as a throne). A user is then able to sit on the scene element in the virtual scene in order to sit on the chair in the real world and so the user can take part in an immersive experience while interacting with real-world objects in a way that does not break the immersion.
In some embodiments, the scene element relates to a further wearer of a further display device, where this furtherwearer may (or may not) also be viewing the scene (orthe base representation). This furtherwearer user maybe in the same real-world environment as the wearer of the display device or may be in a different real-world environment (e.g. where the two wearers are both dialled in to a meeting from different real- world locations).
Typically, the (base representation of the) scene comprises a three-dimensional scene. Equally, the (base representation of the) scene may provide further dimensions (e.g. four, five, or more dimensions), where these other dimensions may relate to physical effects, time effects, etc.).
Viewing zones
Typically, the base representation is generated so as to enable movement of a viewer around the scene. For example, where the scene is a room, the base representation may enable a user to walk around the room so as to view the room from different angles.
In particular, the base representation may be generated in order to enable six degree-of-freedom (6DoF) movement through the scene, where this aids in the provision of an immersive experience for a viewer (and where this reduces any motion sickness effect that may occur for a user of a VR scene). Potentially problematically, generating a base representation that enables such movement requires the base representation to enable viewing from each point within that scene and so requires a substantial file size.
To reduce the file size of the base representation (as compared to a representation that enables unrestricted movement around a scene), in some embodiments the base representation is associated with one or more viewing zones (or zones of view, or viewing volumes), where the base representation enables a user to view the scene in high quality only from within the viewing zones and/or enables a user to move freely (e.g. with six degrees of freedom) only within the viewing zones. The viewing zones have a limited volume, which volume is less than a volume of the scene (e.g. the volume may be less than 50% of the volume of the scene, less than 20% of the volume of the scene, and/or less than 10% of the volume of the scene).
An example of such a viewing zone is illustrated in Figure 4, which figure shows a scene 41 that contains a (first) viewing zone 42. The viewing zone enables a viewer to move from a first position 43 in the viewing zone to a second position 44 in the viewing zone so as to view the scene from these different viewpoints/perspectives. Figure 4 further shows a second viewing zone 45, where a user may be able to move between the first and second viewing zones.
While Figure 4 shows viewing zones in two dimensions, it will be appreciated that viewing zones are typically implemented as three-dimensional volumes (and viewing zones may also be four-dimensional, where a three-dimensional location of the viewing zone changes over time). It will be appreciated that viewing zones may be formed in any size or shape, with different sizes and shapes being suitable for different scenes.
As is shown in Figure 4, the use of the viewing zones enables a base representation to be generated without the requirement to see or consider every single point within the scene. Considering, as an example, the base representation that is associated with the viewing zone 42, the portion 46 of the scene 41 of Figure 4 is obscured or occluded behind a wall for all of the points within this viewing zone 42. Therefore, the image data that forms the base representation is not required to generate or render data for this occluded portion of the scene. This reduction in the amount of data that is required enables the generation of a high- quality immersive scene with a much lower processing and storage requirement than if the entirety of the scene were to be rendered in high quality allowing full movement. Furthermore, the base representation may be generated so as to show nearby objects (that are near to the boundaries of the viewing zone) in greater detail than distant objects (that are far from the boundaries of the viewing zone). This may involve the base representation being generated using (real or virtual) scanners that are set up at the boundaries of the viewing zone, with point filed data being obtained based on beams emitted from these scanners (and with the scanners being arranged to emit beams at regular angles).
Typically, the base representation is generated based on a three-dimensional model, where the model may comprise the occluded portion but the generation of the base representation is such that the base representation does not include the occluded portion. Therefore, there is a loss of information moving from the model to the base representation (and a corresponding reduction in file information and size), but since the occluded portion cannot be seen from within the viewing zone the viewer of the base representation is unable to identify this loss of information.
A detailed method of generating image data for such a viewing zone (e.g. to form the base representation) is described in WO 2016/061640 A1 , which is incorporated herein by reference. Further methods of generating and transmitting representations are described in WO 2024/003577 A1 , which is also incorporated herein by reference.
The combination of both a base representation of a first quality that is generated for a viewing zone and a scene element of a second quality provides a method by which an immersive experience can be provided using a limited bandwidth while also providing personalisation and modification. Specifically, the use of the viewing zone(s) leads to a substantial reduction in the bandwidth needed to transmit the base representation from the image generator 11 to the display device 17. This enables a base representation that provides an immersive experience to feasibly be provided to the display device via a wireless network and thereby greatly increases the versatility of the display device. The use of a scene element as described herein, which scene element is has a different quality to the base representation further enables the personalisation of this immersive experience, where the use of the different (e.g. lower) quality again enables the possibility of providing the scene element to the display device over a wireless connection and enables the possibility of generating the scene element, or of performing the combination of the base representation and the scene element, on a computing device without huge amounts of computing power.
In some embodiments, the base representation is streamed to the display device (or a computer device that is in the vicinity of the computer device) from a separate server (e.g. over the Internet), with the scene element being generated on the display device or the proximate computer device. This reduces the amount of information that needs to be transmitted to the display device to present the scene since a portion of the scene (the scene element) is not being streamed.
Present VR systems typically require the use of a specialised computer device, e.g. a gaming computer, where viewing an immersive scene requires this scene to be pre-downloaded to the computer, sent to a VR headset over a wired connection, and then continuously processed by the computer in order to warp the scene appropriately based on the actions of a view. The present disclosure opens the possibility of the base representation being generated on a server with substantial processing power before being streamed, over the Internet, to a display device that is relatively low-powered (e.g. a smartphone or a standalone VR headset). This display device may then generate the (lower quality) scene element and combining this element with the base representation. Equally, another specialised computer device may generate the scene element and stream this scene element to the display device. Even very powerful computers are typically not able to generate a high quality virtual reality scene in real-time, so even where the base representation is generated at a supercomputer it is typically still not feasible to generate this base representation in real time. Due to this limitation, conventional systems have had to choose between quality and real-time presentation (where real-time and/or interactive scenes cannot normally be provided at high quality). The present disclosure provides a high-quality scene that provides a means for real-time modification via the scene element.
As described above, the base representation typically comprises one or more viewing zones 42, 45, which viewing zones provide a high quality representation of the scene while enabling movement in 6DoF within these viewing zones (so that the viewer can experience the scene from a plurality of different viewpoints). Enabling this movement may comprise generating a point field (as has been described above) where this enables an image to be generated for each possible position and orientation of a user within the viewing zone. As the user moves through the scene, the display device 17 is able to determine a location and orientation of the user and to present an appropriate image to the user based on this location and orientation. For example, the display device may transmit the location and orientation to the image generator 11 so that the image generator can generate the appropriate image and transmit this image to the display device for presentation at the display device. Equally, the base representation may comprise image data and/or point cloud data that is useable to render images for a plurality of different locations and orientations, where the display device is then able to render the appropriate image from this image data based on the location and orientation.
The volume of the viewing zone is such that a user is able to move within the viewing zone in order to view the scene, while still only enabling a limited amount of movement (where this leads to a smaller file size as compared to an implementation where a user is able to fully move about the scene).
The viewing zone may comprise a subset of the scene; and/or may allow movement through only a subset or portion of the scene; and/or may provide a limited or restricted volume in which a user is able to view the scene (in the first quality); and/or may comprise a bounded volume that enables the viewing of the scene (in the first quality), this volume being less than a volume of the scene.
Since the viewing zones have a limited volume (that is only a subset of the volume ofthe scene), the viewer is able to move towards an edge of the viewing zone 42. In some embodiments, a user is able to move out of the viewing zone by translational or rotational movement. In some embodiments, the display device 17 is arranged to resist and/or prevent this movement as a user approaches a boundary of the viewing zone. In various embodiments, a speed of a movement of the viewpoint of a user may slow (e.g. exponentially) as that user moves towards the boundary of the viewing zone so as to discourage movement out of the viewing zone. Equally, the viewing zone may comprise a wall that prevents movement of the viewpoint out of the viewing zone. In some embodiments, the viewing zone is arranged to wrap-around, where for example exiting the viewing zone to the left returns a user to the right of the viewing zone.
Typically, a user is discouraged but not prevented from moving out of the viewing zone. For example, the base representation may be arranged so that, as the user moves towards a boundary of a viewing zone associated with the base representation, a presentation of the base representation changes. The base representation may become blurry as a user moves towards the boundary, playback of the base representation may slow down, a colour range of the base representation may decrease, or an audio playback associated with the base representation may change (e.g. to become quieter or to become slurred). In these ways, a user can be discouraged from exiting the viewing zone without feeling trapped within the viewing zone. Similarly, feedback, such as haptic feedback or an audio feedback may be provided to a user as they approach the boundary, e.g. to provide a warning to the user. In some embodiments, as the user moves towards the boundary, a transparency of the presentation of the base representation changes (e.g. so that the presentation fades to become transparent as the user reaches the boundary). In some embodiments, there is a user input required to exit the viewing zone 42. For example, the viewing zone may comprise a mechanism (such as a wall) that prevents accidental movement out of the viewing zone, but the user may be able to provide an input (such as a mouseclick or pressing a button on an XR controller) that enables or enacts such movement out of the zone. This enables a user to move out of the zone, but prevents such movement from occurring accidentally.
In embodiments where a user is able to move out of the viewing zone 42, this movement may cause a change in the playback of the scene. For example, moving out of the viewing zone may pause or slow down playback of the scene and/or may open an options menu associated with the scene. In some embodiments, moving out of the viewing zone affects (or initiates) the generation or rendering of the scene element. For example, rendering of the scene element may cease once a user exits a viewing zone.
In some embodiments, the rendering of the scene element may continue as the user moves out of the zone so that the scene element can be viewed both from inside and outside the viewing zone (while the based representation can typically be viewed only from inside the viewing zone).
In some embodiments where a user is able to move out of the viewing zone 42, this movement may cause for a user wearing an AR headset to see the real-world around the user. For example, as the user moves towards, or past, a boundary of the viewing zone 42, the display device 17 may be arranged to present image data that has been captured using a camera of the display device.
When the user is outside of the viewing zone, any viewing zone(s) may appear to the user as floating volumes in the real user room, where a user is then able to enter a viewing zone by moving into the associated floating volume. This may involve moving an avatar of the user into a viewing zone using an XR controller or may involve the user physically moving into a volume in the real world that overlaps the viewing zone in the representation of the scene.
In some embodiments, a trigger of the base representation that causes the generation of presentation of the scene element is associated with movement into or out of the viewing zone 42 (e.g. where exiting a zone and/or entering a zone may result in the generation of a real-time element).
In various embodiments, moving out of the viewing zone 42 leads to one or more of:
The generation and/or presentation of the scene in an altered quality. For example, the scene may be generated in a first, high, quality (e.g. of the base representation), when the viewer is within a viewing zone and in an altered, lower, quality when the viewer is outside of the viewing zone(s). The altered quality may be the second quality (e.g. where the scene is generated or rendered in the same quality as the scene element outside of the limited viewing zone). This can enable the scene to be rendered in real-time when a viewer of the scene moves out of a viewing zone.
In some embodiments, the scene is rendered in a low, e.g. two-dimensional or monochrome, quality when a viewer is outside of the viewing zone 42. In some embodiments, the scene is rendered in a two-dimensional representation when a viewer is outside of the viewing zone and in a three- dimensional representation within the viewing zone. This enables a user to move between viewing zones using the two-dimensional, basic, representation before viewing an immersive representation of the scene within a viewing zone.
A limited freedom of movement, e.g. when a viewer is outside of the viewing zone 42 movement may not be possible or movement may only be possible in three (e.g. translational) degrees of freedom.
The presentation of one or more available viewing zones. For example, when the user is outside of the viewing zone 42, the user may see one or more available viewing zones 42, 45 in which six degrees of freedom movement in high quality is possible. The user may then be able to move into one of these zones, e.g. by moving in the real-world and/or by providing a user input such as a selection of one of the zones. In some embodiments, moving out of a viewing zone results in the scene being paused and the user then being given a choice of a plurality of viewing zones, where the user can then move into one of these viewing zones in order to resume playback of the scene.
The present disclosure envisages the presentation of a scene with a plurality of viewing zones 42, 45, each viewing zone being a subset of the scene, where these viewing zones each provide an (e.g. separate) immersive base representation of the scene through which a user is able to move. The viewing zones may present different angles of the scene and/or show different parts of the scene to a user, where the user is then able to move between the viewing zones in order to view different parts of the scene (e.g. based on a selection in a media menu, where the viewing zone to which a user moves is dependent on this selection). And the use of viewing zones (as opposed to simply rendering the whole of the scene in high quality) reduces the processing power/bandwidth required forthe display device 17 to present the scene to the user.
The different viewing zones may be associated with different scene elements and/or different triggers where a user of the display device 17 may be able to move between the zones in order to experience a story, with each zone being associated with a different part of the story (e.g. each zone may be associated with a different cutscene) and/or with each zone showing this story from a different perspective or viewpoint.
The system may be arranged so that a scene element that is generated in relation to a first viewing zone 42 is replicated when a user moves to a second viewing zone 45. That is, the method of presenting the scene may comprise generating a scene element and combining this scene element with each of a first base representation associated with a first viewing zone (when the user is in that first viewing zone) and a second base representation associated with a second viewing zone (when the user is in that second viewing zone). Where the real-time element relates to an object, this realtime element may be generated in the first viewing zone following the triggering of a first trigger associated with the first base representation and then, when the viewer moves to a second zone, this object may be re-generated (e.g. immediately, based on the user entering the second viewing zone, and/or following the triggering of a trigger associated with the second base representation), where this enables the object to be viewed from both the first viewing zone and the second zone. Figure 4 shows an example of such an object 47, where this object may be a part of the base representation associated with one of the viewing zones and/or may be a scene element that is combinable with base representations of each of the first viewing zone and the second viewing zone.
The (or each) viewing zone 42, 45 is typically arranged to be of a limited size, where this provides an immersive experience within this limited-size viewing zone while reducing the amount of processing power/bandwidth required to provide this experience (as compared to a scene in which free movement is possible throughout the scene). Typically, the viewing zones are arranged to enable a user to move their head while they are sitting or standing, but not to freely roam around a room.
The (or one or more, or each) viewing zone 42, 45 may have a volume of less than five cubic metres (5m3), less than one cubic metre (1 m3), less than one-tenth of a cubic metre (0.1 m3) and/or less than one- hundredth of a cubic metre (0.01 m3). The scene may be associated with a plurality of viewing zones of different size, where, for example, a first viewing zone 42 enables only small head movements while a second viewing zone 45 enables a user to walk through the scene.
The or each viewing zone may also have a minimum size, e.g. the or each viewing zone may have a volume of at least 1 % of the volume of the scene, at least 5% of the volume of the scene, and/or at least than 10% of the volume of the scene. Similarly, the or each viewing zone may have a volume of at least one- thousandth of a cubic metre (0.01 m3); at least one-hundredth of a cubic metre (0.01 m3); and/or at least one cubic metre (1 m3).
The ‘size’ of the viewing zone typically relates to a size in the real world, wherein if the viewing zone has a length of one metre this means that a user is able to move one metre in the real world while staying within the viewing zone. The size of the viewing zone in the scene may be greater than, equal to, or less than the size of the viewing zone in the real world. For example, the viewing zone may scale a real world distance so that moving one metre in the real world moves the user less than (or more than) one metre in the scene. This enables the scene to provide different perceptions to the user (e.g. to make the user feel larger or smaller than they are in real life). Similarly, the viewing zone may scale a real world angle so that rotating one degree in the real world rotates the user less than (or more than) one degree in the scene.
Therefore, a viewing zone with a volume of one cubic metre typically connotes a viewing zone in which the user is able to move about a one cubic metre volume in the real world while remaining in the viewing zone.
While the viewing zones shown in Figure 4 are rectangular (in two-dimensions), more generally the viewing zones may be any shape, e.g. cuboid, spherical, ovoid, etc.
In some embodiments, the scene is associated with a plurality of viewing zones 42,45, where each viewing zone is associated with a different base representation. These different base representations may have different qualities, where the quality may be associated with a size of the viewing zone and/or a perspective of the viewing zone. Equally, the quality may be associated with a perceived importance of the viewing zone. This enables a balance to be struck between the processing power/bandwidth required to provide the scene and the immersivity of the experience, where certain perspectives of a scene may benefit more from a higher quality (e.g. a higher resolution or frame rate) than other perspectives. Furthermore, the quality of a scene element that may be generated for combination with the base representations may depend on the quality of that base representation so that different viewing zones may enable the generation of different scene element and/or of scene elements of different qualities.
Triggers
As described above, typically, the generation and/or the presentation of the scene element is dependent on the triggering of a trigger, where the trigger may be a part of the base representation and/or may be associated with the base representation (e.g. the trigger may be a part of the image data that forms the base representation).
Such a method of generating the scene element based on a trigger is shown in Figure 5.
In a first step 51 , a computer device such as the display device 17 begins presentation of the scene based on the base representation of the scene. In a second step 52, the computer determines the triggering of a trigger. Typically this trigger is a part of the base representation, which trigger defines a condition for the generation of a scene element. In a third step 53, a computer device such as the display device or the image generator 11 generates the scene element. Typically, the scene element is generated in real-time and is combined with the base representation with the combined representation then being presented to a user of the display device.
In some embodiments, the trigger comprises a contextual trigger, where the trigger is dependent on a context of a viewer of the base representation and/or of the display device 17 that is displaying the base representation. For example, the contextual trigger may be dependent on a location, a time, an environmental condition, a weather, a condition of a viewer of the base representation, a number of viewers that are presently viewing the base representation, etc. In some embodiments, the trigger comprises an activated trigger, where a viewer is able to interact with the base representation (e.g. via a user interface to activate the trigger). For example, the viewer may be able to click on a specific section of the base representation in order to activate the trigger (and generate the scene element.
The triggering of the trigger may be determined by a sensor of the display device 17, for example a light sensor, a temperature sensor, an accelerometer, and/or a GPS sensor. Equally, the triggering of the trigger may be determined by a user interface of the display device (e.g. based on a user input). Equally, the triggering of the trigger may be determined by a processor. For example, the trigger may be associated with a frame of the base representation being displayed (e.g. the video being 20% complete) or based on a user looking at and/or interacting with a portion of the scene; the processor may determine that such an in-scene trigger condition has been triggered.
The scene element may be generated in dependence on the trigger and/or on a condition at a time when the trigger is triggered. For example, the scene element may be selected from among a database of possible scene elements based on the environment of the display device at the time of generating the scene element and/or based on an active user profile of the display device at the time of generating the scene element.
In some embodiments, the base representation is associated with a plurality of triggers, where the scene element is generated in dependence on which of these triggers is triggered. For example, the base representation may comprise a plurality of triggering areas, where a viewer looking at, or interacting with, any of these triggering areas triggers the generation of a scene element. The scene element that is generated may depend on the triggering area with which the viewer has interacted. For example, each viewing area may be associated with a different object, where the user is then able to generate a desired object (as the scene element) by interacting with a corresponding scene element.
The scene element may be dependent on a condition at the time of the triggering of a trigger where, for example, the trigger may depend on a progress of the viewer through the scene and the scene element may then be generated in dependence on a sensor reading. For example, when a viewer is 20% of the way through the playback of a scene, the scene element may be generated (automatically) based on an environmental condition that is sensed at the time of triggering the trigger. The scene element may comprise a weather filter that is generated at a certain time in a scene (e.g. to depict rain, clouds, or sunshine within the scene) where this enables a scene to be modified based on a current condition of a user.
In some embodiments, the scene element (e.g. the quality of the scene element) is dependent on a capability of the display device 17, e.g. the hardware, software, and/or a condition of the display device. For example, the scene element may be generated in dependence on an available bandwidth at a communication interface of the display device. The capability of the display device may be determined prior to the displaying of the scene and/or at the time of triggering the trigger, where, for example, the available bandwidth of the communication interface may be determined when the trigger is triggered. This enables more powerful or capable devices to render scene elements with higher qualities and in this way enables balancing between accessibility and quality. Devices with low processing power are able to view the scene with a lower-quality scene element with comparatively high processing power devices being able to view the scene - e.g. with the same base representation - with a higher-quality scene element.
The base representation and/or the scene element may comprise layered image data, where this representation and element be generated or presented in dependence on a capability of the display device. For example, the quality of each of the base representation and the scene may be determined in dependence on this capability, where a base level of the representation and/or scene may be combined with one or more enhancement layers depending on the capability of the device. Multiple viewers
The disclosures described above may be applied to scenes that comprise multiple viewers. In particular, a plurality of users may each view a scene using different display devices. These display devices may each have different capabilities or hardware, for example a first user may view the scene through virtual-reality headset and a second user may view the scene through a personal computer.
In such embodiments, the scene element generated for a first viewer may be associated with one or more other viewers. For example, the base representation may provide a high-quality meeting environment, where each of the viewers of the scene is a participant in the meeting. The scene element generated for the first viewer may then be an avatar of another participant in the meeting.
Where the scene is arranged to be viewed by multiple viewers, each of the viewers may be shown the same base representation or each of these viewers may be shown a different base representation. For example, where the meeting is a meeting to discuss a schematic, each of the viewers may be shown the same base representation so as to see this schematic from the same perspective. Equally, each of the viewers may be shown a different base representation so as to see this schematic from a different perspective.
Similarly, each viewer may be shown the same scene element, or each viewer may be shown a different scene element. The scene element may be generated where a first viewer triggers the generation of the scene element, and this scene element may then be shown to each of the users. So the method may comprise generating a plurality of base representations for a plurality of users (which base representations may each be the same base representation) and then generating a scene element for combination with each of the representations.
In a practical example, the base representation may provide an immersive background fora meeting, where a scene element can then be generated based on an input from a first viewer with each viewer thereafter being able to see the scene element. Where the scene element comprises an interactive element, the scene element may arranged so that each viewer is able to interact with the scene element or so that only a subset of the viewers are able to interact with the scene element.
In another practical example, the base representation may be associated with an object to be discussed, where each user may view a meeting room in low quality (e.g. in real-time, where the avatars of other users are shown in the meeting room) and then may be able to step into the base representation to view the object before returning to the meeting room.
Alternatives and modifications
It will be understood that the present invention has been described above purely by way of example, and modifications of detail can be made within the scope of the invention.
For example, the base representation may be arranged to be transmitted to a plurality of users and/or a plurality of display devices 17. Each base representation may be arranged to be combinable with the same scene element(s). Equally, the base representation may be arranged to be combinable with different possible scene elements. This can enable the personalisation of the scene, where different scene elements are shown to different users.
The scene is typically arranged to be viewed using an extended reality (XR) technology, where the user may be presented with a representation of a real scene or a digital scene that contains one or more digital elements. The term extended reality (XR) covers each of virtual reality (VR), augmented reality (AR), and mixed reality (MR) and it will be appreciated that the disclosures herein are applicable to any of these technologies. It will be appreciated that the disclosures herein may be applied to numerous contexts. For example, the scene may comprise or be a part of a movie, a music video, a game, a shopping experience, a sports experience, etc. The scene, and either or both of the base representation and the scene element, may be computer generated. Equally, the scene, and either or both of the base representation and the scene element, may be captured using a sensor such as a camera.
In some embodiments, the base representation is pre-computed and the scene element comprises a live (or near-live) capture of an object. For example, the scene element may comprise a band or a sports team, where this element may be captured live and then combined with a pre-generated base representation, e.g. which forms a background for the live element. In some embodiments, the base representation and/or the scene element is generated using an artificial intelligence (Al) or machine learning (ML) model, where this may involve generating the Al or ML model on a powerful computer device before transmitting the model (e.g. the weights of a neural network) to a less powerful device. This can enable the generation of high quality base representations and scene elements on a device with limited processing power. Reference numerals appearing in the claims are by way of illustration only and shall have no limiting effect on the scope of the claims.

Claims

Claims
1 . A method of presenting a three-dimensional representation of a scene, the method comprising: presenting a base representation of the scene, the base representation having a first quality; generating a scene element, the scene element having a second quality; and combining the base representation with the scene element.
2. The method of any preceding claim, wherein the second quality is lower than the first quality.
3. The method of any preceding claim, wherein the first quality and the second quality are associated with a first resolution and a second resolution, preferably wherein the scene element is upsampled prior to the combining of the scene element with the base representation.
4. The method of any preceding claim, wherein the first quality and the second quality are associated with a first frame rate and a second frame rate, preferably wherein the scene element is subjected to a motion interpolation process prior to the combining of the scene element with the base representation.
5. The method of any preceding claim, wherein the first quality and the second quality are associated with a first colour range and/or a second colour range.
6. The method of any preceding claim, wherein the base representation is generated using a ray-tracing process.
7. The method of any preceding claim, wherein the scene element is generated using a rasterization process, preferably a real-time or near real-time rasterization process.
8. The method of any preceding claim, wherein the base representation is associated with a viewing zone, wherein the viewing zone comprises a subset of the scene.
9. The method of claim 8, wherein the viewing zone enables a userto move through a subset of the scene; and wherein a viewer is able to move within the viewing zone while viewing the base representation, preferably wherein the viewer is able to move within the viewing zone with six degrees of freedom (6DoF).
10. The method of claim 8 or 9, wherein the viewing zone has a volume of less than 50% of the volume of the scene, less than 20% ofthe volume of the scene, and/or less than 10% of the volume of the scene.
11 . The method of any of claims 8 to 10, wherein the viewing zone has, or is associated with, a volume, preferably a real-world volume, of less than five cubic metres (5m3), less than one cubic metre (1 m3), less than one-tenth of a cubic metre (0.1 m3) and/or less than one-hundredth of a cubic metre (0.01 m3).
12. The method of any of claims 8 to 11 , wherein the viewing zone has a volume of at least 1% of the volume of the scene, at least 5% of the volume of the scene, and/or at least than 10% of the volume of the scene.
13. The method of any of claims 8 to 12, wherein the viewing zone has, or is associated with, a volume, preferably a real-world volume, of at least one-thousandth of a cubic metre (0.0001 m3); at least one- hundredth of a cubic metre (0.01 m3); and/or at least one cubic metre (1 m3).
14. The method of any of claims 8 to 13, wherein the scene is associated with a plurality of viewing zones, wherein each zone is associated with a corresponding base representation and/or wherein each viewing zone provides a different set of viewpoints or perspectives for viewing the scene.
15. The method of claim 14, wherein the viewing zones are associated with one or more of: different shapes or sizes of viewing zone; respective base representations of different quality; different scene elements; different sets of available scene elements; different viewers of the scene; different qualities of scene elements.
16. The method of any of claims 8 to 15, wherein the scene comprises an obscured portion, the obscured portion not being visible from the viewing zone and the obscured portion not being rendered within the base representation.
17. The method of any of claims 8 to 16, wherein the viewing zone is arranged to resist and/or prevent movement out of the viewing zone.
18. The method of claim 17, wherein the base representation is arranged so that feedback is provided to a viewer as that viewer moves towards a boundary of the viewing zone.
19. The method of claim 18, wherein the base representation is arranged so that playback of the scene is altered as the viewer moves towards a boundary of the viewing zone.
20. The method of claim 19, wherein the base representation is arranged to show a pass-through view of the actual surroundings of the viewer when the viewer goes outside the limits of the viewing zone
21 . The method of any of claims 8 to 17, wherein the viewing zone is arranged to enable movement out of the viewing zone, preferably to enable movement out of the viewing zone in dependence on a user input, preferably wherein movement out of the viewing zone causes one or more of: pausing of playback of the scene and/or the scene element; presentation of an options menu associated with the scene; and display of one or more available viewing zones.
22. The method of claim 21 , wherein movement out of the viewing zone causes presentation of the scene in an altered quality, preferably wherein the altered quality is lowerthan the first quality, more preferably wherein the altered quality is the second quality.
23. The method of claim 21 or 22, wherein movement out of the viewing zone reduces a freedom of movement through the scene, preferably wherein outside of the viewing zone the viewer is able to move through the scene in less than 6DoF, no more than 3DoF, and/or less than 3DoF.
24. The method of any of claims 21 to 23, wherein movement out of the viewing zone causes one or more of: pausing of playback of the scene; presentation of an options menu associated with the scene; and display of one or more available viewing zones.
25. The method of any of claims 21 to 24, wherein movement out of the viewing zone causes presentation of the scene in an altered quality, preferably wherein: the altered quality is lower than the first quality; and/or the altered quality is associated with a two-dimensional representation of the scene.
26. The method of any of claims 21 to 25, comprising generating (e.g. rendering and/or displaying) the scene element in dependence on a viewer moving towards a boundary of the viewing zone.
27. The method of any of claims 21 to 26, comprising generating (e.g. rendering and/or displaying) the scene element in dependence on a viewer moving into and/or out of the viewing zone.
28. The method of any of claims 8 to 27, wherein the scene is associated with a plurality of viewing zones, wherein each zone is associated with a corresponding base representation.
29. The method of claim 28, wherein the viewing zones are associated with: differently sized viewing zones; base representations of different qualities; and different scene elements; different sets of available scene elements; different viewers of the scene; different qualities of scene elements.
30. The method of any of claims 8 to 29, wherein the method comprises generating (e.g. presenting) the scene element so that the scene element is visible from a plurality of the viewing zones.
31 . The method of any preceding claim, wherein the method is carried out at a first device and wherein generating the base representation comprises generating the base representation based on a transmission received from a second device, preferably wherein the transmission comprises the base representation in an encoded format, more preferably a layered format and/or a low complexity enhanced video codec (LCEVC) format.
32. The method of claim 31 , wherein the base representation is streamed at the first device based on a transmission streamed from the second device.
33. The method of claim 31 or 32, wherein generating the scene element comprises generating the scene element based on a transmission received from a further device;
34. The method of claim 33 when dependent on claim 32, wherein the second device and the further device are different devices.
35. The method of any preceding claim, wherein the scene element comprises a real-time element.
36. The method of claim 35, wherein: the scene element is generated during the presentation of the base representation; and/or a rate of generation of the scene element is equal to or greater than a rate of playback of the scene.
37. The method of any preceding claim, wherein: the base representation comprises a pre-generated representation; and/or the method is carried out by a first device and the base representation is downloaded to the first device prior to the presenting of the base representation.
38. The method of any preceding claim, wherein generating the base representation comprises processing an initial version of the base representation so as to generate the base representation of the scene based on a perspective of a viewer of the scene
39. The method of any preceding claim, wherein the scene element is generated in dependence on one or more of: a context of a viewer of the scene, preferably a context of the viewer at the time of generation of the scene element; a feature of an environment of the viewer; an object and/or a person in the environment of the viewer; a user profile of the viewer; a communication from a further device; a feature of the base representation; an origin of the base representation and/or an original generator of the base representation; a current viewpoint and/or perspective of the viewer; a current viewing zone of the user; and an input of the viewer.
40. The method of any preceding claim, wherein the scene element is selected based on a capability of a device displaying the scene, preferably based on a processing power and/or bandwidth of the device.
41 . The method of any preceding claim, wherein the scene element is selected from a database of available scene elements.
42. The method of any preceding claim, wherein the scene element comprises a real object that is captured with a camera.
43. The method of any preceding claim, wherein the scene element comprises one or more of: an object; a real object; a virtual object; a filter; an overlay; a weather effect; a personalisation; an avatar; an animated object and/or an animation; and an interactive element.
44. The method of any preceding claim, wherein the combining of the scene element with the base representation is dependent on a feature of the base representation, preferably wherein the feature defines a location onto which the scene element may be imposed and/or a time at which the scene element may be combined with the base representation.
45. The method of any preceding claim, comprising detecting an interaction between a viewer of the scene and the scene element, preferably comprising modifying the scene element and/or generating a further scene element in dependence on the interaction.
46. The method of any preceding claim, wherein the scene and/or the base representation and/or the scene element is associated with a trigger, wherein the generation of the scene element is dependent on the triggering of the trigger.
47. The method of claim 46, wherein the triggering of the trigger is associated with one or more of: a context and/or a change in context of a viewer of the scene; a current viewpoint and/or perspective of the viewer; a playback progress and/or a frame of the scene being presented; an input of the viewer, preferably an input associated with a location in the scene; and an input from a third party that is not viewing the scene, preferably wherein the scene is being viewed on a first device and the triggering of the trigger is associated with a transmission being received at the first device from a further device.
48. The method of claim 46 or 47, wherein the trigger is a part of the base representation.
49. The method of claim 46 or 47, wherein the trigger is received separately to the base representation, preferably wherein the base representation is received from a first device and the trigger is received from a second device, more preferably wherein the scene element is also received from the second device.
50. The method of any preceding claim, wherein the base representation comprises image data, preferably encoded image data, more preferably wherein: the base representation is encoded based on a low-complexity enhancement video codec (LCEVC) process; and/or the base representation comprises layered image data so that the base representation can be generated in different levels of quality.
51 . The method of any preceding claim, wherein the scene comprises one or more of: an extended reality (XR) scene; a virtual reality (VR) scene; an augmented reality (AR) scene; and a mixed reality (MR) scene; and a part of a movie, a music video, a game, a shopping experience, a sports experience.
52. The method of any preceding claim, comprising storing the three-dimensional representation and/or outputting the three-dimensional representation, preferably comprising outputting the three- dimensional representation to a further computer device.
53. The method of any preceding claim, comprising generating an image and/or a video based on the three- dimensional representation.
54. A method of generating a base representation of a three-dimensional scene, the method comprising: generating the base representation of a scene, the base representation having a first quality; and inserting a trigger into the base representation, the trigger being associated with the display and/or generation of a scene element, the scene element having a second quality, and the scene element being arranged to be combined with the base representation.
55. A method of generating a three-dimensional representation of a scene, the method comprising: identifying a base representation of a scene, the base representation having a first quality; and determining a scene element for combining with the base representation, the scene element having a second quality; preferably, comprising associating a trigger with the scene element, wherein the scene element is arranged to be combined with the base representation based on the triggering of the trigger.
56. A method of generating a representation of a three-dimensional scene, the method comprising: generating (e.g. rendering and/or presenting) a base representation of the scene, wherein the base representation is associated with a viewing zone; wherein the viewing zone comprises a subset of the scene and/or wherein the viewing zone enables a user to move through a subset of the scene; wherein a viewer is able to move within the viewing zone while viewing the base representation; and wherein the viewing zone is arranged to resist and/or prevent movement out of the viewing zone; preferably wherein the base representation is arranged so that feedback is provided to a viewer as that viewer moves towards a boundary of the viewing zone and/or wherein the base representation is arranged so that playback of the scene is altered (e.g. slowed or blurred) as the viewer moves towards a boundary of the viewing zone).
57. A method of presenting a representation of a three-dimensional scene, the method comprising: receiving a base representation of a scene, the base representation having a first quality; receiving and/or generating a scene element, the scene element having a second quality; and combining the base representation with the scene element.
58. A system for carrying out the method of any preceding claim, the system comprising one or more of: a processor; a communication interface; and a display.
59. A computer program product comprising software code that, when executed on a computer device, causes the computer device to perform the method of any of claims 1 to 57.
60. A machine-readable storage medium that includes instructions that, when executed by one or more processors of a machine, cause the machine to perform the method of any of claims 1 to 57.
61 . An apparatus for presenting a three-dimensional representation of a scene, the apparatus comprising: means for (e.g. a processor for) presenting a base representation of the scene, the base representation having a first quality; means for (e.g. a processor for) generating a scene element, the scene element having a second quality; and means for (e.g. a processor for) combining the base representation with the scene element.
62. An apparatus for generating a base representation of a three-dimensional scene, the apparatus comprising: means for (e.g. a processor for) generating the base representation of a scene, the base representation having a first quality; and means for (e.g. a processor for) inserting a trigger into the base representation, the trigger being associated with the display and/or generation of a scene element, the scene element having a second quality, and the scene element being arranged to be combined with the base representation.
63. An apparatus for generating a three-dimensional representation of a scene, the apparatus comprising: means for (e.g. a processor for) identifying a base representation of a scene, the base representation having a first quality; and means for (e.g. a processor for) determining a scene element for combining with the base representation, the scene element having a second quality; preferably, comprising means for (e.g. a processor for) associating a trigger with the scene element, wherein the scene element is arranged to be combined with the base representation based on the triggering of the trigger.
64. An apparatus for generating a representation of a three-dimensional scene, the apparatus comprising: means for (e.g. a processor for) generating (e.g. rendering and/or presenting) a base representation of the scene, wherein the base representation is associated with a viewing zone; wherein the viewing zone comprises a subset of the scene and/or wherein the viewing zone enables a user to move through a subset of the scene; wherein a viewer is able to move within the viewing zone while viewing the base representation; and wherein the viewing zone is arranged to resist and/or prevent movement out of the viewing zone; preferably wherein the base representation is arranged so that feedback is provided to a viewer as that viewer moves towards a boundary of the viewing zone and/or wherein the base representation is arranged so that playback of the scene is altered (e.g. slowed or blurred) as the viewer moves towards a boundary of the viewing zone).
65. An apparatus for presenting a representation of a three-dimensional scene, the apparatus comprising: means for (e.g. a processorfor) receiving a base representation of a scene, the base representation having a first quality; means for (e.g. a processor for) receiving and/or generating a scene element, the scene element having a second quality; and means for (e.g. a processor for) combining the base representation with the scene element.
PCT/GB2025/050280 2024-02-16 2025-02-14 Generating a representation of a scene Pending WO2025172714A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB2402263.4A GB2637357A (en) 2024-02-16 2024-02-16 Generating a representation of a scene
GB2402263.4 2024-02-16

Publications (1)

Publication Number Publication Date
WO2025172714A1 true WO2025172714A1 (en) 2025-08-21

Family

ID=94820833

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2025/050280 Pending WO2025172714A1 (en) 2024-02-16 2025-02-14 Generating a representation of a scene

Country Status (2)

Country Link
GB (1) GB2637357A (en)
WO (1) WO2025172714A1 (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016061640A1 (en) 2014-10-22 2016-04-28 Parallaxter Method for collecting image data for producing immersive video and method for viewing a space on the basis of the image data
US20160148420A1 (en) * 2014-11-20 2016-05-26 Samsung Electronics Co., Ltd. Image processing apparatus and method
WO2018046940A1 (en) 2016-09-08 2018-03-15 V-Nova Ltd Video compression using differences between a higher and a lower layer
US20180276824A1 (en) * 2017-03-27 2018-09-27 Microsoft Technology Licensing, Llc Selective application of reprojection processing on layer sub-regions for optimizing late stage reprojection power
WO2019111010A1 (en) 2017-12-06 2019-06-13 V-Nova International Ltd Methods and apparatuses for encoding and decoding a bytestream
WO2020188273A1 (en) 2019-03-20 2020-09-24 V-Nova International Limited Low complexity enhancement video coding
WO2024003577A1 (en) 2022-07-01 2024-01-04 V-Nova International Ltd Applications of layered encoding in split computing

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022242855A1 (en) * 2021-05-19 2022-11-24 Telefonaktiebolaget Lm Ericsson (Publ) Extended reality rendering device prioritizing which avatar and/or virtual object to render responsive to rendering priority preferences
EP4483568A1 (en) * 2022-02-23 2025-01-01 Qualcomm Incorporated Foveated sensing

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016061640A1 (en) 2014-10-22 2016-04-28 Parallaxter Method for collecting image data for producing immersive video and method for viewing a space on the basis of the image data
US20160148420A1 (en) * 2014-11-20 2016-05-26 Samsung Electronics Co., Ltd. Image processing apparatus and method
WO2018046940A1 (en) 2016-09-08 2018-03-15 V-Nova Ltd Video compression using differences between a higher and a lower layer
US20180276824A1 (en) * 2017-03-27 2018-09-27 Microsoft Technology Licensing, Llc Selective application of reprojection processing on layer sub-regions for optimizing late stage reprojection power
WO2019111010A1 (en) 2017-12-06 2019-06-13 V-Nova International Ltd Methods and apparatuses for encoding and decoding a bytestream
WO2020188273A1 (en) 2019-03-20 2020-09-24 V-Nova International Limited Low complexity enhancement video coding
WO2024003577A1 (en) 2022-07-01 2024-01-04 V-Nova International Ltd Applications of layered encoding in split computing

Also Published As

Publication number Publication date
GB2637357A (en) 2025-07-23

Similar Documents

Publication Publication Date Title
JP2025020290A (en) Method and system for generating and displaying 3D video in a virtual, augmented, or mixed reality environment - Patents.com
US12293450B2 (en) 3D conversations in an artificial reality environment
US20240212294A1 (en) Augmenting video or external environment with 3d graphics
US11601636B2 (en) Methods, systems, and media for generating an immersive light field video with a layered mesh representation
JP7128217B2 (en) Video generation method and apparatus
US11544894B2 (en) Latency-resilient cloud rendering
CN112585978A (en) Generating a composite video stream for display in a VR
US12262070B2 (en) Reference of neural network model for adaptation of 2D video for streaming to heterogeneous client end-points
US12413633B2 (en) Reference of neural network model by immersive media for adaptation of media for streaming to heterogenous client end-points
JP7743623B2 (en) Streaming Scene Prioritization for Immersive Media - Patent application
US12267477B2 (en) Viewpoint synthesis with enhanced 3D perception
KR20220110787A (en) Adaptation of 2D video for streaming to heterogeneous client endpoints
JP7472298B2 (en) Placement of immersive media and delivery of immersive media to heterogeneous client endpoints
Hinds et al. Immersive media and the metaverse
WO2025172714A1 (en) Generating a representation of a scene
US20240430394A1 (en) Viewpoint synthesis with enhanced 3d perception
EP4241245A1 (en) Latency-resilient cloud rendering
JP2022522364A (en) Devices and methods for generating image signals
WO2025215363A1 (en) Determining a point of a three-dimensional representation of a scene
GB2637367A (en) Processing a point of a three-dimensional representation
WO2025215364A1 (en) Processing a three-dimensional representation of a scene
WO2025248250A1 (en) Processing a point of a three-dimensional representation
GB2637364A (en) Determining a point of a three-dimensional representation of a scene
JP2021129129A (en) Information processing device, information processing method, and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 25708485

Country of ref document: EP

Kind code of ref document: A1