[go: up one dir, main page]

WO2025215364A1 - Processing a three-dimensional representation of a scene - Google Patents

Processing a three-dimensional representation of a scene

Info

Publication number
WO2025215364A1
WO2025215364A1 PCT/GB2025/050760 GB2025050760W WO2025215364A1 WO 2025215364 A1 WO2025215364 A1 WO 2025215364A1 GB 2025050760 W GB2025050760 W GB 2025050760W WO 2025215364 A1 WO2025215364 A1 WO 2025215364A1
Authority
WO
WIPO (PCT)
Prior art keywords
points
point
container
scene
viewing zone
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/GB2025/050760
Other languages
French (fr)
Inventor
Tristan SALOME
Cyril CLAVAUD
Gael HONOREZ
Jeroen DE CONNINCK
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
V Nova International Ltd
Original Assignee
V Nova International Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by V Nova International Ltd filed Critical V Nova International Ltd
Publication of WO2025215364A1 publication Critical patent/WO2025215364A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/001Model-based coding, e.g. wire frame
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • G06T19/20Editing of 3D images, e.g. changing shapes or colours, aligning objects or positioning parts
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/06Topological mapping of higher dimensional structures onto lower dimensional surfaces

Definitions

  • the present disclosure relates to methods, systems, and apparatuses for processing a three-dimensional representation of a scene and/or for determining a point of a three-dimensional representation of a scene.
  • Three-dimensional representations of environments are used in many contexts, including for the generation of virtual reality videos, in which depth information for a plurality of points of the representation is used to generate different images for a left eye and a right eye of a user.
  • substantial processing power is required to determine such a three-dimensional representation, and the file size of files associated with these representations is typically large so that substantial amounts of storage are needed to keep the files and substantial amounts of bandwidth are required to transfer the files.
  • a method of processing a three- dimensional representation of a scene comprising: identifying a plurality of points of a three- dimensional representation; determining a coordinate system for the representation; determining a plurality of containers associated with the coordinate system, wherein each container covers a volume of the three- dimensional representation, and wherein a depth of each container is dependent on a distance of that container from a centre of the coordinate system; and allocating each of the plurality of points to one of the containers.
  • the method comprises determining the coordinate system based on a viewing zone, preferably wherein the coordinate system is centred at a centre of the viewing zone.
  • the method comprises determining the coordinate system based on a capture device used to capture one or more of the plurality of points, preferably wherein the coordinate system is centred on the capture device.
  • the method comprises processing one or more of the points in dependence on a container that contains said points.
  • the method comprises: identifying a first container; identifying one or more points in the first container; and processing the points in the first container.
  • the method comprises: identifying a second container; identifying one or more points in the second container; and processing the points in the second container separately to the points in the first container.
  • processing the points separately comprises: processing the points at different times; and/or processing the points using different computer devices.
  • processing the points comprises one or more of: generating a new point based on one or more points; modifying a location and/or value of the points; removing and/or filtering out one or more of the points; and assigning a new parameter to a point.
  • the new parameter indicates a container comprising the point.
  • the method comprises processing the points in dependence on a threshold, preferably a threshold associated with the attributes of the points.
  • the threshold depends on the container.
  • the method comprises: identifying a first plurality of points in a first container of the plurality of containers; and processing the first plurality of points so as to generate a new point in the first container.
  • the method comprises identifying a plurality of points in a first container; and combining the points to generate the new point.
  • combining the points comprises one or more of: combining an attribute of each point; combining a transparency and combining a distance of each point.
  • the method comprises generating the new points based on one or more of: a minimum attribute value of the first plurality of points; a maximum attribute value of the first plurality of points; an average value of the first plurality of points; and a variance of the attribute values of the first plurality of points.
  • the method comprises determining the coordinate system based on a capture device associated with a viewing zone.
  • the capture device is located at the centre of the viewing zone.
  • the depth of the containers increases as the distance of the containers from the centre of the coordinate system increases.
  • the depth of each container is determined based on an arctan curve.
  • the depth of the containers increases linearly as the distance of the containers from the centre of the coordinate system increases.0
  • the depth of the containers is determined based on an arctan curve.
  • each container is associated with each of: an inner axial boundary; an outer axial boundary; a first radial boundary; and a second radial boundary.
  • the inner axial boundary and the outer axial boundary are determined based on a quantisation curve; preferably, the quantisation curve is based on an arctan curve.
  • the locations of the first radial boundary and the second radial boundary are dependent on an angular resolution of the scene.
  • each container is associated with one or more angular sections.
  • each container is associated with one or more quantisation levels.
  • each container is associated with the same number of quantisation levels.
  • the quantisation levels are determined based on a curve, preferably an arctan curve.
  • the method comprises combining the plurality of points in dependence on a location of each point.
  • the method comprises combining the plurality of points in dependence on each point being within the first container.
  • the method comprises combining the plurality of points in dependence on each point being associated with the same quantisation level within the container.
  • the method comprises combining the plurality of points in dependence on each point being within the same angular section within the container.
  • combining the points comprises taking one or more of: a minimum, an average, a weighted average, and a maximum of the points.
  • the method comprises combining points associated with a plurality of quantisation levels of the container.
  • the method comprises processing (e.g. combining) the points based on a threshold value.
  • the method comprises processing each point (and/or each set of points) only if a parameter value associated with that point exceeds the threshold value.
  • the parameter value is determined based on one or more of: an average parameter of a set of points; a maximum parameter of the set of points; a minimum parameter of the set of points; and a variance of the parameters of the set of points.
  • the parameter relates to the attributes of the points and/or the locations of the points.
  • the threshold value is associated with one or more of: a similarity of the points; a complexity of the points; the attribute values of the points; the container that contains the points; and the locations of the points.
  • the method comprises combining the points based on a complexity value associated with the points and/or the container.
  • a number of quantisation levels for which points are combined is dependent on the complexity value.
  • the complexity value is dependent on one or more of: a user input; an artificial intelligence algorithm; a machine learning model; and a distribution of points and/or attributes of points within a region.
  • the method comprises determining a threshold complexity for combining a plurality of points.
  • the threshold complexity is dependent on one or more of: a container that contains the points; and a distance of the points from a viewing zone and/or from the centre of the coordinate system.
  • the method comprises combining the points based on attribute values associated with the points.
  • the method comprises combining the points based on a separation of the points.
  • the combined points are associated with a plurality of different capture devices.
  • the method comprises associating the new point with a new capture device.
  • the method comprises associating the new point with a new capture device at the centre of the coordinate system.
  • the method comprises determining and storing a distance of the new point from the new capture device.
  • the three-dimensional representation is associated with a viewing zone, the viewing zone comprising a subset of the scene and/or the viewing zone enabling a user to move through a subset of the scene.
  • the user is able to move within the viewing zone with six degrees of freedom (6DoF).
  • the viewing zone has a volume of less than 50% of the volume of the scene, less than 20% of the volume of the scene, and/or less than 10% of the volume of the scene.
  • the viewing zone has, or is associated with, a volume, preferably a real-world volume, of less than five cubic metres (5m 3 ), less than one cubic metre (1 m 3 ), less than one-tenth of a cubic metre (0.1 m 3 ) and/or less than one-hundredth of a cubic metre (0.01 m 3 ).
  • a volume preferably a real-world volume, of less than five cubic metres (5m 3 ), less than one cubic metre (1 m 3 ), less than one-tenth of a cubic metre (0.1 m 3 ) and/or less than one-hundredth of a cubic metre (0.01 m 3 ).
  • the three-dimensional representation comprises a point cloud.
  • the method comprises storing the three-dimensional representation and/or outputting the three- dimensional representation.
  • the method comprises outputting the three-dimensional representation to a further computer device.
  • the method comprises generating an image and/or a video based on the three-dimensional representation.
  • the method comprises forming one or more two-dimensional representations of the scene based on the three-dimensional representation.
  • the method comprises comprising forming a two- dimensional representation for each eye of a viewer.
  • the point is associated with one or more of: a location; an attribute; a transparency; a colour; and a size.
  • the point is associated with an attribute for a right eye and an attribute for a left eye.
  • the scene comprises one or more of: an extended reality (XR) scene; a virtual reality (VR) scene; an augmented reality (AR) scene; and a mixed reality (MR) scene.
  • XR extended reality
  • VR virtual reality
  • AR augmented reality
  • MR mixed reality
  • a system for carrying out the aforesaid method comprising one or more of: a processor; a communication interface; and a display.
  • an apparatus for processing a three-dimensional representation of a scene comprising: means for (e.g. a processor for) identifying a plurality of points of a three-dimensional representation; means for (e.g. a processor for) determining a coordinate system for the representation; means for (e.g. a processor for) determining a plurality of containers associated with the coordinate system, wherein each container covers a volume of the three-dimensional representation, and wherein a depth of each container is dependent on a distance of that container from a centre of the coordinate system; and means for (e.g. a processor for) allocating each of the plurality of points to one of the containers.
  • Any apparatus feature as described herein may also be provided as a method feature, and vice versa.
  • means plus function features may be expressed alternatively in terms of their corresponding structure, such as a suitably programmed processor and associated memory.
  • the disclosure also provides a computer program and a computer program product comprising software code adapted, when executed on a data processing apparatus, to perform any of the methods described herein, including any or all of their component steps.
  • the disclosure also provides a computer program and a computer program product comprising software code which, when executed on a data processing apparatus, comprises any of the apparatus features described herein.
  • the disclosure also provides a computer program and a computer program product having an operating system which supports a computer program for carrying out any of the methods described herein and/or for embodying any of the apparatus features described herein.
  • the disclosure also provides a computer readable medium having stored thereon the computer program as aforesaid.
  • the disclosure also provides a signal carrying the computer program as aforesaid, and a method of transmitting such a signal.
  • Figure 1 shows a system for generating a sequence of images.
  • Figure 2 shows a computer device on which components of the system of Figure 1 may be implemented.
  • Figure 3 shows a method of determining a three-dimensional representation of a scene.
  • Figures 4a and 4b show method of determining a point based on a plurality of sub-points.
  • Figure 5 shows a scene comprising a viewing zone.
  • Figures 6a and 6b show arrangements of capture devices for determining points of the three-dimensional representation.
  • Figure 7 shows a point that can be captured by a plurality of capture devices.
  • FIGS 8a and 8b shows grids formed by the different capture devices.
  • Figure 9 shows a method of determining whether to combine a plurality of sub-points.
  • Figure 10 shows a coordinate system that can be used to analyse the points.
  • Figures 11 a - 11 d show containers of the coordinate system of Figure 10.
  • Figure 12 shows a method of combining a plurality of points of the three-dimensional representation.
  • FIG. 1 there is shown a system for generating a sequence of images.
  • This system can be used to generate, and then display, a representation of an environment, which may comprise a VR environment (or an XR environment).
  • the system comprises an image generator 11 , an encoder 12, a transmitter 13, a network 14, a receiver 15, a decoder 16 and a display device 17.
  • these components may each be implemented on separate apparatuses. Equally, various combinations of these components may be implemented on a shared apparatus; for example, the image generator 11 , the encoder 12, and the transmitter 13 may all be part of a single image data generation device. Similarly, the receiver 15, the decoder 16, and the display device 17 may all be a part of a single image rendering device.
  • the system comprises at least one encoding computer device (e.g. a server of a content provider) and at least one rendering computer device (e.g. a VR headset).
  • encoding computer device e.g. a server of a content provider
  • rendering computer device e.g. a VR headset
  • each of the components, and in particular the image generator 11 , the encoder 12, the transmitter 13, the receiver 15, the decoder 16 and the display device 17 is typically implemented on a computer device 20, where, as described above, a plurality of these components may be implemented on a shared computer device.
  • Each computer device comprises one or more of: a processor 21 for executing instructions (e.g. so as to perform one or more of the steps of the various methods described below), a communication interface 22 for facilitating communication between computer devices (e.g. an ethernet interface, a Bluetooth® interface, or a universal serial bus (UBS) interface, a memory 23 and/or storage 24 for storing information and instructions (e.g. a random access memory (RAM), a read only memory (ROM), a hard drive disk (HDD) a solid state drive (SSD), and/or a flash memory, and a user interface 25 (e.g. a display, a mouse, and/or a keyboard) for enabling a user to interact with the computer device.
  • a processor 21 for executing instructions (e.g. so as to perform one or more of the steps of the various methods described below)
  • a communication interface 22 for facilitating communication between computer devices (e.g. an ethernet interface, a Bluetooth® interface, or a universal serial bus (UBS)
  • the computer device 20 may comprise further (or fewer) components.
  • the computer device e.g. the display device 17
  • the computer device may comprise one or more sensors, such as an accelerometer, a GPS sensor, or a light sensor. These sensors typically enable the computer device to identify an environmental condition and/or an action of wearer of the display device.
  • the image generator 11 is configured to generate a sequence of image data (e.g. a sequence of image frames) to enable the display device 17 to use this image data to display a plurality of images.
  • the image data may comprise one or more digital objects and the image data may be generated or encoded in any format.
  • the image data may comprise point cloud data, where each point has a 3D position and one or more attributes. These attributes may, for example, include, a surface colour, a transparency value, an object size and a surface normal direction. Each attribute may have a value chosen from a continuous range or may have a value chosen from a discrete set.
  • the image data enables the later rendering of images.
  • This image data may enable a direct rendering (e.g. the image data may directly represent an image).
  • the image data may require further processing in order to enable rendering.
  • the image data may comprise three-dimensional point cloud data, where rendering a two-dimensional image using this data requires processing based on a viewpoint of this two-dimensional image.
  • the image data may comprise depth map data, where one or more pixels or objects in the image is associated with a depth that is specified by the depth map data.
  • the depth map data may be provided as a depth map layer, separate from an image layer.
  • the image layer may instead be described as a texture layer.
  • the depth map layer may instead be described as a geometry layer.
  • the image data may include a predicted display window location.
  • the predicted display window location may indicate a portion of an image that is likely to be displayed by the display device 17.
  • the predicted display window location may be based on a viewing position (such as a virtual position and/or orientation of the user in a 3D environment) of the user, where this viewing position may be obtained from the display device.
  • the predicted display window location may be defined using one or more coordinates. For example, the predicted display window location may be defined using the coordinates of a corner or center of a predicted display window, and may be defined using a size of the predicted display window.
  • the predicted display window location may be encoded as part of metadata included with the frame.
  • the image data for each image may include further information, which may be provided as a part of an image, e.g. as part of the point cloud data, or as separate layers.
  • the image data may include audio information or haptic feedback information indicating audio or haptics which can accompany displayed visual data.
  • An audio layer or haptic layer may accompany each image, and may be omitted for images where no accompanying audio or haptics are required.
  • the image data may comprise interactivity information, where the image data may contain or indicate elements with which a user can interact.
  • the interactivity information may, for example, define a behaviour of an element, where a user is able to interact with the element based on this behaviour.
  • the behaviour typically defines a change in an element that occurs as a result of a user interaction where this change may comprise a change in the attributes of the element or in the rendering of the element.
  • the target element may be arranged to disappear when a user interacts with this element, or to provide feedback indicating that the user has interacted with the target.
  • This interactivity data may be provided as part of, or separately to, the image data.
  • the image data may indicate, or may be combinable with, a state of the virtual environment, a position of a user, ora viewing direction of the user.
  • the position and viewing direction may be physical properties of the user in the real-world, or position and viewing direction may also be purely virtual, for example being controlled using a handheld controller.
  • the image generator 11 may, for example, obtain information from the display device 17 that indicates the position, viewing direction, or motion of the user. Equally, the image generator may generate image data such that it can later be combined with this position, viewing direction, or motion, where the image generator may generate a full scene which is only partially viewed by a user depending on the position of that user.
  • the generated image may be independent of user position and viewing direction.
  • This type of image generation typically requires significant computer resources such as a powerful GPU, and may be implemented in a cloud service, or on a local but powerful computer.
  • a cloud service such as a Cloud Rendering Service (CRN)
  • CRN Cloud Rendering Service
  • rendering refers at least to an initial stage of rendering to generate an image. Further rendering may occur at the display device 17 based on the generated image to produce a final image which is displayed.
  • the image generator 11 may, for example, comprise a rendering engine for initially rendering a virtual environment such as a game or a virtual meeting room.
  • the encoder 12 is configured to encode frames to be transmitted to the display device 17.
  • the encoder may be implemented using executable software or may be implemented on specific hardware such as an ASIC.
  • the image generator 11 may transmit raw, unencoded, data through the network 14. However, such transmission typically leads to a high file size and requires a high bandwidth so that it is typically desirable to encode the data prior to the transmission.
  • the encoder 12 may encode the image data in a lossless manner or may encode the data a lossy manner.
  • the encoder may apply inter-frame or intra-frame compression based on a currently-encoded frame and optionally one or more previously encoded frames.
  • the encoder may be a multi-layer encoder, such as a low complexity enhancement video codec (LCEVC) enabled encoder.
  • LEVC low complexity enhancement video codec
  • the encoder 12 may perform layered encoding on each instance of image data (e.g. each frame) to generate an encoded frame comprising a base depth map layer and an enhancement depth map layer. Encoding a depth map in this way may improve compression.
  • depth maps are desirably highly detailed with a bit depth of up to twelve or fourteen bits, which is a significant increase in the data to be transmitted.
  • providing ways to improve compression of the depth map can make more realistic depth map-based displays viable when performing rendering or transmission of rendered data in real-time.
  • this type of layered encoding makes it easy to drop (and then pick back up) one or more of the layers, which provides flexibility and tools for bandwidth management.
  • Layered encoding is also helpful as the final decoder/user device (such as a user display device) can choose whether to process these extra layers.
  • the best the end device i.e. the receiver, decoder or display device associated with a user that will view the images
  • the controller/renderer/encoder that it does not have enough resources.
  • the controller then will send future images at a lower quality.
  • the end device still unfortunately has to process the higher quality data until the lower quality data arrives, if it can process the received images at all.
  • this situation is improved upon because when/if the end device determines for example that it does not have the processing capabilities to handle the highest level of quality, then it can drop and/or choose not to process certain layers.
  • the end device may also signal to the controller that it needs a lower level of quality, but in the meantime the end device can only process the number of layers that it can handle. Therefore, the end device can react to conditions much more quickly.
  • depth map data may be embedded in image data.
  • the base depth map layer may be a base image layer with embedded depth map data
  • the enhancement depth map layer may be an enhancement image layer with embedded depth map data.
  • the encoded depth map layers may be separate from the encoded image layers.
  • the encoded depth map layers can be dropped under some conditions while still retaining image layers that can be displayed (albeit with a lower level of realism).
  • the encoded depth map layers can be dropped by a transmitter or encoder when available communication resources are reduced, or can be dropped by an end device which lacks the processing resources to handle the highest level of quality.
  • some images comprise an audio base layer, a haptic feedback base layer, an audio enhancement layer or a haptic feedback enhancement layer, these can be processed or dropped flexibly.
  • some images comprise an interactivity data base layer or an interactivity enhancement layer these can be processed or dropped flexibly.
  • certain interactions may only be possible where a threshold bandwidth is available, where complex interactions (e.g. those enabling a conversation with a digital object) may be disabled before less complex interactions (e.g. changing a pixel colour) are disabled.
  • the encoder may apply a point cloud data encoding technique such as described in European patent application EP21386059.6, which is incorporated herein by reference.
  • a point cloud encoder may act as a base encoder for a layered encoding technique such as LCEVC or VC-6.
  • LCEVC and VC-6 techniques encode and decode a layered signal, but are agnostic about the content type of data encoded in the signal.
  • the signal can include textures, video frames, geometry or depth data, meshes, point clouds, rendering attributes or physics engine attributes.
  • the transmitter 13 may be any known type of transmitter for wired or wireless communications, including an Ethernet transmitter or a Bluetooth transmitter.
  • the transmitter 13 may be configured to make decisions about how to transmit the image data, and/or may provide feedback to the encoder 12 or the image generator 11 .
  • the transmitter may determine available communication resources (e.g. bandwidth) for transmitting image data, and may drop one or more layers from an encoded frame, or indicate to the image generator and/or encoder that image data should be generated and encoded with fewer layers, when insufficient bandwidth is available for transmission of all generated data.
  • the transmitter may be configured to drop a depth map layer, an LCEVC enhancement layer, or a VC-6 enhancement layer from a frame when insufficient communication resources are available.
  • the network 14 provides a channel for communication between the transmitter 13 and the receiver 15, and may be any known type of network such as a WAN or LAN or a wireless Wi-Fi or Bluetooth network.
  • the network may further be a composite of several networks of different types. Many users only have access to a network with a bandwidth of 30MBps which can lead to latency jitter when streaming. The required bandwidth and the observed latency can be reduced by means of tactics such as forward-looking rendering and last-millisecond reprojection, which are enabled by improved compression.
  • the receiver 15 may be any known type of receiver for wired or wireless communications, including an Ethernet transmitter or a Bluetooth transmitter.
  • the decoder 16 is configured to receive and decode an encoded frame.
  • the decoder may be implemented using executable software or may be implemented on specific hardware such as an ASIC.
  • the display device 17 may for example be a television screen or a VR headset.
  • the timing of the display may be linked to a configured frame rate, such that the display device may wait before displaying the image.
  • the display device may be configured to perform warping, that is, to obtain a final display window location, adjust a warpable image to obtain a final image corresponding to a final viewing direction of the user, and display the final image.
  • the image data is typically arranged to provide a warpable image for which a portion of the image that is displayed at the display device 17 is dependent on a position or orientation of a viewer.
  • the warpable image may then be rendered before a most up to date viewing direction of the user is known.
  • the warpable image may be transmitted to the display device, or the warpable image may be transmitted to a rendering node which is near to the display device, and the display device or rendering node may perform time warping to generate a displayed image portion based on the warpable image and the most up to date viewing direction of the user.
  • a single device may provide a plurality of the described components.
  • a first rendering node may comprise the image generator 11 , encoder 12 and transmitter 13. Additional similar rendering nodes may be included in the system, and may work together to generate the sequence of frames.
  • multiple rendering nodes may each provide separate image data to an image data assembling node; for example, each rendering node may provide a part of a sequence of frames to a frame assembling node.
  • the receiver 15, decoder 16 or display device 17 may be configured to assemble parts of image data from multiple sources to generate a sequence of images for display on the display device.
  • the image data assembling node may be separate from the receiver 15, decoder 16 and display device 17.
  • multiple rendering nodes may be chained.
  • successive rendering nodes may add to a sequence of image data as it passes from rendering node to rendering node, and eventually a complete sequence of image data is then provided to the receiver 15.
  • each rendering node may obtain components of a render from multiple upstream rendering nodes and/or distribute components of a render to multiple downstream rendering nodes.
  • a chain of rendering nodes may be useful for performing different rendering tasks that require different quantities of processing resources, or different frame rates.
  • a company may provide distributed processing in the form of a centralised hub which has abundant processing resources but is distant from users, and peripheral locations which have more scarce processing resources but are closer to users.
  • Expensive but fairly static rendering features such as background lighting or environmental impact on sound may be generated at the central hub (for example using ray tracing), while features that require fewer resources but faster responses or higher frame rates may be generated closer to the user.
  • the more responsive a rendering feature needs to be the lower latency it needs between the rendering node which generates the feature and the user display and, in a chain of rendering nodes, the node which generates each rendering feature can be chosen based on a required maximum latency of that feature.
  • a set of surfaces may be constructed where each surface has different sound reflection and absorption properties depending upon material and shape.
  • the frame rates may be matched by creating multiple frames with features generated at the lower frame rate, and combining them with the frames with features generated at the higher frame rate.
  • a preliminary rendering generates volumetric object data including motion vectors at a first (lowest) frame rate, then produces 2D rendered frames plus depth information for a specific user at a second (higher) frame rate, then transmits video plus depth data to the user device, which produces final frames for display via space warping (depth-based reprojections) at a third (highest) frame rate.
  • One or more of these steps may be performed in combination with the other described embodiments.
  • the viewing position of the user may change as additional rendering tasks are performed at different rendering nodes in the chain. Each or any rendering node may obtain an updated viewing position before performing its respective rendering task.
  • the system may simultaneously generate multiple sequences of image data for different respective users or different respective display devices.
  • each user or display device may view a different 3D environment, or may view different parts of a same 3D environment.
  • each node may serve multiple users or just one user.
  • a starting rendering node may serve a large group of users.
  • the group of users may be viewing nearby parts of a same 3D environment.
  • the starting node may render a wide zone of view (“field of view”) which is relevant for all users in the large group.
  • the starting node may send this wide field of view to a first middle rendering node which renders additional aspects of the 3D environment. These additional aspects may for example be aspects which require less processing power to render, or may be aspects which are specific to individual users of the group. Additionally, the middle rendering node may render features in a smaller field of view than the starting node - this smaller field of view may be relevant to each user rather than the group of users.
  • the first middle rendering node may additionally only serve a smaller number of users (e.g. half of the large group of users), with the remaining users being served by a second middle rendering node which also receives the wide field of view from the starting node.
  • the middle rendering node(s) may then send sequences of second partially or fully rendered frames to an end device for each user.
  • the end device may perform further processes such as warping or focal distance adjustments, optionally using depth map data.
  • each rendering node encodes the partially or fully rendered frames before transmitting them on to a next rendering node or to the receiver 15.
  • the required communication resources can be reduced when the rendering nodes are separated by one or more networks, or more generally are implemented in a distributed system such as a cloud.
  • each rendering node in a chain is encoding a different partially or fully rendered frame, with different data. Therefore, it may be advantageous for different rendering nodes to use different rendering formats and/or encoding formats.
  • the output from a first rendering node may be point cloud data which logically describes a 3D scene. This point cloud data can be encoded using the techniques of EP21386059.6.
  • a second rendering node may then operate on the point cloud data to generate image data that is more readily displayed by a generic display device, without requiring the display device to model the 3D environment. This image data may be encoded using video coding techniques.
  • the chaining of rendering nodes may be extended to arbitrary tree structures, where a rendering node obtains partially rendered frames from more than one preceding rendering node, and generates further partially or fully rendered frames based on the multiple obtained sequences of partially rendered frames.
  • a content rendering network comprising numerous rendering nodes may be used to serve a volumetric event to a large number of same-time users, such as users participating in a shared virtual environment. Rendering the same event for each user is far more expensive in terms of computation time and power consumption than rendering the volumetric effect once and performing the rendering equivalent of multicasting the volumetric effect for multiple users.
  • each user may have a second rendering node (such as a VR headset), and the network may comprise a central first rendering node.
  • the first rendering node may render the volumetric event, and distribute partially rendered frames depicting the volumetric event to the different second rendering nodes.
  • the second rendering node for each user may then integrate the partially rendered frames depicting the volumetric event into a view of the virtual environment which is currently being shown to each user, based on parameters such as the user’s virtual position.
  • the receiver 15, decoder 16 and display device 17 may be consolidated into a single device, or may be separated into two or more devices.
  • some VR headset systems comprise a base unit and a headset unit which communicate with each other.
  • the receiver 15 and decoder 16 may be incorporated into such a base unit.
  • a home display system may comprise a base unit configured as an image source, and a portable display unit comprising the display device 17.
  • the receiver 15 or another transmitter associated with the decoder or display device may send a corresponding layer drop indication back through the network 14.
  • the layer drop indication may be received by each rendering node.
  • a rendering node which generates partially or fully rendered frames for that specific decoder or display device may cease generating the dropped layer.
  • a rendering node which generates partially or fully rendered frames for multiple end devices may disregard a layer drop indication received from one end device (as the dropped layer is still needed for other devices).
  • rendering nodes which serve multiple end devices may record received layer drop indications, and may cease generating the dropped layer only when all end devices served by the rendering node indicate that the layer is to be dropped.
  • the encoders or decoders are part of a tier-based hierarchical coding scheme or format.
  • Hierarchical coding enables frames to be communicated with higher resolution and/or higher frame rate than is possible in single-tier coding schemes.
  • one or more enhancement layers is communicated with base data, where the enhancement layers can be used to up-sample the base data at the decoder, for example providing up-sampling in a spatial ortemporal dimension.
  • hierarchical coding can overall provide lossless compression of data, with higher resolution and/or higher frame rate for a given transmission bit rate.
  • Examples of a tier-based hierarchical coding scheme include LCEVC: MPEG-5 Part 2 LCEVC (“Low Complexity Enhancement Video Coding”) and VC-6: SMPTE VC-6 ST-2117, the former being described in PCT/GB2020/050695, published as WO 2020/188273, (and the associated standard document) and the latter being described in PCT/GB2018/053552, published as WO 2019/111010, (and the associated standard document), all of which are incorporated by reference herein.
  • LCEVC MPEG-5 Part 2 LCEVC (“Low Complexity Enhancement Video Coding”)
  • VC-6 SMPTE VC-6 ST-2117
  • WO2018/046940 A further example is described in WO2018/046940, which is incorporated by reference herein.
  • a set of residuals are encoded relative to the residuals stored in a temporal buffer.
  • LCEVC Low-Complexity Enhancement Video Coding
  • Low-Complexity Enhancement Video Coding is a standardised coding method set out in standard specification documents including the Text of ISO/IEC 23094-2 Ed 1 Low Complexity Enhancement Video Coding published in November 2021 , which is incorporated by reference herein.
  • the system describes above is suitable for generating and presenting a representation of a scene, where this scene displays media content to a user.
  • the scene typically comprises an environment, where the user is able to move (e.g. to move their head or to turn their head) to look around the environment and/or to move around the environment.
  • the scene may be a scene of a room in a building, where the user is able to move around the room (e.g. by moving in the real-world and/or by providing an input to a user interface) in orderto inspect various parts of the room.
  • the scene is a XR (e.g. a VR) scene, where the user is able to move about the scene in three degrees of freedom (3DoF) or six degrees of freedom (6DoF) so as to experience the scene.
  • 3DoF degrees of freedom
  • (6DoF six degrees of freedom
  • the image generator 11 may be arranged to determine point cloud data, where each point of the point cloud has a 3D position and one or more attributes. More generally, the image generator (or another component) is arranged to determine a three-dimensional representation of a scene, where this three-dimensional representation is thereafter used to generate two- dimensional images that are presented to a user at the display device 17. While the points are typically points of a point cloud, more generally the disclosure extends to any point that is associated with a location and a value.
  • the points may, more generally, be considered to be data (or datapoints), which data is associated with a location and a value, and the ‘points’ may comprise polygons, planes (regular or irregular), Gaussian splats, etc.
  • the method comprises determining the attribute using a capture device, such as a camera or a scanner.
  • the scene may comprise a real scene, in which attribute values are captured using a camera, or a virtual scene (e.g. a three-dimensional model of a scene), in which attribute values are captured using a virtual scanner.
  • determining a point it will be understood that this generally refers to determining a point that has a location and an attribute value, where determining the point comprises determining the attribute value and/or storing a point that comprises at least an attribute value and a location value (these values may be indirect values, e.g. where the location is identified relative to another point).
  • determining the point comprises determining the attribute value and/or storing a point that comprises at least an attribute value and a location value (these values may be indirect values, e.g. where the location is identified relative to another point).
  • these points can be stored as a three-dimensional representation (e.g. a point cloud) so as to enable the reconstruction of the three-dimensional scene based on this representation.
  • the scene comprises a simulated scene that exists only on a computer.
  • a scene may, for example, be generated using software such as the Maya software produced by Autodesk®.
  • the attributes determined using the methods described herein may then depend on virtual objects located within the scene as well as a virtual lighting arrangement used in the scene.
  • a computer device initiates a capture process for a capture device, the capture process being initiated with an initial azimuth angle (e.g. of 0°) and an initial elevation angle (e.g. of 0°).
  • an initial azimuth angle e.g. of 0°
  • an initial elevation angle e.g. of 0°
  • the computer device causes a point to be captured using the capture device at the current azimuth angle and current elevation angle.
  • Capturing a point typically comprises assigning an attribute value to the point, which attribute value may, for example, be a color of the point and/or a transparency value of the point.
  • the point has one or more color values associated with each of a left eye and a right eye of a viewer.
  • Capturing the point may also comprise determining a normal value associated with the point, e.g. a normal of a surface on which the point lies.
  • capturing the point further comprises determining a location of the point, e.g. by determining a distance of the point from the camera.
  • determining the point may comprise sending a ‘ray’ from the capture device and then stepping through a computer model to determine which surface of the computer model is impacted by the ray. The color, transparency, and normal of this surface are then recorded alongside the distance of the surface from the capture device.
  • a third step, 33 the computer device determines whether a point has been captured for the capture device at each azimuth of a range of azimuths and in a fourth step 34, if points have not been captured at each azimuth, then the azimuth angle is incremented and the method returns to the second step 32 and another point is captured.
  • the azimuth angle may, for example, be incremented by between 0.01 ° and 1 ° and/or by between 0.025° and 0.1 °.
  • the range of azimuth angles is selected to be 360° (i.e. so that the capture device captures points surrounding the entirety of the capture device), but it will be appreciated that other ranges are possible.
  • the computer device determines whether a point has been captured for the capture device at each elevation of a range of elevations and in a sixth step 36, if points have not been captured at each elevation, then the azimuth angle is reset to the initial value, elevation angle is incremented and the method returns to the second step 32 and another point is captured.
  • the elevation angles may, for example, be incremented by between 0.01 ° and 1 ° and/or by between 0.025° and 0.1 °.
  • the range of elevation angles is selected to be 360° (i.e. so that the capture device captures points surrounding the entirety of the capture device), but it will be appreciated that other ranges are possible.
  • a seventh step 37 once points have been captured for each azimuth angle and each elevation angle, the scanning process ends.
  • This method enables a capture device to capture points at a range of elevation and azimuth angles.
  • This point data is typically stored in a matrix.
  • the point data may then be used to provide a representation of the scene to a user, e.g. the three-dimensional representation formed by the point data may be processed to produce two-dimensional images for each eye of a user, with these images then being shown to a user via the display device 17 to provide a virtual reality experience to the viewer.
  • a video can be provided to a viewer that enables the viewer to move their head to look around the scene (while remaining at the location of the capture device).
  • the capture pattern (or scanning pattern) described with reference to Figure 3 is purely exemplary and that numerous capture patterns are possible.
  • the capture process for each capture device comprises capturing one or more points at one or more azimuth angles and/or one or more elevation angles.
  • the ‘points’ captured by the capture device are typically associated with a size, such as a height, a width, or a depth. That is, the points typically relate to two-dimensional planes/pixels and/or three-dimensional voxels. In this regard, there is necessarily some space between the locations of adjacent points (since if the points had no width, then an infinite number of points would be required to capture points at each angle).
  • the size provides points that depict a non-negligible area of the three-dimensional space so that a plurality of points can be fit together to provide a depiction of the scene to a viewer.
  • each point is typically dependent on the distance of that point from the capture device, where more distant points have a larger width/height.
  • the width and height of each point is typically determined so that when each point is displayed, there is no space between adjacent points (indeed, there may be some overlap between points to ensure that no gaps appear between points). This height/width of each point can be determined at the time of capturing the points, or can be determined or defined after the capture of the points.
  • the points comprise a size value, which is stored as a part of the point data.
  • the points may be stored with a width value and/or a height value.
  • the minimum width and the minimum height of a point are set by the angle increment of the azimuth angle and the elevation angle respectively.
  • the size may be then specified in terms of this angle increment and/or in terms of this minimum width/minimum height (e.g. as being a multiple of the angle increment).
  • the size value is stored as an index, which index relates to a known list of sizes (e.g. if the size may be any of 1x1 , 2x1 , 1x2, 2x2, pixels this may be specified by using 3 bits and a list that relates each combination of bits to a size).
  • the size may be stored based on an underscan value.
  • an underscan value In this regard, where an object is very near to the viewing zone it may be captured using an unnecessarily dense arrangement of points. Therefore, certain surfaces or areas of the representation may be associated with an underscan value, which underscan value defines a reduction in the number of points captured as compared to a representation without underscan.
  • the size of the points may be defined so as to indicate this underscan value.
  • the underscan value is an integer value between 0 and 3 and the size is stored as a combination of point dimensions (e.g. a width in the range [0,2]) and a height in the range ([0,2]) and an underscan factor (e.g.
  • a plurality of sub-points SP1 , SP2, SP3, SP4, SP5 is determined.
  • the azimuth angle increment is 0.1 ° then for an azimuth angle of 0°, sub-points may be determined at azimuth angles of -0.05°, -0.025°, 0, 0.025°, and 0.05° (and similar sub-points may be determined for a plurality of elevation angles). Attribute values of these sub-points may then be combined to obtain an attribute value for the point.
  • a maximum attribute value of the sub-points may be used as the value for the point
  • an average attribute value of the sub-points may be used as the value forthe point
  • a weighted average of the sub-points may be used as the value forthe point. It will be appreciated that numerous other methods for combining the attribute values of the sub-points are possible.
  • the accuracy of the capture process can be increased. While it would be possible to simply reduce the increment of the angle steps to provide a higher resolution scene, by considering sub-points but only storing attributes for points, a balance can be struck between accuracy and file size (since storing every sub-point would lead to a substantial increase in the amount of data that needs storing).
  • this capture device may obtain attributes associated with each of the sub-points SP1 , SP2, SP3, SP4, SP5, combine these attributes to obtain a point attribute, and then store a point with a distance that is an average (e.g. a weighted average) of the distances of the sub-points from the capture device, at the nominal angle of the point, with the point attribute.
  • a distance that is an average (e.g. a weighted average) of the distances of the sub-points from the capture device, at the nominal angle of the point, with the point attribute.
  • these points may have different distances from the location of the capture device.
  • the attributes of the sub-points may be combined in dependence on this distance, e.g. so that sub-points nearer to the capture device have higher weightings.
  • the possibility of sub-points with substantially different distances raises a potential problem.
  • the distances for the sub-points are averaged. But where the sub-points have substantially different distances and/or are related to different surfaces in the scene, this may result in the point having a distance that does not correspond to any actual surface in the scene. Therefore, the point may seem to hang in space (e.g. to hang between the front and rear surfaces shown in Figure 4b.
  • the attribute value of the point may be substantially different to the attribute value of other points in the scene.
  • the point may appear as a grey point hanging in space between these objects.
  • the computer device is arranged to aggregate sub-points so as not to create any floating points. For example, the computer device may determine whether the sub-points are spatially coherent by employing a clustering algorithm (e.g. a k-means clustering algorithm). Where the sub-points are spatially coherent (e.g. where a difference in the distance of the sub-points is below a threshold value), these distances may be averaged to obtain a distance for the point.
  • a clustering algorithm e.g. a k-means clustering algorithm
  • the sub-points may be processed to ensure that the distance of any point places it upon a surface; for example, in the system of Figure 4b, sub-points SP1 , SP2, and SP3 may be grouped into a first point and sub-points SP4 and SP5 may be grouped into a second point. Since each sub-point is associated with the same capture device and capture angle (all of these sub-points being associated with a capture step that has a particular azimuth angle and elevation angle), these points may be located at the same angle with respect to a capture device.
  • the first point (made up of sub-points SP1 , SP2, and SP3) may have a smaller distance value than the second point (made up of sub-points SP4 and SP5) and the first point may be assigned a nonzero transparency value so that the second point can be seen through the first point.
  • a computer device By capturing points at a plurality of azimuth angles and elevation angles, e.g. using the method described with reference to Figure 3, it is possible to provide a three-dimensional representation of the scene that can later be used to enable a viewer to view the scene from a plurality of angles. More specifically, given the three-dimensional points captured by the capture device, a computer device is able to render a two- dimensional representation (e.g. a two-dimensional image) of the scene for each eye of a viewer so as to provide a representation with an impression of depth. The computer device may render a series of two- dimensional representations to enable the viewer to look around the scene, where the two-dimensional representations are rendered based on an orientation of the viewer’s head. In this way, the determined representation is useable to provide, for example, a virtual reality (VR), mixed reality (MR), augmented reality (AR), and/or extended reality (XR) experience to the viewer.
  • VR virtual reality
  • MR mixed reality
  • AR augmented reality
  • XR
  • the display device 17 is typically a virtual reality headset, that comprises a plurality of sensors to track a head movement of the user. By tracking this head movement, the display device is able to update the images being displayed to the viewer as the viewer moves their head to look about the scene. Typically, this involves the display device sensing the sensor data to an external computer device (e.g. a computer connected to the display device via a wire).
  • the external computer device may comprise powerful graphical processing units (GPUs) and/or computer processing units (CPUs) so that the external computer device is able to rapidly render appropriate two-dimensional images for the viewer based on the three-dimensional images and the sensor data.
  • GPUs graphical processing units
  • CPUs computer processing units
  • the processing of data and the rendering of images may be performed by various computer devices; for example, a standalone virtual reality headset may be provided, which headset is capable of processing data and rendering images without any connection to an external computer device.
  • the external computer device may comprise a server device, where the display device 17 may be connected to this server device wirelessly.
  • the display device 17 may be connected to this server device wirelessly.
  • This enables the two-dimensional images to be streamed from the server to the display device so as to enable the display of high-quality images without the need for a viewer to purchase expensive computer equipment.
  • operations that require large amounts of computing power such as the rendering of two-dimensional images based on the three- dimensional representation, may be performed by the server, so that the display device is only required to perform relatively simple operations. This enables the experience to be provided to a wide range of viewers.
  • a first two-dimensional image is provided to the display device 17 (and/or a connected device) and this first image is “warped’ in order to provide an image for viewing at the display device.
  • the warping of the image comprises processing the image based on the sensor data in order to provide an image that matches a current viewpoint of the viewer.
  • the three-dimensional representation of the scene may be captured using a plurality of capture devices placed at different locations (orthe same capture device placed at different locations). A viewer is then able to move around the scene translationally (e.g. by moving between these locations).
  • a three- dimensional representation of a scene may be captured that allows a suitable two-dimensional representation of this scene to be rendered regardless of a location of a viewer (e.g. regardless of where a user is standing within a virtual room).
  • the three-dimensional representation may be associated with a viewing zone, a zone of view (ZOV), or a zone of viewpoints (ZVP), where the three-dimensional representation is arranged to enable a user to move about the viewing zone so as to view the scene.
  • ZOV zone of view
  • ZVP zone of viewpoints
  • Figure 5 illustrates such a viewing zone 1 and illustrates how the use of a viewing zone limits the amount of image data that needs to be stored to provide a three-dimensional representation of the scene.
  • Figure 5 shows a two-dimensional viewing zone, it will be appreciated that in practice the viewing zone 1 is typically a three-dimensional zone or volume.
  • the viewing zone 1 may, for example, comprise a rectangular volume, or a rectangular parallelepiped, and the viewing zone may have a height of at least 30 cm, a depth of at least 30 cm, and/or a width of at least 30 cm, where these dimensions enable a user to move their head while remaining in the viewing zone.
  • This is merely an exemplary arrangement ofthe viewing zone; it will be appreciated that viewing zones of various shapes and sizes may be used (e.g. spherical viewing zones). That being said, it is preferable that the viewing zone is limited so as to cover only a part of the volume of the scene, e.g. no more than 50% of the scene no more than 25% of the scene, and/or no more than 10% of the scene.
  • the three-dimensional representation will simply be a standard representation for virtual reality (that enables a user to move freely about the scene) - and so the use of the viewing zone will not provide any reduction in file size.
  • the viewing zone 1 enables movement of a viewer around (a portion of) the scene.
  • the base representation may enable a user to walk around the room so as to view the room from different angles.
  • the viewing zone enables a user to move through the scene with six degrees-of-freedom (6DoF) movement through the scene, where this aids in the provision of an immersive experience.
  • the viewing zone 1 may be four-dimensional, where a three-dimensional location of the viewing zone changes over time - and in such embodiments the size and location of the occluded surface 2 may also change over time. More generally, it will be appreciated that viewing zones may be formed in any size or shape, with different sizes and shapes being suitable for different scenes.
  • the volume of the viewing zone 1 is typically selected so that a user is able to move to a degree sufficient to avoid motion sickness and to provide an immersive sensation, while still only enabling a limited amount of movement (where this leads to a smaller file size as compared to an implementation where a user is able to fully move about the scene).
  • the viewing zone is arranged to enable a user to move their head while they are sitting or standing, but not to freely roam around a room.
  • the viewing zone 1 may have a (e.g. real-world) volume of less than five cubic metres (5m 3 ), less than one cubic metre (1 m 3 ), less than one-tenth of a cubic metre (0.1 m 3 ) and/or less than one-hundredth of a cubic metre (0.01 m 3 ).
  • a (e.g. real-world) volume of less than five cubic metres (5m 3 ), less than one cubic metre (1 m 3 ), less than one-tenth of a cubic metre (0.1 m 3 ) and/or less than one-hundredth of a cubic metre (0.01 m 3 ).
  • the viewing zone 1 may also have a minimum size, e.g. the viewing zone may have a volume of at least 1 % of the volume of the scene, at least 5% of the volume of the scene, and/or at least than 10% of the volume of the scene. Similarly, the viewing zone may have a volume of at least one-thousandth of a cubic metre (0.01 m 3 ); at least one-hundredth of a cubic metre (0.01 m 3 ); and/or at least one cubic metre (1 m 3 ).
  • the ‘size’ of the viewing zone 1 typically relates to a size in the real world, where if the viewing zone has a length of one metre this means that a user is able to move one metre in the real world while staying within the viewing zone.
  • the size of the viewing zone in the scene may be greater than, equal to, or less than the size of the viewing zone in the real world.
  • the viewing zone may scale a real-world distance so that moving one metre in the real world moves the user less than (or more than) one metre in the scene. This enables the scene to provide different perceptions to the user (e.g. to make the user feel larger or smaller than they are in real life).
  • the viewing zone may scale a real-world angle so that rotating one degree in the real world rotates the user less than (or more than) one degree in the scene.
  • a viewing zone with a volume of one cubic metre typically connotes a viewing zone in which the user is able to move about a one cubic metre volume in the real world while remaining in the viewing zone. And this may cause the user to move about a volume that is more than, or less than, one metre in the scene.
  • a plurality of capture devices C1 , C2 C9 may be used (e.g. a plurality of virtual scanners and/or a plurality of cameras).
  • Each capture device is typically arranged to perform a capture process, e.g. as described with reference to Figure 3, in which the capture device captures points at a plurality of azimuth angles and elevation angles.
  • a first capture device C1 is located at a centrepoint of the viewing zone 1.
  • one or more capture devices C2, C3, C4, C5 may be located at the centre ef faces of the viewing zone; and/or one or more capture devices C6, C7, C8, C9 may be located at edges of and/or corners of the viewing zone.
  • Figure 6a shows a two-dimensional view (e.g. a plan view) of a rectangular viewing zone. It will be appreciated that within this viewing zone each capture device may be located on a shared plane. Equally, the various capture devices may be located on different planes. Referring, for example, to Figure 6b, there is shown a three-dimensional view of a cuboid viewing zone, where there is a capture device located: at the centre of the viewing zone; at the centre of each face of the viewing zone; and at each corner of the viewing zone. With this arrangement, many locations in the scene (e.g. specific surfaces) will be captured by a plurality of capture devices so that there will be overlapping points relating to different capture devices.
  • a two-dimensional view e.g. a plan view
  • FIG. 7 shows a first point P1 being captured by each of a first capture device C1 , a sixth capture device C6, and a seventh capture device C7.
  • Each capture device captures this point at a different angle and distance and may be considered to capture a different ‘version’ of the point.
  • this version may be the highest quality version of the point and/or may be the version of the point associated with the nearest and/or least angled capture device.
  • capturing a point for a given azimuth angle and elevation angle typically comprises capturing a plurality of sub-points at varying sub-point azimuth and elevation angles spread around the point azimuth and elevation angles. Due to the different spreads of sub-points, each capture device will capture a different version of the point (that has a different attribute) even when the points are at the same location. Capture devices that are close to the point and less angled with respect to the point typically have a smaller spread of sub-points and so typically obtain a version of a point that is sharper than a version of that point captured by more distant capture devices.
  • a quality value of a version of the point is determined based on the spread of subpoints associated with this version (e.g. based on the perimeter formed by these sub-points and/or based on a surface area or volume bounded by these sub-points).
  • the version of the point that is stored may depend on the respective quality values of possible versions of the points.
  • two ‘points’ in approximately the same location captured by each capture device may not have exactly the same location in the three-dimensional representation. More specifically, since each capture device typically projects a ‘ray’ at a given angle, the rays of differing capture devices may contact the surface at different locations for each capture device. Two points may be considered to be two ‘versions’ of a single point when they are within a certain proximity, e.g. a threshold proximity.
  • this further point may be considered to be a ‘version’ of one of the first point and the second point.
  • FIGs 8a and 8b show the separate captured grids that are formed by two different capture devices.
  • each capture device will capture a slightly different ‘version’ of a point at a given location and these captured points will have different sizes.
  • Each capture step is associated with a particular range of angles (e.g. a nominal capture angle of 1 ° might encompass angles from 0.9° to 1.1 °), and therefore capture devices that are far from a point to be captured represent a wider region at the capture distance than capture devices closer to that point to be captured.
  • the capture device C1 would capture the points P1 and P2 in separate brackets, whereas for the capture device C2 these points are in the same bracket. Therefore, the capture device C2 might determine a single point that encompasses both points P1 and P2, whereas the capture device C1 would determine separate points for these two points.
  • the ‘sizes’ of these captured points, and the locations in space that are encompassed by the captured points will be based on different grids.
  • the width of the captured point P2 captured by the capture device C2 will be larger than the width of the captured point P1 captured by the capture device C1.
  • the capture process may be determined based on the existence of these different grids, and on the different bracket widths that occur at different distances from a capture device.
  • Figure 8a shows an exaggerated difference between grids for the sake of illustration.
  • Figure 8b shows a more realistic embodiment in which the three-dimensional representation comprises a plurality of points associated with different capture devices, where these points lie on different grids associated with these different capture devices.
  • the capturing of a point by a capture device typically involves capturing a plurality of sub-points associated with this point and then combining these sub-points. And as has been described, e.g. with reference to Figure 4b, in some embodiments the subpoints may be associated with different distances from the capture device.
  • viewers are able to identify details of small surfaces and small separations between surfaces. For example, if the two surfaces shown in Figure 4b were located 5 centimetres apart with the front surface being 50 centimetres from a viewer, then that viewer would be able to identify that the rear surface is slightly angled and the viewer would be able to roughly estimate the separation between the front surface and the rear surface. In contrast, at large distances, viewers are typically unable to identify small details of surface or separations between surfaces.
  • the present disclosure envisages a method in which the determination of a point, and in particular the combining of sub-points to form a point, is dependent on a distance of that point (and/or the component sub-points) from one or more of: a capture device that is capturing the point; and a viewing zone.
  • a distance from a viewing zone typically connotes being dependent on a minimum distance from a viewing zone (e.g. dependent on a distance between the point of the three- dimensional representation and a proximate point on the viewing zone).
  • a computer device determines a plurality of sub-points that are associated with a point. Each of these sub-points is associated with a capture device that has used a different angle (e.g. a different azimuth angle and/or a different elevation angle) to capture the sub-points.
  • a different angle e.g. a different azimuth angle and/or a different elevation angle
  • a computer device determines a separation between the sub-points.
  • the separation may be an absolute separation, may be a radial separation, and/or may be an axial separation (e.g. a difference in a depth of the points).
  • the separation may comprise a difference between distance values of a sub-point to the capture device.
  • the separation may be the maximum difference in distance values between the sub-points.
  • a computer device determines a distance of the sub-points from the viewing zone (and/or the capture device). The distance may, for example, be a minimum distance of a sub-point, a maximum distance of a sub-point, or an average distance of a sub-point. Typically, the distance is the distance of the closest sub-point.
  • the computer device determines whether to combine the sub-points (in order to form a point) based on the separation and the distance. Typically, this comprises comparing the separation to a threshold separation and combining the sub-points in dependence on the separation being beneath the separation threshold.
  • the separation threshold is a function ofthe distance of the points from the capture device and/or the viewing zone.
  • the separation threshold may increase linearly with the distance, but typically the threshold increases at an increasing rate as the distance increases (e.g. increases exponentially as the distance increases), based on a function, and/or based on an arctan function.
  • the separation threshold may change (e.g. in a discrete step) from a first threshold for a first range of distances to a second threshold for a second range of distances.
  • the computer device may determine that it is appropriate to divide the point into a plurality of points.
  • points are typically formed from a combination of sub-points. While each point may, by default, be formed from the same number of subpoints, it may be possible to form points from different number of sub-points. In practice, this may lead to most points in the three-dimensional representation being formed of, e.g. four sub-points or sixteen subpoints, with certain points near the edges of surfaces being formed of fewer, e.g. 1 , or 4, sub-points.
  • the combination of the sub-points may alternatively, or additionally, depend on a difference between the attribute values of the points where this attribute difference may be compared to an attribute threshold, and this attribute threshold may also depend on the distance of the sub-points from the viewing zone.
  • Each of the threshold separation value and the threshold attribute difference may depend on the distance of the sub-points from the capture device.
  • One aspect of the present disclosure relates to a method of dividing the three-dimensional representation into (e.g. allocating the points of the three-dimensional representation to) a plurality of ‘froxels’ (frustrum voxels), where the dimensions - in particular the depths and/or volumes - of each froxel are dependent on a distance of that froxel to the viewing zone.
  • the three-dimensional representation is divided into froxels using a coordinate system that is: based on a location of/in the viewing zone; based on a centrepoint of the viewing zone; and/or based on a capture device.
  • the method may comprise allocating the points of the three-dimensional representation into a plurality of froxels, where the depth of each froxel depends on the distance of that froxel from the centre of the viewing zone. Equally, the volume of each froxel typically depends on the distance of that froxel from the centre of the viewing zone.
  • the division of the three-dimensional representation into froxels is described as being a ‘froxelised’ space. It will be appreciated that the centrepoint of this froxelised space may be the centre of the viewing zone, may be a specific capture device, may be another location in the viewing zone, etc.
  • the froxels may equally be termed as volumes, containers, or boundaries. In general, the froxels each occupy a volume within the three-dimensional representation, which volume encompasses zero or more points of the three-dimensional representation.
  • Each froxel is a segment of space that is defined by: an inner axial boundary; an outer axial boundary; and four radial boundaries (two elevational radial boundaries and two azimuthal radial boundaries.
  • each froxel typically, the inner radial boundary and the outer radial boundary of each froxel are sections of spheres centred on the centrepoint of the viewing zone.
  • the radial boundaries of each froxel are formed by planes extending outwards from the centrepoint of the viewing zone. While Figure 9 shows a two- dimensional, plan, view of the three-dimensional space (with circles and lines), it will be appreciated that in practice the three-dimensional space is three-dimensional so that the circles and lines of Figure 9 represent spheres and planes.
  • the froxels are typically formed by generating a plurality of axial boundaries (e.g. spheres) that are centred on the centrepoint of the viewing zone (or, more generally, based on a point within the viewing zone).
  • axial boundaries e.g. spheres
  • the distance between axial boundaries increases with distance from the centrepoint of the viewing zone. Therefore, a first axial boundary that is the closest boundary to the viewing zone has a first radius r x , a second axial boundary that is the second closest boundary to the viewing zone has a second radius r 2 , and so on, where r n+1 - r n > r n - r n-1 for at least some, and typically all, values of n.
  • r n - f(n) The function that determines the radius of a given axial boundary may, for example, be a curve. Typically, the function is based on an arctan curve, where r n - a * tan -1 n * b, where a and b are constants.
  • the froxelisation is therefore similar to a process of quantisation, where the depth of froxels increases step-wise as the froxels move away from the viewing zone.
  • the rate of increase of the radius of subsequent axial boundaries increases as distance from the viewing zone 1 increases. Therefore, near to the viewing zone there is a very high density of froxels which decreases with distance. This accounts for the loss in separation ability of users at high distances, where points near to the viewing zone can be processed in smaller/shallower froxels, and therefore with more precision, than points far from the viewing zone (as described below).
  • the rate of increase of the radius of subsequent froxel boundaries may increase linearly and/or exponentially.
  • the radiuses follow a quantisation curve, where the possible values for each radius are fixed to be one of a predetermined list (as, e.g. may be set by a party generating the representation and/or may be set based on a size of the scene/representation).
  • Such a method of dividing the three-dimensional space provides a plurality of froxels where froxels near to the viewing zone are smaller than froxels further from the viewing zone. And the size of froxels increases exponentially as the froxels move further from the viewing zone.
  • Each froxel encompasses a volume of space within the three-dimensional representation and encompasses zero or more points within that volume of space.
  • the points in each froxel are processed separately and independently (e.g. as described below). In this way, points from different froxels may be processed separately, e.g. by parallel processors. This enables a scene to be divided into froxels and then these different froxels to be processed by different computer devices so as to increase the speed of processing steps used to process the points of the three-dimensional representation.
  • the embodiment of Figure 10 shows a froxel that is determined based on a spherical coordinate system so that each froxel is determined based on an inner axial boundary; an outer axial boundary; and four radial boundaries.
  • the present disclosures may equally be applied to other coordinate systems (e.g. cartesian systems) where the froxels may then be associated with inner and outer ‘z’ boundaries as well as two ‘x’ boundaries and two ‘y boundaries (or more generally the froxels may be associated with two depth boundaries, two width boundaries, and two height boundaries).
  • the present disclosure considers the determination of a plurality of froxels based on a plurality of boundaries, where a depth of the froxels increases with the distance of the froxels from a centre of the coordinate system).
  • FIG. 11 a there is shown an exemplary froxel that encompasses a plurality of points, which points may have different attribute values, different locations, different normal values, different transparencies, etc.
  • each froxel is typically associated with a plurality of angular sections 111 , 112, 113, 114, where the angular sections relate to angular sections of the froxelised space.
  • the boundaries of the first angular section of the froxel of Figure 10b may be angular lines that extend from the centre of the froxelised space at angles of, e.g. 0° and 1 °.
  • the spacing of the angular lines depends on a desired angular resolution of the three-dimensional representation, which may, for example, be set by a user.
  • each froxel is set to be the same as this angular resolution so that each froxel has only a singular angular section.
  • each froxel is arranged to contain a plurality of angular sections, where points in different angular sections may (in some situations) be considered together.
  • each froxel is typically associated with a plurality of (discrete) (in-froxel) quantisation levels 121 , 122, 123, 124, where typically the distance between the quantisation levels increases with distance from the centre of the froxelised space (e.g. based on a curve or on an arctan curve for example). Therefore, there are a greater number of quantization levels available nearer to the centre of the froxelised space than further from the centre, and nearer to the viewing zone.
  • each froxel is associated with the same number of quantisation levels. Since froxels near to the viewing zone have a smaller depth than froxels further from the viewing zone, the use of the same number of quantisation levels for each froxel provides an implementation with a higher density of quantisation levels nearer to the viewing zone. In various implementations, each froxel may be associated with, for example, five quantisation levels, or ten quantisation levels.
  • points are processed based on the quantisation levels so that one or more of the initially captured points of the three-dimensional representation are processed to form a series of points at the available quantisation levels of the froxel.
  • each of the points within each froxel is quantised (where the distance of that point is modified so as to be at a quantisiation level of the froxel), where the quantisation of each point may depend on a characteristic of that point. Referring to Figure 12, there is described a method of processing points within an angular section of a froxel.
  • This method can be alternatively (and more generally) be implemented as a method of processing points within a container, or a volume, of a three-dimensional representation (where the froxel is an example of such a container and the angular section within the froxel is also an example of such a container).
  • the method is carried out by a computer device, such as the image generator 11 .
  • the computer device identifies a plurality of points within the angular section.
  • this comprises the computer device querying the three-dimensional representation (e.g. a point cloud) to identify a plurality of points within a region of the three-dimensional representation, that region falling within the angular section.
  • the three-dimensional representation e.g. a point cloud
  • the computer device may, for example, iterate through the points of the three-dimensional representation to sort the points first into froxels and then into angular sections within these froxels (and then, optionally, into segments of the angular sections, the segments being associated with the in-froxel quantisation levels).
  • the computer device identifies a feature of each point of the plurality of points, and in a third step 53, the computer device combines the points in dependence on these features.
  • Combining the points typically comprises determining (and storing) a new point with an attribute that is dependent on the attributes of the combined points.
  • the combined points may then be removed from the three-dimensional representation. This method therefore reduces the number of points within the representation and so reduces the size of the representation.
  • the new point may, for example, have a colour value for each of the left eye and the right eye of a user, where each of these colour values for the new point is determined to be a combination of the corresponding (left eye and right eye) colour values of the identified plurality of points.
  • Determining the attribute of the new point may comprise taking: a maximum attribute of the combined points, a minimum attribute of the combined points, an average attribute of the combined points, a weighted average attribute of the combined points.
  • the new point is located at one of the quantisation levels of the froxel, where typically the combined points are located about this quantisation level and are replaced with a new point at this quantisation level that is a combination of these combined points.
  • the new point is typically located at an angle that is an angle of the angular section.
  • each initial point is determined using a capture device, and the locations of the points are typically stored initially by storing an index of the capture device, an angular identifier associated with the angle of the point from the capture device, and a distance of the point from the capture device.
  • the froxelised representation is centred on the location of a capture device, e.g. a capture device located at the centre of the viewing zone. Therefore, the location of the (new) combined points may be defined based on this capture device on which the froxelised representation is centred (e.g. the capture device located at the centre of the viewing zone). More specifically, the location of the combined point may be defined based on an index of this central capture device, an angle with respect to the central index device (which angle is used to determine the angular section), and a distance from this central index device.
  • the present disclosure considers a situation in which a new point is generated based on a combination of (the attributes of) a plurality of other points, these other points optionally being captured by different capture devices.
  • the location of the new point is defined with reference to a first capture device that is different to a second capture device used to capture one or more of the other points.
  • the method of Figure 12 has a dependency on a distance of these points from the viewing zone.
  • This dependency is, to some extent, a feature of the froxelisation in that froxels further from the viewing zone have a greater depth than froxels close to the viewing zone.
  • a further dependency on distance from the viewing zone may be considered during the third step, where the computer device may combine the points in dependence on a distance of these points from the viewing zone (e.g. in dependence on a distance of the nearest quantisation level from the center of the froxelised space).
  • combining of the points may occur for each angular step within the froxel, where, for each of the angular steps, all points within the froxel in that angular step are replaced by a single point having the appearance of all of the combined points when viewed from the center of the viewing zone. This may involve, for each angular step, combining the locations and/or attributes of the points in that angular step. This process can be thought of as a form of rendering the points of a froxel from the center of the viewing zone.
  • the combining of the points may further depend on the attributes of the points and/or a difference between the attributes of the points. For example, points with similar attributes may be combined more readily than points with substantially different attributes. And the threshold level of similarity (for combining to occur) may occur on the distance of the points from the viewing zone (and/or the distance of the points from the centre of the froxelised space).
  • the combining of the points is dependent on a complexity of the points and/or of the region containing the points.
  • small differences in complex shapes, such as foliage can typically be noticed by a user at close range but not at large range.
  • a user may be able to identify separate leaves in foliage when the foliage is near to the user; but. separate leaves may be unidentifiable by this user from further away.
  • small differences in simple shapes, such as a colour change on a smooth wall of colour are typically more noticeable even at large ranges. Therefore, in some embodiments, the combining of the points is dependent on a complexity value of the points, where the computer device may be arranged to combine points that are identified as exceeding a threshold complexity, but not points below this threshold complexity.
  • the threshold complexity required for combining points may depend on the distance of the points from the viewing zone.
  • the complexity of the points may, for example, be indicated by a user, where a user may be able to define complex regions in which points may be combined. Equally, the complexity of the points may be determined using automated methods, e.g. using artificial intelligence algorithms or machine learning models. The complexity of the points may, for example, be determined by comparing patterns of points to a database of patterns, these patterns being associated with complexity levels.
  • the complexity of a region and/or of a plurality of points is determined based on one or more of: a distribution of the attributes of the points (e.g. a maximum difference between the attributes and/or a standard deviation of the attributes); a distribution of the normals of the points; and a distribution of the capture devices used to capture the points.
  • Complexity can also be linked to non-planarity of neighbouring points, where the computer device may determine the complexity based on a difference in the normals of the points and/or based on a distance of the points from a shared plane.
  • the computer device may determine a plane associated with the points (e.g. a plane that passes through the plurality of points with a minimum average distance to the points) and then determine the complexity based on the distance between these points and the plane.
  • combining a plurality of points may comprise combining a plurality of points associated with a quantisation level of the angular section. Therefore, for example, a first set of points 131 , 132, 133, 134, 135 associated with a fourth quantisation level 124 may be combined into a first combined point 139 and a second set of points 141 , 142, 143 associated with a third quantisation level 123 may be combined into a second combined point 149.
  • combining the points in dependence on the features comprises combining points across one or more (e.g. a plurality of) quantisation levels in dependence on these features.
  • combining the points may comprise combining the points across the entirety of the angular section.
  • the points in the angular section are of a medium complexity and/or wherein the angular section is of a medium complexity
  • the points may be combined into a plurality of new points, where one or more of the new points is formed from points associated with a plurality of quantisation levels (e.g.
  • points associated with each of the first quantisation level 121 and the second quantisation level 122 may be combined into a first new point and points associated with each of the third quantisation level 123 and the fourth quantisation level 124 may be combined into a second new point).
  • the points in the angular section are of a low complexity and/or wherein the angular section is of a low complexity, the points may be combined into a plurality of newer points, with each new point associated with a single complexity level.
  • the points in the angular section are of a very low complexity, the points may not be combined at all.
  • the aforementioned combining of points may then be dependent on a complexity level of the points and/or the region, where points associated with a number of quantisation levels are combined, this number of quantisation levels being dependent on the complexity level.
  • the complexity level is an exemplary feature.
  • the combining of points and/or the number of quantities levels across which points are combined may be dependent on another feature, such as a distribution of attributes of the points, a number of points in the angular section, and/or a user input.
  • the result of the processing may be a froxel that has a point for one or more of the angular section.
  • the first angular section 111 is a highly complex section
  • each of the points in the first angular section may be combined to provide a single new point 161 at the third quantisation level 123.
  • the second angular section 112 may have a lesser complexity so that the points in this angular section are combined into a second new point 162 at the fourth quantisation level 124 and a third new point 163 at the fifth quantisation level, and so on.
  • a fourth new point 165 in the third angular section 113 and a fifth new point 165 in the fourth angular section where points in adjacent angular sections are located on the same quantisation level, these points may be combined in dependence on a depth of the froxel and/or a feature of the points. So the fourth and fifth new points 164, 165 may be combined if these points have similar attributes.
  • the above embodiments have primarily considered a method of processing within a froxel that involves combining a plurality of points in this froxel in order to generate a new point. More generally, the current disclosure envisages a method of processing one or more points of a three-dimensional representation in dependence on a container comprising those points. This processing may comprise combining a plurality of points in order to generate a new point. Equally, this processing may comprise a different processing operation. For example, the processing may comprise quantising a single point in the froxel so that this point is located at a quantisation level. Equally, the processing may comprise: removing and/or filtering out one or more of the points (e.g.
  • the processing (e.g. the combining) of the points may be associated with values (e.g. attribute values of the points), where these values may be compared to a threshold value. For example, points may be removed from the three-dimensional representation if an attribute value of these points is beneath a threshold value.
  • the threshold value may depend on the froxel containing the points that are being processed (e.g. so that the threshold attribute value for the points increases as the points become more distant from the centre of the froxelised coordinate system).
  • a plurality of points may be processed in dependence on a combined value associated with this plurality of points; e.g.
  • This combined value may be compared to a threshold, where e.g. the points may be combined to generate a new point in dependence on the combined value exceeding the threshold.
  • the value may be associated with one or more of: the attributes of the points; the locations of the points; the similarity of the attributes of the points; and the complexity of the points.
  • each froxel (and the group of points in each froxel) is processed separately. Therefore, a plurality of different froxels may be processed by a plurality of different components or different (e.g. separate) computer devices. This enables the froxels of the three-point representation to be processed in parallel and also enables sections of the representation to be processed separately (e.g. where a user only wishes to view a portion of the representation then only the froxels relating to this portion may be processed). Essentially, (the points in) a first froxel and (the points in) a second froxel may be processed separately, e.g. at different times or by different computer devices.
  • the representation is typically arranged to provide an extended reality (XR) experience (e.g. a representation that is useable to render a XR video).
  • XR extended reality
  • the term extended reality (XR) covers each of virtual reality (VR), augmented reality (AR), and mixed reality (MR) and it will be appreciated that the disclosures herein are applicable to any of these technologies.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Computer Graphics (AREA)
  • Architecture (AREA)
  • Digital Computer Display Output (AREA)
  • Image Generation (AREA)
  • Apparatus For Radiation Diagnosis (AREA)
  • Processing Or Creating Images (AREA)

Abstract

Processing a three-dimensional representation of a scene There is described a method of processing a three-dimensional representation of a scene, the method comprising: identifying a plurality of points of a three-dimensional representation; determining a coordinate system for the representation; determining a plurality of containers associated with the coordinate system, wherein each container covers a volume of the three-dimensional representation, and wherein a depth of each container is dependent on a distance of that container from a centre of the coordinate system; and allocating each of the plurality of points to one of the containers.

Description

Processing a three-dimensional representation of a scene
Field of the Disclosure
The present disclosure relates to methods, systems, and apparatuses for processing a three-dimensional representation of a scene and/or for determining a point of a three-dimensional representation of a scene.
Background to the Disclosure
Three-dimensional representations of environments are used in many contexts, including for the generation of virtual reality videos, in which depth information for a plurality of points of the representation is used to generate different images for a left eye and a right eye of a user. Typically, substantial processing power is required to determine such a three-dimensional representation, and the file size of files associated with these representations is typically large so that substantial amounts of storage are needed to keep the files and substantial amounts of bandwidth are required to transfer the files.
Summary of the Disclosure
According to an aspect of the present disclosure, there is described: A method of processing a three- dimensional representation of a scene, the method comprising: identifying a plurality of points of a three- dimensional representation; determining a coordinate system for the representation; determining a plurality of containers associated with the coordinate system, wherein each container covers a volume of the three- dimensional representation, and wherein a depth of each container is dependent on a distance of that container from a centre of the coordinate system; and allocating each of the plurality of points to one of the containers.
Preferably, the method comprises determining the coordinate system based on a viewing zone, preferably wherein the coordinate system is centred at a centre of the viewing zone.
Preferably, the method comprises determining the coordinate system based on a capture device used to capture one or more of the plurality of points, preferably wherein the coordinate system is centred on the capture device.
Preferably, the method comprises processing one or more of the points in dependence on a container that contains said points.
Preferably, the method comprises: identifying a first container; identifying one or more points in the first container; and processing the points in the first container.
Preferably, the method comprises: identifying a second container; identifying one or more points in the second container; and processing the points in the second container separately to the points in the first container. Preferably, processing the points separately comprises: processing the points at different times; and/or processing the points using different computer devices.
Preferably, processing the points comprises one or more of: generating a new point based on one or more points; modifying a location and/or value of the points; removing and/or filtering out one or more of the points; and assigning a new parameter to a point. Preferably, the new parameter indicates a container comprising the point.
Preferably, the method comprises processing the points in dependence on a threshold, preferably a threshold associated with the attributes of the points.
Preferably, the threshold depends on the container.
Preferably, the method comprises: identifying a first plurality of points in a first container of the plurality of containers; and processing the first plurality of points so as to generate a new point in the first container. Preferably, the method comprises identifying a plurality of points in a first container; and combining the points to generate the new point.
Preferably, combining the points comprises one or more of: combining an attribute of each point; combining a transparency and combining a distance of each point.
Preferably, the method comprises generating the new points based on one or more of: a minimum attribute value of the first plurality of points; a maximum attribute value of the first plurality of points; an average value of the first plurality of points; and a variance of the attribute values of the first plurality of points.
Preferably, the method comprises determining the coordinate system based on a capture device associated with a viewing zone. Preferably, the capture device is located at the centre of the viewing zone.
Preferably, the depth of the containers increases as the distance of the containers from the centre of the coordinate system increases. Preferably, the depth of each container is determined based on an arctan curve.
Preferably, the depth of the containers increases linearly as the distance of the containers from the centre of the coordinate system increases.0
Preferably, the depth of the containers is determined based on an arctan curve.
Preferably, each container is associated with each of: an inner axial boundary; an outer axial boundary; a first radial boundary; and a second radial boundary. Preferably, the inner axial boundary and the outer axial boundary are determined based on a quantisation curve; preferably, the quantisation curve is based on an arctan curve.
Preferably, the locations of the first radial boundary and the second radial boundary are dependent on an angular resolution of the scene.
Preferably, each container is associated with one or more angular sections.
Preferably, each container is associated with one or more quantisation levels. Preferably, each container is associated with the same number of quantisation levels.
Preferably, the quantisation levels are determined based on a curve, preferably an arctan curve.
Preferably, the method comprises combining the plurality of points in dependence on a location of each point. Preferably, the method comprises combining the plurality of points in dependence on each point being within the first container. Preferably, the method comprises combining the plurality of points in dependence on each point being associated with the same quantisation level within the container. Preferably, the method comprises combining the plurality of points in dependence on each point being within the same angular section within the container.
Preferably, combining the points comprises taking one or more of: a minimum, an average, a weighted average, and a maximum of the points.
Preferably, the method comprises combining points associated with a plurality of quantisation levels of the container.
Preferably, the method comprises processing (e.g. combining) the points based on a threshold value. Preferably, the method comprises processing each point (and/or each set of points) only if a parameter value associated with that point exceeds the threshold value.
Preferably, the parameter value is determined based on one or more of: an average parameter of a set of points; a maximum parameter of the set of points; a minimum parameter of the set of points; and a variance of the parameters of the set of points.
Preferably, the parameter relates to the attributes of the points and/or the locations of the points. Preferably, the threshold value is associated with one or more of: a similarity of the points; a complexity of the points; the attribute values of the points; the container that contains the points; and the locations of the points.
Preferably, the method comprises combining the points based on a complexity value associated with the points and/or the container. Preferably, a number of quantisation levels for which points are combined is dependent on the complexity value.
Preferably, the complexity value is dependent on one or more of: a user input; an artificial intelligence algorithm; a machine learning model; and a distribution of points and/or attributes of points within a region.
Preferably, the method comprises determining a threshold complexity for combining a plurality of points. Preferably, the threshold complexity is dependent on one or more of: a container that contains the points; and a distance of the points from a viewing zone and/or from the centre of the coordinate system.
Preferably, the method comprises combining the points based on attribute values associated with the points.
Preferably, the method comprises combining the points based on a separation of the points.
Preferably, the combined points are associated with a plurality of different capture devices.
Preferably, the method comprises associating the new point with a new capture device. Preferably, the method comprises associating the new point with a new capture device at the centre of the coordinate system. Preferably, the method comprises determining and storing a distance of the new point from the new capture device.
Preferably, the three-dimensional representation is associated with a viewing zone, the viewing zone comprising a subset of the scene and/or the viewing zone enabling a user to move through a subset of the scene. Preferably, the user is able to move within the viewing zone with six degrees of freedom (6DoF).
Preferably, the viewing zone has a volume of less than 50% of the volume of the scene, less than 20% of the volume of the scene, and/or less than 10% of the volume of the scene.
Preferably, the viewing zone has, or is associated with, a volume, preferably a real-world volume, of less than five cubic metres (5m3), less than one cubic metre (1 m3), less than one-tenth of a cubic metre (0.1 m3) and/or less than one-hundredth of a cubic metre (0.01 m3).
Preferably, the three-dimensional representation comprises a point cloud.
Preferably, the method comprises storing the three-dimensional representation and/or outputting the three- dimensional representation. Preferably, the method comprises outputting the three-dimensional representation to a further computer device.
Preferably, the method comprises generating an image and/or a video based on the three-dimensional representation.
Preferably, the method comprises forming one or more two-dimensional representations of the scene based on the three-dimensional representation. Preferably, the method comprises comprising forming a two- dimensional representation for each eye of a viewer.
Preferably, the point is associated with one or more of: a location; an attribute; a transparency; a colour; and a size.
Preferably, the point is associated with an attribute for a right eye and an attribute for a left eye.
Preferably, the scene comprises one or more of: an extended reality (XR) scene; a virtual reality (VR) scene; an augmented reality (AR) scene; and a mixed reality (MR) scene. According to another aspect of the present disclosure, there is described a system for carrying out the aforesaid method, the system comprising one or more of: a processor; a communication interface; and a display.
According to another aspect of the present disclosure, there is described an apparatus for processing a three-dimensional representation of a scene, the apparatus comprising: means for (e.g. a processor for) identifying a plurality of points of a three-dimensional representation; means for (e.g. a processor for) determining a coordinate system for the representation; means for (e.g. a processor for) determining a plurality of containers associated with the coordinate system, wherein each container covers a volume of the three-dimensional representation, and wherein a depth of each container is dependent on a distance of that container from a centre of the coordinate system; and means for (e.g. a processor for) allocating each of the plurality of points to one of the containers.
Any feature in one aspect of the disclosure may be applied to other aspects of the invention, in any appropriate combination. In particular, method aspects may be applied to apparatus aspects, and vice versa.
Furthermore, features implemented in hardware may be implemented in software, and vice versa. Any reference to software and hardware features herein should be construed accordingly.
Any apparatus feature as described herein may also be provided as a method feature, and vice versa. As used herein, means plus function features may be expressed alternatively in terms of their corresponding structure, such as a suitably programmed processor and associated memory.
It should also be appreciated that particular combinations of the various features described and defined in any aspects of the disclosure can be implemented and/or supplied and/or used independently.
The disclosure also provides a computer program and a computer program product comprising software code adapted, when executed on a data processing apparatus, to perform any of the methods described herein, including any or all of their component steps.
The disclosure also provides a computer program and a computer program product comprising software code which, when executed on a data processing apparatus, comprises any of the apparatus features described herein.
The disclosure also provides a computer program and a computer program product having an operating system which supports a computer program for carrying out any of the methods described herein and/or for embodying any of the apparatus features described herein.
The disclosure also provides a computer readable medium having stored thereon the computer program as aforesaid.
The disclosure also provides a signal carrying the computer program as aforesaid, and a method of transmitting such a signal.
The disclosure extends to methods and/or apparatus substantially as herein described with reference to the accompanying drawings.
The disclosure will now be described, by way of example, with reference to the accompanying drawings.
Description of the Drawings
Figure 1 shows a system for generating a sequence of images.
Figure 2 shows a computer device on which components of the system of Figure 1 may be implemented.
Figure 3 shows a method of determining a three-dimensional representation of a scene.
Figures 4a and 4b show method of determining a point based on a plurality of sub-points. Figure 5 shows a scene comprising a viewing zone.
Figures 6a and 6b show arrangements of capture devices for determining points of the three-dimensional representation.
Figure 7 shows a point that can be captured by a plurality of capture devices.
Figures 8a and 8b shows grids formed by the different capture devices.
Figure 9 shows a method of determining whether to combine a plurality of sub-points.
Figure 10 shows a coordinate system that can be used to analyse the points.
Figures 11 a - 11 d show containers of the coordinate system of Figure 10.
Figure 12 shows a method of combining a plurality of points of the three-dimensional representation.
Description of the Preferred Embodiments
Referring to Figure 1 , there is shown a system for generating a sequence of images. This system can be used to generate, and then display, a representation of an environment, which may comprise a VR environment (or an XR environment).
The system comprises an image generator 11 , an encoder 12, a transmitter 13, a network 14, a receiver 15, a decoder 16 and a display device 17.
These components may each be implemented on separate apparatuses. Equally, various combinations of these components may be implemented on a shared apparatus; for example, the image generator 11 , the encoder 12, and the transmitter 13 may all be part of a single image data generation device. Similarly, the receiver 15, the decoder 16, and the display device 17 may all be a part of a single image rendering device.
Typically, the system comprises at least one encoding computer device (e.g. a server of a content provider) and at least one rendering computer device (e.g. a VR headset).
Referring to Figure 2, each of the components, and in particular the image generator 11 , the encoder 12, the transmitter 13, the receiver 15, the decoder 16 and the display device 17 is typically implemented on a computer device 20, where, as described above, a plurality of these components may be implemented on a shared computer device.
Each computer device comprises one or more of: a processor 21 for executing instructions (e.g. so as to perform one or more of the steps of the various methods described below), a communication interface 22 for facilitating communication between computer devices (e.g. an ethernet interface, a Bluetooth® interface, or a universal serial bus (UBS) interface, a memory 23 and/or storage 24 for storing information and instructions (e.g. a random access memory (RAM), a read only memory (ROM), a hard drive disk (HDD) a solid state drive (SSD), and/or a flash memory, and a user interface 25 (e.g. a display, a mouse, and/or a keyboard) for enabling a user to interact with the computer device. These components may be coupled to one another by a bus 25 of the computer device.
The computer device 20 may comprise further (or fewer) components. In particular, the computer device (e.g. the display device 17) may comprise one or more sensors, such as an accelerometer, a GPS sensor, or a light sensor. These sensors typically enable the computer device to identify an environmental condition and/or an action of wearer of the display device.
Turning back to Figure 1 , the image generator 11 is configured to generate a sequence of image data (e.g. a sequence of image frames) to enable the display device 17 to use this image data to display a plurality of images. The image data may comprise one or more digital objects and the image data may be generated or encoded in any format. For example, the image data may comprise point cloud data, where each point has a 3D position and one or more attributes. These attributes may, for example, include, a surface colour, a transparency value, an object size and a surface normal direction. Each attribute may have a value chosen from a continuous range or may have a value chosen from a discrete set.
The image data enables the later rendering of images. This image data may enable a direct rendering (e.g. the image data may directly represent an image). Equally, the image data may require further processing in order to enable rendering. For example, the image data may comprise three-dimensional point cloud data, where rendering a two-dimensional image using this data requires processing based on a viewpoint of this two-dimensional image.
The image data may comprise depth map data, where one or more pixels or objects in the image is associated with a depth that is specified by the depth map data. The depth map data may be provided as a depth map layer, separate from an image layer. In some contexts, such as MPEG Immersive Video (MIV), the image layer may instead be described as a texture layer. Similarly, in some contexts, the depth map layer may instead be described as a geometry layer.
The image data may include a predicted display window location. The predicted display window location may indicate a portion of an image that is likely to be displayed by the display device 17. The predicted display window location may be based on a viewing position (such as a virtual position and/or orientation of the user in a 3D environment) of the user, where this viewing position may be obtained from the display device. The predicted display window location may be defined using one or more coordinates. For example, the predicted display window location may be defined using the coordinates of a corner or center of a predicted display window, and may be defined using a size of the predicted display window. The predicted display window location may be encoded as part of metadata included with the frame.
The image data for each image (e.g. each frame) may include further information, which may be provided as a part of an image, e.g. as part of the point cloud data, or as separate layers. In particular, the image data may include audio information or haptic feedback information indicating audio or haptics which can accompany displayed visual data. An audio layer or haptic layer may accompany each image, and may be omitted for images where no accompanying audio or haptics are required.
Similarly, the image data may comprise interactivity information, where the image data may contain or indicate elements with which a user can interact. The interactivity information may, for example, define a behaviour of an element, where a user is able to interact with the element based on this behaviour. The behaviour typically defines a change in an element that occurs as a result of a user interaction where this change may comprise a change in the attributes of the element or in the rendering of the element. As an example, where an image contains a target element, the target element may be arranged to disappear when a user interacts with this element, or to provide feedback indicating that the user has interacted with the target. This interactivity data may be provided as part of, or separately to, the image data.
The image data may indicate, or may be combinable with, a state of the virtual environment, a position of a user, ora viewing direction of the user. Here, the position and viewing direction may be physical properties of the user in the real-world, or position and viewing direction may also be purely virtual, for example being controlled using a handheld controller. The image generator 11 may, for example, obtain information from the display device 17 that indicates the position, viewing direction, or motion of the user. Equally, the image generator may generate image data such that it can later be combined with this position, viewing direction, or motion, where the image generator may generate a full scene which is only partially viewed by a user depending on the position of that user.
In some cases, the generated image may be independent of user position and viewing direction. This type of image generation typically requires significant computer resources such as a powerful GPU, and may be implemented in a cloud service, or on a local but powerful computer. For example, a cloud service (such as a Cloud Rendering Service (CRN)) may reduce the cost per-user and thereby make the image frame generation more accessible to a wider range of users. Here “rendering” refers at least to an initial stage of rendering to generate an image. Further rendering may occur at the display device 17 based on the generated image to produce a final image which is displayed.
The image generator 11 may, for example, comprise a rendering engine for initially rendering a virtual environment such as a game or a virtual meeting room.
The encoder 12 is configured to encode frames to be transmitted to the display device 17. The encoder may be implemented using executable software or may be implemented on specific hardware such as an ASIC. In some embodiments, the image generator 11 may transmit raw, unencoded, data through the network 14. However, such transmission typically leads to a high file size and requires a high bandwidth so that it is typically desirable to encode the data prior to the transmission.
The encoder 12 may encode the image data in a lossless manner or may encode the data a lossy manner. The encoder may apply inter-frame or intra-frame compression based on a currently-encoded frame and optionally one or more previously encoded frames. The encoder may be a multi-layer encoder, such as a low complexity enhancement video codec (LCEVC) enabled encoder.
Where the generated frames comprise depth map data, the encoder 12 may perform layered encoding on each instance of image data (e.g. each frame) to generate an encoded frame comprising a base depth map layer and an enhancement depth map layer. Encoding a depth map in this way may improve compression. In some applications, such as HDR video, depth maps are desirably highly detailed with a bit depth of up to twelve or fourteen bits, which is a significant increase in the data to be transmitted. As a result, providing ways to improve compression of the depth map can make more realistic depth map-based displays viable when performing rendering or transmission of rendered data in real-time. Furthermore, this type of layered encoding makes it easy to drop (and then pick back up) one or more of the layers, which provides flexibility and tools for bandwidth management.
Layered encoding is also helpful as the final decoder/user device (such as a user display device) can choose whether to process these extra layers. For example, in a non-layered approach, the best the end device (i.e. the receiver, decoder or display device associated with a user that will view the images) can do is determine that it does not have enough resources for a given quality (be it resolution, frame rate, inclusion of depth map) and then signal to the controller/renderer/encoder that it does not have enough resources. The controller then will send future images at a lower quality. In that alternative scenario, the end device still unfortunately has to process the higher quality data until the lower quality data arrives, if it can process the received images at all.
In some of the described embodiments, this situation is improved upon because when/if the end device determines for example that it does not have the processing capabilities to handle the highest level of quality, then it can drop and/or choose not to process certain layers. The end device may also signal to the controller that it needs a lower level of quality, but in the meantime the end device can only process the number of layers that it can handle. Therefore, the end device can react to conditions much more quickly.
In some cases, depth map data may be embedded in image data. In this case, the base depth map layer may be a base image layer with embedded depth map data, and the enhancement depth map layer may be an enhancement image layer with embedded depth map data.
Alternatively, when the generated images comprise a depth map layer separate from an image layer and multi-layer encoding is applied, the encoded depth map layers may be separate from the encoded image layers. This has the advantage that the encoded depth map layers can be dropped under some conditions while still retaining image layers that can be displayed (albeit with a lower level of realism). For example, the encoded depth map layers can be dropped by a transmitter or encoder when available communication resources are reduced, or can be dropped by an end device which lacks the processing resources to handle the highest level of quality. Similarly, if some images comprise an audio base layer, a haptic feedback base layer, an audio enhancement layer or a haptic feedback enhancement layer, these can be processed or dropped flexibly.
Again similarly, if some images comprise an interactivity data base layer or an interactivity enhancement layer these can be processed or dropped flexibly. For example, certain interactions may only be possible where a threshold bandwidth is available, where complex interactions (e.g. those enabling a conversation with a digital object) may be disabled before less complex interactions (e.g. changing a pixel colour) are disabled.
Additionally or alternatively, where the image data comprises point cloud data, the encoder may apply a point cloud data encoding technique such as described in European patent application EP21386059.6, which is incorporated herein by reference. Such a point cloud encoder may act as a base encoder for a layered encoding technique such as LCEVC or VC-6. Notably LCEVC and VC-6 techniques encode and decode a layered signal, but are agnostic about the content type of data encoded in the signal. For example, the signal can include textures, video frames, geometry or depth data, meshes, point clouds, rendering attributes or physics engine attributes.
The transmitter 13 may be any known type of transmitter for wired or wireless communications, including an Ethernet transmitter or a Bluetooth transmitter.
The transmitter 13 may be configured to make decisions about how to transmit the image data, and/or may provide feedback to the encoder 12 or the image generator 11 . For example, the transmitter may determine available communication resources (e.g. bandwidth) for transmitting image data, and may drop one or more layers from an encoded frame, or indicate to the image generator and/or encoder that image data should be generated and encoded with fewer layers, when insufficient bandwidth is available for transmission of all generated data. As specific examples, the transmitter may be configured to drop a depth map layer, an LCEVC enhancement layer, or a VC-6 enhancement layer from a frame when insufficient communication resources are available.
The network 14 provides a channel for communication between the transmitter 13 and the receiver 15, and may be any known type of network such as a WAN or LAN or a wireless Wi-Fi or Bluetooth network. The network may further be a composite of several networks of different types. Many users only have access to a network with a bandwidth of 30MBps which can lead to latency jitter when streaming. The required bandwidth and the observed latency can be reduced by means of tactics such as forward-looking rendering and last-millisecond reprojection, which are enabled by improved compression.
The receiver 15 may be any known type of receiver for wired or wireless communications, including an Ethernet transmitter or a Bluetooth transmitter.
The decoder 16 is configured to receive and decode an encoded frame. The decoder may be implemented using executable software or may be implemented on specific hardware such as an ASIC.
The display device 17 may for example be a television screen or a VR headset. The timing of the display may be linked to a configured frame rate, such that the display device may wait before displaying the image. The display device may be configured to perform warping, that is, to obtain a final display window location, adjust a warpable image to obtain a final image corresponding to a final viewing direction of the user, and display the final image.
In this regard, the image data is typically arranged to provide a warpable image for which a portion of the image that is displayed at the display device 17 is dependent on a position or orientation of a viewer. The warpable image may then be rendered before a most up to date viewing direction of the user is known. The warpable image may be transmitted to the display device, or the warpable image may be transmitted to a rendering node which is near to the display device, and the display device or rendering node may perform time warping to generate a displayed image portion based on the warpable image and the most up to date viewing direction of the user.
As mentioned above, a single device may provide a plurality of the described components. For example, a first rendering node may comprise the image generator 11 , encoder 12 and transmitter 13. Additional similar rendering nodes may be included in the system, and may work together to generate the sequence of frames.
In one case, multiple rendering nodes may each provide separate image data to an image data assembling node; for example, each rendering node may provide a part of a sequence of frames to a frame assembling node.
For example, the receiver 15, decoder 16 or display device 17 may be configured to assemble parts of image data from multiple sources to generate a sequence of images for display on the display device.
Alternatively, the image data assembling node may be separate from the receiver 15, decoder 16 and display device 17.
Additionally or alternatively, multiple rendering nodes may be chained. In otherwords, successive rendering nodes may add to a sequence of image data as it passes from rendering node to rendering node, and eventually a complete sequence of image data is then provided to the receiver 15. Furthermore, each rendering node may obtain components of a render from multiple upstream rendering nodes and/or distribute components of a render to multiple downstream rendering nodes.
A chain of rendering nodes may be useful for performing different rendering tasks that require different quantities of processing resources, or different frame rates. For example, a company may provide distributed processing in the form of a centralised hub which has abundant processing resources but is distant from users, and peripheral locations which have more scarce processing resources but are closer to users. Expensive but fairly static rendering features such as background lighting or environmental impact on sound may be generated at the central hub (for example using ray tracing), while features that require fewer resources but faster responses or higher frame rates may be generated closer to the user. In other words, the more responsive a rendering feature needs to be, the lower latency it needs between the rendering node which generates the feature and the user display and, in a chain of rendering nodes, the node which generates each rendering feature can be chosen based on a required maximum latency of that feature. On the other hand, if it is expensive to generate a rendering feature, then it may be preferable to generate the feature less frequency and with a higher maximum latency. For example, a static, high-quality background feature may be generated early in the chain of rendering nodes and a dynamic, but potentially lower-quality, foreground feature may be generated later in the chain of rendering nodes, closer to the user device. Here, environmental impact on sound means, for example, a set of surfaces may be constructed where each surface has different sound reflection and absorption properties depending upon material and shape. The frame rates may be matched by creating multiple frames with features generated at the lower frame rate, and combining them with the frames with features generated at the higher frame rate. In a nonlimiting embodiment, a preliminary rendering generates volumetric object data including motion vectors at a first (lowest) frame rate, then produces 2D rendered frames plus depth information for a specific user at a second (higher) frame rate, then transmits video plus depth data to the user device, which produces final frames for display via space warping (depth-based reprojections) at a third (highest) frame rate. One or more of these steps may be performed in combination with the other described embodiments. The viewing position of the user may change as additional rendering tasks are performed at different rendering nodes in the chain. Each or any rendering node may obtain an updated viewing position before performing its respective rendering task.
Additionally, the system may simultaneously generate multiple sequences of image data for different respective users or different respective display devices. For example, in the context of a VR or AR experience, each user or display device may view a different 3D environment, or may view different parts of a same 3D environment. When using a chain of rendering nodes, each node may serve multiple users or just one user.
For example, a starting rendering node (e.g. at a centralised hub) may serve a large group of users. For example, the group of users may be viewing nearby parts of a same 3D environment. In this case, the starting node may render a wide zone of view (“field of view”) which is relevant for all users in the large group.
The starting node may send this wide field of view to a first middle rendering node which renders additional aspects of the 3D environment. These additional aspects may for example be aspects which require less processing power to render, or may be aspects which are specific to individual users of the group. Additionally, the middle rendering node may render features in a smaller field of view than the starting node - this smaller field of view may be relevant to each user rather than the group of users. The first middle rendering node may additionally only serve a smaller number of users (e.g. half of the large group of users), with the remaining users being served by a second middle rendering node which also receives the wide field of view from the starting node.
The middle rendering node(s) may then send sequences of second partially or fully rendered frames to an end device for each user. The end device may perform further processes such as warping or focal distance adjustments, optionally using depth map data.
Preferably, each rendering node encodes the partially or fully rendered frames before transmitting them on to a next rendering node or to the receiver 15. This means that the required communication resources can be reduced when the rendering nodes are separated by one or more networks, or more generally are implemented in a distributed system such as a cloud.
However, each rendering node in a chain is encoding a different partially or fully rendered frame, with different data. Therefore, it may be advantageous for different rendering nodes to use different rendering formats and/or encoding formats. For example, the output from a first rendering node may be point cloud data which logically describes a 3D scene. This point cloud data can be encoded using the techniques of EP21386059.6. A second rendering node may then operate on the point cloud data to generate image data that is more readily displayed by a generic display device, without requiring the display device to model the 3D environment. This image data may be encoded using video coding techniques.
The chaining of rendering nodes may be extended to arbitrary tree structures, where a rendering node obtains partially rendered frames from more than one preceding rendering node, and generates further partially or fully rendered frames based on the multiple obtained sequences of partially rendered frames.
For example, a content rendering network (CRN) comprising numerous rendering nodes may be used to serve a volumetric event to a large number of same-time users, such as users participating in a shared virtual environment. Rendering the same event for each user is far more expensive in terms of computation time and power consumption than rendering the volumetric effect once and performing the rendering equivalent of multicasting the volumetric effect for multiple users. For example, each user may have a second rendering node (such as a VR headset), and the network may comprise a central first rendering node. The first rendering node may render the volumetric event, and distribute partially rendered frames depicting the volumetric event to the different second rendering nodes. The second rendering node for each user may then integrate the partially rendered frames depicting the volumetric event into a view of the virtual environment which is currently being shown to each user, based on parameters such as the user’s virtual position.
The receiver 15, decoder 16 and display device 17 may be consolidated into a single device, or may be separated into two or more devices. For example, some VR headset systems comprise a base unit and a headset unit which communicate with each other. The receiver 15 and decoder 16 may be incorporated into such a base unit.
In some embodiments, the network 14 may be omitted. For example, a home display system may comprise a base unit configured as an image source, and a portable display unit comprising the display device 17.
In the event that the decoder 16 or the display device 17 does not or cannot handle one or more layers, the receiver 15 or another transmitter associated with the decoder or display device may send a corresponding layer drop indication back through the network 14. The layer drop indication may be received by each rendering node. A rendering node which generates partially or fully rendered frames for that specific decoder or display device may cease generating the dropped layer. On the other hand, a rendering node which generates partially or fully rendered frames for multiple end devices may disregard a layer drop indication received from one end device (as the dropped layer is still needed for other devices). Alternatively, rendering nodes which serve multiple end devices may record received layer drop indications, and may cease generating the dropped layer only when all end devices served by the rendering node indicate that the layer is to be dropped.
In preferred examples, the encoders or decoders are part of a tier-based hierarchical coding scheme or format. Hierarchical coding enables frames to be communicated with higher resolution and/or higher frame rate than is possible in single-tier coding schemes. In hierarchical coding, one or more enhancement layers is communicated with base data, where the enhancement layers can be used to up-sample the base data at the decoder, for example providing up-sampling in a spatial ortemporal dimension. When combined with equivalent down-sampling of the original frames and generation of the enhancement layer at an encoder, hierarchical coding can overall provide lossless compression of data, with higher resolution and/or higher frame rate for a given transmission bit rate. Examples of a tier-based hierarchical coding scheme include LCEVC: MPEG-5 Part 2 LCEVC (“Low Complexity Enhancement Video Coding”) and VC-6: SMPTE VC-6 ST-2117, the former being described in PCT/GB2020/050695, published as WO 2020/188273, (and the associated standard document) and the latter being described in PCT/GB2018/053552, published as WO 2019/111010, (and the associated standard document), all of which are incorporated by reference herein. However, the concepts illustrated herein need not be limited to these specific hierarchical coding schemes.
A further example is described in WO2018/046940, which is incorporated by reference herein. In this example, a set of residuals are encoded relative to the residuals stored in a temporal buffer.
LCEVC (Low-Complexity Enhancement Video Coding) is a standardised coding method set out in standard specification documents including the Text of ISO/IEC 23094-2 Ed 1 Low Complexity Enhancement Video Coding published in November 2021 , which is incorporated by reference herein.
The system describes above is suitable for generating and presenting a representation of a scene, where this scene displays media content to a user. The scene typically comprises an environment, where the user is able to move (e.g. to move their head or to turn their head) to look around the environment and/or to move around the environment. For example, the scene may be a scene of a room in a building, where the user is able to move around the room (e.g. by moving in the real-world and/or by providing an input to a user interface) in orderto inspect various parts of the room. Typically, the scene is a XR (e.g. a VR) scene, where the user is able to move about the scene in three degrees of freedom (3DoF) or six degrees of freedom (6DoF) so as to experience the scene.
As has been described with reference to Figure 1 , the image generator 11 may be arranged to determine point cloud data, where each point of the point cloud has a 3D position and one or more attributes. More generally, the image generator (or another component) is arranged to determine a three-dimensional representation of a scene, where this three-dimensional representation is thereafter used to generate two- dimensional images that are presented to a user at the display device 17. While the points are typically points of a point cloud, more generally the disclosure extends to any point that is associated with a location and a value. Therefore, the points may, more generally, be considered to be data (or datapoints), which data is associated with a location and a value, and the ‘points’ may comprise polygons, planes (regular or irregular), Gaussian splats, etc.
Referring to Figure 3, there is described a method of determining (an attribute for) a point of such a three- dimensional representation. The method comprises determining the attribute using a capture device, such as a camera or a scanner. The scene may comprise a real scene, in which attribute values are captured using a camera, or a virtual scene (e.g. a three-dimensional model of a scene), in which attribute values are captured using a virtual scanner.
Where this disclosure describes ‘determining a point’ it will be understood that this generally refers to determining a point that has a location and an attribute value, where determining the point comprises determining the attribute value and/or storing a point that comprises at least an attribute value and a location value (these values may be indirect values, e.g. where the location is identified relative to another point). Once a plurality of points have been captured, these points can be stored as a three-dimensional representation (e.g. a point cloud) so as to enable the reconstruction of the three-dimensional scene based on this representation.
Typically, the scene comprises a simulated scene that exists only on a computer. Such a scene may, for example, be generated using software such as the Maya software produced by Autodesk®. The attributes determined using the methods described herein may then depend on virtual objects located within the scene as well as a virtual lighting arrangement used in the scene.
In a first step 31 , a computer device initiates a capture process for a capture device, the capture process being initiated with an initial azimuth angle (e.g. of 0°) and an initial elevation angle (e.g. of 0°).
In a second step 32, the computer device causes a point to be captured using the capture device at the current azimuth angle and current elevation angle. Capturing a point typically comprises assigning an attribute value to the point, which attribute value may, for example, be a color of the point and/or a transparency value of the point. Typically, the point has one or more color values associated with each of a left eye and a right eye of a viewer. Capturing the point may also comprise determining a normal value associated with the point, e.g. a normal of a surface on which the point lies. Typically, capturing the point further comprises determining a location of the point, e.g. by determining a distance of the point from the camera.
In practice, determining the point may comprise sending a ‘ray’ from the capture device and then stepping through a computer model to determine which surface of the computer model is impacted by the ray. The color, transparency, and normal of this surface are then recorded alongside the distance of the surface from the capture device.
In a third step, 33, the computer device determines whether a point has been captured for the capture device at each azimuth of a range of azimuths and in a fourth step 34, if points have not been captured at each azimuth, then the azimuth angle is incremented and the method returns to the second step 32 and another point is captured. The azimuth angle may, for example, be incremented by between 0.01 ° and 1 ° and/or by between 0.025° and 0.1 °. Typically, the range of azimuth angles is selected to be 360° (i.e. so that the capture device captures points surrounding the entirety of the capture device), but it will be appreciated that other ranges are possible.
Once a point has been captured for each azimuth, in a fifth step 35, the computer device determines whether a point has been captured for the capture device at each elevation of a range of elevations and in a sixth step 36, if points have not been captured at each elevation, then the azimuth angle is reset to the initial value, elevation angle is incremented and the method returns to the second step 32 and another point is captured. The elevation angles may, for example, be incremented by between 0.01 ° and 1 ° and/or by between 0.025° and 0.1 °. Typically, the range of elevation angles is selected to be 360° (i.e. so that the capture device captures points surrounding the entirety of the capture device), but it will be appreciated that other ranges are possible.
In a seventh step 37, once points have been captured for each azimuth angle and each elevation angle, the scanning process ends.
This method enables a capture device to capture points at a range of elevation and azimuth angles. This point data is typically stored in a matrix. The point data may then be used to provide a representation of the scene to a user, e.g. the three-dimensional representation formed by the point data may be processed to produce two-dimensional images for each eye of a user, with these images then being shown to a user via the display device 17 to provide a virtual reality experience to the viewer. By using the captured data, a video can be provided to a viewer that enables the viewer to move their head to look around the scene (while remaining at the location of the capture device).
It will be appreciated that the capture pattern (or scanning pattern) described with reference to Figure 3 is purely exemplary and that numerous capture patterns are possible. In general, the capture process for each capture device comprises capturing one or more points at one or more azimuth angles and/or one or more elevation angles.
The ‘points’ captured by the capture device are typically associated with a size, such as a height, a width, or a depth. That is, the points typically relate to two-dimensional planes/pixels and/or three-dimensional voxels. In this regard, there is necessarily some space between the locations of adjacent points (since if the points had no width, then an infinite number of points would be required to capture points at each angle). The size provides points that depict a non-negligible area of the three-dimensional space so that a plurality of points can be fit together to provide a depiction of the scene to a viewer.
The width and height of each point is typically dependent on the distance of that point from the capture device, where more distant points have a larger width/height. The width and height of each point is typically determined so that when each point is displayed, there is no space between adjacent points (indeed, there may be some overlap between points to ensure that no gaps appear between points). This height/width of each point can be determined at the time of capturing the points, or can be determined or defined after the capture of the points.
Typically, the points comprise a size value, which is stored as a part of the point data. For example, the points may be stored with a width value and/or a height value. Typically, the minimum width and the minimum height of a point are set by the angle increment of the azimuth angle and the elevation angle respectively. The size may be then specified in terms of this angle increment and/or in terms of this minimum width/minimum height (e.g. as being a multiple of the angle increment). In some embodiments, the size value is stored as an index, which index relates to a known list of sizes (e.g. if the size may be any of 1x1 , 2x1 , 1x2, 2x2, pixels this may be specified by using 3 bits and a list that relates each combination of bits to a size).
The size may be stored based on an underscan value. In this regard, where an object is very near to the viewing zone it may be captured using an unnecessarily dense arrangement of points. Therefore, certain surfaces or areas of the representation may be associated with an underscan value, which underscan value defines a reduction in the number of points captured as compared to a representation without underscan. The size of the points may be defined so as to indicate this underscan value. In an exemplary embodiment, the underscan value is an integer value between 0 and 3 and the size is stored as a combination of point dimensions (e.g. a width in the range [0,2]) and a height in the range ([0,2]) and an underscan factor (e.g. an underscan factor in the range [0,3]). In some embodiments, the width and the height are dependent on the underscan factor. For example, when the underscan factor exceeds a threshold value, the possible height and width values may be limited. In a specific example, when the underscan factor is 3, the width and the height may be limited to the range [0,1]. The size may then be defined as size = underscan*9 + height*3 + width. Such a method provides efficient storage and indication of width, height, and underscan values.
As shown in Figure 4a, typically, for each capture step (e.g. each azimuth angle and/or each elevation angle), a plurality of sub-points SP1 , SP2, SP3, SP4, SP5 is determined. For example, where the azimuth angle increment is 0.1 ° then for an azimuth angle of 0°, sub-points may be determined at azimuth angles of -0.05°, -0.025°, 0, 0.025°, and 0.05° (and similar sub-points may be determined for a plurality of elevation angles). Attribute values of these sub-points may then be combined to obtain an attribute value for the point. For example, a maximum attribute value of the sub-points may be used as the value for the point, an average attribute value of the sub-points may be used as the value forthe point, and/or a weighted average of the sub-points may be used as the value forthe point. It will be appreciated that numerous other methods for combining the attribute values of the sub-points are possible.
By determining the attribute of a point based on the attributes of sub-points, the accuracy of the capture process can be increased. While it would be possible to simply reduce the increment of the angle steps to provide a higher resolution scene, by considering sub-points but only storing attributes for points, a balance can be struck between accuracy and file size (since storing every sub-point would lead to a substantial increase in the amount of data that needs storing).
With the example of Figure 4a, for each point of the three-dimensional representation that is captured by a capture device, this capture device may obtain attributes associated with each of the sub-points SP1 , SP2, SP3, SP4, SP5, combine these attributes to obtain a point attribute, and then store a point with a distance that is an average (e.g. a weighted average) of the distances of the sub-points from the capture device, at the nominal angle of the point, with the point attribute.
As shown in Figure 4b, where a plurality of sub-points SP1 , SP2, SP3, SP4, SP5 are considered, these points may have different distances from the location of the capture device. In some embodiments, the attributes of the sub-points may be combined in dependence on this distance, e.g. so that sub-points nearer to the capture device have higher weightings.
However, the possibility of sub-points with substantially different distances raises a potential problem. Typically, in order to determine a distance for a point, the distances for the sub-points are averaged. But where the sub-points have substantially different distances and/or are related to different surfaces in the scene, this may result in the point having a distance that does not correspond to any actual surface in the scene. Therefore, the point may seem to hang in space (e.g. to hang between the front and rear surfaces shown in Figure 4b.
Similarly, where the attribute values of the sub-points greatly differ, e.g. if the sub-points SP1 and SP2 are white in colour and the sub-points SP3 and SP4 are black in colour, then the attribute value of the point may be substantially different to the attribute value of other points in the scene. In an example, if the scene were composed of black and white objects, the point may appear as a grey point hanging in space between these objects.
In some embodiments, the computer device is arranged to aggregate sub-points so as not to create any floating points. For example, the computer device may determine whether the sub-points are spatially coherent by employing a clustering algorithm (e.g. a k-means clustering algorithm). Where the sub-points are spatially coherent (e.g. where a difference in the distance of the sub-points is below a threshold value), these distances may be averaged to obtain a distance for the point. Where the sub-points are not spatially coherent, the sub-points may be processed to ensure that the distance of any point places it upon a surface; for example, in the system of Figure 4b, sub-points SP1 , SP2, and SP3 may be grouped into a first point and sub-points SP4 and SP5 may be grouped into a second point. Since each sub-point is associated with the same capture device and capture angle (all of these sub-points being associated with a capture step that has a particular azimuth angle and elevation angle), these points may be located at the same angle with respect to a capture device. Therefore, to ensure that each sub-point affects the representation considered, the first point (made up of sub-points SP1 , SP2, and SP3) may have a smaller distance value than the second point (made up of sub-points SP4 and SP5) and the first point may be assigned a nonzero transparency value so that the second point can be seen through the first point.
By capturing points at a plurality of azimuth angles and elevation angles, e.g. using the method described with reference to Figure 3, it is possible to provide a three-dimensional representation of the scene that can later be used to enable a viewer to view the scene from a plurality of angles. More specifically, given the three-dimensional points captured by the capture device, a computer device is able to render a two- dimensional representation (e.g. a two-dimensional image) of the scene for each eye of a viewer so as to provide a representation with an impression of depth. The computer device may render a series of two- dimensional representations to enable the viewer to look around the scene, where the two-dimensional representations are rendered based on an orientation of the viewer’s head. In this way, the determined representation is useable to provide, for example, a virtual reality (VR), mixed reality (MR), augmented reality (AR), and/or extended reality (XR) experience to the viewer.
To enable such a display, the display device 17 is typically a virtual reality headset, that comprises a plurality of sensors to track a head movement of the user. By tracking this head movement, the display device is able to update the images being displayed to the viewer as the viewer moves their head to look about the scene. Typically, this involves the display device sensing the sensor data to an external computer device (e.g. a computer connected to the display device via a wire). The external computer device may comprise powerful graphical processing units (GPUs) and/or computer processing units (CPUs) so that the external computer device is able to rapidly render appropriate two-dimensional images for the viewer based on the three-dimensional images and the sensor data. It will be appreciated that the use of a combination of a headset and an external device is exemplary. More generally, the processing of data and the rendering of images may be performed by various computer devices; for example, a standalone virtual reality headset may be provided, which headset is capable of processing data and rendering images without any connection to an external computer device.
In some embodiments, the external computer device may comprise a server device, where the display device 17 may be connected to this server device wirelessly. This enables the two-dimensional images to be streamed from the server to the display device so as to enable the display of high-quality images without the need for a viewer to purchase expensive computer equipment. In other words, operations that require large amounts of computing power, such as the rendering of two-dimensional images based on the three- dimensional representation, may be performed by the server, so that the display device is only required to perform relatively simple operations. This enables the experience to be provided to a wide range of viewers.
In some embodiments, a first two-dimensional image is provided to the display device 17 (and/or a connected device) and this first image is “warped’ in order to provide an image for viewing at the display device. The warping of the image comprises processing the image based on the sensor data in order to provide an image that matches a current viewpoint of the viewer. By performing the warping at the display device or another local device, the lag between a head movement of the user and an updating of the two- dimensional representation of the scene can be reduced.
One issue with the above-described method of capturing a three-dimensional representation is that it only enables a viewer to make rotational movements. That is, since the points are captured using a single capture device at a single capture location, there is no possibility of enabling translational movements of a viewer through a scene. This inability to move translationally can induce motion sickness within a viewer, can reduce a degree of immersion of the viewer, and can reduce the viewer’s enjoyment of the scene. However, enabling a viewer to move translationally through a three-dimensional representation that has been captured using a single capture device would lead to holes in the scene wherever a viewer moves away from the capture location of this single capture device (since this movement will cause parts of the scene that were not captured by the capture device to come into view of the viewer)
Therefore, it is desirable to enable translational movements through the scene while avoiding the display of these holes. To enable such movements, the three-dimensional representation of the scene may be captured using a plurality of capture devices placed at different locations (orthe same capture device placed at different locations). A viewer is then able to move around the scene translationally (e.g. by moving between these locations).
More generally, by capturing points for every possible surface that might be viewed by a viewer, a three- dimensional representation of a scene may be captured that allows a suitable two-dimensional representation of this scene to be rendered regardless of a location of a viewer (e.g. regardless of where a user is standing within a virtual room).
This need to capture points for every possible surface (so as to enable movement about a scene) greatly increases the amount of data that needs to be stored to form the three-dimensional representation.
Therefore, as has been described in the application WO 2016/061640 A1 , which is hereby incorporated by reference, the three-dimensional representation may be associated with a viewing zone, a zone of view (ZOV), or a zone of viewpoints (ZVP), where the three-dimensional representation is arranged to enable a user to move about the viewing zone so as to view the scene.
Figure 5 illustrates such a viewing zone 1 and illustrates how the use of a viewing zone limits the amount of image data that needs to be stored to provide a three-dimensional representation of the scene. With the scene shown in this figure, and the viewing zone 1 shown in this figure, it is not necessary to determine attribute data for the occluded surface 2 since this occluded surface cannot be viewed from any point in the viewing zone. Therefore, by enabling the user to only move within the viewing zone (as opposed to around the whole scene) the amount of data needed to depict the scene is greatly reduced, while still enabling the user to move to some extent and thereby to avoid the motion sickness that can be induced by augmented reality scenes with only three degrees of freedom.
While Figure 5 shows a two-dimensional viewing zone, it will be appreciated that in practice the viewing zone 1 is typically a three-dimensional zone or volume.
The viewing zone 1 may, for example, comprise a rectangular volume, or a rectangular parallelepiped, and the viewing zone may have a height of at least 30 cm, a depth of at least 30 cm, and/or a width of at least 30 cm, where these dimensions enable a user to move their head while remaining in the viewing zone. This is merely an exemplary arrangement ofthe viewing zone; it will be appreciated that viewing zones of various shapes and sizes may be used (e.g. spherical viewing zones). That being said, it is preferable that the viewing zone is limited so as to cover only a part of the volume of the scene, e.g. no more than 50% of the scene no more than 25% of the scene, and/or no more than 10% of the scene. In this regard, if the viewing zone is the same size as the scene, then the three-dimensional representation will simply be a standard representation for virtual reality (that enables a user to move freely about the scene) - and so the use of the viewing zone will not provide any reduction in file size.
The viewing zone 1 enables movement of a viewer around (a portion of) the scene. For example, where the scene is a room, the base representation may enable a user to walk around the room so as to view the room from different angles. In particular, the viewing zone enables a user to move through the scene with six degrees-of-freedom (6DoF) movement through the scene, where this aids in the provision of an immersive experience. In some embodiments, the viewing zone 1 may be four-dimensional, where a three-dimensional location of the viewing zone changes over time - and in such embodiments the size and location of the occluded surface 2 may also change over time. More generally, it will be appreciated that viewing zones may be formed in any size or shape, with different sizes and shapes being suitable for different scenes.
The volume of the viewing zone 1 is typically selected so that a user is able to move to a degree sufficient to avoid motion sickness and to provide an immersive sensation, while still only enabling a limited amount of movement (where this leads to a smaller file size as compared to an implementation where a user is able to fully move about the scene). Typically, the viewing zone is arranged to enable a user to move their head while they are sitting or standing, but not to freely roam around a room.
The viewing zone 1 may have a (e.g. real-world) volume of less than five cubic metres (5m3), less than one cubic metre (1 m3), less than one-tenth of a cubic metre (0.1 m3) and/or less than one-hundredth of a cubic metre (0.01 m3).
The viewing zone 1 may also have a minimum size, e.g. the viewing zone may have a volume of at least 1 % of the volume of the scene, at least 5% of the volume of the scene, and/or at least than 10% of the volume of the scene. Similarly, the viewing zone may have a volume of at least one-thousandth of a cubic metre (0.01 m3); at least one-hundredth of a cubic metre (0.01 m3); and/or at least one cubic metre (1 m3).
The ‘size’ of the viewing zone 1 typically relates to a size in the real world, where if the viewing zone has a length of one metre this means that a user is able to move one metre in the real world while staying within the viewing zone. The size of the viewing zone in the scene may be greater than, equal to, or less than the size of the viewing zone in the real world. For example, the viewing zone may scale a real-world distance so that moving one metre in the real world moves the user less than (or more than) one metre in the scene. This enables the scene to provide different perceptions to the user (e.g. to make the user feel larger or smaller than they are in real life). Similarly, the viewing zone may scale a real-world angle so that rotating one degree in the real world rotates the user less than (or more than) one degree in the scene.
Therefore, a viewing zone with a volume of one cubic metre typically connotes a viewing zone in which the user is able to move about a one cubic metre volume in the real world while remaining in the viewing zone. And this may cause the user to move about a volume that is more than, or less than, one metre in the scene.
Referring to Figure 6a, in order to capture points for each surface and location that is visible from the viewing zone 1 , a plurality of capture devices C1 , C2 C9 may be used (e.g. a plurality of virtual scanners and/or a plurality of cameras). Each capture device is typically arranged to perform a capture process, e.g. as described with reference to Figure 3, in which the capture device captures points at a plurality of azimuth angles and elevation angles. By locating the capture devices appropriately, e.g. by locating a capture device at each corner of the viewing zone, it can be ensured that most (or all) points of a scene are captured.
Typically, a first capture device C1 is located at a centrepoint of the viewing zone 1. In various embodiments, one or more capture devices C2, C3, C4, C5 may be located at the centre ef faces of the viewing zone; and/or one or more capture devices C6, C7, C8, C9 may be located at edges of and/or corners of the viewing zone.
Figure 6a shows a two-dimensional view (e.g. a plan view) of a rectangular viewing zone. It will be appreciated that within this viewing zone each capture device may be located on a shared plane. Equally, the various capture devices may be located on different planes. Referring, for example, to Figure 6b, there is shown a three-dimensional view of a cuboid viewing zone, where there is a capture device located: at the centre of the viewing zone; at the centre of each face of the viewing zone; and at each corner of the viewing zone. With this arrangement, many locations in the scene (e.g. specific surfaces) will be captured by a plurality of capture devices so that there will be overlapping points relating to different capture devices. This is shown in Figure 7, which shows a first point P1 being captured by each of a first capture device C1 , a sixth capture device C6, and a seventh capture device C7. Each capture device captures this point at a different angle and distance and may be considered to capture a different ‘version’ of the point.
Typically, only a single version of the point is stored, where this version may be the highest quality version of the point and/or may be the version of the point associated with the nearest and/or least angled capture device.
In this regard, the highest ‘quality’ version of the point is captured by the capture device with the smallest distance and smallest angle to the point (e.g. the smallest solid angle). In this regard, as described with reference to Figures 4a and 4b, capturing a point for a given azimuth angle and elevation angle typically comprises capturing a plurality of sub-points at varying sub-point azimuth and elevation angles spread around the point azimuth and elevation angles. Due to the different spreads of sub-points, each capture device will capture a different version of the point (that has a different attribute) even when the points are at the same location. Capture devices that are close to the point and less angled with respect to the point typically have a smaller spread of sub-points and so typically obtain a version of a point that is sharper than a version of that point captured by more distant capture devices.
In some embodiments, a quality value of a version of the point is determined based on the spread of subpoints associated with this version (e.g. based on the perimeter formed by these sub-points and/or based on a surface area or volume bounded by these sub-points). The version of the point that is stored may depend on the respective quality values of possible versions of the points.
Regarding the ‘versions’ of the points, it will be appreciated that two ‘points’ in approximately the same location captured by each capture device may not have exactly the same location in the three-dimensional representation. More specifically, since each capture device typically projects a ‘ray’ at a given angle, the rays of differing capture devices may contact the surface at different locations for each capture device. Two points may be considered to be two ‘versions’ of a single point when they are within a certain proximity, e.g. a threshold proximity. For example, where the first capture device C1 captures a first point and a second point at subsequent azimuth angles, and the sixth capture device C6 captures a further point that is in between the locations of the first point and the second point, this further point may be considered to be a ‘version’ of one of the first point and the second point.
This difference in the points captured by different capture devices is illustrated by Figures 8a and 8b, which show the separate captured grids that are formed by two different capture devices. As shown by these figures, each capture device will capture a slightly different ‘version’ of a point at a given location and these captured points will have different sizes. Each capture step is associated with a particular range of angles (e.g. a nominal capture angle of 1 ° might encompass angles from 0.9° to 1.1 °), and therefore capture devices that are far from a point to be captured represent a wider region at the capture distance than capture devices closer to that point to be captured. As shown in Figure 8a, the capture device C1 would capture the points P1 and P2 in separate brackets, whereas for the capture device C2 these points are in the same bracket. Therefore, the capture device C2 might determine a single point that encompasses both points P1 and P2, whereas the capture device C1 would determine separate points for these two points.
Considering then a situation in which points P1 and P2 are captured separately, and capture device C1 is used to capture point P1 while capture device C2 being used to capture point P2, it should be apparent that the ‘sizes’ of these captured points, and the locations in space that are encompassed by the captured points will be based on different grids. For example, the width of the captured point P2 captured by the capture device C2 will be larger than the width of the captured point P1 captured by the capture device C1. The capture process may be determined based on the existence of these different grids, and on the different bracket widths that occur at different distances from a capture device.
Figure 8a shows an exaggerated difference between grids for the sake of illustration. Figure 8b shows a more realistic embodiment in which the three-dimensional representation comprises a plurality of points associated with different capture devices, where these points lie on different grids associated with these different capture devices.
Distance-dependency
As has been described, e.g., with reference to Figure 4a, the capturing of a point by a capture device typically involves capturing a plurality of sub-points associated with this point and then combining these sub-points. And as has been described, e.g. with reference to Figure 4b, in some embodiments the subpoints may be associated with different distances from the capture device.
Typically, at short distances, viewers are able to identify details of small surfaces and small separations between surfaces. For example, if the two surfaces shown in Figure 4b were located 5 centimetres apart with the front surface being 50 centimetres from a viewer, then that viewer would be able to identify that the rear surface is slightly angled and the viewer would be able to roughly estimate the separation between the front surface and the rear surface. In contrast, at large distances, viewers are typically unable to identify small details of surface or separations between surfaces. For example, if the two surfaces of Figure 4b were located 5 centimetres apart with the front surface being 1 kilometre from the viewer, the viewer might only be able to identify that the rear surface is behind the front surface (without being able to provide any reasonable estimate of the separation between the front surface and the rear surface) or the two surfaces might even appear to the viewer to be as a single surface.
In relation to that Figure 4b, it has been described that combining the sub-points in the situation shown by Figure 4b can lead to an inaccurate combined point to be determined that seems to float away from a surface with an inaccurate colour.
While this float is problematic for nearby surfaces, this inaccuracy is not necessarily problematic for further surfaces. In practice, where a surface is far away, a user may not be able to identify a slight separation of a point from this surface, and so it may be acceptable to store points even where they do not correspond to a single surface. Therefore, the present disclosure envisages a method in which the determination of a point, and in particular the combining of sub-points to form a point, is dependent on a distance of that point (and/or the component sub-points) from one or more of: a capture device that is capturing the point; and a viewing zone. Being dependent on a distance from a viewing zone typically connotes being dependent on a minimum distance from a viewing zone (e.g. dependent on a distance between the point of the three- dimensional representation and a proximate point on the viewing zone).
Such a method of determining a point in dependence on a distance of that point from a viewing zone is described in Figure 9.
In a first step 41 a computer device determines a plurality of sub-points that are associated with a point. Each of these sub-points is associated with a capture device that has used a different angle (e.g. a different azimuth angle and/or a different elevation angle) to capture the sub-points.
In a second step 42, a computer device determines a separation between the sub-points. The separation may be an absolute separation, may be a radial separation, and/or may be an axial separation (e.g. a difference in a depth of the points). Equally, the separation may comprise a difference between distance values of a sub-point to the capture device. For example, the separation may be the maximum difference in distance values between the sub-points. In a third step 43, a computer device determines a distance of the sub-points from the viewing zone (and/or the capture device). The distance may, for example, be a minimum distance of a sub-point, a maximum distance of a sub-point, or an average distance of a sub-point. Typically, the distance is the distance of the closest sub-point.
In a fourth step 44, the computer device determines whether to combine the sub-points (in order to form a point) based on the separation and the distance. Typically, this comprises comparing the separation to a threshold separation and combining the sub-points in dependence on the separation being beneath the separation threshold.
Typically, the separation threshold is a function ofthe distance of the points from the capture device and/or the viewing zone. The separation threshold may increase linearly with the distance, but typically the threshold increases at an increasing rate as the distance increases (e.g. increases exponentially as the distance increases), based on a function, and/or based on an arctan function. The separation threshold may change (e.g. in a discrete step) from a first threshold for a first range of distances to a second threshold for a second range of distances.
The general principle being applied here is that a separation becomes less noticeable as distance from the viewing zone increases; for example, a separation of 5 cm in depth at a distance of 1 km will not be distinguished from any viewpoint inside the zone of viewpoints, on the contrary a separation of 5 cm at a distance of 50 cm will be highly perceived.
If, in the fourth step 44, the separation of the points is deemed too great for the points to be combined (e.g. because the separation exceeds a separation threshold), then the computer device may determine that it is appropriate to divide the point into a plurality of points. In this regard, points are typically formed from a combination of sub-points. While each point may, by default, be formed from the same number of subpoints, it may be possible to form points from different number of sub-points. In practice, this may lead to most points in the three-dimensional representation being formed of, e.g. four sub-points or sixteen subpoints, with certain points near the edges of surfaces being formed of fewer, e.g. 1 , or 4, sub-points.
While the above description has considered the combination of sub-points based on a separation of these sub-points and a distance of these sub-points from the viewing zone. The combination of the sub-points may alternatively, or additionally, depend on a difference between the attribute values of the points where this attribute difference may be compared to an attribute threshold, and this attribute threshold may also depend on the distance of the sub-points from the viewing zone.
For example, for a given distance, points that are highly separated, but have similar attribute values may be combined to form a point. Similarly, points that are close together, but have very different attribute values may not be combined. Each of the threshold separation value and the threshold attribute difference may depend on the distance of the sub-points from the capture device.
Froxelisation
A method of determining points in dependence on a distance of these points from the viewing zone has been described with reference to Figure 9. The present disclosure also envisages the processing of already captured points based on distance.
One aspect of the present disclosure relates to a method of dividing the three-dimensional representation into (e.g. allocating the points of the three-dimensional representation to) a plurality of ‘froxels’ (frustrum voxels), where the dimensions - in particular the depths and/or volumes - of each froxel are dependent on a distance of that froxel to the viewing zone. Typically, the three-dimensional representation is divided into froxels using a coordinate system that is: based on a location of/in the viewing zone; based on a centrepoint of the viewing zone; and/or based on a capture device. Typically, there is a capture device located at the centrepoint of the viewing zone, so that the division of the three-dimensional representation into froxels may use a coordinate system that is centred on both the centre of the viewing zone and this central capture device. Therefore, the method may comprise allocating the points of the three-dimensional representation into a plurality of froxels, where the depth of each froxel depends on the distance of that froxel from the centre of the viewing zone. Equally, the volume of each froxel typically depends on the distance of that froxel from the centre of the viewing zone.
In this disclosure, the division of the three-dimensional representation into froxels is described as being a ‘froxelised’ space. It will be appreciated that the centrepoint of this froxelised space may be the centre of the viewing zone, may be a specific capture device, may be another location in the viewing zone, etc.
The froxels may equally be termed as volumes, containers, or boundaries. In general, the froxels each occupy a volume within the three-dimensional representation, which volume encompasses zero or more points of the three-dimensional representation.
Referring to Figure 10, there is shown a (cross-sectional view of a) three-dimensional space that has been divided into froxels in dependence on a centrepoint of the viewing zone 1 . Each froxel is a segment of space that is defined by: an inner axial boundary; an outer axial boundary; and four radial boundaries (two elevational radial boundaries and two azimuthal radial boundaries.
Typically, the inner radial boundary and the outer radial boundary of each froxel are sections of spheres centred on the centrepoint of the viewing zone. Typically, the radial boundaries of each froxel are formed by planes extending outwards from the centrepoint of the viewing zone. While Figure 9 shows a two- dimensional, plan, view of the three-dimensional space (with circles and lines), it will be appreciated that in practice the three-dimensional space is three-dimensional so that the circles and lines of Figure 9 represent spheres and planes.
Of relevance to the present disclosure, the froxels are typically formed by generating a plurality of axial boundaries (e.g. spheres) that are centred on the centrepoint of the viewing zone (or, more generally, based on a point within the viewing zone). Typically, the distance between axial boundaries increases with distance from the centrepoint of the viewing zone. Therefore, a first axial boundary that is the closest boundary to the viewing zone has a first radius rx, a second axial boundary that is the second closest boundary to the viewing zone has a second radius r2, and so on, where rn+1 - rn > rn - rn-1 for at least some, and typically all, values of n. And, in general, rn - f(n). The function that determines the radius of a given axial boundary may, for example, be a curve. Typically, the function is based on an arctan curve, where rn - a * tan-1 n * b, where a and b are constants. The froxelisation is therefore similar to a process of quantisation, where the depth of froxels increases step-wise as the froxels move away from the viewing zone.
In some embodiments, the rate of increase of the radius of subsequent axial boundaries increases as distance from the viewing zone 1 increases. Therefore, near to the viewing zone there is a very high density of froxels which decreases with distance. This accounts for the loss in separation ability of users at high distances, where points near to the viewing zone can be processed in smaller/shallower froxels, and therefore with more precision, than points far from the viewing zone (as described below).
The rate of increase of the radius of subsequent froxel boundaries may increase linearly and/or exponentially. Typically, the radiuses follow a quantisation curve, where the possible values for each radius are fixed to be one of a predetermined list (as, e.g. may be set by a party generating the representation and/or may be set based on a size of the scene/representation).
Such a method of dividing the three-dimensional space provides a plurality of froxels where froxels near to the viewing zone are smaller than froxels further from the viewing zone. And the size of froxels increases exponentially as the froxels move further from the viewing zone. Each froxel encompasses a volume of space within the three-dimensional representation and encompasses zero or more points within that volume of space.
During a stage of processing, the points in each froxel are processed separately and independently (e.g. as described below). In this way, points from different froxels may be processed separately, e.g. by parallel processors. This enables a scene to be divided into froxels and then these different froxels to be processed by different computer devices so as to increase the speed of processing steps used to process the points of the three-dimensional representation.
The embodiment of Figure 10 shows a froxel that is determined based on a spherical coordinate system so that each froxel is determined based on an inner axial boundary; an outer axial boundary; and four radial boundaries. The present disclosures may equally be applied to other coordinate systems (e.g. cartesian systems) where the froxels may then be associated with inner and outer ‘z’ boundaries as well as two ‘x’ boundaries and two ‘y boundaries (or more generally the froxels may be associated with two depth boundaries, two width boundaries, and two height boundaries). In general, the present disclosure considers the determination of a plurality of froxels based on a plurality of boundaries, where a depth of the froxels increases with the distance of the froxels from a centre of the coordinate system).
Referring then to Figure 11 a, there is shown an exemplary froxel that encompasses a plurality of points, which points may have different attribute values, different locations, different normal values, different transparencies, etc.
Referring to Figure 11 b, each froxel is typically associated with a plurality of angular sections 111 , 112, 113, 114, where the angular sections relate to angular sections of the froxelised space. For example, the boundaries of the first angular section of the froxel of Figure 10b may be angular lines that extend from the centre of the froxelised space at angles of, e.g. 0° and 1 °. The spacing of the angular lines depends on a desired angular resolution of the three-dimensional representation, which may, for example, be set by a user.
In some embodiments, the width of each froxel is set to be the same as this angular resolution so that each froxel has only a singular angular section. However, typically, each froxel is arranged to contain a plurality of angular sections, where points in different angular sections may (in some situations) be considered together.
Similarly, each froxel is typically associated with a plurality of (discrete) (in-froxel) quantisation levels 121 , 122, 123, 124, where typically the distance between the quantisation levels increases with distance from the centre of the froxelised space (e.g. based on a curve or on an arctan curve for example). Therefore, there are a greater number of quantization levels available nearer to the centre of the froxelised space than further from the centre, and nearer to the viewing zone.
In some embodiments, each froxel is associated with the same number of quantisation levels. Since froxels near to the viewing zone have a smaller depth than froxels further from the viewing zone, the use of the same number of quantisation levels for each froxel provides an implementation with a higher density of quantisation levels nearer to the viewing zone. In various implementations, each froxel may be associated with, for example, five quantisation levels, or ten quantisation levels.
According to an aspect of the present disclosure, within an angular section of a froxel, points are processed based on the quantisation levels so that one or more of the initially captured points of the three-dimensional representation are processed to form a series of points at the available quantisation levels of the froxel. In some embodiments, each of the points within each froxel is quantised (where the distance of that point is modified so as to be at a quantisiation level of the froxel), where the quantisation of each point may depend on a characteristic of that point. Referring to Figure 12, there is described a method of processing points within an angular section of a froxel. This method can be alternatively (and more generally) be implemented as a method of processing points within a container, or a volume, of a three-dimensional representation (where the froxel is an example of such a container and the angular section within the froxel is also an example of such a container). The method is carried out by a computer device, such as the image generator 11 .
In a first step 51 , the computer device identifies a plurality of points within the angular section. Typically, this comprises the computer device querying the three-dimensional representation (e.g. a point cloud) to identify a plurality of points within a region of the three-dimensional representation, that region falling within the angular section.
The computer device may, for example, iterate through the points of the three-dimensional representation to sort the points first into froxels and then into angular sections within these froxels (and then, optionally, into segments of the angular sections, the segments being associated with the in-froxel quantisation levels).
In a second step 52, the computer device identifies a feature of each point of the plurality of points, and in a third step 53, the computer device combines the points in dependence on these features. Combining the points typically comprises determining (and storing) a new point with an attribute that is dependent on the attributes of the combined points. The combined points may then be removed from the three-dimensional representation. This method therefore reduces the number of points within the representation and so reduces the size of the representation. The new point may, for example, have a colour value for each of the left eye and the right eye of a user, where each of these colour values for the new point is determined to be a combination of the corresponding (left eye and right eye) colour values of the identified plurality of points.
Determining the attribute of the new point may comprise taking: a maximum attribute of the combined points, a minimum attribute of the combined points, an average attribute of the combined points, a weighted average attribute of the combined points.
The new point is located at one of the quantisation levels of the froxel, where typically the combined points are located about this quantisation level and are replaced with a new point at this quantisation level that is a combination of these combined points. The new point is typically located at an angle that is an angle of the angular section.
Regarding the storage of the location of the new points, each initial point is determined using a capture device, and the locations of the points are typically stored initially by storing an index of the capture device, an angular identifier associated with the angle of the point from the capture device, and a distance of the point from the capture device.
Typically, the froxelised representation is centred on the location of a capture device, e.g. a capture device located at the centre of the viewing zone. Therefore, the location of the (new) combined points may be defined based on this capture device on which the froxelised representation is centred (e.g. the capture device located at the centre of the viewing zone). More specifically, the location of the combined point may be defined based on an index of this central capture device, an angle with respect to the central index device (which angle is used to determine the angular section), and a distance from this central index device. Therefore, the present disclosure considers a situation in which a new point is generated based on a combination of (the attributes of) a plurality of other points, these other points optionally being captured by different capture devices. The location of the new point is defined with reference to a first capture device that is different to a second capture device used to capture one or more of the other points.
Typically, the method of Figure 12, and the combining of points into new points, has a dependency on a distance of these points from the viewing zone. This dependency is, to some extent, a feature of the froxelisation in that froxels further from the viewing zone have a greater depth than froxels close to the viewing zone. A further dependency on distance from the viewing zone may be considered during the third step, where the computer device may combine the points in dependence on a distance of these points from the viewing zone (e.g. in dependence on a distance of the nearest quantisation level from the center of the froxelised space).
In some embodiments, combining of the points may occur for each angular step within the froxel, where, for each of the angular steps, all points within the froxel in that angular step are replaced by a single point having the appearance of all of the combined points when viewed from the center of the viewing zone. This may involve, for each angular step, combining the locations and/or attributes of the points in that angular step. This process can be thought of as a form of rendering the points of a froxel from the center of the viewing zone.
The combining of the points may further depend on the attributes of the points and/or a difference between the attributes of the points. For example, points with similar attributes may be combined more readily than points with substantially different attributes. And the threshold level of similarity (for combining to occur) may occur on the distance of the points from the viewing zone (and/or the distance of the points from the centre of the froxelised space).
Typically, the combining of the points is dependent on a complexity of the points and/or of the region containing the points. In this regard, small differences in complex shapes, such as foliage, can typically be noticed by a user at close range but not at large range. For example, a user may be able to identify separate leaves in foliage when the foliage is near to the user; but. separate leaves may be unidentifiable by this user from further away. In contrast, small differences in simple shapes, such as a colour change on a smooth wall of colour, are typically more noticeable even at large ranges. Therefore, in some embodiments, the combining of the points is dependent on a complexity value of the points, where the computer device may be arranged to combine points that are identified as exceeding a threshold complexity, but not points below this threshold complexity. The threshold complexity required for combining points may depend on the distance of the points from the viewing zone.
The complexity of the points may, for example, be indicated by a user, where a user may be able to define complex regions in which points may be combined. Equally, the complexity of the points may be determined using automated methods, e.g. using artificial intelligence algorithms or machine learning models. The complexity of the points may, for example, be determined by comparing patterns of points to a database of patterns, these patterns being associated with complexity levels.
In some embodiments, the complexity of a region and/or of a plurality of points is determined based on one or more of: a distribution of the attributes of the points (e.g. a maximum difference between the attributes and/or a standard deviation of the attributes); a distribution of the normals of the points; and a distribution of the capture devices used to capture the points. Complexity can also be linked to non-planarity of neighbouring points, where the computer device may determine the complexity based on a difference in the normals of the points and/or based on a distance of the points from a shared plane. In this regard, the computer device may determine a plane associated with the points (e.g. a plane that passes through the plurality of points with a minimum average distance to the points) and then determine the complexity based on the distance between these points and the plane.
Referring to Figure 11 b, combining a plurality of points may comprise combining a plurality of points associated with a quantisation level of the angular section. Therefore, for example, a first set of points 131 , 132, 133, 134, 135 associated with a fourth quantisation level 124 may be combined into a first combined point 139 and a second set of points 141 , 142, 143 associated with a third quantisation level 123 may be combined into a second combined point 149. Referring to Figure 11 c, in some embodiments, combining the points in dependence on the features comprises combining points across one or more (e.g. a plurality of) quantisation levels in dependence on these features.
For example, where the points in the angular section 111 are all highly complex and/or wherein the entirety of the angular section is associated with a region of high complexity, combining the points may comprise combining the points across the entirety of the angular section. Equally, where the points in the angular section are of a medium complexity and/or wherein the angular section is of a medium complexity, the points may be combined into a plurality of new points, where one or more of the new points is formed from points associated with a plurality of quantisation levels (e.g. points associated with each of the first quantisation level 121 and the second quantisation level 122 may be combined into a first new point and points associated with each of the third quantisation level 123 and the fourth quantisation level 124 may be combined into a second new point). And where the points in the angular section are of a low complexity and/or wherein the angular section is of a low complexity, the points may be combined into a plurality of newer points, with each new point associated with a single complexity level. Finally, where the points in the angular section are of a very low complexity, the points may not be combined at all.
The aforementioned combining of points may then be dependent on a complexity level of the points and/or the region, where points associated with a number of quantisation levels are combined, this number of quantisation levels being dependent on the complexity level.
It will be appreciated that the complexity level is an exemplary feature. And the combining of points and/or the number of quantities levels across which points are combined may be dependent on another feature, such as a distribution of attributes of the points, a number of points in the angular section, and/or a user input.
Referring to Figure 11d, the result of the processing may be a froxel that has a point for one or more of the angular section. For example, where the first angular section 111 is a highly complex section, each of the points in the first angular section may be combined to provide a single new point 161 at the third quantisation level 123. The second angular section 112 may have a lesser complexity so that the points in this angular section are combined into a second new point 162 at the fourth quantisation level 124 and a third new point 163 at the fifth quantisation level, and so on.
Referring to a fourth new point 165 in the third angular section 113 and a fifth new point 165 in the fourth angular section, where points in adjacent angular sections are located on the same quantisation level, these points may be combined in dependence on a depth of the froxel and/or a feature of the points. So the fourth and fifth new points 164, 165 may be combined if these points have similar attributes.
The above embodiments have primarily considered a method of processing within a froxel that involves combining a plurality of points in this froxel in order to generate a new point. More generally, the current disclosure envisages a method of processing one or more points of a three-dimensional representation in dependence on a container comprising those points. This processing may comprise combining a plurality of points in order to generate a new point. Equally, this processing may comprise a different processing operation. For example, the processing may comprise quantising a single point in the froxel so that this point is located at a quantisation level. Equally, the processing may comprise: removing and/or filtering out one or more of the points (e.g. to remove points of a certain colour or to alter a colour range of the three- dimensional representation); modifying a location of one or more of the points; altering an attribute of one or more points; and identifying points with certain features (e.g. identifying all points with a certain attribute value, and then outputting a list of these points).
The processing (e.g. the combining) of the points may be associated with values (e.g. attribute values of the points), where these values may be compared to a threshold value. For example, points may be removed from the three-dimensional representation if an attribute value of these points is beneath a threshold value. The threshold value may depend on the froxel containing the points that are being processed (e.g. so that the threshold attribute value for the points increases as the points become more distant from the centre of the froxelised coordinate system). A plurality of points may be processed in dependence on a combined value associated with this plurality of points; e.g. in dependence on a minimum value of the points, a maximum value of the points, an average value of the points, and/or a variance of the values of the points. This combined value may be compared to a threshold, where e.g. the points may be combined to generate a new point in dependence on the combined value exceeding the threshold. The value may be associated with one or more of: the attributes of the points; the locations of the points; the similarity of the attributes of the points; and the complexity of the points.
Typically, each froxel (and the group of points in each froxel) is processed separately. Therefore, a plurality of different froxels may be processed by a plurality of different components or different (e.g. separate) computer devices. This enables the froxels of the three-point representation to be processed in parallel and also enables sections of the representation to be processed separately (e.g. where a user only wishes to view a portion of the representation then only the froxels relating to this portion may be processed). Essentially, (the points in) a first froxel and (the points in) a second froxel may be processed separately, e.g. at different times or by different computer devices.
Alternatives and modifications
It will be understood that the present invention has been described above purely by way of example, and modifications of detail can be made within the scope of the invention.
The representation is typically arranged to provide an extended reality (XR) experience (e.g. a representation that is useable to render a XR video). The term extended reality (XR) covers each of virtual reality (VR), augmented reality (AR), and mixed reality (MR) and it will be appreciated that the disclosures herein are applicable to any of these technologies.
Reference numerals appearing in the claims are by way of illustration only and shall have no limiting effect on the scope of the claims.

Claims

Claims
1 . A method of processing a three-dimensional representation of a scene, the method comprising: identifying a plurality of points of a three-dimensional representation; determining a coordinate system for the representation; determining a plurality of containers associated with the coordinate system, wherein each container covers a volume of the three-dimensional representation, and wherein a depth of each container is dependent on a distance of that container from a centre of the coordinate system; and allocating each of the plurality of points to one of the containers.
2. The method of any preceding claim, comprising determining the coordinate system based on one or more of: a viewing zone, preferably wherein the coordinate system is centred at a centre of the viewing zone; and a capture device used to capture one or more of the plurality of points, preferably wherein the coordinate system is centred on the capture device.
3. The method of any preceding claim, comprising processing one or more of the points in dependence on a container that contains said points.
4. The method of any preceding claim, comprising: identifying a first container; identifying one or more points in the first container; and processing the points in the first container.
5. The method of any preceding claim, comprising: identifying a second container; identifying one or more points in the second container; and processing the points in the second container separately to the points in the first container, preferably wherein processing the points separately comprises: processing the points at different times; and/or processing the points using different computer devices.
6. The method of any preceding claim, wherein processing the points comprises one or more of: generating a new point based on one or more points; modifying a location and/or value of the points; removing and/or filtering out one or more of the points; and assigning a new parameter to a point, preferably wherein the new parameter indicates a container comprising the point.
7. The method of any preceding claim, comprising processing the points in dependence on a threshold, preferably a threshold associated with the attributes of the points.
8. The method of any preceding claim, wherein the threshold depends on the container.
9. The method of any preceding claim, comprising: identifying a first plurality of points in a first container of the plurality of containers; and processing the first plurality of points so as to generate a new point in the first container.
10. The method of any preceding claim, comprising: identifying a plurality of points in a first container; and combining the points to generate the new point.
11 . The method of any preceding claim, wherein combining the points comprises one or more of: combining an attribute of each point; combining a transparency and combining a distance of each point.
12. The method of any preceding claim, comprising generating the new points based on one or more of: a minimum attribute value of the first plurality of points; a maximum attribute value of the first plurality of points; an average value of the first plurality of points; and a variance of the attribute values of the first plurality of points.
13. The method of any preceding claim, comprising determining the coordinate system based on a capture device associated with a viewing zone, preferably wherein the capture device is located at the centre of the viewing zone.
14. The method of any preceding claim, wherein the depth of the containers increases as the distance of the containers from the centre of the coordinate system increases, preferably wherein the depth of each container is determined based on an arctan curve.
15. The method of any preceding claim, wherein the depth of the containers increases linearly as the distance of the containers from the centre of the coordinate system increases.
16. The method of any preceding claim, wherein the depth of the containers is determined based on an arctan curve.
17. The method of any preceding claim, wherein each container is associated with each of: an inner axial boundary; an outer axial boundary; a first radial boundary; and a second radial boundary, preferably wherein the inner axial boundary and the outer axial boundary are determined based on a quantisation curve, more preferably wherein the quantisation curve is based on an arctan curve.
18. The method of any preceding claim, wherein the locations of the first radial boundary and the second radial boundary are dependent on an angular resolution of the scene.
19. The method of any preceding claim, wherein each container is associated with one or more angular sections.
20. The method of any preceding claim, wherein each container is associated with one or more quantisation levels, preferably wherein each container is associated with the same number of quantisation levels.
21 . The method of any preceding claim, wherein the quantisation levels are determined based on a curve, preferably an arctan curve.
22. The method of any preceding claim, comprising combining the plurality of points in dependence on a location of each point, preferably in dependence on each point being within the first container, more preferably in dependence on each point being associated with the same quantisation level within the container and/or within the same angular section within the container.
23. The method of any preceding claim, wherein combining the points comprises taking one or more of: a minimum, an average, a weighted average, and a maximum of the points.
24. The method of any preceding claim, comprising combining points associated with a plurality of quantisation levels of the container.
25. The method of any preceding claim, comprising processing (e.g. combining) the points based on a threshold value, preferably comprising processing each point (and/or each set of points) only if a parameter value associated with that point exceeds the threshold value.
26. The method of any preceding claim, wherein the parameter value is determined based on one or more of: an average parameter of a set of points; a maximum parameter of the set of points; a minimum parameter of the set of points; and a variance of the parameters of the set of points.
27. The method of any preceding claim, wherein the parameter relates to the attributes of the points and/or the locations of the points.
28. The method of any preceding claim, wherein the threshold value is associated with one or more of: a similarity of the points; a complexity of the points; the attribute values of the points; the container that contains the points; and the locations of the points.
29. The method of any preceding claim, comprising combining the points based on a complexity value associated with the points and/or the container, preferably wherein a number of quantisation levels for which points are combined is dependent on the complexity value.
30. The method of any preceding claim, wherein the complexity value is dependent on one or more of: a user input; an artificial intelligence algorithm; a machine learning model; and a distribution of points and/or attributes of points within a region.
31 . The method of any preceding claim, comprising determining a threshold complexity for combining a plurality of points, preferably wherein the threshold complexity is dependent on one or more of: a container that contains the points; and a distance of the points from a viewing zone and/or from the centre of the coordinate system.
32. The method of any preceding claim, comprising combining the points based on attribute values associated with the points.
33. The method of any preceding claim, comprising combining the points based on a separation of the points.
34. The method of any preceding claim, wherein the combined points are associated with a plurality of different capture devices.
35. The method of any preceding claim, comprising associating the new point with a new capture device, preferably a new capture device at the centre of the coordinate system, preferably comprising determining and storing a distance of the new point from the new capture device.
36. The method of any preceding claim, wherein the three-dimensional representation is associated with a viewing zone, the viewing zone comprising a subset of the scene and/or the viewing zone enabling a user to move through a subset of the scene, preferably wherein the user is able to move within the viewing zone with six degrees of freedom (6DoF).
37. The method of claim 36, wherein: the viewing zone has a volume of less than 50% of the volume of the scene, less than 20% of the volume of the scene, and/or less than 10% of the volume of the scene; and/or the viewing zone has, or is associated with, a volume, preferably a real-world volume, of less than five cubic metres (5m3), less than one cubic metre (1 m3), less than one-tenth of a cubic metre (0.1 m3) and/or less than one-hundredth of a cubic metre (0.01 m3).
38. The method of any preceding claim, wherein the three-dimensional representation comprises a point cloud.
39. The method of any preceding claim, comprising storing the three-dimensional representation and/or outputting the three-dimensional representation, preferably outputting the three-dimensional representation to a further computer device.
40. The method of any preceding claim, comprising generating an image and/or a video based on the three- dimensional representation.
41 . The method of any preceding claim, comprising forming one or more two-dimensional representations of the scene based on the three-dimensional representation, preferably comprising forming a two- dimensional representation for each eye of a viewer.
42. The method of any preceding claim, wherein the point is associated with one or more of: a location; an attribute; a transparency; a colour; and a size.
43. The method of any preceding claim, wherein the point is associated with an attribute for a right eye and an attribute for a left eye.
44. The method of any preceding claim, wherein the scene comprises one or more of: an extended reality (XR) scene; a virtual reality (VR) scene; an augmented reality (AR) scene; and a mixed reality (MR) scene.
45. A computer program product comprising software code that, when executed on a computer device, causes the computer device to perform the method of any preceding claim.
46. A machine-readable storage medium that includes instructions that, when executed by one or more processors of a machine, cause the machine to perform the method of any of claims 1 to 44.
47. A system for carrying out the method of any of claims 1 to 44, the system comprising one or more of: a processor; a communication interface; and a display.
48. An apparatus for processing a three-dimensional representation of a scene, the apparatus comprising: means for identifying a plurality of points of a three-dimensional representation; means for determining a coordinate system for the representation; means for determining a plurality of containers associated with the coordinate system, wherein each container covers a volume of the three-dimensional representation, and wherein a depth of each container is dependent on a distance of that container from a centre of the coordinate system; and means for allocating each of the plurality of points to one of the containers.
PCT/GB2025/050760 2024-04-10 2025-04-10 Processing a three-dimensional representation of a scene Pending WO2025215364A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB2405098.1 2024-04-10
GB2405098.1A GB2640278A (en) 2024-04-10 2024-04-10 Processing a three-dimensional representation of a scene

Publications (1)

Publication Number Publication Date
WO2025215364A1 true WO2025215364A1 (en) 2025-10-16

Family

ID=91334555

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2025/050760 Pending WO2025215364A1 (en) 2024-04-10 2025-04-10 Processing a three-dimensional representation of a scene

Country Status (2)

Country Link
GB (1) GB2640278A (en)
WO (1) WO2025215364A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016061640A1 (en) 2014-10-22 2016-04-28 Parallaxter Method for collecting image data for producing immersive video and method for viewing a space on the basis of the image data
WO2018046940A1 (en) 2016-09-08 2018-03-15 V-Nova Ltd Video compression using differences between a higher and a lower layer
US20190015747A1 (en) * 2017-07-12 2019-01-17 Misapplied Sciences, Inc. Multi-view (mv) display systems and methods for quest experiences, challenges, scavenger hunts, treasure hunts and alternate reality games
WO2019111010A1 (en) 2017-12-06 2019-06-13 V-Nova International Ltd Methods and apparatuses for encoding and decoding a bytestream
WO2020188273A1 (en) 2019-03-20 2020-09-24 V-Nova International Limited Low complexity enhancement video coding

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016061640A1 (en) 2014-10-22 2016-04-28 Parallaxter Method for collecting image data for producing immersive video and method for viewing a space on the basis of the image data
WO2018046940A1 (en) 2016-09-08 2018-03-15 V-Nova Ltd Video compression using differences between a higher and a lower layer
US20190015747A1 (en) * 2017-07-12 2019-01-17 Misapplied Sciences, Inc. Multi-view (mv) display systems and methods for quest experiences, challenges, scavenger hunts, treasure hunts and alternate reality games
WO2019111010A1 (en) 2017-12-06 2019-06-13 V-Nova International Ltd Methods and apparatuses for encoding and decoding a bytestream
WO2020188273A1 (en) 2019-03-20 2020-09-24 V-Nova International Limited Low complexity enhancement video coding

Also Published As

Publication number Publication date
GB2640278A (en) 2025-10-15

Similar Documents

Publication Publication Date Title
CN110178370A (en) Use the light stepping and this rendering of virtual view broadcasting equipment progress for solid rendering
JP2017532847A (en) 3D recording and playback
EP3564905A1 (en) Conversion of a volumetric object in a 3d scene into a simpler representation model
KR20200102507A (en) Apparatus and method for generating image data bitstream
CN113170213A (en) Image synthesis
CN114930812B (en) Method and apparatus for decoding 3D video
WO2019138163A1 (en) A method and technical equipment for encoding and decoding volumetric video
US12354310B2 (en) Point cloud data transmission device, point cloud data transmission method, point cloud data reception device, and point cloud data reception method
EP3540696B1 (en) A method and an apparatus for volumetric video rendering
US12142013B2 (en) Haptic atlas coding and decoding format
WO2022259632A1 (en) Information processing device and information processing method
WO2022224964A1 (en) Information processing device and information processing method
WO2025215364A1 (en) Processing a three-dimensional representation of a scene
TW202046716A (en) Image signal representing a scene
EP3931802B1 (en) Apparatus and method of generating an image signal
WO2025215363A1 (en) Determining a point of a three-dimensional representation of a scene
WO2025233629A1 (en) Determining a point of a three-dimensional representation of a scene
WO2025215362A1 (en) Determining a location of a point in a point cloud
WO2025233631A1 (en) Determining a point of a three-dimensional representation of a scene
WO2025233632A1 (en) Bitstream
GB2640002A (en) Updating a depth buffer
GB2637367A (en) Processing a point of a three-dimensional representation
GB2638298A (en) Determining a point of a three-dimensional representation of a scene
WO2025248250A1 (en) Processing a point of a three-dimensional representation
GB2640349A (en) Processing a three-dimensional representation of a scene

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 25718666

Country of ref document: EP

Kind code of ref document: A1