WO2025233629A1

WO2025233629A1 - Determining a point of a three-dimensional representation of a scene

Info

Publication number: WO2025233629A1
Application number: PCT/GB2025/051000
Authority: WO
Inventors: Tristan SALOME; Gael HONOREZ; Cyril CLAVAUD; Jeroen DE CONNINCK
Original assignee: V Nova International Ltd
Current assignee: V Nova International Ltd
Priority date: 2024-05-09
Filing date: 2025-05-09
Publication date: 2025-11-13
Anticipated expiration: 2026-11-09
Also published as: GB202406502D0; GB2637364A

Abstract

There is described a method of determining a point of a three-dimensional representation of a scene, the method comprising: identifying a plurality of points of the representation; determining that the plurality of points lie on a shared plane; in dependence on the plurality of points lying on a shared plane, determining a texture patch based on attributes of the plurality of points; and determining a texture point, the texture point comprising a reference to the texture patch.

Description

Determining a point of a three-dimensional representation of a scene

Field of the Disclosure

The present disclosure relates to methods, systems, and apparatuses for determining (e.g. defining) a point of a three-dimensional representation of a scene, in particular determining a point in a point cloud.

Background to the Disclosure

Three-dimensional representations of environments are used in many contexts, including for the generation of virtual reality videos, in which depth information for a plurality of points of the representation is used to generate different images for a left eye and a right eye of a user. Typically, substantial processing power is required to determine such a three-dimensional representation, and the file size of files associated with these representations is typically large so that substantial amounts of storage are needed to keep the files and substantial amounts of bandwidth are required to transfer the files.

Summary of the Disclosure

According to an aspect of the present disclosure, there is described a method of determining a point of a three-dimensional representation of a scene, the method comprising: identifying a plurality of points of the representation; determining that the plurality of points lie on a shared plane; in dependence on the plurality of points lying on a shared plane, determining a texture patch based on attributes of the plurality of points; and determining a texture point, the texture point comprising a reference to the texture patch.

Preferably, the representation comprises a plurality of points captured using a plurality of different capture devices, and the method comprises identifying a first plurality of points captured using a first capture device.

Preferably, the method comprises identifying a first plurality of adjacent points of the representation.

Preferably, determining the texture patch comprises: identifying attribute values for each of the identified points; and forming the texture patch based on the identified attribute values.

Preferably, the texture patch comprises the attribute values arranged in the arrangement of the identified points.

Preferably, the method comprises storing the texture patch in a database, the database comprising a plurality of texture patches. Preferably, texture patch is associated with an index.

Preferably, determining the texture patch comprises one or more of: forming the texture patch based on attribute values of the identified points; forming the texture patch based on transparency values of the identified points; and forming the texture patch based on normal values of the identified points.

Preferably, determining the texture point comprises modifying one of the identified points.

Preferably, the method comprises removing at least one, and preferably all, of the identified points from the three-dimensional representation.

Preferably, the method comprises replacing the identified points with the texture point.

Preferably, the method comprises: determining attribute values associated with the identified points; and determining the texture patch in dependence on a difference of said attributes exceeding a threshold, preferably wherein the difference is associated with a variance of the attributes.

Preferably, determining that the plurality of points lie on a shared plane comprises: determining a plane that passes through the plurality of points; determining distances of one or more of the points from the plane; and determining that the plurality of points lie on a shared plane based on the determined distances, preferably based on a maximum distance, an average distance, and/or a variance of the distances. Preferably, the method comprises: identifying a normal for each of the points; and determining that the plurality of points lie on a shared plane in dependence on one or more of the identified normals, preferably based on a variance of the normals.

Preferably, the method comprises determining that the plurality of points lie on a curved plane.

Preferably, a threshold associated with the determination that the plurality of points lie on a plane is dependent on a distance of the points from a viewing zone associated with the three-dimensional representation.

Preferably, identifying the plurality of points comprises identifying a plurality of adjacent points (e.g. points in adjacent angular brackets) and/or a plurality of points in an 8x8 arrangement.

Preferably, the three-dimensional representation is associated with a viewing zone.

Preferably, the three-dimensional representation is associated with a plurality of capture devices and/or a plurality of the points of the three-dimensional representation are associated with different capture devices.

Preferably, the method comprises determining a size of the texture point, the size being based on a number and/or an arrangement of the identified points.

Preferably, identifying the plurality of points comprises determining a plurality of pluralities of points of the three-dimensional representation and, for each identified plurality of points: determining whether the plurality of points lie on a shared plane; in dependence on the plurality of points lying on a shared plane, determining a texture patch based on attributes of the plurality of points; and determining a texture point, the texture point comprising a reference to the texture patch.

Preferably, the method comprises defining a transparency value of the texture point so as to signal the texture point as being opaque.

Preferably, the method comprises determining a motion vector for each corner of the texture point.

According to another aspect of the present disclosure, there is described a method of determining an attribute of a point of a three-dimensional representation of a scene, the method comprising: identifying, in the point, a reference to a texture patch, the texture patch being associated with a plurality of attributes; and determining the attribute of the point based on the attributes of the texture patch.

Preferably, determining the attribute comprises determining a plurality of attributes associated with the point. Preferably, the method comprises comprising determining attribute values for a plurality of locations of the representation based on the attributes of the texture patch.

Preferably, the method comprises determining attributes for a plurality of adjacent angular brackets associated with a capture device of the representation based on the attributes of the texture patch.

Preferably, the method comprises determining the arrangement of the attributes in dependence on the texture patch.

Preferably, the method comprises identifying the point based on a size of the point.

Preferably, the method comprises identifying the point based on the location of the point in a file associated with the three-dimensional representation.

Preferably, the method comprises identifying the point based on a transparency value of the point. Preferably, the method comprises identifying the point based on: the point being in a section of a file associated with transparent points; and the point having a transparency value that signals the point as being opaque.

Preferably, the three-dimensional representation is associated with a viewing zone. Preferably, the viewing zone has a volume of less than 50% of the volume of the scene, less than 20% of the volume of the scene, and/or less than 10% of the volume of the scene.

Preferably, the viewing zone has, or is associated with, a volume, preferably a real-world volume, of less than five cubic metres (5m3), less than one cubic metre (1 m3), less than one-tenth of a cubic metre (0.1 m3) and/or less than one-hundredth of a cubic metre (0.01 m3).

Preferably, the three-dimensional representation comprises a point cloud.

Preferably, the method comprises storing the three-dimensional representation and/or outputting the three- dimensional representation. Preferably, the method comprises outputting the three-dimensional representation to a further computer device.

Preferably, the method comprises generating an image and/or a video based on the three-dimensional representation.

Preferably, the method comprises forming one or more two-dimensional representations of the scene based on the three-dimensional representation. Preferably, the method comprises forming a two-dimensional representation for each eye of a viewer.

Preferably, the point is associated with one or more of: a location; an attribute; a transparency; a colour; and a size.

Preferably, the point is associated with an attribute for a right eye and an attribute for a left eye.

Preferably, the scene comprises one or more of: an extended reality (XR) scene; a virtual reality (VR) scene; an augmented reality (AR) scene; and a mixed reality (MR) scene.

Preferably, the method comprises forming a bitstream that includes the texture point and/or the texture patch.

According to another aspect of the present disclosure, there is described a system for carrying out the aforesaid method, the system comprising one or more of: a processor; a communication interface; and a display.

According to another aspect of the present disclosure, there is described an apparatus for determining a point of a three-dimensional representation of a scene, the apparatus comprising: means for (e.g. a processor for) identifying a plurality of points of the representation; means for (e.g. a processor for) determining that the plurality of points lie on a shared plane; means for (e.g. a processor for) in dependence on the plurality of points lying on a shared plane, determining a texture patch based on attributes of the plurality of points; and means for (e.g. a processor for) determining a texture point, the texture point comprising a reference to the texture patch.

According to another aspect of the present disclosure, there is described an apparatus method for determining an attribute of a point of a three-dimensional representation of a scene, the method comprising: means for (e.g. a processor for) identifying, in the point, a reference to a texture patch, the texture patch being associated with a plurality of attributes; and means for (e.g. a processor for) determining the attribute of the point based on the attributes of the texture patch.

According to another aspect of the present disclosure, there is described a bitstream comprising one or more texture points and/or texture patches determined using the aforesaid method.

According to another aspect of the present disclosure, there is described a bitstream comprising a texture point, the texture point comprising a reference to a texture patch that comprises attribute values associated with the texture point. According to another aspect of the present disclosure, there is described an apparatus (e.g. an encoder) for forming and/or encoding the aforesaid bitstream.

According to another aspect of the present disclosure, there is described an apparatus (e.g. a decoder) for receiving and/or decoding the aforesaid bitstream.

Any feature in one aspect of the disclosure may be applied to other aspects of the invention, in any appropriate combination. In particular, method aspects may be applied to apparatus aspects, and vice versa.

Furthermore, features implemented in hardware may be implemented in software, and vice versa. Any reference to software and hardware features herein should be construed accordingly.

Any apparatus feature as described herein may also be provided as a method feature, and vice versa. As used herein, means plus function features may be expressed alternatively in terms of their corresponding structure, such as a suitably programmed processor and associated memory.

It should also be appreciated that particular combinations of the various features described and defined in any aspects of the disclosure can be implemented and/or supplied and/or used independently.

The disclosure also provides a computer program and a computer program product comprising software code adapted, when executed on a data processing apparatus, to perform any of the methods described herein, including any or all of their component steps.

The disclosure also provides a computer program and a computer program product comprising software code which, when executed on a data processing apparatus, comprises any of the apparatus features described herein.

The disclosure also provides a computer program and a computer program product having an operating system which supports a computer program for carrying out any of the methods described herein and/or for embodying any of the apparatus features described herein.

The disclosure also provides a computer readable medium having stored thereon the computer program as aforesaid.

The disclosure also provides a signal carrying the computer program as aforesaid, and a method of transmitting such a signal.

The disclosure extends to methods and/or apparatus substantially as herein described with reference to the accompanying drawings.

The disclosure will now be described, by way of example, with reference to the accompanying drawings.

Description of the Drawings

Figure 1 shows a system for generating a sequence of images.

Figure 2 shows a computer device on which components of the system of Figure 1 may be implemented.

Figure 3 shows a method of determining a three-dimensional representation of a scene.

Figures 4a and 4b show method of determining a point based on a plurality of sub-points.

Figure 5 shows a scene comprising a viewing zone.

Figures 6a and 6b show arrangements of capture devices for determining points of the three-dimensional representation.

Figure 7 shows a point that can be captured by a plurality of capture devices.

Figures 8a and 8b show grids formed by the different capture devices. Figure 9 describes a method of determining a location of a point of the three-dimensional representation.

Figure 10 shows a method of determining an angle of a point from a capture device used to capture the point.

Figure 11 shows a method of determining a texture patch associated with a point of the three-dimensional representation.

Figures 12a, 12b, 12c, and 12d illustrate the determination and use of a texture patch.

Figures 13 and 14 show detailed methods of determining texture patches.

Description of the Preferred Embodiments

Referring to Figure 1 , there is shown a system for generating a sequence of images. This system can be used to generate, and then display, a representation of an environment, which may comprise a VR environment (or an XR environment).

The system comprises an image generator 11 , an encoder 12, a transmitter 13, a network 14, a receiver 15, a decoder 16 and a display device 17.

These components may each be implemented on separate apparatuses. Equally, various combinations of these components may be implemented on a shared apparatus; for example, the image generator 11 , the encoder 12, and the transmitter 13 may all be part of a single image data generation device. Similarly, the receiver 15, the decoder 16, and the display device 17 may all be a part of a single image rendering device.

Typically, the system comprises at least one encoding computer device (e.g. a server of a content provider) and at least one rendering computer device (e.g. a VR headset).

Referring to Figure 2, each of the components, and in particular the image generator 11 , the encoder 12, the transmitter 13, the receiver 15, the decoder 16 and the display device 17 is typically implemented on a computer device 20, where, as described above, a plurality of these components may be implemented on a shared computer device.

Each computer device comprises one or more of: a processor 21 for executing instructions (e.g. so as to perform one or more of the steps of the various methods described below), a communication interface 22 for facilitating communication between computer devices (e.g. an ethernet interface, a Bluetooth® interface, or a universal serial bus (UBS) interface, a memory 23 and/or storage 24 for storing information and instructions (e.g. a random access memory (RAM), a read only memory (ROM), a hard drive disk (HDD) a solid state drive (SSD), and/or a flash memory, and a user interface 25 (e.g. a display, a mouse, and/or a keyboard) for enabling a user to interact with the computer device. These components may be coupled to one another by a bus 25 of the computer device.

The computer device 20 may comprise further (or fewer) components. In particular, the computer device (e.g. the display device 17) may comprise one or more sensors, such as an accelerometer, a GPS sensor, or a light sensor. These sensors typically enable the computer device to identify an environmental condition and/or an action of wearer of the display device.

Turning back to Figure 1 , the image generator 11 is configured to generate a sequence of image data (e.g. a sequence of image frames) to enable the display device 17 to use this image data to display a plurality of images. The image data may comprise one or more digital objects and the image data may be generated or encoded in any format. For example, the image data may comprise point cloud data, where each point has a 3D position and one or more attributes. These attributes may, for example, include, a surface colour, a transparency value, an object size and a surface normal direction. Each attribute may have a value chosen from a continuous range or may have a value chosen from a discrete set. The image data enables the later rendering of images. This image data may enable a direct rendering (e.g. the image data may directly represent an image). Equally, the image data may require further processing in order to enable rendering. For example, the image data may comprise three-dimensional point cloud data, where rendering a two-dimensional image using this data requires processing based on a viewpoint of this two-dimensional image.

The image data may comprise depth map data, where one or more pixels or objects in the image is associated with a depth that is specified by the depth map data. The depth map data may be provided as a depth map layer, separate from an image layer. In some contexts, such as MPEG Immersive Video (MIV), the image layer may instead be described as a texture layer. Similarly, in some contexts, the depth map layer may instead be described as a geometry layer.

The image data may include a predicted display window location. The predicted display window location may indicate a potion of an image that is likely to be displayed by the display device 17. The predicted display window location may be based on a viewing position (such as a virtual position and/or orientation of the user in a 3D environment) of the user, where this viewing position may be obtained from the display device. The predicted display window location may be defined using one or more coordinates. For example, the predicted display window location may be defined using the coordinates of a corner or center of a predicted display window, and may be defined using a size of the predicted display window. The predicted display window location may be encoded as part of metadata included with the frame.

The image data for each image (e.g. each frame) may include further information, which may be provided as a part of an image, e.g. as part of the point cloud data, or as separate layers. In particular, the image data may include audio information or haptic feedback information indicating audio or haptics which can accompany displayed visual data. An audio layer or haptic layer may accompany each image, and may be omitted for images where no accompanying audio or haptics are required.

Similarly, the image data may comprise interactivity information, where the image data may contain or indicate elements with which a user can interact. The interactivity information may, for example, define a behaviour of an element, where a user is able to interact with the element based on this behaviour. The behaviour typically defines a change in an element that occurs as a result of a user interaction where this change may comprise a change in the attributes of the element or in the rendering of the element. As an example, where an image contains a target element, the target element may be arranged to disappear when a user interacts with this element, or to provide feedback indicating that the user has interacted with the target. This interactivity data may be provided as part of, or separately to, the image data.

The image data may indicate, or may be combinable with, a state of the virtual environment, a position of a user, ora viewing direction of the user. Here, the position and viewing direction may be physical properties of the user in the real-world, or position and viewing direction may also be purely virtual, for example being controlled using a handheld controller. The image generator 11 may, for example, obtain information from the display device 17 that indicates the position, viewing direction, or motion of the user. Equally, the image generator may generate image data such that it can later be combined with this position, viewing direction, or motion, where the image generator may generate a full scene which is only partially viewed by a user depending on the position of that user.

In some cases, the generated image may be independent of user position and viewing direction. This type of image generation typically requires significant computer resources such as a powerful GPU, and may be implemented in a cloud service, or on a local but powerful computer. For example, a cloud service (such as a Cloud Rendering Service (CRN)) may reduce the cost per-user and thereby make the image frame generation more accessible to a wider range of users. Here “rendering” refers at least to an initial stage of rendering to generate an image. Further rendering may occur at the display device 17 based on the generated image to produce a final image which is displayed. The image generator 11 may, for example, comprise a rendering engine for initially rendering a virtual environment such as a game or a virtual meeting room.

The encoder 12 is configured to encode frames to be transmitted to the display device 17. The encoder may be implemented using executable software or may be implemented on specific hardware such as an ASIC. In some embodiments, the image generator 11 may transmit raw, unencoded, data through the network 14. However, such transmission typically leads to a high file size and requires a high bandwidth so that it is typically desirable to encode the data prior to the transmission.

The encoder 12 may encode the image data in a lossless manner or may encode the data a lossy manner. The encoder may apply inter-frame or intra-frame compression based on a currently-encoded frame and optionally one or more previously encoded frames. The encoder may be a multi-layer encoder, such as a low complexity enhancement video codec (LCEVC) enabled encoder.

Where the generated frames comprise depth map data, the encoder 12 may perform layered encoding on each instance of image data (e.g. each frame) to generate an encoded frame comprising a base depth map layer and an enhancement depth map layer. Encoding a depth map in this way may improve compression. In some applications, such as HDR video, depth maps are desirably highly detailed with a bit depth of up to twelve or fourteen bits, which is a significant increase in the data to be transmitted. As a result, providing ways to improve compression of the depth map can make more realistic depth map-based displays viable when performing rendering or transmission of rendered data in real-time. Furthermore, this type of layered encoding makes it easy to drop (and then pick back up) one or more of the layers, which provides flexibility and tools for bandwidth management.

Layered encoding is also helpful as the final decoder/user device (such as a user display device) can choose whether to process these extra layers. For example, in a non-layered approach, the best the end device (i.e. the receiver, decoder or display device associated with a user that will view the images) can do is determine that it does not have enough resources for a given quality (be it resolution, frame rate, inclusion of depth map) and then signal to the controller/renderer/encoder that it does not have enough resources. The controller then will send future images at a lower quality. In that alternative scenario, the end device still unfortunately has to process the higher quality data until the lower quality data arrives, if it can process the received images at all.

In some of the described embodiments, this situation is improved upon because when/if the end device determines for example that it does not have the processing capabilities to handle the highest level of quality, then it can drop and/or choose not to process certain layers. The end device may also signal to the controller that it needs a lower level of quality, but in the meantime the end device can only process the number of layers that it can handle. Therefore, the end device can react to conditions much more quickly.

In some cases, depth map data may be embedded in image data. In this case, the base depth map layer may be a base image layer with embedded depth map data, and the enhancement depth map layer may be an enhancement image layer with embedded depth map data.

Alternatively, when the generated images comprise a depth map layer separate from an image layer and multi-layer encoding is applied, the encoded depth map layers may be separate from the encoded image layers. This has the advantage that the encoded depth map layers can be dropped under some conditions while still retaining image layers that can be displayed (albeit with a lower level of realism). For example, the encoded depth map layers can be dropped by a transmitter or encoder when available communication resources are reduced, or can be dropped by an end device which lacks the processing resources to handle the highest level of quality.

Similarly, if some images comprise an audio base layer, a haptic feedback base layer, an audio enhancement layer or a haptic feedback enhancement layer, these can be processed or dropped flexibly. Again similarly, if some images comprise an interactivity data base layer or an interactivity enhancement layer these can be processed or dropped flexibly. For example, certain interactions may only be possible where a threshold bandwidth is available, where complex interactions (e.g. those enabling a conversation with a digital object) may be disabled before less complex interactions (e.g. changing a pixel colour) are disabled.

Additionally or alternatively, where the image data comprises point cloud data, the encoder may apply a point cloud data encoding technique such as described in European patent application EP21386059.6, which is incorporated herein by reference. Such a point cloud encoder may act as a base encoder for a layered encoding technique such as LCEVC or VC-6. Notably LCEVC and VC-6 techniques encode and decode a layered signal, but are agnostic about the content type of data encoded in the signal. For example, the signal can include textures, video frames, geometry or depth data, meshes, point clouds, rendering attributes or physics engine attributes.

The transmitter 13 may be any known type of transmitter for wired or wireless communications, including an Ethernet transmitter or a Bluetooth transmitter.

The transmitter 13 may be configured to make decisions about how to transmit the image data, and/or may provide feedback to the encoder 12 or the image generator 11 . For example, the transmitter may determine available communication resources (e.g. bandwidth) for transmitting image data, and may drop one or more layers from an encoded frame, or indicate to the image generator and/or encoder that image data should be generated and encoded with fewer layers, when insufficient bandwidth is available for transmission of all generated data. As specific examples, the transmitter may be configured to drop a depth map layer, an LCEVC enhancement layer, or a VC-6 enhancement layer from a frame when insufficient communication resources are available.

The network 14 provides a channel for communication between the transmitter 13 and the receiver 15, and may be any known type of network such as a WAN or LAN or a wireless Wi-Fi or Bluetooth network. The network may further be a composite of several networks of different types. Many users only have access to a network with a bandwidth of 30MBps which can lead to latency jitter when streaming. The required bandwidth and the observed latency can be reduced by means of tactics such as forward-looking rendering and last-millisecond reprojection, which are enabled by improved compression.

The receiver 15 may be any known type of receiver for wired or wireless communications, including an Ethernet transmitter or a Bluetooth transmitter.

The decoder 16 is configured to receive and decode an encoded frame. The decoder may be implemented using executable software or may be implemented on specific hardware such as an ASIC.

The display device 17 may for example be a television screen or a VR headset. The timing of the display may be linked to a configured frame rate, such that the display device may wait before displaying the image. The display device may be configured to perform warping, that is, to obtain a final display window location, adjust a warpable image to obtain a final image corresponding to a final viewing direction of the user, and display the final image.

In this regard, the image data is typically arranged to provide a warpable image for which a portion of the image that is displayed at the display device 17 is dependent on a position or orientation of a viewer. The warpable image may then be rendered before a most up to date viewing direction of the user is known. The warpable image may be transmitted to the display device, or the warpable image may be transmitted to a rendering node which is near to the display device, and the display device or rendering node may perform time warping to generate a displayed image portion based on the warpable image and the most up to date viewing direction of the user. As mentioned above, a single device may provide a plurality of the described components. For example, a first rendering node may comprise the image generator 11 , encoder 12 and transmitter 13. Additional similar rendering nodes may be included in the system, and may work together to generate the sequence of frames.

In one case, multiple rendering nodes may each provide separate image data to an image data assembling node; for example, each rendering node may provide a part of a sequence of frames to a frame assembling node.

For example, the receiver 15, decoder 16 or display device 17 may be configured to assemble parts of image data from multiple sources to generate a sequence of images for display on the display device.

Alternatively, the image data assembling node may be separate from the receiver 15, decoder 16 and display device 17.

Additionally or alternatively, multiple rendering nodes may be chained. In otherwords, successive rendering nodes may add to a sequence of image data as it passes from rendering node to rendering node, and eventually a complete sequence of image data is then provided to the receiver 15. Furthermore, each rendering node may obtain components of a render from multiple upstream rendering nodes and/or distribute components of a render to multiple downstream rendering nodes.

A chain of rendering nodes may be useful for performing different rendering tasks that require different quantities of processing resources, or different frame rates. For example, a company may provide distributed processing in the form of a centralised hub which has abundant processing resources but is distant from users, and peripheral locations which have more scarce processing resources but are closer to users. Expensive but fairly static rendering features such as background lighting or environmental impact on sound may be generated at the central hub (for example using ray tracing), while features that require fewer resources but faster responses or higher frame rates may be generated closer to the user. In other words, the more responsive a rendering feature needs to be, the lower latency it needs between the rendering node which generates the feature and the user display and, in a chain of rendering nodes, the node which generates each rendering feature can be chosen based on a required maximum latency of that feature. On the other hand, if it is expensive to generate a rendering feature, then it may be preferable to generate the feature less frequency and with a higher maximum latency. For example, a static, high-quality background feature may be generated early in the chain of rendering nodes and a dynamic, but potentially lower-quality, foreground feature may be generated later in the chain of rendering nodes, closer to the user device. Here, environmental impact on sound means, for example, a set of surfaces may be constructed where each surface has different sound reflection and absorption properties depending upon material and shape. The frame rates may be matched by creating multiple frames with features generated at the lower frame rate, and combining them with the frames with features generated at the higher frame rate. In a nonlimiting embodiment, a preliminary rendering generates volumetric object data including motion vectors at a first (lowest) frame rate, then produces 2D rendered frames plus depth information for a specific user at a second (higher) frame rate, then transmits video plus depth data to the user device, which produces final frames for display via space warping (depth-based reprojections) at a third (highest) frame rate. One or more of these steps may be performed in combination with the other described embodiments. The viewing position of the user may change as additional rendering tasks are performed at different rendering nodes in the chain. Each or any rendering node may obtain an updated viewing position before performing its respective rendering task.

Additionally, the system may simultaneously generate multiple sequences of image data for different respective users or different respective display devices. For example, in the context of a VR or AR experience, each user or display device may view a different 3D environment, or may view different parts of a same 3D environment. When using a chain of rendering nodes, each node may serve multiple users or just one user.

For example, a starting rendering node (e.g. at a centralised hub) may serve a large group of users. For example, the group of users may be viewing nearby parts of a same 3D environment. In this case, the starting node may render a wide zone of view (“field of view”) which is relevant for all users in the large group.

The starting node may send this wide field of view to a first middle rendering node which renders additional aspects of the 3D environment. These additional aspects may for example be aspects which require less processing power to render, or may be aspects which are specific to individual users of the group. Additionally, the middle rendering node may render features in a smaller field of view than the starting node - this smaller field of view may be relevant to each user rather than the group of users. The first middle rendering node may additionally only serve a smaller number of users (e.g. half of the large group of users), with the remaining users being served by a second middle rendering node which also receives the wide field of view from the starting node.

The middle rendering node(s) may then send sequences of second partially or fully rendered frames to an end device for each user. The end device may perform further processes such as warping or focal distance adjustments, optionally using depth map data.

Preferably, each rendering node encodes the partially or fully rendered frames before transmitting them on to a next rendering node or to the receiver 15. This means that the required communication resources can be reduced when the rendering nodes are separated by one or more networks, or more generally are implemented in a distributed system such as a cloud.

However, each rendering node in a chain is encoding a different partially or fully rendered frame, with different data. Therefore, it may be advantageous for different rendering nodes to use different rendering formats and/or encoding formats. For example, the output from a first rendering node may be point cloud data which logically describes a 3D scene. This point cloud data can be encoded using the techniques of EP21386059.6. A second rendering node may then operate on the point cloud data to generate image data that is more readily displayed by a generic display device, without requiring the display device to model the 3D environment. This image data may be encoded using video coding techniques.

The chaining of rendering nodes may be extended to arbitrary tree structures, where a rendering node obtains partially rendered frames from more than one preceding rendering node, and generates further partially or fully rendered frames based on the multiple obtained sequences of partially rendered frames.

For example, a content rendering network (CRN) comprising numerous rendering nodes may be used to serve a volumetric event to a large number of same-time users, such as users participating in a shared virtual environment. Rendering the same event for each user is far more expensive in terms of computation time and power consumption than rendering the volumetric effect once and performing the rendering equivalent of multicasting the volumetric effect for multiple users. For example, each user may have a second rendering node (such as a VR headset), and the network may comprise a central first rendering node. The first rendering node may render the volumetric event, and distribute partially rendered frames depicting the volumetric event to the different second rendering nodes. The second rendering node for each user may then integrate the partially rendered frames depicting the volumetric event into a view of the virtual environment which is currently being shown to each user, based on parameters such as the user’s virtual position.

The receiver 15, decoder 16 and display device 17 may be consolidated into a single device, or may be separated into two or more devices. For example, some VR headset systems comprise a base unit and a headset unit which communicate with each other. The receiver 15 and decoder 16 may be incorporated into such a base unit.

In some embodiments, the network 14 may be omitted. For example, a home display system may comprise a base unit configured as an image source, and a portable display unit comprising the display device 17.

In the event that the decoder 16 or the display device 17 does not or cannot handle one or more layers, the receiver 15 or another transmitter associated with the decoder or display device may send a corresponding layer drop indication back through the network 14. The layer drop indication may be received by each rendering node. A rendering node which generates partially or fully rendered frames for that specific decoder or display device may cease generating the dropped layer. On the other hand, a rendering node which generates partially or fully rendered frames for multiple end devices may disregard a layer drop indication received from one end device (as the dropped layer is still needed for other devices). Alternatively, rendering nodes which serve multiple end devices may record received layer drop indications, and may cease generating the dropped layer only when all end devices served by the rendering node indicate that the layer is to be dropped.

In preferred examples, the encoders or decoders are part of a tier-based hierarchical coding scheme or format. Hierarchical coding enables frames to be communicated with higher resolution and/or higher frame rate than is possible in single-tier coding schemes. In hierarchical coding, one or more enhancement layers is communicated with base data, where the enhancement layers can be used to up-sample the base data at the decoder, for example providing up-sampling in a spatial ortemporal dimension. When combined with equivalent down-sampling of the original frames and generation of the enhancement layer at an encoder, hierarchical coding can overall provide lossless compression of data, with higher resolution and/or higher frame rate for a given transmission bit rate. Examples of a tier-based hierarchical coding scheme include LCEVC: MPEG-5 Part 2 LCEVC (“Low Complexity Enhancement Video Coding”) and VC-6: SMPTE VC-6 ST-2117, the former being described in PCT/GB2020/050695, published as WO 2020/188273, (and the associated standard document) and the latter being described in PCT/GB2018/053552, published as WO 2019/111010, (and the associated standard document), all of which are incorporated by reference herein. However, the concepts illustrated herein need not be limited to these specific hierarchical coding schemes.

A further example is described in WO2018/046940, which is incorporated by reference herein. In this example, a set of residuals are encoded relative to the residuals stored in a temporal buffer.

LCEVC (Low-Complexity Enhancement Video Coding) is a standardised coding method set out in standard specification documents including the Text of ISO/IEC 23094-2 Ed 1 Low Complexity Enhancement Video Coding published in November 2021 , which is incorporated by reference herein.

The system describes above is suitable for generating and presenting a representation of a scene, where this scene displays media content to a user. The scene typically comprises an environment, where the user is able to move (e.g. to move their head or to turn their head) to look around the environment and/or to move around the environment. For example, the scene may be a scene of a room in a building, where the user is able to move around the room (e.g. by moving in the real-world and/or by providing an input to a user interface) in orderto inspect various parts of the room. Typically, the scene is a XR (e.g. a VR) scene, where the user is able to move about the scene in three degrees of freedom (3DoF) or six degrees of freedom (6DoF) so as to experience the scene.

As has been described with reference to Figure 1 , the image generator 11 may be arranged to determine point cloud data, where each point of the point cloud has a 3D position and one or more attributes. More generally, the image generator (or another component) is arranged to determine a three-dimensional representation of a scene, where this three-dimensional representation is thereafter used to generate two- dimensional images that are presented to a user at the display device 17. While the points are typically points of a point cloud, more generally the disclosure extends to any point that is associated with a location and a value. Therefore, the points may, more generally, be considered to be data (or datapoints), which data is associated with a location and a value, and the ‘points’ may comprise polygons, planes (regular or irregular), Gaussian splats, etc.

Referring to Figure 3, there is described a method of determining (an attribute for) a point of such a three- dimensional representation. The method comprises determining the attribute using a capture device, such as a camera or a scanner. The scene may comprise a real scene, in which attribute values are captured using a camera, or a virtual scene (e.g. a three-dimensional model of a scene), in which attribute values are captured using a virtual scanner.

Where this disclosure describes ‘determining a point’ it will be understood that this generally refers to determining a point that has a location and an attribute value, where determining the point comprises determining the attribute value and/or storing a point that comprises at least an attribute value and a location value (these values may be indirect values, e.g. where the location is identified relative to another point). Once a plurality of points have been captured, these points can be stored as a three-dimensional representation (e.g. a point cloud) so as to enable the reconstruction of the three-dimensional scene based no this representation.

Typically, the scene comprises a simulated scene that exists only on a computer. Such a scene may, for example, be generated using software such as the Maya software produced by Autodesk®. The attributes determined using the methods described herein may then depend on virtual objects located within the scene as well as a virtual lighting arrangement used in the scene.

In a first step 31 , a computer device initiates a capture process for a capture device, the capture process being initiated with an initial azimuth angle (e.g. of 0°) and an initial elevation angle (e.g. of 0°).

In a second step 32, the computer device causes a point to be captured using the capture device at the current azimuth angle and current elevation angle. Capturing a point typically comprises assigning an attribute value to the point, which attribute value may, for example, be a color of the point and/or a transparency value of the point. Typically, the point has one or more color values associated with each of a left eye and a right eye of a viewer. Capturing the point may also comprise determining a normal value associated with the point, e.g. a normal of a surface on which the point lies. Typically, capturing the point further comprises determining a location of the point, e.g. by determining a distance of the point from the camera.

In practice, determining the point may comprise sending a ‘ray’ from the capture device and then stepping through a computer model to determine which surface of the computer model is impacted by the ray. The color, transparency, and normal of this surface are then recorded alongside the distance of the surface from the capture device.

In a third step, 33, the computer device determines whether a point has been captured for the capture device at each azimuth of a range of azimuths and in a fourth step 34, if points have not been captured at each azimuth, then the azimuth angle is incremented and the method returns to the second step 12 and another point is captured. The azimuth angle may, for example, be incremented by between 0.01 ° and 1 ° and/or by between 0.025° and 0.1 °. Typically, the range of azimuth angles is selected to be 360° (i.e. so that the capture device captures points surrounding the entirety of the capture device), but it will be appreciated that other ranges are possible.

Once a point has been captured for each azimuth, in a fifth step 35, the computer device determines whether a point has been captured for the capture device at each elevation of a range of elevations and in a sixth step 36, if points have not been captured at each elevation, then the azimuth angle is reset to the initial value, elevation angle is incremented and the method returns to the second step 12 and another point is captured. The elevation angles may, for example, be incremented by between 0.01 ° and 1 ° and/or by between 0.025° and 0.1 °. Typically, the range of elevation angles is selected to be 360° (i.e. so that the capture device captures points surrounding the entirety of the capture device), but it will be appreciated that other ranges are possible.

In a seventh step 37, once points have been captured for each azimuth angle and each elevation angle, the scanning process ends.

This method enables a capture device to capture points at a range of elevation and azimuth angles. This point data is typically stored in a matrix. The point data may then be used to provide a representation of the scene to a user, e.g. the three-dimensional representation formed by the point data may be processed to produce two-dimensional images for each eye of a user, with these images then being shown to a user via the display device 17 to provide a virtual reality experience to the viewer. By using the captured data, a video can be provided to a viewer that enables the viewer to move their head to look around the scene (while remaining at the location of the capture device).

It will be appreciated that the capture pattern (or scanning pattern) described with reference to Figure 3 is purely exemplary and that numerous capture patterns are possible. In general, the capture process for each capture device comprises capturing one or more points at one or more azimuth angles and/or one or more elevation angles.

The ‘points’ captured by the capture device are typically associated with a size, such as a height, a width, or a depth. That is, the points typically relate to two-dimensional planes/pixels and/or three-dimensional voxels. In this regard, there is necessarily some space between the locations of adjacent points (since if the points had no width, then an infinite number of points would be required to capture points at each angle). The size provides points that depict a non-negligible area of the three-dimensional space so that a plurality of points can be fit together to provide a depiction of the scene to a viewer.

The width and height of each point is typically dependent on the distance of that point from the capture device, where more distant points have a larger width/height. The width and height of each point is typically determined so that when each point is displayed, there is no space between adjacent points (indeed, there may be some overlap between points to ensure that no gaps appear between points). This height/width of each point can be determined at the time of capturing the points, or can be determined or defined after the capture of the points.

Typically, the points comprise a size value, which is stored as a part of the point data. For example, the points may be stored with a width value and/or a height value. Typically, the minimum width and the minimum height of a point are set by the angle increment of the azimuth angle and the elevation angle respectively. The size may be then specified in terms of this angle increment and/or in terms of this minimum width/minimum height (e.g. as being a multiple of the angle increment). In some embodiments, the size value is stored as an index, which index relates to a known list of sizes (e.g. if the size may be any of 1x1 , 2x1 , 1x2, 2x2, pixels this may be specified by using 3 bits and a list that relates each combination of bits to a size). The size may be stored based on an underscan value. In this regard, where an object is very near to the viewing zone it may be captured using an unnecessarily dense arrangement of points. Therefore, certain surfaces or areas of the representation may be associated with an underscan value, which underscan value defines a reduction in the number of points captured as compared to a representation without underscan. The size of the points may be defined so as to indicate this underscan value. In an exemplary embodiment, the underscan value is an integer value between 0 and 3 and the size is stored as a combination of point dimensions (e.g. a width in the range [0,2]) and a height in the range ([0,2]) and an underscan factor (e.g. an underscan factor in the range [0,3]).

In some embodiments, the width and the height are dependent on the underscan factor. For example, when the underscan factor exceeds a threshold value, the possible height and width values may be limited. In a specific example, when the underscan factor is 3, the width and the height may be limited to the range [0,1]. The size may then be defined as size = underscan*9 + height*3 + width. Such a method provides efficient storage and indication of width, height, and underscan values.

As shown in Figure 4a, typically, for each capture step (e.g. each azimuth angle and/or each elevation angle), a plurality of sub-points SP1 , SP2, SP3, SP4, SP5 is determined. For example, where the azimuth angle increment is 0.1 ° then for an azimuth angle of 0°, sub-points may be determined at azimuth angles of -0.05°, -0.025°, 0, 0.025°, and 0.05° (and similar sub-points may be determined for a plurality of elevation angles). Attribute values of these sub-points may then be combined to obtain an attribute value for the point. For example, a maximum attribute value of the sub-points may be used as the value for the point, an average attribute value of the sub-points may be used as the value forthe point, and/or a weighted average of the sub-points may be used as the value forthe point. It will be appreciated that numerous other methods for combining the attribute values of the sub-points are possible.

By determining the attribute of a point based on the attributes of sub-points, the accuracy of the capture process can be increased. While it would be possible to simply reduce the increment of the angle steps to provide a higher resolution scene, by considering sub-points but only storing attributes for points, a balance can be struck between accuracy and file size (since storing every sub-point would lead to a substantial increase in the amount of data that needs storing).

With the example of Figure 4a, for each point of the three-dimensional representation that is captured by a capture device, this capture device may obtain attributes associated with each of the sub-points SP1 , SP2, SP3, SP4, SP5, combine these attributes to obtain a point attribute, and then store a point with a distance that is an average (e.g. a weighted average) of the distances of the sub-points from the capture device, at the nominal angle of the point, with the point attribute.

As shown in Figure 4b, where a plurality of sub-points SP1 , SP2, SP3, SP4, SP5 are considered, these points may have different distances from the location of the capture device. In some embodiments, the attributes of the sub-points may be combined in dependence on this distance, e.g. so that sub-points nearer to the capture device have higher weightings.

However, the possibility of sub-points with substantially different distances raises a potential problem. Typically, in order to determine a distance for a point, the distances for the sub-points are averaged. But where the sub-points have substantially different distances and/or are related to different surfaces in the scene, this may result in the point having a distance that does not correspond to any actual surface in the scene. Therefore, the point may seem to hang in space (e.g. to hang between the front and rear surfaces shown in Figure 4b.

Similarly, where the attribute values of the sub-points greatly differ, e.g. if the sub-points SP1 and SP2 are white in colour and the sub-points SP3 and SP4 are black in colour, then the attribute value of the point may be substantially different to the attribute value of other points in the scene. In an example, if the scene were composed of black and white objects, the point may appear as a grey point hanging in space between these objects.

In some embodiments, the computer device is arranged to aggregate sub-points so as not to create any floating points. For example, the computer device may determine whether the sub-points are spatially coherent by employing a clustering algorithm (e.g. a k-means clustering algorithm). Where the sub-points are spatially coherent (e.g. where a difference in the distance of the sub-points is below a threshold value), these distances may be averaged to obtain a distance for the point. Where the sub-points are not spatially coherent, the sub-points may be processed to ensure that the distance of any point places it upon a surface; for example, in the system of Figure 4b, sub-points SP1 , SP2, and SP3 may be grouped into a first point and sub-points SP4 and SP5 may be grouped into a second point. Since each sub-point is associated with the same capture device and capture angle (all of these sub-points being associated with a capture step that has a particular azimuth angle and elevation angle), these points may be located at the same angle with respect to a capture device. Therefore, to ensure that each sub-point affects the representation considered, the first point (made up of sub-points SP1 , SP2, and SP3) may have a smaller distance value than the second point (made up of sub-points SP4 and SP5) and the first point may be assigned a nonzero transparency value so that the second point can be seen through the first point.

By capturing points at a plurality of azimuth angles and elevation angles, e.g. using the method described with reference to Figure 3, it is possible to provide a three-dimensional representation of the scene that can later be used to enable a viewer to view the scene from a plurality of angles. More specifically, given the three-dimensional points captured by the capture device, a computer device is able to render a two- dimensional representation (e.g. a two-dimensional image) of the scene for each eye of a viewer so as to provide a representation with an impression of depth. The computer device may render a series of two- dimensional representations to enable the viewer to look around the scene, where the two-dimensional representations are rendered based on an orientation of the viewer’s head. In this way, the determined representation is useable to provide, for example, a virtual reality (VR), mixed reality (MR), augmented reality (AR), and/or extended reality (XR) experience to the viewer.

To enable such a display, the display device 17 is typically a virtual reality headset, that comprises a plurality of sensors to track a head movement of the user. By tracking this head movement, the display device is able to update the images being displayed to the viewer as the viewer moves their head to look about the scene. Typically, this involves the display device sensing the sensor data to an external computer device (e.g. a computer connected to the display device via a wire). The external computer device may comprise powerful graphical processing units (GPUs) and/or computer processing units (CPUs) so that the external computer device is able to rapidly render appropriate two-dimensional images for the viewer based on the three-dimensional images and the sensor data.

In some embodiments, the external computer device may comprise a server device, where the display device 17 may be connected to this server device wirelessly. This enables the two-dimensional images to be streamed from the server to the display device so as to enable the display of high-quality images without the need for a viewer to purchase expensive computer equipment. In other words, operations that require large amounts of computing power, such as the rendering of two-dimensional images based on the three- dimensional representation, may be performed by the server, so that the display device is only required to perform relatively simple operations. This enables the experience to be provided to a wide range of viewers.

In some embodiments, a first two-dimensional image is provided to the display device 17 (and/or a connected device) and this first image is “warped’ in order to provide an image for viewing at the display device. The warping of the image comprises processing the image based on the sensor data in order to provide an image that matches a current viewpoint of the viewer. By performing the warping at the display device or another local device, the lag between a head movement of the user and an updating of the two- dimensional representation of the scene can be reduced.

One issue with the above-described method of capturing a three-dimensional representation is that it only enables a viewer to make rotational movements. That is, since the points are captured using a single capture device at a single capture location, there is no possibility of enabling translational movements of a viewer through a scene. This inability to move translationally can induce motion sickness within a viewer, can reduce a degree of immersion of the viewer, and can reduce the viewer’s enjoyment of the scene.

Therefore, it is desirable to enable translational movements through the scene. To enable such movements, the three-dimensional representation of the scene may be captured using a plurality of capture devices placed at different locations (or the same capture device placed at different locations). A viewer is then able to move around the scene translationally (e.g. by moving between these locations). More generally, by capturing points for every possible surface that might be viewed by a viewer, a three- dimensional representation of a scene may be captured that allows a suitable two-dimensional representation of this scene to be rendered regardless of a location of a viewer (e.g. regardless of where a user is standing within a virtual room).

This need to capture points for every possible surface (so as to enable movement about a scene) greatly increases the amount of data that needs to be stored to form the three-dimensional representation.

Therefore, as has been described in the application WO 2016/061640 A1 , which is hereby incorporated by reference, the three-dimensional representation may be associated with a viewing zone, or a zone of viewpoints (ZVP), where the three-dimensional representation is arranged to enable a user to move about the viewing zone so as to view the scene.

Figure 5 illustrates such a viewing zone 1 and illustrates how the use of a viewing zone limits the amount of image data that needs to be stored to provide a three-dimensional representation of the scene. With the scene shown in this figure, and the viewing zone 1 shown in this figure, it is not necessary to determine attribute data for the occluded surface 2 since this occluded surface cannot be viewed from any point in the viewing zone. Therefore, by enabling the user to only move within the viewing zone (as opposed to around the whole scene) the amount of data needed to depict the scene is greatly reduced.

While Figure 5 shows a two-dimensional viewing zone, it will be appreciated that in practice the viewing zone 1 is typically a three-dimensional zone or volume.

The viewing zone 1 may, for example, comprise a rectangular volume, or a rectangular parallelepiped, and the viewing zone may have a height of at least 30 cm, a depth of at least 30 cm, and/or a width of at least 30 cm, where these dimensions enable a user to move their head while remaining in the viewing zone. This is merely an exemplary arrangement ofthe viewing zone; it will be appreciated that viewing zones of various shapes and sizes may be used (e.g. spherical viewing zones). That being said, it is preferable that the viewing zone is limited so as to cover only a part of the volume of the scene, e.g. no more than 50% of the scene no more than 25% of the scene, and/or no more than 10% of the scene. In this regard, if the viewing zone is the same size as the scene, then the three-dimensional representation will simply be a standard representation for virtual reality (that enables a user to move freely about the scene) - and so the use of the viewing zone will not provide any reduction in file size.

The viewing zone 1 enables movement of a viewer around (a portion of) the scene. For example, where the scene is a room, the base representation may enable a user to walk around the room so as to view the room from different angles. In particular, the viewing zone enables a user to move through the scene with six degrees-of-freedom (6DoF) movement through the scene, where this aids in the provision of an immersive experience.

In some embodiments, the viewing zone 1 may be four-dimensional, where a three-dimensional location of the viewing zone changes over time - and in such embodiments the size and location of the occluded surface 2 may also change over time. More generally, it will be appreciated that viewing zones may be formed in any size or shape, with different sizes and shapes being suitable for different scenes.

The volume of the viewing zone 1 is typically selected so that a user is able to move to a degree sufficient to avoid motion sickness and to provide an immersive sensation, while still only enabling a limited amount of movement (where this leads to a smaller file size as compared to an implementation where a user is able to fully move about the scene). Typically, the viewing zone is arranged to enable a user to move their head while they are sitting or standing, but not to freely roam around a room.

The viewing zone 1 may have a (e.g. real-world) volume of less than five cubic metres (5m³), less than one cubic metre (1 m³), less than one-tenth of a cubic metre (0.1 m³) and/or less than one-hundredth of a cubic metre (0.01 m³). The viewing zone 1 may also have a minimum size, e.g. the viewing zone may have a volume of at least 1 % of the volume of the scene, at least 5% of the volume of the scene, and/or at least than 10% of the volume of the scene. Similarly, the viewing zone may have a volume of at least one-thousandth of a cubic metre (0.01 m³); at least one-hundredth of a cubic metre (0.01 m³); and/or at least one cubic metre (1 m³).

The ‘size’ of the viewing zone 1 typically relates to a size in the real world, where if the viewing zone has a length of one metre this means that a user is able to move one metre in the real world while staying within the viewing zone. The size of the viewing zone in the scene may be greater than, equal to, or less than the size of the viewing zone in the real world. For example, the viewing zone may scale a real-world distance so that moving one metre in the real world moves the user less than (or more than) one metre in the scene. This enables the scene to provide different perceptions to the user (e.g. to make the user feel larger or smaller than they are in real life). Similarly, the viewing zone may scale a real-world angle so that rotating one degree in the real world rotates the user less than (or more than) one degree in the scene.

Therefore, a viewing zone with a volume of one cubic metre typically connotes a viewing zone in which the user is able to move about a one cubic metre volume in the real world while remaining in the viewing zone. And this may cause the user to move about a volume that is more than, or less than, one metre in the scene.

Referring to Figure 6a, in order to capture points for each surface and location that is visible from the viewing zone 1 , a plurality of capture devices C1 , C2 C9 may be used (e.g. a plurality of virtual scanners and/or a plurality of cameras). Each capture device is typically arranged to perform a capture process, e.g. as described with reference to Figure 3, in which the capture device captures points at a plurality of azimuth angles and elevation angles. By locating the capture devices appropriately, e.g. by locating a capture device at each corner of the viewing zone, it can be ensured that most (or all) points of a scene are captured.

Typically, a first capture device C1 is located at a centrepoint of the viewing zone 1. In various embodiments, one or more capture devices C2, C3, C4, C5 may be located at the centre ef faces of the viewing zone; and/or one or more capture devices C6, C7, C8, C9 may be located at edges of and/or corners of the viewing zone.

Figure 6a shows a two-dimensional view (e.g. a plan view) of a rectangular viewing zone. It will be appreciated that within this viewing zone each capture device may be located on a shared plane. Equally, the various capture devices may be located on different planes. Referring, for example, to Figure 6b, there is shown a three-dimensional view of a cuboid viewing zone, where there is a capture device located: at the centre of the viewing zone; at the centre of each face of the viewing zone; and at each corner of the viewing zone.

With this arrangement, many locations in the scene (e.g. specific surfaces) will be captured by a plurality of capture devices so that there will be overlapping points relating to different capture devices. This is shown in Figure 7, which shows a first point P1 being captured by each of a first capture device C1 , a sixth capture device C6, and a seventh capture device C7. Each capture device captures this point at a different angle and distance and may be considered to capture a different ‘version’ of the point.

Typically, only a single version of the point is stored, where this version may be the highest quality version of the point and/or may be the version of the point associated with the nearest and/or least angled capture device.

In this regard, the highest ‘quality’ version of the point is captured by the capture device with the smallest distance and smallest angle to the point (e.g. the smallest solid angle). In this regard, as described with reference to Figures 4a and 4b, capturing a point for a given azimuth angle and elevation angle typically comprises capturing a plurality of sub-points at varying sub-point azimuth and elevation angles spread around the point azimuth and elevation angles. Due to the different spreads of sub-points, each capture device will capture a different version of the point (that has a different attribute) even when the points are at the same location. Capture devices that are close to the point and less angled with respect to the point typically have a smaller spread of sub-points and so typically obtain a version of a point that is sharper than a version of that point captured by more distant capture devices.

In some embodiments, a quality value of a version of the point is determined based on the spread of subpoints associated with this version (e.g. based on the perimeter formed by these sub-points and/or based on a surface area or volume bounded by these sub-points). The version of the point that is stored may depend on the respective quality values of possible versions of the points.

Regarding the ‘versions’ of the points, it will be appreciated that two ‘points’ in approximately the same location captured by each capture device may not have exactly the same location in the three-dimensional representation. More specifically, since each capture device typically projects a ‘ray’ at a given angle, the rays of differing capture devices may contact the surface at different locations for each capture device. Two points may be considered to be two ‘versions’ of a single point when they are within a certain proximity, e.g. a threshold proximity. For example, where the first capture device C1 captures a first point and a second point at subsequent azimuth angles, and the sixth capture device C6 captures a further point that is in between the locations of the first point and the second point, this further point may be considered to be a ‘version’ of one of the first point and the second point.

This difference in the points captured by different capture devices is illustrated by Figures 8a and 8b, which show the separate captured grids that are formed by two different capture devices. As shown by these figures, each capture device will capture a slightly different ‘version’ of a point at a given location and these captured points will have different sizes. Each capture step is associated with a particular range of angles (e.g. a nominal capture angle of 1 ° might encompass angles from 0.9° to 1.1 °), and therefore capture devices that are far from a point to be captured represent a wider region at the capture distance than capture devices closer to that point to be captured. As shown in Figure 8a, the capture device C1 would capture the points P1 and P2 in separate brackets, whereas for the capture device C2 these points are in the same bracket. Therefore, the capture device C2 might determine a single point that encompasses both points P1 and P2, whereas the capture device C1 would determine separate points for these two points.

Considering then a situation in which points P1 and P2 are captured separately, and capture device C1 is used to capture point P1 while capture device C2 being used to capture point P2, it should be apparent that the ‘sizes’ of these captured points, and the locations in space that are encompassed by the captured points will be based on different grids. For example, the width of the captured point P2 captured by the capture device C2 will be larger than the width of the captured point P1 captured by the capture device C1. The capture process may be determined based on the existence of these different grids, and on the different bracket widths that occur at different distances from a capture device.

Figure 8a shows an exaggerated difference between grids for the sake of illustration. Figure 8b shows a more realistic embodiment in which the three-dimensional representation comprises a plurality of points associated with different capture devices, where these points lie on different grids associated with these different capture devices.

In order to store the points of the three-dimensional representation, the points may be stored as a string of bits, where a first portion of the string indicates a location of the point (e.g. using x, y, z coordinates) and a second portion ofthe string locates an attribute ofthe point. In various embodiments, further portions of the string may be used to indicate, for example, a transparency of the point, a size of the point, and/or a shape of the point.

A computer device that processes the three-dimensional representation after the generation of this representation is then able to determine the location and attribute of each point so as to recreate the scene. This location and attribute may then be used to render a two-dimensional representation of the scene that can be displayed to a viewer wearing the display device 17. Specifically, the locations and attributes of the points of the three-dimensional representation can be used to render a two-dimensional image for each of the left eye of the viewer and the right eye of the viewer so as to provide an immersive extended reality (XR) experience to the viewer.

The present disclosure considers an efficient method of storing the locations of the points (e.g. at an encoder) and of determining the locations of the points (e.g. at a decoder).

As has been described with reference to Figures 5a and 5b, the points of the three-dimensional representation are determined using a set of capture devices placed at locations about the viewing zone, where these capture devices are arranged to capture points at a series of azimuth angles and elevation angles. Typically, each of the capture devices is arranged to use the same capture process (e.g. the same series of azimuth angles and elevation angles), though it will be appreciated that different series of capture angles are possible. For example, there may be a plurality of possible series of capture angles, where different capture devices use different capture angles.

In general, the present disclosure considers a method in which points are stored based on a capture device identifier and an indication of a distance of the point from the capture device associated with this capture device identifier. Typically, the point is also associated with an angular indicator, which indicates an azimuth angle and/or an elevation angle of the point relative to the identified capture device.

It will be appreciated that the storage of the distance and the angle may take many forms. For example, the distance and the angle of each point may be converted into a universal coordinate system, where each capture device has a different location in this universal coordinate system. In particular, each point may be stored with reference to a centre of this universal coordinate system, which centre may be co-located with a central capture device. Where a point is determined based on a distance and an angle from a capture device of a known location in this universal coordinate system, the coordinates of the point in this universal coordinate system can be determined trivially - and the location of the point may then be stored either relative to the capture device or as a coordinate in the universal coordinate system.

The capture device identifier may comprise a location of a capture device (e.g. a location in a co-ordinate system of the three-dimensional representation). Equally, the capture device identifier may comprise an index of a capture device. Similarly, the indication of the azimuth angle and the elevation angle for a point may comprise an angle with reference to a zero-angle of a co-ordinate system of the three-dimensional representation. Equally, the azimuth angle and/or the elevation angle may be indicated using an angle index.

In some embodiments, the three-dimensional representation is associated with configuration information, which configuration information comprises one or more of: a set of capture device indexes; locations associated with the capture devices and/or the capture device indexes; a spacing of capture devices (e.g. so that locations of the capture devices can be determined from a location of a first capture device and the spacing); angles associated with a capture process for the capture devices; an azimuth angle increment and/or an elevation angle increment associated with the capture process; and a set of angle indexes (e.g. to match an angle index to an angle).

With this configuration information, it is possible to determine a location of each capture device from an index of that capture device and/or to determine a capture angle from a known capture process. Therefore, given two numbers: a capture device index and an angle index (that is associated with a combination of a specific azimuth angle and a specific elevation angle), a location of a capture device and a direction of a point from this capture device can be determined. By also signalling a distance of the point from the signalled capture device, a precise location of the point in the three-dimensional space can be signalled efficiently. Typically, the point is associated with each of: a camera index, a distance, an first angular index (e.g. a first azimuth), and a second angle (e.g. a second elevation)

This method of indicating a location of a point enables point locations to be identified using a much smaller number of bits than if each point location is identified using x, y, z coordinates.

Referring to Figure 9, there is shown a method of determining a location of a point. This method is carried out by a computer device, e.g. the image generator 11 and/or the decoder 15.

In a first step 41 , the computer device identifies an indicator of a capture device used to capture the point. Typically, this comprises identifying a portion of a string of bits associated with a capture device index.

In a second step 42, the computer device identifies an indicator of an angle of the point from the capture device. Typically, this comprises identifying an angle index, e.g. an azimuth index and/or an elevation index and/or a combined azimuth/elevation index, which index(es) identifies a step of the capture process during which the point was captured.

In a third step 43, based on the identifiers, the computer device determines the location of the capture device and the angle of the point from the capture device.

The capture device identifier is typically a capture device index, which is related to a capture device location based on configuration information that has been sent before, or along with, the point data. For example, the configuration information may specify:

Location of first capture device is (0,0,0).

Step between capture devices is (0,0,1) along the grid, then across the grid, then up the grid.

- The grid is (10,10,10).

With this information, a capture device with an index of 1 can be determined to be located at (0,0,0); a capture device with an index of 5 can be determined to be located at (0,0,4); a capture device with an index of 12 can be determined to be located at (0,1 ,0), and so on.

Equally, the configuration information may specify a list of camera indexes and locations associated with these indexes, where this enables the use of a wide range of setups of capture devices.

Typically, the three-dimensional representation is associated with a frame of video. The configuration information may be constant over the frames of the video so that the configuration information needs to be signalled only once for an entire video. Therefore, the configuration information may be transmitted alongside a three-dimensional representation of a first frame of the video, with this same information being used for any subsequent frames (e.g. until updated configuration information is sent).

The angle identifier may similarly be related to an angle by a location and an increment that are signalled in a configuration file. For example, the configuration information may specify:

An azimuth increment and an elevation increment are each 1 °.

There are 359 increments for each angle type.

With this information: a capture angle with an index of 1 can be determined to be at an azimuth angle of 0° and an elevation angle of 0°; a capture angle with an index of 10 can be determined to be at an azimuth angle of 10° and an elevation angle of 0°; a capture angle with an index of 360 can be determined to be at an azimuth angle of 0° and an elevation angle of 1 °; and a capture angle with an index of 370 can be determined to be at an azimuth angle of 9° and an elevation angle of 1 °; etc.

In a fourth step 44, based on the determined location of the capture device and the determined angle, a location of the point is determined. Typically, this comprises determining the location ofthe point based on the location of the capture device, the capture angle, and a distance of the point from the capture device (where this distance is specified in the point data for the point). Determining the location of the point typically comprises determining the location of the point relative to a centrepoint of the three-dimensional representation, This location of the point may then be converted into a desired coordinate system and/orthe point may be processed based on its location (e.g. to stitch together adjacent points).

The angular identifier typically comprises a first angular identifier and a second angular identifier, where the first identifier provides the azimuthal angle of the point and the second identifier provides the elevation angle of the point.

Referring to Figure 10, each angular identifier may be provided as an index of a segment of the three- dimensional representation, where, for example, an index of 0 may identify the point as being in a first angular bracket 101 and an index of 1 may identify the point as being in a second angular bracket 102.

In this regard, the capture devices are arranged to perform a capture process, e.g. as described with reference to Figure 3, with a non-infinite angular resolution. Given this non-infinite resolution, each point is not a one-dimensional point located at a precise angle. Instead, each point is a point for a particular area of space, with the size of this area being dependent on the angular resolution as well as the distance of the point from the capture device. In other words, each capture angle determines a point for an angular range (with the range being dependent on the angular resolution). That is, if the capture process leads to points being captured at angles of 10°, 11 °, and 12° then this can equally be considered to relate to points being captured at a first range of 9.5°-10.5°, a second range of 10.5°-11 .5°, and a third range of 11 .5°-12.5°.

This is shown in Figure 10, which shows a series of angular brackets, with the size of these angular brackets at a given distance being dependent on the angular resolution. The angular identifier(s) typically comprise a reference to such an angular bracket. Consider, for example, a cube placed with the capture device C1 at the centre of this cube. By dividing this cube into x segments at regular azimuth angles and y segments at regular elevation angles, it is possible to identify any angular range of the representation by reference to an x segment and a y segment (and then the space bracketed by this angular range will depend on both the angular resolution (e.g. the angle between adjacent brackets) and the distance of the point from the capture device).

Typically, each capture device has the same capture pattern so that the angular bracketing of each device is the same (albeit centred differently at the location of the relevant capture device). For example, in an embodiment with 1000 equal angular brackets, the angle for each bracket may be 360/1000.

In some embodiments, different capture devices are associated with different capture patterns, where this may be signalled in configuration information relating to the three-dimensional representation.

In some embodiments, each capture device is arranged to capture a point for a plurality of angular brackets, where each bracket is associated with a different angle. The angular spread of each bracket (that is, the angle between a first, e.g. left, angular boundary of the bracket and a second, e.g. right, angular boundary of the bracket) may be the same; equally, this angular spread may vary. In particular, the angular spread may vary so as to be smaller for points which are directly in front of (or behind, or to a side of) the capture device. For example, the embodiment shown in Figure 7 shows an angular bracketing system that is based on a cube. With this system, a cube is placed such that a capture device is located at the centre of the cube and the cube is then split into 1000 sections of equal size (it will be appreciated that the use of 1000 sections is exemplary and any number of sections may be used). Each of these sections is then associated with an angular index. With this arrangement, the angular spread of each section (or bracket) varies, as has been described above.

Figure 10 shows a two-dimensional square, where each angular bracket of the square is referenced by an index number (between 1 and 100). In a three-dimensional implementation, an angular bracket of a cube could be indicated with two separate numbers (with a first azimuthal indicator that identifies a ‘column’ of the cube and a second elevational indicator that identifies a ‘row’ of the cube). Equally, a singular indicator may be provided that indicates a specific bracket of the cube. Therefore, for a cube that is divided into 1000 elevational sections and 1000 azimuthal sections, the bracket may be indicated with two separate indicators that are each between 0 and 999 or with a single indicator that is between 0 and 999999.

It will be appreciated that the use of a cube to define the brackets is exemplary and that other bracketing systems are possible. For example, a spherical bracketing system may be used (where this leads to curve angular brackets). Equally, a lookup table may be provided that relates angular indexes to angles, where this enables irregularly spaced brackets to be used.

Typically, determining the location of the point comprises determining the location of the point so as to be at the centre of the angular bracket identified by the angular identifier(s).

Texture patches

In order to reduce the file size of the three-dimensional representation (and the bandwidth required to transmit the three-dimensional representation) it is desirable to reduce the number of points within the three-dimensional representation. Therefore, referring to Figure 11 , there is described a method of determining a texture patch that can replace a plurality of points in the representation.

In a first step 71 , the computer device identifies a plurality of points of the representation; in a second step 72, the computer device determines that the points lie on a shared plane; in a third step 73, the computer device determines a texture patch based on the attributes of the points; and in a fourth step 74, the computer determines a new point that references the texture patch (this new point may be referred to as a ‘texture point’).

The texture patch typically comprises a patch with a plurality of attribute values, which attribute values may be the same as the attribute values of the identified points. Therefore, the texture patch enables the recreation of the plurality of points. A benefit of using the texture patch is that a single point, with a single location value and a (single) reference to the texture point, can replace the plurality of identified points. The attribute values of each point are contained in the texture patch so that little (or no) information is lost from the original representation, but by representing all of these attribute values by reference to the texture patch, only a single location needs to be signalled (saving on the computational cost of signalling locations for a plurality of points). For example, an 8x8 square of identified points that each have separate locations and attribute values may be replaced by a single point with a single location and an attribute value that is a reference to a texture patch (which texture patch comprises the attribute values of the identified points arranged in the relative positions of the identified points); this would reduce the size of the representation by 63 points (where a single point replaces an 8x8 grid of points) at the cost of needing to signal a 8x8 texture patch (that has 64 attribute values and/or transparency values and/or normal values).

This is shown in Figures 12a, 12b, and 12c. Figure 12a shows a plurality of points of a three-dimensional representation that lie on a shared plane. Each of these points has a location and an attribute value. Figure 12b shows how these points may be replaced by a single point (e.g. a ‘texture point’) that contains a reference to the texture patch shown in Figure 12c. This texture patch may comprise the attribute values of the plurality of points without separately storing the locations of the attributes (instead, the attribute values are laid out in a predetermined pattern, which is a 5x5 grid in the example of Figure 12c).

It will be appreciated that various sizes of texture patch are possible and that the 5x5 grid of Figure 12c is only an example. Another (practical) example of a texture patch is shown in Figure 12d, which shows an 8x8 arrangement of values laid out in the form of a texture patch. As shown in Figure 12d, typically the texture patch provides a continuous grid of pixel values (e.g. that can be used to form a continuous image) - in this regard, the points shown in Figures 12a - 12c are shown as separated points. In practice, these ‘points’ are typically abutting points that form a joined arrangement of values. In some embodiments, the method comprises determining the texture patch in dependence on a difference of the attributes of the identified points exceeding a threshold (e.g. in dependence on a variance, a range, or a maximum difference of these attributes exceeding a threshold). In this regard, points that are similar in both location and attribute may be aggregated into a single point with a location and attribute that is based on the initial points and a size that covers both of the initial points (e.g. two adjacent points of the same colour and a size of 1 may be aggregated into a single point of this colour with a size of 2). Such an aggregation does not require any determination of a texture patch. In contrast, a texture patch may be determined where there is a plurality of dissimilar points (e.g. points with dissimilar attributes) that lie on a shared plane, where the use of the texture patch enables the attributes of each of these points to be signalled in an efficient manner.

The second step 72 of determining that the points lie on a shared plane may comprise determining that the points are lie on a shared surface (e.g. on the same object), where the method may comprise identifying a surface associated with the identified points.

Determining that the points lie on a shared plane may comprise comparing a distance of (each of) the points from this plane and/or surface to a threshold distance and determining that the points lie on the threshold/plane if they are within this threshold distance from the plane.

This second step 72 may also, or alternatively, comprise identifying normals for each of the points, which normals may be contained in point data of the points, and determining a similarity of the normals (e.g. determining that each of the normals is within a threshold value of an average normal and/or determining that a variance of the normals is below a threshold value).

The texture patch is typically determined based on this determination in the second step 72, where if (e.g. only if) the identified points lie on a shared surface or plane then they may be replaced by a single point that references a texture patch.

In some embodiments, the texture patch may be determined for points that lie on a curved plane, where the second step 72 may comprise determining that the points lie on a curved plane or a curved surface. Such a texture patch may be associated with a bend value to enable the reproduction of the identified points. Typically, the texture patch comprises a quadrilateral, where the texture path may be able to bend about a line that is formed between opposite corners of this quadrilateral so as to map the texture patch to a curved surface.

The threshold distance (forthe points to be considered co-planar) may depend on the distance of the points from the viewing zone; in particular, points that are located far from the viewing zone may have a higher threshold separation than points that are located nearer to the viewing zone. Typically, users are better able to identify separations between a plurality of surfaces when these surfaces are near to the viewing zone whereas users may not be able to identify separations between surfaces that are distant from the viewing zone. Therefore, the maximum (threshold) acceptable distance between the identified points and a plane passing through the identified points may be dependent on the distance of the identified points from the viewing zone (e.g. the threshold may increase from a first value when the identified points are within 1 km of the viewing zone to a second value when the identified points are more than 1 km from the viewing zone).

Typically, the texture patch is associated with a size, where there may also be provided a plurality of texture patches of different sizes. Identifying the plurality of points may then comprise identifying a plurality of points that could potentially be replaced with a single point that references a texture patch. This may involve the computer device iterating through a plurality of pluralities of identified points and then evaluating each of these pluralities of identified points in order to determine whether the points lie on a shared plane. If these identified points are found to lie on such a shared plane, then a texture patch may be determined based on the attributes of these points and this texture patch may be added to a database of texture patches (e.g. a texture ‘atlas’).

The texture patch may be associated with one or more of: one or more attribute values; one or more transparency values; one or more normals; etc. The texture patch may comprise a plurality of points that correspond to the points used to form the texture patch (e.g. where each point of the texture patch comprises an attribute, a normal, and/or a transparency of a corresponding point of the three-dimensional representation). The method may comprise determining a multi-layered texture patch and/or a plurality of texture patches. For example, the method may comprise determining a texture patch for each eye of a user (where these texture patches may be located at the same index of separate databases of texture patches so that they can be signalled by a single reference in a point).

In some embodiments, texture patches for each of a left eye and a right eye are stored in a shared database. The index for a texture patch for a first eye may then be set as being one greater than the index for a texture patch of a right eye, where this simplifies the signalling of the texture patches. In some situations, e.g. for diffuse non-reflective materials, each eye may be associated with the same texture patch. In these situations, only a single texture patch may be included in the database (with the index that would otherwise contain a second texture patch instead pointing to the single texture patch). Equally, the same texture patch may be stored twice. By storing the texture patches for each eye adjacent to each other, there is an increased ability to benefit from similarities between these texture patches when encoding the texture patch database.

Determining the new point may comprise replacing (one or more of, or all of) the identified points with the new point (the texture point). While the identified points each comprise an attribute value (e.g. to indicate the colour of the point), the new point may instead comprise a reference to the texture patch. This reference may be located in an attribute datafield associated with the point so that the new point has the same form as the identified points (and the same form as the other points of the three-dimensional representation).

In some embodiments, determining the new points comprises modifying one of the identified points. In particular, the attribute value of one of the identified points may be replaced with a reference to the texture patch. Furthermore, the size of this point may be modified, e.g. increased, to signal that the point is now associated with a texture patch.

Typically, the first step 71 of identifying the plurality of points comprises identifying a plurality of points associated with the same capture device (and/or similar, e.g. adjacent capture devices). In this regard, as has been described above, each points is typically associated with a capture device, a distance (from that capture device) and one or more angles (from the capture device), where this method of defining points enables the location of each point to be determined relative to the associated capture device (and then absolute locations of each point can be determined using the locations of the various capture devices).

Typically, each point of the representation comprises a size, which size may relate to a number of angular boundaries that is encompassed by the point. Therefore, a point that has a size of 1x1 may relate to a point that covers a single angular bracket, a point that has a size of 2x1 may cover two angular brackets in a row etc. Typically, the size is stored via an index so that, for example, a size value of 0 (which may be signalled by a binary value of 000) may signal a point that covers a 1x1 arrangement of angular brackets, a size value of 1 (which may be signalled by a binary value of 001) may signal a point that covers a 2x1 arrangement of angular brackets, a size value of 2 (which may be signalled by a binary value of 010) may signal a point that covers a 1x2 arrangement of angular brackets, etc.

Typically, the texture patch is similarly associated with a size, where the method of determining texture patches may be performed such that every texture patch is of the same size (where this can simplify the storage of the texture patches). In some embodiments, a computer device determining the texture patches may be arranged to determine texture patches of different sizes.

The size of the texture patch may refer to a number of angular brackets covered by a texture patch. For example, each texture patch may cover an 8x8 arrangement of angular brackets. The first step 71 of identifying the points may comprise identifying a plurality of contiguous points, where these points may be in a predetermined arrangement. For example, the first step may comprise identifying an 8x8 arrangement of points in adjacent (and contiguous) angular brackets.

In some embodiments, the three-dimensional representation comprises a plurality of separately signalled points and/or comprises a plurality of different sections, with each section comprising a different type of points. In particular, the representation may be associated with a file that has at least two sections, where a first section comprises points for which an attribute datafield contains an attribute value and a second section comprises points for which an attribute datafield comprises a reference to an external value (e.g. to a texture patch).

Equally, those points for which the attribute value references a texture patch may comprise an identifier that enables these points to be distinguished from points for which the attribute datafield comprises an attribute value (e.g. each attribute datafield that references a texture patch may begin with a recognisable string or pattern of bits). In some embodiments, points that reference a texture patch are identified based on a size of that point, where typically these points have a greater size than other points (equally a size of 0 may be used to signal a texture patch where this size would not otherwise be used).

In some embodiments, the three-dimensional representation comprises at least two of the following sections:

A first section comprising opaque points with attribute values (e.g. points for which a transparency (a) value of 255).

A second section comprising opaque points that reference texture patches (e.g. points that have an attribute datafield that comprises a reference to a texture patch and also a transparency value of 255). A third section comprising transparent points with attribute values and transparent points that reference texture patches. Typically, the transparency values for the texture patch are a part of the texture patch (e.g. these transparency values are stored in the texture atlas). The points with attribute values and the points referencing texture patches may then be distinguished by setting the transparency values of the points referencing texture patches to 255. Since the points are in the third section, a computer device parsing the three-dimensional representation is able to determine that a point with a transparency value of 255 is not opaque (since otherwise it would be located in the first or second section) and therefore the computer device can identify that such points include references to texture patches).

It will be appreciated that the use of a value of ‘255’ to denote an opaque point is purely exemplary. More generally, points referencing transparent texture patches may be signalled by including these points in a section associated with transparent points and transparent texture patches and setting a transparency value of the points referencing transparent texture patches to a value that indicates an opaque point.

Such a method of signalling points that reference texture patches enables an increase in the efficiency of the storage of the three-dimensional representation.

In some embodiments, one or more of the points of the three-dimensional representation is associated with a motion vector, where this vector indicates a motion of that point that occurs between frames of a video associated with the three-dimensional representation. A texture patch may similarly be associated with a single vector (e.g. where the texture patch is a rigid structure and so moves as a single structure). Equally, a texture patch may be associated with a plurality of motion vectors, e.g. each corner of the texture patch may be associated with a different motion vector, where this enables the motion vectorto flex and to change in size during the course of a video (and enables this flexing to be signalled via motion vectors).

Referring to Figure 13, there is described a method for determining one or more texture patches for a three- dimensional representation. This method is typically performed by a computer device as a post-processing step, after the three-dimensional representation has been generated (e.g. after each point of the three- dimensional representation has been captured). This method of Figure 13 may then be used to replace a number of points within this three-dimensional representation with one or more new points that reference texture patches so as to reduce the number of points in the three-dimensional representation.

In a first step 81 , the computer device identifies a plurality of points associated with a given location in the representation. Typically, this step comprises identifying a plurality of points in adjacent angular brackets. For example, the computer device may identify an 8x8 arrangement of points (it will be appreciated that various other arrangements may be identified).

In a second step 82, the computer device determines whether these identified points lie on a shared plane. This determination may comprise identifying a plane that passes through these identified points and determining that each of the identified points is within a threshold distance of this plane. The plane may comprise a curved plane. Additionally, or alternatively, the second step may comprise comparing the normals of the identified points to determine that the identified points have similar directional values (e.g. face in the same direction).

In a third step 83, if the points lie on a shared plane, the computer device determines a texture patch based on the identified points; e.g. where the texture patch comprises a two-dimensional plane that has attribute values corresponding to the attribute values of the identified points. The computer device may then replace the identified points with a new point that has a location related to the identified points and that references the texture patch.

In a fourth step 83, the computer device considers a next location in the representation, and the method then returns to the first step 81 so that a next plurality of points can be evaluated.

This method may be performed so as to move through a plurality of pluralities of points of the representation and to determine, for each of these pluralities of points, whether the points lie on a shared plane (and so can be replaced with a texture quad). For example, the fourth step 84 of considering a next location may comprise incrementing an angular identifier so as to move incrementally through the angular brackets for a computer device and, at each stage, to compare the points in a number of adjacent angular brackets. In this way, the computer device essentially initialises a window of points and then moves the window so as to evaluate a moving window of points that passes about the entirety of the representation.

Typically, the method of Figure 13 is performed separately for each capture device so that, for one or more capture devices, one or more pluralities of points are evaluated to determine whether these points lie on a shared plane and to determine a texture patch if the points do lie on a shared plane. In some embodiments, points captured by separate capture device may be considered together (e.g. points captured by adjacent capture devices may be considered together for the purposes of determining texture patches).

The iterative process described with reference to Figure 13 is capable of obtaining a three-dimensional representation that efficiently represents a space. Typically, the three-dimensional representation is associated with a video, e.g. a VR video, and so the three-dimensional representation (and the texture atlas associated with this three-dimensional representation) may relate to a single frame of the video. The video may be composed of a plurality of frames, with each frame relating to a different three-dimensional representation. By iterating separately over each three-dimensional representation using the method of Figure 11 it is possible to obtain efficient representations for each frame of the video. However, this method of determining texture patches can result in similar features being sorted into different representations in each frame, which may prevent the use of temporal encoding methods.

In this regard, the texture patch may represent a point that is moving throughout a scene. And so the same (or a similar) texture patch may be identified in a plurality of three-dimensional representations, e.g. where a movement of this texture patch is signalled by one or more motion vectors of this texture patch.

If each representation is considered separately, then a texture patch that could be present in each of a first and second three-dimensional representation may not be identified in the second three-dimensional representation. For example, in the second three-dimensional representation, a number of the constituent points of this texture patch may instead be included in a different texture patch. As a result, the processing of the entire representation might change based on a change to only a few points (since any formation of a new texture patch will have a knock-on effect throughout the three-dimensional representation).

Therefore, referring to Figure 14, there is envisaged a grid-based method of processing the representation that can be used to ensure isolated changes in a first area of the scene do not cause wholesale changes in the processing of a three-dimensional representation. While this grid-based system is useable for the aggregation process, it will be appreciated that more generally the grid system may be used as the basis for any processing procedure.

In a first step 91 , a grid system of a representation is determined.

For example, where 8x8 arrangements of points may be evaluated and potentially replaced with a point referencing a texture patch, a grid may be determined that has grid sizes of 8x8. Equally, each grid square may, for example have a size of 16x16 (where the size indicates a number of angular brackets covered by the grid square).

Typically, the method comprises determining a grid system such that each of a plurality of three-dimensional representations (e.g. relating to frames of a video) has a similar grid system. This may comprise determining the grid system such that a first grid square in a first three-dimensional representation is located at the same space in the scene as a first grid square in a second three-dimensional representation. A grid system is typically determined for each of the capture devices, where the grid system may be based on the angular brackets of that capture device. Therefore, for each of the three-dimensional representations (e.g. for each of a plurality of three-dimensional representations associated with a certain viewing zone), a grid system may be determined for each of the capture devices such that the grid squares of each grid system cover the same angular brackets for each three-dimensional representation.

These grid squares can then be processed separately. If an object is moving in the first grid square, then the process of determining texture patches that occurs for this first grid square will likely differ in successive three-dimensional representations (relating to successive frames), but with this grid based system, if there are no moving objects in the second square then the determination of texture patches in the second grid square will likely be the same in these successive frames (and will not be affected by the object moving in the first grid square).

For each grid square the process of determining texture patches mirrors that described with reference to Figure 13. That is, for each grid square, the method involves identifying a plurality of points in a second step 92, determining whether these points lie on a shared plane in a third step 93, if the points do lie on a shared plane then determining a texture patch in a fourth step 94, and then considering a next location in the grid square in a fifth step 95. Once, in a sixth step 96, the computer device determines that all possible pluralities of points in the grid square have been identified, then in an eighth step 98, a next grid square is considered and this process is repeated.

As described above, the grid squares typically cover the same angular brackets in a plurality of three- dimensional representations that are associated with the same video and/or the same viewing zone. Equally, in some embodiments, the grid squares may be associated with a motion vector so that the grid system may differ (in a calculable way) between the three-dimensional representations. Such embodiments can enable the efficient encoding of three-dimensional representations that are associated with consistent movements.

While the above method has referred to ‘grid squares’, it will be appreciated that the grid system may be associated with subdivisions of any shape or size (these subdivisions being consistent through a plurality of three-dimensional representations).

The determination ofthe texture patches and of an aggregate point may be a part of the same process. For example, the computer device may identify a plurality of points with a similar location (e.g. being captured by the same capture device and being located in adjacent angular brackets) and the computer device may: in dependence on the points having similar attribute values, determine an aggregate point based on these points; in dependence on the points having different attribute values, determine a texture point (and a texture patch) based on the points.

Alternatives and modifications

It will be understood that the present invention has been described above purely by way of example, and modifications of detail can be made within the scope of the invention.

The representation is typically arranged to provide an extended reality (XR) experience (e.g. a representation that is useable to render a XR video). The term extended reality (XR) covers each of virtual reality (VR), augmented reality (AR), and mixed reality (MR) and it will be appreciated that the disclosures herein are applicable to any of these technologies.

The representation may be encoded into, and/or transmitted using, a bitstream, which bitstream typically comprises point data for one or more points of the three-dimensional representation. The point data may be compressed or encoded to form the bitstream. The bitstream may then be transmitted between devices before being decoded at a receiving device so that this receiving device can determine the point data and reform the three-dimensional representation (or form one or more two-dimensional images based on this three-dimensional representation). In particular, the encoder 13 may be arranged to encode (e.g. one or more points of) the three-dimensional representation in order to form the bitstream and the decoder 14 may be arranged to decode the bitstream to generate the one or more two-dimensional images.

Reference numerals appearing in the claims are by way of illustration only and shall have no limiting effect on the scope of the claims.

Claims

1 . A method of determining a point of a three-dimensional representation of a scene, the method comprising: identifying a plurality of points of the representation; determining that the plurality of points lie on a shared plane; in dependence on the plurality of points lying on a shared plane, determining a texture patch based on attributes of the plurality of points; and determining a texture point, the texture point comprising a reference to the texture patch.

2. The method of any preceding claim, wherein the representation comprises a plurality of points captured using a plurality of different capture devices, and wherein the method comprises identifying a first plurality of points captured using a first capture device.

3. The method of any preceding claim, comprising identifying a first plurality of adjacent points of the representation.

4. The method of any preceding claim, wherein determining the texture patch comprises: identifying attribute values for each of the identified points; and forming the texture patch based on the identified attribute values; preferably, wherein the texture patch comprises the attribute values arranged in the arrangement of the identified points.

5. The method of any preceding claim, comprising storing the texture patch in a database, the database comprising a plurality of texture patches, preferably wherein each texture patch is associated with an index.

6. The method of any preceding claim, wherein determining the texture patch comprises one or more of: forming the texture patch based on attribute values of the identified points; forming the texture patch based on transparency values of the identified points; and forming the texture patch based on normal values of the identified points.

7. The method of any preceding claim, comprising one or more of: modifying one of the identified points; removing at least one, and preferably all, of the identified points from the three-dimensional representation; and replacing the identified points with the texture point.

8. The method of any preceding claim, comprising: determining attribute values associated with the identified points; and determining the texture patch in dependence on a difference of said attributes exceeding a threshold, preferably wherein the difference is associated with a variance of the attributes.

9. The method of any preceding claim, wherein determining that the plurality of points lie on a shared plane comprises: determining a plane that passes through the plurality of points; determining distances of one or more of the points from the plane; and determining that the plurality of points lie on a shared plane based on the determined distances, preferably based on a maximum distance, an average distance, and/or a variance of the distances.

10. The method of any preceding claim, comprising; identifying a normal for each of the points; and determining that the plurality of points lie on a shared plane in dependence on one or more of the identified normals, preferably based on a variance of the normals.

11 . The method of any preceding claim, comprising determining that the plurality of points lie on a shared plane.

12. The method of any preceding claim, wherein a threshold associated with the determination that the plurality of points lie on a plane is dependent on a distance of the points from a viewing zone associated with the three-dimensional representation.

13. The method of any preceding claim, wherein identifying the plurality of points comprises identifying a plurality of adjacent points, preferably a plurality points in adjacent angular brackets) and/or a plurality of points in an 8x8 arrangement.

14. The method of any preceding claim, comprising determining a size of the texture point, the size being based on a number and/or an arrangement of the identified points.

15. The method of any preceding claim, wherein: the three-dimensional representation is associated with a plurality of capture devices; and/or a plurality of the points of the three-dimensional representation are associated with different capture devices.

16. The method of any preceding claim, wherein identifying the plurality of points comprises determining a plurality of pluralities of points of the three-dimensional representation and, for each identified plurality of points: determining whether the plurality of points lie on a shared plane; in dependence on the plurality of points lying on a shared plane, determining a texture patch based on attributes of the plurality of points; and determining a texture point, the texture point comprising a reference to the texture patch.

17. The method of any preceding claim, comprising defining a transparency value of the texture point so as to signal the texture point as being opaque.

18. The method of any preceding claim, comprising determining a motion vector for each corner of the texture point.

19. A method of determining an attribute of a point of a three-dimensional representation of a scene, the method comprising: identifying, in the point, a reference to a texture patch, the texture patch being associated with a plurality of attributes; and determining the attribute of the point based on the attributes of the texture patch.

20. The method of claim 19, wherein determining the attribute comprises determining a plurality of attributes associated with the point, preferably comprising determining attribute values for a plurality of locations of the representation based on the attributes of the texture patch.

21. The method of claim 19 or 20, comprising determining attributes for a plurality of adjacent angular brackets associated with a capture device of the representation based on the attributes of the texture patch.

22. The method of any preceding claim, comprising determining the arrangement of the attributes in dependence on the texture patch.

23. The method of any preceding claim, comprising identifying the point based on one or more of: a size of the point; the location of the point in a file associated with the three-dimensional representation; and a transparency value of the point, preferably comprising identifying the point based on: the point being in a section of a file associated with transparent points; and the point having a transparency value that signals the point as being opaque.

24. The method of any preceding claim, wherein the three-dimensional representation is associated with a viewing zone, the viewing zone comprising a subset of the scene and/or the viewing zone enabling a user to move through a subset of the scene, preferably wherein: the user is able to move within the viewing zone with six degrees of freedom (6DoF); and/or the viewing zone has a volume of less than 50% of the volume of the scene, less than 20% of the volume of the scene, and/or less than 10% of the volume of the scene; and/or the viewing zone has, or is associated with, a volume, preferably a real-world volume, of less than five cubic metres (5m3), less than one cubic metre (1 m3), less than one-tenth of a cubic metre (0.1 m3) and/or less than one-hundredth of a cubic metre (0.01 m3).

25. The method of any preceding claim, wherein the three-dimensional representation comprises a point cloud.

26. The method of any preceding claim, comprising storing the three-dimensional representation and/or outputting the three-dimensional representation, preferably outputting the three-dimensional representation to a further computer device.

27. The method of any preceding claim, comprising one or more of: generating an image and/or a video based on the three-dimensional representation; forming one or more two-dimensional representations of the scene based on the three-dimensional representation, preferably forming a two-dimensional representation for each eye of a viewer.

28. The method of any preceding claim, wherein: each point is associated with one or more of: a location; an attribute; a transparency; a colour; and a size; and/or each point is associated with an attribute for a right eye and an attribute for a left eye.

29. The method of any preceding claim, wherein the scene comprises one or more of: an extended reality (XR) scene; a virtual reality (VR) scene; an augmented reality (AR) scene; and a mixed reality (MR) scene.

30. The method of any preceding claim, comprising forming a bitstream that includes the aggregate point.

31 . A computer program product comprising software code that, when executed on a computer device, causes the computer device to perform the method of any preceding claim.

32. A machine-readable storage medium that includes instructions that, when executed by one or more processors of a machine, cause the machine to perform the method of any of claims 1 to 30.

33. A system for carrying out the method of any of claims 1 to 30, the system comprising one or more of: a processor; a communication interface; and a display.

34. An apparatus for determining a point of a three-dimensional representation of a scene, the apparatus comprising: means for identifying a plurality of points of the representation; means for determining that the plurality of points lie on a shared plane; means for in dependence on the plurality of points lying on a shared plane, determining a texture patch based on attributes of the plurality of points; and means for determining a texture point, the texture point comprising a reference to the texture patch.

35. An apparatus method for determining an attribute of a point of a three-dimensional representation of a scene, the method comprising: means for identifying, in the point, a reference to a texture patch, the texture patch being associated with a plurality of attributes; and means for determining the attribute of the point based on the attributes of the texture patch.

36. A bitstream comprising one or more texture points and/or texture patches determined using the method of any preceding claim.

37. A bitstream comprising a texture point, the texture point comprising a reference to a texture patch that comprises attribute values associated with the texture point, preferably wherein the texture point has been determined using the method of any of claims 1 to 30.

38. An apparatus, preferably an encoder, for forming and/or encoding the bitstream of claim 36 or 37.

39. An apparatus, preferably a decoder, for receiving and/or decoding the bitstream of claim 36 or 37.